[PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.
@ 2010-03-06  9:38 xiaohui.xin
  2010-03-06  9:38 ` [PATCH v1 1/3] A device for zero-copy based " xiaohui.xin
  2010-03-07 10:50 ` [PATCH v1 0/3] Provide a zero-copy method " Michael S. Tsirkin
  0 siblings, 2 replies; 33+ messages in thread
From: xiaohui.xin @ 2010-03-06  9:38 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mingo, mst, jdike

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
	packet split support
	To support GRO
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

performance:
	using netperf with GSO/TSO disabled, 10G NIC, 
	disabled packet split mode, with raw socket case compared to vhost.

	bindwidth will be from 1.1Gbps to 1.7Gbps
	CPU % from 120%-140% to 140%-160%

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net.
  2010-03-06  9:38 [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net xiaohui.xin
@ 2010-03-06  9:38 ` xiaohui.xin
  2010-03-06  9:38   ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications xiaohui.xin
  2010-03-08 11:28   ` [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net Michael S. Tsirkin
  2010-03-07 10:50 ` [PATCH v1 0/3] Provide a zero-copy method " Michael S. Tsirkin
  1 sibling, 2 replies; 33+ messages in thread
From: xiaohui.xin @ 2010-03-06  9:38 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mingo, mst, jdike; +Cc: Xin Xiaohui, Zhao Yu

From: Xin Xiaohui <xiaohui.xin@intel.com>

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
---
 drivers/vhost/Kconfig     |    5 +
 drivers/vhost/Makefile    |    2 +
 drivers/vhost/mpassthru.c | 1202 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mpassthru.h |   29 ++
 4 files changed, 1238 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config VHOST_PASSTHRU
+	tristate "Zerocopy network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..744d6cd
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1202 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+#include "vhost.h"
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+struct page_ctor {
+	struct list_head        readq;
+	int 			w_len;
+	int 			r_len;
+	spinlock_t      	read_lock;
+	atomic_t        	refcnt;
+	struct kmem_cache   	*cache;
+	struct net_device   	*dev;
+	struct mpassthru_port	port;
+	void 			*sendctrl;
+	void 			*recvctrl;
+};
+
+struct page_info {
+	struct list_head    	list;
+	int         		header;
+	/* indicate the actual length of bytes
+	 * send/recv in the user space buffers
+	 */
+	int         		total;
+	int         		offset;
+	struct page     	*pages[MAX_SKB_FRAGS+1];
+	struct skb_frag_struct 	frag[MAX_SKB_FRAGS+1];
+	struct sk_buff      	*skb;
+	struct page_ctor   	*ctor;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a user space allocated skb or kernel
+	 */
+	struct skb_user_page    user;
+	struct skb_shared_info	ushinfo;
+
+#define INFO_READ      		0
+#define INFO_WRITE     		1
+	unsigned        	flags;
+	unsigned        	pnum;
+
+	/* It's meaningful for receive, means
+	 * the max length allowed
+	 */
+	size_t          	len;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+	struct vhost_notifier	notifier;
+	unsigned int    	desc_pos;
+	unsigned int 		log;
+	struct iovec 		hdr[VHOST_NET_MAX_SG];
+	struct iovec 		iov[VHOST_NET_MAX_SG];
+	void 			*ctl;
+};
+
+struct mp_struct {
+	struct mp_file   	*mfile;
+	struct net_device       *dev;
+	struct page_ctor	*ctor;
+	struct socket           socket;
+
+#ifdef MPASSTHRU_DEBUG
+	int debug;
+#endif
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock            	sk;
+	struct mp_struct       	*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+	int ret = 0;
+
+	rtnl_lock();
+	ret = dev_change_flags(dev, flags);
+	rtnl_unlock();
+
+	if (ret < 0)
+		printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+	return ret;
+}
+
+/* The main function to allocate user space buffers */
+static struct skb_user_page *page_ctor(struct mpassthru_port *port,
+		struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++) {
+		get_page(info->pages[i]);
+		info->frag[i].page = info->pages[i];
+		info->frag[i].page_offset = i ? 0 : info->offset;
+		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+			port->data_len;
+	}
+	info->skb = skb;
+	info->user.frags = info->frag;
+	info->user.ushinfo = &info->ushinfo;
+	return &info->user;
+}
+
+static struct vhost_notifier *create_vhost_notifier(struct vhost_virtqueue *vq,
+			struct page_info *info, int size);
+
+static void mp_vhost_notifier_dtor(struct vhost_notifier *vnotify)
+{
+	struct page_info *info = (struct page_info *)(vnotify->ctrl);
+	int i;
+
+	for (i = 0; i < info->pnum; i++) {
+		if (info->pages[i])
+			put_page(info->pages[i]);
+	}
+
+	if (info->flags == INFO_READ) {
+		skb_shinfo(info->skb)->destructor_arg = &info->user;
+		info->skb->destructor = NULL;
+		kfree(info->skb);
+	}
+
+	kmem_cache_free(info->ctor->cache, info);
+
+	return;
+}
+
+/* A helper to clean the skb before the kfree_skb() */
+
+static void page_dtor_prepare(struct page_info *info)
+{
+	if (info->flags == INFO_READ)
+		if (info->skb)
+			info->skb->head = NULL;
+}
+
+/* The callback to destruct the user space buffers or skb */
+static void page_dtor(struct skb_user_page *user)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+	struct vhost_notifier *vnotify;
+	struct vhost_virtqueue *vq = NULL;
+	unsigned long flags;
+	int i;
+
+	if (!user)
+		return;
+	info = container_of(user, struct page_info, user);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	page_dtor_prepare(info);
+
+	/* If the info->total is 0, make it to be reused */
+	if (!info->total) {
+		spin_lock_irqsave(&ctor->read_lock, flags);
+		list_add(&info->list, &ctor->readq);
+		spin_unlock_irqrestore(&ctor->read_lock, flags);
+		return;
+	}
+
+	/* Receive buffers, should be destructed */
+	if (info->flags == INFO_READ) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		info->skb = NULL;
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+	vq = (struct vhost_virtqueue *)info->ctl;
+	vnotify = create_vhost_notifier(vq, info, info->total);
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&vnotify->list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_ctor *ctor;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (rcu_dereference(mp->ctor))
+		return -EBUSY;
+
+	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+	if (!ctor)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	ctor->cache = kmem_cache_create("skb_page_info",
+			sizeof(struct page_info), 0,
+			SLAB_HWCACHE_ALIGN, NULL);
+
+	if (!ctor->cache)
+		goto cache_fail;
+
+	INIT_LIST_HEAD(&ctor->readq);
+	spin_lock_init(&ctor->read_lock);
+
+	ctor->w_len = 0;
+	ctor->r_len = 0;
+
+	dev_hold(dev);
+	ctor->dev = dev;
+	ctor->port.ctor = page_ctor;
+	ctor->port.sock = &mp->socket;
+	atomic_set(&ctor->refcnt, 1);
+
+	rc = netdev_mp_port_attach(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, ctor);
+
+	/* XXX:Need we do set_offload here ? */
+
+	return 0;
+
+fail:
+	kmem_cache_destroy(ctor->cache);
+cache_fail:
+	kfree(ctor);
+	dev_put(dev);
+
+	return rc;
+}
+
+
+static inline void get_page_ctor(struct page_ctor *ctor)
+{
+       atomic_inc(&ctor->refcnt);
+}
+
+static inline void put_page_ctor(struct page_ctor *ctor)
+{
+	if (atomic_dec_and_test(&ctor->refcnt))
+		kfree(ctor);
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	return info;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+	struct page_ctor *ctor;
+	struct page_info *info;
+	int i;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	while ((info = info_dequeue(ctor))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		kmem_cache_free(ctor->cache, info);
+	}
+	kmem_cache_destroy(ctor->cache);
+	netdev_mp_port_detach(ctor->dev);
+	dev_put(ctor->dev);
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, NULL);
+	synchronize_rcu();
+
+	put_page_ctor(ctor);
+
+	return 0;
+}
+
+/* For small user space buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+		int total)
+{
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the user space address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+			struct iovec *iov, int count, struct frag *frags,
+			int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base;
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	down_read(&current->mm->mmap_sem);
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages(current, current->mm, base, n,
+				npages ? 1 : 0, 0, &info->pages[j], NULL);
+		if (rc != n) {
+			up_read(&current->mm->mmap_sem);
+			goto failed;
+		}
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+	up_read(&current->mm->mmap_sem);
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->pnum = j;
+
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		info->user.start = (u8 *)(((unsigned long)
+				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+				frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
+		info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
+	}
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ctor->cache, info);
+
+	return NULL;
+}
+
+struct page_ctor *mp_rcu_get_ctor(struct page_ctor *ctor)
+{
+	struct page_ctor *_ctor = NULL;
+
+	rcu_read_lock();
+	_ctor = rcu_dereference(ctor);
+
+	if (!_ctor) {
+		DBG(KERN_INFO "Device %s cannot do mediate passthru.\n",
+				ctor->dev->name);
+		rcu_read_unlock();
+		return NULL;
+	}
+	get_page_ctor(_ctor);
+	rcu_read_unlock();
+	return _ctor;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(m->msg_control);
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = mp_rcu_get_ctor(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	ctor->sendctrl = vq;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN) {
+		put_page_ctor(ctor);
+		return -EINVAL;
+	}
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS) {
+			put_page_ctor(ctor);
+			return -EINVAL;
+		}
+	}
+
+copy:
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+	if (!skb)
+		goto drop;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	memcpy_fromiovec(skb->data, iov, header);
+	skb_put(skb, header);
+	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, total);
+	} else {
+		info = alloc_page_info(ctor, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = vq->head;
+		info->ctl = vq;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->user;
+		skb->dev = mp->dev;
+		dev_queue_xmit(skb);
+		mp->dev->stats.tx_packets++;
+		mp->dev->stats.tx_bytes += total;
+		put_page_ctor(ctor);
+		return 0;
+	}
+drop:
+	kfree(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(info->ctor->cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	put_page_ctor(ctor);
+	return -ENOMEM;
+}
+
+
+static struct vhost_notifier *create_vhost_notifier(struct vhost_virtqueue *vq,
+			struct page_info *info, int size)
+{
+	struct vhost_notifier *vnotify = NULL;
+
+	vnotify = &info->notifier;
+	memset(vnotify, 0, sizeof(struct vhost_notifier));
+	vnotify->vq = vq;
+	vnotify->head = info->desc_pos;
+	vnotify->size = size;
+	vnotify->log = info->log;
+	vnotify->ctrl = (void *)info;
+	vnotify->dtor = mp_vhost_notifier_dtor;
+	return vnotify;
+}
+
+static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
+{
+	struct socket *sock = vq->private_data;
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	struct ethhdr *eth;
+	struct vhost_notifier *vnotify = NULL;
+	int len, i;
+	unsigned long flags;
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = mp_rcu_get_ctor(mp->ctor);
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
+		if (skb_shinfo(skb)->destructor_arg) {
+			info = container_of(skb_shinfo(skb)->destructor_arg,
+					struct page_info, user);
+			info->skb = skb;
+			if (skb->len > info->len) {
+				mp->dev->stats.rx_dropped++;
+				DBG(KERN_INFO "Discarded truncated rx packet: "
+					" len %d > %zd\n", skb->len, info->len);
+				info->total = skb->len;
+				goto clean;
+			} else {
+				int i;
+				struct skb_shared_info *gshinfo =
+				(struct skb_shared_info *)(&info->ushinfo);
+				struct skb_shared_info *hshinfo =
+						skb_shinfo(skb);
+
+				if (gshinfo->nr_frags < hshinfo->nr_frags)
+					goto clean;
+				eth = eth_hdr(skb);
+				skb_push(skb, ETH_HLEN);
+
+				hdr.hdr_len = skb_headlen(skb);
+				info->total = skb->len;
+
+				for (i = 0; i < gshinfo->nr_frags; i++)
+					gshinfo->frags[i].size = 0;
+				for (i = 0; i < hshinfo->nr_frags; i++)
+					gshinfo->frags[i].size =
+						hshinfo->frags[i].size;
+				memcpy(skb_shinfo(skb), &info->ushinfo,
+						sizeof(struct skb_shared_info));
+			}
+		} else {
+			/* The skb composed with kernel buffers
+			 * in case user space buffers are not sufficent.
+			 * The case should be rare.
+			 */
+			unsigned long flags;
+			int i;
+			struct skb_shared_info *gshinfo = NULL;
+
+			info = NULL;
+
+			spin_lock_irqsave(&ctor->read_lock, flags);
+			if (!list_empty(&ctor->readq)) {
+				info = list_first_entry(&ctor->readq,
+						struct page_info, list);
+				list_del(&info->list);
+			}
+			spin_unlock_irqrestore(&ctor->read_lock, flags);
+			if (!info) {
+				DBG(KERN_INFO "No user buffer avaliable %p\n",
+									skb);
+				skb_queue_head(&sock->sk->sk_receive_queue,
+									skb);
+				break;
+			}
+			info->skb = skb;
+			/* compute the guest skb frags info */
+			gshinfo = (struct skb_shared_info *)(info->user.start +
+					SKB_DATA_ALIGN(info->user.size));
+
+			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+				goto clean;
+
+			eth = eth_hdr(skb);
+			skb_push(skb, ETH_HLEN);
+			info->total = skb->len;
+
+			for (i = 0; i < gshinfo->nr_frags; i++)
+				gshinfo->frags[i].size = 0;
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+				gshinfo->frags[i].size =
+					skb_shinfo(skb)->frags[i].size;
+			hdr.hdr_len = min_t(int, skb->len,
+						info->iov[1].iov_len);
+			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+		}
+
+		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+								 sizeof hdr);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->hdr->iov_base, len);
+			goto clean;
+		}
+		vnotify = create_vhost_notifier(vq, info,
+				skb->len + sizeof(hdr));
+
+		spin_lock_irqsave(&vq->notify_lock, flags);
+		list_add_tail(&vnotify->list, &vq->notifier);
+		spin_unlock_irqrestore(&vq->notify_lock, flags);
+		continue;
+
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ctor->cache, info);
+	}
+	put_page_ctor(ctor);
+	return;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(m->msg_control);
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = mp_rcu_get_ctor(mp->ctor);
+	if (!ctor)
+		return -EINVAL;
+
+	ctor->recvctrl = vq;
+
+	/* Error detections in case invalid user space buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		put_page_ctor(ctor);
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	put_page_ctor(ctor);
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	iov++;
+	count--;
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iov, count, frags, npages, 0);
+	if (!info) {
+		put_page_ctor(ctor);
+		return -ENOMEM;
+	}
+
+	info->len = total_len;
+	info->hdr[0].iov_base = vq->hdr[0].iov_base;
+	info->hdr[0].iov_len = vq->hdr[0].iov_len;
+	info->offset = frags[0].offset;
+	info->desc_pos = vq->head;
+	info->log = vq->_log;
+	info->ctl = NULL;
+
+	iov--;
+	count++;
+
+	memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	if (!vq->receiver)
+		vq->receiver = mp_recvmsg_notify;
+
+	put_page_ctor(ctor);
+	return 0;
+}
+
+static void mp_put(struct mp_file *mfile);
+
+static int mp_release(struct socket *sock)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct mp_file *mfile = mp->mfile;
+
+	mp_put(mfile);
+	sock_put(mp->socket.sk);
+	put_net(mfile->net);
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+	.release = mp_release,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+	DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_ctor_detach(mp);
+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count))
+		mp_detach(mfile->mp);
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static void mp_sock_data_ready(struct sock *sk, int len)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -EBUSY;
+
+		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
+			break;
+
+		ret = -ENODEV;
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev)
+			break;
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_ctor_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
+		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+out:
+		mutex_unlock(&mp_mutex);
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		ret = do_unbind(mfile);
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mpassthru_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+			sock = dev->mp_port->sock;
+			mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+			do_unbind(mp->mfile);
+			break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int ret = 0;
+
+	ret = misc_register(&mp_miscdev);
+	if (ret)
+		printk(KERN_ERR "mp: Can't register misc device\n");
+	else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+			mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return ret;
+}
+
+void mp_cleanup(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_cleanup);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..2be21c5
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,29 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
+
+/* MPASSTHRU ifc flags */
+#define IFF_MPASSTHRU		0x0001
+#define IFF_MPASSTHRU_EXCL	0x0002
+
+#ifdef __KERNEL__
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-06  9:38 ` [PATCH v1 1/3] A device for zero-copy based " xiaohui.xin
@ 2010-03-06  9:38   ` xiaohui.xin
  2010-03-06  9:38     ` [PATCH v1 3/3] Let host NIC driver to DMA to guest user space xiaohui.xin
  2010-03-07 11:18     ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications Michael S. Tsirkin
  2010-03-08 11:28   ` [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net Michael S. Tsirkin
  1 sibling, 2 replies; 33+ messages in thread
From: xiaohui.xin @ 2010-03-06  9:38 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mingo, mst, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  156 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   23 +++++++
 2 files changed, 174 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..24a6c3d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -22,6 +22,7 @@
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/mpassthru.h>
 
 #include <net/sock.h>
 
@@ -91,6 +92,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq);
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq);
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -124,6 +131,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	hdr_size = vq->hdr_size;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -151,6 +160,12 @@ static void handle_tx(struct vhost_net *net)
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			vq->head = head;
+			msg.msg_control = (void *)vq;
+		}
+
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
@@ -166,6 +181,10 @@ static void handle_tx(struct vhost_net *net)
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -177,6 +196,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -206,7 +227,8 @@ static void handle_rx(struct vhost_net *net)
 	int err;
 	size_t hdr_size;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+			vq->link_state == VHOST_VQ_LINK_SYNC))
 		return;
 
 	use_mm(net->dev.mm);
@@ -214,9 +236,18 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
-	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+	/* In async cases, for write logging, the simple way is to get
+	 * the log info always, and really logging is decided later.
+	 * Thus, when logging enabled, we can get log, and when logging
+	 * disabled, we can get log disabled accordingly.
+	 */
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
+		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
 		vq->log : NULL;
 
+	handle_async_rx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -245,6 +276,11 @@ static void handle_rx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			vq->head = head;
+			vq->_log = log;
+			msg.msg_control = (void *)vq;
+		}
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
@@ -259,6 +295,10 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		/* TODO: Should check and handle checksum. */
 		if (err > len) {
 			pr_err("Discarded truncated rx packet: "
@@ -284,10 +324,83 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
 
+struct vhost_notifier *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct vhost_notifier *vnotify = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		vnotify = list_first_entry(&vq->notifier,
+				struct vhost_notifier, list);
+		list_del(&vnotify->list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return vnotify;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+				struct vhost_virtqueue *vq)
+{
+	struct vhost_notifier *vnotify = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	int log, size;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+	if (vq != &net->dev.vqs[VHOST_NET_VQ_RX])
+		return;
+
+	if (vq->receiver)
+		vq->receiver(vq);
+	vq_log = unlikely(vhost_has_feature(
+				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+	while ((vnotify = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				vnotify->head, vnotify->size);
+		log = vnotify->log;
+		size = vnotify->size;
+		rx_total_len += vnotify->size;
+		vnotify->dtor(vnotify);
+		if (unlikely(vq_log))
+			vhost_log_write(vq, vq_log, log, size);
+		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+		struct vhost_virtqueue *vq)
+{
+	struct vhost_notifier *vnotify = NULL;
+	int tx_total_len = 0;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+	if (vq != &net->dev.vqs[VHOST_NET_VQ_TX])
+		return;
+
+	while ((vnotify = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				vnotify->head, 0);
+		tx_total_len += vnotify->size;
+		vnotify->dtor(vnotify);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
 static void handle_tx_kick(struct work_struct *work)
 {
 	struct vhost_virtqueue *vq;
@@ -462,7 +575,19 @@ static struct socket *get_tun_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
 {
 	struct socket *sock;
 	if (fd == -1)
@@ -473,9 +598,26 @@ static struct socket *get_socket(int fd)
 	sock = get_tun_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		vq->link_state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		vq->receiver = NULL;
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -493,12 +635,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 	vq = n->vqs + index;
 	mutex_lock(&vq->mutex);
-	sock = get_socket(fd);
+	vq->link_state = VHOST_VQ_LINK_SYNC;
+	sock = get_socket(vq, fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -507,8 +652,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	vhost_net_disable_vq(n, vq);
 	rcu_assign_pointer(vq->private_data, sock);
 	vhost_net_enable_vq(n, vq);
-	mutex_unlock(&vq->mutex);
 done:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	if (oldsock) {
 		vhost_net_flush_vq(n, index);
@@ -516,6 +661,7 @@ done:
 	}
 	return r;
 err:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	return r;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d1f0453..295bffa 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,22 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 	0,
+	VHOST_VQ_LINK_ASYNC = 	1,
+};
+
+/* The structure to notify the virtqueue for async socket */
+struct vhost_notifier {
+	struct list_head list;
+	struct vhost_virtqueue *vq;
+	int head;
+	int size;
+	int log;
+	void *ctrl;
+	void (*dtor)(struct vhost_notifier *);
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -96,6 +112,13 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/*Differiate async socket for 0-copy from normal*/
+	enum vhost_vq_link_state link_state;
+	int head;
+	int _log;
+	struct list_head notifier;
+	spinlock_t notify_lock;
+	void (*receiver)(struct vhost_virtqueue *);
 };
 
 struct vhost_dev {
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v1 3/3] Let host NIC driver to DMA to guest user space.
  2010-03-06  9:38   ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications xiaohui.xin
@ 2010-03-06  9:38     ` xiaohui.xin
  2010-03-06 17:18       ` Stephen Hemminger
  2010-03-08 11:18       ` Michael S. Tsirkin
  2010-03-07 11:18     ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications Michael S. Tsirkin
  1 sibling, 2 replies; 33+ messages in thread
From: xiaohui.xin @ 2010-03-06  9:38 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mingo, mst, jdike; +Cc: Xin Xiaohui, Zhao Yu

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch let host NIC driver to receive user space skb,
then the driver has chance to directly DMA to guest user
space buffers thru single ethX interface.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
---
 include/linux/netdevice.h |   76 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h    |   30 +++++++++++++++--
 net/core/dev.c            |   32 ++++++++++++++++++
 net/core/skbuff.c         |   79 +++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 205 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94958c1..97bf12c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -485,6 +485,17 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct mpassthru_port	{
+	int		hdr_len;
+	int		data_len;
+	int		npages;
+	unsigned	flags;
+	struct socket	*sock;
+	struct skb_user_page	*(*ctor)(struct mpassthru_port *,
+				struct sk_buff *, int);
+};
+#endif
 
 /*
  * This structure defines the management hooks for network devices.
@@ -636,6 +647,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_ddp_done)(struct net_device *dev,
 						     u16 xid);
 #endif
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mpassthru_port *port);
+#endif
 };
 
 /*
@@ -891,7 +906,8 @@ struct net_device
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mpassthru_port	*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional statistics and wireless sysfs groups */
@@ -2013,6 +2029,62 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev)
 		return 0;
 	return dev->ethtool_ops->get_flags(dev);
 }
-#endif /* __KERNEL__ */
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+static inline int netdev_mp_port_prep(struct net_device *dev,
+		struct mpassthru_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	/* needed by packet split */
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {  /* should be temp */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+
+static inline int netdev_mp_port_attach(struct net_device *dev,
+		struct mpassthru_port *port)
+{
+	if (rcu_dereference(dev->mp_port))
+		return -EBUSY;
+
+	rcu_assign_pointer(dev->mp_port, port);
+
+	return 0;
+}
+
+static inline void netdev_mp_port_detach(struct net_device *dev)
+{
+	if (!rcu_dereference(dev->mp_port))
+		return;
+
+	rcu_assign_pointer(dev->mp_port, NULL);
+	synchronize_rcu();
+}
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
 #endif	/* _LINUX_NETDEVICE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index df7b23a..e59fa57 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -209,6 +209,13 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+struct skb_user_page {
+	u8              *start;
+	int             size;
+	struct skb_frag_struct *frags;
+	struct skb_shared_info *ushinfo;
+	void		(*dtor)(struct skb_user_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
@@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int fclone,
+				   int node, struct net_device *dev);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, -1, NULL);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, -1, NULL);
 }
 
 extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1509,6 +1517,22 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
 	__free_page(page);
 }
 
+extern struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
+			struct sk_buff *skb, int npages);
+
+static inline struct skb_user_page *netdev_alloc_user_page(
+		struct net_device *dev,
+		struct sk_buff *skb, unsigned int size)
+{
+	struct skb_user_page *user;
+	int npages = (size < PAGE_SIZE) ? 1 : (size / PAGE_SIZE);
+
+	user = netdev_alloc_user_pages(dev, skb, npages);
+	if (likely(user))
+		return user;
+	return NULL;
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..ab8b082 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2265,6 +2265,30 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					struct packet_type **pt_prev,
+					int *ret, struct net_device *orig_dev)
+{
+	struct mpassthru_port *ctor = NULL;
+	struct sock *sk = NULL;
+
+	if (skb->dev)
+		ctor = skb->dev->mp_port;
+	if (!ctor)
+		return skb;
+
+	sk = ctor->sock->sk;
+
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+
+	sk->sk_data_ready(sk, skb->len);
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)      (skb)
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2342,6 +2366,9 @@ int netif_receive_skb(struct sk_buff *skb)
 		goto out;
 ncls:
 #endif
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
 
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
@@ -2455,6 +2482,11 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 	if (skb_is_gso(skb) || skb_has_frags(skb))
 		goto normal;
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	if (skb->dev && skb->dev->mp_port)
+		goto normal;
+#endif
+
 	rcu_read_lock();
 	list_for_each_entry_rcu(ptype, head, list) {
 		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 80a9616..6510e5b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -170,13 +170,15 @@ EXPORT_SYMBOL(skb_under_panic);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int fclone, int node, struct net_device *dev)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
-
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	struct skb_user_page *user = NULL;
+#endif
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
 	/* Get the HEAD */
@@ -185,8 +187,26 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 
 	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	if (!dev || !dev->mp_port) { /* Legacy alloc func */
+#endif
+		data = kmalloc_node_track_caller(
+				size + sizeof(struct skb_shared_info),
+				gfp_mask, node);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	} else { /* Allocation may from page constructor of device */
+		user = netdev_alloc_user_page(dev, skb, size);
+		if (!user) {
+			data = kmalloc_node_track_caller(
+				size + sizeof(struct skb_shared_info),
+				gfp_mask, node);
+			printk(KERN_INFO "can't alloc user buffer.\n");
+		} else {
+			data = user->start;
+			size = SKB_DATA_ALIGN(user->size);
+		}
+	}
+#endif
 	if (!data)
 		goto nodata;
 
@@ -208,6 +228,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb->mac_header = ~0U;
 #endif
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	if (user)
+		memcpy(user->ushinfo, skb_shinfo(skb),
+				sizeof(struct skb_shared_info));
+#endif
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
@@ -231,6 +256,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	shinfo->destructor_arg = user;
+#endif
+
 out:
 	return skb;
 nodata:
@@ -259,7 +288,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node, dev);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -278,6 +307,29 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
+			struct sk_buff *skb, int npages)
+{
+	struct mpassthru_port *ctor;
+	struct skb_user_page *user = NULL;
+
+	rcu_read_lock();
+	ctor = rcu_dereference(dev->mp_port);
+	if (!ctor)
+		goto out;
+
+	BUG_ON(npages > ctor->npages);
+
+	user = ctor->ctor(ctor, skb, npages);
+out:
+	rcu_read_unlock();
+
+	return user;
+}
+EXPORT_SYMBOL(netdev_alloc_user_pages);
+#endif
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
@@ -338,6 +390,10 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 
 static void skb_release_data(struct sk_buff *skb)
 {
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	struct skb_user_page *user = skb_shinfo(skb)->destructor_arg;
+#endif
+
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
@@ -349,7 +405,10 @@ static void skb_release_data(struct sk_buff *skb)
 
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
-
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+		if (skb->dev && skb->dev->mp_port && user && user->dtor)
+			user->dtor(user);
+#endif
 		kfree(skb->head);
 	}
 }
@@ -503,8 +562,14 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (skb_shared(skb) || skb_cloned(skb))
 		return 0;
 
-	skb_release_head_state(skb);
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+	if (skb->dev && skb->dev->mp_port)
+		return 0;
+#endif
+
 	shinfo = skb_shinfo(skb);
+
+	skb_release_head_state(skb);
 	atomic_set(&shinfo->dataref, 1);
 	shinfo->nr_frags = 0;
 	shinfo->gso_size = 0;
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 3/3] Let host NIC driver to DMA to guest user space.
  2010-03-06  9:38     ` [PATCH v1 3/3] Let host NIC driver to DMA to guest user space xiaohui.xin
@ 2010-03-06 17:18       ` Stephen Hemminger
  2010-03-08 11:18       ` Michael S. Tsirkin
  1 sibling, 0 replies; 33+ messages in thread
From: Stephen Hemminger @ 2010-03-06 17:18 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, mst, jdike, Zhao Yu

On Sat,  6 Mar 2010 17:38:38 +0800
xiaohui.xin@intel.com wrote:

> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> The patch let host NIC driver to receive user space skb,
> then the driver has chance to directly DMA to guest user
> space buffers thru single ethX interface.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
> ---
>  include/linux/netdevice.h |   76 ++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/skbuff.h    |   30 +++++++++++++++--
>  net/core/dev.c            |   32 ++++++++++++++++++
>  net/core/skbuff.c         |   79 +++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 205 insertions(+), 12 deletions(-)
> 

There are too many ifdef's in this implementation.
I would prefer to see a few functions (with stub for the non-ifdef case),
like the network namespace code.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.
  2010-03-06  9:38 [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net xiaohui.xin
  2010-03-06  9:38 ` [PATCH v1 1/3] A device for zero-copy based " xiaohui.xin
@ 2010-03-07 10:50 ` Michael S. Tsirkin
  2010-03-09  7:47   ` Xin, Xiaohui
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-07 10:50 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Sat, Mar 06, 2010 at 05:38:35PM +0800, xiaohui.xin@intel.com wrote:
> The idea is simple, just to pin the guest VM user space and then
> let host NIC driver has the chance to directly DMA to it. 
> The patches are based on vhost-net backend driver. We add a device
> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> send/recv directly to/from the NIC driver. KVM guest who use the
> vhost-net backend may bind any ethX interface in the host side to
> get copyless data transfer thru guest virtio-net frontend.
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support
> 	To support GRO
> 	Performance tuning

Am I right to say that nic driver needs changes for these patches
to work? If so, please publish nic driver patches as well.

> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%

That's pretty low for a 10Gb nic. Are you hitting some other bottleneck,
like high interrupt rate? Also, GSO support and performance tuning
for raw are incomplete. Try comparing with e.g. tap with GSO.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-06  9:38   ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications xiaohui.xin
  2010-03-06  9:38     ` [PATCH v1 3/3] Let host NIC driver to DMA to guest user space xiaohui.xin
@ 2010-03-07 11:18     ` Michael S. Tsirkin
  2010-03-15  8:46       ` Xin, Xiaohui
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-07 11:18 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

> +/* The structure to notify the virtqueue for async socket */
> +struct vhost_notifier {
> +	struct list_head list;
> +	struct vhost_virtqueue *vq;
> +	int head;
> +	int size;
> +	int log;
> +	void *ctrl;
> +	void (*dtor)(struct vhost_notifier *);
> +};
> +

So IMO, this is not the best interface between vhost
and your driver, exposing them to each other unnecessarily.

If you think about it, your driver should not care about this structure.
It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
on completion.  vhost could save it's state in ki_user_data.  If your
driver needs to add more data to do more tracking, I think it can put
skb pointer in the private pointer.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 3/3] Let host NIC driver to DMA to guest user space.
  2010-03-06  9:38     ` [PATCH v1 3/3] Let host NIC driver to DMA to guest user space xiaohui.xin
  2010-03-06 17:18       ` Stephen Hemminger
@ 2010-03-08 11:18       ` Michael S. Tsirkin
  1 sibling, 0 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-08 11:18 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Zhao Yu

On Sat, Mar 06, 2010 at 05:38:38PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> The patch let host NIC driver to receive user space skb,
> then the driver has chance to directly DMA to guest user
> space buffers thru single ethX interface.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>


I have a feeling I commented on some of the below issues already.
Do you plan to send a version with comments addressed?

> ---
>  include/linux/netdevice.h |   76 ++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/skbuff.h    |   30 +++++++++++++++--
>  net/core/dev.c            |   32 ++++++++++++++++++
>  net/core/skbuff.c         |   79 +++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 205 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 94958c1..97bf12c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -485,6 +485,17 @@ struct netdev_queue {
>  	unsigned long		tx_dropped;
>  } ____cacheline_aligned_in_smp;
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +struct mpassthru_port	{
> +	int		hdr_len;
> +	int		data_len;
> +	int		npages;
> +	unsigned	flags;
> +	struct socket	*sock;
> +	struct skb_user_page	*(*ctor)(struct mpassthru_port *,
> +				struct sk_buff *, int);
> +};
> +#endif
>  
>  /*
>   * This structure defines the management hooks for network devices.
> @@ -636,6 +647,10 @@ struct net_device_ops {
>  	int			(*ndo_fcoe_ddp_done)(struct net_device *dev,
>  						     u16 xid);
>  #endif
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	int			(*ndo_mp_port_prep)(struct net_device *dev,
> +						struct mpassthru_port *port);
> +#endif
>  };
>  
>  /*
> @@ -891,7 +906,8 @@ struct net_device
>  	struct macvlan_port	*macvlan_port;
>  	/* GARP */
>  	struct garp_port	*garp_port;
> -
> +	/* mpassthru */
> +	struct mpassthru_port	*mp_port;
>  	/* class/net/name entry */
>  	struct device		dev;
>  	/* space for optional statistics and wireless sysfs groups */
> @@ -2013,6 +2029,62 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev)
>  		return 0;
>  	return dev->ethtool_ops->get_flags(dev);
>  }
> -#endif /* __KERNEL__ */
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +static inline int netdev_mp_port_prep(struct net_device *dev,
> +		struct mpassthru_port *port)
> +{

This function lacks documentation.

> +	int rc;
> +	int npages, data_len;
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +
> +	/* needed by packet split */
> +	if (ops->ndo_mp_port_prep) {
> +		rc = ops->ndo_mp_port_prep(dev, port);
> +		if (rc)
> +			return rc;
> +	} else {  /* should be temp */
> +		port->hdr_len = 128;
> +		port->data_len = 2048;
> +		port->npages = 1;

where do the numbers come from?

> +	}
> +
> +	if (port->hdr_len <= 0)
> +		goto err;
> +
> +	npages = port->npages;
> +	data_len = port->data_len;
> +	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
> +			(data_len < PAGE_SIZE * (npages - 1) ||
> +			 data_len > PAGE_SIZE * npages))
> +		goto err;
> +
> +	return 0;
> +err:
> +	dev_warn(&dev->dev, "invalid page constructor parameters\n");
> +
> +	return -EINVAL;
> +}
> +
> +static inline int netdev_mp_port_attach(struct net_device *dev,
> +		struct mpassthru_port *port)
> +{
> +	if (rcu_dereference(dev->mp_port))
> +		return -EBUSY;
> +
> +	rcu_assign_pointer(dev->mp_port, port);
> +
> +	return 0;
> +}
> +
> +static inline void netdev_mp_port_detach(struct net_device *dev)
> +{
> +	if (!rcu_dereference(dev->mp_port))
> +		return;
> +
> +	rcu_assign_pointer(dev->mp_port, NULL);
> +	synchronize_rcu();
> +}

The above looks wrong, rcu_dereference should be called
under rcu read side, rcu_assign_pointer usually should not,
synchronize_rcu definitely should not.

As I suggested already, these functions are better opencoded,
rcu is tricky as is without hiding it in inline helpers.

> +#endif /* CONFIG_VHOST_PASSTHRU */
> +#endif /* __KERNEL__ */
>  #endif	/* _LINUX_NETDEVICE_H */
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index df7b23a..e59fa57 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -209,6 +209,13 @@ struct skb_shared_info {
>  	void *		destructor_arg;
>  };
>  
> +struct skb_user_page {
> +	u8              *start;
> +	int             size;
> +	struct skb_frag_struct *frags;
> +	struct skb_shared_info *ushinfo;
> +	void		(*dtor)(struct skb_user_page *);
> +};
>  /* We divide dataref into two halves.  The higher 16 bits hold references
>   * to the payload part of skb->data.  The lower 16 bits hold references to
>   * the entire skb->data.  A clone of a headerless skb holds the length of
> @@ -441,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
>  extern void consume_skb(struct sk_buff *skb);
>  extern void	       __kfree_skb(struct sk_buff *skb);
>  extern struct sk_buff *__alloc_skb(unsigned int size,
> -				   gfp_t priority, int fclone, int node);
> +				   gfp_t priority, int fclone,
> +				   int node, struct net_device *dev);
>  static inline struct sk_buff *alloc_skb(unsigned int size,
>  					gfp_t priority)
>  {
> -	return __alloc_skb(size, priority, 0, -1);
> +	return __alloc_skb(size, priority, 0, -1, NULL);
>  }
>  
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>  					       gfp_t priority)
>  {
> -	return __alloc_skb(size, priority, 1, -1);
> +	return __alloc_skb(size, priority, 1, -1, NULL);
>  }
>  
>  extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
> @@ -1509,6 +1517,22 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
>  	__free_page(page);
>  }
>  
> +extern struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
> +			struct sk_buff *skb, int npages);
> +
> +static inline struct skb_user_page *netdev_alloc_user_page(
> +		struct net_device *dev,
> +		struct sk_buff *skb, unsigned int size)
> +{
> +	struct skb_user_page *user;
> +	int npages = (size < PAGE_SIZE) ? 1 : (size / PAGE_SIZE);

Should round up to full pages?

> +
> +	user = netdev_alloc_user_pages(dev, skb, npages);
> +	if (likely(user))
> +		return user;
> +	return NULL;
> +}
> +
>  /**
>   *	skb_clone_writable - is the header of a clone writable
>   *	@skb: buffer to check
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b8f74cf..ab8b082 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2265,6 +2265,30 @@ void netif_nit_deliver(struct sk_buff *skb)
>  	rcu_read_unlock();
>  }
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
> +					struct packet_type **pt_prev,
> +					int *ret, struct net_device *orig_dev)

please document

pt_prev, orig_dev and ret seem unused?

> +{
> +	struct mpassthru_port *ctor = NULL;

Why do you call the port "ctor"?

> +	struct sock *sk = NULL;
> +
> +	if (skb->dev)
> +		ctor = skb->dev->mp_port;
> +	if (!ctor)
> +		return skb;
> +
> +	sk = ctor->sock->sk;
> +
> +	skb_queue_tail(&sk->sk_receive_queue, skb);
> +
> +	sk->sk_data_ready(sk, skb->len);
> +	return NULL;
> +}
> +#else
> +#define handle_mpassthru(skb, pt_prev, ret, orig_dev)      (skb)
> +#endif
> +
>  /**
>   *	netif_receive_skb - process receive buffer from network
>   *	@skb: buffer to process
> @@ -2342,6 +2366,9 @@ int netif_receive_skb(struct sk_buff *skb)
>  		goto out;
>  ncls:
>  #endif
> +	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
> +	if (!skb)
> +		goto out;
>  
>  	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
> @@ -2455,6 +2482,11 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
>  	if (skb_is_gso(skb) || skb_has_frags(skb))
>  		goto normal;
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	if (skb->dev && skb->dev->mp_port)
> +		goto normal;
> +#endif
> +
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(ptype, head, list) {
>  		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 80a9616..6510e5b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -170,13 +170,15 @@ EXPORT_SYMBOL(skb_under_panic);
>   *	%GFP_ATOMIC.
>   */
>  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> -			    int fclone, int node)
> +			    int fclone, int node, struct net_device *dev)
>  {
>  	struct kmem_cache *cache;
>  	struct skb_shared_info *shinfo;
>  	struct sk_buff *skb;
>  	u8 *data;
> -
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	struct skb_user_page *user = NULL;
> +#endif
>  	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
>  
>  	/* Get the HEAD */
> @@ -185,8 +187,26 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  		goto out;
>  
>  	size = SKB_DATA_ALIGN(size);
> -	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
> -			gfp_mask, node);
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	if (!dev || !dev->mp_port) { /* Legacy alloc func */
> +#endif
> +		data = kmalloc_node_track_caller(
> +				size + sizeof(struct skb_shared_info),
> +				gfp_mask, node);
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	} else { /* Allocation may from page constructor of device */

what does the comment mean?

> +		user = netdev_alloc_user_page(dev, skb, size);
> +		if (!user) {
> +			data = kmalloc_node_track_caller(
> +				size + sizeof(struct skb_shared_info),
> +				gfp_mask, node);
> +			printk(KERN_INFO "can't alloc user buffer.\n");
> +		} else {
> +			data = user->start;
> +			size = SKB_DATA_ALIGN(user->size);
> +		}
> +	}
> +#endif
>  	if (!data)
>  		goto nodata;
>  
> @@ -208,6 +228,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	skb->mac_header = ~0U;
>  #endif
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	if (user)
> +		memcpy(user->ushinfo, skb_shinfo(skb),
> +				sizeof(struct skb_shared_info));
> +#endif
>  	/* make sure we initialize shinfo sequentially */
>  	shinfo = skb_shinfo(skb);
>  	atomic_set(&shinfo->dataref, 1);
> @@ -231,6 +256,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  
>  		child->fclone = SKB_FCLONE_UNAVAILABLE;
>  	}
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	shinfo->destructor_arg = user;
> +#endif
> +
>  out:
>  	return skb;
>  nodata:
> @@ -259,7 +288,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>  	struct sk_buff *skb;
>  
> -	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> +	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node, dev);
>  	if (likely(skb)) {
>  		skb_reserve(skb, NET_SKB_PAD);
>  		skb->dev = dev;
> @@ -278,6 +307,29 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(__netdev_alloc_page);
>  
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +struct skb_user_page *netdev_alloc_user_pages(struct net_device *dev,
> +			struct sk_buff *skb, int npages)
> +{
> +	struct mpassthru_port *ctor;
> +	struct skb_user_page *user = NULL;
> +
> +	rcu_read_lock();
> +	ctor = rcu_dereference(dev->mp_port);
> +	if (!ctor)
> +		goto out;
> +
> +	BUG_ON(npages > ctor->npages);
> +
> +	user = ctor->ctor(ctor, skb, npages);

With the assumption that "ctor" pins userspace pages,
can't it sleep? If yes you can't call it under rcu read side
critical section.

> +out:
> +	rcu_read_unlock();
> +
> +	return user;
> +}
> +EXPORT_SYMBOL(netdev_alloc_user_pages);
> +#endif
> +
>  void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
>  		int size)
>  {
> @@ -338,6 +390,10 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>  
>  static void skb_release_data(struct sk_buff *skb)
>  {
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	struct skb_user_page *user = skb_shinfo(skb)->destructor_arg;
> +#endif
> +
>  	if (!skb->cloned ||
>  	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>  			       &skb_shinfo(skb)->dataref)) {
> @@ -349,7 +405,10 @@ static void skb_release_data(struct sk_buff *skb)
>  
>  		if (skb_has_frags(skb))
>  			skb_drop_fraglist(skb);
> -
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +		if (skb->dev && skb->dev->mp_port && user && user->dtor)
> +			user->dtor(user);
> +#endif
>  		kfree(skb->head);
>  	}
>  }
> @@ -503,8 +562,14 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
>  	if (skb_shared(skb) || skb_cloned(skb))
>  		return 0;
>  
> -	skb_release_head_state(skb);
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +	if (skb->dev && skb->dev->mp_port)
> +		return 0;
> +#endif
> +
>  	shinfo = skb_shinfo(skb);
> +
> +	skb_release_head_state(skb);
>  	atomic_set(&shinfo->dataref, 1);
>  	shinfo->nr_frags = 0;
>  	shinfo->gso_size = 0;
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net.
  2010-03-06  9:38 ` [PATCH v1 1/3] A device for zero-copy based " xiaohui.xin
  2010-03-06  9:38   ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications xiaohui.xin
@ 2010-03-08 11:28   ` Michael S. Tsirkin
  2010-04-01  9:27     ` Xin Xiaohui
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-08 11:28 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Zhao Yu

On Sat, Mar 06, 2010 at 05:38:36PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> Add a device to utilize the vhost-net backend driver for
> copy-less data transfer between guest FE and host NIC.
> It pins the guest user space to the host memory and
> provides proto_ops as sendmsg/recvmsg to vhost-net.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>

I think some of the comments below are repeated.
Do you plan addressing them?

> ---
>  drivers/vhost/Kconfig     |    5 +
>  drivers/vhost/Makefile    |    2 +
>  drivers/vhost/mpassthru.c | 1202 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mpassthru.h |   29 ++

I'm not sure it's wise to limit the device to
vhost even if that's the only mode that you are going to
support in the first version.
How about locating the char device under drivers/net/?

>  4 files changed, 1238 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/mpassthru.c
>  create mode 100644 include/linux/mpassthru.h
> 
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 9f409f4..ee32a3b 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -9,3 +9,8 @@ config VHOST_NET
>  	  To compile this driver as a module, choose M here: the module will
>  	  be called vhost_net.
>  
> +config VHOST_PASSTHRU
> +	tristate "Zerocopy network driver (EXPERIMENTAL)"
> +	depends on VHOST_NET
> +	---help---
> +	  zerocopy network I/O support
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 72dd020..3f79c79 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -1,2 +1,4 @@
>  obj-$(CONFIG_VHOST_NET) += vhost_net.o
>  vhost_net-y := vhost.o net.o
> +
> +obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 0000000..744d6cd
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
> @@ -0,0 +1,1202 @@
> +/*
> + *  MPASSTHRU - Mediate passthrough device.
> + *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + *  GNU General Public License for more details.
> + *
> + */
> +
> +#define DRV_NAME        "mpassthru"
> +#define DRV_DESCRIPTION "Mediate passthru device driver"
> +#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
> +
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/major.h>
> +#include <linux/slab.h>
> +#include <linux/smp_lock.h>
> +#include <linux/poll.h>
> +#include <linux/fcntl.h>
> +#include <linux/init.h>
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/miscdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if.h>
> +#include <linux/if_arp.h>
> +#include <linux/if_ether.h>
> +#include <linux/crc32.h>
> +#include <linux/nsproxy.h>
> +#include <linux/uaccess.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mpassthru.h>
> +#include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +#include <net/rtnetlink.h>
> +#include <net/sock.h>
> +
> +#include <asm/system.h>
> +
> +#include "vhost.h"
> +
> +/* Uncomment to enable debugging */
> +/* #define MPASSTHRU_DEBUG 1 */
> +
> +#ifdef MPASSTHRU_DEBUG
> +static int debug;
> +
> +#define DBG  if (mp->debug) printk
> +#define DBG1 if (debug == 2) printk
> +#else
> +#define DBG(a...)
> +#define DBG1(a...)
> +#endif
> +
> +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
> +#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
> +
> +struct frag {
> +	u16     offset;
> +	u16     size;
> +};
> +
> +struct page_ctor {
> +	struct list_head        readq;
> +	int 			w_len;
> +	int 			r_len;
> +	spinlock_t      	read_lock;
> +	atomic_t        	refcnt;
> +	struct kmem_cache   	*cache;
> +	struct net_device   	*dev;
> +	struct mpassthru_port	port;
> +	void 			*sendctrl;
> +	void 			*recvctrl;
> +};
> +
> +struct page_info {
> +	struct list_head    	list;
> +	int         		header;
> +	/* indicate the actual length of bytes
> +	 * send/recv in the user space buffers
> +	 */
> +	int         		total;
> +	int         		offset;
> +	struct page     	*pages[MAX_SKB_FRAGS+1];
> +	struct skb_frag_struct 	frag[MAX_SKB_FRAGS+1];
> +	struct sk_buff      	*skb;
> +	struct page_ctor   	*ctor;
> +
> +	/* The pointer relayed to skb, to indicate
> +	 * it's a user space allocated skb or kernel
> +	 */
> +	struct skb_user_page    user;
> +	struct skb_shared_info	ushinfo;
> +
> +#define INFO_READ      		0
> +#define INFO_WRITE     		1
> +	unsigned        	flags;
> +	unsigned        	pnum;
> +
> +	/* It's meaningful for receive, means
> +	 * the max length allowed
> +	 */
> +	size_t          	len;
> +
> +	/* The fields after that is for backend
> +	 * driver, now for vhost-net.
> +	 */
> +	struct vhost_notifier	notifier;
> +	unsigned int    	desc_pos;
> +	unsigned int 		log;
> +	struct iovec 		hdr[VHOST_NET_MAX_SG];
> +	struct iovec 		iov[VHOST_NET_MAX_SG];
> +	void 			*ctl;
> +};
> +
> +struct mp_struct {
> +	struct mp_file   	*mfile;
> +	struct net_device       *dev;
> +	struct page_ctor	*ctor;
> +	struct socket           socket;
> +
> +#ifdef MPASSTHRU_DEBUG
> +	int debug;
> +#endif
> +};
> +
> +struct mp_file {
> +	atomic_t count;
> +	struct mp_struct *mp;
> +	struct net *net;
> +};
> +
> +struct mp_sock {
> +	struct sock            	sk;
> +	struct mp_struct       	*mp;
> +};
> +
> +static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
> +{
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	ret = dev_change_flags(dev, flags);
> +	rtnl_unlock();
> +
> +	if (ret < 0)
> +		printk(KERN_ERR "failed to change dev state of %s", dev->name);
> +
> +	return ret;
> +}
> +
> +/* The main function to allocate user space buffers */
> +static struct skb_user_page *page_ctor(struct mpassthru_port *port,
> +		struct sk_buff *skb, int npages)
> +{
> +	int i;
> +	unsigned long flags;
> +	struct page_ctor *ctor;
> +	struct page_info *info = NULL;
> +
> +	ctor = container_of(port, struct page_ctor, port);
> +
> +	spin_lock_irqsave(&ctor->read_lock, flags);
> +	if (!list_empty(&ctor->readq)) {
> +		info = list_first_entry(&ctor->readq, struct page_info, list);
> +		list_del(&info->list);
> +	}
> +	spin_unlock_irqrestore(&ctor->read_lock, flags);
> +	if (!info)
> +		return NULL;
> +
> +	for (i = 0; i < info->pnum; i++) {
> +		get_page(info->pages[i]);
> +		info->frag[i].page = info->pages[i];
> +		info->frag[i].page_offset = i ? 0 : info->offset;
> +		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
> +			port->data_len;
> +	}
> +	info->skb = skb;
> +	info->user.frags = info->frag;
> +	info->user.ushinfo = &info->ushinfo;
> +	return &info->user;
> +}
> +
> +static struct vhost_notifier *create_vhost_notifier(struct vhost_virtqueue *vq,
> +			struct page_info *info, int size);
> +
> +static void mp_vhost_notifier_dtor(struct vhost_notifier *vnotify)
> +{
> +	struct page_info *info = (struct page_info *)(vnotify->ctrl);
> +	int i;
> +
> +	for (i = 0; i < info->pnum; i++) {
> +		if (info->pages[i])
> +			put_page(info->pages[i]);
> +	}
> +
> +	if (info->flags == INFO_READ) {
> +		skb_shinfo(info->skb)->destructor_arg = &info->user;
> +		info->skb->destructor = NULL;
> +		kfree(info->skb);
> +	}
> +
> +	kmem_cache_free(info->ctor->cache, info);
> +
> +	return;
> +}
> +
> +/* A helper to clean the skb before the kfree_skb() */
> +
> +static void page_dtor_prepare(struct page_info *info)
> +{
> +	if (info->flags == INFO_READ)
> +		if (info->skb)
> +			info->skb->head = NULL;
> +}
> +
> +/* The callback to destruct the user space buffers or skb */
> +static void page_dtor(struct skb_user_page *user)
> +{
> +	struct page_info *info;
> +	struct page_ctor *ctor;
> +	struct sock *sk;
> +	struct sk_buff *skb;
> +	struct vhost_notifier *vnotify;
> +	struct vhost_virtqueue *vq = NULL;
> +	unsigned long flags;
> +	int i;
> +
> +	if (!user)
> +		return;
> +	info = container_of(user, struct page_info, user);
> +	if (!info)
> +		return;
> +	ctor = info->ctor;
> +	skb = info->skb;
> +
> +	page_dtor_prepare(info);
> +
> +	/* If the info->total is 0, make it to be reused */
> +	if (!info->total) {
> +		spin_lock_irqsave(&ctor->read_lock, flags);
> +		list_add(&info->list, &ctor->readq);
> +		spin_unlock_irqrestore(&ctor->read_lock, flags);
> +		return;
> +	}
> +
> +	/* Receive buffers, should be destructed */
> +	if (info->flags == INFO_READ) {
> +		for (i = 0; info->pages[i]; i++)
> +			put_page(info->pages[i]);
> +		info->skb = NULL;
> +		return;
> +	}
> +
> +	/* For transmit, we should wait for the DMA finish by hardware.
> +	 * Queue the notifier to wake up the backend driver
> +	 */
> +	vq = (struct vhost_virtqueue *)info->ctl;
> +	vnotify = create_vhost_notifier(vq, info, info->total);
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	list_add_tail(&vnotify->list, &vq->notifier);
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +
> +	sk = ctor->port.sock->sk;
> +	sk->sk_write_space(sk);
> +
> +	return;
> +}
> +
> +static int page_ctor_attach(struct mp_struct *mp)
> +{
> +	int rc;
> +	struct page_ctor *ctor;
> +	struct net_device *dev = mp->dev;
> +
> +	/* locked by mp_mutex */
> +	if (rcu_dereference(mp->ctor))
> +		return -EBUSY;
> +
> +	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
> +	if (!ctor)
> +		return -ENOMEM;
> +	rc = netdev_mp_port_prep(dev, &ctor->port);
> +	if (rc)
> +		goto fail;
> +
> +	ctor->cache = kmem_cache_create("skb_page_info",
> +			sizeof(struct page_info), 0,
> +			SLAB_HWCACHE_ALIGN, NULL);
> +
> +	if (!ctor->cache)
> +		goto cache_fail;
> +
> +	INIT_LIST_HEAD(&ctor->readq);
> +	spin_lock_init(&ctor->read_lock);
> +
> +	ctor->w_len = 0;
> +	ctor->r_len = 0;
> +
> +	dev_hold(dev);
> +	ctor->dev = dev;
> +	ctor->port.ctor = page_ctor;
> +	ctor->port.sock = &mp->socket;
> +	atomic_set(&ctor->refcnt, 1);
> +
> +	rc = netdev_mp_port_attach(dev, &ctor->port);
> +	if (rc)
> +		goto fail;
> +
> +	/* locked by mp_mutex */
> +	rcu_assign_pointer(mp->ctor, ctor);
> +
> +	/* XXX:Need we do set_offload here ? */
> +
> +	return 0;
> +
> +fail:
> +	kmem_cache_destroy(ctor->cache);
> +cache_fail:
> +	kfree(ctor);
> +	dev_put(dev);
> +
> +	return rc;
> +}
> +
> +
> +static inline void get_page_ctor(struct page_ctor *ctor)
> +{
> +       atomic_inc(&ctor->refcnt);
> +}
> +
> +static inline void put_page_ctor(struct page_ctor *ctor)
> +{
> +	if (atomic_dec_and_test(&ctor->refcnt))
> +		kfree(ctor);
> +}
> +
> +struct page_info *info_dequeue(struct page_ctor *ctor)
> +{
> +	unsigned long flags;
> +	struct page_info *info = NULL;
> +	spin_lock_irqsave(&ctor->read_lock, flags);
> +	if (!list_empty(&ctor->readq)) {
> +		info = list_first_entry(&ctor->readq,
> +				struct page_info, list);
> +		list_del(&info->list);
> +	}
> +	spin_unlock_irqrestore(&ctor->read_lock, flags);
> +	return info;
> +}
> +
> +static int page_ctor_detach(struct mp_struct *mp)
> +{
> +	struct page_ctor *ctor;
> +	struct page_info *info;
> +	int i;
> +
> +	ctor = rcu_dereference(mp->ctor);
> +	if (!ctor)
> +		return -ENODEV;
> +
> +	while ((info = info_dequeue(ctor))) {
> +		for (i = 0; i < info->pnum; i++)
> +			if (info->pages[i])
> +				put_page(info->pages[i]);
> +		kmem_cache_free(ctor->cache, info);
> +	}
> +	kmem_cache_destroy(ctor->cache);
> +	netdev_mp_port_detach(ctor->dev);
> +	dev_put(ctor->dev);
> +
> +	/* locked by mp_mutex */
> +	rcu_assign_pointer(mp->ctor, NULL);
> +	synchronize_rcu();
> +
> +	put_page_ctor(ctor);
> +
> +	return 0;
> +}
> +
> +/* For small user space buffers transmit, we don't need to call
> + * get_user_pages().
> + */
> +static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
> +		int total)
> +{
> +	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
> +
> +	if (!info)
> +		return NULL;
> +	info->total = total;
> +	info->user.dtor = page_dtor;
> +	info->ctor = ctor;
> +	info->flags = INFO_WRITE;
> +	return info;
> +}
> +
> +/* The main function to transform the guest user space address
> + * to host kernel address via get_user_pages(). Thus the hardware
> + * can do DMA directly to the user space address.
> + */
> +static struct page_info *alloc_page_info(struct page_ctor *ctor,
> +			struct iovec *iov, int count, struct frag *frags,
> +			int npages, int total)
> +{
> +	int rc;
> +	int i, j, n = 0;
> +	int len;
> +	unsigned long base;
> +	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
> +
> +	if (!info)
> +		return NULL;
> +
> +	down_read(&current->mm->mmap_sem);
> +	for (i = j = 0; i < count; i++) {
> +		base = (unsigned long)iov[i].iov_base;
> +		len = iov[i].iov_len;
> +
> +		if (!len)
> +			continue;
> +		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +
> +		rc = get_user_pages(current, current->mm, base, n,
> +				npages ? 1 : 0, 0, &info->pages[j], NULL);

Try switching to get_user_pages_fast.

We need some limit on the number of pages this can lock,
otherwise it's an obvious DOS.

> +		if (rc != n) {
> +			up_read(&current->mm->mmap_sem);
> +			goto failed;
> +		}
> +
> +		while (n--) {
> +			frags[j].offset = base & ~PAGE_MASK;
> +			frags[j].size = min_t(int, len,
> +					PAGE_SIZE - frags[j].offset);
> +			len -= frags[j].size;
> +			base += frags[j].size;
> +			j++;
> +		}
> +	}
> +	up_read(&current->mm->mmap_sem);
> +
> +#ifdef CONFIG_HIGHMEM
> +	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
> +		for (i = 0; i < j; i++) {
> +			if (PageHighMem(info->pages[i]))
> +				goto failed;
> +		}
> +	}
> +#endif
> +
> +	info->total = total;
> +	info->user.dtor = page_dtor;
> +	info->ctor = ctor;
> +	info->pnum = j;
> +
> +	if (!npages)
> +		info->flags = INFO_WRITE;
> +	if (info->flags == INFO_READ) {
> +		info->user.start = (u8 *)(((unsigned long)
> +				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
> +				frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
> +		info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
> +	}
> +	return info;
> +
> +failed:
> +	for (i = 0; i < j; i++)
> +		put_page(info->pages[i]);

For reads, I think you must also mark the page dirty.

> +
> +	kmem_cache_free(ctor->cache, info);
> +
> +	return NULL;
> +}
> +
> +struct page_ctor *mp_rcu_get_ctor(struct page_ctor *ctor)
> +{
> +	struct page_ctor *_ctor = NULL;
> +
> +	rcu_read_lock();
> +	_ctor = rcu_dereference(ctor);
> +
> +	if (!_ctor) {
> +		DBG(KERN_INFO "Device %s cannot do mediate passthru.\n",
> +				ctor->dev->name);
> +		rcu_read_unlock();
> +		return NULL;
> +	}
> +	get_page_ctor(_ctor);
> +	rcu_read_unlock();
> +	return _ctor;

So you take a reference to ctor under rcu, but then you keep it
after going outside rcu read side critical section.
This does not seem right.

> +}
> +
> +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> +		struct msghdr *m, size_t total_len)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(m->msg_control);
> +	struct iovec *iov = m->msg_iov;
> +	struct page_info *info = NULL;
> +	struct frag frags[MAX_SKB_FRAGS];
> +	struct sk_buff *skb;
> +	int count = m->msg_iovlen;
> +	int total = 0, header, n, i, len, rc;
> +	unsigned long base;
> +
> +	ctor = mp_rcu_get_ctor(mp->ctor);
> +	if (!ctor)
> +		return -ENODEV;
> +
> +	ctor->sendctrl = vq;
> +
> +	total = iov_length(iov, count);
> +
> +	if (total < ETH_HLEN) {
> +		put_page_ctor(ctor);
> +		return -EINVAL;
> +	}
> +
> +	if (total <= COPY_THRESHOLD)
> +		goto copy;
> +
> +	n = 0;
> +	for (i = 0; i < count; i++) {
> +		base = (unsigned long)iov[i].iov_base;
> +		len = iov[i].iov_len;
> +		if (!len)
> +			continue;
> +		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +		if (n > MAX_SKB_FRAGS) {
> +			put_page_ctor(ctor);
> +			return -EINVAL;
> +		}
> +	}
> +
> +copy:
> +	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
> +
> +	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
> +	if (!skb)
> +		goto drop;
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +
> +	memcpy_fromiovec(skb->data, iov, header);
> +	skb_put(skb, header);
> +	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
> +
> +	if (header == total) {
> +		rc = total;
> +		info = alloc_small_page_info(ctor, total);
> +	} else {
> +		info = alloc_page_info(ctor, iov, count, frags, 0, total);
> +		if (info)
> +			for (i = 0; info->pages[i]; i++) {
> +				skb_add_rx_frag(skb, i, info->pages[i],
> +						frags[i].offset, frags[i].size);
> +				info->pages[i] = NULL;
> +			}
> +	}
> +	if (info != NULL) {
> +		info->desc_pos = vq->head;
> +		info->ctl = vq;
> +		info->total = total;
> +		info->skb = skb;
> +		skb_shinfo(skb)->destructor_arg = &info->user;
> +		skb->dev = mp->dev;
> +		dev_queue_xmit(skb);
> +		mp->dev->stats.tx_packets++;
> +		mp->dev->stats.tx_bytes += total;
> +		put_page_ctor(ctor);
> +		return 0;
> +	}
> +drop:
> +	kfree(skb);
> +	if (info) {
> +		for (i = 0; info->pages[i]; i++)
> +			put_page(info->pages[i]);
> +		kmem_cache_free(info->ctor->cache, info);
> +	}
> +	mp->dev->stats.tx_dropped++;
> +	put_page_ctor(ctor);
> +	return -ENOMEM;
> +}
> +
> +
> +static struct vhost_notifier *create_vhost_notifier(struct vhost_virtqueue *vq,
> +			struct page_info *info, int size)
> +{
> +	struct vhost_notifier *vnotify = NULL;
> +
> +	vnotify = &info->notifier;
> +	memset(vnotify, 0, sizeof(struct vhost_notifier));
> +	vnotify->vq = vq;
> +	vnotify->head = info->desc_pos;
> +	vnotify->size = size;
> +	vnotify->log = info->log;
> +	vnotify->ctrl = (void *)info;
> +	vnotify->dtor = mp_vhost_notifier_dtor;
> +	return vnotify;
> +}
> +
> +static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
> +{
> +	struct socket *sock = vq->private_data;
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor = NULL;
> +	struct sk_buff *skb = NULL;
> +	struct page_info *info = NULL;
> +	struct ethhdr *eth;
> +	struct vhost_notifier *vnotify = NULL;
> +	int len, i;
> +	unsigned long flags;
> +
> +	struct virtio_net_hdr hdr = {
> +		.flags = 0,
> +		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	};
> +
> +	ctor = mp_rcu_get_ctor(mp->ctor);
> +	if (!ctor)
> +		return;
> +
> +	while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
> +		if (skb_shinfo(skb)->destructor_arg) {
> +			info = container_of(skb_shinfo(skb)->destructor_arg,
> +					struct page_info, user);
> +			info->skb = skb;
> +			if (skb->len > info->len) {
> +				mp->dev->stats.rx_dropped++;
> +				DBG(KERN_INFO "Discarded truncated rx packet: "
> +					" len %d > %zd\n", skb->len, info->len);
> +				info->total = skb->len;
> +				goto clean;
> +			} else {
> +				int i;
> +				struct skb_shared_info *gshinfo =
> +				(struct skb_shared_info *)(&info->ushinfo);
> +				struct skb_shared_info *hshinfo =
> +						skb_shinfo(skb);
> +
> +				if (gshinfo->nr_frags < hshinfo->nr_frags)
> +					goto clean;
> +				eth = eth_hdr(skb);
> +				skb_push(skb, ETH_HLEN);
> +
> +				hdr.hdr_len = skb_headlen(skb);
> +				info->total = skb->len;
> +
> +				for (i = 0; i < gshinfo->nr_frags; i++)
> +					gshinfo->frags[i].size = 0;
> +				for (i = 0; i < hshinfo->nr_frags; i++)
> +					gshinfo->frags[i].size =
> +						hshinfo->frags[i].size;
> +				memcpy(skb_shinfo(skb), &info->ushinfo,
> +						sizeof(struct skb_shared_info));
> +			}
> +		} else {
> +			/* The skb composed with kernel buffers
> +			 * in case user space buffers are not sufficent.
> +			 * The case should be rare.
> +			 */
> +			unsigned long flags;
> +			int i;
> +			struct skb_shared_info *gshinfo = NULL;
> +
> +			info = NULL;
> +
> +			spin_lock_irqsave(&ctor->read_lock, flags);
> +			if (!list_empty(&ctor->readq)) {
> +				info = list_first_entry(&ctor->readq,
> +						struct page_info, list);
> +				list_del(&info->list);
> +			}
> +			spin_unlock_irqrestore(&ctor->read_lock, flags);
> +			if (!info) {
> +				DBG(KERN_INFO "No user buffer avaliable %p\n",
> +									skb);
> +				skb_queue_head(&sock->sk->sk_receive_queue,
> +									skb);
> +				break;
> +			}
> +			info->skb = skb;
> +			/* compute the guest skb frags info */
> +			gshinfo = (struct skb_shared_info *)(info->user.start +
> +					SKB_DATA_ALIGN(info->user.size));
> +
> +			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
> +				goto clean;
> +
> +			eth = eth_hdr(skb);
> +			skb_push(skb, ETH_HLEN);
> +			info->total = skb->len;
> +
> +			for (i = 0; i < gshinfo->nr_frags; i++)
> +				gshinfo->frags[i].size = 0;
> +			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> +				gshinfo->frags[i].size =
> +					skb_shinfo(skb)->frags[i].size;
> +			hdr.hdr_len = min_t(int, skb->len,
> +						info->iov[1].iov_len);
> +			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
> +		}
> +
> +		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
> +								 sizeof hdr);
> +		if (len) {
> +			DBG(KERN_INFO
> +				"Unable to write vnet_hdr at addr %p: %d\n",
> +				info->hdr->iov_base, len);
> +			goto clean;
> +		}
> +		vnotify = create_vhost_notifier(vq, info,
> +				skb->len + sizeof(hdr));
> +
> +		spin_lock_irqsave(&vq->notify_lock, flags);
> +		list_add_tail(&vnotify->list, &vq->notifier);
> +		spin_unlock_irqrestore(&vq->notify_lock, flags);
> +		continue;
> +
> +clean:
> +		kfree_skb(skb);
> +		for (i = 0; info->pages[i]; i++)
> +			put_page(info->pages[i]);
> +		kmem_cache_free(ctor->cache, info);
> +	}
> +	put_page_ctor(ctor);
> +	return;
> +}
> +
> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> +		struct msghdr *m, size_t total_len,
> +		int flags)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(m->msg_control);
> +	struct iovec *iov = m->msg_iov;
> +	int count = m->msg_iovlen;
> +	int npages, payload;
> +	struct page_info *info;
> +	struct frag frags[MAX_SKB_FRAGS];
> +	unsigned long base;
> +	int i, len;
> +	unsigned long flag;
> +
> +	if (!(flags & MSG_DONTWAIT))
> +		return -EINVAL;
> +
> +	ctor = mp_rcu_get_ctor(mp->ctor);
> +	if (!ctor)
> +		return -EINVAL;
> +
> +	ctor->recvctrl = vq;
> +
> +	/* Error detections in case invalid user space buffer */
> +	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
> +			mp->dev->features & NETIF_F_SG) {
> +		put_page_ctor(ctor);
> +		return -EINVAL;
> +	}
> +
> +	npages = ctor->port.npages;
> +	payload = ctor->port.data_len;
> +
> +	/* If KVM guest virtio-net FE driver use SG feature */
> +	if (count > 2) {
> +		for (i = 2; i < count; i++) {
> +			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
> +			len = iov[i].iov_len;
> +			if (npages == 1)
> +				len = min_t(int, len, PAGE_SIZE - base);
> +			else if (base)
> +				break;
> +			payload -= len;
> +			if (payload <= 0)
> +				goto proceed;
> +			if (npages == 1 || (len & ~PAGE_MASK))
> +				break;
> +		}
> +	}
> +
> +	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
> +				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
> +		goto proceed;
> +
> +	put_page_ctor(ctor);
> +	return -EINVAL;
> +
> +proceed:
> +	/* skip the virtnet head */
> +	iov++;
> +	count--;
> +
> +	/* Translate address to kernel */
> +	info = alloc_page_info(ctor, iov, count, frags, npages, 0);
> +	if (!info) {
> +		put_page_ctor(ctor);
> +		return -ENOMEM;
> +	}
> +
> +	info->len = total_len;
> +	info->hdr[0].iov_base = vq->hdr[0].iov_base;
> +	info->hdr[0].iov_len = vq->hdr[0].iov_len;
> +	info->offset = frags[0].offset;
> +	info->desc_pos = vq->head;
> +	info->log = vq->_log;
> +	info->ctl = NULL;
> +
> +	iov--;
> +	count++;
> +
> +	memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
> +
> +	spin_lock_irqsave(&ctor->read_lock, flag);
> +	list_add_tail(&info->list, &ctor->readq);
> +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> +
> +	if (!vq->receiver)
> +		vq->receiver = mp_recvmsg_notify;
> +
> +	put_page_ctor(ctor);
> +	return 0;
> +}
> +
> +static void mp_put(struct mp_file *mfile);
> +
> +static int mp_release(struct socket *sock)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct mp_file *mfile = mp->mfile;
> +
> +	mp_put(mfile);
> +	sock_put(mp->socket.sk);
> +	put_net(mfile->net);
> +
> +	return 0;
> +}
> +
> +/* Ops structure to mimic raw sockets with mp device */
> +static const struct proto_ops mp_socket_ops = {
> +	.sendmsg = mp_sendmsg,
> +	.recvmsg = mp_recvmsg,
> +	.release = mp_release,
> +};
> +
> +static struct proto mp_proto = {
> +	.name           = "mp",
> +	.owner          = THIS_MODULE,
> +	.obj_size       = sizeof(struct mp_sock),
> +};
> +
> +static int mp_chr_open(struct inode *inode, struct file * file)
> +{
> +	struct mp_file *mfile;
> +	cycle_kernel_lock();
> +	DBG1(KERN_INFO "mp: mp_chr_open\n");
> +
> +	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
> +	if (!mfile)
> +		return -ENOMEM;
> +	atomic_set(&mfile->count, 0);
> +	mfile->mp = NULL;
> +	mfile->net = get_net(current->nsproxy->net_ns);
> +	file->private_data = mfile;
> +	return 0;
> +}
> +
> +static void __mp_detach(struct mp_struct *mp)
> +{
> +	mp->mfile = NULL;
> +
> +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> +	page_ctor_detach(mp);
> +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +
> +	/* Drop the extra count on the net device */
> +	dev_put(mp->dev);
> +}
> +
> +static DEFINE_MUTEX(mp_mutex);
> +
> +static void mp_detach(struct mp_struct *mp)
> +{
> +	mutex_lock(&mp_mutex);
> +	__mp_detach(mp);
> +	mutex_unlock(&mp_mutex);
> +}
> +
> +static struct mp_struct *mp_get(struct mp_file *mfile)
> +{
> +	struct mp_struct *mp = NULL;
> +	if (atomic_inc_not_zero(&mfile->count))
> +		mp = mfile->mp;
> +
> +	return mp;
> +}
> +
> +static void mp_put(struct mp_file *mfile)
> +{
> +	if (atomic_dec_and_test(&mfile->count))
> +		mp_detach(mfile->mp);
> +}
> +
> +static int mp_attach(struct mp_struct *mp, struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	int err;
> +
> +	netif_tx_lock_bh(mp->dev);
> +
> +	err = -EINVAL;
> +
> +	if (mfile->mp)
> +		goto out;
> +
> +	err = -EBUSY;
> +	if (mp->mfile)
> +		goto out;
> +
> +	err = 0;
> +	mfile->mp = mp;
> +	mp->mfile = mfile;
> +	mp->socket.file = file;
> +	dev_hold(mp->dev);
> +	sock_hold(mp->socket.sk);
> +	atomic_inc(&mfile->count);
> +
> +out:
> +	netif_tx_unlock_bh(mp->dev);
> +	return err;
> +}
> +
> +static void mp_sock_destruct(struct sock *sk)
> +{
> +	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
> +	kfree(mp);
> +}
> +
> +static int do_unbind(struct mp_file *mfile)
> +{
> +	struct mp_struct *mp = mp_get(mfile);
> +
> +	if (!mp)
> +		return -EINVAL;
> +
> +	mp_detach(mp);
> +	sock_put(mp->socket.sk);
> +	mp_put(mfile);
> +	return 0;
> +}
> +
> +static void mp_sock_data_ready(struct sock *sk, int len)
> +{
> +	if (sk_has_sleeper(sk))
> +		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
> +}
> +
> +static void mp_sock_write_space(struct sock *sk)
> +{
> +	if (sk_has_sleeper(sk))
> +		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
> +}
> +
> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +	struct net_device *dev;
> +	void __user* argp = (void __user *)arg;
> +	struct ifreq ifr;
> +	struct sock *sk;
> +	int ret;
> +
> +	ret = -EINVAL;
> +
> +	switch (cmd) {
> +	case MPASSTHRU_BINDDEV:
> +		ret = -EFAULT;
> +		if (copy_from_user(&ifr, argp, sizeof ifr))
> +			break;
> +
> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> +
> +		ret = -EBUSY;
> +
> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> +			break;
> +
> +		ret = -ENODEV;
> +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> +		if (!dev)
> +			break;
> +
> +		mutex_lock(&mp_mutex);
> +
> +		ret = -EBUSY;
> +		mp = mfile->mp;
> +		if (mp)
> +			goto err_dev_put;
> +
> +		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
> +		if (!mp) {
> +			ret = -ENOMEM;
> +			goto err_dev_put;
> +		}
> +		mp->dev = dev;
> +		ret = -ENOMEM;
> +
> +		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
> +		if (!sk)
> +			goto err_free_mp;
> +
> +		init_waitqueue_head(&mp->socket.wait);
> +		mp->socket.ops = &mp_socket_ops;
> +		sock_init_data(&mp->socket, sk);
> +		sk->sk_sndbuf = INT_MAX;
> +		container_of(sk, struct mp_sock, sk)->mp = mp;
> +
> +		sk->sk_destruct = mp_sock_destruct;
> +		sk->sk_data_ready = mp_sock_data_ready;
> +		sk->sk_write_space = mp_sock_write_space;
> +
> +		ret = mp_attach(mp, file);
> +		if (ret < 0)
> +			goto err_free_sk;
> +
> +		ret = page_ctor_attach(mp);
> +		if (ret < 0)
> +			goto err_free_sk;
> +
> +		ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
> +		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +out:
> +		mutex_unlock(&mp_mutex);
> +		break;
> +err_free_sk:
> +		sk_free(sk);
> +err_free_mp:
> +		kfree(mp);
> +err_dev_put:
> +		dev_put(dev);
> +		goto out;
> +
> +	case MPASSTHRU_UNBINDDEV:
> +		ret = do_unbind(mfile);
> +		break;
> +
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp = mp_get(mfile);
> +	struct sock *sk;
> +	unsigned int mask = 0;
> +
> +	if (!mp)
> +		return POLLERR;
> +
> +	sk = mp->socket.sk;
> +
> +	poll_wait(file, &mp->socket.wait, wait);
> +
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	if (sock_writeable(sk) ||
> +		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
> +			 sock_writeable(sk)))
> +		mask |= POLLOUT | POLLWRNORM;
> +
> +	if (mp->dev->reg_state != NETREG_REGISTERED)
> +		mask = POLLERR;
> +
> +	mp_put(mfile);
> +	return mask;
> +}
> +
> +static int mp_chr_close(struct inode *inode, struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +
> +	/*
> +	 * Ignore return value since an error only means there was nothing to
> +	 * do
> +	 */
> +	do_unbind(mfile);
> +
> +	put_net(mfile->net);
> +	kfree(mfile);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations mp_fops = {
> +	.owner  = THIS_MODULE,
> +	.llseek = no_llseek,
> +	.poll   = mp_chr_poll,
> +	.unlocked_ioctl = mp_chr_ioctl,
> +	.open   = mp_chr_open,
> +	.release = mp_chr_close,


qemu will need a way to send packets from userspace
(not from guest) as well. So I think you will need
a write as well.

> +};
> +
> +static struct miscdevice mp_miscdev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mp",
> +	.nodename = "net/mp",
> +	.fops = &mp_fops,
> +};
> +
> +static int mp_device_event(struct notifier_block *unused,
> +		unsigned long event, void *ptr)
> +{
> +	struct net_device *dev = ptr;
> +	struct mpassthru_port *port;
> +	struct mp_struct *mp = NULL;
> +	struct socket *sock = NULL;
> +
> +	port = dev->mp_port;
> +	if (port == NULL)
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_UNREGISTER:
> +			sock = dev->mp_port->sock;
> +			mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +			do_unbind(mp->mfile);
> +			break;
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block mp_notifier_block __read_mostly = {
> +	.notifier_call  = mp_device_event,
> +};
> +
> +static int mp_init(void)
> +{
> +	int ret = 0;
> +
> +	ret = misc_register(&mp_miscdev);
> +	if (ret)
> +		printk(KERN_ERR "mp: Can't register misc device\n");
> +	else {
> +		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
> +			mp_miscdev.minor);
> +		register_netdevice_notifier(&mp_notifier_block);
> +	}
> +	return ret;
> +}
> +
> +void mp_cleanup(void)
> +{
> +	unregister_netdevice_notifier(&mp_notifier_block);
> +	misc_deregister(&mp_miscdev);
> +}
> +
> +/* Get an underlying socket object from mp file.  Returns error unless file is
> + * attached to a device.  The returned object works like a packet socket, it
> + * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
> + * holding a reference to the file for as long as the socket is in use. */
> +struct socket *mp_get_socket(struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +
> +	if (file->f_op != &mp_fops)
> +		return ERR_PTR(-EINVAL);
> +	mp = mp_get(mfile);
> +	if (!mp)
> +		return ERR_PTR(-EBADFD);
> +	mp_put(mfile);
> +	return &mp->socket;
> +}
> +EXPORT_SYMBOL_GPL(mp_get_socket);
> +
> +module_init(mp_init);
> +module_exit(mp_cleanup);
> +MODULE_AUTHOR(DRV_COPYRIGHT);
> +MODULE_DESCRIPTION(DRV_DESCRIPTION);
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> new file mode 100644
> index 0000000..2be21c5
> --- /dev/null
> +++ b/include/linux/mpassthru.h
> @@ -0,0 +1,29 @@
> +#ifndef __MPASSTHRU_H
> +#define __MPASSTHRU_H
> +
> +#include <linux/types.h>
> +#include <linux/if_ether.h>
> +
> +/* ioctl defines */
> +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
> +
> +/* MPASSTHRU ifc flags */
> +#define IFF_MPASSTHRU		0x0001
> +#define IFF_MPASSTHRU_EXCL	0x0002
> +
> +#ifdef __KERNEL__
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +struct socket *mp_get_socket(struct file *);
> +#else
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +struct file;
> +struct socket;
> +static inline struct socket *mp_get_socket(struct file *f)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif /* CONFIG_VHOST_PASSTHRU */
> +#endif /* __KERNEL__ */
> +#endif /* __MPASSTHRU_H */
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net.
  2010-03-07 10:50 ` [PATCH v1 0/3] Provide a zero-copy method " Michael S. Tsirkin
@ 2010-03-09  7:47   ` Xin, Xiaohui
  0 siblings, 0 replies; 33+ messages in thread
From: Xin, Xiaohui @ 2010-03-09  7:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

>On Sat, Mar 06, 2010 at 05:38:35PM +0800, xiaohui.xin@intel.com wrote:
>> The idea is simple, just to pin the guest VM user space and then
> >let host NIC driver has the chance to directly DMA to it. 
> >The patches are based on vhost-net backend driver. We add a device
> >which provides proto_ops as sendmsg/recvmsg to vhost-net to
> >send/recv directly to/from the NIC driver. KVM guest who use the
> >vhost-net backend may bind any ethX interface in the host side to
> >get copyless data transfer thru guest virtio-net frontend.
> >
> >We provide multiple submits and asynchronous notifiicaton to 
> >vhost-net too.
> > 
> >Our goal is to improve the bandwidth and reduce the CPU usage.
> >Exact performance data will be provided later. But for simple
> >test with netperf, we found bindwidth up and CPU % up too,
> >but the bindwidth up ratio is much more than CPU % up ratio.
> >
> >What we have not done yet:
> >	packet split support
> >	To support GRO
> >	Performance tuning
> >
>Am I right to say that nic driver needs changes for these patches
>to work? If so, please publish nic driver patches as well.

For drivers not support packet split mode, the NIC drivers don't need to change.
For packet split support drivers, we plan to add the drivers API in updated versions.
Now for PS support drivers, just disable the PS mode, it also works.

> > what we have done in v1:
> > 	polish the RCU usage
> >	deal with write logging in asynchroush mode in vhost
> >	add notifier block for mp device
> >	rename page_ctor to mp_port in netdevice.h to make it looks generic
> >	add mp_dev_change_flags() for mp device to change NIC state
> >	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> >	a small fix for missing dev_put when fail
> >	using dynamic minor instead of static minor number
> >	a __KERNEL__ protect to mp_get_sock()
> >
> >performance:
> >	using netperf with GSO/TSO disabled, 10G NIC, 
> >	disabled packet split mode, with raw socket case compared to vhost.
> >
> >	bindwidth will be from 1.1Gbps to 1.7Gbps
> >	CPU % from 120%-140% to 140%-160%

> That's pretty low for a 10Gb nic. Are you hitting some other bottleneck,
> like high interrupt rate? Also, GSO support and performance tuning
> for raw are incomplete. Try comparing with e.g. tap with GSO.

I'm curious too. 
I have tested vhost-net without zero-copy patch at first in RAW socket case with ixgbe driver, with that driver GRO feature is enabled default, but netperf data is extremely low, after disabled GRO, then I can get more than 1Gbps. So I thought I have missed something there, but I had send 2 emails to you about this before and got no reply from you.
Have you got some perf data in raw socket case with vhost-net? 
The data I have got from your web page is always tap with GSO case.

If GSO is not supported, I think the data cannot compare with tap with GSO case in 1500 MTU. 
Maybe mergable buffers may help the performance in raw socket case?

Thanks Xiaohui

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-07 11:18     ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications Michael S. Tsirkin
@ 2010-03-15  8:46       ` Xin, Xiaohui
  2010-03-15  9:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-03-15  8:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

>> +/* The structure to notify the virtqueue for async socket */
>> +struct vhost_notifier {
>> +	struct list_head list;
> >+	struct vhost_virtqueue *vq;
> >+	int head;
> >+	int size;
> >+	int log;
> >+	void *ctrl;
> >+	void (*dtor)(struct vhost_notifier *);
> >+};
> >+

>So IMO, this is not the best interface between vhost
>and your driver, exposing them to each other unnecessarily.
>
>If you think about it, your driver should not care about this structure.
>It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
>on completion.  vhost could save it's state in ki_user_data.  If your
>driver needs to add more data to do more tracking, I think it can put
>skb pointer in the private pointer.

Then if I remove the struct vhost_notifier, and just use struct kiocb, but don't use the one got from sendmsg or recvmsg, but allocated within the page_info structure, and don't implement any aio logic related to it, is that ok?

Sorry, I made a patch, but don't know how to reply mail with a good formatted patch here....

Thanks
Xiaohui

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-15  8:46       ` Xin, Xiaohui
@ 2010-03-15  9:23         ` Michael S. Tsirkin
  2010-03-16  9:32           ` Xin Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-15  9:23 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Mon, Mar 15, 2010 at 04:46:50PM +0800, Xin, Xiaohui wrote:
> >> +/* The structure to notify the virtqueue for async socket */
> >> +struct vhost_notifier {
> >> +	struct list_head list;
> > >+	struct vhost_virtqueue *vq;
> > >+	int head;
> > >+	int size;
> > >+	int log;
> > >+	void *ctrl;
> > >+	void (*dtor)(struct vhost_notifier *);
> > >+};
> > >+
> 
> >So IMO, this is not the best interface between vhost
> >and your driver, exposing them to each other unnecessarily.
> >
> >If you think about it, your driver should not care about this structure.
> >It could get e.g. a kiocb (sendmsg already gets one), and call ki_dtor
> >on completion.  vhost could save it's state in ki_user_data.  If your
> >driver needs to add more data to do more tracking, I think it can put
> >skb pointer in the private pointer.
> 
> Then if I remove the struct vhost_notifier, and just use struct kiocb, but don't use the one got from sendmsg or recvmsg, but allocated within the page_info structure, and don't implement any aio logic related to it, is that ok?

Hmm, not sure I understand.  It seems both cleaner and easier to use the
iocb passed to sendmsg/recvmsg. No? I am not saying you necessarily must
implement full aio directly.

> Sorry, I made a patch, but don't know how to reply mail with a good formatted patch here....
> 
> Thanks
> Xiaohui

Maybe Documentation/email-clients.txt will help?
Generally you do it like this (at start of mail):

Subject: one line patch summary (overrides mail subject)

multilie patch description

Signed-off-by: <...>

---

Free text comes after --- delimeter, before patch.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a140dad..e830b30 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -106,22 +106,41 @@ static void handle_tx(struct vhost_net *net)




-- 
MST

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-15  9:23         ` Michael S. Tsirkin
@ 2010-03-16  9:32           ` Xin Xiaohui
  2010-03-16 11:33             ` [PATCH " Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Xiaohui @ 2010-03-16  9:32 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Xin Xiaohui

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---

Michael,
I don't use the kiocb comes from the sendmsg/recvmsg,
since I have embeded the kiocb in page_info structure,
and allocate it when page_info allocated.
Please have a review and thanks for the instruction
for replying email which helps me a lot.

Thanks,
Xiaohui

 drivers/vhost/net.c   |  159 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   12 ++++
 2 files changed, 166 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..5483848 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include <linux/workqueue.h>
 #include <linux/rcupdate.h>
 #include <linux/file.h>
+#include <linux/aio.h>
 
 #include <linux/net.h>
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/mpassthru.h>
 
 #include <net/sock.h>
 
@@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq);
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq);
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	hdr_size = vq->hdr_size;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			vq->head = head;
+			msg.msg_control = (void *)vq;
+		}
+
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
@@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
 	int err;
 	size_t hdr_size;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+			vq->link_state == VHOST_VQ_LINK_SYNC))
 		return;
 
 	use_mm(net->dev.mm);
@@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
-	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+	/* In async cases, for write logging, the simple way is to get
+	 * the log info always, and really logging is decided later.
+	 * Thus, when logging enabled, we can get log, and when logging
+	 * disabled, we can get log disabled accordingly.
+	 */
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
+		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
 		vq->log : NULL;
 
+	handle_async_rx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			vq->head = head;
+			vq->_log = log;
+			msg.msg_control = (void *)vq;
+		}
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
@@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		/* TODO: Should check and handle checksum. */
 		if (err > len) {
 			pr_err("Discarded truncated rx packet: "
@@ -284,10 +325,85 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+				struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	int log, size;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+	if (vq != &net->dev.vqs[VHOST_NET_VQ_RX])
+		return;
+
+	if (vq->receiver)
+		vq->receiver(vq);
+	vq_log = unlikely(vhost_has_feature(
+				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, iocb->ki_nbytes);
+		log = (int)iocb->ki_user_data;
+		size = iocb->ki_nbytes;
+		rx_total_len += iocb->ki_nbytes;
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+		if (unlikely(vq_log))
+			vhost_log_write(vq, vq_log, log, size);
+		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+		struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	int tx_total_len = 0;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+	if (vq != &net->dev.vqs[VHOST_NET_VQ_TX])
+		return;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
 static void handle_tx_kick(struct work_struct *work)
 {
 	struct vhost_virtqueue *vq;
@@ -462,7 +578,19 @@ static struct socket *get_tun_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
 {
 	struct socket *sock;
 	if (fd == -1)
@@ -473,9 +601,26 @@ static struct socket *get_socket(int fd)
 	sock = get_tun_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		vq->link_state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		vq->receiver = NULL;
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -493,12 +638,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 	vq = n->vqs + index;
 	mutex_lock(&vq->mutex);
-	sock = get_socket(fd);
+	vq->link_state = VHOST_VQ_LINK_SYNC;
+	sock = get_socket(vq, fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -507,8 +655,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	vhost_net_disable_vq(n, vq);
 	rcu_assign_pointer(vq->private_data, sock);
 	vhost_net_enable_vq(n, vq);
-	mutex_unlock(&vq->mutex);
 done:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	if (oldsock) {
 		vhost_net_flush_vq(n, index);
@@ -516,6 +664,7 @@ done:
 	}
 	return r;
 err:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	return r;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d1f0453..297af1c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 	0,
+	VHOST_VQ_LINK_ASYNC = 	1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -96,6 +101,13 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/*Differiate async socket for 0-copy from normal*/
+	enum vhost_vq_link_state link_state;
+	int head;
+	int _log;
+	struct list_head notifier;
+	spinlock_t notify_lock;
+	void (*receiver)(struct vhost_virtqueue *);
 };
 
 struct vhost_dev {
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-16  9:32           ` Xin Xiaohui
@ 2010-03-16 11:33             ` Michael S. Tsirkin
  2010-03-17  9:48               ` Xin, Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-16 11:33 UTC (permalink / raw)
  To: Xin Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Tue, Mar 16, 2010 at 05:32:17PM +0800, Xin Xiaohui wrote:
> The vhost-net backend now only supports synchronous send/recv
> operations. The patch provides multiple submits and asynchronous
> notifications. This is needed for zero-copy case.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> ---
> 
> Michael,
> I don't use the kiocb comes from the sendmsg/recvmsg,
> since I have embeded the kiocb in page_info structure,
> and allocate it when page_info allocated.

So what I suggested was that vhost allocates and tracks the iocbs, and
passes them to your device with sendmsg/ recvmsg calls. This way your
device won't need to share structures and locking strategy with vhost:
you get an iocb, handle it, invoke a callback to notify vhost about
completion.

This also gets rid of the 'receiver' callback.

> Please have a review and thanks for the instruction
> for replying email which helps me a lot.
> 
> Thanks,
> Xiaohui
> 
>  drivers/vhost/net.c   |  159 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   12 ++++
>  2 files changed, 166 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 22d5fef..5483848 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -17,11 +17,13 @@
>  #include <linux/workqueue.h>
>  #include <linux/rcupdate.h>
>  #include <linux/file.h>
> +#include <linux/aio.h>
>  
>  #include <linux/net.h>
>  #include <linux/if_packet.h>
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
> +#include <linux/mpassthru.h>
>  
>  #include <net/sock.h>
>  
> @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
>  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
>  }
>  
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq);
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq);
> +

A couple of style comments:

- It's better to arrange functions in such order that forward declarations
aren't necessary.  Since we don't have recursion, this should always be
possible.

- continuation lines should be idented at least at the position of '('
on the previous line.

>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
> @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
>  		tx_poll_stop(net);
>  	hdr_size = vq->hdr_size;
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
>  		/* Skip header. TODO: support TSO. */
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
>  		msg.msg_iovlen = out;
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			vq->head = head;
> +			msg.msg_control = (void *)vq;

So here a device gets a pointer to vhost_virtqueue structure. If it gets
an iocb and invokes a callback, it would not care about vhost internals.

> +		}
> +
>  		len = iov_length(vq->iov, out);
>  		/* Sanity check */
>  		if (!len) {
> @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
>  			tx_poll_start(net, sock);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		if (err != len)
>  			pr_err("Truncated TX packet: "
>  			       " len %d != %zd\n", err, len);
> @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
> @@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
>  	int err;
>  	size_t hdr_size;
>  	struct socket *sock = rcu_dereference(vq->private_data);
> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> +			vq->link_state == VHOST_VQ_LINK_SYNC))
>  		return;
>  
>  	use_mm(net->dev.mm);
> @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net)
>  	vhost_disable_notify(vq);
>  	hdr_size = vq->hdr_size;
>  
> -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> +	/* In async cases, for write logging, the simple way is to get
> +	 * the log info always, and really logging is decided later.
> +	 * Thus, when logging enabled, we can get log, and when logging
> +	 * disabled, we can get log disabled accordingly.
> +	 */
> +


This adds overhead and might be one of the reasons
your patch does not perform that well. A better way
would be to flush outstanding requests or reread the vq
when logging is enabled.

> +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
>  		vq->log : NULL;
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net)
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			vq->head = head;
> +			vq->_log = log;
> +			msg.msg_control = (void *)vq;
> +		}
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
> @@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net *net)
>  			vhost_discard_vq_desc(vq);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		/* TODO: Should check and handle checksum. */
>  		if (err > len) {
>  			pr_err("Discarded truncated rx packet: "
> @@ -284,10 +325,85 @@ static void handle_rx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	if (!list_empty(&vq->notifier)) {
> +		iocb = list_first_entry(&vq->notifier,
> +				struct kiocb, ki_list);
> +		list_del(&iocb->ki_list);
> +	}
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	return iocb;
> +}
> +
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +				struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	struct vhost_log *vq_log = NULL;
> +	int rx_total_len = 0;
> +	int log, size;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +	if (vq != &net->dev.vqs[VHOST_NET_VQ_RX])
> +		return;
> +
> +	if (vq->receiver)
> +		vq->receiver(vq);
> +	vq_log = unlikely(vhost_has_feature(
> +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, iocb->ki_nbytes);
> +		log = (int)iocb->ki_user_data;
> +		size = iocb->ki_nbytes;
> +		rx_total_len += iocb->ki_nbytes;
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		if (unlikely(vq_log))
> +			vhost_log_write(vq, vq_log, log, size);
> +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +		struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	int tx_total_len = 0;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +	if (vq != &net->dev.vqs[VHOST_NET_VQ_TX])
> +		return;
> +

Hard to see why the second check would be necessary

> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, 0);
> +		tx_total_len += iocb->ki_nbytes;
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
>  static void handle_tx_kick(struct work_struct *work)
>  {
>  	struct vhost_virtqueue *vq;
> @@ -462,7 +578,19 @@ static struct socket *get_tun_socket(int fd)
>  	return sock;
>  }
>  
> -static struct socket *get_socket(int fd)
> +static struct socket *get_mp_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	struct socket *sock;
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	sock = mp_get_socket(file);
> +	if (IS_ERR(sock))
> +		fput(file);
> +	return sock;
> +}
> +
> +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
>  {
>  	struct socket *sock;
>  	if (fd == -1)
> @@ -473,9 +601,26 @@ static struct socket *get_socket(int fd)
>  	sock = get_tun_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_mp_socket(fd);
> +	if (!IS_ERR(sock)) {
> +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> +		return sock;
> +	}
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
> +static void vhost_init_link_state(struct vhost_net *n, int index)
> +{
> +	struct vhost_virtqueue *vq = n->vqs + index;
> +
> +	WARN_ON(!mutex_is_locked(&vq->mutex));
> +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +		vq->receiver = NULL;
> +		INIT_LIST_HEAD(&vq->notifier);
> +		spin_lock_init(&vq->notify_lock);
> +	}
> +}
> +
>  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  {
>  	struct socket *sock, *oldsock;
> @@ -493,12 +638,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	}
>  	vq = n->vqs + index;
>  	mutex_lock(&vq->mutex);
> -	sock = get_socket(fd);
> +	vq->link_state = VHOST_VQ_LINK_SYNC;
> +	sock = get_socket(vq, fd);
>  	if (IS_ERR(sock)) {
>  		r = PTR_ERR(sock);
>  		goto err;
>  	}
>  
> +	vhost_init_link_state(n, index);
> +
>  	/* start polling new socket */
>  	oldsock = vq->private_data;
>  	if (sock == oldsock)
> @@ -507,8 +655,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	vhost_net_disable_vq(n, vq);
>  	rcu_assign_pointer(vq->private_data, sock);
>  	vhost_net_enable_vq(n, vq);
> -	mutex_unlock(&vq->mutex);
>  done:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	if (oldsock) {
>  		vhost_net_flush_vq(n, index);
> @@ -516,6 +664,7 @@ done:
>  	}
>  	return r;
>  err:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	return r;
>  }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d1f0453..297af1c 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -43,6 +43,11 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +enum vhost_vq_link_state {
> +	VHOST_VQ_LINK_SYNC = 	0,
> +	VHOST_VQ_LINK_ASYNC = 	1,
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -96,6 +101,13 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log log[VHOST_NET_MAX_SG];
> +	/*Differiate async socket for 0-copy from normal*/
> +	enum vhost_vq_link_state link_state;
> +	int head;
> +	int _log;
> +	struct list_head notifier;
> +	spinlock_t notify_lock;
> +	void (*receiver)(struct vhost_virtqueue *);
>  };
>  
>  struct vhost_dev {
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-16 11:33             ` [PATCH " Michael S. Tsirkin
@ 2010-03-17  9:48               ` Xin, Xiaohui
  2010-03-17 10:27                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-03-17  9:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

>> Michael,
>> I don't use the kiocb comes from the sendmsg/recvmsg,
> >since I have embeded the kiocb in page_info structure,
> >and allocate it when page_info allocated.

>So what I suggested was that vhost allocates and tracks the iocbs, and
>passes them to your device with sendmsg/ recvmsg calls. This way your
>device won't need to share structures and locking strategy with vhost:
>you get an iocb, handle it, invoke a callback to notify vhost about
>completion.

>This also gets rid of the 'receiver' callback

I'm not sure receiver callback can be removed here:
The patch describes a work flow like this:
netif_receive_skb() gets the packet, it does nothing but just queue the skb
and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver callback
to deal with skb and and get the necessary notify info into a list, vhost owns the 
list and in the same handle_rx() context use it to complete.

We use "receiver" callback here is because only handle_rx() is waked up from
netif_receive_skb(), and we need mp device context to deal with the skb and
notify info attached to it. We also have some lock in the callback function.

If I remove the receiver callback, I can only deal with the skb and notify
info in netif_receive_skb(), but this function is in an interrupt context,
which I think lock is not allowed there. But I cannot remove the lock there.


>> Please have a review and thanks for the instruction
>> for replying email which helps me a lot.
>> 
> >Thanks,
> >Xiaohui
> >
> > drivers/vhost/net.c   |  159 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  drivers/vhost/vhost.h |   12 ++++
>>  2 files changed, 166 insertions(+), 5 deletions(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >index 22d5fef..5483848 100644
> >--- a/drivers/vhost/net.c
> >+++ b/drivers/vhost/net.c
> >@@ -17,11 +17,13 @@
> > #include <linux/workqueue.h>
> > #include <linux/rcupdate.h>
> > #include <linux/file.h>
> >+#include <linux/aio.h>
> > 
> > #include <linux/net.h>
> > #include <linux/if_packet.h>
> > #include <linux/if_arp.h>
> > #include <linux/if_tun.h>
> >+#include <linux/mpassthru.h>
> > 
> > #include <net/sock.h>
> > 
> >@@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > }
> > 
> >+static void handle_async_rx_events_notify(struct vhost_net *net,
> >+					struct vhost_virtqueue *vq);
> >+
> >+static void handle_async_tx_events_notify(struct vhost_net *net,
> >+					struct vhost_virtqueue *vq);
> >+

>A couple of style comments:
>
>- It's better to arrange functions in such order that forward declarations
>aren't necessary.  Since we don't have recursion, this should always be
>possible.

>- continuation lines should be idented at least at the position of '('
>on the previous line.

Thanks. I'd correct that.

>>  /* Expects to be always run from workqueue - which acts as
>>   * read-size critical section for our kind of RCU. */
>>  static void handle_tx(struct vhost_net *net)
>> @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
>>  		tx_poll_stop(net);
>>  	hdr_size = vq->hdr_size;
>>  
>> +	handle_async_tx_events_notify(net, vq);
> >+
>>  	for (;;) {
>>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>>  					 ARRAY_SIZE(vq->iov),
> >@@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
>>  		/* Skip header. TODO: support TSO. */
>>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
>>  		msg.msg_iovlen = out;
> >+
> >+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> >+			vq->head = head;
> >+			msg.msg_control = (void *)vq;

>So here a device gets a pointer to vhost_virtqueue structure. If it gets
>an iocb and invokes a callback, it would not care about vhost internals.

>> +		}
>> +
>> 		len = iov_length(vq->iov, out);
>>  		/* Sanity check */
>>  		if (!len) {
>> @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
>>  			tx_poll_start(net, sock);
>>  			break;
>>  		}
>> +
>> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
>> +			continue;
>>+
>>  		if (err != len)
>>  			pr_err("Truncated TX packet: "
>>  			       " len %d != %zd\n", err, len);
>> @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
>>  		}
>>  	}
>>  
>> +	handle_async_tx_events_notify(net, vq);
>> +
>>  	mutex_unlock(&vq->mutex);
>>  	unuse_mm(net->dev.mm);
>>  }
>>@@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
>>  	int err;
>>  	size_t hdr_size;
>>  	struct socket *sock = rcu_dereference(vq->private_data);
>> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
>> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
>> +			vq->link_state == VHOST_VQ_LINK_SYNC))
>>  		return;
>>  
>>  	use_mm(net->dev.mm);
>> @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net)
>>  	vhost_disable_notify(vq);
>>  	hdr_size = vq->hdr_size;
>>  
>> -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>> +	/* In async cases, for write logging, the simple way is to get
>> +	 * the log info always, and really logging is decided later.
>>+	 * Thus, when logging enabled, we can get log, and when logging
>> +	 * disabled, we can get log disabled accordingly.
>> +	 */
>> +


>This adds overhead and might be one of the reasons
>your patch does not perform that well. A better way
>would be to flush outstanding requests or reread the vq
>when logging is enabled.

Since the guest may submit a lot of buffers and h/w have already used them
to allocate host skb, it's difficult to know how many and which one is the
outstanding request, it's not just only inside in notifier list or sk->receive_queue.

But what does reread mean? 

> +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
>  		vq->log : NULL;
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net)
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			vq->head = head;
> +			vq->_log = log;
> +			msg.msg_control = (void *)vq;
> +		}
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
> @@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net *net)
>  			vhost_discard_vq_desc(vq);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		/* TODO: Should check and handle checksum. */
>  		if (err > len) {
>  			pr_err("Discarded truncated rx packet: "
> @@ -284,10 +325,85 @@ static void handle_rx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	if (!list_empty(&vq->notifier)) {
> +		iocb = list_first_entry(&vq->notifier,
> +				struct kiocb, ki_list);
> +		list_del(&iocb->ki_list);
> +	}
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	return iocb;
> +}
> +
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +				struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	struct vhost_log *vq_log = NULL;
> +	int rx_total_len = 0;
> +	int log, size;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +	if (vq != &net->dev.vqs[VHOST_NET_VQ_RX])
> +		return;
> +
> +	if (vq->receiver)
> +		vq->receiver(vq);
> +	vq_log = unlikely(vhost_has_feature(
> +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, iocb->ki_nbytes);
> +		log = (int)iocb->ki_user_data;
> +		size = iocb->ki_nbytes;
> +		rx_total_len += iocb->ki_nbytes;
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		if (unlikely(vq_log))
> +			vhost_log_write(vq, vq_log, log, size);
> +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +		struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	int tx_total_len = 0;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +	if (vq != &net->dev.vqs[VHOST_NET_VQ_TX])
> +		return;
> +

Hard to see why the second check would be necessary

> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, 0);
> +		tx_total_len += iocb->ki_nbytes;
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
>  static void handle_tx_kick(struct work_struct *work)
>  {
>  	struct vhost_virtqueue *vq;
> @@ -462,7 +578,19 @@ static struct socket *get_tun_socket(int fd)
>  	return sock;
>  }
>  
> -static struct socket *get_socket(int fd)
> +static struct socket *get_mp_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	struct socket *sock;
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	sock = mp_get_socket(file);
> +	if (IS_ERR(sock))
> +		fput(file);
> +	return sock;
> +}
> +
> +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
>  {
>  	struct socket *sock;
>  	if (fd == -1)
> @@ -473,9 +601,26 @@ static struct socket *get_socket(int fd)
>  	sock = get_tun_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_mp_socket(fd);
> +	if (!IS_ERR(sock)) {
> +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> +		return sock;
> +	}
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
> +static void vhost_init_link_state(struct vhost_net *n, int index)
> +{
> +	struct vhost_virtqueue *vq = n->vqs + index;
> +
> +	WARN_ON(!mutex_is_locked(&vq->mutex));
> +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +		vq->receiver = NULL;
> +		INIT_LIST_HEAD(&vq->notifier);
> +		spin_lock_init(&vq->notify_lock);
> +	}
> +}
> +
>  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  {
>  	struct socket *sock, *oldsock;
> @@ -493,12 +638,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	}
>  	vq = n->vqs + index;
>  	mutex_lock(&vq->mutex);
> -	sock = get_socket(fd);
> +	vq->link_state = VHOST_VQ_LINK_SYNC;
> +	sock = get_socket(vq, fd);
>  	if (IS_ERR(sock)) {
>  		r = PTR_ERR(sock);
>  		goto err;
>  	}
>  
> +	vhost_init_link_state(n, index);
> +
>  	/* start polling new socket */
>  	oldsock = vq->private_data;
>  	if (sock == oldsock)
> @@ -507,8 +655,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	vhost_net_disable_vq(n, vq);
>  	rcu_assign_pointer(vq->private_data, sock);
>  	vhost_net_enable_vq(n, vq);
> -	mutex_unlock(&vq->mutex);
>  done:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	if (oldsock) {
>  		vhost_net_flush_vq(n, index);
> @@ -516,6 +664,7 @@ done:
>  	}
>  	return r;
>  err:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	return r;
>  }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d1f0453..297af1c 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -43,6 +43,11 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +enum vhost_vq_link_state {
> +	VHOST_VQ_LINK_SYNC = 	0,
> +	VHOST_VQ_LINK_ASYNC = 	1,
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -96,6 +101,13 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log log[VHOST_NET_MAX_SG];
> +	/*Differiate async socket for 0-copy from normal*/
> +	enum vhost_vq_link_state link_state;
> +	int head;
> +	int _log;
> +	struct list_head notifier;
> +	spinlock_t notify_lock;
> +	void (*receiver)(struct vhost_virtqueue *);
>  };
>  
>  struct vhost_dev {
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-17  9:48               ` Xin, Xiaohui
@ 2010-03-17 10:27                 ` Michael S. Tsirkin
  2010-04-01  9:14                   ` Xin Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-03-17 10:27 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Wed, Mar 17, 2010 at 05:48:10PM +0800, Xin, Xiaohui wrote:
> >> Michael,
> >> I don't use the kiocb comes from the sendmsg/recvmsg,
> > >since I have embeded the kiocb in page_info structure,
> > >and allocate it when page_info allocated.
> 
> >So what I suggested was that vhost allocates and tracks the iocbs, and
> >passes them to your device with sendmsg/ recvmsg calls. This way your
> >device won't need to share structures and locking strategy with vhost:
> >you get an iocb, handle it, invoke a callback to notify vhost about
> >completion.
> 
> >This also gets rid of the 'receiver' callback
> 
> I'm not sure receiver callback can be removed here:
> The patch describes a work flow like this:
> netif_receive_skb() gets the packet, it does nothing but just queue the skb
> and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver callback
> to deal with skb and and get the necessary notify info into a list, vhost owns the 
> list and in the same handle_rx() context use it to complete.
> 
> We use "receiver" callback here is because only handle_rx() is waked up from
> netif_receive_skb(), and we need mp device context to deal with the skb and
> notify info attached to it. We also have some lock in the callback function.
> 
> If I remove the receiver callback, I can only deal with the skb and notify
> info in netif_receive_skb(), but this function is in an interrupt context,
> which I think lock is not allowed there. But I cannot remove the lock there.
> 

The basic idea is that vhost passes iocb to recvmsg and backend
completes the iocb to signal that data is ready. That completion could
be in interrupt context and so we need to switch to workqueue to handle
the event, it is true, but the code to do this would live in vhost.c or
net.c.

With this structure your device won't depend on
vhost, and can go under drivers/net/, opening up possibility
to use it for zero copy without vhost in the future.



> >> Please have a review and thanks for the instruction
> >> for replying email which helps me a lot.
> >> 
> > >Thanks,
> > >Xiaohui
> > >
> > > drivers/vhost/net.c   |  159 +++++++++++++++++++++++++++++++++++++++++++++++--
> >>  drivers/vhost/vhost.h |   12 ++++
> >>  2 files changed, 166 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > >index 22d5fef..5483848 100644
> > >--- a/drivers/vhost/net.c
> > >+++ b/drivers/vhost/net.c
> > >@@ -17,11 +17,13 @@
> > > #include <linux/workqueue.h>
> > > #include <linux/rcupdate.h>
> > > #include <linux/file.h>
> > >+#include <linux/aio.h>
> > > 
> > > #include <linux/net.h>
> > > #include <linux/if_packet.h>
> > > #include <linux/if_arp.h>
> > > #include <linux/if_tun.h>
> > >+#include <linux/mpassthru.h>
> > > 
> > > #include <net/sock.h>
> > > 
> > >@@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > > 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > > }
> > > 
> > >+static void handle_async_rx_events_notify(struct vhost_net *net,
> > >+					struct vhost_virtqueue *vq);
> > >+
> > >+static void handle_async_tx_events_notify(struct vhost_net *net,
> > >+					struct vhost_virtqueue *vq);
> > >+
> 
> >A couple of style comments:
> >
> >- It's better to arrange functions in such order that forward declarations
> >aren't necessary.  Since we don't have recursion, this should always be
> >possible.
> 
> >- continuation lines should be idented at least at the position of '('
> >on the previous line.
> 
> Thanks. I'd correct that.
> 
> >>  /* Expects to be always run from workqueue - which acts as
> >>   * read-size critical section for our kind of RCU. */
> >>  static void handle_tx(struct vhost_net *net)
> >> @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net)
> >>  		tx_poll_stop(net);
> >>  	hdr_size = vq->hdr_size;
> >>  
> >> +	handle_async_tx_events_notify(net, vq);
> > >+
> >>  	for (;;) {
> >>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >>  					 ARRAY_SIZE(vq->iov),
> > >@@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net)
> >>  		/* Skip header. TODO: support TSO. */
> >>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> >>  		msg.msg_iovlen = out;
> > >+
> > >+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > >+			vq->head = head;
> > >+			msg.msg_control = (void *)vq;
> 
> >So here a device gets a pointer to vhost_virtqueue structure. If it gets
> >an iocb and invokes a callback, it would not care about vhost internals.
> 
> >> +		}
> >> +
> >> 		len = iov_length(vq->iov, out);
> >>  		/* Sanity check */
> >>  		if (!len) {
> >> @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net)
> >>  			tx_poll_start(net, sock);
> >>  			break;
> >>  		}
> >> +
> >> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> >> +			continue;
> >>+
> >>  		if (err != len)
> >>  			pr_err("Truncated TX packet: "
> >>  			       " len %d != %zd\n", err, len);
> >> @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net)
> >>  		}
> >>  	}
> >>  
> >> +	handle_async_tx_events_notify(net, vq);
> >> +
> >>  	mutex_unlock(&vq->mutex);
> >>  	unuse_mm(net->dev.mm);
> >>  }
> >>@@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net)
> >>  	int err;
> >>  	size_t hdr_size;
> >>  	struct socket *sock = rcu_dereference(vq->private_data);
> >> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> >> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> >> +			vq->link_state == VHOST_VQ_LINK_SYNC))
> >>  		return;
> >>  
> >>  	use_mm(net->dev.mm);
> >> @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net *net)
> >>  	vhost_disable_notify(vq);
> >>  	hdr_size = vq->hdr_size;
> >>  
> >> -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> >> +	/* In async cases, for write logging, the simple way is to get
> >> +	 * the log info always, and really logging is decided later.
> >>+	 * Thus, when logging enabled, we can get log, and when logging
> >> +	 * disabled, we can get log disabled accordingly.
> >> +	 */
> >> +
> 
> 
> >This adds overhead and might be one of the reasons
> >your patch does not perform that well. A better way
> >would be to flush outstanding requests or reread the vq
> >when logging is enabled.
> 
> Since the guest may submit a lot of buffers and h/w have already used them
> to allocate host skb, it's difficult to know how many and which one is the
> outstanding request, it's not just only inside in notifier list or sk->receive_queue.

Well, that was just a thought.  I guess there needs to be some way to
recover outstanding requests at least for cleanup when device is closed?
Maybe we could put in a special request marked "flush" and wait until it
completes?

> But what does reread mean? 

If we want to know the physical address of the iovec, we can look in the
virtqueue to find it. I do this in one go when building up the iovec
now, but logged mode is not a common case so it is not a must.

> > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> >  		vq->log : NULL;
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -245,6 +277,11 @@ static void handle_rx(struct vhost_net *net)
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> >  		msg.msg_iovlen = in;
> >  		len = iov_length(vq->iov, in);
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			vq->head = head;
> > +			vq->_log = log;
> > +			msg.msg_control = (void *)vq;
> > +		}
> >  		/* Sanity check */
> >  		if (!len) {
> >  			vq_err(vq, "Unexpected header len for RX: "
> > @@ -259,6 +296,10 @@ static void handle_rx(struct vhost_net *net)
> >  			vhost_discard_vq_desc(vq);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		/* TODO: Should check and handle checksum. */
> >  		if (err > len) {
> >  			pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +325,85 @@ static void handle_rx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> >  
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > +	if (!list_empty(&vq->notifier)) {
> > +		iocb = list_first_entry(&vq->notifier,
> > +				struct kiocb, ki_list);
> > +		list_del(&iocb->ki_list);
> > +	}
> > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > +	return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > +				struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	struct vhost_log *vq_log = NULL;
> > +	int rx_total_len = 0;
> > +	int log, size;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +	if (vq != &net->dev.vqs[VHOST_NET_VQ_RX])
> > +		return;
> > +
> > +	if (vq->receiver)
> > +		vq->receiver(vq);
> > +	vq_log = unlikely(vhost_has_feature(
> > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, iocb->ki_nbytes);
> > +		log = (int)iocb->ki_user_data;
> > +		size = iocb->ki_nbytes;
> > +		rx_total_len += iocb->ki_nbytes;
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		if (unlikely(vq_log))
> > +			vhost_log_write(vq, vq_log, log, size);
> > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > +		struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	int tx_total_len = 0;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +	if (vq != &net->dev.vqs[VHOST_NET_VQ_TX])
> > +		return;
> > +
> 
> Hard to see why the second check would be necessary
> 
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, 0);
> > +		tx_total_len += iocb->ki_nbytes;
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> >  static void handle_tx_kick(struct work_struct *work)
> >  {
> >  	struct vhost_virtqueue *vq;
> > @@ -462,7 +578,19 @@ static struct socket *get_tun_socket(int fd)
> >  	return sock;
> >  }
> >  
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > +	struct file *file = fget(fd);
> > +	struct socket *sock;
> > +	if (!file)
> > +		return ERR_PTR(-EBADF);
> > +	sock = mp_get_socket(file);
> > +	if (IS_ERR(sock))
> > +		fput(file);
> > +	return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> >  {
> >  	struct socket *sock;
> >  	if (fd == -1)
> > @@ -473,9 +601,26 @@ static struct socket *get_socket(int fd)
> >  	sock = get_tun_socket(fd);
> >  	if (!IS_ERR(sock))
> >  		return sock;
> > +	sock = get_mp_socket(fd);
> > +	if (!IS_ERR(sock)) {
> > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > +		return sock;
> > +	}
> >  	return ERR_PTR(-ENOTSOCK);
> >  }
> >  
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > +	struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +		vq->receiver = NULL;
> > +		INIT_LIST_HEAD(&vq->notifier);
> > +		spin_lock_init(&vq->notify_lock);
> > +	}
> > +}
> > +
> >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  {
> >  	struct socket *sock, *oldsock;
> > @@ -493,12 +638,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	}
> >  	vq = n->vqs + index;
> >  	mutex_lock(&vq->mutex);
> > -	sock = get_socket(fd);
> > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > +	sock = get_socket(vq, fd);
> >  	if (IS_ERR(sock)) {
> >  		r = PTR_ERR(sock);
> >  		goto err;
> >  	}
> >  
> > +	vhost_init_link_state(n, index);
> > +
> >  	/* start polling new socket */
> >  	oldsock = vq->private_data;
> >  	if (sock == oldsock)
> > @@ -507,8 +655,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	vhost_net_disable_vq(n, vq);
> >  	rcu_assign_pointer(vq->private_data, sock);
> >  	vhost_net_enable_vq(n, vq);
> > -	mutex_unlock(&vq->mutex);
> >  done:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	if (oldsock) {
> >  		vhost_net_flush_vq(n, index);
> > @@ -516,6 +664,7 @@ done:
> >  	}
> >  	return r;
> >  err:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	return r;
> >  }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..297af1c 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> >  	u64 len;
> >  };
> >  
> > +enum vhost_vq_link_state {
> > +	VHOST_VQ_LINK_SYNC = 	0,
> > +	VHOST_VQ_LINK_ASYNC = 	1,
> > +};
> > +
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >  	struct vhost_dev *dev;
> > @@ -96,6 +101,13 @@ struct vhost_virtqueue {
> >  	/* Log write descriptors */
> >  	void __user *log_base;
> >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > +	/*Differiate async socket for 0-copy from normal*/
> > +	enum vhost_vq_link_state link_state;
> > +	int head;
> > +	int _log;
> > +	struct list_head notifier;
> > +	spinlock_t notify_lock;
> > +	void (*receiver)(struct vhost_virtqueue *);
> >  };
> >  
> >  struct vhost_dev {
> > -- 
> > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-03-17 10:27                 ` Michael S. Tsirkin
@ 2010-04-01  9:14                   ` Xin Xiaohui
  2010-04-01 11:02                     ` [PATCH " Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Xiaohui @ 2010-04-01  9:14 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Xin Xiaohui

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---

Michael,
Now, I made vhost to alloc/destroy the kiocb, and transfer it from 
sendmsg/recvmsg. I did not remove vq->receiver, since what the
callback does is related to the structures owned by mp device,
and I think isolation them to vhost is a good thing to us all.
And it will not prevent mp device to be independent of vhost 
in future. Later, when mp device can be a real device which
provides asynchronous read/write operations and not just report
proto_ops, it will use another callback function which is not
related to vhost at all.

For the write logging, do you have a function in hand that we can
recompute the log? If that, I think I can use it to recompute the
log info when the logging is suddenly enabled.
For the outstanding requests, do you mean all the user buffers have
submitted before the logging ioctl changed? That may be a lot, and
some of them are still in NIC ring descriptors. Waiting them to be
finished may be need some time. I think when logging ioctl changed,
then the logging is changed just after that is also reasonable.

Thanks
Xiaohui

 drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   10 +++
 2 files changed, 192 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..2aafd90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include <linux/workqueue.h>
 #include <linux/rcupdate.h>
 #include <linux/file.h>
+#include <linux/aio.h>
 
 #include <linux/net.h>
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/mpassthru.h>
 
 #include <net/sock.h>
 
@@ -47,6 +49,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	int log, size;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+
+	if (vq->receiver)
+		vq->receiver(vq);
+
+	vq_log = unlikely(vhost_has_feature(
+				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, iocb->ki_nbytes);
+		log = (int)iocb->ki_user_data;
+		size = iocb->ki_nbytes;
+		rx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+		kmem_cache_free(net->cache, iocb);
+
+		if (unlikely(vq_log))
+			vhost_log_write(vq, vq_log, log, size);
+		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	int tx_total_len = 0;
+
+	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+		return;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	hdr_size = vq->hdr_size;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+			if (!iocb)
+				break;
+			iocb->ki_pos = head;
+			iocb->private = (void *)vq;
+		}
+
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
@@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
 	int err;
 	size_t hdr_size;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+			vq->link_state == VHOST_VQ_LINK_SYNC))
 		return;
 
 	use_mm(net->dev.mm);
@@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
-	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+	/* In async cases, for write logging, the simple way is to get
+	 * the log info always, and really logging is decided later.
+	 * Thus, when logging enabled, we can get log, and when logging
+	 * disabled, we can get log disabled accordingly.
+	 */
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
+		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
 		vq->log : NULL;
 
+	handle_async_rx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+			if (!iocb)
+				break;
+			iocb->private = vq;
+			iocb->ki_pos = head;
+			iocb->ki_user_data = log;
+		}
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
@@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
 			vhost_discard_vq_desc(vq);
 			break;
 		}
+
+		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
+			continue;
+
 		/* TODO: Should check and handle checksum. */
 		if (err > len) {
 			pr_err("Discarded truncated rx packet: "
@@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
 
+
 static void handle_tx_kick(struct work_struct *work)
 {
 	struct vhost_virtqueue *vq;
@@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 	return 0;
 }
 
@@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_notifier_cleanup(struct vhost_net *n)
+{
+	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		kmem_cache_destroy(n->cache);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_notifier_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
 {
 	struct socket *sock;
 	if (fd == -1)
@@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
 	sock = get_tun_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		vq->link_state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		vq->receiver = NULL;
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache) {
+			n->cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
+		}
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 	vq = n->vqs + index;
 	mutex_lock(&vq->mutex);
-	sock = get_socket(fd);
+	vq->link_state = VHOST_VQ_LINK_SYNC;
+	sock = get_socket(vq, fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	vhost_net_disable_vq(n, vq);
 	rcu_assign_pointer(vq->private_data, sock);
 	vhost_net_enable_vq(n, vq);
-	mutex_unlock(&vq->mutex);
 done:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	if (oldsock) {
 		vhost_net_flush_vq(n, index);
@@ -516,6 +690,7 @@ done:
 	}
 	return r;
 err:
+	mutex_unlock(&vq->mutex);
 	mutex_unlock(&n->dev.mutex);
 	return r;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d1f0453..cffe39a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 	0,
+	VHOST_VQ_LINK_ASYNC = 	1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -96,6 +101,11 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/*Differiate async socket for 0-copy from normal*/
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
+	void (*receiver)(struct vhost_virtqueue *);
 };
 
 struct vhost_dev {
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-03-08 11:28   ` [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net Michael S. Tsirkin
@ 2010-04-01  9:27     ` Xin Xiaohui
  2010-04-01 11:08       ` [PATCH " Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin Xiaohui @ 2010-04-01  9:27 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81, Xin Xiaohui

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
---

Micheal,
Sorry, I did not resolve all your comments this time.
I did not move the device out of vhost directory because I
did not implement real asynchronous read/write operations
to mp device for now, We wish we can do this after the network
code checked in. 

For the DOS issue, I'm not sure how much the limit get_user_pages()
can pin is reasonable, should we compute the bindwidth to make it?

We use get_user_pages_fast() and use set_page_dirty_lock().
Remove read_rcu_lock()/unlock(), since the ctor pointer is
only changed by BIND/UNBIND ioctl, and during that time,
the NIC is always stoped, all outstanding requests are done,
so the ctor pointer cannot be raced into wrong condition.

Qemu needs a userspace write, is that a synchronous one or
asynchronous one?

Thanks
Xiaohui

 drivers/vhost/Kconfig     |    5 +
 drivers/vhost/Makefile    |    2 +
 drivers/vhost/mpassthru.c | 1162 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mpassthru.h |   29 ++
 4 files changed, 1198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..ee32a3b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,8 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config VHOST_PASSTHRU
+	tristate "Zerocopy network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..3f79c79 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..6e8fc4d
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1162 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+#include "vhost.h"
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+struct page_ctor {
+	struct list_head        readq;
+	int 			w_len;
+	int 			r_len;
+	spinlock_t      	read_lock;
+	struct kmem_cache   	*cache;
+	struct net_device   	*dev;
+	struct mpassthru_port	port;
+};
+
+struct page_info {
+	void 			*ctrl;
+	struct list_head    	list;
+	int         		header;
+	/* indicate the actual length of bytes
+	 * send/recv in the user space buffers
+	 */
+	int         		total;
+	int         		offset;
+	struct page     	*pages[MAX_SKB_FRAGS+1];
+	struct skb_frag_struct 	frag[MAX_SKB_FRAGS+1];
+	struct sk_buff      	*skb;
+	struct page_ctor   	*ctor;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a user space allocated skb or kernel
+	 */
+	struct skb_user_page    user;
+	struct skb_shared_info	ushinfo;
+
+#define INFO_READ      		0
+#define INFO_WRITE     		1
+	unsigned        	flags;
+	unsigned        	pnum;
+
+	/* It's meaningful for receive, means
+	 * the max length allowed
+	 */
+	size_t          	len;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+
+	struct kiocb		*iocb;
+	unsigned int    	desc_pos;
+	unsigned int 		log;
+	struct iovec 		hdr[VHOST_NET_MAX_SG];
+	struct iovec 		iov[VHOST_NET_MAX_SG];
+};
+
+struct mp_struct {
+	struct mp_file   	*mfile;
+	struct net_device       *dev;
+	struct page_ctor	*ctor;
+	struct socket           socket;
+
+#ifdef MPASSTHRU_DEBUG
+	int debug;
+#endif
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock            	sk;
+	struct mp_struct       	*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+	int ret = 0;
+
+	rtnl_lock();
+	ret = dev_change_flags(dev, flags);
+	rtnl_unlock();
+
+	if (ret < 0)
+		printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+	return ret;
+}
+
+/* The main function to allocate user space buffers */
+static struct skb_user_page *page_ctor(struct mpassthru_port *port,
+					struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++) {
+		get_page(info->pages[i]);
+		info->frag[i].page = info->pages[i];
+		info->frag[i].page_offset = i ? 0 : info->offset;
+		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+			port->data_len;
+	}
+	info->skb = skb;
+	info->user.frags = info->frag;
+	info->user.ushinfo = &info->ushinfo;
+	return &info->user;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	for (i = 0; i < info->pnum; i++) {
+		if (info->pages[i])
+			put_page(info->pages[i]);
+	}
+
+	if (info->flags == INFO_READ) {
+		skb_shinfo(info->skb)->destructor_arg = &info->user;
+		info->skb->destructor = NULL;
+		kfree_skb(info->skb);
+	}
+
+	kmem_cache_free(info->ctor->cache, info);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_iovec = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->private = (void *)info;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_user_data = info->log;
+	iocb->ki_dtor = mp_ki_dtor;
+	return iocb;
+}
+
+/* A helper to clean the skb before the kfree_skb() */
+
+static void page_dtor_prepare(struct page_info *info)
+{
+	if (info->flags == INFO_READ)
+		if (info->skb)
+			info->skb->head = NULL;
+}
+
+/* The callback to destruct the user space buffers or skb */
+static void page_dtor(struct skb_user_page *user)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+	struct kiocb *iocb = NULL;
+	struct vhost_virtqueue *vq = NULL;
+	unsigned long flags;
+	int i;
+
+	if (!user)
+		return;
+	info = container_of(user, struct page_info, user);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	page_dtor_prepare(info);
+
+	/* If the info->total is 0, make it to be reused */
+	if (!info->total) {
+		spin_lock_irqsave(&ctor->read_lock, flags);
+		list_add(&info->list, &ctor->readq);
+		spin_unlock_irqrestore(&ctor->read_lock, flags);
+		return;
+	}
+
+	if (info->flags == INFO_READ)
+		return;
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+	vq = (struct vhost_virtqueue *)info->ctrl;
+	iocb = create_iocb(info, info->total);
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_ctor *ctor;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (rcu_dereference(mp->ctor))
+		return -EBUSY;
+
+	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+	if (!ctor)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	ctor->cache = kmem_cache_create("skb_page_info",
+			sizeof(struct page_info), 0,
+			SLAB_HWCACHE_ALIGN, NULL);
+
+	if (!ctor->cache)
+		goto cache_fail;
+
+	INIT_LIST_HEAD(&ctor->readq);
+	spin_lock_init(&ctor->read_lock);
+
+	ctor->w_len = 0;
+	ctor->r_len = 0;
+
+	dev_hold(dev);
+	ctor->dev = dev;
+	ctor->port.ctor = page_ctor;
+	ctor->port.sock = &mp->socket;
+
+	rc = netdev_mp_port_attach(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, ctor);
+
+	/* XXX:Need we do set_offload here ? */
+
+	return 0;
+
+fail:
+	kmem_cache_destroy(ctor->cache);
+cache_fail:
+	kfree(ctor);
+	dev_put(dev);
+
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	return info;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+	struct page_ctor *ctor;
+	struct page_info *info;
+	struct vhost_virtqueue *vq = NULL;
+	struct kiocb *iocb = NULL;
+	int i;
+	unsigned long flags;
+
+	/* locked by mp_mutex */
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	while ((info = info_dequeue(ctor))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		vq = (struct vhost_virtqueue *)(info->ctrl);
+		iocb = create_iocb(info, 0);
+
+		spin_lock_irqsave(&vq->notify_lock, flags);
+		list_add_tail(&iocb->ki_list, &vq->notifier);
+		spin_unlock_irqrestore(&vq->notify_lock, flags);
+
+		kmem_cache_free(ctor->cache, info);
+	}
+	kmem_cache_destroy(ctor->cache);
+	netdev_mp_port_detach(ctor->dev);
+	dev_put(ctor->dev);
+
+	/* locked by mp_mutex */
+	rcu_assign_pointer(mp->ctor, NULL);
+	synchronize_rcu();
+
+	kfree(ctor);
+	return 0;
+}
+
+/* For small user space buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+						struct kiocb *iocb, int total)
+{
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the user space address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+					struct kiocb *iocb, struct iovec *iov,
+					int count, struct frag *frags,
+					int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base;
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+						&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->user.dtor = page_dtor;
+	info->ctor = ctor;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		info->user.start = (u8 *)(((unsigned long)
+				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+				frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
+		info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
+		for (i = 0; i < j; i++)
+			set_page_dirty_lock(info->pages[i]);
+	}
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ctor->cache, info);
+
+	return NULL;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+			struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+	if (!skb)
+		goto drop;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	memcpy_fromiovec(skb->data, iov, header);
+	skb_put(skb, header);
+	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, iocb, total);
+	} else {
+		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->ctrl = vq;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->user;
+		skb->dev = mp->dev;
+		dev_queue_xmit(skb);
+		mp->dev->stats.tx_packets++;
+		mp->dev->stats.tx_bytes += total;
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(info->ctor->cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+
+static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
+{
+	struct socket *sock = vq->private_data;
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	struct ethhdr *eth;
+	struct kiocb *iocb = NULL;
+	int len, i;
+	unsigned long flags;
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
+		if (skb_shinfo(skb)->destructor_arg) {
+			info = container_of(skb_shinfo(skb)->destructor_arg,
+					struct page_info, user);
+			info->skb = skb;
+			if (skb->len > info->len) {
+				mp->dev->stats.rx_dropped++;
+				DBG(KERN_INFO "Discarded truncated rx packet: "
+					" len %d > %zd\n", skb->len, info->len);
+				info->total = skb->len;
+				goto clean;
+			} else {
+				int i;
+				struct skb_shared_info *gshinfo =
+				(struct skb_shared_info *)(&info->ushinfo);
+				struct skb_shared_info *hshinfo =
+						skb_shinfo(skb);
+
+				if (gshinfo->nr_frags < hshinfo->nr_frags)
+					goto clean;
+				eth = eth_hdr(skb);
+				skb_push(skb, ETH_HLEN);
+
+				hdr.hdr_len = skb_headlen(skb);
+				info->total = skb->len;
+
+				for (i = 0; i < gshinfo->nr_frags; i++)
+					gshinfo->frags[i].size = 0;
+				for (i = 0; i < hshinfo->nr_frags; i++)
+					gshinfo->frags[i].size =
+						hshinfo->frags[i].size;
+				memcpy(skb_shinfo(skb), &info->ushinfo,
+						sizeof(struct skb_shared_info));
+			}
+		} else {
+			/* The skb composed with kernel buffers
+			 * in case user space buffers are not sufficent.
+			 * The case should be rare.
+			 */
+			unsigned long flags;
+			int i;
+			struct skb_shared_info *gshinfo = NULL;
+
+			info = NULL;
+
+			spin_lock_irqsave(&ctor->read_lock, flags);
+			if (!list_empty(&ctor->readq)) {
+				info = list_first_entry(&ctor->readq,
+						struct page_info, list);
+				list_del(&info->list);
+			}
+			spin_unlock_irqrestore(&ctor->read_lock, flags);
+			if (!info) {
+				DBG(KERN_INFO "No user buffer avaliable %p\n",
+									skb);
+				skb_queue_head(&sock->sk->sk_receive_queue,
+									skb);
+				break;
+			}
+			info->skb = skb;
+			/* compute the guest skb frags info */
+			gshinfo = (struct skb_shared_info *)(info->user.start +
+					SKB_DATA_ALIGN(info->user.size));
+
+			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+				goto clean;
+
+			eth = eth_hdr(skb);
+			skb_push(skb, ETH_HLEN);
+			info->total = skb->len;
+
+			for (i = 0; i < gshinfo->nr_frags; i++)
+				gshinfo->frags[i].size = 0;
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+				gshinfo->frags[i].size =
+					skb_shinfo(skb)->frags[i].size;
+			hdr.hdr_len = min_t(int, skb->len,
+						info->iov[1].iov_len);
+			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+		}
+
+		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+								 sizeof hdr);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->hdr->iov_base, len);
+			goto clean;
+		}
+		iocb = create_iocb(info, skb->len + sizeof(hdr));
+
+		spin_lock_irqsave(&vq->notify_lock, flags);
+		list_add_tail(&iocb->ki_list, &vq->notifier);
+		spin_unlock_irqrestore(&vq->notify_lock, flags);
+		continue;
+
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ctor->cache, info);
+	}
+	return;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+			struct msghdr *m, size_t total_len,
+			int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -EINVAL;
+
+	/* Error detections in case invalid user space buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	iov++;
+	count--;
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->len = total_len;
+	info->hdr[0].iov_base = vq->hdr[0].iov_base;
+	info->hdr[0].iov_len = vq->hdr[0].iov_len;
+	info->offset = frags[0].offset;
+	info->desc_pos = iocb->ki_pos;
+	info->log = iocb->ki_user_data;
+	info->ctrl = vq;
+
+	iov--;
+	count++;
+
+	memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	if (!vq->receiver)
+		vq->receiver = mp_recvmsg_notify;
+
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_ctor_detach(mp);
+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count))
+		mp_detach(mfile->mp);
+}
+
+static int mp_release(struct socket *sock)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct mp_file *mfile = mp->mfile;
+
+	mp_put(mfile);
+	sock_put(mp->socket.sk);
+	put_net(mfile->net);
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+	.release = mp_release,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+	DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static void mp_sock_data_ready(struct sock *sk, int len)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -EBUSY;
+
+		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
+			break;
+
+		ret = -ENODEV;
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev)
+			break;
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_ctor_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
+		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+out:
+		mutex_unlock(&mp_mutex);
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		ret = do_unbind(mfile);
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mpassthru_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+			sock = dev->mp_port->sock;
+			mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+			do_unbind(mp->mfile);
+			break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int ret = 0;
+
+	ret = misc_register(&mp_miscdev);
+	if (ret)
+		printk(KERN_ERR "mp: Can't register misc device\n");
+	else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+			mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return ret;
+}
+
+void mp_cleanup(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_cleanup);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..2be21c5
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,29 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
+
+/* MPASSTHRU ifc flags */
+#define IFF_MPASSTHRU		0x0001
+#define IFF_MPASSTHRU_EXCL	0x0002
+
+#ifdef __KERNEL__
+#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_VHOST_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-01  9:14                   ` Xin Xiaohui
@ 2010-04-01 11:02                     ` Michael S. Tsirkin
  2010-04-02  2:16                       ` Xin, Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-01 11:02 UTC (permalink / raw)
  To: Xin Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Thu, Apr 01, 2010 at 05:14:56PM +0800, Xin Xiaohui wrote:
> The vhost-net backend now only supports synchronous send/recv
> operations. The patch provides multiple submits and asynchronous
> notifications. This is needed for zero-copy case.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> ---
> 
> Michael,
> Now, I made vhost to alloc/destroy the kiocb, and transfer it from 
> sendmsg/recvmsg. I did not remove vq->receiver, since what the
> callback does is related to the structures owned by mp device,
> and I think isolation them to vhost is a good thing to us all.
> And it will not prevent mp device to be independent of vhost 
> in future. Later, when mp device can be a real device which
> provides asynchronous read/write operations and not just report
> proto_ops, it will use another callback function which is not
> related to vhost at all.

Thanks, I'll look at the code!

> For the write logging, do you have a function in hand that we can
> recompute the log? If that, I think I can use it to recompute the
> log info when the logging is suddenly enabled.
> For the outstanding requests, do you mean all the user buffers have
> submitted before the logging ioctl changed? That may be a lot, and
> some of them are still in NIC ring descriptors. Waiting them to be
> finished may be need some time. I think when logging ioctl changed,
> then the logging is changed just after that is also reasonable.

The key point is that after loggin ioctl returns, any
subsequent change to memory must be logged. It does not
matter when was the request submitted, otherwise we will
get memory corruption on migration.

> Thanks
> Xiaohui
> 
>  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   10 +++
>  2 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 22d5fef..2aafd90 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -17,11 +17,13 @@
>  #include <linux/workqueue.h>
>  #include <linux/rcupdate.h>
>  #include <linux/file.h>
> +#include <linux/aio.h>
>  
>  #include <linux/net.h>
>  #include <linux/if_packet.h>
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
> +#include <linux/mpassthru.h>
>  
>  #include <net/sock.h>
>  
> @@ -47,6 +49,7 @@ struct vhost_net {
>  	struct vhost_dev dev;
>  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
>  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +	struct kmem_cache       *cache;
>  	/* Tells us whether we are polling a socket for TX.
>  	 * We only do this when socket buffer fills up.
>  	 * Protected by tx vq lock. */
> @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
>  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	if (!list_empty(&vq->notifier)) {
> +		iocb = list_first_entry(&vq->notifier,
> +				struct kiocb, ki_list);
> +		list_del(&iocb->ki_list);
> +	}
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	return iocb;
> +}
> +
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	struct vhost_log *vq_log = NULL;
> +	int rx_total_len = 0;
> +	int log, size;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +
> +	if (vq->receiver)
> +		vq->receiver(vq);
> +
> +	vq_log = unlikely(vhost_has_feature(
> +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, iocb->ki_nbytes);
> +		log = (int)iocb->ki_user_data;
> +		size = iocb->ki_nbytes;
> +		rx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		kmem_cache_free(net->cache, iocb);
> +
> +		if (unlikely(vq_log))
> +			vhost_log_write(vq, vq_log, log, size);
> +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	int tx_total_len = 0;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, 0);
> +		tx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +
> +		kmem_cache_free(net->cache, iocb);
> +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	struct kiocb *iocb = NULL;
>  	unsigned head, out, in, s;
>  	struct msghdr msg = {
>  		.msg_name = NULL,
> @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
>  		tx_poll_stop(net);
>  	hdr_size = vq->hdr_size;
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
>  		/* Skip header. TODO: support TSO. */
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
>  		msg.msg_iovlen = out;
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> +			if (!iocb)
> +				break;
> +			iocb->ki_pos = head;
> +			iocb->private = (void *)vq;
> +		}
> +
>  		len = iov_length(vq->iov, out);
>  		/* Sanity check */
>  		if (!len) {
> @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
>  			break;
>  		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
>  		if (unlikely(err < 0)) {
>  			vhost_discard_vq_desc(vq);
>  			tx_poll_start(net, sock);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		if (err != len)
>  			pr_err("Truncated TX packet: "
>  			       " len %d != %zd\n", err, len);
> @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
> @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
>  static void handle_rx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
>  	unsigned head, out, in, log, s;
>  	struct vhost_log *vq_log;
>  	struct msghdr msg = {
> @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
>  	int err;
>  	size_t hdr_size;
>  	struct socket *sock = rcu_dereference(vq->private_data);
> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> +			vq->link_state == VHOST_VQ_LINK_SYNC))
>  		return;
>  
>  	use_mm(net->dev.mm);
> @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
>  	vhost_disable_notify(vq);
>  	hdr_size = vq->hdr_size;
>  
> -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> +	/* In async cases, for write logging, the simple way is to get
> +	 * the log info always, and really logging is decided later.
> +	 * Thus, when logging enabled, we can get log, and when logging
> +	 * disabled, we can get log disabled accordingly.
> +	 */
> +
> +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
>  		vq->log : NULL;
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> +			if (!iocb)
> +				break;
> +			iocb->private = vq;
> +			iocb->ki_pos = head;
> +			iocb->ki_user_data = log;
> +		}
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
> @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> -		err = sock->ops->recvmsg(NULL, sock, &msg,
> +
> +		err = sock->ops->recvmsg(iocb, sock, &msg,
>  					 len, MSG_DONTWAIT | MSG_TRUNC);
>  		/* TODO: Check specific error and bomb out unless EAGAIN? */
>  		if (err < 0) {
>  			vhost_discard_vq_desc(vq);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		/* TODO: Should check and handle checksum. */
>  		if (err > len) {
>  			pr_err("Discarded truncated rx packet: "
> @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
>  
> +
>  static void handle_tx_kick(struct work_struct *work)
>  {
>  	struct vhost_virtqueue *vq;
> @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
>  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> +	n->cache = NULL;
>  	return 0;
>  }
>  
> @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
>  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
>  }
>  
> +static void vhost_notifier_cleanup(struct vhost_net *n)
> +{
> +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
> +	if (n->cache) {
> +		while ((iocb = notify_dequeue(vq)) != NULL)
> +			kmem_cache_free(n->cache, iocb);
> +		kmem_cache_destroy(n->cache);
> +	}
> +}
> +
>  static int vhost_net_release(struct inode *inode, struct file *f)
>  {
>  	struct vhost_net *n = f->private_data;
> @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	/* We do an extra flush before freeing memory,
>  	 * since jobs can re-queue themselves. */
>  	vhost_net_flush(n);
> +	vhost_notifier_cleanup(n);
>  	kfree(n);
>  	return 0;
>  }
> @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
>  	return sock;
>  }
>  
> -static struct socket *get_socket(int fd)
> +static struct socket *get_mp_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	struct socket *sock;
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	sock = mp_get_socket(file);
> +	if (IS_ERR(sock))
> +		fput(file);
> +	return sock;
> +}
> +
> +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
>  {
>  	struct socket *sock;
>  	if (fd == -1)
> @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
>  	sock = get_tun_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_mp_socket(fd);
> +	if (!IS_ERR(sock)) {
> +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> +		return sock;
> +	}
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
> +static void vhost_init_link_state(struct vhost_net *n, int index)
> +{
> +	struct vhost_virtqueue *vq = n->vqs + index;
> +
> +	WARN_ON(!mutex_is_locked(&vq->mutex));
> +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +		vq->receiver = NULL;
> +		INIT_LIST_HEAD(&vq->notifier);
> +		spin_lock_init(&vq->notify_lock);
> +		if (!n->cache) {
> +			n->cache = kmem_cache_create("vhost_kiocb",
> +					sizeof(struct kiocb), 0,
> +					SLAB_HWCACHE_ALIGN, NULL);
> +		}
> +	}
> +}
> +
>  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  {
>  	struct socket *sock, *oldsock;
> @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	}
>  	vq = n->vqs + index;
>  	mutex_lock(&vq->mutex);
> -	sock = get_socket(fd);
> +	vq->link_state = VHOST_VQ_LINK_SYNC;
> +	sock = get_socket(vq, fd);
>  	if (IS_ERR(sock)) {
>  		r = PTR_ERR(sock);
>  		goto err;
>  	}
>  
> +	vhost_init_link_state(n, index);
> +
>  	/* start polling new socket */
>  	oldsock = vq->private_data;
>  	if (sock == oldsock)
> @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	vhost_net_disable_vq(n, vq);
>  	rcu_assign_pointer(vq->private_data, sock);
>  	vhost_net_enable_vq(n, vq);
> -	mutex_unlock(&vq->mutex);
>  done:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	if (oldsock) {
>  		vhost_net_flush_vq(n, index);
> @@ -516,6 +690,7 @@ done:
>  	}
>  	return r;
>  err:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	return r;
>  }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d1f0453..cffe39a 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -43,6 +43,11 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +enum vhost_vq_link_state {
> +	VHOST_VQ_LINK_SYNC = 	0,
> +	VHOST_VQ_LINK_ASYNC = 	1,
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -96,6 +101,11 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log log[VHOST_NET_MAX_SG];
> +	/*Differiate async socket for 0-copy from normal*/
> +	enum vhost_vq_link_state link_state;
> +	struct list_head notifier;
> +	spinlock_t notify_lock;
> +	void (*receiver)(struct vhost_virtqueue *);
>  };
>  
>  struct vhost_dev {
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-01  9:27     ` Xin Xiaohui
@ 2010-04-01 11:08       ` Michael S. Tsirkin
  2010-04-06  5:41         ` Xin, Xiaohui
  2010-04-07  2:41         ` Xin, Xiaohui
  0 siblings, 2 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-01 11:08 UTC (permalink / raw)
  To: Xin Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81

On Thu, Apr 01, 2010 at 05:27:18PM +0800, Xin Xiaohui wrote:
> Add a device to utilize the vhost-net backend driver for
> copy-less data transfer between guest FE and host NIC.
> It pins the guest user space to the host memory and
> provides proto_ops as sendmsg/recvmsg to vhost-net.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
> ---
> 
> Micheal,
> Sorry, I did not resolve all your comments this time.
> I did not move the device out of vhost directory because I
> did not implement real asynchronous read/write operations
> to mp device for now, We wish we can do this after the network
> code checked in. 

Well, placement of code is not such a major issue.
It's just that code under drivers/net gets more and better
review than drivers/vhost. I'll try to get Dave's opinion.

> 
> For the DOS issue, I'm not sure how much the limit get_user_pages()
> can pin is reasonable, should we compute the bindwidth to make it?

There's a ulimit for locked memory. Can we use this, decreasing
the value for rlimit array? We can do this when backend is
enabled and re-increment when backend is disabled.

> We use get_user_pages_fast() and use set_page_dirty_lock().
> Remove read_rcu_lock()/unlock(), since the ctor pointer is
> only changed by BIND/UNBIND ioctl, and during that time,
> the NIC is always stoped, all outstanding requests are done,
> so the ctor pointer cannot be raced into wrong condition.
> 
> Qemu needs a userspace write, is that a synchronous one or
> asynchronous one?

It's a synchronous non-blocking write.

> Thanks
> Xiaohui
> 
>  drivers/vhost/Kconfig     |    5 +
>  drivers/vhost/Makefile    |    2 +
>  drivers/vhost/mpassthru.c | 1162 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mpassthru.h |   29 ++
>  4 files changed, 1198 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/mpassthru.c
>  create mode 100644 include/linux/mpassthru.h
> 
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 9f409f4..ee32a3b 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -9,3 +9,8 @@ config VHOST_NET
>  	  To compile this driver as a module, choose M here: the module will
>  	  be called vhost_net.
>  
> +config VHOST_PASSTHRU
> +	tristate "Zerocopy network driver (EXPERIMENTAL)"
> +	depends on VHOST_NET
> +	---help---
> +	  zerocopy network I/O support
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 72dd020..3f79c79 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -1,2 +1,4 @@
>  obj-$(CONFIG_VHOST_NET) += vhost_net.o
>  vhost_net-y := vhost.o net.o
> +
> +obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 0000000..6e8fc4d
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
> @@ -0,0 +1,1162 @@
> +/*
> + *  MPASSTHRU - Mediate passthrough device.
> + *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + *  GNU General Public License for more details.
> + *
> + */
> +
> +#define DRV_NAME        "mpassthru"
> +#define DRV_DESCRIPTION "Mediate passthru device driver"
> +#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
> +
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/major.h>
> +#include <linux/slab.h>
> +#include <linux/smp_lock.h>
> +#include <linux/poll.h>
> +#include <linux/fcntl.h>
> +#include <linux/init.h>
> +#include <linux/aio.h>
> +
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/miscdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if.h>
> +#include <linux/if_arp.h>
> +#include <linux/if_ether.h>
> +#include <linux/crc32.h>
> +#include <linux/nsproxy.h>
> +#include <linux/uaccess.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mpassthru.h>
> +#include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +#include <net/rtnetlink.h>
> +#include <net/sock.h>
> +
> +#include <asm/system.h>
> +
> +#include "vhost.h"
> +
> +/* Uncomment to enable debugging */
> +/* #define MPASSTHRU_DEBUG 1 */
> +
> +#ifdef MPASSTHRU_DEBUG
> +static int debug;
> +
> +#define DBG  if (mp->debug) printk
> +#define DBG1 if (debug == 2) printk
> +#else
> +#define DBG(a...)
> +#define DBG1(a...)
> +#endif
> +
> +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
> +#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
> +
> +struct frag {
> +	u16     offset;
> +	u16     size;
> +};
> +
> +struct page_ctor {
> +	struct list_head        readq;
> +	int 			w_len;
> +	int 			r_len;
> +	spinlock_t      	read_lock;
> +	struct kmem_cache   	*cache;
> +	struct net_device   	*dev;
> +	struct mpassthru_port	port;
> +};
> +
> +struct page_info {
> +	void 			*ctrl;
> +	struct list_head    	list;
> +	int         		header;
> +	/* indicate the actual length of bytes
> +	 * send/recv in the user space buffers
> +	 */
> +	int         		total;
> +	int         		offset;
> +	struct page     	*pages[MAX_SKB_FRAGS+1];
> +	struct skb_frag_struct 	frag[MAX_SKB_FRAGS+1];
> +	struct sk_buff      	*skb;
> +	struct page_ctor   	*ctor;
> +
> +	/* The pointer relayed to skb, to indicate
> +	 * it's a user space allocated skb or kernel
> +	 */
> +	struct skb_user_page    user;
> +	struct skb_shared_info	ushinfo;
> +
> +#define INFO_READ      		0
> +#define INFO_WRITE     		1
> +	unsigned        	flags;
> +	unsigned        	pnum;
> +
> +	/* It's meaningful for receive, means
> +	 * the max length allowed
> +	 */
> +	size_t          	len;
> +
> +	/* The fields after that is for backend
> +	 * driver, now for vhost-net.
> +	 */
> +
> +	struct kiocb		*iocb;
> +	unsigned int    	desc_pos;
> +	unsigned int 		log;
> +	struct iovec 		hdr[VHOST_NET_MAX_SG];
> +	struct iovec 		iov[VHOST_NET_MAX_SG];
> +};
> +
> +struct mp_struct {
> +	struct mp_file   	*mfile;
> +	struct net_device       *dev;
> +	struct page_ctor	*ctor;
> +	struct socket           socket;
> +
> +#ifdef MPASSTHRU_DEBUG
> +	int debug;
> +#endif
> +};
> +
> +struct mp_file {
> +	atomic_t count;
> +	struct mp_struct *mp;
> +	struct net *net;
> +};
> +
> +struct mp_sock {
> +	struct sock            	sk;
> +	struct mp_struct       	*mp;
> +};
> +
> +static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
> +{
> +	int ret = 0;
> +
> +	rtnl_lock();
> +	ret = dev_change_flags(dev, flags);
> +	rtnl_unlock();
> +
> +	if (ret < 0)
> +		printk(KERN_ERR "failed to change dev state of %s", dev->name);
> +
> +	return ret;
> +}
> +
> +/* The main function to allocate user space buffers */
> +static struct skb_user_page *page_ctor(struct mpassthru_port *port,
> +					struct sk_buff *skb, int npages)
> +{
> +	int i;
> +	unsigned long flags;
> +	struct page_ctor *ctor;
> +	struct page_info *info = NULL;
> +
> +	ctor = container_of(port, struct page_ctor, port);
> +
> +	spin_lock_irqsave(&ctor->read_lock, flags);
> +	if (!list_empty(&ctor->readq)) {
> +		info = list_first_entry(&ctor->readq, struct page_info, list);
> +		list_del(&info->list);
> +	}
> +	spin_unlock_irqrestore(&ctor->read_lock, flags);
> +	if (!info)
> +		return NULL;
> +
> +	for (i = 0; i < info->pnum; i++) {
> +		get_page(info->pages[i]);
> +		info->frag[i].page = info->pages[i];
> +		info->frag[i].page_offset = i ? 0 : info->offset;
> +		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
> +			port->data_len;
> +	}
> +	info->skb = skb;
> +	info->user.frags = info->frag;
> +	info->user.ushinfo = &info->ushinfo;
> +	return &info->user;
> +}
> +
> +static void mp_ki_dtor(struct kiocb *iocb)
> +{
> +	struct page_info *info = (struct page_info *)(iocb->private);
> +	int i;
> +
> +	for (i = 0; i < info->pnum; i++) {
> +		if (info->pages[i])
> +			put_page(info->pages[i]);
> +	}
> +
> +	if (info->flags == INFO_READ) {
> +		skb_shinfo(info->skb)->destructor_arg = &info->user;
> +		info->skb->destructor = NULL;
> +		kfree_skb(info->skb);
> +	}
> +
> +	kmem_cache_free(info->ctor->cache, info);
> +
> +	return;
> +}
> +
> +static struct kiocb *create_iocb(struct page_info *info, int size)
> +{
> +	struct kiocb *iocb = NULL;
> +
> +	iocb = info->iocb;
> +	if (!iocb)
> +		return iocb;
> +	iocb->ki_flags = 0;
> +	iocb->ki_users = 1;
> +	iocb->ki_key = 0;
> +	iocb->ki_ctx = NULL;
> +	iocb->ki_cancel = NULL;
> +	iocb->ki_retry = NULL;
> +	iocb->ki_iovec = NULL;
> +	iocb->ki_eventfd = NULL;
> +	iocb->private = (void *)info;
> +	iocb->ki_pos = info->desc_pos;
> +	iocb->ki_nbytes = size;
> +	iocb->ki_user_data = info->log;
> +	iocb->ki_dtor = mp_ki_dtor;
> +	return iocb;
> +}
> +
> +/* A helper to clean the skb before the kfree_skb() */
> +
> +static void page_dtor_prepare(struct page_info *info)
> +{
> +	if (info->flags == INFO_READ)
> +		if (info->skb)
> +			info->skb->head = NULL;
> +}
> +
> +/* The callback to destruct the user space buffers or skb */
> +static void page_dtor(struct skb_user_page *user)
> +{
> +	struct page_info *info;
> +	struct page_ctor *ctor;
> +	struct sock *sk;
> +	struct sk_buff *skb;
> +	struct kiocb *iocb = NULL;
> +	struct vhost_virtqueue *vq = NULL;
> +	unsigned long flags;
> +	int i;
> +
> +	if (!user)
> +		return;
> +	info = container_of(user, struct page_info, user);
> +	if (!info)
> +		return;
> +	ctor = info->ctor;
> +	skb = info->skb;
> +
> +	page_dtor_prepare(info);
> +
> +	/* If the info->total is 0, make it to be reused */
> +	if (!info->total) {
> +		spin_lock_irqsave(&ctor->read_lock, flags);
> +		list_add(&info->list, &ctor->readq);
> +		spin_unlock_irqrestore(&ctor->read_lock, flags);
> +		return;
> +	}
> +
> +	if (info->flags == INFO_READ)
> +		return;
> +
> +	/* For transmit, we should wait for the DMA finish by hardware.
> +	 * Queue the notifier to wake up the backend driver
> +	 */
> +	vq = (struct vhost_virtqueue *)info->ctrl;
> +	iocb = create_iocb(info, info->total);
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	list_add_tail(&iocb->ki_list, &vq->notifier);
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +
> +	sk = ctor->port.sock->sk;
> +	sk->sk_write_space(sk);
> +
> +	return;
> +}
> +
> +static int page_ctor_attach(struct mp_struct *mp)
> +{
> +	int rc;
> +	struct page_ctor *ctor;
> +	struct net_device *dev = mp->dev;
> +
> +	/* locked by mp_mutex */
> +	if (rcu_dereference(mp->ctor))
> +		return -EBUSY;
> +
> +	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
> +	if (!ctor)
> +		return -ENOMEM;
> +	rc = netdev_mp_port_prep(dev, &ctor->port);
> +	if (rc)
> +		goto fail;
> +
> +	ctor->cache = kmem_cache_create("skb_page_info",
> +			sizeof(struct page_info), 0,
> +			SLAB_HWCACHE_ALIGN, NULL);
> +
> +	if (!ctor->cache)
> +		goto cache_fail;
> +
> +	INIT_LIST_HEAD(&ctor->readq);
> +	spin_lock_init(&ctor->read_lock);
> +
> +	ctor->w_len = 0;
> +	ctor->r_len = 0;
> +
> +	dev_hold(dev);
> +	ctor->dev = dev;
> +	ctor->port.ctor = page_ctor;
> +	ctor->port.sock = &mp->socket;
> +
> +	rc = netdev_mp_port_attach(dev, &ctor->port);
> +	if (rc)
> +		goto fail;
> +
> +	/* locked by mp_mutex */
> +	rcu_assign_pointer(mp->ctor, ctor);
> +
> +	/* XXX:Need we do set_offload here ? */
> +
> +	return 0;
> +
> +fail:
> +	kmem_cache_destroy(ctor->cache);
> +cache_fail:
> +	kfree(ctor);
> +	dev_put(dev);
> +
> +	return rc;
> +}
> +
> +struct page_info *info_dequeue(struct page_ctor *ctor)
> +{
> +	unsigned long flags;
> +	struct page_info *info = NULL;
> +	spin_lock_irqsave(&ctor->read_lock, flags);
> +	if (!list_empty(&ctor->readq)) {
> +		info = list_first_entry(&ctor->readq,
> +				struct page_info, list);
> +		list_del(&info->list);
> +	}
> +	spin_unlock_irqrestore(&ctor->read_lock, flags);
> +	return info;
> +}
> +
> +static int page_ctor_detach(struct mp_struct *mp)
> +{
> +	struct page_ctor *ctor;
> +	struct page_info *info;
> +	struct vhost_virtqueue *vq = NULL;
> +	struct kiocb *iocb = NULL;
> +	int i;
> +	unsigned long flags;
> +
> +	/* locked by mp_mutex */
> +	ctor = rcu_dereference(mp->ctor);
> +	if (!ctor)
> +		return -ENODEV;
> +
> +	while ((info = info_dequeue(ctor))) {
> +		for (i = 0; i < info->pnum; i++)
> +			if (info->pages[i])
> +				put_page(info->pages[i]);
> +		vq = (struct vhost_virtqueue *)(info->ctrl);
> +		iocb = create_iocb(info, 0);
> +
> +		spin_lock_irqsave(&vq->notify_lock, flags);
> +		list_add_tail(&iocb->ki_list, &vq->notifier);
> +		spin_unlock_irqrestore(&vq->notify_lock, flags);
> +
> +		kmem_cache_free(ctor->cache, info);
> +	}
> +	kmem_cache_destroy(ctor->cache);
> +	netdev_mp_port_detach(ctor->dev);
> +	dev_put(ctor->dev);
> +
> +	/* locked by mp_mutex */
> +	rcu_assign_pointer(mp->ctor, NULL);
> +	synchronize_rcu();
> +
> +	kfree(ctor);
> +	return 0;
> +}
> +
> +/* For small user space buffers transmit, we don't need to call
> + * get_user_pages().
> + */
> +static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
> +						struct kiocb *iocb, int total)
> +{
> +	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
> +
> +	if (!info)
> +		return NULL;
> +	info->total = total;
> +	info->user.dtor = page_dtor;
> +	info->ctor = ctor;
> +	info->flags = INFO_WRITE;
> +	info->iocb = iocb;
> +	return info;
> +}
> +
> +/* The main function to transform the guest user space address
> + * to host kernel address via get_user_pages(). Thus the hardware
> + * can do DMA directly to the user space address.
> + */
> +static struct page_info *alloc_page_info(struct page_ctor *ctor,
> +					struct kiocb *iocb, struct iovec *iov,
> +					int count, struct frag *frags,
> +					int npages, int total)
> +{
> +	int rc;
> +	int i, j, n = 0;
> +	int len;
> +	unsigned long base;
> +	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
> +
> +	if (!info)
> +		return NULL;
> +
> +	for (i = j = 0; i < count; i++) {
> +		base = (unsigned long)iov[i].iov_base;
> +		len = iov[i].iov_len;
> +
> +		if (!len)
> +			continue;
> +		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +
> +		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
> +						&info->pages[j]);
> +		if (rc != n)
> +			goto failed;
> +
> +		while (n--) {
> +			frags[j].offset = base & ~PAGE_MASK;
> +			frags[j].size = min_t(int, len,
> +					PAGE_SIZE - frags[j].offset);
> +			len -= frags[j].size;
> +			base += frags[j].size;
> +			j++;
> +		}
> +	}
> +
> +#ifdef CONFIG_HIGHMEM
> +	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
> +		for (i = 0; i < j; i++) {
> +			if (PageHighMem(info->pages[i]))
> +				goto failed;
> +		}
> +	}
> +#endif
> +
> +	info->total = total;
> +	info->user.dtor = page_dtor;
> +	info->ctor = ctor;
> +	info->pnum = j;
> +	info->iocb = iocb;
> +	if (!npages)
> +		info->flags = INFO_WRITE;
> +	if (info->flags == INFO_READ) {
> +		info->user.start = (u8 *)(((unsigned long)
> +				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
> +				frags[0].offset) - NET_IP_ALIGN - NET_SKB_PAD);
> +		info->user.size = iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD;
> +		for (i = 0; i < j; i++)
> +			set_page_dirty_lock(info->pages[i]);
> +	}
> +	return info;
> +
> +failed:
> +	for (i = 0; i < j; i++)
> +		put_page(info->pages[i]);
> +
> +	kmem_cache_free(ctor->cache, info);
> +
> +	return NULL;
> +}
> +
> +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
> +	struct iovec *iov = m->msg_iov;
> +	struct page_info *info = NULL;
> +	struct frag frags[MAX_SKB_FRAGS];
> +	struct sk_buff *skb;
> +	int count = m->msg_iovlen;
> +	int total = 0, header, n, i, len, rc;
> +	unsigned long base;
> +
> +	ctor = rcu_dereference(mp->ctor);
> +	if (!ctor)
> +		return -ENODEV;
> +
> +	total = iov_length(iov, count);
> +
> +	if (total < ETH_HLEN)
> +		return -EINVAL;
> +
> +	if (total <= COPY_THRESHOLD)
> +		goto copy;
> +
> +	n = 0;
> +	for (i = 0; i < count; i++) {
> +		base = (unsigned long)iov[i].iov_base;
> +		len = iov[i].iov_len;
> +		if (!len)
> +			continue;
> +		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +		if (n > MAX_SKB_FRAGS)
> +			return -EINVAL;
> +	}
> +
> +copy:
> +	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
> +
> +	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
> +	if (!skb)
> +		goto drop;
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +
> +	memcpy_fromiovec(skb->data, iov, header);
> +	skb_put(skb, header);
> +	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
> +
> +	if (header == total) {
> +		rc = total;
> +		info = alloc_small_page_info(ctor, iocb, total);
> +	} else {
> +		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
> +		if (info)
> +			for (i = 0; info->pages[i]; i++) {
> +				skb_add_rx_frag(skb, i, info->pages[i],
> +						frags[i].offset, frags[i].size);
> +				info->pages[i] = NULL;
> +			}
> +	}
> +	if (info != NULL) {
> +		info->desc_pos = iocb->ki_pos;
> +		info->ctrl = vq;
> +		info->total = total;
> +		info->skb = skb;
> +		skb_shinfo(skb)->destructor_arg = &info->user;
> +		skb->dev = mp->dev;
> +		dev_queue_xmit(skb);
> +		mp->dev->stats.tx_packets++;
> +		mp->dev->stats.tx_bytes += total;
> +		return 0;
> +	}
> +drop:
> +	kfree_skb(skb);
> +	if (info) {
> +		for (i = 0; info->pages[i]; i++)
> +			put_page(info->pages[i]);
> +		kmem_cache_free(info->ctor->cache, info);
> +	}
> +	mp->dev->stats.tx_dropped++;
> +	return -ENOMEM;
> +}
> +
> +
> +static void mp_recvmsg_notify(struct vhost_virtqueue *vq)
> +{
> +	struct socket *sock = vq->private_data;
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor = NULL;
> +	struct sk_buff *skb = NULL;
> +	struct page_info *info = NULL;
> +	struct ethhdr *eth;
> +	struct kiocb *iocb = NULL;
> +	int len, i;
> +	unsigned long flags;
> +
> +	struct virtio_net_hdr hdr = {
> +		.flags = 0,
> +		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	};
> +
> +	ctor = rcu_dereference(mp->ctor);
> +	if (!ctor)
> +		return;
> +
> +	while ((skb = skb_dequeue(&sock->sk->sk_receive_queue)) != NULL) {
> +		if (skb_shinfo(skb)->destructor_arg) {
> +			info = container_of(skb_shinfo(skb)->destructor_arg,
> +					struct page_info, user);
> +			info->skb = skb;
> +			if (skb->len > info->len) {
> +				mp->dev->stats.rx_dropped++;
> +				DBG(KERN_INFO "Discarded truncated rx packet: "
> +					" len %d > %zd\n", skb->len, info->len);
> +				info->total = skb->len;
> +				goto clean;
> +			} else {
> +				int i;
> +				struct skb_shared_info *gshinfo =
> +				(struct skb_shared_info *)(&info->ushinfo);
> +				struct skb_shared_info *hshinfo =
> +						skb_shinfo(skb);
> +
> +				if (gshinfo->nr_frags < hshinfo->nr_frags)
> +					goto clean;
> +				eth = eth_hdr(skb);
> +				skb_push(skb, ETH_HLEN);
> +
> +				hdr.hdr_len = skb_headlen(skb);
> +				info->total = skb->len;
> +
> +				for (i = 0; i < gshinfo->nr_frags; i++)
> +					gshinfo->frags[i].size = 0;
> +				for (i = 0; i < hshinfo->nr_frags; i++)
> +					gshinfo->frags[i].size =
> +						hshinfo->frags[i].size;
> +				memcpy(skb_shinfo(skb), &info->ushinfo,
> +						sizeof(struct skb_shared_info));
> +			}
> +		} else {
> +			/* The skb composed with kernel buffers
> +			 * in case user space buffers are not sufficent.
> +			 * The case should be rare.
> +			 */
> +			unsigned long flags;
> +			int i;
> +			struct skb_shared_info *gshinfo = NULL;
> +
> +			info = NULL;
> +
> +			spin_lock_irqsave(&ctor->read_lock, flags);
> +			if (!list_empty(&ctor->readq)) {
> +				info = list_first_entry(&ctor->readq,
> +						struct page_info, list);
> +				list_del(&info->list);
> +			}
> +			spin_unlock_irqrestore(&ctor->read_lock, flags);
> +			if (!info) {
> +				DBG(KERN_INFO "No user buffer avaliable %p\n",
> +									skb);
> +				skb_queue_head(&sock->sk->sk_receive_queue,
> +									skb);
> +				break;
> +			}
> +			info->skb = skb;
> +			/* compute the guest skb frags info */
> +			gshinfo = (struct skb_shared_info *)(info->user.start +
> +					SKB_DATA_ALIGN(info->user.size));
> +
> +			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
> +				goto clean;
> +
> +			eth = eth_hdr(skb);
> +			skb_push(skb, ETH_HLEN);
> +			info->total = skb->len;
> +
> +			for (i = 0; i < gshinfo->nr_frags; i++)
> +				gshinfo->frags[i].size = 0;
> +			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> +				gshinfo->frags[i].size =
> +					skb_shinfo(skb)->frags[i].size;
> +			hdr.hdr_len = min_t(int, skb->len,
> +						info->iov[1].iov_len);
> +			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
> +		}
> +
> +		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
> +								 sizeof hdr);
> +		if (len) {
> +			DBG(KERN_INFO
> +				"Unable to write vnet_hdr at addr %p: %d\n",
> +				info->hdr->iov_base, len);
> +			goto clean;
> +		}
> +		iocb = create_iocb(info, skb->len + sizeof(hdr));
> +
> +		spin_lock_irqsave(&vq->notify_lock, flags);
> +		list_add_tail(&iocb->ki_list, &vq->notifier);
> +		spin_unlock_irqrestore(&vq->notify_lock, flags);
> +		continue;
> +
> +clean:
> +		kfree_skb(skb);
> +		for (i = 0; info->pages[i]; i++)
> +			put_page(info->pages[i]);
> +		kmem_cache_free(ctor->cache, info);
> +	}
> +	return;
> +}
> +
> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len,
> +			int flags)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
> +	struct iovec *iov = m->msg_iov;
> +	int count = m->msg_iovlen;
> +	int npages, payload;
> +	struct page_info *info;
> +	struct frag frags[MAX_SKB_FRAGS];
> +	unsigned long base;
> +	int i, len;
> +	unsigned long flag;
> +
> +	if (!(flags & MSG_DONTWAIT))
> +		return -EINVAL;
> +
> +	ctor = rcu_dereference(mp->ctor);
> +	if (!ctor)
> +		return -EINVAL;
> +
> +	/* Error detections in case invalid user space buffer */
> +	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
> +			mp->dev->features & NETIF_F_SG) {
> +		return -EINVAL;
> +	}
> +
> +	npages = ctor->port.npages;
> +	payload = ctor->port.data_len;
> +
> +	/* If KVM guest virtio-net FE driver use SG feature */
> +	if (count > 2) {
> +		for (i = 2; i < count; i++) {
> +			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
> +			len = iov[i].iov_len;
> +			if (npages == 1)
> +				len = min_t(int, len, PAGE_SIZE - base);
> +			else if (base)
> +				break;
> +			payload -= len;
> +			if (payload <= 0)
> +				goto proceed;
> +			if (npages == 1 || (len & ~PAGE_MASK))
> +				break;
> +		}
> +	}
> +
> +	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
> +				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
> +		goto proceed;
> +
> +	return -EINVAL;
> +
> +proceed:
> +	/* skip the virtnet head */
> +	iov++;
> +	count--;
> +
> +	/* Translate address to kernel */
> +	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
> +	if (!info)
> +		return -ENOMEM;
> +	info->len = total_len;
> +	info->hdr[0].iov_base = vq->hdr[0].iov_base;
> +	info->hdr[0].iov_len = vq->hdr[0].iov_len;
> +	info->offset = frags[0].offset;
> +	info->desc_pos = iocb->ki_pos;
> +	info->log = iocb->ki_user_data;
> +	info->ctrl = vq;
> +
> +	iov--;
> +	count++;
> +
> +	memcpy(info->iov, vq->iov, sizeof(struct iovec) * count);
> +
> +	spin_lock_irqsave(&ctor->read_lock, flag);
> +	list_add_tail(&info->list, &ctor->readq);
> +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> +
> +	if (!vq->receiver)
> +		vq->receiver = mp_recvmsg_notify;
> +
> +	return 0;
> +}
> +
> +static void __mp_detach(struct mp_struct *mp)
> +{
> +	mp->mfile = NULL;
> +
> +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> +	page_ctor_detach(mp);
> +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +
> +	/* Drop the extra count on the net device */
> +	dev_put(mp->dev);
> +}
> +
> +static DEFINE_MUTEX(mp_mutex);
> +
> +static void mp_detach(struct mp_struct *mp)
> +{
> +	mutex_lock(&mp_mutex);
> +	__mp_detach(mp);
> +	mutex_unlock(&mp_mutex);
> +}
> +
> +static void mp_put(struct mp_file *mfile)
> +{
> +	if (atomic_dec_and_test(&mfile->count))
> +		mp_detach(mfile->mp);
> +}
> +
> +static int mp_release(struct socket *sock)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct mp_file *mfile = mp->mfile;
> +
> +	mp_put(mfile);
> +	sock_put(mp->socket.sk);
> +	put_net(mfile->net);
> +
> +	return 0;
> +}
> +
> +/* Ops structure to mimic raw sockets with mp device */
> +static const struct proto_ops mp_socket_ops = {
> +	.sendmsg = mp_sendmsg,
> +	.recvmsg = mp_recvmsg,
> +	.release = mp_release,
> +};
> +
> +static struct proto mp_proto = {
> +	.name           = "mp",
> +	.owner          = THIS_MODULE,
> +	.obj_size       = sizeof(struct mp_sock),
> +};
> +
> +static int mp_chr_open(struct inode *inode, struct file * file)
> +{
> +	struct mp_file *mfile;
> +	cycle_kernel_lock();
> +	DBG1(KERN_INFO "mp: mp_chr_open\n");
> +
> +	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
> +	if (!mfile)
> +		return -ENOMEM;
> +	atomic_set(&mfile->count, 0);
> +	mfile->mp = NULL;
> +	mfile->net = get_net(current->nsproxy->net_ns);
> +	file->private_data = mfile;
> +	return 0;
> +}
> +
> +
> +static struct mp_struct *mp_get(struct mp_file *mfile)
> +{
> +	struct mp_struct *mp = NULL;
> +	if (atomic_inc_not_zero(&mfile->count))
> +		mp = mfile->mp;
> +
> +	return mp;
> +}
> +
> +
> +static int mp_attach(struct mp_struct *mp, struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	int err;
> +
> +	netif_tx_lock_bh(mp->dev);
> +
> +	err = -EINVAL;
> +
> +	if (mfile->mp)
> +		goto out;
> +
> +	err = -EBUSY;
> +	if (mp->mfile)
> +		goto out;
> +
> +	err = 0;
> +	mfile->mp = mp;
> +	mp->mfile = mfile;
> +	mp->socket.file = file;
> +	dev_hold(mp->dev);
> +	sock_hold(mp->socket.sk);
> +	atomic_inc(&mfile->count);
> +
> +out:
> +	netif_tx_unlock_bh(mp->dev);
> +	return err;
> +}
> +
> +static void mp_sock_destruct(struct sock *sk)
> +{
> +	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
> +	kfree(mp);
> +}
> +
> +static int do_unbind(struct mp_file *mfile)
> +{
> +	struct mp_struct *mp = mp_get(mfile);
> +
> +	if (!mp)
> +		return -EINVAL;
> +
> +	mp_detach(mp);
> +	sock_put(mp->socket.sk);
> +	mp_put(mfile);
> +	return 0;
> +}
> +
> +static void mp_sock_data_ready(struct sock *sk, int len)
> +{
> +	if (sk_has_sleeper(sk))
> +		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
> +}
> +
> +static void mp_sock_write_space(struct sock *sk)
> +{
> +	if (sk_has_sleeper(sk))
> +		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
> +}
> +
> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +	struct net_device *dev;
> +	void __user* argp = (void __user *)arg;
> +	struct ifreq ifr;
> +	struct sock *sk;
> +	int ret;
> +
> +	ret = -EINVAL;
> +
> +	switch (cmd) {
> +	case MPASSTHRU_BINDDEV:
> +		ret = -EFAULT;
> +		if (copy_from_user(&ifr, argp, sizeof ifr))
> +			break;
> +
> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> +
> +		ret = -EBUSY;
> +
> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> +			break;
> +
> +		ret = -ENODEV;
> +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> +		if (!dev)
> +			break;
> +
> +		mutex_lock(&mp_mutex);
> +
> +		ret = -EBUSY;
> +		mp = mfile->mp;
> +		if (mp)
> +			goto err_dev_put;
> +
> +		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
> +		if (!mp) {
> +			ret = -ENOMEM;
> +			goto err_dev_put;
> +		}
> +		mp->dev = dev;
> +		ret = -ENOMEM;
> +
> +		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
> +		if (!sk)
> +			goto err_free_mp;
> +
> +		init_waitqueue_head(&mp->socket.wait);
> +		mp->socket.ops = &mp_socket_ops;
> +		sock_init_data(&mp->socket, sk);
> +		sk->sk_sndbuf = INT_MAX;
> +		container_of(sk, struct mp_sock, sk)->mp = mp;
> +
> +		sk->sk_destruct = mp_sock_destruct;
> +		sk->sk_data_ready = mp_sock_data_ready;
> +		sk->sk_write_space = mp_sock_write_space;
> +
> +		ret = mp_attach(mp, file);
> +		if (ret < 0)
> +			goto err_free_sk;
> +
> +		ret = page_ctor_attach(mp);
> +		if (ret < 0)
> +			goto err_free_sk;
> +
> +		ifr.ifr_flags |= IFF_MPASSTHRU_EXCL;
> +		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +out:
> +		mutex_unlock(&mp_mutex);
> +		break;
> +err_free_sk:
> +		sk_free(sk);
> +err_free_mp:
> +		kfree(mp);
> +err_dev_put:
> +		dev_put(dev);
> +		goto out;
> +
> +	case MPASSTHRU_UNBINDDEV:
> +		ret = do_unbind(mfile);
> +		break;
> +
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp = mp_get(mfile);
> +	struct sock *sk;
> +	unsigned int mask = 0;
> +
> +	if (!mp)
> +		return POLLERR;
> +
> +	sk = mp->socket.sk;
> +
> +	poll_wait(file, &mp->socket.wait, wait);
> +
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	if (sock_writeable(sk) ||
> +		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
> +			 sock_writeable(sk)))
> +		mask |= POLLOUT | POLLWRNORM;
> +
> +	if (mp->dev->reg_state != NETREG_REGISTERED)
> +		mask = POLLERR;
> +
> +	mp_put(mfile);
> +	return mask;
> +}
> +
> +static int mp_chr_close(struct inode *inode, struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +
> +	/*
> +	 * Ignore return value since an error only means there was nothing to
> +	 * do
> +	 */
> +	do_unbind(mfile);
> +
> +	put_net(mfile->net);
> +	kfree(mfile);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations mp_fops = {
> +	.owner  = THIS_MODULE,
> +	.llseek = no_llseek,
> +	.poll   = mp_chr_poll,
> +	.unlocked_ioctl = mp_chr_ioctl,
> +	.open   = mp_chr_open,
> +	.release = mp_chr_close,
> +};
> +
> +static struct miscdevice mp_miscdev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mp",
> +	.nodename = "net/mp",
> +	.fops = &mp_fops,
> +};
> +
> +static int mp_device_event(struct notifier_block *unused,
> +		unsigned long event, void *ptr)
> +{
> +	struct net_device *dev = ptr;
> +	struct mpassthru_port *port;
> +	struct mp_struct *mp = NULL;
> +	struct socket *sock = NULL;
> +
> +	port = dev->mp_port;
> +	if (port == NULL)
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_UNREGISTER:
> +			sock = dev->mp_port->sock;
> +			mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +			do_unbind(mp->mfile);
> +			break;
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block mp_notifier_block __read_mostly = {
> +	.notifier_call  = mp_device_event,
> +};
> +
> +static int mp_init(void)
> +{
> +	int ret = 0;
> +
> +	ret = misc_register(&mp_miscdev);
> +	if (ret)
> +		printk(KERN_ERR "mp: Can't register misc device\n");
> +	else {
> +		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
> +			mp_miscdev.minor);
> +		register_netdevice_notifier(&mp_notifier_block);
> +	}
> +	return ret;
> +}
> +
> +void mp_cleanup(void)
> +{
> +	unregister_netdevice_notifier(&mp_notifier_block);
> +	misc_deregister(&mp_miscdev);
> +}
> +
> +/* Get an underlying socket object from mp file.  Returns error unless file is
> + * attached to a device.  The returned object works like a packet socket, it
> + * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
> + * holding a reference to the file for as long as the socket is in use. */
> +struct socket *mp_get_socket(struct file *file)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +
> +	if (file->f_op != &mp_fops)
> +		return ERR_PTR(-EINVAL);
> +	mp = mp_get(mfile);
> +	if (!mp)
> +		return ERR_PTR(-EBADFD);
> +	mp_put(mfile);
> +	return &mp->socket;
> +}
> +EXPORT_SYMBOL_GPL(mp_get_socket);
> +
> +module_init(mp_init);
> +module_exit(mp_cleanup);
> +MODULE_AUTHOR(DRV_COPYRIGHT);
> +MODULE_DESCRIPTION(DRV_DESCRIPTION);
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> new file mode 100644
> index 0000000..2be21c5
> --- /dev/null
> +++ b/include/linux/mpassthru.h
> @@ -0,0 +1,29 @@
> +#ifndef __MPASSTHRU_H
> +#define __MPASSTHRU_H
> +
> +#include <linux/types.h>
> +#include <linux/if_ether.h>
> +
> +/* ioctl defines */
> +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
> +
> +/* MPASSTHRU ifc flags */
> +#define IFF_MPASSTHRU		0x0001
> +#define IFF_MPASSTHRU_EXCL	0x0002
> +
> +#ifdef __KERNEL__
> +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE)
> +struct socket *mp_get_socket(struct file *);
> +#else
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +struct file;
> +struct socket;
> +static inline struct socket *mp_get_socket(struct file *f)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif /* CONFIG_VHOST_PASSTHRU */
> +#endif /* __KERNEL__ */
> +#endif /* __MPASSTHRU_H */
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-01 11:02                     ` [PATCH " Michael S. Tsirkin
@ 2010-04-02  2:16                       ` Xin, Xiaohui
  2010-04-04 11:40                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-04-02  2:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike


>> For the write logging, do you have a function in hand that we can
>> recompute the log? If that, I think I can use it to recompute the
>>log info when the logging is suddenly enabled.
>> For the outstanding requests, do you mean all the user buffers have
>>submitted before the logging ioctl changed? That may be a lot, and
>> some of them are still in NIC ring descriptors. Waiting them to be
>>finished may be need some time. I think when logging ioctl changed,
>> then the logging is changed just after that is also reasonable.

>The key point is that after loggin ioctl returns, any
>subsequent change to memory must be logged. It does not
>matter when was the request submitted, otherwise we will
>get memory corruption on migration.

The change to memory happens when vhost_add_used_and_signal(), right?
So after ioctl returns, just recompute the log info to the events in the async queue,
is ok. Since the ioctl and write log operations are all protected by vq->mutex.

Thanks
Xiaohui

> Thanks
> Xiaohui
> 
>  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   10 +++
>  2 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 22d5fef..2aafd90 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -17,11 +17,13 @@
>  #include <linux/workqueue.h>
>  #include <linux/rcupdate.h>
>  #include <linux/file.h>
> +#include <linux/aio.h>
>  
>  #include <linux/net.h>
>  #include <linux/if_packet.h>
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
> +#include <linux/mpassthru.h>
>  
>  #include <net/sock.h>
>  
> @@ -47,6 +49,7 @@ struct vhost_net {
>  	struct vhost_dev dev;
>  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
>  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +	struct kmem_cache       *cache;
>  	/* Tells us whether we are polling a socket for TX.
>  	 * We only do this when socket buffer fills up.
>  	 * Protected by tx vq lock. */
> @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
>  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	if (!list_empty(&vq->notifier)) {
> +		iocb = list_first_entry(&vq->notifier,
> +				struct kiocb, ki_list);
> +		list_del(&iocb->ki_list);
> +	}
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	return iocb;
> +}
> +
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	struct vhost_log *vq_log = NULL;
> +	int rx_total_len = 0;
> +	int log, size;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +
> +	if (vq->receiver)
> +		vq->receiver(vq);
> +
> +	vq_log = unlikely(vhost_has_feature(
> +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, iocb->ki_nbytes);
> +		log = (int)iocb->ki_user_data;
> +		size = iocb->ki_nbytes;
> +		rx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +		kmem_cache_free(net->cache, iocb);
> +
> +		if (unlikely(vq_log))
> +			vhost_log_write(vq, vq_log, log, size);
> +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +					struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	int tx_total_len = 0;
> +
> +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> +		return;
> +
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, 0);
> +		tx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);
> +
> +		kmem_cache_free(net->cache, iocb);
> +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	struct kiocb *iocb = NULL;
>  	unsigned head, out, in, s;
>  	struct msghdr msg = {
>  		.msg_name = NULL,
> @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
>  		tx_poll_stop(net);
>  	hdr_size = vq->hdr_size;
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
>  		/* Skip header. TODO: support TSO. */
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
>  		msg.msg_iovlen = out;
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> +			if (!iocb)
> +				break;
> +			iocb->ki_pos = head;
> +			iocb->private = (void *)vq;
> +		}
> +
>  		len = iov_length(vq->iov, out);
>  		/* Sanity check */
>  		if (!len) {
> @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
>  			break;
>  		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
>  		if (unlikely(err < 0)) {
>  			vhost_discard_vq_desc(vq);
>  			tx_poll_start(net, sock);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		if (err != len)
>  			pr_err("Truncated TX packet: "
>  			       " len %d != %zd\n", err, len);
> @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
> @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
>  static void handle_rx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
>  	unsigned head, out, in, log, s;
>  	struct vhost_log *vq_log;
>  	struct msghdr msg = {
> @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
>  	int err;
>  	size_t hdr_size;
>  	struct socket *sock = rcu_dereference(vq->private_data);
> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> +			vq->link_state == VHOST_VQ_LINK_SYNC))
>  		return;
>  
>  	use_mm(net->dev.mm);
> @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
>  	vhost_disable_notify(vq);
>  	hdr_size = vq->hdr_size;
>  
> -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> +	/* In async cases, for write logging, the simple way is to get
> +	 * the log info always, and really logging is decided later.
> +	 * Thus, when logging enabled, we can get log, and when logging
> +	 * disabled, we can get log disabled accordingly.
> +	 */
> +
> +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
>  		vq->log : NULL;
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> +			if (!iocb)
> +				break;
> +			iocb->private = vq;
> +			iocb->ki_pos = head;
> +			iocb->ki_user_data = log;
> +		}
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
> @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> -		err = sock->ops->recvmsg(NULL, sock, &msg,
> +
> +		err = sock->ops->recvmsg(iocb, sock, &msg,
>  					 len, MSG_DONTWAIT | MSG_TRUNC);
>  		/* TODO: Check specific error and bomb out unless EAGAIN? */
>  		if (err < 0) {
>  			vhost_discard_vq_desc(vq);
>  			break;
>  		}
> +
> +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> +			continue;
> +
>  		/* TODO: Should check and handle checksum. */
>  		if (err > len) {
>  			pr_err("Discarded truncated rx packet: "
> @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_rx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
>  
> +
>  static void handle_tx_kick(struct work_struct *work)
>  {
>  	struct vhost_virtqueue *vq;
> @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
>  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> +	n->cache = NULL;
>  	return 0;
>  }
>  
> @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
>  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
>  }
>  
> +static void vhost_notifier_cleanup(struct vhost_net *n)
> +{
> +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
> +	if (n->cache) {
> +		while ((iocb = notify_dequeue(vq)) != NULL)
> +			kmem_cache_free(n->cache, iocb);
> +		kmem_cache_destroy(n->cache);
> +	}
> +}
> +
>  static int vhost_net_release(struct inode *inode, struct file *f)
>  {
>  	struct vhost_net *n = f->private_data;
> @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	/* We do an extra flush before freeing memory,
>  	 * since jobs can re-queue themselves. */
>  	vhost_net_flush(n);
> +	vhost_notifier_cleanup(n);
>  	kfree(n);
>  	return 0;
>  }
> @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
>  	return sock;
>  }
>  
> -static struct socket *get_socket(int fd)
> +static struct socket *get_mp_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	struct socket *sock;
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	sock = mp_get_socket(file);
> +	if (IS_ERR(sock))
> +		fput(file);
> +	return sock;
> +}
> +
> +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
>  {
>  	struct socket *sock;
>  	if (fd == -1)
> @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
>  	sock = get_tun_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_mp_socket(fd);
> +	if (!IS_ERR(sock)) {
> +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> +		return sock;
> +	}
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
> +static void vhost_init_link_state(struct vhost_net *n, int index)
> +{
> +	struct vhost_virtqueue *vq = n->vqs + index;
> +
> +	WARN_ON(!mutex_is_locked(&vq->mutex));
> +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +		vq->receiver = NULL;
> +		INIT_LIST_HEAD(&vq->notifier);
> +		spin_lock_init(&vq->notify_lock);
> +		if (!n->cache) {
> +			n->cache = kmem_cache_create("vhost_kiocb",
> +					sizeof(struct kiocb), 0,
> +					SLAB_HWCACHE_ALIGN, NULL);
> +		}
> +	}
> +}
> +
>  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  {
>  	struct socket *sock, *oldsock;
> @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	}
>  	vq = n->vqs + index;
>  	mutex_lock(&vq->mutex);
> -	sock = get_socket(fd);
> +	vq->link_state = VHOST_VQ_LINK_SYNC;
> +	sock = get_socket(vq, fd);
>  	if (IS_ERR(sock)) {
>  		r = PTR_ERR(sock);
>  		goto err;
>  	}
>  
> +	vhost_init_link_state(n, index);
> +
>  	/* start polling new socket */
>  	oldsock = vq->private_data;
>  	if (sock == oldsock)
> @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  	vhost_net_disable_vq(n, vq);
>  	rcu_assign_pointer(vq->private_data, sock);
>  	vhost_net_enable_vq(n, vq);
> -	mutex_unlock(&vq->mutex);
>  done:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	if (oldsock) {
>  		vhost_net_flush_vq(n, index);
> @@ -516,6 +690,7 @@ done:
>  	}
>  	return r;
>  err:
> +	mutex_unlock(&vq->mutex);
>  	mutex_unlock(&n->dev.mutex);
>  	return r;
>  }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d1f0453..cffe39a 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -43,6 +43,11 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +enum vhost_vq_link_state {
> +	VHOST_VQ_LINK_SYNC = 	0,
> +	VHOST_VQ_LINK_ASYNC = 	1,
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -96,6 +101,11 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log log[VHOST_NET_MAX_SG];
> +	/*Differiate async socket for 0-copy from normal*/
> +	enum vhost_vq_link_state link_state;
> +	struct list_head notifier;
> +	spinlock_t notify_lock;
> +	void (*receiver)(struct vhost_virtqueue *);
>  };
>  
>  struct vhost_dev {
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-02  2:16                       ` Xin, Xiaohui
@ 2010-04-04 11:40                         ` Michael S. Tsirkin
  2010-04-06  5:46                           ` Xin, Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-04 11:40 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Fri, Apr 02, 2010 at 10:16:16AM +0800, Xin, Xiaohui wrote:
> 
> >> For the write logging, do you have a function in hand that we can
> >> recompute the log? If that, I think I can use it to recompute the
> >>log info when the logging is suddenly enabled.
> >> For the outstanding requests, do you mean all the user buffers have
> >>submitted before the logging ioctl changed? That may be a lot, and
> >> some of them are still in NIC ring descriptors. Waiting them to be
> >>finished may be need some time. I think when logging ioctl changed,
> >> then the logging is changed just after that is also reasonable.
> 
> >The key point is that after loggin ioctl returns, any
> >subsequent change to memory must be logged. It does not
> >matter when was the request submitted, otherwise we will
> >get memory corruption on migration.
> 
> The change to memory happens when vhost_add_used_and_signal(), right?
> So after ioctl returns, just recompute the log info to the events in the async queue,
> is ok. Since the ioctl and write log operations are all protected by vq->mutex.
> 
> Thanks
> Xiaohui

Yes, I think this will work.

> > Thanks
> > Xiaohui
> > 
> >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/vhost/vhost.h |   10 +++
> >  2 files changed, 192 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 22d5fef..2aafd90 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -17,11 +17,13 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/rcupdate.h>
> >  #include <linux/file.h>
> > +#include <linux/aio.h>
> >  
> >  #include <linux/net.h>
> >  #include <linux/if_packet.h>
> >  #include <linux/if_arp.h>
> >  #include <linux/if_tun.h>
> > +#include <linux/mpassthru.h>
> >  
> >  #include <net/sock.h>
> >  
> > @@ -47,6 +49,7 @@ struct vhost_net {
> >  	struct vhost_dev dev;
> >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > +	struct kmem_cache       *cache;
> >  	/* Tells us whether we are polling a socket for TX.
> >  	 * We only do this when socket buffer fills up.
> >  	 * Protected by tx vq lock. */
> > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> >  }
> >  
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > +	if (!list_empty(&vq->notifier)) {
> > +		iocb = list_first_entry(&vq->notifier,
> > +				struct kiocb, ki_list);
> > +		list_del(&iocb->ki_list);
> > +	}
> > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > +	return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	struct vhost_log *vq_log = NULL;
> > +	int rx_total_len = 0;
> > +	int log, size;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	if (vq->receiver)
> > +		vq->receiver(vq);
> > +
> > +	vq_log = unlikely(vhost_has_feature(
> > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, iocb->ki_nbytes);
> > +		log = (int)iocb->ki_user_data;
> > +		size = iocb->ki_nbytes;
> > +		rx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		kmem_cache_free(net->cache, iocb);
> > +
> > +		if (unlikely(vq_log))
> > +			vhost_log_write(vq, vq_log, log, size);
> > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	int tx_total_len = 0;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, 0);
> > +		tx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +
> > +		kmem_cache_free(net->cache, iocb);
> > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> >  /* Expects to be always run from workqueue - which acts as
> >   * read-size critical section for our kind of RCU. */
> >  static void handle_tx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, s;
> >  	struct msghdr msg = {
> >  		.msg_name = NULL,
> > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> >  		tx_poll_stop(net);
> >  	hdr_size = vq->hdr_size;
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> >  		/* Skip header. TODO: support TSO. */
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> >  		msg.msg_iovlen = out;
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->ki_pos = head;
> > +			iocb->private = (void *)vq;
> > +		}
> > +
> >  		len = iov_length(vq->iov, out);
> >  		/* Sanity check */
> >  		if (!len) {
> > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> >  			break;
> >  		}
> >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> >  		if (unlikely(err < 0)) {
> >  			vhost_discard_vq_desc(vq);
> >  			tx_poll_start(net, sock);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		if (err != len)
> >  			pr_err("Truncated TX packet: "
> >  			       " len %d != %zd\n", err, len);
> > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> >  static void handle_rx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, log, s;
> >  	struct vhost_log *vq_log;
> >  	struct msghdr msg = {
> > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> >  	int err;
> >  	size_t hdr_size;
> >  	struct socket *sock = rcu_dereference(vq->private_data);
> > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> >  		return;
> >  
> >  	use_mm(net->dev.mm);
> > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> >  	vhost_disable_notify(vq);
> >  	hdr_size = vq->hdr_size;
> >  
> > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > +	/* In async cases, for write logging, the simple way is to get
> > +	 * the log info always, and really logging is decided later.
> > +	 * Thus, when logging enabled, we can get log, and when logging
> > +	 * disabled, we can get log disabled accordingly.
> > +	 */
> > +
> > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> >  		vq->log : NULL;
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> >  		msg.msg_iovlen = in;
> >  		len = iov_length(vq->iov, in);
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->private = vq;
> > +			iocb->ki_pos = head;
> > +			iocb->ki_user_data = log;
> > +		}
> >  		/* Sanity check */
> >  		if (!len) {
> >  			vq_err(vq, "Unexpected header len for RX: "
> > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> >  			       iov_length(vq->hdr, s), hdr_size);
> >  			break;
> >  		}
> > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > +
> > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> >  		if (err < 0) {
> >  			vhost_discard_vq_desc(vq);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		/* TODO: Should check and handle checksum. */
> >  		if (err > len) {
> >  			pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> >  
> > +
> >  static void handle_tx_kick(struct work_struct *work)
> >  {
> >  	struct vhost_virtqueue *vq;
> > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > +	n->cache = NULL;
> >  	return 0;
> >  }
> >  
> > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> >  }
> >  
> > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > +{
> > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> > +	if (n->cache) {
> > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > +			kmem_cache_free(n->cache, iocb);
> > +		kmem_cache_destroy(n->cache);
> > +	}
> > +}
> > +
> >  static int vhost_net_release(struct inode *inode, struct file *f)
> >  {
> >  	struct vhost_net *n = f->private_data;
> > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	/* We do an extra flush before freeing memory,
> >  	 * since jobs can re-queue themselves. */
> >  	vhost_net_flush(n);
> > +	vhost_notifier_cleanup(n);
> >  	kfree(n);
> >  	return 0;
> >  }
> > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> >  	return sock;
> >  }
> >  
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > +	struct file *file = fget(fd);
> > +	struct socket *sock;
> > +	if (!file)
> > +		return ERR_PTR(-EBADF);
> > +	sock = mp_get_socket(file);
> > +	if (IS_ERR(sock))
> > +		fput(file);
> > +	return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> >  {
> >  	struct socket *sock;
> >  	if (fd == -1)
> > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> >  	sock = get_tun_socket(fd);
> >  	if (!IS_ERR(sock))
> >  		return sock;
> > +	sock = get_mp_socket(fd);
> > +	if (!IS_ERR(sock)) {
> > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > +		return sock;
> > +	}
> >  	return ERR_PTR(-ENOTSOCK);
> >  }
> >  
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > +	struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +		vq->receiver = NULL;
> > +		INIT_LIST_HEAD(&vq->notifier);
> > +		spin_lock_init(&vq->notify_lock);
> > +		if (!n->cache) {
> > +			n->cache = kmem_cache_create("vhost_kiocb",
> > +					sizeof(struct kiocb), 0,
> > +					SLAB_HWCACHE_ALIGN, NULL);
> > +		}
> > +	}
> > +}
> > +
> >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  {
> >  	struct socket *sock, *oldsock;
> > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	}
> >  	vq = n->vqs + index;
> >  	mutex_lock(&vq->mutex);
> > -	sock = get_socket(fd);
> > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > +	sock = get_socket(vq, fd);
> >  	if (IS_ERR(sock)) {
> >  		r = PTR_ERR(sock);
> >  		goto err;
> >  	}
> >  
> > +	vhost_init_link_state(n, index);
> > +
> >  	/* start polling new socket */
> >  	oldsock = vq->private_data;
> >  	if (sock == oldsock)
> > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	vhost_net_disable_vq(n, vq);
> >  	rcu_assign_pointer(vq->private_data, sock);
> >  	vhost_net_enable_vq(n, vq);
> > -	mutex_unlock(&vq->mutex);
> >  done:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	if (oldsock) {
> >  		vhost_net_flush_vq(n, index);
> > @@ -516,6 +690,7 @@ done:
> >  	}
> >  	return r;
> >  err:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	return r;
> >  }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..cffe39a 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> >  	u64 len;
> >  };
> >  
> > +enum vhost_vq_link_state {
> > +	VHOST_VQ_LINK_SYNC = 	0,
> > +	VHOST_VQ_LINK_ASYNC = 	1,
> > +};
> > +
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >  	struct vhost_dev *dev;
> > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> >  	/* Log write descriptors */
> >  	void __user *log_base;
> >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > +	/*Differiate async socket for 0-copy from normal*/
> > +	enum vhost_vq_link_state link_state;
> > +	struct list_head notifier;
> > +	spinlock_t notify_lock;
> > +	void (*receiver)(struct vhost_virtqueue *);
> >  };
> >  
> >  struct vhost_dev {
> > -- 
> > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-01 11:08       ` [PATCH " Michael S. Tsirkin
@ 2010-04-06  5:41         ` Xin, Xiaohui
  2010-04-06  7:49           ` Michael S. Tsirkin
  2010-04-07  2:41         ` Xin, Xiaohui
  1 sibling, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-04-06  5:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81

Michael,
>> 
>>For the DOS issue, I'm not sure how much the limit get_user_pages()
>> can pin is reasonable, should we compute the bindwidth to make it?

>There's a ulimit for locked memory. Can we use this, decreasing
>the value for rlimit array? We can do this when backend is
>enabled and re-increment when backend is disabled.

I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
the initial value for it is 0x10000, after right shift PAGE_SHIFT,
it's only 16 pages we can lock then, it seems too small, since the 
guest virito-net driver may submit a lot requests one time.


Thanks
Xiaohui


^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-04 11:40                         ` Michael S. Tsirkin
@ 2010-04-06  5:46                           ` Xin, Xiaohui
  2010-04-06  7:51                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-04-06  5:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

Michael,
> >>> For the write logging, do you have a function in hand that we can
> >>> recompute the log? If that, I think I can use it to recompute the
> >>>log info when the logging is suddenly enabled.
> >>> For the outstanding requests, do you mean all the user buffers have
> >>>submitted before the logging ioctl changed? That may be a lot, and
> >> >some of them are still in NIC ring descriptors. Waiting them to be
> >>>finished may be need some time. I think when logging ioctl changed,
> >> >then the logging is changed just after that is also reasonable.
 
> >>The key point is that after loggin ioctl returns, any
> >>subsequent change to memory must be logged. It does not
> >>matter when was the request submitted, otherwise we will
> >>get memory corruption on migration.

> >The change to memory happens when vhost_add_used_and_signal(), right?
> >So after ioctl returns, just recompute the log info to the events in the async queue,
> >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
 
>> Thanks
>> Xiaohui

>Yes, I think this will work.

Thanks, so do you have the function to recompute the log info in your hand that I can 
use? I have weakly remembered that you have noticed it before some time.

> > Thanks
> > Xiaohui
> > 
> >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/vhost/vhost.h |   10 +++
> >  2 files changed, 192 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 22d5fef..2aafd90 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -17,11 +17,13 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/rcupdate.h>
> >  #include <linux/file.h>
> > +#include <linux/aio.h>
> >  
> >  #include <linux/net.h>
> >  #include <linux/if_packet.h>
> >  #include <linux/if_arp.h>
> >  #include <linux/if_tun.h>
> > +#include <linux/mpassthru.h>
> >  
> >  #include <net/sock.h>
> >  
> > @@ -47,6 +49,7 @@ struct vhost_net {
> >  	struct vhost_dev dev;
> >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > +	struct kmem_cache       *cache;
> >  	/* Tells us whether we are polling a socket for TX.
> >  	 * We only do this when socket buffer fills up.
> >  	 * Protected by tx vq lock. */
> > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> >  }
> >  
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > +	if (!list_empty(&vq->notifier)) {
> > +		iocb = list_first_entry(&vq->notifier,
> > +				struct kiocb, ki_list);
> > +		list_del(&iocb->ki_list);
> > +	}
> > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > +	return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	struct vhost_log *vq_log = NULL;
> > +	int rx_total_len = 0;
> > +	int log, size;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	if (vq->receiver)
> > +		vq->receiver(vq);
> > +
> > +	vq_log = unlikely(vhost_has_feature(
> > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, iocb->ki_nbytes);
> > +		log = (int)iocb->ki_user_data;
> > +		size = iocb->ki_nbytes;
> > +		rx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		kmem_cache_free(net->cache, iocb);
> > +
> > +		if (unlikely(vq_log))
> > +			vhost_log_write(vq, vq_log, log, size);
> > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	int tx_total_len = 0;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, 0);
> > +		tx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +
> > +		kmem_cache_free(net->cache, iocb);
> > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> >  /* Expects to be always run from workqueue - which acts as
> >   * read-size critical section for our kind of RCU. */
> >  static void handle_tx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, s;
> >  	struct msghdr msg = {
> >  		.msg_name = NULL,
> > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> >  		tx_poll_stop(net);
> >  	hdr_size = vq->hdr_size;
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> >  		/* Skip header. TODO: support TSO. */
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> >  		msg.msg_iovlen = out;
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->ki_pos = head;
> > +			iocb->private = (void *)vq;
> > +		}
> > +
> >  		len = iov_length(vq->iov, out);
> >  		/* Sanity check */
> >  		if (!len) {
> > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> >  			break;
> >  		}
> >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> >  		if (unlikely(err < 0)) {
> >  			vhost_discard_vq_desc(vq);
> >  			tx_poll_start(net, sock);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		if (err != len)
> >  			pr_err("Truncated TX packet: "
> >  			       " len %d != %zd\n", err, len);
> > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> >  static void handle_rx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, log, s;
> >  	struct vhost_log *vq_log;
> >  	struct msghdr msg = {
> > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> >  	int err;
> >  	size_t hdr_size;
> >  	struct socket *sock = rcu_dereference(vq->private_data);
> > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> >  		return;
> >  
> >  	use_mm(net->dev.mm);
> > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> >  	vhost_disable_notify(vq);
> >  	hdr_size = vq->hdr_size;
> >  
> > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > +	/* In async cases, for write logging, the simple way is to get
> > +	 * the log info always, and really logging is decided later.
> > +	 * Thus, when logging enabled, we can get log, and when logging
> > +	 * disabled, we can get log disabled accordingly.
> > +	 */
> > +
> > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> >  		vq->log : NULL;
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> >  		msg.msg_iovlen = in;
> >  		len = iov_length(vq->iov, in);
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->private = vq;
> > +			iocb->ki_pos = head;
> > +			iocb->ki_user_data = log;
> > +		}
> >  		/* Sanity check */
> >  		if (!len) {
> >  			vq_err(vq, "Unexpected header len for RX: "
> > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> >  			       iov_length(vq->hdr, s), hdr_size);
> >  			break;
> >  		}
> > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > +
> > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> >  		if (err < 0) {
> >  			vhost_discard_vq_desc(vq);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		/* TODO: Should check and handle checksum. */
> >  		if (err > len) {
> >  			pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> >  
> > +
> >  static void handle_tx_kick(struct work_struct *work)
> >  {
> >  	struct vhost_virtqueue *vq;
> > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > +	n->cache = NULL;
> >  	return 0;
> >  }
> >  
> > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> >  }
> >  
> > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > +{
> > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> > +	if (n->cache) {
> > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > +			kmem_cache_free(n->cache, iocb);
> > +		kmem_cache_destroy(n->cache);
> > +	}
> > +}
> > +
> >  static int vhost_net_release(struct inode *inode, struct file *f)
> >  {
> >  	struct vhost_net *n = f->private_data;
> > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	/* We do an extra flush before freeing memory,
> >  	 * since jobs can re-queue themselves. */
> >  	vhost_net_flush(n);
> > +	vhost_notifier_cleanup(n);
> >  	kfree(n);
> >  	return 0;
> >  }
> > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> >  	return sock;
> >  }
> >  
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > +	struct file *file = fget(fd);
> > +	struct socket *sock;
> > +	if (!file)
> > +		return ERR_PTR(-EBADF);
> > +	sock = mp_get_socket(file);
> > +	if (IS_ERR(sock))
> > +		fput(file);
> > +	return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> >  {
> >  	struct socket *sock;
> >  	if (fd == -1)
> > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> >  	sock = get_tun_socket(fd);
> >  	if (!IS_ERR(sock))
> >  		return sock;
> > +	sock = get_mp_socket(fd);
> > +	if (!IS_ERR(sock)) {
> > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > +		return sock;
> > +	}
> >  	return ERR_PTR(-ENOTSOCK);
> >  }
> >  
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > +	struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +		vq->receiver = NULL;
> > +		INIT_LIST_HEAD(&vq->notifier);
> > +		spin_lock_init(&vq->notify_lock);
> > +		if (!n->cache) {
> > +			n->cache = kmem_cache_create("vhost_kiocb",
> > +					sizeof(struct kiocb), 0,
> > +					SLAB_HWCACHE_ALIGN, NULL);
> > +		}
> > +	}
> > +}
> > +
> >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  {
> >  	struct socket *sock, *oldsock;
> > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	}
> >  	vq = n->vqs + index;
> >  	mutex_lock(&vq->mutex);
> > -	sock = get_socket(fd);
> > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > +	sock = get_socket(vq, fd);
> >  	if (IS_ERR(sock)) {
> >  		r = PTR_ERR(sock);
> >  		goto err;
> >  	}
> >  
> > +	vhost_init_link_state(n, index);
> > +
> >  	/* start polling new socket */
> >  	oldsock = vq->private_data;
> >  	if (sock == oldsock)
> > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	vhost_net_disable_vq(n, vq);
> >  	rcu_assign_pointer(vq->private_data, sock);
> >  	vhost_net_enable_vq(n, vq);
> > -	mutex_unlock(&vq->mutex);
> >  done:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	if (oldsock) {
> >  		vhost_net_flush_vq(n, index);
> > @@ -516,6 +690,7 @@ done:
> >  	}
> >  	return r;
> >  err:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	return r;
> >  }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..cffe39a 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> >  	u64 len;
> >  };
> >  
> > +enum vhost_vq_link_state {
> > +	VHOST_VQ_LINK_SYNC = 	0,
> > +	VHOST_VQ_LINK_ASYNC = 	1,
> > +};
> > +
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >  	struct vhost_dev *dev;
> > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> >  	/* Log write descriptors */
> >  	void __user *log_base;
> >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > +	/*Differiate async socket for 0-copy from normal*/
> > +	enum vhost_vq_link_state link_state;
> > +	struct list_head notifier;
> > +	spinlock_t notify_lock;
> > +	void (*receiver)(struct vhost_virtqueue *);
> >  };
> >  
> >  struct vhost_dev {
> > -- 
> > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-06  5:41         ` Xin, Xiaohui
@ 2010-04-06  7:49           ` Michael S. Tsirkin
  0 siblings, 0 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-06  7:49 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81

On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote:
> Michael,
> >> 
> >>For the DOS issue, I'm not sure how much the limit get_user_pages()
> >> can pin is reasonable, should we compute the bindwidth to make it?
> 
> >There's a ulimit for locked memory. Can we use this, decreasing
> >the value for rlimit array? We can do this when backend is
> >enabled and re-increment when backend is disabled.
> 
> I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
> the initial value for it is 0x10000, after right shift PAGE_SHIFT,
> it's only 16 pages we can lock then, it seems too small, since the 
> guest virito-net driver may submit a lot requests one time.
> 
> 
> Thanks
> Xiaohui

Yes, that's the default, but system administrator can always increase
this value with ulimit if necessary.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-06  5:46                           ` Xin, Xiaohui
@ 2010-04-06  7:51                             ` Michael S. Tsirkin
  2010-04-07  1:36                               ` Xin, Xiaohui
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-06  7:51 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote:
> Michael,
> > >>> For the write logging, do you have a function in hand that we can
> > >>> recompute the log? If that, I think I can use it to recompute the
> > >>>log info when the logging is suddenly enabled.
> > >>> For the outstanding requests, do you mean all the user buffers have
> > >>>submitted before the logging ioctl changed? That may be a lot, and
> > >> >some of them are still in NIC ring descriptors. Waiting them to be
> > >>>finished may be need some time. I think when logging ioctl changed,
> > >> >then the logging is changed just after that is also reasonable.
>  
> > >>The key point is that after loggin ioctl returns, any
> > >>subsequent change to memory must be logged. It does not
> > >>matter when was the request submitted, otherwise we will
> > >>get memory corruption on migration.
> 
> > >The change to memory happens when vhost_add_used_and_signal(), right?
> > >So after ioctl returns, just recompute the log info to the events in the async queue,
> > >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
>  
> >> Thanks
> >> Xiaohui
> 
> >Yes, I think this will work.
> 
> Thanks, so do you have the function to recompute the log info in your hand that I can 
> use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

> > > Thanks
> > > Xiaohui
> > > 
> > >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > >  drivers/vhost/vhost.h |   10 +++
> > >  2 files changed, 192 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 22d5fef..2aafd90 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -17,11 +17,13 @@
> > >  #include <linux/workqueue.h>
> > >  #include <linux/rcupdate.h>
> > >  #include <linux/file.h>
> > > +#include <linux/aio.h>
> > >  
> > >  #include <linux/net.h>
> > >  #include <linux/if_packet.h>
> > >  #include <linux/if_arp.h>
> > >  #include <linux/if_tun.h>
> > > +#include <linux/mpassthru.h>
> > >  
> > >  #include <net/sock.h>
> > >  
> > > @@ -47,6 +49,7 @@ struct vhost_net {
> > >  	struct vhost_dev dev;
> > >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > +	struct kmem_cache       *cache;
> > >  	/* Tells us whether we are polling a socket for TX.
> > >  	 * We only do this when socket buffer fills up.
> > >  	 * Protected by tx vq lock. */
> > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > >  }
> > >  
> > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > > +	if (!list_empty(&vq->notifier)) {
> > > +		iocb = list_first_entry(&vq->notifier,
> > > +				struct kiocb, ki_list);
> > > +		list_del(&iocb->ki_list);
> > > +	}
> > > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > +	return iocb;
> > > +}
> > > +
> > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	struct vhost_log *vq_log = NULL;
> > > +	int rx_total_len = 0;
> > > +	int log, size;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	if (vq->receiver)
> > > +		vq->receiver(vq);
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(
> > > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, iocb->ki_nbytes);
> > > +		log = (int)iocb->ki_user_data;
> > > +		size = iocb->ki_nbytes;
> > > +		rx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +		kmem_cache_free(net->cache, iocb);
> > > +
> > > +		if (unlikely(vq_log))
> > > +			vhost_log_write(vq, vq_log, log, size);
> > > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	int tx_total_len = 0;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, 0);
> > > +		tx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +
> > > +		kmem_cache_free(net->cache, iocb);
> > > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  /* Expects to be always run from workqueue - which acts as
> > >   * read-size critical section for our kind of RCU. */
> > >  static void handle_tx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, s;
> > >  	struct msghdr msg = {
> > >  		.msg_name = NULL,
> > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		tx_poll_stop(net);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > >  		/* Skip header. TODO: support TSO. */
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > >  		msg.msg_iovlen = out;
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->ki_pos = head;
> > > +			iocb->private = (void *)vq;
> > > +		}
> > > +
> > >  		len = iov_length(vq->iov, out);
> > >  		/* Sanity check */
> > >  		if (!len) {
> > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > >  			break;
> > >  		}
> > >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > >  		if (unlikely(err < 0)) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			tx_poll_start(net, sock);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		if (err != len)
> > >  			pr_err("Truncated TX packet: "
> > >  			       " len %d != %zd\n", err, len);
> > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > >  static void handle_rx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, log, s;
> > >  	struct vhost_log *vq_log;
> > >  	struct msghdr msg = {
> > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > >  	int err;
> > >  	size_t hdr_size;
> > >  	struct socket *sock = rcu_dereference(vq->private_data);
> > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> > >  		return;
> > >  
> > >  	use_mm(net->dev.mm);
> > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > >  	vhost_disable_notify(vq);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > +	/* In async cases, for write logging, the simple way is to get
> > > +	 * the log info always, and really logging is decided later.
> > > +	 * Thus, when logging enabled, we can get log, and when logging
> > > +	 * disabled, we can get log disabled accordingly.
> > > +	 */
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > >  		vq->log : NULL;
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > >  		msg.msg_iovlen = in;
> > >  		len = iov_length(vq->iov, in);
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->private = vq;
> > > +			iocb->ki_pos = head;
> > > +			iocb->ki_user_data = log;
> > > +		}
> > >  		/* Sanity check */
> > >  		if (!len) {
> > >  			vq_err(vq, "Unexpected header len for RX: "
> > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > >  			       iov_length(vq->hdr, s), hdr_size);
> > >  			break;
> > >  		}
> > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > > +
> > > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> > >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> > >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> > >  		if (err < 0) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		/* TODO: Should check and handle checksum. */
> > >  		if (err > len) {
> > >  			pr_err("Discarded truncated rx packet: "
> > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > >  
> > > +
> > >  static void handle_tx_kick(struct work_struct *work)
> > >  {
> > >  	struct vhost_virtqueue *vq;
> > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > +	n->cache = NULL;
> > >  	return 0;
> > >  }
> > >  
> > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > >  }
> > >  
> > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > +{
> > > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > > +	if (n->cache) {
> > > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > > +			kmem_cache_free(n->cache, iocb);
> > > +		kmem_cache_destroy(n->cache);
> > > +	}
> > > +}
> > > +
> > >  static int vhost_net_release(struct inode *inode, struct file *f)
> > >  {
> > >  	struct vhost_net *n = f->private_data;
> > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > >  	/* We do an extra flush before freeing memory,
> > >  	 * since jobs can re-queue themselves. */
> > >  	vhost_net_flush(n);
> > > +	vhost_notifier_cleanup(n);
> > >  	kfree(n);
> > >  	return 0;
> > >  }
> > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > >  	return sock;
> > >  }
> > >  
> > > -static struct socket *get_socket(int fd)
> > > +static struct socket *get_mp_socket(int fd)
> > > +{
> > > +	struct file *file = fget(fd);
> > > +	struct socket *sock;
> > > +	if (!file)
> > > +		return ERR_PTR(-EBADF);
> > > +	sock = mp_get_socket(file);
> > > +	if (IS_ERR(sock))
> > > +		fput(file);
> > > +	return sock;
> > > +}
> > > +
> > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > >  {
> > >  	struct socket *sock;
> > >  	if (fd == -1)
> > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > >  	sock = get_tun_socket(fd);
> > >  	if (!IS_ERR(sock))
> > >  		return sock;
> > > +	sock = get_mp_socket(fd);
> > > +	if (!IS_ERR(sock)) {
> > > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > +		return sock;
> > > +	}
> > >  	return ERR_PTR(-ENOTSOCK);
> > >  }
> > >  
> > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > +{
> > > +	struct vhost_virtqueue *vq = n->vqs + index;
> > > +
> > > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +		vq->receiver = NULL;
> > > +		INIT_LIST_HEAD(&vq->notifier);
> > > +		spin_lock_init(&vq->notify_lock);
> > > +		if (!n->cache) {
> > > +			n->cache = kmem_cache_create("vhost_kiocb",
> > > +					sizeof(struct kiocb), 0,
> > > +					SLAB_HWCACHE_ALIGN, NULL);
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  {
> > >  	struct socket *sock, *oldsock;
> > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	}
> > >  	vq = n->vqs + index;
> > >  	mutex_lock(&vq->mutex);
> > > -	sock = get_socket(fd);
> > > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > > +	sock = get_socket(vq, fd);
> > >  	if (IS_ERR(sock)) {
> > >  		r = PTR_ERR(sock);
> > >  		goto err;
> > >  	}
> > >  
> > > +	vhost_init_link_state(n, index);
> > > +
> > >  	/* start polling new socket */
> > >  	oldsock = vq->private_data;
> > >  	if (sock == oldsock)
> > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	vhost_net_disable_vq(n, vq);
> > >  	rcu_assign_pointer(vq->private_data, sock);
> > >  	vhost_net_enable_vq(n, vq);
> > > -	mutex_unlock(&vq->mutex);
> > >  done:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	if (oldsock) {
> > >  		vhost_net_flush_vq(n, index);
> > > @@ -516,6 +690,7 @@ done:
> > >  	}
> > >  	return r;
> > >  err:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	return r;
> > >  }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index d1f0453..cffe39a 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -43,6 +43,11 @@ struct vhost_log {
> > >  	u64 len;
> > >  };
> > >  
> > > +enum vhost_vq_link_state {
> > > +	VHOST_VQ_LINK_SYNC = 	0,
> > > +	VHOST_VQ_LINK_ASYNC = 	1,
> > > +};
> > > +
> > >  /* The virtqueue structure describes a queue attached to a device. */
> > >  struct vhost_virtqueue {
> > >  	struct vhost_dev *dev;
> > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > >  	/* Log write descriptors */
> > >  	void __user *log_base;
> > >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > > +	/*Differiate async socket for 0-copy from normal*/
> > > +	enum vhost_vq_link_state link_state;
> > > +	struct list_head notifier;
> > > +	spinlock_t notify_lock;
> > > +	void (*receiver)(struct vhost_virtqueue *);
> > >  };
> > >  
> > >  struct vhost_dev {
> > > -- 
> > > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-06  7:51                             ` Michael S. Tsirkin
@ 2010-04-07  1:36                               ` Xin, Xiaohui
  2010-04-07  8:18                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-04-07  1:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

Michael,
> > >>>> For the write logging, do you have a function in hand that we can
> > >>>> recompute the log? If that, I think I can use it to recompute the
> > >>>>log info when the logging is suddenly enabled.
> > >>>> For the outstanding requests, do you mean all the user buffers have
> > >>>>submitted before the logging ioctl changed? That may be a lot, and
> > >> >>some of them are still in NIC ring descriptors. Waiting them to be
> > >>>>finished may be need some time. I think when logging ioctl changed,
> > >> >>then the logging is changed just after that is also reasonable.

> > >>>The key point is that after loggin ioctl returns, any
> > >>>subsequent change to memory must be logged. It does not
> > >>>matter when was the request submitted, otherwise we will
> > >>>get memory corruption on migration.

> > >>The change to memory happens when vhost_add_used_and_signal(), right?
> > >>So after ioctl returns, just recompute the log info to the events in the async queue,
> > >>is ok. Since the ioctl and write log operations are all protected by vq->mutex.

> >>> Thanks
> >> >Xiaohui

> >>Yes, I think this will work.

>> Thanks, so do you have the function to recompute the log info in your hand that I can
>>use? I have weakly remembered that you have noticed it before some time.

>Doesn't just rerunning vhost_get_vq_desc work?

Am I missing something here?
The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and converts it
to an iovec. I think the first available buffer is not the buffers in the async queue, so I
think rerunning vhost_get_vq_desc() cannot work.

Thanks
Xiaohui

> > > Thanks
> > > Xiaohui
> > >
> > >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > >  drivers/vhost/vhost.h |   10 +++
> > >  2 files changed, 192 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 22d5fef..2aafd90 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -17,11 +17,13 @@
> > >  #include <linux/workqueue.h>
> > >  #include <linux/rcupdate.h>
> > >  #include <linux/file.h>
> > > +#include <linux/aio.h>
> > >
> > >  #include <linux/net.h>
> > >  #include <linux/if_packet.h>
> > >  #include <linux/if_arp.h>
> > >  #include <linux/if_tun.h>
> > > +#include <linux/mpassthru.h>
> > >
> > >  #include <net/sock.h>
> > >
> > > @@ -47,6 +49,7 @@ struct vhost_net {
> > >   struct vhost_dev dev;
> > >   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > >   struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > + struct kmem_cache       *cache;
> > >   /* Tells us whether we are polling a socket for TX.
> > >    * We only do this when socket buffer fills up.
> > >    * Protected by tx vq lock. */
> > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > >   net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > >  }
> > >
> > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&vq->notify_lock, flags);
> > > + if (!list_empty(&vq->notifier)) {
> > > +         iocb = list_first_entry(&vq->notifier,
> > > +                         struct kiocb, ki_list);
> > > +         list_del(&iocb->ki_list);
> > > + }
> > > + spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > + return iocb;
> > > +}
> > > +
> > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > +                                 struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + struct vhost_log *vq_log = NULL;
> > > + int rx_total_len = 0;
> > > + int log, size;
> > > +
> > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +         return;
> > > +
> > > + if (vq->receiver)
> > > +         vq->receiver(vq);
> > > +
> > > + vq_log = unlikely(vhost_has_feature(
> > > +                         &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +         vhost_add_used_and_signal(&net->dev, vq,
> > > +                         iocb->ki_pos, iocb->ki_nbytes);
> > > +         log = (int)iocb->ki_user_data;
> > > +         size = iocb->ki_nbytes;
> > > +         rx_total_len += iocb->ki_nbytes;
> > > +
> > > +         if (iocb->ki_dtor)
> > > +                 iocb->ki_dtor(iocb);
> > > +         kmem_cache_free(net->cache, iocb);
> > > +
> > > +         if (unlikely(vq_log))
> > > +                 vhost_log_write(vq, vq_log, log, size);
> > > +         if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > +                 vhost_poll_queue(&vq->poll);
> > > +                 break;
> > > +         }
> > > + }
> > > +}
> > > +
> > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > +                                 struct vhost_virtqueue *vq)
> > > +{
> > > + struct kiocb *iocb = NULL;
> > > + int tx_total_len = 0;
> > > +
> > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +         return;
> > > +
> > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +         vhost_add_used_and_signal(&net->dev, vq,
> > > +                         iocb->ki_pos, 0);
> > > +         tx_total_len += iocb->ki_nbytes;
> > > +
> > > +         if (iocb->ki_dtor)
> > > +                 iocb->ki_dtor(iocb);
> > > +
> > > +         kmem_cache_free(net->cache, iocb);
> > > +         if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > +                 vhost_poll_queue(&vq->poll);
> > > +                 break;
> > > +         }
> > > + }
> > > +}
> > > +
> > >  /* Expects to be always run from workqueue - which acts as
> > >   * read-size critical section for our kind of RCU. */
> > >  static void handle_tx(struct vhost_net *net)
> > >  {
> > >   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > + struct kiocb *iocb = NULL;
> > >   unsigned head, out, in, s;
> > >   struct msghdr msg = {
> > >           .msg_name = NULL,
> > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > >           tx_poll_stop(net);
> > >   hdr_size = vq->hdr_size;
> > >
> > > + handle_async_tx_events_notify(net, vq);
> > > +
> > >   for (;;) {
> > >           head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >                                    ARRAY_SIZE(vq->iov),
> > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > >           /* Skip header. TODO: support TSO. */
> > >           s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > >           msg.msg_iovlen = out;
> > > +
> > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +                 iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +                 if (!iocb)
> > > +                         break;
> > > +                 iocb->ki_pos = head;
> > > +                 iocb->private = (void *)vq;
> > > +         }
> > > +
> > >           len = iov_length(vq->iov, out);
> > >           /* Sanity check */
> > >           if (!len) {
> > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > >                   break;
> > >           }
> > >           /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > -         err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > +         err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > >           if (unlikely(err < 0)) {
> > >                   vhost_discard_vq_desc(vq);
> > >                   tx_poll_start(net, sock);
> > >                   break;
> > >           }
> > > +
> > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +                 continue;
> > > +
> > >           if (err != len)
> > >                   pr_err("Truncated TX packet: "
> > >                          " len %d != %zd\n", err, len);
> > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > >           }
> > >   }
> > >
> > > + handle_async_tx_events_notify(net, vq);
> > > +
> > >   mutex_unlock(&vq->mutex);
> > >   unuse_mm(net->dev.mm);
> > >  }
> > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > >  static void handle_rx(struct vhost_net *net)
> > >  {
> > >   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > + struct kiocb *iocb = NULL;
> > >   unsigned head, out, in, log, s;
> > >   struct vhost_log *vq_log;
> > >   struct msghdr msg = {
> > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > >   int err;
> > >   size_t hdr_size;
> > >   struct socket *sock = rcu_dereference(vq->private_data);
> > > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > + if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > +                 vq->link_state == VHOST_VQ_LINK_SYNC))
> > >           return;
> > >
> > >   use_mm(net->dev.mm);
> > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > >   vhost_disable_notify(vq);
> > >   hdr_size = vq->hdr_size;
> > >
> > > - vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > + /* In async cases, for write logging, the simple way is to get
> > > +  * the log info always, and really logging is decided later.
> > > +  * Thus, when logging enabled, we can get log, and when logging
> > > +  * disabled, we can get log disabled accordingly.
> > > +  */
> > > +
> > > + vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > +         (vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > >           vq->log : NULL;
> > >
> > > + handle_async_rx_events_notify(net, vq);
> > > +
> > >   for (;;) {
> > >           head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >                                    ARRAY_SIZE(vq->iov),
> > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > >           s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > >           msg.msg_iovlen = in;
> > >           len = iov_length(vq->iov, in);
> > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +                 iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +                 if (!iocb)
> > > +                         break;
> > > +                 iocb->private = vq;
> > > +                 iocb->ki_pos = head;
> > > +                 iocb->ki_user_data = log;
> > > +         }
> > >           /* Sanity check */
> > >           if (!len) {
> > >                   vq_err(vq, "Unexpected header len for RX: "
> > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > >                          iov_length(vq->hdr, s), hdr_size);
> > >                   break;
> > >           }
> > > -         err = sock->ops->recvmsg(NULL, sock, &msg,
> > > +
> > > +         err = sock->ops->recvmsg(iocb, sock, &msg,
> > >                                    len, MSG_DONTWAIT | MSG_TRUNC);
> > >           /* TODO: Check specific error and bomb out unless EAGAIN? */
> > >           if (err < 0) {
> > >                   vhost_discard_vq_desc(vq);
> > >                   break;
> > >           }
> > > +
> > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +                 continue;
> > > +
> > >           /* TODO: Should check and handle checksum. */
> > >           if (err > len) {
> > >                   pr_err("Discarded truncated rx packet: "
> > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > >           }
> > >   }
> > >
> > > + handle_async_rx_events_notify(net, vq);
> > > +
> > >   mutex_unlock(&vq->mutex);
> > >   unuse_mm(net->dev.mm);
> > >  }
> > >
> > > +
> > >  static void handle_tx_kick(struct work_struct *work)
> > >  {
> > >   struct vhost_virtqueue *vq;
> > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > >   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > >   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > >   n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > + n->cache = NULL;
> > >   return 0;
> > >  }
> > >
> > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > >   vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > >  }
> > >
> > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > +{
> > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > + struct kiocb *iocb = NULL;
> > > + if (n->cache) {
> > > +         while ((iocb = notify_dequeue(vq)) != NULL)
> > > +                 kmem_cache_free(n->cache, iocb);
> > > +         kmem_cache_destroy(n->cache);
> > > + }
> > > +}
> > > +
> > >  static int vhost_net_release(struct inode *inode, struct file *f)
> > >  {
> > >   struct vhost_net *n = f->private_data;
> > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > >   /* We do an extra flush before freeing memory,
> > >    * since jobs can re-queue themselves. */
> > >   vhost_net_flush(n);
> > > + vhost_notifier_cleanup(n);
> > >   kfree(n);
> > >   return 0;
> > >  }
> > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > >   return sock;
> > >  }
> > >
> > > -static struct socket *get_socket(int fd)
> > > +static struct socket *get_mp_socket(int fd)
> > > +{
> > > + struct file *file = fget(fd);
> > > + struct socket *sock;
> > > + if (!file)
> > > +         return ERR_PTR(-EBADF);
> > > + sock = mp_get_socket(file);
> > > + if (IS_ERR(sock))
> > > +         fput(file);
> > > + return sock;
> > > +}
> > > +
> > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > >  {
> > >   struct socket *sock;
> > >   if (fd == -1)
> > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > >   sock = get_tun_socket(fd);
> > >   if (!IS_ERR(sock))
> > >           return sock;
> > > + sock = get_mp_socket(fd);
> > > + if (!IS_ERR(sock)) {
> > > +         vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > +         return sock;
> > > + }
> > >   return ERR_PTR(-ENOTSOCK);
> > >  }
> > >
> > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > +{
> > > + struct vhost_virtqueue *vq = n->vqs + index;
> > > +
> > > + WARN_ON(!mutex_is_locked(&vq->mutex));
> > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +         vq->receiver = NULL;
> > > +         INIT_LIST_HEAD(&vq->notifier);
> > > +         spin_lock_init(&vq->notify_lock);
> > > +         if (!n->cache) {
> > > +                 n->cache = kmem_cache_create("vhost_kiocb",
> > > +                                 sizeof(struct kiocb), 0,
> > > +                                 SLAB_HWCACHE_ALIGN, NULL);
> > > +         }
> > > + }
> > > +}
> > > +
> > >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  {
> > >   struct socket *sock, *oldsock;
> > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >   }
> > >   vq = n->vqs + index;
> > >   mutex_lock(&vq->mutex);
> > > - sock = get_socket(fd);
> > > + vq->link_state = VHOST_VQ_LINK_SYNC;
> > > + sock = get_socket(vq, fd);
> > >   if (IS_ERR(sock)) {
> > >           r = PTR_ERR(sock);
> > >           goto err;
> > >   }
> > >
> > > + vhost_init_link_state(n, index);
> > > +
> > >   /* start polling new socket */
> > >   oldsock = vq->private_data;
> > >   if (sock == oldsock)
> > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >   vhost_net_disable_vq(n, vq);
> > >   rcu_assign_pointer(vq->private_data, sock);
> > >   vhost_net_enable_vq(n, vq);
> > > - mutex_unlock(&vq->mutex);
> > >  done:
> > > + mutex_unlock(&vq->mutex);
> > >   mutex_unlock(&n->dev.mutex);
> > >   if (oldsock) {
> > >           vhost_net_flush_vq(n, index);
> > > @@ -516,6 +690,7 @@ done:
> > >   }
> > >   return r;
> > >  err:
> > > + mutex_unlock(&vq->mutex);
> > >   mutex_unlock(&n->dev.mutex);
> > >   return r;
> > >  }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index d1f0453..cffe39a 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -43,6 +43,11 @@ struct vhost_log {
> > >   u64 len;
> > >  };
> > >
> > > +enum vhost_vq_link_state {
> > > + VHOST_VQ_LINK_SYNC =    0,
> > > + VHOST_VQ_LINK_ASYNC =   1,
> > > +};
> > > +
> > >  /* The virtqueue structure describes a queue attached to a device. */
> > >  struct vhost_virtqueue {
> > >   struct vhost_dev *dev;
> > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > >   /* Log write descriptors */
> > >   void __user *log_base;
> > >   struct vhost_log log[VHOST_NET_MAX_SG];
> > > + /*Differiate async socket for 0-copy from normal*/
> > > + enum vhost_vq_link_state link_state;
> > > + struct list_head notifier;
> > > + spinlock_t notify_lock;
> > > + void (*receiver)(struct vhost_virtqueue *);
> > >  };
> > >
> > >  struct vhost_dev {
> > > --
> > > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-01 11:08       ` [PATCH " Michael S. Tsirkin
  2010-04-06  5:41         ` Xin, Xiaohui
@ 2010-04-07  2:41         ` Xin, Xiaohui
  2010-04-07  8:15           ` Michael S. Tsirkin
  1 sibling, 1 reply; 33+ messages in thread
From: Xin, Xiaohui @ 2010-04-07  2:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81

Michael,
 
>> Qemu needs a userspace write, is that a synchronous one or
>>asynchronous one?

>It's a synchronous non-blocking write.
Sorry, why the Qemu live migration needs the device have a userspace write?
how does the write operation work? And why a read operation is not cared here?

Thanks
Xiaohui

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-07  2:41         ` Xin, Xiaohui
@ 2010-04-07  8:15           ` Michael S. Tsirkin
  2010-04-07  9:00             ` xiaohui.xin
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-07  8:15 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike, yzhao81

On Wed, Apr 07, 2010 at 10:41:08AM +0800, Xin, Xiaohui wrote:
> Michael,
>  
> >> Qemu needs a userspace write, is that a synchronous one or
> >>asynchronous one?
> 
> >It's a synchronous non-blocking write.
> Sorry, why the Qemu live migration needs the device have a userspace write?
> how does the write operation work? And why a read operation is not cared here?
> 
> Thanks
> Xiaohui

Roughly, with ethernet bridges, moving a device from one location in
the network to another makes forwarding tables incorrect (or incomplete),
until outgoing traffic from the device causes these tables
to be updated. Since there's no guarantee that guest
will generate outgoing traffic, after migration qemu sends out several
dummy packets itself.

-- 
MST

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-07  1:36                               ` Xin, Xiaohui
@ 2010-04-07  8:18                                 ` Michael S. Tsirkin
  2010-04-08  9:07                                   ` xiaohui.xin
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-07  8:18 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Wed, Apr 07, 2010 at 09:36:36AM +0800, Xin, Xiaohui wrote:
> Michael,
> > > >>>> For the write logging, do you have a function in hand that we can
> > > >>>> recompute the log? If that, I think I can use it to recompute the
> > > >>>>log info when the logging is suddenly enabled.
> > > >>>> For the outstanding requests, do you mean all the user buffers have
> > > >>>>submitted before the logging ioctl changed? That may be a lot, and
> > > >> >>some of them are still in NIC ring descriptors. Waiting them to be
> > > >>>>finished may be need some time. I think when logging ioctl changed,
> > > >> >>then the logging is changed just after that is also reasonable.
> 
> > > >>>The key point is that after loggin ioctl returns, any
> > > >>>subsequent change to memory must be logged. It does not
> > > >>>matter when was the request submitted, otherwise we will
> > > >>>get memory corruption on migration.
> 
> > > >>The change to memory happens when vhost_add_used_and_signal(), right?
> > > >>So after ioctl returns, just recompute the log info to the events in the async queue,
> > > >>is ok. Since the ioctl and write log operations are all protected by vq->mutex.
> 
> > >>> Thanks
> > >> >Xiaohui
> 
> > >>Yes, I think this will work.
> 
> >> Thanks, so do you have the function to recompute the log info in your hand that I can
> >>use? I have weakly remembered that you have noticed it before some time.
> 
> >Doesn't just rerunning vhost_get_vq_desc work?
> 
> Am I missing something here?
> The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and converts it
> to an iovec. I think the first available buffer is not the buffers in the async queue, so I
> think rerunning vhost_get_vq_desc() cannot work.
> 
> Thanks
> Xiaohui

Right, but we can move the head back, so we'll find the same buffers
again, or add a variant of vhost_get_vq_desc that will process
descriptors already consumed.

> > > > Thanks
> > > > Xiaohui
> > > >
> > > >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > > >  drivers/vhost/vhost.h |   10 +++
> > > >  2 files changed, 192 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > index 22d5fef..2aafd90 100644
> > > > --- a/drivers/vhost/net.c
> > > > +++ b/drivers/vhost/net.c
> > > > @@ -17,11 +17,13 @@
> > > >  #include <linux/workqueue.h>
> > > >  #include <linux/rcupdate.h>
> > > >  #include <linux/file.h>
> > > > +#include <linux/aio.h>
> > > >
> > > >  #include <linux/net.h>
> > > >  #include <linux/if_packet.h>
> > > >  #include <linux/if_arp.h>
> > > >  #include <linux/if_tun.h>
> > > > +#include <linux/mpassthru.h>
> > > >
> > > >  #include <net/sock.h>
> > > >
> > > > @@ -47,6 +49,7 @@ struct vhost_net {
> > > >   struct vhost_dev dev;
> > > >   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > > >   struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > > + struct kmem_cache       *cache;
> > > >   /* Tells us whether we are polling a socket for TX.
> > > >    * We only do this when socket buffer fills up.
> > > >    * Protected by tx vq lock. */
> > > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > > >   net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > > >  }
> > > >
> > > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > > +{
> > > > + struct kiocb *iocb = NULL;
> > > > + unsigned long flags;
> > > > +
> > > > + spin_lock_irqsave(&vq->notify_lock, flags);
> > > > + if (!list_empty(&vq->notifier)) {
> > > > +         iocb = list_first_entry(&vq->notifier,
> > > > +                         struct kiocb, ki_list);
> > > > +         list_del(&iocb->ki_list);
> > > > + }
> > > > + spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > > + return iocb;
> > > > +}
> > > > +
> > > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > > +                                 struct vhost_virtqueue *vq)
> > > > +{
> > > > + struct kiocb *iocb = NULL;
> > > > + struct vhost_log *vq_log = NULL;
> > > > + int rx_total_len = 0;
> > > > + int log, size;
> > > > +
> > > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > > +         return;
> > > > +
> > > > + if (vq->receiver)
> > > > +         vq->receiver(vq);
> > > > +
> > > > + vq_log = unlikely(vhost_has_feature(
> > > > +                         &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > > +         vhost_add_used_and_signal(&net->dev, vq,
> > > > +                         iocb->ki_pos, iocb->ki_nbytes);
> > > > +         log = (int)iocb->ki_user_data;
> > > > +         size = iocb->ki_nbytes;
> > > > +         rx_total_len += iocb->ki_nbytes;
> > > > +
> > > > +         if (iocb->ki_dtor)
> > > > +                 iocb->ki_dtor(iocb);
> > > > +         kmem_cache_free(net->cache, iocb);
> > > > +
> > > > +         if (unlikely(vq_log))
> > > > +                 vhost_log_write(vq, vq_log, log, size);
> > > > +         if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > > +                 vhost_poll_queue(&vq->poll);
> > > > +                 break;
> > > > +         }
> > > > + }
> > > > +}
> > > > +
> > > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > > +                                 struct vhost_virtqueue *vq)
> > > > +{
> > > > + struct kiocb *iocb = NULL;
> > > > + int tx_total_len = 0;
> > > > +
> > > > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > > +         return;
> > > > +
> > > > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > > > +         vhost_add_used_and_signal(&net->dev, vq,
> > > > +                         iocb->ki_pos, 0);
> > > > +         tx_total_len += iocb->ki_nbytes;
> > > > +
> > > > +         if (iocb->ki_dtor)
> > > > +                 iocb->ki_dtor(iocb);
> > > > +
> > > > +         kmem_cache_free(net->cache, iocb);
> > > > +         if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > > +                 vhost_poll_queue(&vq->poll);
> > > > +                 break;
> > > > +         }
> > > > + }
> > > > +}
> > > > +
> > > >  /* Expects to be always run from workqueue - which acts as
> > > >   * read-size critical section for our kind of RCU. */
> > > >  static void handle_tx(struct vhost_net *net)
> > > >  {
> > > >   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > > + struct kiocb *iocb = NULL;
> > > >   unsigned head, out, in, s;
> > > >   struct msghdr msg = {
> > > >           .msg_name = NULL,
> > > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > > >           tx_poll_stop(net);
> > > >   hdr_size = vq->hdr_size;
> > > >
> > > > + handle_async_tx_events_notify(net, vq);
> > > > +
> > > >   for (;;) {
> > > >           head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > >                                    ARRAY_SIZE(vq->iov),
> > > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > > >           /* Skip header. TODO: support TSO. */
> > > >           s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > > >           msg.msg_iovlen = out;
> > > > +
> > > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > > +                 iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > > +                 if (!iocb)
> > > > +                         break;
> > > > +                 iocb->ki_pos = head;
> > > > +                 iocb->private = (void *)vq;
> > > > +         }
> > > > +
> > > >           len = iov_length(vq->iov, out);
> > > >           /* Sanity check */
> > > >           if (!len) {
> > > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > > >                   break;
> > > >           }
> > > >           /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > > -         err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > > +         err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > > >           if (unlikely(err < 0)) {
> > > >                   vhost_discard_vq_desc(vq);
> > > >                   tx_poll_start(net, sock);
> > > >                   break;
> > > >           }
> > > > +
> > > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > > +                 continue;
> > > > +
> > > >           if (err != len)
> > > >                   pr_err("Truncated TX packet: "
> > > >                          " len %d != %zd\n", err, len);
> > > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > > >           }
> > > >   }
> > > >
> > > > + handle_async_tx_events_notify(net, vq);
> > > > +
> > > >   mutex_unlock(&vq->mutex);
> > > >   unuse_mm(net->dev.mm);
> > > >  }
> > > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > > >  static void handle_rx(struct vhost_net *net)
> > > >  {
> > > >   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > > + struct kiocb *iocb = NULL;
> > > >   unsigned head, out, in, log, s;
> > > >   struct vhost_log *vq_log;
> > > >   struct msghdr msg = {
> > > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > > >   int err;
> > > >   size_t hdr_size;
> > > >   struct socket *sock = rcu_dereference(vq->private_data);
> > > > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > > + if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > > +                 vq->link_state == VHOST_VQ_LINK_SYNC))
> > > >           return;
> > > >
> > > >   use_mm(net->dev.mm);
> > > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > > >   vhost_disable_notify(vq);
> > > >   hdr_size = vq->hdr_size;
> > > >
> > > > - vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > > + /* In async cases, for write logging, the simple way is to get
> > > > +  * the log info always, and really logging is decided later.
> > > > +  * Thus, when logging enabled, we can get log, and when logging
> > > > +  * disabled, we can get log disabled accordingly.
> > > > +  */
> > > > +
> > > > + vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > > +         (vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > > >           vq->log : NULL;
> > > >
> > > > + handle_async_rx_events_notify(net, vq);
> > > > +
> > > >   for (;;) {
> > > >           head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > >                                    ARRAY_SIZE(vq->iov),
> > > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > > >           s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > > >           msg.msg_iovlen = in;
> > > >           len = iov_length(vq->iov, in);
> > > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > > +                 iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > > +                 if (!iocb)
> > > > +                         break;
> > > > +                 iocb->private = vq;
> > > > +                 iocb->ki_pos = head;
> > > > +                 iocb->ki_user_data = log;
> > > > +         }
> > > >           /* Sanity check */
> > > >           if (!len) {
> > > >                   vq_err(vq, "Unexpected header len for RX: "
> > > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > > >                          iov_length(vq->hdr, s), hdr_size);
> > > >                   break;
> > > >           }
> > > > -         err = sock->ops->recvmsg(NULL, sock, &msg,
> > > > +
> > > > +         err = sock->ops->recvmsg(iocb, sock, &msg,
> > > >                                    len, MSG_DONTWAIT | MSG_TRUNC);
> > > >           /* TODO: Check specific error and bomb out unless EAGAIN? */
> > > >           if (err < 0) {
> > > >                   vhost_discard_vq_desc(vq);
> > > >                   break;
> > > >           }
> > > > +
> > > > +         if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > > +                 continue;
> > > > +
> > > >           /* TODO: Should check and handle checksum. */
> > > >           if (err > len) {
> > > >                   pr_err("Discarded truncated rx packet: "
> > > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > > >           }
> > > >   }
> > > >
> > > > + handle_async_rx_events_notify(net, vq);
> > > > +
> > > >   mutex_unlock(&vq->mutex);
> > > >   unuse_mm(net->dev.mm);
> > > >  }
> > > >
> > > > +
> > > >  static void handle_tx_kick(struct work_struct *work)
> > > >  {
> > > >   struct vhost_virtqueue *vq;
> > > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > > >   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > > >   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > > >   n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > > + n->cache = NULL;
> > > >   return 0;
> > > >  }
> > > >
> > > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > > >   vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > > >  }
> > > >
> > > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > > +{
> > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > > + struct kiocb *iocb = NULL;
> > > > + if (n->cache) {
> > > > +         while ((iocb = notify_dequeue(vq)) != NULL)
> > > > +                 kmem_cache_free(n->cache, iocb);
> > > > +         kmem_cache_destroy(n->cache);
> > > > + }
> > > > +}
> > > > +
> > > >  static int vhost_net_release(struct inode *inode, struct file *f)
> > > >  {
> > > >   struct vhost_net *n = f->private_data;
> > > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > > >   /* We do an extra flush before freeing memory,
> > > >    * since jobs can re-queue themselves. */
> > > >   vhost_net_flush(n);
> > > > + vhost_notifier_cleanup(n);
> > > >   kfree(n);
> > > >   return 0;
> > > >  }
> > > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > > >   return sock;
> > > >  }
> > > >
> > > > -static struct socket *get_socket(int fd)
> > > > +static struct socket *get_mp_socket(int fd)
> > > > +{
> > > > + struct file *file = fget(fd);
> > > > + struct socket *sock;
> > > > + if (!file)
> > > > +         return ERR_PTR(-EBADF);
> > > > + sock = mp_get_socket(file);
> > > > + if (IS_ERR(sock))
> > > > +         fput(file);
> > > > + return sock;
> > > > +}
> > > > +
> > > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > > >  {
> > > >   struct socket *sock;
> > > >   if (fd == -1)
> > > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > > >   sock = get_tun_socket(fd);
> > > >   if (!IS_ERR(sock))
> > > >           return sock;
> > > > + sock = get_mp_socket(fd);
> > > > + if (!IS_ERR(sock)) {
> > > > +         vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > > +         return sock;
> > > > + }
> > > >   return ERR_PTR(-ENOTSOCK);
> > > >  }
> > > >
> > > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > > +{
> > > > + struct vhost_virtqueue *vq = n->vqs + index;
> > > > +
> > > > + WARN_ON(!mutex_is_locked(&vq->mutex));
> > > > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > > +         vq->receiver = NULL;
> > > > +         INIT_LIST_HEAD(&vq->notifier);
> > > > +         spin_lock_init(&vq->notify_lock);
> > > > +         if (!n->cache) {
> > > > +                 n->cache = kmem_cache_create("vhost_kiocb",
> > > > +                                 sizeof(struct kiocb), 0,
> > > > +                                 SLAB_HWCACHE_ALIGN, NULL);
> > > > +         }
> > > > + }
> > > > +}
> > > > +
> > > >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > >  {
> > > >   struct socket *sock, *oldsock;
> > > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > >   }
> > > >   vq = n->vqs + index;
> > > >   mutex_lock(&vq->mutex);
> > > > - sock = get_socket(fd);
> > > > + vq->link_state = VHOST_VQ_LINK_SYNC;
> > > > + sock = get_socket(vq, fd);
> > > >   if (IS_ERR(sock)) {
> > > >           r = PTR_ERR(sock);
> > > >           goto err;
> > > >   }
> > > >
> > > > + vhost_init_link_state(n, index);
> > > > +
> > > >   /* start polling new socket */
> > > >   oldsock = vq->private_data;
> > > >   if (sock == oldsock)
> > > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > > >   vhost_net_disable_vq(n, vq);
> > > >   rcu_assign_pointer(vq->private_data, sock);
> > > >   vhost_net_enable_vq(n, vq);
> > > > - mutex_unlock(&vq->mutex);
> > > >  done:
> > > > + mutex_unlock(&vq->mutex);
> > > >   mutex_unlock(&n->dev.mutex);
> > > >   if (oldsock) {
> > > >           vhost_net_flush_vq(n, index);
> > > > @@ -516,6 +690,7 @@ done:
> > > >   }
> > > >   return r;
> > > >  err:
> > > > + mutex_unlock(&vq->mutex);
> > > >   mutex_unlock(&n->dev.mutex);
> > > >   return r;
> > > >  }
> > > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > > index d1f0453..cffe39a 100644
> > > > --- a/drivers/vhost/vhost.h
> > > > +++ b/drivers/vhost/vhost.h
> > > > @@ -43,6 +43,11 @@ struct vhost_log {
> > > >   u64 len;
> > > >  };
> > > >
> > > > +enum vhost_vq_link_state {
> > > > + VHOST_VQ_LINK_SYNC =    0,
> > > > + VHOST_VQ_LINK_ASYNC =   1,
> > > > +};
> > > > +
> > > >  /* The virtqueue structure describes a queue attached to a device. */
> > > >  struct vhost_virtqueue {
> > > >   struct vhost_dev *dev;
> > > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > > >   /* Log write descriptors */
> > > >   void __user *log_base;
> > > >   struct vhost_log log[VHOST_NET_MAX_SG];
> > > > + /*Differiate async socket for 0-copy from normal*/
> > > > + enum vhost_vq_link_state link_state;
> > > > + struct list_head notifier;
> > > > + spinlock_t notify_lock;
> > > > + void (*receiver)(struct vhost_virtqueue *);
> > > >  };
> > > >
> > > >  struct vhost_dev {
> > > > --
> > > > 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-07  8:15           ` Michael S. Tsirkin
@ 2010-04-07  9:00             ` xiaohui.xin
  2010-04-07 11:17               ` [PATCH " Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: xiaohui.xin @ 2010-04-07  9:00 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, mingo, jdike, Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

---

Michael,
Thanks a lot for the explanation. I have drafted a patch for the qemu write
after I looked into tun driver. Does it do in right way?

Thanks
Xiaohui

 drivers/vhost/mpassthru.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index e9449ac..1cde097 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
 	return mask;
 }
 
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -EFAULT;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+	skb_set_network_header(skb, ETH_HLEN);
+	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+	mp->dev->stats.tx_packets++;
+	mp->dev->stats.tx_bytes += len;
+
+	mp_put(mp);
+	return result;
+}
+
 static int mp_chr_close(struct inode *inode, struct file *file)
 {
 	struct mp_file *mfile = file->private_data;
@@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct file *file)
 static const struct file_operations mp_fops = {
 	.owner  = THIS_MODULE,
 	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
 	.poll   = mp_chr_poll,
 	.unlocked_ioctl = mp_chr_ioctl,
 	.open   = mp_chr_open,
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
  2010-04-07  9:00             ` xiaohui.xin
@ 2010-04-07 11:17               ` Michael S. Tsirkin
  0 siblings, 0 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2010-04-07 11:17 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike

On Wed, Apr 07, 2010 at 05:00:39PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> ---
> 
> Michael,
> Thanks a lot for the explanation. I have drafted a patch for the qemu write
> after I looked into tun driver. Does it do in right way?
> 
> Thanks
> Xiaohui
> 
>  drivers/vhost/mpassthru.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 45 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> index e9449ac..1cde097 100644
> --- a/drivers/vhost/mpassthru.c
> +++ b/drivers/vhost/mpassthru.c
> @@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
>  	return mask;
>  }
>  
> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct mp_struct *mp = mp_get(file->private_data);
> +	struct sock *sk = mp->socket.sk;
> +	struct sk_buff *skb;
> +	int len, err;
> +	ssize_t result;
> +
> +	if (!mp)
> +		return -EBADFD;
> +

Can this happen? When?

> +	/* currently, async is not supported */
> +	if (!is_sync_kiocb(iocb))
> +		return -EFAULT;

Really necessary? I think do_sync_write handles all this.

> +
> +	len = iov_length(iov, count);
> +	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
> +				  file->f_flags & O_NONBLOCK, &err);
> +
> +	if (!skb)
> +		return -EFAULT;

Surely not EFAULT. -EAGAIN?

> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +	skb_put(skb, len);
> +
> +	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
> +		kfree_skb(skb);
> +		return -EFAULT;
> +	}
> +	skb_set_network_header(skb, ETH_HLEN);

Is this really right or necessary? Also,
probably need to check that length is at least ETH_ALEN before
doing this.

> +	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);

eth_type_trans?

> +	skb->dev = mp->dev;
> +
> +	dev_queue_xmit(skb);
> +	mp->dev->stats.tx_packets++;
> +	mp->dev->stats.tx_bytes += len;

Doesn't the hard start xmit function for the device
increment the counters?

> +
> +	mp_put(mp);
> +	return result;
> +}
> +
>  static int mp_chr_close(struct inode *inode, struct file *file)
>  {
>  	struct mp_file *mfile = file->private_data;
> @@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct file *file)
>  static const struct file_operations mp_fops = {
>  	.owner  = THIS_MODULE,
>  	.llseek = no_llseek,
> +	.write  = do_sync_write,
> +	.aio_write = mp_chr_aio_write,
>  	.poll   = mp_chr_poll,
>  	.unlocked_ioctl = mp_chr_ioctl,
>  	.open   = mp_chr_open,
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
  2010-04-07  8:18                                 ` Michael S. Tsirkin
@ 2010-04-08  9:07                                   ` xiaohui.xin
  0 siblings, 0 replies; 33+ messages in thread
From: xiaohui.xin @ 2010-04-08  9:07 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, linux-kernel, jdike, Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

---
Michael,
This is a small patch for the write logging issue with async queue.
I have made a __vhost_get_vq_desc() func which may compute the log
info with any valid buffer index. The __vhost_get_vq_desc() is 
coming from the code in vq_get_vq_desc().
And I use it to recompute the log info when logging is enabled.

Thanks
Xiaohui

 drivers/vhost/net.c   |   27 ++++++++---
 drivers/vhost/vhost.c |  115 ++++++++++++++++++++++++++++---------------------
 drivers/vhost/vhost.h |    5 ++
 3 files changed, 90 insertions(+), 57 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2aafd90..00a45ef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -115,7 +115,8 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 	struct kiocb *iocb = NULL;
 	struct vhost_log *vq_log = NULL;
 	int rx_total_len = 0;
-	int log, size;
+	unsigned int head, log, in, out;
+	int size;
 
 	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
 		return;
@@ -130,14 +131,25 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 				iocb->ki_pos, iocb->ki_nbytes);
 		log = (int)iocb->ki_user_data;
 		size = iocb->ki_nbytes;
+		head = iocb->ki_pos;
 		rx_total_len += iocb->ki_nbytes;
 
 		if (iocb->ki_dtor)
 			iocb->ki_dtor(iocb);
 		kmem_cache_free(net->cache, iocb);
 
-		if (unlikely(vq_log))
+		/* when log is enabled, recomputing the log info is needed,
+		 * since these buffers are in async queue, and may not get
+		 * the log info before.
+		 */
+		if (unlikely(vq_log)) {
+			if (!log)
+				__vhost_get_vq_desc(&net->dev, vq, vq->iov,
+						    ARRAY_SIZE(vq->iov),
+						    &out, &in, vq_log,
+						    &log, head);
 			vhost_log_write(vq, vq_log, log, size);
+		}
 		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
@@ -313,14 +325,13 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
-	/* In async cases, for write logging, the simple way is to get
-	 * the log info always, and really logging is decided later.
-	 * Thus, when logging enabled, we can get log, and when logging
-	 * disabled, we can get log disabled accordingly.
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
 	 */
 
-	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
-		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
 	handle_async_rx_events_notify(net, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 97233d5..53dab80 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -715,66 +715,21 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 			   struct iovec iov[], unsigned int iov_size,
 			   unsigned int *out_num, unsigned int *in_num,
-			   struct vhost_log *log, unsigned int *log_num)
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head)
 {
 	struct vring_desc desc;
-	unsigned int i, head, found = 0;
-	u16 last_avail_idx;
+	unsigned int i = head, found = 0;
 	int ret;
 
-	/* Check it isn't doing very strange things with descriptor numbers. */
-	last_avail_idx = vq->last_avail_idx;
-	if (get_user(vq->avail_idx, &vq->avail->idx)) {
-		vq_err(vq, "Failed to access avail idx at %p\n",
-		       &vq->avail->idx);
-		return vq->num;
-	}
-
-	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
-		vq_err(vq, "Guest moved used index from %u to %u",
-		       last_avail_idx, vq->avail_idx);
-		return vq->num;
-	}
-
-	/* If there's nothing new since last we looked, return invalid. */
-	if (vq->avail_idx == last_avail_idx)
-		return vq->num;
-
-	/* Only get avail ring entries after they have been exposed by guest. */
-	rmb();
-
-	/* Grab the next descriptor number they're advertising, and increment
-	 * the index we've seen. */
-	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
-		vq_err(vq, "Failed to read head: idx %d address %p\n",
-		       last_avail_idx,
-		       &vq->avail->ring[last_avail_idx % vq->num]);
-		return vq->num;
-	}
-
-	/* If their number is silly, that's an error. */
-	if (head >= vq->num) {
-		vq_err(vq, "Guest says index %u > %u is available",
-		       head, vq->num);
-		return vq->num;
-	}
-
 	/* When we start there are none of either input nor output. */
 	*out_num = *in_num = 0;
 	if (unlikely(log))
 		*log_num = 0;
 
-	i = head;
 	do {
 		unsigned iov_count = *in_num + *out_num;
 		if (i >= vq->num) {
@@ -833,8 +788,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 			*out_num += ret;
 		}
 	} while ((i = next_desc(&desc)) != -1);
+	return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[], unsigned int iov_size,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head, found = 0;
+	u16 last_avail_idx;
+	unsigned int ret;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vq->last_avail_idx;
+	if (get_user(vq->avail_idx, &vq->avail->idx)) {
+		vq_err(vq, "Failed to access avail idx at %p\n",
+		       &vq->avail->idx);
+		return vq->num;
+	}
+
+	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+		vq_err(vq, "Guest moved used index from %u to %u",
+		       last_avail_idx, vq->avail_idx);
+		return vq->num;
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (vq->avail_idx == last_avail_idx)
+		return vq->num;
+
+	/* Only get avail ring entries after they have been exposed by guest. */
+	rmb();
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+		vq_err(vq, "Failed to read head: idx %d address %p\n",
+		       last_avail_idx,
+		       &vq->avail->ring[last_avail_idx % vq->num]);
+		return vq->num;
+	}
+
+	/* If their number is silly, that's an error. */
+	if (head >= vq->num) {
+		vq_err(vq, "Guest says index %u > %u is available",
+		       head, vq->num);
+		return vq->num;
+	}
+
+	ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+				  out_num, in_num,
+				  log, log_num, head);
 
 	/* On success, increment avail index. */
+	if (ret == vq->num)
+		return ret;
 	vq->last_avail_idx++;
 	return head;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index cffe39a..a74a6d4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -132,6 +132,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
 			   struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			   struct iovec iov[], unsigned int iov_count,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head);
 void vhost_discard_vq_desc(struct vhost_virtqueue *);
 
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-- 
1.5.4.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-04-08  9:05 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-06  9:38 [PATCH v1 0/3] Provide a zero-copy method on KVM virtio-net xiaohui.xin
2010-03-06  9:38 ` [PATCH v1 1/3] A device for zero-copy based " xiaohui.xin
2010-03-06  9:38   ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications xiaohui.xin
2010-03-06  9:38     ` [PATCH v1 3/3] Let host NIC driver to DMA to guest user space xiaohui.xin
2010-03-06 17:18       ` Stephen Hemminger
2010-03-08 11:18       ` Michael S. Tsirkin
2010-03-07 11:18     ` [PATCH v1 2/3] Provides multiple submits and asynchronous notifications Michael S. Tsirkin
2010-03-15  8:46       ` Xin, Xiaohui
2010-03-15  9:23         ` Michael S. Tsirkin
2010-03-16  9:32           ` Xin Xiaohui
2010-03-16 11:33             ` [PATCH " Michael S. Tsirkin
2010-03-17  9:48               ` Xin, Xiaohui
2010-03-17 10:27                 ` Michael S. Tsirkin
2010-04-01  9:14                   ` Xin Xiaohui
2010-04-01 11:02                     ` [PATCH " Michael S. Tsirkin
2010-04-02  2:16                       ` Xin, Xiaohui
2010-04-04 11:40                         ` Michael S. Tsirkin
2010-04-06  5:46                           ` Xin, Xiaohui
2010-04-06  7:51                             ` Michael S. Tsirkin
2010-04-07  1:36                               ` Xin, Xiaohui
2010-04-07  8:18                                 ` Michael S. Tsirkin
2010-04-08  9:07                                   ` xiaohui.xin
2010-03-08 11:28   ` [PATCH v1 1/3] A device for zero-copy based on KVM virtio-net Michael S. Tsirkin
2010-04-01  9:27     ` Xin Xiaohui
2010-04-01 11:08       ` [PATCH " Michael S. Tsirkin
2010-04-06  5:41         ` Xin, Xiaohui
2010-04-06  7:49           ` Michael S. Tsirkin
2010-04-07  2:41         ` Xin, Xiaohui
2010-04-07  8:15           ` Michael S. Tsirkin
2010-04-07  9:00             ` xiaohui.xin
2010-04-07 11:17               ` [PATCH " Michael S. Tsirkin
2010-03-07 10:50 ` [PATCH v1 0/3] Provide a zero-copy method " Michael S. Tsirkin
2010-03-09  7:47   ` Xin, Xiaohui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.