All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-06  9:23 ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:  	net core and kernel changes.
patch 11-13:  	new device as interface to mantpulate external buffers.
patch 14: 	for vhost-net.
patch 15:	An example on modifying NIC driver to using napi_gro_frags().
patch 16:	An example how to get guest buffers based on driver
		who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:
	
	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK
	

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.
	

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
	move create_iocb() out of page_dtor which may happen in interrupt context
	-This remove the potential issues which lock called in interrupt context
	make the cache used by mp, vhost as static, and created/destoryed during
	modules init/exit functions.
	-This makes multiple mp guest created at the same time.

what we have done in v7:
	some cleanup prepared to suppprt PS mode

what we have done in v8:
	discarding the modifications to point skb->data to guest buffer directly.
	Add code to modify driver to support napi_gro_frags() with Herbert's comments.
	To support PS mode.
	Add mergeable buffer support in mp device.
	Add GSO/GRO support in mp deice.
	Address comments from Eric Dumazet about cache line and rcu usage.

what we have done in v9:
	v8 patch is based on a fix in dev_gro_receive().
	But Herbert did not agree with the fix we have sent out.
	And he suggest another fix. v9 is modified to base on that fix.
		

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-06  9:23 ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:  	net core and kernel changes.
patch 11-13:  	new device as interface to mantpulate external buffers.
patch 14: 	for vhost-net.
patch 15:	An example on modifying NIC driver to using napi_gro_frags().
patch 16:	An example how to get guest buffers based on driver
		who using napi_gro_frags().

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:
	
	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK
	

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.
	

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
	move create_iocb() out of page_dtor which may happen in interrupt context
	-This remove the potential issues which lock called in interrupt context
	make the cache used by mp, vhost as static, and created/destoryed during
	modules init/exit functions.
	-This makes multiple mp guest created at the same time.

what we have done in v7:
	some cleanup prepared to suppprt PS mode

what we have done in v8:
	discarding the modifications to point skb->data to guest buffer directly.
	Add code to modify driver to support napi_gro_frags() with Herbert's comments.
	To support PS mode.
	Add mergeable buffer support in mp device.
	Add GSO/GRO support in mp deice.
	Address comments from Eric Dumazet about cache line and rcu usage.

what we have done in v9:
	v8 patch is based on a fix in dev_gro_receive().
	But Herbert did not agree with the fix we have sent out.
	And he suggest another fix. v9 is modified to base on that fix.
		

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 01/16] Add a new structure for skb buffer from external.
  2010-08-06  9:23 ` xiaohui.xin
@ 2010-08-06  9:23   ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+	struct		page *page;
+	void		(*dtor)(struct skb_ext_page *);
+};
+
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 01/16] Add a new structure for skb buffer from external.
@ 2010-08-06  9:23   ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+	struct		page *page;
+	void		(*dtor)(struct skb_ext_page *);
+};
+
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 02/16] Add a new struct for device to manipulate external buffer.
  2010-08-06  9:23   ` xiaohui.xin
@ 2010-08-06  9:23     ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..ba582e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,25 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port	{
+	int		hdr_len;
+	int		data_len;
+	int		npages;
+	unsigned	flags;
+	struct socket	*sock;
+	int		vnet_hlen;
+	struct skb_ext_page *(*ctor)(struct mpassthru_port *,
+				struct sk_buff *, int);
+	struct skb_ext_page *(*hash)(struct net_device *,
+				struct page *);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +971,8 @@ struct net_device {
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mpassthru_port	*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 02/16] Add a new struct for device to manipulate external buffer.
@ 2010-08-06  9:23     ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..ba582e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,25 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port	{
+	int		hdr_len;
+	int		data_len;
+	int		npages;
+	unsigned	flags;
+	struct socket	*sock;
+	int		vnet_hlen;
+	struct skb_ext_page *(*ctor)(struct mpassthru_port *,
+				struct sk_buff *, int);
+	struct skb_ext_page *(*hash)(struct net_device *,
+				struct page *);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +971,8 @@ struct net_device {
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mpassthru_port	*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 03/16] Add a ndo_mp_port_prep func to net_device_ops.
  2010-08-06  9:23     ` xiaohui.xin
@ 2010-08-06  9:23       ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ba582e1..aba0308 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -710,6 +710,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 03/16] Add a ndo_mp_port_prep func to net_device_ops.
@ 2010-08-06  9:23       ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ba582e1..aba0308 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -710,6 +710,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 04/16] Add a function make external buffer owner to query capability.
  2010-08-06  9:23       ` xiaohui.xin
@ 2010-08-06  9:23         ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    The external buffer owner can use the functions to get
    the capability of the underlying NIC driver.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>

---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   49 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aba0308..5f192de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+				struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..636f11b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+		struct mpassthru_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {
+		/* If the NIC driver did not report this,
+		 * then we try to use default value.
+		 */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 04/16] Add a function make external buffer owner to query capability.
@ 2010-08-06  9:23         ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    The external buffer owner can use the functions to get
    the capability of the underlying NIC driver.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>

---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   49 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aba0308..5f192de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+				struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..636f11b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+		struct mpassthru_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {
+		/* If the NIC driver did not report this,
+		 * then we try to use default value.
+		 */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 05/16] Add a function to indicate if device use external buffer.
  2010-08-06  9:23         ` xiaohui.xin
@ 2010-08-06  9:23           ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5f192de..23d6ec0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_prep(struct net_device *dev,
 				struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+	return dev && dev->mp_port;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 05/16] Add a function to indicate if device use external buffer.
@ 2010-08-06  9:23           ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5f192de..23d6ec0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_prep(struct net_device *dev,
 				struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+	return dev && dev->mp_port;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 06/16] Use callback to deal with skb_release_data() specially.
  2010-08-06  9:23           ` xiaohui.xin
@ 2010-08-06  9:23             ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    If buffer is external, then use the callback to destruct
    buffers.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/skbuff.c      |    8 ++++++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 74af06c..ab29675 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,10 +197,11 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
-	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+
+	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
 /* The structure is for a skb which pages may point to
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..117d82b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->gso_type = 0;
 	shinfo->ip6_frag_id = 0;
 	shinfo->tx_flags.flags = 0;
+	shinfo->destructor_arg = NULL;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
@@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb->dev && dev_is_mpassthru(skb->dev)) {
+			struct skb_ext_page *ext_page =
+				skb_shinfo(skb)->destructor_arg;
+			if (ext_page && ext_page->dtor)
+				ext_page->dtor(ext_page);
+		}
+
 		kfree(skb->head);
 	}
 }
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 06/16] Use callback to deal with skb_release_data() specially.
@ 2010-08-06  9:23             ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    If buffer is external, then use the callback to destruct
    buffers.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/skbuff.c      |    8 ++++++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 74af06c..ab29675 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,10 +197,11 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
-	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+
+	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
 /* The structure is for a skb which pages may point to
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..117d82b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->gso_type = 0;
 	shinfo->ip6_frag_id = 0;
 	shinfo->tx_flags.flags = 0;
+	shinfo->destructor_arg = NULL;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
@@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb->dev && dev_is_mpassthru(skb->dev)) {
+			struct skb_ext_page *ext_page =
+				skb_shinfo(skb)->destructor_arg;
+			if (ext_page && ext_page->dtor)
+				ext_page->dtor(ext_page);
+		}
+
 		kfree(skb->head);
 	}
 }
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 07/16] Modify netdev_alloc_page() to get external buffer
  2010-08-06  9:23             ` xiaohui.xin
@ 2010-08-06  9:23               ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Currently, it can get external buffers from mp device.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 117d82b..1a61e2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
+struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages)
+{
+	struct mpassthru_port *port;
+	struct skb_ext_page *ext_page = NULL;
+
+	port = dev->mp_port;
+	if (!port)
+		goto out;
+	ext_page = port->ctor(port, NULL, npages);
+	if (ext_page)
+		return ext_page->page;
+out:
+	return NULL;
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_pages);
+
+struct page *netdev_alloc_ext_page(struct net_device *dev)
+{
+	return netdev_alloc_ext_pages(dev, 1);
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_page);
+
 struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 {
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+	if (dev_is_mpassthru(dev))
+		return netdev_alloc_ext_page(dev);
+
 	page = alloc_pages_node(node, gfp_mask, 0);
 	return page;
 }
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 07/16] Modify netdev_alloc_page() to get external buffer
@ 2010-08-06  9:23               ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Currently, it can get external buffers from mp device.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 117d82b..1a61e2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
+struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages)
+{
+	struct mpassthru_port *port;
+	struct skb_ext_page *ext_page = NULL;
+
+	port = dev->mp_port;
+	if (!port)
+		goto out;
+	ext_page = port->ctor(port, NULL, npages);
+	if (ext_page)
+		return ext_page->page;
+out:
+	return NULL;
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_pages);
+
+struct page *netdev_alloc_ext_page(struct net_device *dev)
+{
+	return netdev_alloc_ext_pages(dev, 1);
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_page);
+
 struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 {
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+	if (dev_is_mpassthru(dev))
+		return netdev_alloc_ext_page(dev);
+
 	page = alloc_pages_node(node, gfp_mask, 0);
 	return page;
 }
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 08/16] Modify netdev_free_page() to release external buffer
  2010-08-06  9:23               ` xiaohui.xin
@ 2010-08-06  9:23                 ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    4 +++-
 net/core/skbuff.c      |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ab29675..3d7f70e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1512,9 +1512,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev)
 	return __netdev_alloc_page(dev, GFP_ATOMIC);
 }
 
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
+
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1a61e2b..bbf4707 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -306,6 +306,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void netdev_free_ext_page(struct net_device *dev, struct page *page)
+{
+	struct skb_ext_page *ext_page = NULL;
+	if (dev_is_mpassthru(dev) && dev->mp_port->hash) {
+		ext_page = dev->mp_port->hash(dev, page);
+		if (ext_page)
+			ext_page->dtor(ext_page);
+		else
+			__free_page(page);
+	}
+}
+EXPORT_SYMBOL(netdev_free_ext_page);
+
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+	if (dev_is_mpassthru(dev)) {
+		netdev_free_ext_page(dev, page);
+		return;
+	}
+
+	__free_page(page);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 08/16] Modify netdev_free_page() to release external buffer
@ 2010-08-06  9:23                 ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    4 +++-
 net/core/skbuff.c      |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ab29675..3d7f70e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1512,9 +1512,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev)
 	return __netdev_alloc_page(dev, GFP_ATOMIC);
 }
 
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
+
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1a61e2b..bbf4707 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -306,6 +306,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void netdev_free_ext_page(struct net_device *dev, struct page *page)
+{
+	struct skb_ext_page *ext_page = NULL;
+	if (dev_is_mpassthru(dev) && dev->mp_port->hash) {
+		ext_page = dev->mp_port->hash(dev, page);
+		if (ext_page)
+			ext_page->dtor(ext_page);
+		else
+			__free_page(page);
+	}
+}
+EXPORT_SYMBOL(netdev_free_ext_page);
+
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+	if (dev_is_mpassthru(dev)) {
+		netdev_free_ext_page(dev, page);
+		return;
+	}
+
+	__free_page(page);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 09/16] Don't do skb recycle, if device use external buffer.
  2010-08-06  9:23                 ` xiaohui.xin
@ 2010-08-06  9:23                   ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbf4707..9b156bb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (skb_shared(skb) || skb_cloned(skb))
 		return 0;
 
+	/* if the device wants to do mediate passthru, the skb may
+	 * get external buffer, so don't recycle
+	 */
+	if (dev_is_mpassthru(skb->dev))
+		return 0;
+
 	skb_release_head_state(skb);
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 09/16] Don't do skb recycle, if device use external buffer.
@ 2010-08-06  9:23                   ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbf4707..9b156bb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (skb_shared(skb) || skb_cloned(skb))
 		return 0;
 
+	/* if the device wants to do mediate passthru, the skb may
+	 * get external buffer, so don't recycle
+	 */
+	if (dev_is_mpassthru(skb->dev))
+		return 0;
+
 	skb_release_head_state(skb);
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 10/16] Add a hook to intercept external buffers from NIC driver.
  2010-08-06  9:23                   ` xiaohui.xin
@ 2010-08-06  9:23                     ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 636f11b..4b379b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2517,6 +2517,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					       struct packet_type **pt_prev,
+					       int *ret,
+					       struct net_device *orig_dev)
+{
+	struct mpassthru_port *mp_port = NULL;
+	struct sock *sk = NULL;
+
+	if (!dev_is_mpassthru(skb->dev))
+		return skb;
+	mp_port = skb->dev->mp_port;
+
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	sk = mp_port->sock->sk;
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_state_change(sk);
+
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+	/* To intercept mediate passthru(zero-copy) packets here */
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
 		goto out;
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 10/16] Add a hook to intercept external buffers from NIC driver.
@ 2010-08-06  9:23                     ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 636f11b..4b379b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2517,6 +2517,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					       struct packet_type **pt_prev,
+					       int *ret,
+					       struct net_device *orig_dev)
+{
+	struct mpassthru_port *mp_port = NULL;
+	struct sock *sk = NULL;
+
+	if (!dev_is_mpassthru(skb->dev))
+		return skb;
+	mp_port = skb->dev->mp_port;
+
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	sk = mp_port->sock->sk;
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_state_change(sk);
+
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+	/* To intercept mediate passthru(zero-copy) packets here */
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
 		goto out;
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 11/16] Add header file for mp device.
  2010-08-06  9:23                     ` xiaohui.xin
@ 2010-08-06  9:23                       ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 11/16] Add header file for mp device.
@ 2010-08-06  9:23                       ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 13/16] Add a kconfig entry and make entry for mp device.
  2010-08-06  9:23                       ` xiaohui.xin
@ 2010-08-06  9:23                         ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 13/16] Add a kconfig entry and make entry for mp device.
@ 2010-08-06  9:23                         ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-08-06  9:23                         ` xiaohui.xin
@ 2010-08-06  9:23                           ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c | 1419 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1419 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..a5c1456
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1419 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+#define	HASH_BUCKETS	(8192*2)
+
+struct page_info {
+	struct list_head        list;
+	struct page_info	*next;
+	struct page_info	*prev;
+	int                     header;
+	/* indicate the actual length of bytes
+	 * send/recv in the external buffers
+	 */
+	int                     total;
+	int                     offset;
+	struct page             *pages[MAX_SKB_FRAGS+1];
+	struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+	struct sk_buff          *skb;
+	struct page_ctor        *ctor;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a external allocated skb or kernel
+	 */
+	struct skb_ext_page    ext_page;
+	struct skb_shared_info  ushinfo;
+
+#define INFO_READ                      0
+#define INFO_WRITE                     1
+	unsigned                flags;
+	unsigned                pnum;
+
+	/* It's meaningful for receive, means
+	 * the max length allowed
+	 */
+	size_t                  len;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+
+	struct kiocb            *iocb;
+	unsigned int            desc_pos;
+	struct iovec            hdr[MAX_SKB_FRAGS + 2];
+	struct iovec            iov[MAX_SKB_FRAGS + 2];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_ctor {
+	struct list_head        readq;
+	int			wq_len;
+	int			rq_len;
+	spinlock_t		read_lock;
+	/* record the locked pages */
+	int			lock_pages;
+	struct rlimit		o_rlim;
+	struct net_device	*dev;
+	struct mpassthru_port	port;
+	struct page_info	**hash_table;
+};
+
+struct mp_struct {
+	struct mp_file		*mfile;
+	struct net_device       *dev;
+	struct page_ctor	*ctor;
+	struct socket           socket;
+
+#ifdef MPASSTHRU_DEBUG
+	int debug;
+#endif
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock		sk;
+	struct mp_struct	*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+	int ret = 0;
+
+	rtnl_lock();
+	ret = dev_change_flags(dev, flags);
+	rtnl_unlock();
+
+	if (ret < 0)
+		printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+	return ret;
+}
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
+		struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+		ctor->rq_len--;
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++)
+		get_page(info->pages[i]);
+	info->skb = skb;
+	return &info->ext_page;
+}
+
+static struct page_info *mp_hash_lookup(struct page_ctor *ctor,
+					struct page *page);
+static struct page_info *mp_hash_delete(struct page_ctor *ctor,
+					struct page_info *info);
+
+static struct skb_ext_page *mp_lookup(struct net_device *dev,
+				      struct page *page)
+{
+	struct mp_struct *mp =
+		container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = mp->ctor;
+	struct page_info *info;
+
+	info = mp_hash_lookup(ctor, page);
+	if (!info)
+		return NULL;
+	return &info->ext_page;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_ctor *ctor;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (mp->ctor)
+		return -EBUSY;
+
+	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+	if (!ctor)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	INIT_LIST_HEAD(&ctor->readq);
+	spin_lock_init(&ctor->read_lock);
+	ctor->hash_table = kzalloc(sizeof(struct page_info *) * HASH_BUCKETS,
+			GFP_KERNEL);
+	if (!ctor->hash_table)
+		goto fail_hash;
+
+	ctor->rq_len = 0;
+	ctor->wq_len = 0;
+
+	dev_hold(dev);
+	ctor->dev = dev;
+	ctor->port.ctor = page_ctor;
+	ctor->port.sock = &mp->socket;
+	ctor->port.hash = mp_lookup;
+	ctor->lock_pages = 0;
+
+	/* locked by mp_mutex */
+	dev->mp_port = &ctor->port;
+	mp->ctor = ctor;
+
+	return 0;
+
+fail_hash:
+	kfree(ctor->hash_table);
+
+fail:
+	kfree(ctor);
+	dev_put(dev);
+
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq,
+				struct page_info, list);
+		list_del(&info->list);
+		ctor->rq_len--;
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	return info;
+}
+
+static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
+			      unsigned long cur, unsigned long max)
+{
+	struct rlimit new_rlim, *old_rlim;
+	int retval;
+
+	if (resource != RLIMIT_MEMLOCK)
+		return -EINVAL;
+	new_rlim.rlim_cur = cur;
+	new_rlim.rlim_max = max;
+
+	old_rlim = current->signal->rlim + resource;
+
+	/* remember the old rlimit value when backend enabled */
+	ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
+	ctor->o_rlim.rlim_max = old_rlim->rlim_max;
+
+	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+			!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+
+	retval = security_task_setrlimit(resource, &new_rlim);
+	if (retval)
+		return retval;
+
+	task_lock(current->group_leader);
+	*old_rlim = new_rlim;
+	task_unlock(current->group_leader);
+	return 0;
+}
+
+static void relinquish_resource(struct page_ctor *ctor)
+{
+	if (!(ctor->dev->flags & IFF_UP) &&
+			!(ctor->wq_len + ctor->rq_len))
+		printk(KERN_INFO "relinquish_resource\n");
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		mp_hash_delete(info->ctor, info);
+		if (info->skb) {
+			info->skb->destructor = NULL;
+			kfree_skb(info->skb);
+		}
+		info->ctor->rq_len--;
+	} else
+		info->ctor->wq_len--;
+	/* Decrement the number of locked pages */
+	info->ctor->lock_pages -= info->pnum;
+	kmem_cache_free(ext_page_info_cache, info);
+	relinquish_resource(info->ctor);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+	struct page_ctor *ctor;
+	struct page_info *info;
+	int i;
+
+	/* locked by mp_mutex */
+	ctor = mp->ctor;
+	if (!ctor)
+		return -ENODEV;
+
+	while ((info = info_dequeue(ctor))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		ctor->rq_len--;
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+
+	relinquish_resource(ctor);
+
+	set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+			   ctor->o_rlim.rlim_cur,
+			   ctor->o_rlim.rlim_max);
+
+	/* locked by mp_mutex */
+	ctor->dev->mp_port = NULL;
+	dev_put(ctor->dev);
+
+	mp->ctor = NULL;
+	kfree(ctor->hash_table);
+	kfree(ctor);
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_ctor_detach(mp);
+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count))
+		mp_detach(mfile->mp);
+}
+
+static void iocb_tag(struct kiocb *iocb)
+{
+	iocb->ki_flags = 1;
+}
+
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_ext_page *ext_page)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	if (info->flags == INFO_READ) {
+		create_iocb(info, 0);
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb_tag(info->iocb);
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info =
+		kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	return info;
+}
+
+typedef u32 key_mp_t;
+static inline key_mp_t mp_hash(struct page *page, int buckets)
+{
+	key_mp_t k;
+
+	k = ((((unsigned long)page << 32UL) >> 32UL) / 0x38) % buckets ;
+	return k;
+}
+
+static void mp_hash_insert(struct page_ctor *ctor,
+		struct page *page, struct page_info *page_info)
+{
+	struct page_info *tmp;
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	if (!ctor->hash_table[key]) {
+		ctor->hash_table[key] = page_info;
+		return;
+	}
+
+	tmp = ctor->hash_table[key];
+	while (tmp->next)
+		tmp = tmp->next;
+
+	tmp->next = page_info;
+	page_info->prev = tmp;
+	return;
+}
+
+static struct page_info *mp_hash_delete(struct page_ctor *ctor,
+					struct page_info *info)
+{
+	key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+	int i;
+
+	tmp = ctor->hash_table[key];
+	while (tmp) {
+		if (tmp == info) {
+			if (!tmp->prev) {
+				ctor->hash_table[key] = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = NULL;
+			} else {
+				tmp->prev->next = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = tmp->prev;
+			}
+			return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+static struct page_info *mp_hash_lookup(struct page_ctor *ctor,
+					struct page *page)
+{
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	int i;
+	tmp = ctor->hash_table[key];
+	while (tmp) {
+		for (i = 0; i < tmp->pnum; i++) {
+			if (tmp->pages[i] == page)
+				return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base, lock_limit;
+	struct page_info *info = NULL;
+
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+
+	if (ctor->lock_pages + count > lock_limit && npages) {
+		printk(KERN_INFO "exceed the locked memory rlimit.");
+		return NULL;
+	}
+
+	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ext_page.page = info->pages[0];
+	info->ctor = ctor;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
+			frags[0].offset = iocb->ki_iovec[0].iov_len;
+			ctor->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
+		}
+		for (i = 0; i < j; i++)
+			mp_hash_insert(ctor, info->pages[i], info);
+	}
+	/* increment the number of locked pages */
+	ctor->lock_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return NULL;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	int len;
+
+	struct virtio_net_hdr_mrg_rxbuf hdr = {
+		.hdr.flags = 0,
+		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		struct page *page;
+		int off;
+		int size = 0, i = 0;
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct skb_ext_page *ext_page =
+			(struct skb_ext_page *)(shinfo->destructor_arg); 
+		int free = 0;
+
+		if (skb->ip_summed == CHECKSUM_COMPLETE)
+			DBG(KERN_INFO "csum %d\n", skb->ip_summed);
+
+		if (shinfo->frags[0].page == ext_page->page) {
+			info = container_of(ext_page, struct page_info, ext_page);
+			if (shinfo->nr_frags)
+				hdr.num_buffers = shinfo->nr_frags;
+			else
+				hdr.num_buffers = shinfo->nr_frags + 1;
+		} else {
+			info = container_of(ext_page, struct page_info, ext_page);
+			hdr.num_buffers = shinfo->nr_frags + 1;
+		}
+		skb_push(skb, ETH_HLEN);
+
+		if (skb_is_gso(skb)) {
+			hdr.hdr.hdr_len = skb_headlen(skb);
+			hdr.hdr.gso_size = shinfo->gso_size;
+			if (shinfo->gso_type & SKB_GSO_TCPV4)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (shinfo->gso_type & SKB_GSO_TCPV6)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (shinfo->gso_type & SKB_GSO_UDP)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
+				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+
+		} else
+			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			hdr.hdr.csum_start =
+				skb->csum_start - skb_headroom(skb);
+			hdr.hdr.csum_offset = skb->csum_offset;
+		}
+
+		info->skb = NULL;
+		off = info->hdr[0].iov_len;
+		len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->iov_base, len);
+			goto clean;
+		}
+
+		memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
+
+		info->iocb->ki_left = hdr.num_buffers;
+		if (shinfo->frags[0].page == ext_page->page) {
+			size = shinfo->frags[0].size +
+				shinfo->frags[0].page_offset - off;
+			i = 1;
+		} else {
+			size = skb_headlen(skb);
+			i = 0;
+		}
+		create_iocb(info, off + size);
+		for (i = i; i < shinfo->nr_frags; i++) {
+			page = shinfo->frags[i].page;
+			info = mp_hash_lookup(ctor, shinfo->frags[i].page);
+			info->skb = NULL;
+			create_iocb(info, shinfo->frags[i].size);
+		}
+		info->skb = skb;
+		shinfo->nr_frags = 0;
+		shinfo->destructor_arg = NULL;
+		continue;
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	return;
+}
+
+static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
+					   size_t len, size_t linear,
+					   int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+			err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
+		struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct virtio_net_hdr vnet_hdr = {0};
+	int hdr_len = 0;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	hdr_len = sizeof(vnet_hdr);
+	if ((total - iocb->ki_iovec[0].iov_len) < 0)
+		return -EINVAL;
+
+	rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
+	if (rc < 0)
+		return -EINVAL;
+
+	if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+			vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+			vnet_hdr.hdr_len)
+		vnet_hdr.hdr_len = vnet_hdr.csum_start +
+			vnet_hdr.csum_offset + 2;
+
+	if (vnet_hdr.hdr_len > total)
+		return -EINVAL;
+
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = mp_alloc_skb(sock->sk, NET_IP_ALIGN, header,
+			   iocb->ki_iovec[0].iov_len, 1, &rc);
+
+	if (!skb)
+		goto drop;
+
+	skb_set_network_header(skb, ETH_HLEN);
+	memcpy_fromiovec(skb->data, iov, header);
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
+	if (rc)
+		goto drop;
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, iocb, total);
+	} else {
+		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = mp->dev;
+		ctor->wq_len++;
+		create_iocb(info, info->total);
+		dev_queue_xmit(skb);
+		if (!ctor->rq_len)
+			sock->sk->sk_state_change(sock->sk);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	if (count > 1) {
+		iov++;
+		count--;
+	}
+
+	if (!ctor->lock_pages || !ctor->rq_len) {
+		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+				iocb->ki_user_data * 4096 * 2,
+				iocb->ki_user_data * 4096 * 2);
+	}
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->len = total_len;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	iocb->ki_iovec[0].iov_len = 0;
+	iocb->ki_left = 0;
+	info->offset = frags[0].offset;
+	info->desc_pos = iocb->ki_pos;
+
+	if (count > 1) {
+		iov--;
+		count++;
+	}
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	ctor->rq_len++;
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+	DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -ENODEV;
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev)
+			break;
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+
+		/* the device can be only bind once */
+		if (dev_is_mpassthru(dev))
+			goto err_dev_put;
+
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_ctor_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+		mp_dev_change_flags(mp->dev, mp->dev->flags & (~IFF_UP));
+		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+		sk->sk_state_change(sk);
+out:
+		mutex_unlock(&mp_mutex);
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		ret = do_unbind(mfile);
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result = 0;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported.
+	 * but we may support real async aio from user application,
+	 * maybe qemu virtio-net backend.
+	 */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+
+	if (unlikely(len) < ETH_HLEN)
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -EFAULT;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
+
+	skb->protocol = eth_type_trans(skb, mp->dev);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+
+	mp_put(file->private_data);
+	return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+				unsigned long arg)
+{
+	return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = mp_chr_compat_ioctl,
+#endif
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mpassthru_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+	struct sock *sk;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		sock = dev->mp_port->sock;
+		mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+		do_unbind(mp->mfile);
+		break;
+	case NETDEV_CHANGE:
+		sk = dev->mp_port->sock->sk;
+		sk->sk_state_change(sk);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int err = 0;
+
+	ext_page_info_cache = kmem_cache_create("skb_page_info",
+						sizeof(struct page_info),
+						0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!ext_page_info_cache)
+		return -ENOMEM;
+
+	err = misc_register(&mp_miscdev);
+	if (err) {
+		printk(KERN_ERR "mp: Can't register misc device\n");
+		kmem_cache_destroy(ext_page_info_cache);
+	} else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+				mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return err;
+}
+
+void mp_exit(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+	kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
@ 2010-08-06  9:23                           ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c | 1419 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1419 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..a5c1456
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1419 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+#define	HASH_BUCKETS	(8192*2)
+
+struct page_info {
+	struct list_head        list;
+	struct page_info	*next;
+	struct page_info	*prev;
+	int                     header;
+	/* indicate the actual length of bytes
+	 * send/recv in the external buffers
+	 */
+	int                     total;
+	int                     offset;
+	struct page             *pages[MAX_SKB_FRAGS+1];
+	struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+	struct sk_buff          *skb;
+	struct page_ctor        *ctor;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a external allocated skb or kernel
+	 */
+	struct skb_ext_page    ext_page;
+	struct skb_shared_info  ushinfo;
+
+#define INFO_READ                      0
+#define INFO_WRITE                     1
+	unsigned                flags;
+	unsigned                pnum;
+
+	/* It's meaningful for receive, means
+	 * the max length allowed
+	 */
+	size_t                  len;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+
+	struct kiocb            *iocb;
+	unsigned int            desc_pos;
+	struct iovec            hdr[MAX_SKB_FRAGS + 2];
+	struct iovec            iov[MAX_SKB_FRAGS + 2];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_ctor {
+	struct list_head        readq;
+	int			wq_len;
+	int			rq_len;
+	spinlock_t		read_lock;
+	/* record the locked pages */
+	int			lock_pages;
+	struct rlimit		o_rlim;
+	struct net_device	*dev;
+	struct mpassthru_port	port;
+	struct page_info	**hash_table;
+};
+
+struct mp_struct {
+	struct mp_file		*mfile;
+	struct net_device       *dev;
+	struct page_ctor	*ctor;
+	struct socket           socket;
+
+#ifdef MPASSTHRU_DEBUG
+	int debug;
+#endif
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock		sk;
+	struct mp_struct	*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+	int ret = 0;
+
+	rtnl_lock();
+	ret = dev_change_flags(dev, flags);
+	rtnl_unlock();
+
+	if (ret < 0)
+		printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+	return ret;
+}
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
+		struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+		ctor->rq_len--;
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++)
+		get_page(info->pages[i]);
+	info->skb = skb;
+	return &info->ext_page;
+}
+
+static struct page_info *mp_hash_lookup(struct page_ctor *ctor,
+					struct page *page);
+static struct page_info *mp_hash_delete(struct page_ctor *ctor,
+					struct page_info *info);
+
+static struct skb_ext_page *mp_lookup(struct net_device *dev,
+				      struct page *page)
+{
+	struct mp_struct *mp =
+		container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = mp->ctor;
+	struct page_info *info;
+
+	info = mp_hash_lookup(ctor, page);
+	if (!info)
+		return NULL;
+	return &info->ext_page;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_ctor *ctor;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (mp->ctor)
+		return -EBUSY;
+
+	ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+	if (!ctor)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &ctor->port);
+	if (rc)
+		goto fail;
+
+	INIT_LIST_HEAD(&ctor->readq);
+	spin_lock_init(&ctor->read_lock);
+	ctor->hash_table = kzalloc(sizeof(struct page_info *) * HASH_BUCKETS,
+			GFP_KERNEL);
+	if (!ctor->hash_table)
+		goto fail_hash;
+
+	ctor->rq_len = 0;
+	ctor->wq_len = 0;
+
+	dev_hold(dev);
+	ctor->dev = dev;
+	ctor->port.ctor = page_ctor;
+	ctor->port.sock = &mp->socket;
+	ctor->port.hash = mp_lookup;
+	ctor->lock_pages = 0;
+
+	/* locked by mp_mutex */
+	dev->mp_port = &ctor->port;
+	mp->ctor = ctor;
+
+	return 0;
+
+fail_hash:
+	kfree(ctor->hash_table);
+
+fail:
+	kfree(ctor);
+	dev_put(dev);
+
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq,
+				struct page_info, list);
+		list_del(&info->list);
+		ctor->rq_len--;
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	return info;
+}
+
+static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
+			      unsigned long cur, unsigned long max)
+{
+	struct rlimit new_rlim, *old_rlim;
+	int retval;
+
+	if (resource != RLIMIT_MEMLOCK)
+		return -EINVAL;
+	new_rlim.rlim_cur = cur;
+	new_rlim.rlim_max = max;
+
+	old_rlim = current->signal->rlim + resource;
+
+	/* remember the old rlimit value when backend enabled */
+	ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
+	ctor->o_rlim.rlim_max = old_rlim->rlim_max;
+
+	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+			!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+
+	retval = security_task_setrlimit(resource, &new_rlim);
+	if (retval)
+		return retval;
+
+	task_lock(current->group_leader);
+	*old_rlim = new_rlim;
+	task_unlock(current->group_leader);
+	return 0;
+}
+
+static void relinquish_resource(struct page_ctor *ctor)
+{
+	if (!(ctor->dev->flags & IFF_UP) &&
+			!(ctor->wq_len + ctor->rq_len))
+		printk(KERN_INFO "relinquish_resource\n");
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		mp_hash_delete(info->ctor, info);
+		if (info->skb) {
+			info->skb->destructor = NULL;
+			kfree_skb(info->skb);
+		}
+		info->ctor->rq_len--;
+	} else
+		info->ctor->wq_len--;
+	/* Decrement the number of locked pages */
+	info->ctor->lock_pages -= info->pnum;
+	kmem_cache_free(ext_page_info_cache, info);
+	relinquish_resource(info->ctor);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+	struct page_ctor *ctor;
+	struct page_info *info;
+	int i;
+
+	/* locked by mp_mutex */
+	ctor = mp->ctor;
+	if (!ctor)
+		return -ENODEV;
+
+	while ((info = info_dequeue(ctor))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		ctor->rq_len--;
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+
+	relinquish_resource(ctor);
+
+	set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+			   ctor->o_rlim.rlim_cur,
+			   ctor->o_rlim.rlim_max);
+
+	/* locked by mp_mutex */
+	ctor->dev->mp_port = NULL;
+	dev_put(ctor->dev);
+
+	mp->ctor = NULL;
+	kfree(ctor->hash_table);
+	kfree(ctor);
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_ctor_detach(mp);
+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count))
+		mp_detach(mfile->mp);
+}
+
+static void iocb_tag(struct kiocb *iocb)
+{
+	iocb->ki_flags = 1;
+}
+
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_ext_page *ext_page)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	if (info->flags == INFO_READ) {
+		create_iocb(info, 0);
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb_tag(info->iocb);
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info =
+		kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	return info;
+}
+
+typedef u32 key_mp_t;
+static inline key_mp_t mp_hash(struct page *page, int buckets)
+{
+	key_mp_t k;
+
+	k = ((((unsigned long)page << 32UL) >> 32UL) / 0x38) % buckets ;
+	return k;
+}
+
+static void mp_hash_insert(struct page_ctor *ctor,
+		struct page *page, struct page_info *page_info)
+{
+	struct page_info *tmp;
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	if (!ctor->hash_table[key]) {
+		ctor->hash_table[key] = page_info;
+		return;
+	}
+
+	tmp = ctor->hash_table[key];
+	while (tmp->next)
+		tmp = tmp->next;
+
+	tmp->next = page_info;
+	page_info->prev = tmp;
+	return;
+}
+
+static struct page_info *mp_hash_delete(struct page_ctor *ctor,
+					struct page_info *info)
+{
+	key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+	int i;
+
+	tmp = ctor->hash_table[key];
+	while (tmp) {
+		if (tmp == info) {
+			if (!tmp->prev) {
+				ctor->hash_table[key] = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = NULL;
+			} else {
+				tmp->prev->next = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = tmp->prev;
+			}
+			return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+static struct page_info *mp_hash_lookup(struct page_ctor *ctor,
+					struct page *page)
+{
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	int i;
+	tmp = ctor->hash_table[key];
+	while (tmp) {
+		for (i = 0; i < tmp->pnum; i++) {
+			if (tmp->pages[i] == page)
+				return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base, lock_limit;
+	struct page_info *info = NULL;
+
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+
+	if (ctor->lock_pages + count > lock_limit && npages) {
+		printk(KERN_INFO "exceed the locked memory rlimit.");
+		return NULL;
+	}
+
+	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ext_page.page = info->pages[0];
+	info->ctor = ctor;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
+			frags[0].offset = iocb->ki_iovec[0].iov_len;
+			ctor->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
+		}
+		for (i = 0; i < j; i++)
+			mp_hash_insert(ctor, info->pages[i], info);
+	}
+	/* increment the number of locked pages */
+	ctor->lock_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return NULL;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	int len;
+
+	struct virtio_net_hdr_mrg_rxbuf hdr = {
+		.hdr.flags = 0,
+		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		struct page *page;
+		int off;
+		int size = 0, i = 0;
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct skb_ext_page *ext_page =
+			(struct skb_ext_page *)(shinfo->destructor_arg); 
+		int free = 0;
+
+		if (skb->ip_summed == CHECKSUM_COMPLETE)
+			DBG(KERN_INFO "csum %d\n", skb->ip_summed);
+
+		if (shinfo->frags[0].page == ext_page->page) {
+			info = container_of(ext_page, struct page_info, ext_page);
+			if (shinfo->nr_frags)
+				hdr.num_buffers = shinfo->nr_frags;
+			else
+				hdr.num_buffers = shinfo->nr_frags + 1;
+		} else {
+			info = container_of(ext_page, struct page_info, ext_page);
+			hdr.num_buffers = shinfo->nr_frags + 1;
+		}
+		skb_push(skb, ETH_HLEN);
+
+		if (skb_is_gso(skb)) {
+			hdr.hdr.hdr_len = skb_headlen(skb);
+			hdr.hdr.gso_size = shinfo->gso_size;
+			if (shinfo->gso_type & SKB_GSO_TCPV4)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (shinfo->gso_type & SKB_GSO_TCPV6)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (shinfo->gso_type & SKB_GSO_UDP)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
+				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+
+		} else
+			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			hdr.hdr.csum_start =
+				skb->csum_start - skb_headroom(skb);
+			hdr.hdr.csum_offset = skb->csum_offset;
+		}
+
+		info->skb = NULL;
+		off = info->hdr[0].iov_len;
+		len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->iov_base, len);
+			goto clean;
+		}
+
+		memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
+
+		info->iocb->ki_left = hdr.num_buffers;
+		if (shinfo->frags[0].page == ext_page->page) {
+			size = shinfo->frags[0].size +
+				shinfo->frags[0].page_offset - off;
+			i = 1;
+		} else {
+			size = skb_headlen(skb);
+			i = 0;
+		}
+		create_iocb(info, off + size);
+		for (i = i; i < shinfo->nr_frags; i++) {
+			page = shinfo->frags[i].page;
+			info = mp_hash_lookup(ctor, shinfo->frags[i].page);
+			info->skb = NULL;
+			create_iocb(info, shinfo->frags[i].size);
+		}
+		info->skb = skb;
+		shinfo->nr_frags = 0;
+		shinfo->destructor_arg = NULL;
+		continue;
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	return;
+}
+
+static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
+					   size_t len, size_t linear,
+					   int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+			err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
+		struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct virtio_net_hdr vnet_hdr = {0};
+	int hdr_len = 0;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	hdr_len = sizeof(vnet_hdr);
+	if ((total - iocb->ki_iovec[0].iov_len) < 0)
+		return -EINVAL;
+
+	rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
+	if (rc < 0)
+		return -EINVAL;
+
+	if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+			vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+			vnet_hdr.hdr_len)
+		vnet_hdr.hdr_len = vnet_hdr.csum_start +
+			vnet_hdr.csum_offset + 2;
+
+	if (vnet_hdr.hdr_len > total)
+		return -EINVAL;
+
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = mp_alloc_skb(sock->sk, NET_IP_ALIGN, header,
+			   iocb->ki_iovec[0].iov_len, 1, &rc);
+
+	if (!skb)
+		goto drop;
+
+	skb_set_network_header(skb, ETH_HLEN);
+	memcpy_fromiovec(skb->data, iov, header);
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
+	if (rc)
+		goto drop;
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, iocb, total);
+	} else {
+		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = mp->dev;
+		ctor->wq_len++;
+		create_iocb(info, info->total);
+		dev_queue_xmit(skb);
+		if (!ctor->rq_len)
+			sock->sk->sk_state_change(sock->sk);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = mp->ctor;
+	if (!ctor)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	if (count > 1) {
+		iov++;
+		count--;
+	}
+
+	if (!ctor->lock_pages || !ctor->rq_len) {
+		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+				iocb->ki_user_data * 4096 * 2,
+				iocb->ki_user_data * 4096 * 2);
+	}
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->len = total_len;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	iocb->ki_iovec[0].iov_len = 0;
+	iocb->ki_left = 0;
+	info->offset = frags[0].offset;
+	info->desc_pos = iocb->ki_pos;
+
+	if (count > 1) {
+		iov--;
+		count++;
+	}
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	ctor->rq_len++;
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+	DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -ENODEV;
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev)
+			break;
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+
+		/* the device can be only bind once */
+		if (dev_is_mpassthru(dev))
+			goto err_dev_put;
+
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_ctor_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+		mp_dev_change_flags(mp->dev, mp->dev->flags & (~IFF_UP));
+		mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+		sk->sk_state_change(sk);
+out:
+		mutex_unlock(&mp_mutex);
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		ret = do_unbind(mfile);
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result = 0;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported.
+	 * but we may support real async aio from user application,
+	 * maybe qemu virtio-net backend.
+	 */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+
+	if (unlikely(len) < ETH_HLEN)
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -EFAULT;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
+
+	skb->protocol = eth_type_trans(skb, mp->dev);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+
+	mp_put(file->private_data);
+	return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+				unsigned long arg)
+{
+	return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = mp_chr_compat_ioctl,
+#endif
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mpassthru_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+	struct sock *sk;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		sock = dev->mp_port->sock;
+		mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+		do_unbind(mp->mfile);
+		break;
+	case NETDEV_CHANGE:
+		sk = dev->mp_port->sock->sk;
+		sk->sk_state_change(sk);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int err = 0;
+
+	ext_page_info_cache = kmem_cache_create("skb_page_info",
+						sizeof(struct page_info),
+						0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!ext_page_info_cache)
+		return -ENOMEM;
+
+	err = misc_register(&mp_miscdev);
+	if (err) {
+		printk(KERN_ERR "mp: Can't register misc device\n");
+		kmem_cache_destroy(ext_page_info_cache);
+	} else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+				mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return err;
+}
+
+void mp_exit(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+	kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 14/16] Provides multiple submits and asynchronous notifications.
  2010-08-06  9:23                           ` xiaohui.xin
@ 2010-08-06  9:23                             ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  348 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.c |   79 +++++++++++
 drivers/vhost/vhost.h |   15 ++
 3 files changed, 414 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b38abc6..c4bc815 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
 
 #include <net/sock.h>
 
@@ -39,6 +41,8 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+static struct kmem_cache *notify_cache;
+
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -49,6 +53,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -93,11 +98,190 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+	struct vhost_virtqueue *vq = iocb->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+	return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq,
+					  struct socket *sock)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	unsigned int head, log, in, out;
+	int size;
+	int count;
+
+	struct virtio_net_hdr_mrg_rxbuf hdr = {
+		.hdr.flags = 0,
+		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	if (!is_async_vq(vq))
+		return;
+
+	if (sock->sk->sk_data_ready)
+		sock->sk->sk_data_ready(sock->sk, 0);
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+		vq->log : NULL;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		if (!iocb->ki_left) {
+			vhost_add_used_and_signal(&net->dev, vq,
+					iocb->ki_pos, iocb->ki_nbytes);
+			size = iocb->ki_nbytes;
+			head = iocb->ki_pos;
+			rx_total_len += iocb->ki_nbytes;
+
+			if (iocb->ki_dtor)
+				iocb->ki_dtor(iocb);
+			kmem_cache_free(net->cache, iocb);
+
+			/* when log is enabled, recomputing the log is needed,
+			 * since these buffers are in async queue, may not get
+			 * the log info before.
+			 */
+			if (unlikely(vq_log)) {
+				if (!log)
+					__vhost_get_desc(&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+				vhost_log_write(vq, vq_log, log, size);
+			}
+			if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+				vhost_poll_queue(&vq->poll);
+				break;
+			}
+		} else {
+			int i = 0;
+			int count = iocb->ki_left;
+			int hc = count;
+			while (count--) {
+				if (iocb) {
+					vq->heads[i].id = iocb->ki_pos;
+					vq->heads[i].len = iocb->ki_nbytes;
+					size = iocb->ki_nbytes;
+					head = iocb->ki_pos;
+					rx_total_len += iocb->ki_nbytes;
+
+					if (iocb->ki_dtor)
+						iocb->ki_dtor(iocb);
+					kmem_cache_free(net->cache, iocb);
+
+					if (unlikely(vq_log)) {
+						if (!log)
+							__vhost_get_desc(
+							&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+						vhost_log_write(
+							vq, vq_log, log, size);
+					}
+				} else
+					break;
+
+				i++;
+				iocb == NULL;
+				if (count)
+					iocb = notify_dequeue(vq);
+			}
+			vhost_add_used_and_signal_n(
+					&net->dev, vq, vq->heads, hc);
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct list_head *entry, *tmp;
+	unsigned long flags;
+	int tx_total_len = 0;
+
+	if (!is_async_vq(vq))
+		return;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_for_each_safe(entry, tmp, &vq->notifier) {
+		iocb = list_entry(entry,
+				  struct kiocb, ki_list);
+		if (!iocb->ki_flags)
+			continue;
+		list_del(&iocb->ki_list);
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+				 struct vhost_virtqueue *vq,
+				 unsigned head)
+{
+	struct kiocb *iocb = NULL;
+
+	if (!is_async_vq(vq))
+		return NULL;
+
+	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+	if (!iocb)
+		return NULL;
+	iocb->private = vq;
+	iocb->ki_pos = head;
+	iocb->ki_dtor = handle_iocb;
+	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX])
+		iocb->ki_user_data = vq->num;
+	iocb->ki_iovec = vq->hdr;
+	return iocb;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -130,6 +314,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	vhost_hlen = vq->vhost_hlen;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_desc(&net->dev, vq, vq->iov,
 				      ARRAY_SIZE(vq->iov),
@@ -138,10 +324,13 @@ static void handle_tx(struct vhost_net *net)
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
-			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
-				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-				break;
+			if (!is_async_vq(vq)) {
+				if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+					tx_poll_start(net, sock);
+					set_bit(SOCK_ASYNC_NOSPACE,
+						&sock->flags);
+					break;
+				}
 			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
@@ -158,6 +347,13 @@ static void handle_tx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
 		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
+
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
@@ -166,12 +362,18 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -183,6 +385,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -205,7 +409,8 @@ static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-	unsigned in, log, s;
+	struct kiocb *iocb = NULL;
+	unsigned in, out, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -225,25 +430,42 @@ static void handle_rx(struct vhost_net *net)
 	int err, headcount, datalen;
 	size_t vhost_hlen;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+		      !is_async_vq(vq)))
 		return;
-
 	use_mm(net->dev.mm);
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(vq);
 	vhost_hlen = vq->vhost_hlen;
 
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
+	 */
+
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	while ((datalen = vhost_head_len(vq, sock->sk))) {
-		headcount = vhost_get_desc_n(vq, vq->heads,
-					     datalen + vhost_hlen,
-					     &in, vq_log, &log);
+	handle_async_rx_events_notify(net, vq, sock);
+
+	while (is_async_vq(vq) ||
+		(datalen = vhost_head_len(vq, sock->sk)) != 0) {
+		if (is_async_vq(vq))
+			headcount =
+				vhost_get_desc(&net->dev, vq, vq->iov,
+						ARRAY_SIZE(vq->iov),
+						&out, &in,
+						vq->log, &log);
+		else
+			headcount = vhost_get_desc_n(vq, vq->heads,
+						     datalen + vhost_hlen,
+						     &in, vq_log, &log);
 		if (headcount < 0)
 			break;
 		/* OK, now we need to know about added descriptors. */
-		if (!headcount) {
+		if ((!headcount && !is_async_vq(vq)) ||
+			(headcount == vq->num && is_async_vq(vq))) {
 			if (unlikely(vhost_enable_notify(vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
@@ -256,7 +478,12 @@ static void handle_rx(struct vhost_net *net)
 		}
 		/* We don't need to be notified again. */
 		/* Skip header. TODO: support TSO. */
+		if (is_async_vq(vq) && vhost_hlen == sizeof(hdr)) {
+			vq->hdr[0].iov_len = vhost_hlen;
+			goto nomove;
+		}
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
+nomove:
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
 		/* Sanity check */
@@ -266,13 +493,23 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), vhost_hlen);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, headcount);
+			if (!iocb)
+				break;
+		}
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, headcount);
 			break;
 		}
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != datalen) {
 			pr_err("Discarded rx packet: "
 			       " len %d, expected %zd\n", err, datalen);
@@ -280,6 +517,9 @@ static void handle_rx(struct vhost_net *net)
 			continue;
 		}
 		len = err;
+		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
+			hdr.num_buffers = headcount;
+
 		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr,
 				     vhost_hlen);
 		if (err) {
@@ -287,18 +527,7 @@ static void handle_rx(struct vhost_net *net)
 			       vq->iov->iov_base, err);
 			break;
 		}
-		/* TODO: Should check and handle checksum. */
-		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
-			struct iovec *iov = vhost_hlen ? vq->hdr : vq->iov;
-
-			if (memcpy_toiovecend(iov, (unsigned char *)&headcount,
-				      offsetof(typeof(hdr), num_buffers),
-				      sizeof(hdr.num_buffers))) {
-				vq_err(vq, "Failed num_buffers write");
-				vhost_discard_desc(vq, headcount);
-				break;
-			}
-		}
+
 		len += vhost_hlen;
 		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
 					    headcount);
@@ -311,6 +540,8 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -364,6 +595,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 
 	f->private_data = n;
 
@@ -427,6 +659,21 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+	/* clean the notifier */
+	struct vhost_virtqueue *vq;
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		vq = &n->dev.vqs[VHOST_NET_VQ_TX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -443,6 +690,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_async_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -494,21 +742,58 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+				 enum vhost_vq_link_state *state)
 {
 	struct socket *sock;
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
+
+	*state = VHOST_VQ_LINK_SYNC;
+
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	/* If we dont' have notify_cache, then dont do mpassthru */
+	if (!notify_cache)
+		return ERR_PTR(-ENOTSOCK);
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		*state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache)
+			n->cache = notify_cache;
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -532,12 +817,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(vq, fd, &vq->link_state);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -687,6 +974,9 @@ int vhost_net_init(void)
 	r = misc_register(&vhost_net_misc);
 	if (r)
 		goto err_reg;
+	notify_cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
 	return 0;
 err_reg:
 	vhost_cleanup();
@@ -700,6 +990,8 @@ void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
 	vhost_cleanup();
+	if (notify_cache)
+		kmem_cache_destroy(notify_cache);
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 118c8e0..66ff5c5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -909,6 +909,85 @@ err:
 	return r;
 }
 
+unsigned __vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[], unsigned int iov_size,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head)
+{
+	struct vring_desc desc;
+	unsigned int i, found = 0;
+	u16 last_avail_idx;
+	int ret;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+	if (unlikely(log))
+		*log_num = 0;
+
+	i = head;
+	do {
+		unsigned iov_count = *in_num + *out_num;
+		if (i >= vq->num) {
+			vq_err(vq, "Desc index is %u > %u, head = %u",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		if (++found > vq->num) {
+			vq_err(vq, "Loop detected: last one at %u "
+			       "vq size %u head %u\n",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		ret = copy_from_user(&desc, vq->desc + i, sizeof desc);
+		if (ret) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+			ret = get_indirect(dev, vq, iov, iov_size,
+					   out_num, in_num,
+					   log, log_num, &desc);
+			if (ret < 0) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return vq->num;
+			}
+			continue;
+		}
+
+		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
+				     iov_size - iov_count);
+		if (ret < 0) {
+			vq_err(vq, "Translation failure %d descriptor idx %d\n",
+			       ret, i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += ret;
+			if (unlikely(log)) {
+				log[*log_num].addr = desc.addr;
+				log[*log_num].len = desc.len;
+				++*log_num;
+			}
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (*in_num) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				return vq->num;
+			}
+			*out_num += ret;
+		}
+	} while ((i = next_desc(&desc)) != -1);
+
+	return head;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 08d740a..54c6d0b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 0,
+	VHOST_VQ_LINK_ASYNC = 1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -98,6 +103,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/* Differiate async socket for 0-copy from normal */
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
 };
 
 struct vhost_dev {
@@ -125,6 +134,11 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
 		     int datalen, unsigned int *iovcount, struct vhost_log *log,
 		     unsigned int *log_num);
+unsigned __vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			struct iovec iov[], unsigned int iov_count,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			unsigned int head);
 unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
@@ -165,6 +179,7 @@ enum {
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 {
 	unsigned acked_features = rcu_dereference(dev->acked_features);
+	acked_features |= (1 << VIRTIO_NET_F_MRG_RXBUF);
 	return acked_features & (1 << bit);
 }
 
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 14/16] Provides multiple submits and asynchronous notifications.
@ 2010-08-06  9:23                             ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  348 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.c |   79 +++++++++++
 drivers/vhost/vhost.h |   15 ++
 3 files changed, 414 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b38abc6..c4bc815 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
 
 #include <net/sock.h>
 
@@ -39,6 +41,8 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+static struct kmem_cache *notify_cache;
+
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -49,6 +53,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -93,11 +98,190 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+	struct vhost_virtqueue *vq = iocb->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+	return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq,
+					  struct socket *sock)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	unsigned int head, log, in, out;
+	int size;
+	int count;
+
+	struct virtio_net_hdr_mrg_rxbuf hdr = {
+		.hdr.flags = 0,
+		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	if (!is_async_vq(vq))
+		return;
+
+	if (sock->sk->sk_data_ready)
+		sock->sk->sk_data_ready(sock->sk, 0);
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+		vq->log : NULL;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		if (!iocb->ki_left) {
+			vhost_add_used_and_signal(&net->dev, vq,
+					iocb->ki_pos, iocb->ki_nbytes);
+			size = iocb->ki_nbytes;
+			head = iocb->ki_pos;
+			rx_total_len += iocb->ki_nbytes;
+
+			if (iocb->ki_dtor)
+				iocb->ki_dtor(iocb);
+			kmem_cache_free(net->cache, iocb);
+
+			/* when log is enabled, recomputing the log is needed,
+			 * since these buffers are in async queue, may not get
+			 * the log info before.
+			 */
+			if (unlikely(vq_log)) {
+				if (!log)
+					__vhost_get_desc(&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+				vhost_log_write(vq, vq_log, log, size);
+			}
+			if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+				vhost_poll_queue(&vq->poll);
+				break;
+			}
+		} else {
+			int i = 0;
+			int count = iocb->ki_left;
+			int hc = count;
+			while (count--) {
+				if (iocb) {
+					vq->heads[i].id = iocb->ki_pos;
+					vq->heads[i].len = iocb->ki_nbytes;
+					size = iocb->ki_nbytes;
+					head = iocb->ki_pos;
+					rx_total_len += iocb->ki_nbytes;
+
+					if (iocb->ki_dtor)
+						iocb->ki_dtor(iocb);
+					kmem_cache_free(net->cache, iocb);
+
+					if (unlikely(vq_log)) {
+						if (!log)
+							__vhost_get_desc(
+							&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+						vhost_log_write(
+							vq, vq_log, log, size);
+					}
+				} else
+					break;
+
+				i++;
+				iocb == NULL;
+				if (count)
+					iocb = notify_dequeue(vq);
+			}
+			vhost_add_used_and_signal_n(
+					&net->dev, vq, vq->heads, hc);
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct list_head *entry, *tmp;
+	unsigned long flags;
+	int tx_total_len = 0;
+
+	if (!is_async_vq(vq))
+		return;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_for_each_safe(entry, tmp, &vq->notifier) {
+		iocb = list_entry(entry,
+				  struct kiocb, ki_list);
+		if (!iocb->ki_flags)
+			continue;
+		list_del(&iocb->ki_list);
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+				 struct vhost_virtqueue *vq,
+				 unsigned head)
+{
+	struct kiocb *iocb = NULL;
+
+	if (!is_async_vq(vq))
+		return NULL;
+
+	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+	if (!iocb)
+		return NULL;
+	iocb->private = vq;
+	iocb->ki_pos = head;
+	iocb->ki_dtor = handle_iocb;
+	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX])
+		iocb->ki_user_data = vq->num;
+	iocb->ki_iovec = vq->hdr;
+	return iocb;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -130,6 +314,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	vhost_hlen = vq->vhost_hlen;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_desc(&net->dev, vq, vq->iov,
 				      ARRAY_SIZE(vq->iov),
@@ -138,10 +324,13 @@ static void handle_tx(struct vhost_net *net)
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
-			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
-				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-				break;
+			if (!is_async_vq(vq)) {
+				if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+					tx_poll_start(net, sock);
+					set_bit(SOCK_ASYNC_NOSPACE,
+						&sock->flags);
+					break;
+				}
 			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
@@ -158,6 +347,13 @@ static void handle_tx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
 		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
+
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
@@ -166,12 +362,18 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -183,6 +385,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -205,7 +409,8 @@ static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-	unsigned in, log, s;
+	struct kiocb *iocb = NULL;
+	unsigned in, out, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -225,25 +430,42 @@ static void handle_rx(struct vhost_net *net)
 	int err, headcount, datalen;
 	size_t vhost_hlen;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+		      !is_async_vq(vq)))
 		return;
-
 	use_mm(net->dev.mm);
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(vq);
 	vhost_hlen = vq->vhost_hlen;
 
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
+	 */
+
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	while ((datalen = vhost_head_len(vq, sock->sk))) {
-		headcount = vhost_get_desc_n(vq, vq->heads,
-					     datalen + vhost_hlen,
-					     &in, vq_log, &log);
+	handle_async_rx_events_notify(net, vq, sock);
+
+	while (is_async_vq(vq) ||
+		(datalen = vhost_head_len(vq, sock->sk)) != 0) {
+		if (is_async_vq(vq))
+			headcount =
+				vhost_get_desc(&net->dev, vq, vq->iov,
+						ARRAY_SIZE(vq->iov),
+						&out, &in,
+						vq->log, &log);
+		else
+			headcount = vhost_get_desc_n(vq, vq->heads,
+						     datalen + vhost_hlen,
+						     &in, vq_log, &log);
 		if (headcount < 0)
 			break;
 		/* OK, now we need to know about added descriptors. */
-		if (!headcount) {
+		if ((!headcount && !is_async_vq(vq)) ||
+			(headcount == vq->num && is_async_vq(vq))) {
 			if (unlikely(vhost_enable_notify(vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
@@ -256,7 +478,12 @@ static void handle_rx(struct vhost_net *net)
 		}
 		/* We don't need to be notified again. */
 		/* Skip header. TODO: support TSO. */
+		if (is_async_vq(vq) && vhost_hlen == sizeof(hdr)) {
+			vq->hdr[0].iov_len = vhost_hlen;
+			goto nomove;
+		}
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
+nomove:
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
 		/* Sanity check */
@@ -266,13 +493,23 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), vhost_hlen);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, headcount);
+			if (!iocb)
+				break;
+		}
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, headcount);
 			break;
 		}
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != datalen) {
 			pr_err("Discarded rx packet: "
 			       " len %d, expected %zd\n", err, datalen);
@@ -280,6 +517,9 @@ static void handle_rx(struct vhost_net *net)
 			continue;
 		}
 		len = err;
+		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
+			hdr.num_buffers = headcount;
+
 		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr,
 				     vhost_hlen);
 		if (err) {
@@ -287,18 +527,7 @@ static void handle_rx(struct vhost_net *net)
 			       vq->iov->iov_base, err);
 			break;
 		}
-		/* TODO: Should check and handle checksum. */
-		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
-			struct iovec *iov = vhost_hlen ? vq->hdr : vq->iov;
-
-			if (memcpy_toiovecend(iov, (unsigned char *)&headcount,
-				      offsetof(typeof(hdr), num_buffers),
-				      sizeof(hdr.num_buffers))) {
-				vq_err(vq, "Failed num_buffers write");
-				vhost_discard_desc(vq, headcount);
-				break;
-			}
-		}
+
 		len += vhost_hlen;
 		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
 					    headcount);
@@ -311,6 +540,8 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -364,6 +595,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 
 	f->private_data = n;
 
@@ -427,6 +659,21 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+	/* clean the notifier */
+	struct vhost_virtqueue *vq;
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		vq = &n->dev.vqs[VHOST_NET_VQ_TX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -443,6 +690,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_async_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -494,21 +742,58 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+				 enum vhost_vq_link_state *state)
 {
 	struct socket *sock;
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
+
+	*state = VHOST_VQ_LINK_SYNC;
+
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	/* If we dont' have notify_cache, then dont do mpassthru */
+	if (!notify_cache)
+		return ERR_PTR(-ENOTSOCK);
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		*state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache)
+			n->cache = notify_cache;
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -532,12 +817,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(vq, fd, &vq->link_state);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -687,6 +974,9 @@ int vhost_net_init(void)
 	r = misc_register(&vhost_net_misc);
 	if (r)
 		goto err_reg;
+	notify_cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
 	return 0;
 err_reg:
 	vhost_cleanup();
@@ -700,6 +990,8 @@ void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
 	vhost_cleanup();
+	if (notify_cache)
+		kmem_cache_destroy(notify_cache);
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 118c8e0..66ff5c5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -909,6 +909,85 @@ err:
 	return r;
 }
 
+unsigned __vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[], unsigned int iov_size,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head)
+{
+	struct vring_desc desc;
+	unsigned int i, found = 0;
+	u16 last_avail_idx;
+	int ret;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+	if (unlikely(log))
+		*log_num = 0;
+
+	i = head;
+	do {
+		unsigned iov_count = *in_num + *out_num;
+		if (i >= vq->num) {
+			vq_err(vq, "Desc index is %u > %u, head = %u",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		if (++found > vq->num) {
+			vq_err(vq, "Loop detected: last one at %u "
+			       "vq size %u head %u\n",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		ret = copy_from_user(&desc, vq->desc + i, sizeof desc);
+		if (ret) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+			ret = get_indirect(dev, vq, iov, iov_size,
+					   out_num, in_num,
+					   log, log_num, &desc);
+			if (ret < 0) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return vq->num;
+			}
+			continue;
+		}
+
+		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
+				     iov_size - iov_count);
+		if (ret < 0) {
+			vq_err(vq, "Translation failure %d descriptor idx %d\n",
+			       ret, i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += ret;
+			if (unlikely(log)) {
+				log[*log_num].addr = desc.addr;
+				log[*log_num].len = desc.len;
+				++*log_num;
+			}
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (*in_num) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				return vq->num;
+			}
+			*out_num += ret;
+		}
+	} while ((i = next_desc(&desc)) != -1);
+
+	return head;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 08d740a..54c6d0b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 0,
+	VHOST_VQ_LINK_ASYNC = 1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -98,6 +103,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/* Differiate async socket for 0-copy from normal */
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
 };
 
 struct vhost_dev {
@@ -125,6 +134,11 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
 		     int datalen, unsigned int *iovcount, struct vhost_log *log,
 		     unsigned int *log_num);
+unsigned __vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			struct iovec iov[], unsigned int iov_count,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			unsigned int head);
 unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
@@ -165,6 +179,7 @@ enum {
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 {
 	unsigned acked_features = rcu_dereference(dev->acked_features);
+	acked_features |= (1 << VIRTIO_NET_F_MRG_RXBUF);
 	return acked_features & (1 << bit);
 }
 
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 15/16] An example how to modifiy NIC driver to use napi_gro_frags() interface
  2010-08-06  9:23                             ` xiaohui.xin
@ 2010-08-06  9:23                               ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.

---
 drivers/net/ixgbe/ixgbe.h      |    3 +
 drivers/net/ixgbe/ixgbe_main.c |  138 +++++++++++++++++++++++++++++++--------
 2 files changed, 112 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..fceffc5 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
 	struct page *page;
 	dma_addr_t page_dma;
 	unsigned int page_offset;
+	u16 mapped_as_page;
+	struct page *page_skb;
+	unsigned int page_skb_offset;
 };
 
 struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6c00ee4..cfe6853 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 	IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring->reg_idx), val);
 }
 
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+					struct net_device *dev)
+{
+	return true;
+}
+
 /**
  * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
  * @adapter: address of board private structure
@@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 	i = rx_ring->next_to_use;
 	bi = &rx_ring->rx_buffer_info[i];
 
+
 	while (cleaned_count--) {
 		rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 
+		bi->mapped_as_page =
+			is_rx_buffer_mapped_as_page(bi, adapter->netdev);
+
 		if (!bi->page_dma &&
 		    (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC);
+				bi->page = netdev_alloc_page(adapter->netdev);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                            PCI_DMA_FROMDEVICE);
 		}
 
-		if (!bi->skb) {
+		if (!bi->mapped_as_page && !bi->skb) {
 			struct sk_buff *skb;
 			/* netdev_alloc_skb reserves 32 bytes up front!! */
 			uint bufsz = rx_ring->rx_buf_len + SMP_CACHE_BYTES;
@@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                         rx_ring->rx_buf_len,
 			                         PCI_DMA_FROMDEVICE);
 		}
+
+		if (bi->mapped_as_page && !bi->page_skb) {
+			bi->page_skb = netdev_alloc_page(adapter->netdev);
+			if (!bi->page_skb) {
+				adapter->alloc_rx_page_failed++;
+				goto no_buffers;
+			}
+			bi->page_skb_offset = 0;
+			bi->dma = pci_map_page(pdev, bi->page_skb,
+					bi->page_skb_offset,
+					(PAGE_SIZE / 2),
+					PCI_DMA_FROMDEVICE);
+		}
 		/* Refresh the desc even if buffer_addrs didn't change because
 		 * each write-back erases this info. */
 		if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
@@ -823,6 +846,13 @@ struct ixgbe_rsc_cb {
 	dma_addr_t dma;
 };
 
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+	return ((!rx_buffer_info->skb ||
+		!rx_buffer_info->page_skb) &&
+		!rx_buffer_info->page);
+}
+
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
 
 static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
+	struct napi_struct *napi = &q_vector->napi;
 	union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
@@ -868,29 +899,57 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
+		if (is_no_buffer(rx_buffer_info))
+			break;
+
 		cleaned = true;
-		skb = rx_buffer_info->skb;
-		prefetch(skb->data);
-		rx_buffer_info->skb = NULL;
 
-		if (rx_buffer_info->dma) {
-			if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
-			    (!(staterr & IXGBE_RXD_STAT_EOP)) &&
-				 (!(skb->prev)))
-				/*
-				 * When HWRSC is enabled, delay unmapping
-				 * of the first packet. It carries the
-				 * header information, HW may still
-				 * access the header after the writeback.
-				 * Only unmap it when EOP is reached
-				 */
-				IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
-			else
-				pci_unmap_single(pdev, rx_buffer_info->dma,
-				                 rx_ring->rx_buf_len,
-				                 PCI_DMA_FROMDEVICE);
-			rx_buffer_info->dma = 0;
-			skb_put(skb, len);
+		if (!rx_buffer_info->mapped_as_page) {
+			skb = rx_buffer_info->skb;
+			prefetch(skb->data);
+			rx_buffer_info->skb = NULL;
+
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 &
+					IXGBE_FLAG2_RSC_ENABLED) &&
+					(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+					(!(skb->prev)))
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->dma =
+							rx_buffer_info->dma;
+				else
+					pci_unmap_single(pdev,
+							rx_buffer_info->dma,
+							rx_ring->rx_buf_len,
+							PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_put(skb, len);
+			}
+		} else {
+			skb = napi_get_frags(napi);
+			prefetch(rx_buffer_info->page_skb_offset);
+			rx_buffer_info->skb = NULL;
+			if (rx_buffer_info->dma) {
+				pci_unmap_page(pdev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_fill_page_desc(skb,
+						skb_shinfo(skb)->nr_frags,
+						rx_buffer_info->page_skb,
+						rx_buffer_info->page_skb_offset,
+						len);
+				rx_buffer_info->page_skb = NULL;
+				skb->len += len;
+				skb->data_len += len;
+				skb->truesize += len;
+			}
 		}
 
 		if (upper_len) {
@@ -956,6 +1015,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				rx_buffer_info->dma = next_buffer->dma;
 				next_buffer->skb = skb;
 				next_buffer->dma = 0;
+				if (rx_buffer_info->mapped_as_page) {
+					rx_buffer_info->page_skb =
+							next_buffer->page_skb;
+					next_buffer->page_skb = NULL;
+					next_buffer->skb = NULL;
+				}
 			} else {
 				skb->next = next_buffer->skb;
 				skb->next->prev = skb;
@@ -975,7 +1040,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		skb->protocol = eth_type_trans(skb, adapter->netdev);
+		if (!rx_buffer_info->mapped_as_page)
+			skb->protocol = eth_type_trans(skb, adapter->netdev);
 #ifdef IXGBE_FCOE
 		/* if ddp, not passing to ULD unless for FCP_RSP or error */
 		if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
@@ -984,7 +1050,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				goto next_desc;
 		}
 #endif /* IXGBE_FCOE */
-		ixgbe_receive_skb(q_vector, skb, staterr, rx_ring, rx_desc);
+
+		if (!rx_buffer_info->mapped_as_page)
+			ixgbe_receive_skb(q_vector, skb, staterr,
+						rx_ring, rx_desc);
+		else {
+			skb_record_rx_queue(skb, rx_ring->queue_index);
+			napi_gro_frags(napi);
+		}
 
 next_desc:
 		rx_desc->wb.upper.status_error = 0;
@@ -3131,9 +3204,16 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 
 		rx_buffer_info = &rx_ring->rx_buffer_info[i];
 		if (rx_buffer_info->dma) {
-			pci_unmap_single(pdev, rx_buffer_info->dma,
-			                 rx_ring->rx_buf_len,
-			                 PCI_DMA_FROMDEVICE);
+			if (!rx_buffer_info->mapped_as_page) {
+				pci_unmap_single(pdev, rx_buffer_info->dma,
+						rx_ring->rx_buf_len,
+						PCI_DMA_FROMDEVICE);
+			} else {
+				pci_unmap_page(pdev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->page_skb = NULL;
+			}
 			rx_buffer_info->dma = 0;
 		}
 		if (rx_buffer_info->skb) {
@@ -3158,7 +3238,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 			               PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
 		}
-		put_page(rx_buffer_info->page);
+		netdev_free_page(adapter->netdev, rx_buffer_info->page);
 		rx_buffer_info->page = NULL;
 		rx_buffer_info->page_offset = 0;
 	}
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 15/16] An example how to modifiy NIC driver to use napi_gro_frags() interface
@ 2010-08-06  9:23                               ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.

---
 drivers/net/ixgbe/ixgbe.h      |    3 +
 drivers/net/ixgbe/ixgbe_main.c |  138 +++++++++++++++++++++++++++++++--------
 2 files changed, 112 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..fceffc5 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
 	struct page *page;
 	dma_addr_t page_dma;
 	unsigned int page_offset;
+	u16 mapped_as_page;
+	struct page *page_skb;
+	unsigned int page_skb_offset;
 };
 
 struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6c00ee4..cfe6853 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 	IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring->reg_idx), val);
 }
 
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+					struct net_device *dev)
+{
+	return true;
+}
+
 /**
  * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
  * @adapter: address of board private structure
@@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 	i = rx_ring->next_to_use;
 	bi = &rx_ring->rx_buffer_info[i];
 
+
 	while (cleaned_count--) {
 		rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 
+		bi->mapped_as_page =
+			is_rx_buffer_mapped_as_page(bi, adapter->netdev);
+
 		if (!bi->page_dma &&
 		    (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC);
+				bi->page = netdev_alloc_page(adapter->netdev);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                            PCI_DMA_FROMDEVICE);
 		}
 
-		if (!bi->skb) {
+		if (!bi->mapped_as_page && !bi->skb) {
 			struct sk_buff *skb;
 			/* netdev_alloc_skb reserves 32 bytes up front!! */
 			uint bufsz = rx_ring->rx_buf_len + SMP_CACHE_BYTES;
@@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                         rx_ring->rx_buf_len,
 			                         PCI_DMA_FROMDEVICE);
 		}
+
+		if (bi->mapped_as_page && !bi->page_skb) {
+			bi->page_skb = netdev_alloc_page(adapter->netdev);
+			if (!bi->page_skb) {
+				adapter->alloc_rx_page_failed++;
+				goto no_buffers;
+			}
+			bi->page_skb_offset = 0;
+			bi->dma = pci_map_page(pdev, bi->page_skb,
+					bi->page_skb_offset,
+					(PAGE_SIZE / 2),
+					PCI_DMA_FROMDEVICE);
+		}
 		/* Refresh the desc even if buffer_addrs didn't change because
 		 * each write-back erases this info. */
 		if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
@@ -823,6 +846,13 @@ struct ixgbe_rsc_cb {
 	dma_addr_t dma;
 };
 
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+	return ((!rx_buffer_info->skb ||
+		!rx_buffer_info->page_skb) &&
+		!rx_buffer_info->page);
+}
+
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
 
 static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
+	struct napi_struct *napi = &q_vector->napi;
 	union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
@@ -868,29 +899,57 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
+		if (is_no_buffer(rx_buffer_info))
+			break;
+
 		cleaned = true;
-		skb = rx_buffer_info->skb;
-		prefetch(skb->data);
-		rx_buffer_info->skb = NULL;
 
-		if (rx_buffer_info->dma) {
-			if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
-			    (!(staterr & IXGBE_RXD_STAT_EOP)) &&
-				 (!(skb->prev)))
-				/*
-				 * When HWRSC is enabled, delay unmapping
-				 * of the first packet. It carries the
-				 * header information, HW may still
-				 * access the header after the writeback.
-				 * Only unmap it when EOP is reached
-				 */
-				IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
-			else
-				pci_unmap_single(pdev, rx_buffer_info->dma,
-				                 rx_ring->rx_buf_len,
-				                 PCI_DMA_FROMDEVICE);
-			rx_buffer_info->dma = 0;
-			skb_put(skb, len);
+		if (!rx_buffer_info->mapped_as_page) {
+			skb = rx_buffer_info->skb;
+			prefetch(skb->data);
+			rx_buffer_info->skb = NULL;
+
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 &
+					IXGBE_FLAG2_RSC_ENABLED) &&
+					(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+					(!(skb->prev)))
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->dma =
+							rx_buffer_info->dma;
+				else
+					pci_unmap_single(pdev,
+							rx_buffer_info->dma,
+							rx_ring->rx_buf_len,
+							PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_put(skb, len);
+			}
+		} else {
+			skb = napi_get_frags(napi);
+			prefetch(rx_buffer_info->page_skb_offset);
+			rx_buffer_info->skb = NULL;
+			if (rx_buffer_info->dma) {
+				pci_unmap_page(pdev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_fill_page_desc(skb,
+						skb_shinfo(skb)->nr_frags,
+						rx_buffer_info->page_skb,
+						rx_buffer_info->page_skb_offset,
+						len);
+				rx_buffer_info->page_skb = NULL;
+				skb->len += len;
+				skb->data_len += len;
+				skb->truesize += len;
+			}
 		}
 
 		if (upper_len) {
@@ -956,6 +1015,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				rx_buffer_info->dma = next_buffer->dma;
 				next_buffer->skb = skb;
 				next_buffer->dma = 0;
+				if (rx_buffer_info->mapped_as_page) {
+					rx_buffer_info->page_skb =
+							next_buffer->page_skb;
+					next_buffer->page_skb = NULL;
+					next_buffer->skb = NULL;
+				}
 			} else {
 				skb->next = next_buffer->skb;
 				skb->next->prev = skb;
@@ -975,7 +1040,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		skb->protocol = eth_type_trans(skb, adapter->netdev);
+		if (!rx_buffer_info->mapped_as_page)
+			skb->protocol = eth_type_trans(skb, adapter->netdev);
 #ifdef IXGBE_FCOE
 		/* if ddp, not passing to ULD unless for FCP_RSP or error */
 		if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
@@ -984,7 +1050,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				goto next_desc;
 		}
 #endif /* IXGBE_FCOE */
-		ixgbe_receive_skb(q_vector, skb, staterr, rx_ring, rx_desc);
+
+		if (!rx_buffer_info->mapped_as_page)
+			ixgbe_receive_skb(q_vector, skb, staterr,
+						rx_ring, rx_desc);
+		else {
+			skb_record_rx_queue(skb, rx_ring->queue_index);
+			napi_gro_frags(napi);
+		}
 
 next_desc:
 		rx_desc->wb.upper.status_error = 0;
@@ -3131,9 +3204,16 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 
 		rx_buffer_info = &rx_ring->rx_buffer_info[i];
 		if (rx_buffer_info->dma) {
-			pci_unmap_single(pdev, rx_buffer_info->dma,
-			                 rx_ring->rx_buf_len,
-			                 PCI_DMA_FROMDEVICE);
+			if (!rx_buffer_info->mapped_as_page) {
+				pci_unmap_single(pdev, rx_buffer_info->dma,
+						rx_ring->rx_buf_len,
+						PCI_DMA_FROMDEVICE);
+			} else {
+				pci_unmap_page(pdev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->page_skb = NULL;
+			}
 			rx_buffer_info->dma = 0;
 		}
 		if (rx_buffer_info->skb) {
@@ -3158,7 +3238,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 			               PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
 		}
-		put_page(rx_buffer_info->page);
+		netdev_free_page(adapter->netdev, rx_buffer_info->page);
 		rx_buffer_info->page = NULL;
 		rx_buffer_info->page_offset = 0;
 	}
-- 
1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 16/16] An example how to alloc user buffer based on napi_gro_frags() interface.
  2010-08-06  9:23                               ` xiaohui.xin
@ 2010-08-06  9:23                                 ` xiaohui.xin
  -1 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver which using napi_gro_frags().
It can get buffers from guest side directly using netdev_alloc_page()
and release guest buffers using netdev_free_page().
---
 drivers/net/ixgbe/ixgbe_main.c |   25 +++++++++++++++++++++----
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index cfe6853..c563111 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -691,7 +691,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
 					struct net_device *dev)
 {
-	return true;
+	return dev_is_mpassthru(dev);
+}
+
+static u32 get_page_skb_offset(struct net_device *dev)
+{
+	if (!dev_is_mpassthru(dev))
+		return 0;
+	return dev->mp_port->vnet_hlen;
 }
 
 /**
@@ -764,7 +771,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 				adapter->alloc_rx_page_failed++;
 				goto no_buffers;
 			}
-			bi->page_skb_offset = 0;
+			bi->page_skb_offset =
+				get_page_skb_offset(adapter->netdev);
 			bi->dma = pci_map_page(pdev, bi->page_skb,
 					bi->page_skb_offset,
 					(PAGE_SIZE / 2),
@@ -899,8 +907,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
-		if (is_no_buffer(rx_buffer_info))
+		if (is_no_buffer(rx_buffer_info)) {
+			printk("no buffers\n");
 			break;
+		}
 
 		cleaned = true;
 
@@ -945,6 +955,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 						rx_buffer_info->page_skb,
 						rx_buffer_info->page_skb_offset,
 						len);
+				if (dev_is_mpassthru(netdev) &&
+						netdev->mp_port->hash)
+					skb_shinfo(skb)->destructor_arg =
+					netdev->mp_port->hash(netdev,
+					rx_buffer_info->page_skb);
+
 				rx_buffer_info->page_skb = NULL;
 				skb->len += len;
 				skb->data_len += len;
@@ -962,7 +978,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			                   upper_len);
 
 			if ((rx_ring->rx_buf_len > (PAGE_SIZE / 2)) ||
-			    (page_count(rx_buffer_info->page) != 1))
+			    (page_count(rx_buffer_info->page) != 1) ||
+				dev_is_mpassthru(netdev))
 				rx_buffer_info->page = NULL;
 			else
 				get_page(rx_buffer_info->page);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC PATCH v9 16/16] An example how to alloc user buffer based on napi_gro_frags() interface.
@ 2010-08-06  9:23                                 ` xiaohui.xin
  0 siblings, 0 replies; 64+ messages in thread
From: xiaohui.xin @ 2010-08-06  9:23 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver which using napi_gro_frags().
It can get buffers from guest side directly using netdev_alloc_page()
and release guest buffers using netdev_free_page().
---
 drivers/net/ixgbe/ixgbe_main.c |   25 +++++++++++++++++++++----
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index cfe6853..c563111 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -691,7 +691,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
 					struct net_device *dev)
 {
-	return true;
+	return dev_is_mpassthru(dev);
+}
+
+static u32 get_page_skb_offset(struct net_device *dev)
+{
+	if (!dev_is_mpassthru(dev))
+		return 0;
+	return dev->mp_port->vnet_hlen;
 }
 
 /**
@@ -764,7 +771,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 				adapter->alloc_rx_page_failed++;
 				goto no_buffers;
 			}
-			bi->page_skb_offset = 0;
+			bi->page_skb_offset =
+				get_page_skb_offset(adapter->netdev);
 			bi->dma = pci_map_page(pdev, bi->page_skb,
 					bi->page_skb_offset,
 					(PAGE_SIZE / 2),
@@ -899,8 +907,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
-		if (is_no_buffer(rx_buffer_info))
+		if (is_no_buffer(rx_buffer_info)) {
+			printk("no buffers\n");
 			break;
+		}
 
 		cleaned = true;
 
@@ -945,6 +955,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 						rx_buffer_info->page_skb,
 						rx_buffer_info->page_skb_offset,
 						len);
+				if (dev_is_mpassthru(netdev) &&
+						netdev->mp_port->hash)
+					skb_shinfo(skb)->destructor_arg =
+					netdev->mp_port->hash(netdev,
+					rx_buffer_info->page_skb);
+
 				rx_buffer_info->page_skb = NULL;
 				skb->len += len;
 				skb->data_len += len;
@@ -962,7 +978,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			                   upper_len);
 
 			if ((rx_ring->rx_buf_len > (PAGE_SIZE / 2)) ||
-			    (page_count(rx_buffer_info->page) != 1))
+			    (page_count(rx_buffer_info->page) != 1) ||
+				dev_is_mpassthru(netdev))
 				rx_buffer_info->page = NULL;
 			else
 				get_page(rx_buffer_info->page);
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-06  9:23 ` xiaohui.xin
@ 2010-08-11  1:23   ` Shirley Ma
  -1 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  1:23 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

Hello Xiaohui,

On Fri, 2010-08-06 at 17:23 +0800, xiaohui.xin@intel.com wrote:
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. 

Have you had any performance data to share here? I tested my
experimental macvtap zero copy for TX only. The performance I have seen
as below without any tuning, (default setting):

Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
over ixgbe 10GbE card. perf top shows:

2103.00 12.9% copy_user_generic_string
1541.00  9.4% handle_tx
1490.00  9.1% _raw_spin_unlock_irqrestore
1361.00  8.3% _raw_spin_lock_irqsave
1288.00  7.9% _raw_spin_lock
924.00  5.7% vhost_worker

After: netperf results with 60 secs run is 8.1Gb/s, perf output:

1093.00  9.9% _raw_spin_unlock_irqrestore
1048.00  9.5% handle_tx
934.00  8.5% _raw_spin_lock_irqsave
864.00  7.9% _raw_spin_lock
644.00  5.9% vhost_worker
387.00  3.5% use_mm 

I am still working on collecting more data (latency, cpu
utilization...). I will let you know once I get all data for macvtap TX
zero copy. Also I found some vhost performance regression on the new
kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.

Shirley
 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-11  1:23   ` Shirley Ma
  0 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  1:23 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

Hello Xiaohui,

On Fri, 2010-08-06 at 17:23 +0800, xiaohui.xin@intel.com wrote:
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. 

Have you had any performance data to share here? I tested my
experimental macvtap zero copy for TX only. The performance I have seen
as below without any tuning, (default setting):

Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
over ixgbe 10GbE card. perf top shows:

2103.00 12.9% copy_user_generic_string
1541.00  9.4% handle_tx
1490.00  9.1% _raw_spin_unlock_irqrestore
1361.00  8.3% _raw_spin_lock_irqsave
1288.00  7.9% _raw_spin_lock
924.00  5.7% vhost_worker

After: netperf results with 60 secs run is 8.1Gb/s, perf output:

1093.00  9.9% _raw_spin_unlock_irqrestore
1048.00  9.5% handle_tx
934.00  8.5% _raw_spin_lock_irqsave
864.00  7.9% _raw_spin_lock
644.00  5.9% vhost_worker
387.00  3.5% use_mm 

I am still working on collecting more data (latency, cpu
utilization...). I will let you know once I get all data for macvtap TX
zero copy. Also I found some vhost performance regression on the new
kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.

Shirley
 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-11  1:23   ` Shirley Ma
@ 2010-08-11  1:43     ` Shirley Ma
  -1 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  1:43 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 18:23 -0700, Shirley Ma wrote:
> Also I found some vhost performance regression on the new
> kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 

I forgot to mention the kernel I used 2.6.36 one. And I found the native
host BW is limited to 8.0Gb/s, so the regression might come from the
device driver not vhost.

Shirley




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-11  1:43     ` Shirley Ma
  0 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  1:43 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 18:23 -0700, Shirley Ma wrote:
> Also I found some vhost performance regression on the new
> kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 

I forgot to mention the kernel I used 2.6.36 one. And I found the native
host BW is limited to 8.0Gb/s, so the regression might come from the
device driver not vhost.

Shirley




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-11  1:43     ` Shirley Ma
@ 2010-08-11  6:01       ` Shirley Ma
  -1 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  6:01 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > Also I found some vhost performance regression on the new
> > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 
> 
> I forgot to mention the kernel I used 2.6.36 one. And I found the
> native
> host BW is limited to 8.0Gb/s, so the regression might come from the
> device driver not vhost.

Something is very interesting, when binding ixgbe interrupts to cpu1,
and running netperf/netserver on cpu0, the native host to host
performance is still around 8.0Gb/s, however, the macvtap zero copy
result is 9.0Gb/s.

root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
10.74) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  65536    60.00      9013.59   53.01    8.21     0.963   0.597

Below is perf top output:

              578.00  6.5% copy_user_generic_string   
              381.00  4.3% vmx_vcpu_run                
              250.00  2.8% schedule                    
              207.00  2.3% vhost_get_vq_desc           
              204.00  2.3% _raw_spin_lock_irqsave      
              197.00  2.2% translate_desc              
              193.00  2.2% memcpy_fromiovec            
              162.00  1.8% gup_pte_range   

We can compare your results with mine to see any difference.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-11  6:01       ` Shirley Ma
  0 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  6:01 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > Also I found some vhost performance regression on the new
> > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 
> 
> I forgot to mention the kernel I used 2.6.36 one. And I found the
> native
> host BW is limited to 8.0Gb/s, so the regression might come from the
> device driver not vhost.

Something is very interesting, when binding ixgbe interrupts to cpu1,
and running netperf/netserver on cpu0, the native host to host
performance is still around 8.0Gb/s, however, the macvtap zero copy
result is 9.0Gb/s.

root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
10.74) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  65536    60.00      9013.59   53.01    8.21     0.963   0.597

Below is perf top output:

              578.00  6.5% copy_user_generic_string   
              381.00  4.3% vmx_vcpu_run                
              250.00  2.8% schedule                    
              207.00  2.3% vhost_get_vq_desc           
              204.00  2.3% _raw_spin_lock_irqsave      
              197.00  2.2% translate_desc              
              193.00  2.2% memcpy_fromiovec            
              162.00  1.8% gup_pte_range   

We can compare your results with mine to see any difference.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-11  6:01       ` Shirley Ma
@ 2010-08-11  6:55         ` Shirley Ma
  -1 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  6:55 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 23:01 -0700, Shirley Ma wrote:
> On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > > Also I found some vhost performance regression on the new
> > > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 
> > 
> > I forgot to mention the kernel I used 2.6.36 one. And I found the
> > native
> > host BW is limited to 8.0Gb/s, so the regression might come from the
> > device driver not vhost.
> 
> Something is very interesting, when binding ixgbe interrupts to cpu1,
> and running netperf/netserver on cpu0, the native host to host
> performance is still around 8.0Gb/s, however, the macvtap zero copy
> result is 9.0Gb/s.
> 
> root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
> 10.74) port 0 AF_INET : cpu bind
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  65536    60.00      9013.59   53.01    8.21     0.963   0.597
> 
> Below is perf top output:
> 
>               578.00  6.5% copy_user_generic_string   
>               381.00  4.3% vmx_vcpu_run                
>               250.00  2.8% schedule                    
>               207.00  2.3% vhost_get_vq_desc           
>               204.00  2.3% _raw_spin_lock_irqsave      
>               197.00  2.2% translate_desc              
>               193.00  2.2% memcpy_fromiovec            
>               162.00  1.8% gup_pte_range   
> 
> We can compare your results with mine to see any difference.

When binding vhost thread to cpu3, qemu I/O thread to cpu2, macvtap zero
copy patch can get 9.4Gb/s. 

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168.10.74) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  65536    60.00      9408.19   55.69    8.45     0.970   0.589

Shirley


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
@ 2010-08-11  6:55         ` Shirley Ma
  0 siblings, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-08-11  6:55 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

On Tue, 2010-08-10 at 23:01 -0700, Shirley Ma wrote:
> On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > > Also I found some vhost performance regression on the new
> > > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 
> > 
> > I forgot to mention the kernel I used 2.6.36 one. And I found the
> > native
> > host BW is limited to 8.0Gb/s, so the regression might come from the
> > device driver not vhost.
> 
> Something is very interesting, when binding ixgbe interrupts to cpu1,
> and running netperf/netserver on cpu0, the native host to host
> performance is still around 8.0Gb/s, however, the macvtap zero copy
> result is 9.0Gb/s.
> 
> root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
> 10.74) port 0 AF_INET : cpu bind
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  65536    60.00      9013.59   53.01    8.21     0.963   0.597
> 
> Below is perf top output:
> 
>               578.00  6.5% copy_user_generic_string   
>               381.00  4.3% vmx_vcpu_run                
>               250.00  2.8% schedule                    
>               207.00  2.3% vhost_get_vq_desc           
>               204.00  2.3% _raw_spin_lock_irqsave      
>               197.00  2.2% translate_desc              
>               193.00  2.2% memcpy_fromiovec            
>               162.00  1.8% gup_pte_range   
> 
> We can compare your results with mine to see any difference.

When binding vhost thread to cpu3, qemu I/O thread to cpu2, macvtap zero
copy patch can get 9.4Gb/s. 

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168.10.74) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  65536    60.00      9408.19   55.69    8.45     0.970   0.589

Shirley


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-11  1:23   ` Shirley Ma
  (?)
  (?)
@ 2010-09-03 10:14   ` Michael S. Tsirkin
  2010-09-03 20:29     ` Sridhar Samudrala
  -1 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-03 10:14 UTC (permalink / raw)
  To: Shirley Ma
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Tue, Aug 10, 2010 at 06:23:24PM -0700, Shirley Ma wrote:
> Hello Xiaohui,
> 
> On Fri, 2010-08-06 at 17:23 +0800, xiaohui.xin@intel.com wrote:
> > Our goal is to improve the bandwidth and reduce the CPU usage.
> > Exact performance data will be provided later. 
> 
> Have you had any performance data to share here? I tested my
> experimental macvtap zero copy for TX only. The performance I have seen
> as below without any tuning, (default setting):
> 
> Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
> over ixgbe 10GbE card. perf top shows:
> 
> 2103.00 12.9% copy_user_generic_string
> 1541.00  9.4% handle_tx
> 1490.00  9.1% _raw_spin_unlock_irqrestore
> 1361.00  8.3% _raw_spin_lock_irqsave
> 1288.00  7.9% _raw_spin_lock
> 924.00  5.7% vhost_worker
> 
> After: netperf results with 60 secs run is 8.1Gb/s, perf output:
> 
> 1093.00  9.9% _raw_spin_unlock_irqrestore
> 1048.00  9.5% handle_tx
> 934.00  8.5% _raw_spin_lock_irqsave
> 864.00  7.9% _raw_spin_lock
> 644.00  5.9% vhost_worker
> 387.00  3.5% use_mm 
> 
> I am still working on collecting more data (latency, cpu
> utilization...). I will let you know once I get all data for macvtap TX
> zero copy. Also I found some vhost performance regression on the new
> kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.
> 
> Shirley

Could you please try disabling mergeable buffers, and see if this gets
you back where you were?
-global virtio-net-pci.mrg_rxbuf=off 

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-08-11  6:55         ` Shirley Ma
  (?)
@ 2010-09-03 10:52         ` Michael S. Tsirkin
  2010-09-13 18:48           ` Shirley Ma
  2010-09-13 21:35           ` Shirley Ma
  -1 siblings, 2 replies; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-03 10:52 UTC (permalink / raw)
  To: Shirley Ma
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Tue, Aug 10, 2010 at 11:55:04PM -0700, Shirley Ma wrote:
> On Tue, 2010-08-10 at 23:01 -0700, Shirley Ma wrote:
> > On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > > > Also I found some vhost performance regression on the new
> > > > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it. 
> > > 
> > > I forgot to mention the kernel I used 2.6.36 one. And I found the
> > > native
> > > host BW is limited to 8.0Gb/s, so the regression might come from the
> > > device driver not vhost.
> > 
> > Something is very interesting, when binding ixgbe interrupts to cpu1,
> > and running netperf/netserver on cpu0, the native host to host
> > performance is still around 8.0Gb/s, however, the macvtap zero copy
> > result is 9.0Gb/s.
> > 
> > root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
> > 10.74) port 0 AF_INET : cpu bind
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > 
> >  87380  16384  65536    60.00      9013.59   53.01    8.21     0.963   0.597
> > 
> > Below is perf top output:
> > 
> >               578.00  6.5% copy_user_generic_string   
> >               381.00  4.3% vmx_vcpu_run                
> >               250.00  2.8% schedule                    
> >               207.00  2.3% vhost_get_vq_desc           
> >               204.00  2.3% _raw_spin_lock_irqsave      
> >               197.00  2.2% translate_desc              
> >               193.00  2.2% memcpy_fromiovec            
> >               162.00  1.8% gup_pte_range   
> > 
> > We can compare your results with mine to see any difference.

Could you look at the guest as well?

> When binding vhost thread to cpu3, qemu I/O thread to cpu2, macvtap zero
> copy patch can get 9.4Gb/s. 
> 
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168.10.74) port 0 AF_INET : cpu bind
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  65536    60.00      9408.19   55.69    8.45     0.970   0.589
> 
> Shirley

OTOH CPU utilization is up too.

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-09-03 10:14   ` Michael S. Tsirkin
@ 2010-09-03 20:29     ` Sridhar Samudrala
  0 siblings, 0 replies; 64+ messages in thread
From: Sridhar Samudrala @ 2010-09-03 20:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem,
	herbert, jdike

On Fri, 2010-09-03 at 13:14 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 10, 2010 at 06:23:24PM -0700, Shirley Ma wrote:
> > Hello Xiaohui,
> > 
> > On Fri, 2010-08-06 at 17:23 +0800, xiaohui.xin@intel.com wrote:
> > > Our goal is to improve the bandwidth and reduce the CPU usage.
> > > Exact performance data will be provided later. 
> > 
> > Have you had any performance data to share here? I tested my
> > experimental macvtap zero copy for TX only. The performance I have seen
> > as below without any tuning, (default setting):
> > 
> > Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
> > over ixgbe 10GbE card. perf top shows:
> > 
> > 2103.00 12.9% copy_user_generic_string
> > 1541.00  9.4% handle_tx
> > 1490.00  9.1% _raw_spin_unlock_irqrestore
> > 1361.00  8.3% _raw_spin_lock_irqsave
> > 1288.00  7.9% _raw_spin_lock
> > 924.00  5.7% vhost_worker
> > 
> > After: netperf results with 60 secs run is 8.1Gb/s, perf output:
> > 
> > 1093.00  9.9% _raw_spin_unlock_irqrestore
> > 1048.00  9.5% handle_tx
> > 934.00  8.5% _raw_spin_lock_irqsave
> > 864.00  7.9% _raw_spin_lock
> > 644.00  5.9% vhost_worker
> > 387.00  3.5% use_mm 
> > 
> > I am still working on collecting more data (latency, cpu
> > utilization...). I will let you know once I get all data for macvtap TX
> > zero copy. Also I found some vhost performance regression on the new
> > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.
> > 
> > Shirley
> 
> Could you please try disabling mergeable buffers, and see if this gets
> you back where you were?
> -global virtio-net-pci.mrg_rxbuf=off 

I don't think Shirley had mergeable buffers on when she ran these tests. 
The qemu patch to support mergeable buffers with vhost is not yet upstream.

Shirley is on vacation and will be back on Sept 7 and can provide more
detailed performance data and post her patch.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-08-06  9:23                           ` xiaohui.xin
  (?)
  (?)
@ 2010-09-06 11:11                           ` Michael S. Tsirkin
  2010-09-10 13:40                             ` Xin, Xiaohui
  -1 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-06 11:11 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

So - does this driver help reduce service demand signifiantly?
Some comments from looking at the code:

On Fri, Aug 06, 2010 at 05:23:41PM +0800, xiaohui.xin@intel.com wrote:
> +static struct page_info *alloc_page_info(struct page_ctor *ctor,
> +		struct kiocb *iocb, struct iovec *iov,
> +		int count, struct frag *frags,
> +		int npages, int total)
> +{
> +	int rc;
> +	int i, j, n = 0;
> +	int len;
> +	unsigned long base, lock_limit;
> +	struct page_info *info = NULL;
> +
> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> +	lock_limit >>= PAGE_SHIFT;

Playing with rlimit on data path, transparently to the application in this way
looks strange to me, I suspect this has unexpected security implications.
Further, applications may have other uses for locked memory
besides mpassthru - you should not just take it because it's there.

Can we have an ioctl that lets userspace configure how much
memory to lock? This ioctl will decrement the rlimit and store
the data in the device structure so we can do accounting
internally. Put it back on close or on another ioctl.
Need to be careful for when this operation gets called
again with 0 or another small value while we have locked memory -
maybe just fail with EBUSY?  or wait until it gets unlocked?
Maybe 0 can be special-cased and deactivate zero-copy?.


> +
> +	if (ctor->lock_pages + count > lock_limit && npages) {
> +		printk(KERN_INFO "exceed the locked memory rlimit.");
> +		return NULL;
> +	}
> +
> +	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);

You seem to fill in all memory, why zalloc? this is data path ...

> +
> +	if (!info)
> +		return NULL;
> +
> +	for (i = j = 0; i < count; i++) {
> +		base = (unsigned long)iov[i].iov_base;
> +		len = iov[i].iov_len;
> +
> +		if (!len)
> +			continue;
> +		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +
> +		rc = get_user_pages_fast(base, n, npages ? 1 : 0,

npages controls whether this is a write? Why?

> +				&info->pages[j]);
> +		if (rc != n)
> +			goto failed;
> +
> +		while (n--) {
> +			frags[j].offset = base & ~PAGE_MASK;
> +			frags[j].size = min_t(int, len,
> +					PAGE_SIZE - frags[j].offset);
> +			len -= frags[j].size;
> +			base += frags[j].size;
> +			j++;
> +		}
> +	}
> +
> +#ifdef CONFIG_HIGHMEM
> +	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
> +		for (i = 0; i < j; i++) {
> +			if (PageHighMem(info->pages[i]))
> +				goto failed;
> +		}
> +	}
> +#endif

Are non-highdma devices worth bothering with? If yes -
are there other limitations devices might have that we need to handle?
E.g. what about non-s/g devices or no checksum offloading?

> +		skb_push(skb, ETH_HLEN);
> +
> +		if (skb_is_gso(skb)) {
> +			hdr.hdr.hdr_len = skb_headlen(skb);
> +			hdr.hdr.gso_size = shinfo->gso_size;
> +			if (shinfo->gso_type & SKB_GSO_TCPV4)
> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +			else if (shinfo->gso_type & SKB_GSO_TCPV6)
> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +			else if (shinfo->gso_type & SKB_GSO_UDP)
> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
> +			else
> +				BUG();
> +			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
> +				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> +
> +		} else
> +			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
> +
> +		if (skb->ip_summed == CHECKSUM_PARTIAL) {
> +			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +			hdr.hdr.csum_start =
> +				skb->csum_start - skb_headroom(skb);
> +			hdr.hdr.csum_offset = skb->csum_offset;
> +		}

We have this code in tun, macvtap and packet socket already.
Could this be a good time to move these into networking core?
I'm not asking you to do this right now, but could this generic
virtio-net to skb stuff be encapsulated in functions?

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-06 11:11                           ` [RFC PATCH v9 12/16] Add mp(mediate passthru) device Michael S. Tsirkin
@ 2010-09-10 13:40                             ` Xin, Xiaohui
  2010-09-11  7:41                               ` Xin, Xiaohui
  2010-09-11  9:42                               ` Xin, Xiaohui
  0 siblings, 2 replies; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-10 13:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

Michael,
Sorry to reply the mail late.

>So - does this driver help reduce service demand signifiantly?

I'm looking at the performance now.

>Some comments from looking at the code:
>
>On Fri, Aug 06, 2010 at 05:23:41PM +0800, xiaohui.xin@intel.com wrote:
>> +static struct page_info *alloc_page_info(struct page_ctor *ctor,
>> +		struct kiocb *iocb, struct iovec *iov,
>> +		int count, struct frag *frags,
>> +		int npages, int total)
>> +{
>> +	int rc;
>> +	int i, j, n = 0;
>> +	int len;
>> +	unsigned long base, lock_limit;
>> +	struct page_info *info = NULL;
>> +
>> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
>> +	lock_limit >>= PAGE_SHIFT;
>
>Playing with rlimit on data path, transparently to the application in this way
>looks strange to me, I suspect this has unexpected security implications.
>Further, applications may have other uses for locked memory
>besides mpassthru - you should not just take it because it's there.
>
>Can we have an ioctl that lets userspace configure how much
>memory to lock? This ioctl will decrement the rlimit and store
>the data in the device structure so we can do accounting
>internally. Put it back on close or on another ioctl.

Yes, we can decrement the rlimit in ioctl in one time to avoid
data path.

>Need to be careful for when this operation gets called
>again with 0 or another small value while we have locked memory -
>maybe just fail with EBUSY?  or wait until it gets unlocked?
>Maybe 0 can be special-cased and deactivate zero-copy?.
>
In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory,
the default value is only 16 pages. It's too small to make the device to 
work. So we always to configure it with a large value. 
I think, if rlimit value after decrement is < 0, then deactivate zero-copy 
is better. 0 maybe ok.

>> +
>> +	if (ctor->lock_pages + count > lock_limit && npages) {
>> +		printk(KERN_INFO "exceed the locked memory rlimit.");
>> +		return NULL;
>> +	}
>> +
>> +	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
>
>You seem to fill in all memory, why zalloc? this is data path ...

Ok, Let me check this.

>
>> +
>> +	if (!info)
>> +		return NULL;
>> +
>> +	for (i = j = 0; i < count; i++) {
>> +		base = (unsigned long)iov[i].iov_base;
>> +		len = iov[i].iov_len;
>> +
>> +		if (!len)
>> +			continue;
>> +		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
>> +
>> +		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
>
>npages controls whether this is a write? Why?

We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with npages = 0.

>
>> +				&info->pages[j]);
>> +		if (rc != n)
>> +			goto failed;
>> +
>> +		while (n--) {
>> +			frags[j].offset = base & ~PAGE_MASK;
>> +			frags[j].size = min_t(int, len,
>> +					PAGE_SIZE - frags[j].offset);
>> +			len -= frags[j].size;
>> +			base += frags[j].size;
>> +			j++;
>> +		}
>> +	}
>> +
>> +#ifdef CONFIG_HIGHMEM
>> +	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
>> +		for (i = 0; i < j; i++) {
>> +			if (PageHighMem(info->pages[i]))
>> +				goto failed;
>> +		}
>> +	}
>> +#endif
>
>Are non-highdma devices worth bothering with? If yes -
>are there other limitations devices might have that we need to handle?
>E.g. what about non-s/g devices or no checksum offloading?.

Basically I think there is no limitations for both, but let me have a check.

>
>> +		skb_push(skb, ETH_HLEN);
>> +
>> +		if (skb_is_gso(skb)) {
>> +			hdr.hdr.hdr_len = skb_headlen(skb);
>> +			hdr.hdr.gso_size = shinfo->gso_size;
>> +			if (shinfo->gso_type & SKB_GSO_TCPV4)
>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +			else if (shinfo->gso_type & SKB_GSO_TCPV6)
>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>> +			else if (shinfo->gso_type & SKB_GSO_UDP)
>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
>> +			else
>> +				BUG();
>> +			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
>> +				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>> +
>> +		} else
>> +			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
>> +
>> +		if (skb->ip_summed == CHECKSUM_PARTIAL) {
>> +			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +			hdr.hdr.csum_start =
>> +				skb->csum_start - skb_headroom(skb);
>> +			hdr.hdr.csum_offset = skb->csum_offset;
>> +		}
>
>We have this code in tun, macvtap and packet socket already.
>Could this be a good time to move these into networking core?
>I'm not asking you to do this right now, but could this generic
>virtio-net to skb stuff be encapsulated in functions?

It seems reasonable.

>
>--
>MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-10 13:40                             ` Xin, Xiaohui
@ 2010-09-11  7:41                               ` Xin, Xiaohui
  2010-09-12 13:37                                 ` Michael S. Tsirkin
  2010-09-11  9:42                               ` Xin, Xiaohui
  1 sibling, 1 reply; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-11  7:41 UTC (permalink / raw)
  To: Xin, Xiaohui, Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>>Playing with rlimit on data path, transparently to the application in this way
>>looks strange to me, I suspect this has unexpected security implications.
>>Further, applications may have other uses for locked memory
>>besides mpassthru - you should not just take it because it's there.
>>
>>Can we have an ioctl that lets userspace configure how much
>>memory to lock? This ioctl will decrement the rlimit and store
>>the data in the device structure so we can do accounting
>>internally. Put it back on close or on another ioctl.
>Yes, we can decrement the rlimit in ioctl in one time to avoid
>data path.
>
>>Need to be careful for when this operation gets called
>>again with 0 or another small value while we have locked memory -
>>maybe just fail with EBUSY?  or wait until it gets unlocked?
>>Maybe 0 can be special-cased and deactivate zero-copy?.
>>

How about we don't use a new ioctl, but just check the rlimit 
in one MPASSTHRU_BINDDEV ioctl? If we find mp device
break the rlimit, then we fail the bind ioctl, and thus can't do 
zero copy any more.

>In fact, if we choose RLIMIT_MEMLOCK to limit the lock memory,
>the default value is only 16 pages. It's too small to make the device to
>work. So we always to configure it with a large value.
>I think, if rlimit value after decrement is < 0, then deactivate zero-copy
>is better. 0 maybe ok.
>

>>> +
>>> +	if (ctor->lock_pages + count > lock_limit && npages) {
>>> +		printk(KERN_INFO "exceed the locked memory rlimit.");
>>> +		return NULL;
>>> +	}
>>> +
>>> +	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
>>
>>You seem to fill in all memory, why zalloc? this is data path ...
>
>Ok, Let me check this.
>
>>
>>> +
>>> +	if (!info)
>>> +		return NULL;
>>> +
>>> +	for (i = j = 0; i < count; i++) {
>>> +		base = (unsigned long)iov[i].iov_base;
>>> +		len = iov[i].iov_len;
>>> +
>>> +		if (!len)
>>> +			continue;
>>> +		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
>>> +
>>> +		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
>>
>>npages controls whether this is a write? Why?
>
>We use npages as a flag here. In mp_sendmsg(), we called alloc_page_info() with npages =
>0.
>
>>
>>> +				&info->pages[j]);
>>> +		if (rc != n)
>>> +			goto failed;
>>> +
>>> +		while (n--) {
>>> +			frags[j].offset = base & ~PAGE_MASK;
>>> +			frags[j].size = min_t(int, len,
>>> +					PAGE_SIZE - frags[j].offset);
>>> +			len -= frags[j].size;
>>> +			base += frags[j].size;
>>> +			j++;
>>> +		}
>>> +	}
>>> +
>>> +#ifdef CONFIG_HIGHMEM
>>> +	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
>>> +		for (i = 0; i < j; i++) {
>>> +			if (PageHighMem(info->pages[i]))
>>> +				goto failed;
>>> +		}
>>> +	}
>>> +#endif
>>
>>Are non-highdma devices worth bothering with? If yes -
>>are there other limitations devices might have that we need to handle?
>>E.g. what about non-s/g devices or no checksum offloading?.
>
>Basically I think there is no limitations for both, but let me have a check.
>
>>
>>> +		skb_push(skb, ETH_HLEN);
>>> +
>>> +		if (skb_is_gso(skb)) {
>>> +			hdr.hdr.hdr_len = skb_headlen(skb);
>>> +			hdr.hdr.gso_size = shinfo->gso_size;
>>> +			if (shinfo->gso_type & SKB_GSO_TCPV4)
>>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>>> +			else if (shinfo->gso_type & SKB_GSO_TCPV6)
>>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>>> +			else if (shinfo->gso_type & SKB_GSO_UDP)
>>> +				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
>>> +			else
>>> +				BUG();
>>> +			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
>>> +				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>>> +
>>> +		} else
>>> +			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
>>> +
>>> +		if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>> +			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>>> +			hdr.hdr.csum_start =
>>> +				skb->csum_start - skb_headroom(skb);
>>> +			hdr.hdr.csum_offset = skb->csum_offset;
>>> +		}
>>
>>We have this code in tun, macvtap and packet socket already.
>>Could this be a good time to move these into networking core?
>>I'm not asking you to do this right now, but could this generic
>>virtio-net to skb stuff be encapsulated in functions?
>
>It seems reasonable.
>
>>
>>--
>>MST
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-10 13:40                             ` Xin, Xiaohui
  2010-09-11  7:41                               ` Xin, Xiaohui
@ 2010-09-11  9:42                               ` Xin, Xiaohui
  1 sibling, 0 replies; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-11  9:42 UTC (permalink / raw)
  To: Xin, Xiaohui, Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>>> +
>>> +	if (ctor->lock_pages + count > lock_limit && npages) {
>>> +		printk(KERN_INFO "exceed the locked memory rlimit.");
>>> +		return NULL;
>>> +	}
>>> +
>>> +	info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
>>
>>You seem to fill in all memory, why zalloc? this is data path ...
>
>Ok, Let me check this.

It's mainly for info->next and info->prev, these two fields will be used in hash functions.
But you are right, since most fields will be refilled. The new version includes the fix. 

Thanks
Xiaohui

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-11  7:41                               ` Xin, Xiaohui
@ 2010-09-12 13:37                                 ` Michael S. Tsirkin
  2010-09-15  3:13                                   ` Xin, Xiaohui
  0 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-12 13:37 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
> >>Playing with rlimit on data path, transparently to the application in this way
> >>looks strange to me, I suspect this has unexpected security implications.
> >>Further, applications may have other uses for locked memory
> >>besides mpassthru - you should not just take it because it's there.
> >>
> >>Can we have an ioctl that lets userspace configure how much
> >>memory to lock? This ioctl will decrement the rlimit and store
> >>the data in the device structure so we can do accounting
> >>internally. Put it back on close or on another ioctl.
> >Yes, we can decrement the rlimit in ioctl in one time to avoid
> >data path.
> >
> >>Need to be careful for when this operation gets called
> >>again with 0 or another small value while we have locked memory -
> >>maybe just fail with EBUSY?  or wait until it gets unlocked?
> >>Maybe 0 can be special-cased and deactivate zero-copy?.
> >>
> 
> How about we don't use a new ioctl, but just check the rlimit 
> in one MPASSTHRU_BINDDEV ioctl? If we find mp device
> break the rlimit, then we fail the bind ioctl, and thus can't do 
> zero copy any more.

Yes, and not just check, but decrement as well.
I think we should give userspace control over
how much memory we can lock and subtract from the rlimit.
It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
Then increment the rlimit back on unbind and on close?

This opens up an interesting condition: process 1
calls bind, process 2 calls unbind or close.
This will increment rlimit for process 2.
Not sure how to fix this properly.

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-09-03 10:52         ` Michael S. Tsirkin
@ 2010-09-13 18:48           ` Shirley Ma
  2010-09-13 21:35           ` Shirley Ma
  1 sibling, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-09-13 18:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Fri, 2010-09-03 at 13:52 +0300, Michael S. Tsirkin wrote:
> Could you look at the guest as well?

I just finished the patch built against the recent kernel. I will submit
perf data for both guest, host along with the patch.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.
  2010-09-03 10:52         ` Michael S. Tsirkin
  2010-09-13 18:48           ` Shirley Ma
@ 2010-09-13 21:35           ` Shirley Ma
  1 sibling, 0 replies; 64+ messages in thread
From: Shirley Ma @ 2010-09-13 21:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Fri, 2010-09-03 at 13:52 +0300, Michael S. Tsirkin wrote:
> > When binding vhost thread to cpu3, qemu I/O thread to cpu2, macvtap
> zero
> > copy patch can get 9.4Gb/s. 
> > 
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 192.168.10.74 (192.168.10.74) port 0 AF_INET : cpu bind
> > Recv   Send    Send                          Utilization
> Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send
> Recv
> > Size   Size    Size     Time     Throughput  local    remote   local
> remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
> us/KB
> > 
> >  87380  16384  65536    60.00      9408.19   55.69    8.45     0.970
> 0.589
> > 
> > Shirley
> 
> OTOH CPU utilization is up too.

w/i macvtap zero copy patch, the BW can reach link w/i more cpu usage,
w/o macvtap zero copy patch, the BW can't be up to link speed. To
achieve same BW, CPU utilization is lower when using zero copy.

Shirley


^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-12 13:37                                 ` Michael S. Tsirkin
@ 2010-09-15  3:13                                   ` Xin, Xiaohui
  2010-09-15 11:28                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-15  3:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Sunday, September 12, 2010 9:37 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
>> >>Playing with rlimit on data path, transparently to the application in this way
>> >>looks strange to me, I suspect this has unexpected security implications.
>> >>Further, applications may have other uses for locked memory
>> >>besides mpassthru - you should not just take it because it's there.
>> >>
>> >>Can we have an ioctl that lets userspace configure how much
>> >>memory to lock? This ioctl will decrement the rlimit and store
>> >>the data in the device structure so we can do accounting
>> >>internally. Put it back on close or on another ioctl.
>> >Yes, we can decrement the rlimit in ioctl in one time to avoid
>> >data path.
>> >
>> >>Need to be careful for when this operation gets called
>> >>again with 0 or another small value while we have locked memory -
>> >>maybe just fail with EBUSY?  or wait until it gets unlocked?
>> >>Maybe 0 can be special-cased and deactivate zero-copy?.
>> >>
>>
>> How about we don't use a new ioctl, but just check the rlimit
>> in one MPASSTHRU_BINDDEV ioctl? If we find mp device
>> break the rlimit, then we fail the bind ioctl, and thus can't do
>> zero copy any more.
>
>Yes, and not just check, but decrement as well.
>I think we should give userspace control over
>how much memory we can lock and subtract from the rlimit.
>It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
>Then increment the rlimit back on unbind and on close?
>
>This opens up an interesting condition: process 1
>calls bind, process 2 calls unbind or close.
>This will increment rlimit for process 2.
>Not sure how to fix this properly.
>
I can't too, can we do any synchronous operations on rlimit stuff?
I quite suspect in it.
 
>--
>MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-15  3:13                                   ` Xin, Xiaohui
@ 2010-09-15 11:28                                     ` Michael S. Tsirkin
  2010-09-17  3:16                                       ` Xin, Xiaohui
  2010-09-20  8:08                                       ` xiaohui.xin
  0 siblings, 2 replies; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15 11:28 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Wed, Sep 15, 2010 at 11:13:44AM +0800, Xin, Xiaohui wrote:
> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >Sent: Sunday, September 12, 2010 9:37 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >
> >On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
> >> >>Playing with rlimit on data path, transparently to the application in this way
> >> >>looks strange to me, I suspect this has unexpected security implications.
> >> >>Further, applications may have other uses for locked memory
> >> >>besides mpassthru - you should not just take it because it's there.
> >> >>
> >> >>Can we have an ioctl that lets userspace configure how much
> >> >>memory to lock? This ioctl will decrement the rlimit and store
> >> >>the data in the device structure so we can do accounting
> >> >>internally. Put it back on close or on another ioctl.
> >> >Yes, we can decrement the rlimit in ioctl in one time to avoid
> >> >data path.
> >> >
> >> >>Need to be careful for when this operation gets called
> >> >>again with 0 or another small value while we have locked memory -
> >> >>maybe just fail with EBUSY?  or wait until it gets unlocked?
> >> >>Maybe 0 can be special-cased and deactivate zero-copy?.
> >> >>
> >>
> >> How about we don't use a new ioctl, but just check the rlimit
> >> in one MPASSTHRU_BINDDEV ioctl? If we find mp device
> >> break the rlimit, then we fail the bind ioctl, and thus can't do
> >> zero copy any more.
> >
> >Yes, and not just check, but decrement as well.
> >I think we should give userspace control over
> >how much memory we can lock and subtract from the rlimit.
> >It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
> >Then increment the rlimit back on unbind and on close?
> >
> >This opens up an interesting condition: process 1
> >calls bind, process 2 calls unbind or close.
> >This will increment rlimit for process 2.
> >Not sure how to fix this properly.
> >
> I can't too, can we do any synchronous operations on rlimit stuff?
> I quite suspect in it.
>  
> >--
> >MST

Here's what infiniband does: simply pass the amount of memory userspace
wants you to lock on an ioctl, and verify that either you have
CAP_IPC_LOCK or this number does not exceed the current rlimit.  (must
be on ioctl, not on open, as we likely want the fd passed around between
processes), but do not decrement rlimit.  Use this on following
operations.  Be careful if this can be changed while operations are in
progress.

This does mean that the effective amount of memory that userspace can
lock is doubled, but at least it is not unlimited, and we sidestep all
other issues such as userspace running out of lockable memory simply by
virtue of using the driver.

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-15 11:28                                     ` Michael S. Tsirkin
@ 2010-09-17  3:16                                       ` Xin, Xiaohui
  2010-09-20  8:08                                       ` xiaohui.xin
  1 sibling, 0 replies; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-17  3:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Wednesday, September 15, 2010 7:28 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Wed, Sep 15, 2010 at 11:13:44AM +0800, Xin, Xiaohui wrote:
>> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >Sent: Sunday, September 12, 2010 9:37 PM
>> >To: Xin, Xiaohui
>> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >jdike@linux.intel.com
>> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >
>> >On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
>> >> >>Playing with rlimit on data path, transparently to the application in this way
>> >> >>looks strange to me, I suspect this has unexpected security implications.
>> >> >>Further, applications may have other uses for locked memory
>> >> >>besides mpassthru - you should not just take it because it's there.
>> >> >>
>> >> >>Can we have an ioctl that lets userspace configure how much
>> >> >>memory to lock? This ioctl will decrement the rlimit and store
>> >> >>the data in the device structure so we can do accounting
>> >> >>internally. Put it back on close or on another ioctl.
>> >> >Yes, we can decrement the rlimit in ioctl in one time to avoid
>> >> >data path.
>> >> >
>> >> >>Need to be careful for when this operation gets called
>> >> >>again with 0 or another small value while we have locked memory -
>> >> >>maybe just fail with EBUSY?  or wait until it gets unlocked?
>> >> >>Maybe 0 can be special-cased and deactivate zero-copy?.
>> >> >>
>> >>
>> >> How about we don't use a new ioctl, but just check the rlimit
>> >> in one MPASSTHRU_BINDDEV ioctl? If we find mp device
>> >> break the rlimit, then we fail the bind ioctl, and thus can't do
>> >> zero copy any more.
>> >
>> >Yes, and not just check, but decrement as well.
>> >I think we should give userspace control over
>> >how much memory we can lock and subtract from the rlimit.
>> >It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
>> >Then increment the rlimit back on unbind and on close?
>> >
>> >This opens up an interesting condition: process 1
>> >calls bind, process 2 calls unbind or close.
>> >This will increment rlimit for process 2.
>> >Not sure how to fix this properly.
>> >
>> I can't too, can we do any synchronous operations on rlimit stuff?
>> I quite suspect in it.
>>
>> >--
>> >MST
>
>Here's what infiniband does: simply pass the amount of memory userspace
>wants you to lock on an ioctl, and verify that either you have
>CAP_IPC_LOCK or this number does not exceed the current rlimit.  (must
>be on ioctl, not on open, as we likely want the fd passed around between
>processes), but do not decrement rlimit.  Use this on following
>operations.  Be careful if this can be changed while operations are in
>progress.
>
>This does mean that the effective amount of memory that userspace can
>lock is doubled, but at least it is not unlimited, and we sidestep all
>other issues such as userspace running out of lockable memory simply by
>virtue of using the driver.
>

What I have done in mp device is almost the same as it. The difference is that
I do not check the capability, and I use my own parameter ctor->pages instead
of mm->locked_vm.

So currently, 1) add the capability check 2) use mm->locked_vm 3) add
an ioctl for userspace to configure how much memory can lock.
 
>--
>MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-15 11:28                                     ` Michael S. Tsirkin
  2010-09-17  3:16                                       ` Xin, Xiaohui
@ 2010-09-20  8:08                                       ` xiaohui.xin
  2010-09-20 11:36                                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 64+ messages in thread
From: xiaohui.xin @ 2010-09-20  8:08 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike, mst; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

---
Michael,
I have move the ioctl to configure the locked memory to vhost and 
check the limit with mm->locked_vm. please have a look.

Thanks
Xiaohui

 drivers/vhost/mpassthru.c |   74 +++++++++----------------------------------
 drivers/vhost/net.c       |   78 ++++++++++++++++++++++++++++++++++++++------
 include/linux/vhost.h     |    3 ++
 3 files changed, 85 insertions(+), 70 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index d86d94c..fd3827b 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -109,9 +109,6 @@ struct page_ctor {
 	int			wq_len;
 	int			rq_len;
 	spinlock_t		read_lock;
-	/* record the locked pages */
-	int			lock_pages;
-	struct rlimit		o_rlim;
 	struct net_device	*dev;
 	struct mpassthru_port	port;
 	struct page_info	**hash_table;
@@ -231,7 +228,6 @@ static int page_ctor_attach(struct mp_struct *mp)
 	ctor->port.ctor = page_ctor;
 	ctor->port.sock = &mp->socket;
 	ctor->port.hash = mp_lookup;
-	ctor->lock_pages = 0;
 
 	/* locked by mp_mutex */
 	dev->mp_port = &ctor->port;
@@ -264,37 +260,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor)
 	return info;
 }
 
-static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
-			      unsigned long cur, unsigned long max)
-{
-	struct rlimit new_rlim, *old_rlim;
-	int retval;
-
-	if (resource != RLIMIT_MEMLOCK)
-		return -EINVAL;
-	new_rlim.rlim_cur = cur;
-	new_rlim.rlim_max = max;
-
-	old_rlim = current->signal->rlim + resource;
-
-	/* remember the old rlimit value when backend enabled */
-	ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
-	ctor->o_rlim.rlim_max = old_rlim->rlim_max;
-
-	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
-			!capable(CAP_SYS_RESOURCE))
-		return -EPERM;
-
-	retval = security_task_setrlimit(resource, &new_rlim);
-	if (retval)
-		return retval;
-
-	task_lock(current->group_leader);
-	*old_rlim = new_rlim;
-	task_unlock(current->group_leader);
-	return 0;
-}
-
 static void relinquish_resource(struct page_ctor *ctor)
 {
 	if (!(ctor->dev->flags & IFF_UP) &&
@@ -322,8 +287,6 @@ static void mp_ki_dtor(struct kiocb *iocb)
 		info->ctor->rq_len--;
 	} else
 		info->ctor->wq_len--;
-	/* Decrement the number of locked pages */
-	info->ctor->lock_pages -= info->pnum;
 	kmem_cache_free(ext_page_info_cache, info);
 	relinquish_resource(info->ctor);
 
@@ -349,7 +312,7 @@ static struct kiocb *create_iocb(struct page_info *info, int size)
 	iocb->ki_dtor(iocb);
 	iocb->private = (void *)info;
 	iocb->ki_dtor = mp_ki_dtor;
-
+	iocb->ki_user_data = info->pnum;
 	return iocb;
 }
 
@@ -375,10 +338,6 @@ static int page_ctor_detach(struct mp_struct *mp)
 
 	relinquish_resource(ctor);
 
-	set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
-			   ctor->o_rlim.rlim_cur,
-			   ctor->o_rlim.rlim_max);
-
 	/* locked by mp_mutex */
 	ctor->dev->mp_port = NULL;
 	dev_put(ctor->dev);
@@ -565,21 +524,23 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
 	int rc;
 	int i, j, n = 0;
 	int len;
-	unsigned long base, lock_limit;
+	unsigned long base, lock_limit, locked;
 	struct page_info *info = NULL;
 
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-	lock_limit >>= PAGE_SHIFT;
+	down_write(&current->mm->mmap_sem);
+	locked     = count + current->mm->locked_vm;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-	if (ctor->lock_pages + count > lock_limit && npages) {
-		printk(KERN_INFO "exceed the locked memory rlimit.");
-		return NULL;
-	}
+	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK))
+		goto out;
 
 	info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
 	
 	if (!info)
-		return NULL;
+		goto out;
+
+	up_write(&current->mm->mmap_sem);
+
 	info->skb = NULL;
 	info->next = info->prev = NULL;
 
@@ -633,8 +594,7 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
 		for (i = 0; i < j; i++)
 			mp_hash_insert(ctor, info->pages[i], info);
 	}
-	/* increment the number of locked pages */
-	ctor->lock_pages += j;
+
 	return info;
 
 failed:
@@ -642,7 +602,9 @@ failed:
 		put_page(info->pages[i]);
 
 	kmem_cache_free(ext_page_info_cache, info);
-
+	return NULL;
+out:
+	up(&current->mm->mmap_sem);
 	return NULL;
 }
 
@@ -1006,12 +968,6 @@ proceed:
 		count--;
 	}
 
-	if (!ctor->lock_pages || !ctor->rq_len) {
-		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
-				iocb->ki_user_data * 4096 * 2,
-				iocb->ki_user_data * 4096 * 2);
-	}
-
 	/* Translate address to kernel */
 	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
 	if (!info)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c4bc815..da78837 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,6 +42,7 @@ enum {
 };
 
 static struct kmem_cache *notify_cache;
+static struct rlimit orig_rlim;
 
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
@@ -136,13 +137,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 	struct vhost_log *vq_log = NULL;
 	int rx_total_len = 0;
 	unsigned int head, log, in, out;
-	int size;
-	int count;
-
-	struct virtio_net_hdr_mrg_rxbuf hdr = {
-		.hdr.flags = 0,
-		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
-	};
+	int size, free = 0;
 
 	if (!is_async_vq(vq))
 		return;
@@ -160,7 +155,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 			size = iocb->ki_nbytes;
 			head = iocb->ki_pos;
 			rx_total_len += iocb->ki_nbytes;
-
+			free += iocb->ki_user_data;
 			if (iocb->ki_dtor)
 				iocb->ki_dtor(iocb);
 			kmem_cache_free(net->cache, iocb);
@@ -192,6 +187,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 					size = iocb->ki_nbytes;
 					head = iocb->ki_pos;
 					rx_total_len += iocb->ki_nbytes;
+					free += iocb->ki_user_data;
 
 					if (iocb->ki_dtor)
 						iocb->ki_dtor(iocb);
@@ -211,7 +207,6 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 					break;
 
 				i++;
-				iocb == NULL;
 				if (count)
 					iocb = notify_dequeue(vq);
 			}
@@ -219,6 +214,10 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
 					&net->dev, vq, vq->heads, hc);
 		}
 	}
+	/* record locked memroy */
+	down_write(&current->mm->mmap_sem);
+	current->mm->locked_vm -= free;
+	up_write(&current->mm->mmap_sem);
 }
 
 static void handle_async_tx_events_notify(struct vhost_net *net,
@@ -227,7 +226,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
 	struct kiocb *iocb = NULL;
 	struct list_head *entry, *tmp;
 	unsigned long flags;
-	int tx_total_len = 0;
+	int tx_total_len = 0, free = 0;
 
 	if (!is_async_vq(vq))
 		return;
@@ -242,7 +241,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
 		vhost_add_used_and_signal(&net->dev, vq,
 				iocb->ki_pos, 0);
 		tx_total_len += iocb->ki_nbytes;
-
+		free += iocb->ki_user_data;
 		if (iocb->ki_dtor)
 			iocb->ki_dtor(iocb);
 
@@ -253,6 +252,10 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
 		}
 	}
 	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	/* record locked memroy */
+	down_write(&current->mm->mmap_sem);
+	current->mm->locked_vm -= free;
+	up_write(&current->mm->mmap_sem);
 }
 
 static struct kiocb *create_iocb(struct vhost_net *net,
@@ -581,6 +584,7 @@ static void handle_rx_net(struct work_struct *work)
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+	struct rlimit *old_rlim;
 	int r;
 	if (!n)
 		return -ENOMEM;
@@ -597,6 +601,12 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 	n->cache = NULL;
 
+	old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
+
+	/* remember the old rlimit value when backend enabled */
+	orig_rlim.rlim_cur = old_rlim->rlim_cur;
+	orig_rlim.rlim_max = old_rlim->rlim_max;
+
 	f->private_data = n;
 
 	return 0;
@@ -659,6 +669,39 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static long vhost_net_set_mem_locked(struct vhost_net *n,
+				     unsigned long cur,
+				     unsigned long max)
+{
+	struct rlimit new_rlim, *old_rlim;
+	int retval = 0;
+
+	mutex_lock(&n->dev.mutex);
+	new_rlim.rlim_cur = cur;
+	new_rlim.rlim_max = max;
+
+	old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
+
+	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+			!capable(CAP_SYS_RESOURCE)) {
+		retval = -EPERM;
+		goto err;
+	}
+
+	retval = security_task_setrlimit(RLIMIT_MEMLOCK, &new_rlim);
+	if (retval) {
+		retval = retval;
+		goto err;
+	}
+
+	task_lock(current->group_leader);
+	*old_rlim = new_rlim;
+	task_unlock(current->group_leader);
+err:
+	mutex_unlock(&n->dev.mutex);
+	return retval;
+}
+
 static void vhost_async_cleanup(struct vhost_net *n)
 {
 	/* clean the notifier */
@@ -691,6 +734,10 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
 	vhost_async_cleanup(n);
+	/* return back the rlimit */
+	vhost_net_set_mem_locked(n,
+				 orig_rlim.rlim_cur,
+				 orig_rlim.rlim_max);
 	kfree(n);
 	return 0;
 }
@@ -846,6 +893,7 @@ err:
 	return r;
 }
 
+
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
 	struct socket *tx_sock = NULL;
@@ -913,6 +961,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 	void __user *argp = (void __user *)arg;
 	u64 __user *featurep = argp;
 	struct vhost_vring_file backend;
+	struct rlimit rlim;
 	u64 features;
 	int r;
 	switch (ioctl) {
@@ -933,6 +982,13 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 		return vhost_net_set_features(n, features);
 	case VHOST_RESET_OWNER:
 		return vhost_net_reset_owner(n);
+	case VHOST_SET_MEM_LOCKED:
+		r = copy_from_user(&rlim, argp, sizeof rlim);
+		if (r < 0)
+			return r;
+		return vhost_net_set_mem_locked(n,
+						rlim.rlim_cur,
+						rlim.rlim_max);
 	default:
 		mutex_lock(&n->dev.mutex);
 		r = vhost_dev_ioctl(&n->dev, ioctl, arg);
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
index e847f1e..df93f5a 100644
--- a/include/linux/vhost.h
+++ b/include/linux/vhost.h
@@ -92,6 +92,9 @@ struct vhost_memory {
 /* Specify an eventfd file descriptor to signal on log write. */
 #define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
 
+/* Specify how much locked memory can be used */
+#define VHOST_SET_MEM_LOCKED	_IOW(VHOST_VIRTIO, 0x08, struct rlimit)
+
 /* Ring setup. */
 /* Set number of descriptors in ring. This parameter can not
  * be modified while ring is running (bound to a device). */
-- 
1.5.4.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-20  8:08                                       ` xiaohui.xin
@ 2010-09-20 11:36                                         ` Michael S. Tsirkin
  2010-09-21  1:39                                           ` Xin, Xiaohui
  0 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-20 11:36 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> ---
> Michael,
> I have move the ioctl to configure the locked memory to vhost

It's ok to move this to vhost but vhost does not
know how much memory is needed by the backend.
So I think we'll need another ioctl in the backend
to tell userspace how much memory is needed?

It seems a bit cleaner as a backend ioctl as vhost
does not lock memory itself, but I am not
principally opposed.

> and 
> check the limit with mm->locked_vm. please have a look.
> 
> Thanks
> Xiaohui
> 
>  drivers/vhost/mpassthru.c |   74 +++++++++----------------------------------
>  drivers/vhost/net.c       |   78 ++++++++++++++++++++++++++++++++++++++------
>  include/linux/vhost.h     |    3 ++
>  3 files changed, 85 insertions(+), 70 deletions(-)
> 
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> index d86d94c..fd3827b 100644
> --- a/drivers/vhost/mpassthru.c
> +++ b/drivers/vhost/mpassthru.c
> @@ -109,9 +109,6 @@ struct page_ctor {
>  	int			wq_len;
>  	int			rq_len;
>  	spinlock_t		read_lock;
> -	/* record the locked pages */
> -	int			lock_pages;
> -	struct rlimit		o_rlim;
>  	struct net_device	*dev;
>  	struct mpassthru_port	port;
>  	struct page_info	**hash_table;
> @@ -231,7 +228,6 @@ static int page_ctor_attach(struct mp_struct *mp)
>  	ctor->port.ctor = page_ctor;
>  	ctor->port.sock = &mp->socket;
>  	ctor->port.hash = mp_lookup;
> -	ctor->lock_pages = 0;
>  
>  	/* locked by mp_mutex */
>  	dev->mp_port = &ctor->port;
> @@ -264,37 +260,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor)
>  	return info;
>  }
>  
> -static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
> -			      unsigned long cur, unsigned long max)
> -{
> -	struct rlimit new_rlim, *old_rlim;
> -	int retval;
> -
> -	if (resource != RLIMIT_MEMLOCK)
> -		return -EINVAL;
> -	new_rlim.rlim_cur = cur;
> -	new_rlim.rlim_max = max;
> -
> -	old_rlim = current->signal->rlim + resource;
> -
> -	/* remember the old rlimit value when backend enabled */
> -	ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
> -	ctor->o_rlim.rlim_max = old_rlim->rlim_max;
> -
> -	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
> -			!capable(CAP_SYS_RESOURCE))
> -		return -EPERM;
> -
> -	retval = security_task_setrlimit(resource, &new_rlim);
> -	if (retval)
> -		return retval;
> -
> -	task_lock(current->group_leader);
> -	*old_rlim = new_rlim;
> -	task_unlock(current->group_leader);
> -	return 0;
> -}
> -
>  static void relinquish_resource(struct page_ctor *ctor)
>  {
>  	if (!(ctor->dev->flags & IFF_UP) &&
> @@ -322,8 +287,6 @@ static void mp_ki_dtor(struct kiocb *iocb)
>  		info->ctor->rq_len--;
>  	} else
>  		info->ctor->wq_len--;
> -	/* Decrement the number of locked pages */
> -	info->ctor->lock_pages -= info->pnum;
>  	kmem_cache_free(ext_page_info_cache, info);
>  	relinquish_resource(info->ctor);
>  
> @@ -349,7 +312,7 @@ static struct kiocb *create_iocb(struct page_info *info, int size)
>  	iocb->ki_dtor(iocb);
>  	iocb->private = (void *)info;
>  	iocb->ki_dtor = mp_ki_dtor;
> -
> +	iocb->ki_user_data = info->pnum;
>  	return iocb;
>  }
>  
> @@ -375,10 +338,6 @@ static int page_ctor_detach(struct mp_struct *mp)
>  
>  	relinquish_resource(ctor);
>  
> -	set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> -			   ctor->o_rlim.rlim_cur,
> -			   ctor->o_rlim.rlim_max);
> -
>  	/* locked by mp_mutex */
>  	ctor->dev->mp_port = NULL;
>  	dev_put(ctor->dev);
> @@ -565,21 +524,23 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
>  	int rc;
>  	int i, j, n = 0;
>  	int len;
> -	unsigned long base, lock_limit;
> +	unsigned long base, lock_limit, locked;
>  	struct page_info *info = NULL;
>  
> -	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> -	lock_limit >>= PAGE_SHIFT;
> +	down_write(&current->mm->mmap_sem);
> +	locked     = count + current->mm->locked_vm;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  
> -	if (ctor->lock_pages + count > lock_limit && npages) {
> -		printk(KERN_INFO "exceed the locked memory rlimit.");
> -		return NULL;
> -	}
> +	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK))
> +		goto out;
>  
>  	info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
>  	
>  	if (!info)
> -		return NULL;
> +		goto out;
> +
> +	up_write(&current->mm->mmap_sem);
> +
>  	info->skb = NULL;
>  	info->next = info->prev = NULL;
>  

Sorry I wasnt clear, I didn't really mean we copy everything
from infiniband, just the capability checks and locked_mm use.
These guys don't do registration on data path so
they can play games with locked_mm etc on registration.
But lock on data path, taking mmap sem and doing security checks
on data path would be bad for performance.
I would expect this to cause contention especially
as we'll go for multiqueue.

Here's what I really meant:
	SET_MEM_LOCKED gets a 32 bit integer (or a 64 bit
	if you like - just not long).
	the meaning of which is "this is how much
	memory device can lock".
	This ioctl does rlim_cur and capability checks,
	if passed immediately increments locked_vm counter
	by the *maximum amount specified*.
	Device must store the value by which we incremented
	locked_vm and the mm pointer (if this is vhost ioctl
	it has the owner already). Let's call this
	field lock_limit.


	Lock limit can also take into account
	e.g. device tx queue depth and our queue size.
	Either we give another ioctl that tells userspace
	about these and let it make the decision,
	or simply cap lock_limit ourselves
	depending on these parameters.

	If another SET_MEM_LOCKED ioctl is made,
	decrement lcoked_vm in the stored mm,
	and redo the operation on current->mm
	(note: might be different!).

	This ioctl should probably fail if backend is active
	(already has locked some pages), such an
	approach makes it easy as we do not need to
	find and unlock pages.

	Each time you want to lock some memory you check that
	1. current->mm matches the stored mm.
	2. (number of pages locked + amount we want to lock) * PAGE_SIZE <= lock_limit.


	close and RESET_OWNER decrement and drop mm reference
	(note: on close
	we decrement owner's locked_vm, not current->mm
	which might be different).




> @@ -633,8 +594,7 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
>  		for (i = 0; i < j; i++)
>  			mp_hash_insert(ctor, info->pages[i], info);
>  	}
> -	/* increment the number of locked pages */
> -	ctor->lock_pages += j;
> +
>  	return info;
>  
>  failed:
> @@ -642,7 +602,9 @@ failed:
>  		put_page(info->pages[i]);
>  
>  	kmem_cache_free(ext_page_info_cache, info);
> -
> +	return NULL;
> +out:
> +	up(&current->mm->mmap_sem);
>  	return NULL;
>  }
>  
> @@ -1006,12 +968,6 @@ proceed:
>  		count--;
>  	}
>  
> -	if (!ctor->lock_pages || !ctor->rq_len) {
> -		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> -				iocb->ki_user_data * 4096 * 2,
> -				iocb->ki_user_data * 4096 * 2);
> -	}
> -
>  	/* Translate address to kernel */
>  	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
>  	if (!info)
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c4bc815..da78837 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -42,6 +42,7 @@ enum {
>  };
>  
>  static struct kmem_cache *notify_cache;
> +static struct rlimit orig_rlim;
>  
>  enum vhost_net_poll_state {
>  	VHOST_NET_POLL_DISABLED = 0,
> @@ -136,13 +137,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>  	struct vhost_log *vq_log = NULL;
>  	int rx_total_len = 0;
>  	unsigned int head, log, in, out;
> -	int size;
> -	int count;
> -
> -	struct virtio_net_hdr_mrg_rxbuf hdr = {
> -		.hdr.flags = 0,
> -		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
> -	};
> +	int size, free = 0;
>  
>  	if (!is_async_vq(vq))
>  		return;
> @@ -160,7 +155,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>  			size = iocb->ki_nbytes;
>  			head = iocb->ki_pos;
>  			rx_total_len += iocb->ki_nbytes;
> -
> +			free += iocb->ki_user_data;
>  			if (iocb->ki_dtor)
>  				iocb->ki_dtor(iocb);
>  			kmem_cache_free(net->cache, iocb);
> @@ -192,6 +187,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>  					size = iocb->ki_nbytes;
>  					head = iocb->ki_pos;
>  					rx_total_len += iocb->ki_nbytes;
> +					free += iocb->ki_user_data;
>  
>  					if (iocb->ki_dtor)
>  						iocb->ki_dtor(iocb);
> @@ -211,7 +207,6 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>  					break;
>  
>  				i++;
> -				iocb == NULL;
>  				if (count)
>  					iocb = notify_dequeue(vq);
>  			}
> @@ -219,6 +214,10 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>  					&net->dev, vq, vq->heads, hc);
>  		}
>  	}
> +	/* record locked memroy */
> +	down_write(&current->mm->mmap_sem);
> +	current->mm->locked_vm -= free;
> +	up_write(&current->mm->mmap_sem);
>  }
>  
>  static void handle_async_tx_events_notify(struct vhost_net *net,
> @@ -227,7 +226,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>  	struct kiocb *iocb = NULL;
>  	struct list_head *entry, *tmp;
>  	unsigned long flags;
> -	int tx_total_len = 0;
> +	int tx_total_len = 0, free = 0;
>  
>  	if (!is_async_vq(vq))
>  		return;
> @@ -242,7 +241,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>  		vhost_add_used_and_signal(&net->dev, vq,
>  				iocb->ki_pos, 0);
>  		tx_total_len += iocb->ki_nbytes;
> -
> +		free += iocb->ki_user_data;
>  		if (iocb->ki_dtor)
>  			iocb->ki_dtor(iocb);
>  
> @@ -253,6 +252,10 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>  		}
>  	}
>  	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	/* record locked memroy */
> +	down_write(&current->mm->mmap_sem);
> +	current->mm->locked_vm -= free;
> +	up_write(&current->mm->mmap_sem);
>  }
>  
>  static struct kiocb *create_iocb(struct vhost_net *net,
> @@ -581,6 +584,7 @@ static void handle_rx_net(struct work_struct *work)
>  static int vhost_net_open(struct inode *inode, struct file *f)
>  {
>  	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
> +	struct rlimit *old_rlim;
>  	int r;
>  	if (!n)
>  		return -ENOMEM;
> @@ -597,6 +601,12 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
>  	n->cache = NULL;
>  
> +	old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
> +
> +	/* remember the old rlimit value when backend enabled */
> +	orig_rlim.rlim_cur = old_rlim->rlim_cur;
> +	orig_rlim.rlim_max = old_rlim->rlim_max;
> +
>  	f->private_data = n;
>  
>  	return 0;
> @@ -659,6 +669,39 @@ static void vhost_net_flush(struct vhost_net *n)
>  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
>  }
>  
> +static long vhost_net_set_mem_locked(struct vhost_net *n,
> +				     unsigned long cur,
> +				     unsigned long max)
> +{

So one issue here is that when this is called on close,
current might be different from owner, with bad results.

I really think avoiding modifying rlimit is
the simplest way to go for now.

> +	struct rlimit new_rlim, *old_rlim;
> +	int retval = 0;
> +
> +	mutex_lock(&n->dev.mutex);
> +	new_rlim.rlim_cur = cur;
> +	new_rlim.rlim_max = max;
> +
> +	old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
> +
> +	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
> +			!capable(CAP_SYS_RESOURCE)) {
> +		retval = -EPERM;
> +		goto err;
> +	}
> +
> +	retval = security_task_setrlimit(RLIMIT_MEMLOCK, &new_rlim);
> +	if (retval) {
> +		retval = retval;
> +		goto err;
> +	}
> +
> +	task_lock(current->group_leader);
> +	*old_rlim = new_rlim;
> +	task_unlock(current->group_leader);
> +err:
> +	mutex_unlock(&n->dev.mutex);
> +	return retval;
> +}
> +
>  static void vhost_async_cleanup(struct vhost_net *n)
>  {
>  	/* clean the notifier */
> @@ -691,6 +734,10 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	 * since jobs can re-queue themselves. */
>  	vhost_net_flush(n);
>  	vhost_async_cleanup(n);
> +	/* return back the rlimit */
> +	vhost_net_set_mem_locked(n,
> +				 orig_rlim.rlim_cur,
> +				 orig_rlim.rlim_max);
>  	kfree(n);
>  	return 0;
>  }
> @@ -846,6 +893,7 @@ err:
>  	return r;
>  }
>  
> +
>  static long vhost_net_reset_owner(struct vhost_net *n)
>  {
>  	struct socket *tx_sock = NULL;
> @@ -913,6 +961,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>  	void __user *argp = (void __user *)arg;
>  	u64 __user *featurep = argp;
>  	struct vhost_vring_file backend;
> +	struct rlimit rlim;
>  	u64 features;
>  	int r;
>  	switch (ioctl) {
> @@ -933,6 +982,13 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>  		return vhost_net_set_features(n, features);
>  	case VHOST_RESET_OWNER:
>  		return vhost_net_reset_owner(n);
> +	case VHOST_SET_MEM_LOCKED:
> +		r = copy_from_user(&rlim, argp, sizeof rlim);
> +		if (r < 0)
> +			return r;
> +		return vhost_net_set_mem_locked(n,
> +						rlim.rlim_cur,
> +						rlim.rlim_max);
>  	default:
>  		mutex_lock(&n->dev.mutex);
>  		r = vhost_dev_ioctl(&n->dev, ioctl, arg);
> diff --git a/include/linux/vhost.h b/include/linux/vhost.h
> index e847f1e..df93f5a 100644
> --- a/include/linux/vhost.h
> +++ b/include/linux/vhost.h
> @@ -92,6 +92,9 @@ struct vhost_memory {
>  /* Specify an eventfd file descriptor to signal on log write. */
>  #define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
>  
> +/* Specify how much locked memory can be used */
> +#define VHOST_SET_MEM_LOCKED	_IOW(VHOST_VIRTIO, 0x08, struct rlimit)
> +

This is not a good structure to use: its size varies between
64 and 32 bit. rlimit64 would be better.
Also, you will have to include resource.h from here.

>  /* Ring setup. */
>  /* Set number of descriptors in ring. This parameter can not
>   * be modified while ring is running (bound to a device). */
> -- 
> 1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-20 11:36                                         ` Michael S. Tsirkin
@ 2010-09-21  1:39                                           ` Xin, Xiaohui
  2010-09-21 13:14                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-21  1:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Monday, September 20, 2010 7:37 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>>
>> ---
>> Michael,
>> I have move the ioctl to configure the locked memory to vhost
>
>It's ok to move this to vhost but vhost does not
>know how much memory is needed by the backend.

I think the backend here you mean is mp device.
Actually, the memory needed is related to vq->num to run zero-copy
smoothly. That means mp device did not know it but vhost did.
And the rlimt stuff is per process, we use current pointer to set
and check the rlimit, the operations should be in the same process.
Now the check operations are in vhost process, as mp_recvmsg() or
mp_sendmsg() are called by vhost. So set operations should be in
vhost process too, it's natural.

>So I think we'll need another ioctl in the backend
>to tell userspace how much memory is needed?
>
Except vhost tells it to mp device, mp did not know
how much memory is needed to run zero-copy smoothly.
Is userspace interested about the memory mp is needed?

>It seems a bit cleaner as a backend ioctl as vhost
>does not lock memory itself, but I am not
>principally opposed.
>
>> and
>> check the limit with mm->locked_vm. please have a look.
>>
>> Thanks
>> Xiaohui
>>
>>  drivers/vhost/mpassthru.c |   74 +++++++++----------------------------------
>>  drivers/vhost/net.c       |   78
>++++++++++++++++++++++++++++++++++++++------
>>  include/linux/vhost.h     |    3 ++
>>  3 files changed, 85 insertions(+), 70 deletions(-)
>>
>> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
>> index d86d94c..fd3827b 100644
>> --- a/drivers/vhost/mpassthru.c
>> +++ b/drivers/vhost/mpassthru.c
>> @@ -109,9 +109,6 @@ struct page_ctor {
>>      int                     wq_len;
>>      int                     rq_len;
>>      spinlock_t              read_lock;
>> -    /* record the locked pages */
>> -    int                     lock_pages;
>> -    struct rlimit           o_rlim;
>>      struct net_device       *dev;
>>      struct mpassthru_port   port;
>>      struct page_info        **hash_table;
>> @@ -231,7 +228,6 @@ static int page_ctor_attach(struct mp_struct *mp)
>>      ctor->port.ctor = page_ctor;
>>      ctor->port.sock = &mp->socket;
>>      ctor->port.hash = mp_lookup;
>> -    ctor->lock_pages = 0;
>>
>>      /* locked by mp_mutex */
>>      dev->mp_port = &ctor->port;
>> @@ -264,37 +260,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor)
>>      return info;
>>  }
>>
>> -static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
>> -                          unsigned long cur, unsigned long max)
>> -{
>> -    struct rlimit new_rlim, *old_rlim;
>> -    int retval;
>> -
>> -    if (resource != RLIMIT_MEMLOCK)
>> -            return -EINVAL;
>> -    new_rlim.rlim_cur = cur;
>> -    new_rlim.rlim_max = max;
>> -
>> -    old_rlim = current->signal->rlim + resource;
>> -
>> -    /* remember the old rlimit value when backend enabled */
>> -    ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
>> -    ctor->o_rlim.rlim_max = old_rlim->rlim_max;
>> -
>> -    if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
>> -                    !capable(CAP_SYS_RESOURCE))
>> -            return -EPERM;
>> -
>> -    retval = security_task_setrlimit(resource, &new_rlim);
>> -    if (retval)
>> -            return retval;
>> -
>> -    task_lock(current->group_leader);
>> -    *old_rlim = new_rlim;
>> -    task_unlock(current->group_leader);
>> -    return 0;
>> -}
>> -
>>  static void relinquish_resource(struct page_ctor *ctor)
>>  {
>>      if (!(ctor->dev->flags & IFF_UP) &&
>> @@ -322,8 +287,6 @@ static void mp_ki_dtor(struct kiocb *iocb)
>>              info->ctor->rq_len--;
>>      } else
>>              info->ctor->wq_len--;
>> -    /* Decrement the number of locked pages */
>> -    info->ctor->lock_pages -= info->pnum;
>>      kmem_cache_free(ext_page_info_cache, info);
>>      relinquish_resource(info->ctor);
>>
>> @@ -349,7 +312,7 @@ static struct kiocb *create_iocb(struct page_info *info, int size)
>>      iocb->ki_dtor(iocb);
>>      iocb->private = (void *)info;
>>      iocb->ki_dtor = mp_ki_dtor;
>> -
>> +    iocb->ki_user_data = info->pnum;
>>      return iocb;
>>  }
>>
>> @@ -375,10 +338,6 @@ static int page_ctor_detach(struct mp_struct *mp)
>>
>>      relinquish_resource(ctor);
>>
>> -    set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
>> -                       ctor->o_rlim.rlim_cur,
>> -                       ctor->o_rlim.rlim_max);
>> -
>>      /* locked by mp_mutex */
>>      ctor->dev->mp_port = NULL;
>>      dev_put(ctor->dev);
>> @@ -565,21 +524,23 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
>>      int rc;
>>      int i, j, n = 0;
>>      int len;
>> -    unsigned long base, lock_limit;
>> +    unsigned long base, lock_limit, locked;
>>      struct page_info *info = NULL;
>>
>> -    lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
>> -    lock_limit >>= PAGE_SHIFT;
>> +    down_write(&current->mm->mmap_sem);
>> +    locked     = count + current->mm->locked_vm;
>> +    lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>
>> -    if (ctor->lock_pages + count > lock_limit && npages) {
>> -            printk(KERN_INFO "exceed the locked memory rlimit.");
>> -            return NULL;
>> -    }
>> +    if ((locked > lock_limit) && !capable(CAP_IPC_LOCK))
>> +            goto out;
>>
>>      info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
>>
>>      if (!info)
>> -            return NULL;
>> +            goto out;
>> +
>> +    up_write(&current->mm->mmap_sem);
>> +
>>      info->skb = NULL;
>>      info->next = info->prev = NULL;
>>
>
>Sorry I wasnt clear, I didn't really mean we copy everything
>from infiniband, just the capability checks and locked_mm use.
>These guys don't do registration on data path so
>they can play games with locked_mm etc on registration.
>But lock on data path, taking mmap sem and doing security checks
>on data path would be bad for performance.
>I would expect this to cause contention especially
>as we'll go for multiqueue.
>
>Here's what I really meant:
>       SET_MEM_LOCKED gets a 32 bit integer (or a 64 bit
>       if you like - just not long).
>       the meaning of which is "this is how much
>       memory device can lock".
>       This ioctl does rlim_cur and capability checks,
>       if passed immediately increments locked_vm counter
>       by the *maximum amount specified*.
>       Device must store the value by which we incremented
>       locked_vm and the mm pointer (if this is vhost ioctl
>       it has the owner already). Let's call this
>       field lock_limit.
>
>
>       Lock limit can also take into account
>       e.g. device tx queue depth and our queue size.
>       Either we give another ioctl that tells userspace
>       about these and let it make the decision,
>       or simply cap lock_limit ourselves
>       depending on these parameters.
>
>       If another SET_MEM_LOCKED ioctl is made,
>       decrement lcoked_vm in the stored mm,
>       and redo the operation on current->mm
>       (note: might be different!).
>
>       This ioctl should probably fail if backend is active
>       (already has locked some pages), such an
>       approach makes it easy as we do not need to
>       find and unlock pages.
>
>       Each time you want to lock some memory you check that
>       1. current->mm matches the stored mm.
>       2. (number of pages locked + amount we want to lock) * PAGE_SIZE <= lock_limit.
>
>
>       close and RESET_OWNER decrement and drop mm reference
>       (note: on close
>       we decrement owner's locked_vm, not current->mm
>       which might be different).
>
>
>
>
>> @@ -633,8 +594,7 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
>>              for (i = 0; i < j; i++)
>>                      mp_hash_insert(ctor, info->pages[i], info);
>>      }
>> -    /* increment the number of locked pages */
>> -    ctor->lock_pages += j;
>> +
>>      return info;
>>
>>  failed:
>> @@ -642,7 +602,9 @@ failed:
>>              put_page(info->pages[i]);
>>
>>      kmem_cache_free(ext_page_info_cache, info);
>> -
>> +    return NULL;
>> +out:
>> +    up(&current->mm->mmap_sem);
>>      return NULL;
>>  }
>>
>> @@ -1006,12 +968,6 @@ proceed:
>>              count--;
>>      }
>>
>> -    if (!ctor->lock_pages || !ctor->rq_len) {
>> -            set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
>> -                            iocb->ki_user_data * 4096 * 2,
>> -                            iocb->ki_user_data * 4096 * 2);
>> -    }
>> -
>>      /* Translate address to kernel */
>>      info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
>>      if (!info)
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index c4bc815..da78837 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -42,6 +42,7 @@ enum {
>>  };
>>
>>  static struct kmem_cache *notify_cache;
>> +static struct rlimit orig_rlim;
>>
>>  enum vhost_net_poll_state {
>>      VHOST_NET_POLL_DISABLED = 0,
>> @@ -136,13 +137,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>>      struct vhost_log *vq_log = NULL;
>>      int rx_total_len = 0;
>>      unsigned int head, log, in, out;
>> -    int size;
>> -    int count;
>> -
>> -    struct virtio_net_hdr_mrg_rxbuf hdr = {
>> -            .hdr.flags = 0,
>> -            .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
>> -    };
>> +    int size, free = 0;
>>
>>      if (!is_async_vq(vq))
>>              return;
>> @@ -160,7 +155,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>>                      size = iocb->ki_nbytes;
>>                      head = iocb->ki_pos;
>>                      rx_total_len += iocb->ki_nbytes;
>> -
>> +                    free += iocb->ki_user_data;
>>                      if (iocb->ki_dtor)
>>                              iocb->ki_dtor(iocb);
>>                      kmem_cache_free(net->cache, iocb);
>> @@ -192,6 +187,7 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>>                                      size = iocb->ki_nbytes;
>>                                      head = iocb->ki_pos;
>>                                      rx_total_len += iocb->ki_nbytes;
>> +                                    free += iocb->ki_user_data;
>>
>>                                      if (iocb->ki_dtor)
>>                                              iocb->ki_dtor(iocb);
>> @@ -211,7 +207,6 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>>                                      break;
>>
>>                              i++;
>> -                            iocb == NULL;
>>                              if (count)
>>                                      iocb = notify_dequeue(vq);
>>                      }
>> @@ -219,6 +214,10 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
>>                                      &net->dev, vq, vq->heads, hc);
>>              }
>>      }
>> +    /* record locked memroy */
>> +    down_write(&current->mm->mmap_sem);
>> +    current->mm->locked_vm -= free;
>> +    up_write(&current->mm->mmap_sem);
>>  }
>>
>>  static void handle_async_tx_events_notify(struct vhost_net *net,
>> @@ -227,7 +226,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>>      struct kiocb *iocb = NULL;
>>      struct list_head *entry, *tmp;
>>      unsigned long flags;
>> -    int tx_total_len = 0;
>> +    int tx_total_len = 0, free = 0;
>>
>>      if (!is_async_vq(vq))
>>              return;
>> @@ -242,7 +241,7 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>>              vhost_add_used_and_signal(&net->dev, vq,
>>                              iocb->ki_pos, 0);
>>              tx_total_len += iocb->ki_nbytes;
>> -
>> +            free += iocb->ki_user_data;
>>              if (iocb->ki_dtor)
>>                      iocb->ki_dtor(iocb);
>>
>> @@ -253,6 +252,10 @@ static void handle_async_tx_events_notify(struct vhost_net *net,
>>              }
>>      }
>>      spin_unlock_irqrestore(&vq->notify_lock, flags);
>> +    /* record locked memroy */
>> +    down_write(&current->mm->mmap_sem);
>> +    current->mm->locked_vm -= free;
>> +    up_write(&current->mm->mmap_sem);
>>  }
>>
>>  static struct kiocb *create_iocb(struct vhost_net *net,
>> @@ -581,6 +584,7 @@ static void handle_rx_net(struct work_struct *work)
>>  static int vhost_net_open(struct inode *inode, struct file *f)
>>  {
>>      struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
>> +    struct rlimit *old_rlim;
>>      int r;
>>      if (!n)
>>              return -ENOMEM;
>> @@ -597,6 +601,12 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>>      n->tx_poll_state = VHOST_NET_POLL_DISABLED;
>>      n->cache = NULL;
>>
>> +    old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
>> +
>> +    /* remember the old rlimit value when backend enabled */
>> +    orig_rlim.rlim_cur = old_rlim->rlim_cur;
>> +    orig_rlim.rlim_max = old_rlim->rlim_max;
>> +
>>      f->private_data = n;
>>
>>      return 0;
>> @@ -659,6 +669,39 @@ static void vhost_net_flush(struct vhost_net *n)
>>      vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
>>  }
>>
>> +static long vhost_net_set_mem_locked(struct vhost_net *n,
>> +                                 unsigned long cur,
>> +                                 unsigned long max)
>> +{
>
>So one issue here is that when this is called on close,
>current might be different from owner, with bad results.
>
>I really think avoiding modifying rlimit is
>the simplest way to go for now.
>
>> +    struct rlimit new_rlim, *old_rlim;
>> +    int retval = 0;
>> +
>> +    mutex_lock(&n->dev.mutex);
>> +    new_rlim.rlim_cur = cur;
>> +    new_rlim.rlim_max = max;
>> +
>> +    old_rlim = current->signal->rlim + RLIMIT_MEMLOCK;
>> +
>> +    if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
>> +                    !capable(CAP_SYS_RESOURCE)) {
>> +            retval = -EPERM;
>> +            goto err;
>> +    }
>> +
>> +    retval = security_task_setrlimit(RLIMIT_MEMLOCK, &new_rlim);
>> +    if (retval) {
>> +            retval = retval;
>> +            goto err;
>> +    }
>> +
>> +    task_lock(current->group_leader);
>> +    *old_rlim = new_rlim;
>> +    task_unlock(current->group_leader);
>> +err:
>> +    mutex_unlock(&n->dev.mutex);
>> +    return retval;
>> +}
>> +
>>  static void vhost_async_cleanup(struct vhost_net *n)
>>  {
>>      /* clean the notifier */
>> @@ -691,6 +734,10 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>>       * since jobs can re-queue themselves. */
>>      vhost_net_flush(n);
>>      vhost_async_cleanup(n);
>> +    /* return back the rlimit */
>> +    vhost_net_set_mem_locked(n,
>> +                             orig_rlim.rlim_cur,
>> +                             orig_rlim.rlim_max);
>>      kfree(n);
>>      return 0;
>>  }
>> @@ -846,6 +893,7 @@ err:
>>      return r;
>>  }
>>
>> +
>>  static long vhost_net_reset_owner(struct vhost_net *n)
>>  {
>>      struct socket *tx_sock = NULL;
>> @@ -913,6 +961,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>>      void __user *argp = (void __user *)arg;
>>      u64 __user *featurep = argp;
>>      struct vhost_vring_file backend;
>> +    struct rlimit rlim;
>>      u64 features;
>>      int r;
>>      switch (ioctl) {
>> @@ -933,6 +982,13 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>>              return vhost_net_set_features(n, features);
>>      case VHOST_RESET_OWNER:
>>              return vhost_net_reset_owner(n);
>> +    case VHOST_SET_MEM_LOCKED:
>> +            r = copy_from_user(&rlim, argp, sizeof rlim);
>> +            if (r < 0)
>> +                    return r;
>> +            return vhost_net_set_mem_locked(n,
>> +                                            rlim.rlim_cur,
>> +                                            rlim.rlim_max);
>>      default:
>>              mutex_lock(&n->dev.mutex);
>>              r = vhost_dev_ioctl(&n->dev, ioctl, arg);
>> diff --git a/include/linux/vhost.h b/include/linux/vhost.h
>> index e847f1e..df93f5a 100644
>> --- a/include/linux/vhost.h
>> +++ b/include/linux/vhost.h
>> @@ -92,6 +92,9 @@ struct vhost_memory {
>>  /* Specify an eventfd file descriptor to signal on log write. */
>>  #define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
>>
>> +/* Specify how much locked memory can be used */
>> +#define VHOST_SET_MEM_LOCKED        _IOW(VHOST_VIRTIO, 0x08, struct rlimit)
>> +
>
>This is not a good structure to use: its size varies between
>64 and 32 bit. rlimit64 would be better.
>Also, you will have to include resource.h from here.
>
>>  /* Ring setup. */
>>  /* Set number of descriptors in ring. This parameter can not
>>   * be modified while ring is running (bound to a device). */
>> --
>> 1.5.4.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-21  1:39                                           ` Xin, Xiaohui
@ 2010-09-21 13:14                                             ` Michael S. Tsirkin
  2010-09-22 11:41                                               ` Xin, Xiaohui
  0 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-21 13:14 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >Sent: Monday, September 20, 2010 7:37 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >
> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
> >>
> >> ---
> >> Michael,
> >> I have move the ioctl to configure the locked memory to vhost
> >
> >It's ok to move this to vhost but vhost does not
> >know how much memory is needed by the backend.
> 
> I think the backend here you mean is mp device.
> Actually, the memory needed is related to vq->num to run zero-copy
> smoothly.
> That means mp device did not know it but vhost did.

Well, this might be so if you insist on locking
all posted buffers immediately. However, let's assume I have a
very large ring and prepost a ton of RX buffers:
there's no need to lock all of them directly:

if we have buffers A and B, we can lock A, pass it
to hardware, and when A is consumed unlock A, lock B
and pass it to hardware.


It's not really critical. But note we can always have userspace
tell MP device all it wants to know, after all.

> And the rlimt stuff is per process, we use current pointer to set
> and check the rlimit, the operations should be in the same process.

Well no, the ring is handled from the kernel thread: we switch the mm to
point to the owner task so copy from/to user and friends work, but you
can't access the rlimit etc.

> Now the check operations are in vhost process, as mp_recvmsg() or
> mp_sendmsg() are called by vhost.

Hmm, what do you mean by the check operations?
send/recv are data path operations, they shouldn't
do any checks, should they?

> So set operations should be in
> vhost process too, it's natural.
> 
> >So I think we'll need another ioctl in the backend
> >to tell userspace how much memory is needed?
> >
> Except vhost tells it to mp device, mp did not know
> how much memory is needed to run zero-copy smoothly.
> Is userspace interested about the memory mp is needed?

Couldn't parse this last question.
I think userspace generally does want control over
how much memory we'll lock. We should not just lock
as much as we can.

-- 
MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-21 13:14                                             ` Michael S. Tsirkin
@ 2010-09-22 11:41                                               ` Xin, Xiaohui
  2010-09-22 11:55                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-22 11:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>-----Original Message-----
>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Tuesday, September 21, 2010 9:14 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
>> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >Sent: Monday, September 20, 2010 7:37 PM
>> >To: Xin, Xiaohui
>> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >jdike@linux.intel.com
>> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >
>> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
>> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> >>
>> >> ---
>> >> Michael,
>> >> I have move the ioctl to configure the locked memory to vhost
>> >
>> >It's ok to move this to vhost but vhost does not
>> >know how much memory is needed by the backend.
>>
>> I think the backend here you mean is mp device.
>> Actually, the memory needed is related to vq->num to run zero-copy
>> smoothly.
>> That means mp device did not know it but vhost did.
>
>Well, this might be so if you insist on locking
>all posted buffers immediately. However, let's assume I have a
>very large ring and prepost a ton of RX buffers:
>there's no need to lock all of them directly:
>
>if we have buffers A and B, we can lock A, pass it
>to hardware, and when A is consumed unlock A, lock B
>and pass it to hardware.
>
>
>It's not really critical. But note we can always have userspace
>tell MP device all it wants to know, after all.
>
Ok. Here are two values we have mentioned, one is how much memory
user application wants to lock, and one is how much memory locked
is needed to run smoothly. When net backend is setup, we first need
an ioctl to get how much memory is needed to lock, and then we call
another ioctl to set how much it want to lock. Is that what's in your mind? 

>> And the rlimt stuff is per process, we use current pointer to set
>> and check the rlimit, the operations should be in the same process.
>
>Well no, the ring is handled from the kernel thread: we switch the mm to
>point to the owner task so copy from/to user and friends work, but you
>can't access the rlimit etc.
>
Yes, the userspace and vhost kernel is not the same process. But we can
record the task pointer as mm.

>> Now the check operations are in vhost process, as mp_recvmsg() or
>> mp_sendmsg() are called by vhost.
>
>Hmm, what do you mean by the check operations?
>send/recv are data path operations, they shouldn't
>do any checks, should they?
>
As you mentioned what infiniband driver done:
        down_write(&current->mm->mmap_sem);

        locked     = npages + current->mm->locked_vm;
        lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

        if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
                ret = -ENOMEM;
                goto out;
        }

        cur_base = addr & PAGE_MASK;

        ret = 0;
        while (npages) {
                ret = get_user_pages(current, current->mm, cur_base,
                                     min_t(unsigned long, npages,
                                           PAGE_SIZE / sizeof (struct page *)),
                                     1, !umem->writable, page_list, vma_list);

I think it's a data path too. We do the check because get_user_pages() really pin and locked
the memory. 

>> So set operations should be in
>> vhost process too, it's natural.
>>
>> >So I think we'll need another ioctl in the backend
>> >to tell userspace how much memory is needed?
>> >
>> Except vhost tells it to mp device, mp did not know
>> how much memory is needed to run zero-copy smoothly.
>> Is userspace interested about the memory mp is needed?
>
>Couldn't parse this last question.
>I think userspace generally does want control over
>how much memory we'll lock. We should not just lock
>as much as we can.
>
>--
>MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-22 11:41                                               ` Xin, Xiaohui
@ 2010-09-22 11:55                                                 ` Michael S. Tsirkin
  2010-09-23 12:56                                                   ` Xin, Xiaohui
  0 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-22 11:55 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >Sent: Tuesday, September 21, 2010 9:14 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >
> >On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >> >Sent: Monday, September 20, 2010 7:37 PM
> >> >To: Xin, Xiaohui
> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >> >jdike@linux.intel.com
> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >> >
> >> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
> >> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
> >> >>
> >> >> ---
> >> >> Michael,
> >> >> I have move the ioctl to configure the locked memory to vhost
> >> >
> >> >It's ok to move this to vhost but vhost does not
> >> >know how much memory is needed by the backend.
> >>
> >> I think the backend here you mean is mp device.
> >> Actually, the memory needed is related to vq->num to run zero-copy
> >> smoothly.
> >> That means mp device did not know it but vhost did.
> >
> >Well, this might be so if you insist on locking
> >all posted buffers immediately. However, let's assume I have a
> >very large ring and prepost a ton of RX buffers:
> >there's no need to lock all of them directly:
> >
> >if we have buffers A and B, we can lock A, pass it
> >to hardware, and when A is consumed unlock A, lock B
> >and pass it to hardware.
> >
> >
> >It's not really critical. But note we can always have userspace
> >tell MP device all it wants to know, after all.
> >
> Ok. Here are two values we have mentioned, one is how much memory
> user application wants to lock, and one is how much memory locked
> is needed to run smoothly. When net backend is setup, we first need
> an ioctl to get how much memory is needed to lock, and then we call
> another ioctl to set how much it want to lock. Is that what's in your mind? 

That's fine.

> >> And the rlimt stuff is per process, we use current pointer to set
> >> and check the rlimit, the operations should be in the same process.
> >
> >Well no, the ring is handled from the kernel thread: we switch the mm to
> >point to the owner task so copy from/to user and friends work, but you
> >can't access the rlimit etc.
> >
> Yes, the userspace and vhost kernel is not the same process. But we can
> record the task pointer as mm.

So you will have to store mm and do device->mm, not current->mm.
Anyway, better not touch mm on data path.

> >> Now the check operations are in vhost process, as mp_recvmsg() or
> >> mp_sendmsg() are called by vhost.
> >
> >Hmm, what do you mean by the check operations?
> >send/recv are data path operations, they shouldn't
> >do any checks, should they?
> >
> As you mentioned what infiniband driver done:
>         down_write(&current->mm->mmap_sem);
> 
>         locked     = npages + current->mm->locked_vm;
>         lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
>         if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>                 ret = -ENOMEM;
>                 goto out;
>         }
> 
>         cur_base = addr & PAGE_MASK;
> 
>         ret = 0;
>         while (npages) {
>                 ret = get_user_pages(current, current->mm, cur_base,
>                                      min_t(unsigned long, npages,
>                                            PAGE_SIZE / sizeof (struct page *)),
>                                      1, !umem->writable, page_list, vma_list);
> 
> I think it's a data path too.

in infiniband this is used to 'register memory' which is not data path.

> We do the check because get_user_pages() really pin and locked
> the memory. 

Don't do this. Performance will be bad.
Do the check once in ioctl and increment locked_vm by max amount you will use.
On data path just make sure you do not exceed what userspace told you
to.

> 
> >> So set operations should be in
> >> vhost process too, it's natural.
> >>
> >> >So I think we'll need another ioctl in the backend
> >> >to tell userspace how much memory is needed?
> >> >
> >> Except vhost tells it to mp device, mp did not know
> >> how much memory is needed to run zero-copy smoothly.
> >> Is userspace interested about the memory mp is needed?
> >
> >Couldn't parse this last question.
> >I think userspace generally does want control over
> >how much memory we'll lock. We should not just lock
> >as much as we can.
> >
> >--
> >MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-22 11:55                                                 ` Michael S. Tsirkin
@ 2010-09-23 12:56                                                   ` Xin, Xiaohui
  2010-09-26 11:50                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-23 12:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>-----Original Message-----
>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Wednesday, September 22, 2010 7:55 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
>> >-----Original Message-----
>> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >Sent: Tuesday, September 21, 2010 9:14 PM
>> >To: Xin, Xiaohui
>> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >jdike@linux.intel.com
>> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >
>> >On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
>> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >> >Sent: Monday, September 20, 2010 7:37 PM
>> >> >To: Xin, Xiaohui
>> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >> >jdike@linux.intel.com
>> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >> >
>> >> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
>> >> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> >> >>
>> >> >> ---
>> >> >> Michael,
>> >> >> I have move the ioctl to configure the locked memory to vhost
>> >> >
>> >> >It's ok to move this to vhost but vhost does not
>> >> >know how much memory is needed by the backend.
>> >>
>> >> I think the backend here you mean is mp device.
>> >> Actually, the memory needed is related to vq->num to run zero-copy
>> >> smoothly.
>> >> That means mp device did not know it but vhost did.
>> >
>> >Well, this might be so if you insist on locking
>> >all posted buffers immediately. However, let's assume I have a
>> >very large ring and prepost a ton of RX buffers:
>> >there's no need to lock all of them directly:
>> >
>> >if we have buffers A and B, we can lock A, pass it
>> >to hardware, and when A is consumed unlock A, lock B
>> >and pass it to hardware.
>> >
>> >
>> >It's not really critical. But note we can always have userspace
>> >tell MP device all it wants to know, after all.
>> >
>> Ok. Here are two values we have mentioned, one is how much memory
>> user application wants to lock, and one is how much memory locked
>> is needed to run smoothly. When net backend is setup, we first need
>> an ioctl to get how much memory is needed to lock, and then we call
>> another ioctl to set how much it want to lock. Is that what's in your mind?
>
>That's fine.
>
>> >> And the rlimt stuff is per process, we use current pointer to set
>> >> and check the rlimit, the operations should be in the same process.
>> >
>> >Well no, the ring is handled from the kernel thread: we switch the mm to
>> >point to the owner task so copy from/to user and friends work, but you
>> >can't access the rlimit etc.
>> >
>> Yes, the userspace and vhost kernel is not the same process. But we can
>> record the task pointer as mm.
>
>So you will have to store mm and do device->mm, not current->mm.
>Anyway, better not touch mm on data path.
>
>> >> Now the check operations are in vhost process, as mp_recvmsg() or
>> >> mp_sendmsg() are called by vhost.
>> >
>> >Hmm, what do you mean by the check operations?
>> >send/recv are data path operations, they shouldn't
>> >do any checks, should they?
>> >
>> As you mentioned what infiniband driver done:
>>         down_write(&current->mm->mmap_sem);
>>
>>         locked     = npages + current->mm->locked_vm;
>>         lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>
>>         if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>>                 ret = -ENOMEM;
>>                 goto out;
>>         }
>>
>>         cur_base = addr & PAGE_MASK;
>>
>>         ret = 0;
>>         while (npages) {
>>                 ret = get_user_pages(current, current->mm, cur_base,
>>                                      min_t(unsigned long, npages,
>>                                            PAGE_SIZE / sizeof (struct page *)),
>>                                      1, !umem->writable, page_list, vma_list);
>>
>> I think it's a data path too.
>
>in infiniband this is used to 'register memory' which is not data path.
>
>> We do the check because get_user_pages() really pin and locked
>> the memory.
>
>Don't do this. Performance will be bad.
>Do the check once in ioctl and increment locked_vm by max amount you will use.
>On data path just make sure you do not exceed what userspace told you
>to.

What's in my mind is that in the ioctl which to get the memory locked needed to run smoothly,
it just return a value of how much memory is needed by mp device.
And then in the ioctl which to set the memory locked by user space, it check the rlimit and
increment locked_vm by user want. But I'm not sure how to "make sure do not exceed what
userspace told to". If we don't check locked_vm, what do we use to check? And Is it another kind of check on data path?

>
>>
>> >> So set operations should be in
>> >> vhost process too, it's natural.
>> >>
>> >> >So I think we'll need another ioctl in the backend
>> >> >to tell userspace how much memory is needed?
>> >> >
>> >> Except vhost tells it to mp device, mp did not know
>> >> how much memory is needed to run zero-copy smoothly.
>> >> Is userspace interested about the memory mp is needed?
>> >
>> >Couldn't parse this last question.
>> >I think userspace generally does want control over
>> >how much memory we'll lock. We should not just lock
>> >as much as we can.
>> >
>> >--
>> >MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-23 12:56                                                   ` Xin, Xiaohui
@ 2010-09-26 11:50                                                     ` Michael S. Tsirkin
  2010-09-27  0:42                                                       ` Xin, Xiaohui
  0 siblings, 1 reply; 64+ messages in thread
From: Michael S. Tsirkin @ 2010-09-26 11:50 UTC (permalink / raw)
  To: Xin, Xiaohui; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

On Thu, Sep 23, 2010 at 08:56:33PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >Sent: Wednesday, September 22, 2010 7:55 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >
> >On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
> >> >-----Original Message-----
> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >> >Sent: Tuesday, September 21, 2010 9:14 PM
> >> >To: Xin, Xiaohui
> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >> >jdike@linux.intel.com
> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >> >
> >> >On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
> >> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >> >> >Sent: Monday, September 20, 2010 7:37 PM
> >> >> >To: Xin, Xiaohui
> >> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
> >> >> >jdike@linux.intel.com
> >> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
> >> >> >
> >> >> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
> >> >> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
> >> >> >>
> >> >> >> ---
> >> >> >> Michael,
> >> >> >> I have move the ioctl to configure the locked memory to vhost
> >> >> >
> >> >> >It's ok to move this to vhost but vhost does not
> >> >> >know how much memory is needed by the backend.
> >> >>
> >> >> I think the backend here you mean is mp device.
> >> >> Actually, the memory needed is related to vq->num to run zero-copy
> >> >> smoothly.
> >> >> That means mp device did not know it but vhost did.
> >> >
> >> >Well, this might be so if you insist on locking
> >> >all posted buffers immediately. However, let's assume I have a
> >> >very large ring and prepost a ton of RX buffers:
> >> >there's no need to lock all of them directly:
> >> >
> >> >if we have buffers A and B, we can lock A, pass it
> >> >to hardware, and when A is consumed unlock A, lock B
> >> >and pass it to hardware.
> >> >
> >> >
> >> >It's not really critical. But note we can always have userspace
> >> >tell MP device all it wants to know, after all.
> >> >
> >> Ok. Here are two values we have mentioned, one is how much memory
> >> user application wants to lock, and one is how much memory locked
> >> is needed to run smoothly. When net backend is setup, we first need
> >> an ioctl to get how much memory is needed to lock, and then we call
> >> another ioctl to set how much it want to lock. Is that what's in your mind?
> >
> >That's fine.
> >
> >> >> And the rlimt stuff is per process, we use current pointer to set
> >> >> and check the rlimit, the operations should be in the same process.
> >> >
> >> >Well no, the ring is handled from the kernel thread: we switch the mm to
> >> >point to the owner task so copy from/to user and friends work, but you
> >> >can't access the rlimit etc.
> >> >
> >> Yes, the userspace and vhost kernel is not the same process. But we can
> >> record the task pointer as mm.
> >
> >So you will have to store mm and do device->mm, not current->mm.
> >Anyway, better not touch mm on data path.
> >
> >> >> Now the check operations are in vhost process, as mp_recvmsg() or
> >> >> mp_sendmsg() are called by vhost.
> >> >
> >> >Hmm, what do you mean by the check operations?
> >> >send/recv are data path operations, they shouldn't
> >> >do any checks, should they?
> >> >
> >> As you mentioned what infiniband driver done:
> >>         down_write(&current->mm->mmap_sem);
> >>
> >>         locked     = npages + current->mm->locked_vm;
> >>         lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >>
> >>         if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> >>                 ret = -ENOMEM;
> >>                 goto out;
> >>         }
> >>
> >>         cur_base = addr & PAGE_MASK;
> >>
> >>         ret = 0;
> >>         while (npages) {
> >>                 ret = get_user_pages(current, current->mm, cur_base,
> >>                                      min_t(unsigned long, npages,
> >>                                            PAGE_SIZE / sizeof (struct page *)),
> >>                                      1, !umem->writable, page_list, vma_list);
> >>
> >> I think it's a data path too.
> >
> >in infiniband this is used to 'register memory' which is not data path.
> >
> >> We do the check because get_user_pages() really pin and locked
> >> the memory.
> >
> >Don't do this. Performance will be bad.
> >Do the check once in ioctl and increment locked_vm by max amount you will use.
> >On data path just make sure you do not exceed what userspace told you
> >to.
> 
> What's in my mind is that in the ioctl which to get the memory locked needed to run smoothly,
> it just return a value of how much memory is needed by mp device.
> And then in the ioctl which to set the memory locked by user space, it check the rlimit and
> increment locked_vm by user want.

Fine.

> But I'm not sure how to "make sure do not exceed what
> userspace told to". If we don't check locked_vm, what do we use to check? And Is it another kind of check on data path?

An example: on ioctl we have incremented locked_vm by say 128K.
We will record this number 128K in mp data structure and on data path
verify that amount of memory we actually lock with get_user_pages_fast
does not exceed 128K. This is not part of mm and so can use
any locking scheme, no need to take mm semaphore.



> >
> >>
> >> >> So set operations should be in
> >> >> vhost process too, it's natural.
> >> >>
> >> >> >So I think we'll need another ioctl in the backend
> >> >> >to tell userspace how much memory is needed?
> >> >> >
> >> >> Except vhost tells it to mp device, mp did not know
> >> >> how much memory is needed to run zero-copy smoothly.
> >> >> Is userspace interested about the memory mp is needed?
> >> >
> >> >Couldn't parse this last question.
> >> >I think userspace generally does want control over
> >> >how much memory we'll lock. We should not just lock
> >> >as much as we can.
> >> >
> >> >--
> >> >MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
  2010-09-26 11:50                                                     ` Michael S. Tsirkin
@ 2010-09-27  0:42                                                       ` Xin, Xiaohui
  0 siblings, 0 replies; 64+ messages in thread
From: Xin, Xiaohui @ 2010-09-27  0:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike

>From: Michael S. Tsirkin [mailto:mst@redhat.com]
>Sent: Sunday, September 26, 2010 7:50 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Thu, Sep 23, 2010 at 08:56:33PM +0800, Xin, Xiaohui wrote:
>> >-----Original Message-----
>> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >Sent: Wednesday, September 22, 2010 7:55 PM
>> >To: Xin, Xiaohui
>> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >jdike@linux.intel.com
>> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >
>> >On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
>> >> >-----Original Message-----
>> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >> >Sent: Tuesday, September 21, 2010 9:14 PM
>> >> >To: Xin, Xiaohui
>> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >> >jdike@linux.intel.com
>> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >> >
>> >> >On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
>> >> >> >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> >> >> >Sent: Monday, September 20, 2010 7:37 PM
>> >> >> >To: Xin, Xiaohui
>> >> >> >Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> >> >> >mingo@elte.hu; davem@davemloft.net; herbert@gondor.hengli.com.au;
>> >> >> >jdike@linux.intel.com
>> >> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >> >> >
>> >> >> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@intel.com wrote:
>> >> >> >> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> >> >> >>
>> >> >> >> ---
>> >> >> >> Michael,
>> >> >> >> I have move the ioctl to configure the locked memory to vhost
>> >> >> >
>> >> >> >It's ok to move this to vhost but vhost does not
>> >> >> >know how much memory is needed by the backend.
>> >> >>
>> >> >> I think the backend here you mean is mp device.
>> >> >> Actually, the memory needed is related to vq->num to run zero-copy
>> >> >> smoothly.
>> >> >> That means mp device did not know it but vhost did.
>> >> >
>> >> >Well, this might be so if you insist on locking
>> >> >all posted buffers immediately. However, let's assume I have a
>> >> >very large ring and prepost a ton of RX buffers:
>> >> >there's no need to lock all of them directly:
>> >> >
>> >> >if we have buffers A and B, we can lock A, pass it
>> >> >to hardware, and when A is consumed unlock A, lock B
>> >> >and pass it to hardware.
>> >> >
>> >> >
>> >> >It's not really critical. But note we can always have userspace
>> >> >tell MP device all it wants to know, after all.
>> >> >
>> >> Ok. Here are two values we have mentioned, one is how much memory
>> >> user application wants to lock, and one is how much memory locked
>> >> is needed to run smoothly. When net backend is setup, we first need
>> >> an ioctl to get how much memory is needed to lock, and then we call
>> >> another ioctl to set how much it want to lock. Is that what's in your mind?
>> >
>> >That's fine.
>> >
>> >> >> And the rlimt stuff is per process, we use current pointer to set
>> >> >> and check the rlimit, the operations should be in the same process.
>> >> >
>> >> >Well no, the ring is handled from the kernel thread: we switch the mm to
>> >> >point to the owner task so copy from/to user and friends work, but you
>> >> >can't access the rlimit etc.
>> >> >
>> >> Yes, the userspace and vhost kernel is not the same process. But we can
>> >> record the task pointer as mm.
>> >
>> >So you will have to store mm and do device->mm, not current->mm.
>> >Anyway, better not touch mm on data path.
>> >
>> >> >> Now the check operations are in vhost process, as mp_recvmsg() or
>> >> >> mp_sendmsg() are called by vhost.
>> >> >
>> >> >Hmm, what do you mean by the check operations?
>> >> >send/recv are data path operations, they shouldn't
>> >> >do any checks, should they?
>> >> >
>> >> As you mentioned what infiniband driver done:
>> >>         down_write(&current->mm->mmap_sem);
>> >>
>> >>         locked     = npages + current->mm->locked_vm;
>> >>         lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> >>
>> >>         if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>> >>                 ret = -ENOMEM;
>> >>                 goto out;
>> >>         }
>> >>
>> >>         cur_base = addr & PAGE_MASK;
>> >>
>> >>         ret = 0;
>> >>         while (npages) {
>> >>                 ret = get_user_pages(current, current->mm, cur_base,
>> >>                                      min_t(unsigned long, npages,
>> >>                                            PAGE_SIZE / sizeof (struct page
>*)),
>> >>                                      1, !umem->writable, page_list,
>vma_list);
>> >>
>> >> I think it's a data path too.
>> >
>> >in infiniband this is used to 'register memory' which is not data path.
>> >
>> >> We do the check because get_user_pages() really pin and locked
>> >> the memory.
>> >
>> >Don't do this. Performance will be bad.
>> >Do the check once in ioctl and increment locked_vm by max amount you will use.
>> >On data path just make sure you do not exceed what userspace told you
>> >to.
>>
>> What's in my mind is that in the ioctl which to get the memory locked needed to run
>smoothly,
>> it just return a value of how much memory is needed by mp device.
>> And then in the ioctl which to set the memory locked by user space, it check the rlimit and
>> increment locked_vm by user want.
>
>Fine.
>
>> But I'm not sure how to "make sure do not exceed what
>> userspace told to". If we don't check locked_vm, what do we use to check? And Is it
>another kind of check on data path?
>
>An example: on ioctl we have incremented locked_vm by say 128K.
>We will record this number 128K in mp data structure and on data path
>verify that amount of memory we actually lock with get_user_pages_fast
>does not exceed 128K. This is not part of mm and so can use
>any locking scheme, no need to take mm semaphore.
>
>
Thanks, and later, I did do that in v11 patches. 

>
>> >
>> >>
>> >> >> So set operations should be in
>> >> >> vhost process too, it's natural.
>> >> >>
>> >> >> >So I think we'll need another ioctl in the backend
>> >> >> >to tell userspace how much memory is needed?
>> >> >> >
>> >> >> Except vhost tells it to mp device, mp did not know
>> >> >> how much memory is needed to run zero-copy smoothly.
>> >> >> Is userspace interested about the memory mp is needed?
>> >> >
>> >> >Couldn't parse this last question.
>> >> >I think userspace generally does want control over
>> >> >how much memory we'll lock. We should not just lock
>> >> >as much as we can.
>> >> >
>> >> >--
>> >> >MST

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2010-09-27  0:42 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-06  9:23 [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net xiaohui.xin
2010-08-06  9:23 ` xiaohui.xin
2010-08-06  9:23 ` [RFC PATCH v9 01/16] Add a new structure for skb buffer from external xiaohui.xin
2010-08-06  9:23   ` xiaohui.xin
2010-08-06  9:23   ` [RFC PATCH v9 02/16] Add a new struct for device to manipulate external buffer xiaohui.xin
2010-08-06  9:23     ` xiaohui.xin
2010-08-06  9:23     ` [RFC PATCH v9 03/16] Add a ndo_mp_port_prep func to net_device_ops xiaohui.xin
2010-08-06  9:23       ` xiaohui.xin
2010-08-06  9:23       ` [RFC PATCH v9 04/16] Add a function make external buffer owner to query capability xiaohui.xin
2010-08-06  9:23         ` xiaohui.xin
2010-08-06  9:23         ` [RFC PATCH v9 05/16] Add a function to indicate if device use external buffer xiaohui.xin
2010-08-06  9:23           ` xiaohui.xin
2010-08-06  9:23           ` [RFC PATCH v9 06/16] Use callback to deal with skb_release_data() specially xiaohui.xin
2010-08-06  9:23             ` xiaohui.xin
2010-08-06  9:23             ` [RFC PATCH v9 07/16] Modify netdev_alloc_page() to get external buffer xiaohui.xin
2010-08-06  9:23               ` xiaohui.xin
2010-08-06  9:23               ` [RFC PATCH v9 08/16] Modify netdev_free_page() to release " xiaohui.xin
2010-08-06  9:23                 ` xiaohui.xin
2010-08-06  9:23                 ` [RFC PATCH v9 09/16] Don't do skb recycle, if device use " xiaohui.xin
2010-08-06  9:23                   ` xiaohui.xin
2010-08-06  9:23                   ` [RFC PATCH v9 10/16] Add a hook to intercept external buffers from NIC driver xiaohui.xin
2010-08-06  9:23                     ` xiaohui.xin
2010-08-06  9:23                     ` [RFC PATCH v9 11/16] Add header file for mp device xiaohui.xin
2010-08-06  9:23                       ` xiaohui.xin
2010-08-06  9:23                       ` [RFC PATCH v9 13/16] Add a kconfig entry and make entry " xiaohui.xin
2010-08-06  9:23                         ` xiaohui.xin
2010-08-06  9:23                         ` [RFC PATCH v9 12/16] Add mp(mediate passthru) device xiaohui.xin
2010-08-06  9:23                           ` xiaohui.xin
2010-08-06  9:23                           ` [RFC PATCH v9 14/16] Provides multiple submits and asynchronous notifications xiaohui.xin
2010-08-06  9:23                             ` xiaohui.xin
2010-08-06  9:23                             ` [RFC PATCH v9 15/16] An example how to modifiy NIC driver to use napi_gro_frags() interface xiaohui.xin
2010-08-06  9:23                               ` xiaohui.xin
2010-08-06  9:23                               ` [RFC PATCH v9 16/16] An example how to alloc user buffer based on " xiaohui.xin
2010-08-06  9:23                                 ` xiaohui.xin
2010-09-06 11:11                           ` [RFC PATCH v9 12/16] Add mp(mediate passthru) device Michael S. Tsirkin
2010-09-10 13:40                             ` Xin, Xiaohui
2010-09-11  7:41                               ` Xin, Xiaohui
2010-09-12 13:37                                 ` Michael S. Tsirkin
2010-09-15  3:13                                   ` Xin, Xiaohui
2010-09-15 11:28                                     ` Michael S. Tsirkin
2010-09-17  3:16                                       ` Xin, Xiaohui
2010-09-20  8:08                                       ` xiaohui.xin
2010-09-20 11:36                                         ` Michael S. Tsirkin
2010-09-21  1:39                                           ` Xin, Xiaohui
2010-09-21 13:14                                             ` Michael S. Tsirkin
2010-09-22 11:41                                               ` Xin, Xiaohui
2010-09-22 11:55                                                 ` Michael S. Tsirkin
2010-09-23 12:56                                                   ` Xin, Xiaohui
2010-09-26 11:50                                                     ` Michael S. Tsirkin
2010-09-27  0:42                                                       ` Xin, Xiaohui
2010-09-11  9:42                               ` Xin, Xiaohui
2010-08-11  1:23 ` [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net Shirley Ma
2010-08-11  1:23   ` Shirley Ma
2010-08-11  1:43   ` Shirley Ma
2010-08-11  1:43     ` Shirley Ma
2010-08-11  6:01     ` Shirley Ma
2010-08-11  6:01       ` Shirley Ma
2010-08-11  6:55       ` Shirley Ma
2010-08-11  6:55         ` Shirley Ma
2010-09-03 10:52         ` Michael S. Tsirkin
2010-09-13 18:48           ` Shirley Ma
2010-09-13 21:35           ` Shirley Ma
2010-09-03 10:14   ` Michael S. Tsirkin
2010-09-03 20:29     ` Sridhar Samudrala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.