All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 net-next 0/7] vhost-net rx batching
@ 2017-03-30  7:22 Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing Jason Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

Hi all:

This series tries to implement rx batching for vhost-net. This is done
by batching the dequeuing from skb_array which was exported by
underlayer socket and pass the sbk back through msg_control to finish
userspace copying.

Tests shows at most 19% improvment on rx pps.

Please review.

Thanks

Changes from V1:
- switch to use for() in __ptr_ring_consume_batched()
- rename peek_head_len_batched() to fetch_skbs()
- use skb_array_consume_batched() instead of
  skb_array_consume_batched_bh() since no consumer run in bh
- drop the lockless peeking patch since skb_array could be resized, so
  it's not safe to call lockless one

Jason Wang (7):
  ptr_ring: introduce batch dequeuing
  skb_array: introduce batch dequeuing
  tun: export skb_array
  tap: export skb_array
  tun: support receiving skb through msg_control
  tap: support receiving skb from msg_control
  vhost_net: try batch dequing from skb array

 drivers/net/tap.c         | 25 +++++++++++++++---
 drivers/net/tun.c         | 31 ++++++++++++++++------
 drivers/vhost/net.c       | 64 +++++++++++++++++++++++++++++++++++++++++++---
 include/linux/if_tap.h    |  5 ++++
 include/linux/if_tun.h    |  5 ++++
 include/linux/ptr_ring.h  | 65 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/skb_array.h | 25 ++++++++++++++++++
 7 files changed, 204 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30 13:53   ` Michael S. Tsirkin
  2017-03-30  7:22 ` [PATCH V2 net-next 2/7] skb_array: " Jason Wang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

This patch introduce a batched version of consuming, consumer can
dequeue more than one pointers from the ring at a time. We don't care
about the reorder of reading here so no need for compiler barrier.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..2be0f350 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
 	return ptr;
 }
 
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+					     void **array, int n)
+{
+	void *ptr;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		ptr = __ptr_ring_consume(r);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	return i;
+}
+
 /*
  * Note: resize (below) nests producer lock within consumer lock, so if you
  * call this in interrupt or BH context, you must disable interrupts/BH when
@@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
 	return ptr;
 }
 
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+					   void **array, int n)
+{
+	int ret;
+
+	spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
+					       void **array, int n)
+{
+	int ret;
+
+	spin_lock_irq(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_irq(&r->consumer_lock);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
+					       void **array, int n)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&r->consumer_lock, flags);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_irqrestore(&r->consumer_lock, flags);
+
+	return ret;
+}
+
+static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
+					      void **array, int n)
+{
+	int ret;
+
+	spin_lock_bh(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	spin_unlock_bh(&r->consumer_lock);
+
+	return ret;
+}
+
 /* Cast to structure type and call a function without discarding from FIFO.
  * Function must return a value.
  * Callers must take consumer_lock.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 2/7] skb_array: introduce batch dequeuing
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 3/7] tun: export skb_array Jason Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/skb_array.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..90e44b9 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -97,21 +97,46 @@ static inline struct sk_buff *skb_array_consume(struct skb_array *a)
 	return ptr_ring_consume(&a->ring);
 }
 
+static inline int skb_array_consume_batched(struct skb_array *a,
+					    void **array, int n)
+{
+	return ptr_ring_consume_batched(&a->ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_irq(struct skb_array *a)
 {
 	return ptr_ring_consume_irq(&a->ring);
 }
 
+static inline int skb_array_consume_batched_irq(struct skb_array *a,
+						void **array, int n)
+{
+	return ptr_ring_consume_batched_irq(&a->ring, array, n);
+}
+
 static inline struct sk_buff *skb_array_consume_any(struct skb_array *a)
 {
 	return ptr_ring_consume_any(&a->ring);
 }
 
+static inline int skb_array_consume_batched_any(struct skb_array *a,
+						void **array, int n)
+{
+	return ptr_ring_consume_batched_any(&a->ring, array, n);
+}
+
+
 static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 {
 	return ptr_ring_consume_bh(&a->ring);
 }
 
+static inline int skb_array_consume_batched_bh(struct skb_array *a,
+					       void **array, int n)
+{
+	return ptr_ring_consume_batched_bh(&a->ring, array, n);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
 	if (likely(skb)) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 3/7] tun: export skb_array
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 2/7] skb_array: " Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 4/7] tap: " Jason Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c      | 13 +++++++++++++
 include/linux/if_tun.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c418f0a..70dd9ec 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2613,6 +2613,19 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
+struct skb_array *tun_get_skb_array(struct file *file)
+{
+	struct tun_file *tfile;
+
+	if (file->f_op != &tun_fops)
+		return ERR_PTR(-EINVAL);
+	tfile = file->private_data;
+	if (!tfile)
+		return ERR_PTR(-EBADFD);
+	return &tfile->tx_array;
+}
+EXPORT_SYMBOL_GPL(tun_get_skb_array);
+
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index ed6da2e..bf9bdf4 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -19,6 +19,7 @@
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
+struct skb_array *tun_get_skb_array(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -28,5 +29,9 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tun_get_skb_array(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TUN */
 #endif /* __IF_TUN_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 4/7] tap: export skb_array
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
                   ` (2 preceding siblings ...)
  2017-03-30  7:22 ` [PATCH V2 net-next 3/7] tun: export skb_array Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control Jason Wang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

This patch exports skb_array through tap_get_skb_array(). Caller can
then manipulate skb array directly.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tap.c      | 13 +++++++++++++
 include/linux/if_tap.h |  5 +++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4d4173d..abdaf86 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1193,6 +1193,19 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
+struct skb_array *tap_get_skb_array(struct file *file)
+{
+	struct tap_queue *q;
+
+	if (file->f_op != &tap_fops)
+		return ERR_PTR(-EINVAL);
+	q = file->private_data;
+	if (!q)
+		return ERR_PTR(-EBADFD);
+	return &q->skb_array;
+}
+EXPORT_SYMBOL_GPL(tap_get_skb_array);
+
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 3482c3c..4837157 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -3,6 +3,7 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
+struct skb_array *tap_get_skb_array(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -12,6 +13,10 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline struct skb_array *tap_get_skb_array(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif /* CONFIG_TAP */
 
 #include <net/sock.h>
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
                   ` (3 preceding siblings ...)
  2017-03-30  7:22 ` [PATCH V2 net-next 4/7] tap: " Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30 15:06   ` Michael S. Tsirkin
  2017-03-30  7:22 ` [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control Jason Wang
  2017-03-30  7:22 ` [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array Jason Wang
  6 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 70dd9ec..a82bced 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
 
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
-	struct sk_buff *skb;
 	ssize_t ret;
 	int err;
 
@@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (!iov_iter_count(to))
 		return 0;
 
-	/* Read frames from ring */
-	skb = tun_ring_recv(tfile, noblock, &err);
-	if (!skb)
-		return err;
+	if (!skb) {
+		/* Read frames from ring */
+		skb = tun_ring_recv(tfile, noblock, &err);
+		if (!skb)
+			return err;
+	}
 
 	ret = tun_put_user(tun, tfile, skb, to);
 	if (unlikely(ret < 0))
@@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	if (!tun)
 		return -EBADFD;
-	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
+	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 					 SOL_PACKET, TUN_TX_TIMESTAMP);
 		goto out;
 	}
-	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > (ssize_t)total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
                   ` (4 preceding siblings ...)
  2017-03-30  7:22 ` [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30 15:03   ` Michael S. Tsirkin
  2017-03-30  7:22 ` [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array Jason Wang
  6 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

This patch makes tap_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tap.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index abdaf86..07d9174 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
 
 static ssize_t tap_do_read(struct tap_queue *q,
 			   struct iov_iter *to,
-			   int noblock)
+			   int noblock, struct sk_buff *skb)
 {
 	DEFINE_WAIT(wait);
-	struct sk_buff *skb;
 	ssize_t ret = 0;
 
 	if (!iov_iter_count(to))
 		return 0;
 
+	if (skb)
+		goto done;
+
 	while (1) {
 		if (!noblock)
 			prepare_to_wait(sk_sleep(&q->sk), &wait,
@@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
 	if (!noblock)
 		finish_wait(sk_sleep(&q->sk), &wait);
 
+done:
 	if (skb) {
 		ret = tap_put_user(q, skb, to);
 		if (unlikely(ret < 0))
@@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct tap_queue *q = file->private_data;
 	ssize_t len = iov_iter_count(to), ret;
 
-	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
+	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
 	ret = min_t(ssize_t, ret, len);
 	if (ret > 0)
 		iocb->ki_pos = ret;
@@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
 	int ret;
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT);
+	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT,
+			  m->msg_control);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
  2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
                   ` (5 preceding siblings ...)
  2017-03-30  7:22 ` [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control Jason Wang
@ 2017-03-30  7:22 ` Jason Wang
  2017-03-30 14:21   ` Michael S. Tsirkin
  6 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-30  7:22 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

We used to dequeue one skb during recvmsg() from skb_array, this could
be inefficient because of the bad cache utilization and spinlock
touching for each packet. This patch tries to batch them by calling
batch dequeuing helpers explicitly on the exported skb array and pass
the skb back through msg_control for underlayer socket to finish the
userspace copying.

Tests were done by XDP1:
- small buffer:
  Before: 1.88Mpps
  After : 2.25Mpps (+19.6%)
- mergeable buffer:
  Before: 1.83Mpps
  After : 2.10Mpps (+14.7%)

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b51989..ffa78c6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -28,6 +28,8 @@
 #include <linux/if_macvlan.h>
 #include <linux/if_tap.h>
 #include <linux/if_vlan.h>
+#include <linux/skb_array.h>
+#include <linux/skbuff.h>
 
 #include <net/sock.h>
 
@@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
 	struct vhost_virtqueue *vq;
 };
 
+#define VHOST_RX_BATCH 64
 struct vhost_net_virtqueue {
 	struct vhost_virtqueue vq;
 	size_t vhost_hlen;
@@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
+	struct skb_array *rx_array;
+	void *rxq[VHOST_RX_BATCH];
+	int rt;
+	int rh;
 };
 
 struct vhost_net {
@@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
 		n->vqs[i].ubufs = NULL;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		n->vqs[i].rt = 0;
+		n->vqs[i].rh = 0;
 	}
 
 }
@@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
 	mutex_unlock(&vq->mutex);
 }
 
-static int peek_head_len(struct sock *sk)
+static int fetch_skbs(struct vhost_net_virtqueue *rvq)
+{
+	if (rvq->rh != rvq->rt)
+		goto out;
+
+	rvq->rh = rvq->rt = 0;
+	rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
+					    VHOST_RX_BATCH);
+	if (!rvq->rt)
+		return 0;
+out:
+	return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
+}
+
+static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 {
 	struct socket *sock = sk->sk_socket;
 	struct sk_buff *head;
 	int len = 0;
 	unsigned long flags;
 
+	if (rvq->rx_array)
+		return fetch_skbs(rvq);
+
 	if (sock->ops->peek_len)
 		return sock->ops->peek_len(sock);
 
@@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
 	return skb_queue_empty(&sk->sk_receive_queue);
 }
 
-static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
+static int vhost_net_rx_peek_head_len(struct vhost_net *net,
+				      struct sock *sk)
 {
+	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
-	int len = peek_head_len(sk);
+	int len = peek_head_len(rvq, sk);
 
 	if (!len && vq->busyloop_timeout) {
 		/* Both tx vq and rx socket were polled here */
@@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 			vhost_poll_queue(&vq->poll);
 		mutex_unlock(&vq->mutex);
 
-		len = peek_head_len(sk);
+		len = peek_head_len(rvq, sk);
 	}
 
 	return len;
@@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
 		/* On error, stop handling until the next kick. */
 		if (unlikely(headcount < 0))
 			goto out;
+		if (nvq->rx_array)
+			msg.msg_control = nvq->rxq[nvq->rh++];
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].done_idx = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
+		n->vqs[i].rt = 0;
+		n->vqs[i].rh = 0;
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
@@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 					struct vhost_virtqueue *vq)
 {
 	struct socket *sock;
+	struct vhost_net_virtqueue *nvq =
+		container_of(vq, struct vhost_net_virtqueue, vq);
 
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	vhost_net_disable_vq(n, vq);
 	vq->private_data = NULL;
+	while (nvq->rh != nvq->rt)
+		kfree_skb(nvq->rxq[nvq->rh++]);
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -953,6 +989,25 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
+static struct skb_array *get_tap_skb_array(int fd)
+{
+	struct skb_array *array;
+	struct file *file = fget(fd);
+
+	if (!file)
+		return NULL;
+	array = tun_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = tap_get_skb_array(file);
+	if (!IS_ERR(array))
+		goto out;
+	array = NULL;
+out:
+	fput(file);
+	return array;
+}
+
 static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
@@ -1029,6 +1084,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vq->private_data = sock;
+		nvq->rx_array = get_tap_skb_array(fd);
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
  2017-03-30  7:22 ` [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing Jason Wang
@ 2017-03-30 13:53   ` Michael S. Tsirkin
  2017-03-31  3:52     ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2017-03-30 13:53 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel

On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:
> This patch introduce a batched version of consuming, consumer can
> dequeue more than one pointers from the ring at a time. We don't care
> about the reorder of reading here so no need for compiler barrier.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 6c70444..2be0f350 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
>  	return ptr;
>  }
>  
> +static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
> +					     void **array, int n)

Can we use a shorter name? ptr_ring_consume_batch?

> +{
> +	void *ptr;
> +	int i;
> +
> +	for (i = 0; i < n; i++) {
> +		ptr = __ptr_ring_consume(r);
> +		if (!ptr)
> +			break;
> +		array[i] = ptr;
> +	}
> +
> +	return i;
> +}
> +
>  /*
>   * Note: resize (below) nests producer lock within consumer lock, so if you
>   * call this in interrupt or BH context, you must disable interrupts/BH when

I'd like to add a code comment here explaining why we don't
care about cpu or compiler reordering. And I think the reason is
in the way you use this API: in vhost it does not matter
if you get less entries than present in the ring.
That's ok but needs to be noted
in a code comment so people use this function correctly.

Also, I think you need to repeat the comment about cpu_relax
near this function: if someone uses it in a loop,
a compiler barrier is needed to prevent compiler from
optimizing it out.

I note that ptr_ring_consume currently lacks any of these
comments so I'm ok with merging as is, and I'll add
documentation on top.
Like this perhaps?

/* Consume up to n entries and return the number of entries consumed
 * or 0 on ring empty.
 * Note: this might return early with less entries than present in the
 * ring.
 * Note: callers invoking this in a loop must use a compiler barrier,
 * for example cpu_relax(). Callers must take consumer_lock
 * if the ring is ever resized - see e.g. ptr_ring_consume_batch.
 */



> @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
>  	return ptr;
>  }
>  
> +static inline int ptr_ring_consume_batched(struct ptr_ring *r,
> +					   void **array, int n)
> +{
> +	int ret;
> +
> +	spin_lock(&r->consumer_lock);
> +	ret = __ptr_ring_consume_batched(r, array, n);
> +	spin_unlock(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
> +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
> +					       void **array, int n)
> +{
> +	int ret;
> +
> +	spin_lock_irq(&r->consumer_lock);
> +	ret = __ptr_ring_consume_batched(r, array, n);
> +	spin_unlock_irq(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
> +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
> +					       void **array, int n)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	spin_lock_irqsave(&r->consumer_lock, flags);
> +	ret = __ptr_ring_consume_batched(r, array, n);
> +	spin_unlock_irqrestore(&r->consumer_lock, flags);
> +
> +	return ret;
> +}
> +
> +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> +					      void **array, int n)
> +{
> +	int ret;
> +
> +	spin_lock_bh(&r->consumer_lock);
> +	ret = __ptr_ring_consume_batched(r, array, n);
> +	spin_unlock_bh(&r->consumer_lock);
> +
> +	return ret;
> +}
> +
>  /* Cast to structure type and call a function without discarding from FIFO.
>   * Function must return a value.
>   * Callers must take consumer_lock.
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
  2017-03-30  7:22 ` [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array Jason Wang
@ 2017-03-30 14:21   ` Michael S. Tsirkin
  2017-03-31  4:02     ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2017-03-30 14:21 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel

On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote:
> We used to dequeue one skb during recvmsg() from skb_array, this could
> be inefficient because of the bad cache utilization

which cache does this refer to btw?

> and spinlock
> touching for each packet.

Do you mean the effect of extra two atomics here?

> This patch tries to batch them by calling
> batch dequeuing helpers explicitly on the exported skb array and pass
> the skb back through msg_control for underlayer socket to finish the
> userspace copying.
> 
> Tests were done by XDP1:
> - small buffer:
>   Before: 1.88Mpps
>   After : 2.25Mpps (+19.6%)
> - mergeable buffer:
>   Before: 1.83Mpps
>   After : 2.10Mpps (+14.7%)
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Looks like I misread the code previously. More comments below,
sorry about not asking these questions earlier.

> ---
>  drivers/vhost/net.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9b51989..ffa78c6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -28,6 +28,8 @@
>  #include <linux/if_macvlan.h>
>  #include <linux/if_tap.h>
>  #include <linux/if_vlan.h>
> +#include <linux/skb_array.h>
> +#include <linux/skbuff.h>
>  
>  #include <net/sock.h>
>  
> @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
>  	struct vhost_virtqueue *vq;
>  };
>  
> +#define VHOST_RX_BATCH 64
>  struct vhost_net_virtqueue {
>  	struct vhost_virtqueue vq;
>  	size_t vhost_hlen;

Could you please try playing with batch size and see
what the effect is?


> @@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
>  	/* Reference counting for outstanding ubufs.
>  	 * Protected by vq mutex. Writers must also take device mutex. */
>  	struct vhost_net_ubuf_ref *ubufs;
> +	struct skb_array *rx_array;
> +	void *rxq[VHOST_RX_BATCH];
> +	int rt;
> +	int rh;
>  };
>  
>  struct vhost_net {
> @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
>  		n->vqs[i].ubufs = NULL;
>  		n->vqs[i].vhost_hlen = 0;
>  		n->vqs[i].sock_hlen = 0;
> +		n->vqs[i].rt = 0;
> +		n->vqs[i].rh = 0;
>  	}
>  
>  }
> @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
>  	mutex_unlock(&vq->mutex);
>  }
>  
> -static int peek_head_len(struct sock *sk)
> +static int fetch_skbs(struct vhost_net_virtqueue *rvq)
> +{
> +	if (rvq->rh != rvq->rt)
> +		goto out;
> +
> +	rvq->rh = rvq->rt = 0;
> +	rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
> +					    VHOST_RX_BATCH);
> +	if (!rvq->rt)
> +		return 0;
> +out:
> +	return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
> +}
> +
> +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
>  {
>  	struct socket *sock = sk->sk_socket;
>  	struct sk_buff *head;
>  	int len = 0;
>  	unsigned long flags;
>  
> +	if (rvq->rx_array)
> +		return fetch_skbs(rvq);
> +
>  	if (sock->ops->peek_len)
>  		return sock->ops->peek_len(sock);
>  
> @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
>  	return skb_queue_empty(&sk->sk_receive_queue);
>  }
>  
> -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
> +static int vhost_net_rx_peek_head_len(struct vhost_net *net,
> +				      struct sock *sk)
>  {
> +	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
>  	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>  	struct vhost_virtqueue *vq = &nvq->vq;
>  	unsigned long uninitialized_var(endtime);
> -	int len = peek_head_len(sk);
> +	int len = peek_head_len(rvq, sk);
>  
>  	if (!len && vq->busyloop_timeout) {
>  		/* Both tx vq and rx socket were polled here */
> @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>  			vhost_poll_queue(&vq->poll);
>  		mutex_unlock(&vq->mutex);
>  
> -		len = peek_head_len(sk);
> +		len = peek_head_len(rvq, sk);
>  	}
>  
>  	return len;
> @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
>  		/* On error, stop handling until the next kick. */
>  		if (unlikely(headcount < 0))
>  			goto out;
> +		if (nvq->rx_array)
> +			msg.msg_control = nvq->rxq[nvq->rh++];
>  		/* On overrun, truncate and discard */
>  		if (unlikely(headcount > UIO_MAXIOV)) {
>  			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);

So there's a bit of a mystery here. vhost code isn't
batched, all we are batching is the fetch from the tun ring.

So what is the source of the speedup?

Are queued spinlocks that expensive? They shouldn't be ...
Could you try using virt_spin_lock instead (at least as a quick hack)
to see whether that helps?
  
> @@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  		n->vqs[i].done_idx = 0;
>  		n->vqs[i].vhost_hlen = 0;
>  		n->vqs[i].sock_hlen = 0;
> +		n->vqs[i].rt = 0;
> +		n->vqs[i].rh = 0;
>  	}
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>
> @@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
>  					struct vhost_virtqueue *vq)
>  {
>  	struct socket *sock;
> +	struct vhost_net_virtqueue *nvq =
> +		container_of(vq, struct vhost_net_virtqueue, vq);
>  
>  	mutex_lock(&vq->mutex);
>  	sock = vq->private_data;
>  	vhost_net_disable_vq(n, vq);
>  	vq->private_data = NULL;
> +	while (nvq->rh != nvq->rt)
> +		kfree_skb(nvq->rxq[nvq->rh++]);
>  	mutex_unlock(&vq->mutex);
>  	return sock;
>  }

So I didn't realise it but of course the effect will be
dropped packets if we just connect and disconnect without
consuming anything.

So I think it's worth it to try analysing the speedup a bit
and see whether we can get the gains without queueing
the skbs in vhost.

> @@ -953,6 +989,25 @@ static struct socket *get_raw_socket(int fd)
>  	return ERR_PTR(r);
>  }
>  
> +static struct skb_array *get_tap_skb_array(int fd)

That's a confusing name, pls prefix with vhost_, not tap.

> +{
> +	struct skb_array *array;
> +	struct file *file = fget(fd);
> +
> +	if (!file)
> +		return NULL;
> +	array = tun_get_skb_array(file);
> +	if (!IS_ERR(array))
> +		goto out;
> +	array = tap_get_skb_array(file);
> +	if (!IS_ERR(array))
> +		goto out;
> +	array = NULL;
> +out:
> +	fput(file);
> +	return array;
> +}
> +
>  static struct socket *get_tap_socket(int fd)
>  {
>  	struct file *file = fget(fd);
> @@ -1029,6 +1084,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  
>  		vhost_net_disable_vq(n, vq);
>  		vq->private_data = sock;
> +		nvq->rx_array = get_tap_skb_array(fd);
>  		r = vhost_vq_init_access(vq);
>  		if (r)
>  			goto err_used;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control
  2017-03-30  7:22 ` [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control Jason Wang
@ 2017-03-30 15:03   ` Michael S. Tsirkin
  2017-03-31  4:07     ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2017-03-30 15:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel

On Thu, Mar 30, 2017 at 03:22:29PM +0800, Jason Wang wrote:
> This patch makes tap_recvmsg() can receive from skb from its caller
> through msg_control. Vhost_net will be the first user.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/tap.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index abdaf86..07d9174 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
>  
>  static ssize_t tap_do_read(struct tap_queue *q,
>  			   struct iov_iter *to,
> -			   int noblock)
> +			   int noblock, struct sk_buff *skb)
>  {
>  	DEFINE_WAIT(wait);
> -	struct sk_buff *skb;
>  	ssize_t ret = 0;
>  
>  	if (!iov_iter_count(to))
>  		return 0;
>  
> +	if (skb)
> +		goto done;
> +
>  	while (1) {
>  		if (!noblock)
>  			prepare_to_wait(sk_sleep(&q->sk), &wait,
> @@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>  	if (!noblock)
>  		finish_wait(sk_sleep(&q->sk), &wait);
>  
> +done:

Please just use an if {} block here. goto on error is ok,
but we are far from done here and goto done is misleading.


>  	if (skb) {
>  		ret = tap_put_user(q, skb, to);
>  		if (unlikely(ret < 0))
> @@ -872,7 +875,7 @@ static ssize_t tap_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	struct tap_queue *q = file->private_data;
>  	ssize_t len = iov_iter_count(to), ret;
>  
> -	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK);
> +	ret = tap_do_read(q, to, file->f_flags & O_NONBLOCK, NULL);
>  	ret = min_t(ssize_t, ret, len);
>  	if (ret > 0)
>  		iocb->ki_pos = ret;
> @@ -1155,7 +1158,8 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
>  	int ret;
>  	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
>  		return -EINVAL;
> -	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT);
> +	ret = tap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT,
> +			  m->msg_control);
>  	if (ret > total_len) {
>  		m->msg_flags |= MSG_TRUNC;
>  		ret = flags & MSG_TRUNC ? ret : total_len;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control
  2017-03-30  7:22 ` [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control Jason Wang
@ 2017-03-30 15:06   ` Michael S. Tsirkin
  2017-03-31  4:10     ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2017-03-30 15:06 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel

On Thu, Mar 30, 2017 at 03:22:28PM +0800, Jason Wang wrote:
> This patch makes tun_recvmsg() can receive from skb from its caller
> through msg_control. Vhost_net will be the first user.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Do we need to bother with tun? I didn't realize one
can even use that with vhost. What would be the point of
all the virtio header stuff dealing with checksums etc?

Even if you see a use-case is it worth optimizing?


> ---
>  drivers/net/tun.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 70dd9ec..a82bced 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1498,9 +1498,8 @@ static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
>  
>  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  			   struct iov_iter *to,
> -			   int noblock)
> +			   int noblock, struct sk_buff *skb)
>  {
> -	struct sk_buff *skb;
>  	ssize_t ret;
>  	int err;
>  
> @@ -1509,10 +1508,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  	if (!iov_iter_count(to))
>  		return 0;
>  
> -	/* Read frames from ring */
> -	skb = tun_ring_recv(tfile, noblock, &err);
> -	if (!skb)
> -		return err;
> +	if (!skb) {
> +		/* Read frames from ring */
> +		skb = tun_ring_recv(tfile, noblock, &err);
> +		if (!skb)
> +			return err;
> +	}
>  
>  	ret = tun_put_user(tun, tfile, skb, to);
>  	if (unlikely(ret < 0))
> @@ -1532,7 +1533,7 @@ static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  
>  	if (!tun)
>  		return -EBADFD;
> -	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK);
> +	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
>  	ret = min_t(ssize_t, ret, len);
>  	if (ret > 0)
>  		iocb->ki_pos = ret;
> @@ -1634,7 +1635,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>  					 SOL_PACKET, TUN_TX_TIMESTAMP);
>  		goto out;
>  	}
> -	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT);
> +	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT,
> +			  m->msg_control);
>  	if (ret > (ssize_t)total_len) {
>  		m->msg_flags |= MSG_TRUNC;
>  		ret = flags & MSG_TRUNC ? ret : total_len;
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
  2017-03-30 13:53   ` Michael S. Tsirkin
@ 2017-03-31  3:52     ` Jason Wang
  2017-03-31 14:31       ` Michael S. Tsirkin
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-31  3:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月30日 21:53, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:
>> This patch introduce a batched version of consuming, consumer can
>> dequeue more than one pointers from the ring at a time. We don't care
>> about the reorder of reading here so no need for compiler barrier.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 65 insertions(+)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index 6c70444..2be0f350 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
>>   	return ptr;
>>   }
>>   
>> +static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
>> +					     void **array, int n)
> Can we use a shorter name? ptr_ring_consume_batch?

Ok, but at least we need to keep the prefix since there's a locked version.



>
>> +{
>> +	void *ptr;
>> +	int i;
>> +
>> +	for (i = 0; i < n; i++) {
>> +		ptr = __ptr_ring_consume(r);
>> +		if (!ptr)
>> +			break;
>> +		array[i] = ptr;
>> +	}
>> +
>> +	return i;
>> +}
>> +
>>   /*
>>    * Note: resize (below) nests producer lock within consumer lock, so if you
>>    * call this in interrupt or BH context, you must disable interrupts/BH when
> I'd like to add a code comment here explaining why we don't
> care about cpu or compiler reordering. And I think the reason is
> in the way you use this API: in vhost it does not matter
> if you get less entries than present in the ring.
> That's ok but needs to be noted
> in a code comment so people use this function correctly.

Interesting, but I still think it's not necessary.

If consumer is doing a busy polling, it will eventually get the entries. 
If the consumer need notification from producer, it should drain the 
queue which means it need enable notification before last try of 
consuming call, otherwise it was a bug. The batch consuming function in 
this patch can guarantee return at least one pointer if there's many, 
this looks sufficient for the correctness?

Thanks

>
> Also, I think you need to repeat the comment about cpu_relax
> near this function: if someone uses it in a loop,
> a compiler barrier is needed to prevent compiler from
> optimizing it out.
>
> I note that ptr_ring_consume currently lacks any of these
> comments so I'm ok with merging as is, and I'll add
> documentation on top.
> Like this perhaps?
>
> /* Consume up to n entries and return the number of entries consumed
>   * or 0 on ring empty.
>   * Note: this might return early with less entries than present in the
>   * ring.
>   * Note: callers invoking this in a loop must use a compiler barrier,
>   * for example cpu_relax(). Callers must take consumer_lock
>   * if the ring is ever resized - see e.g. ptr_ring_consume_batch.
>   */
>
>
>
>> @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
>>   	return ptr;
>>   }
>>   
>> +static inline int ptr_ring_consume_batched(struct ptr_ring *r,
>> +					   void **array, int n)
>> +{
>> +	int ret;
>> +
>> +	spin_lock(&r->consumer_lock);
>> +	ret = __ptr_ring_consume_batched(r, array, n);
>> +	spin_unlock(&r->consumer_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
>> +					       void **array, int n)
>> +{
>> +	int ret;
>> +
>> +	spin_lock_irq(&r->consumer_lock);
>> +	ret = __ptr_ring_consume_batched(r, array, n);
>> +	spin_unlock_irq(&r->consumer_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
>> +					       void **array, int n)
>> +{
>> +	unsigned long flags;
>> +	int ret;
>> +
>> +	spin_lock_irqsave(&r->consumer_lock, flags);
>> +	ret = __ptr_ring_consume_batched(r, array, n);
>> +	spin_unlock_irqrestore(&r->consumer_lock, flags);
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
>> +					      void **array, int n)
>> +{
>> +	int ret;
>> +
>> +	spin_lock_bh(&r->consumer_lock);
>> +	ret = __ptr_ring_consume_batched(r, array, n);
>> +	spin_unlock_bh(&r->consumer_lock);
>> +
>> +	return ret;
>> +}
>> +
>>   /* Cast to structure type and call a function without discarding from FIFO.
>>    * Function must return a value.
>>    * Callers must take consumer_lock.
>> -- 
>> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
  2017-03-30 14:21   ` Michael S. Tsirkin
@ 2017-03-31  4:02     ` Jason Wang
  2017-03-31  6:47       ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2017-03-31  4:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月30日 22:21, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote:
>> We used to dequeue one skb during recvmsg() from skb_array, this could
>> be inefficient because of the bad cache utilization
> which cache does this refer to btw?

Both icache and dcache more or less.

>
>> and spinlock
>> touching for each packet.
> Do you mean the effect of extra two atomics here?

In fact four, packet length peeking needs another two.

>
>> This patch tries to batch them by calling
>> batch dequeuing helpers explicitly on the exported skb array and pass
>> the skb back through msg_control for underlayer socket to finish the
>> userspace copying.
>>
>> Tests were done by XDP1:
>> - small buffer:
>>    Before: 1.88Mpps
>>    After : 2.25Mpps (+19.6%)
>> - mergeable buffer:
>>    Before: 1.83Mpps
>>    After : 2.10Mpps (+14.7%)
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Looks like I misread the code previously. More comments below,
> sorry about not asking these questions earlier.
>
>> ---
>>   drivers/vhost/net.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 60 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 9b51989..ffa78c6 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -28,6 +28,8 @@
>>   #include <linux/if_macvlan.h>
>>   #include <linux/if_tap.h>
>>   #include <linux/if_vlan.h>
>> +#include <linux/skb_array.h>
>> +#include <linux/skbuff.h>
>>   
>>   #include <net/sock.h>
>>   
>> @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
>>   	struct vhost_virtqueue *vq;
>>   };
>>   
>> +#define VHOST_RX_BATCH 64
>>   struct vhost_net_virtqueue {
>>   	struct vhost_virtqueue vq;
>>   	size_t vhost_hlen;
> Could you please try playing with batch size and see
> what the effect is?

Ok. I tried 32 which seems slower than 64 but still faster than no batching.

>
>> @@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
>>   	/* Reference counting for outstanding ubufs.
>>   	 * Protected by vq mutex. Writers must also take device mutex. */
>>   	struct vhost_net_ubuf_ref *ubufs;
>> +	struct skb_array *rx_array;
>> +	void *rxq[VHOST_RX_BATCH];
>> +	int rt;
>> +	int rh;
>>   };
>>   
>>   struct vhost_net {
>> @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
>>   		n->vqs[i].ubufs = NULL;
>>   		n->vqs[i].vhost_hlen = 0;
>>   		n->vqs[i].sock_hlen = 0;
>> +		n->vqs[i].rt = 0;
>> +		n->vqs[i].rh = 0;
>>   	}
>>   
>>   }
>> @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
>>   	mutex_unlock(&vq->mutex);
>>   }
>>   
>> -static int peek_head_len(struct sock *sk)
>> +static int fetch_skbs(struct vhost_net_virtqueue *rvq)
>> +{
>> +	if (rvq->rh != rvq->rt)
>> +		goto out;
>> +
>> +	rvq->rh = rvq->rt = 0;
>> +	rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
>> +					    VHOST_RX_BATCH);
>> +	if (!rvq->rt)
>> +		return 0;
>> +out:
>> +	return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
>> +}
>> +
>> +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
>>   {
>>   	struct socket *sock = sk->sk_socket;
>>   	struct sk_buff *head;
>>   	int len = 0;
>>   	unsigned long flags;
>>   
>> +	if (rvq->rx_array)
>> +		return fetch_skbs(rvq);
>> +
>>   	if (sock->ops->peek_len)
>>   		return sock->ops->peek_len(sock);
>>   
>> @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
>>   	return skb_queue_empty(&sk->sk_receive_queue);
>>   }
>>   
>> -static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>> +static int vhost_net_rx_peek_head_len(struct vhost_net *net,
>> +				      struct sock *sk)
>>   {
>> +	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
>>   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>>   	struct vhost_virtqueue *vq = &nvq->vq;
>>   	unsigned long uninitialized_var(endtime);
>> -	int len = peek_head_len(sk);
>> +	int len = peek_head_len(rvq, sk);
>>   
>>   	if (!len && vq->busyloop_timeout) {
>>   		/* Both tx vq and rx socket were polled here */
>> @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>>   			vhost_poll_queue(&vq->poll);
>>   		mutex_unlock(&vq->mutex);
>>   
>> -		len = peek_head_len(sk);
>> +		len = peek_head_len(rvq, sk);
>>   	}
>>   
>>   	return len;
>> @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
>>   		/* On error, stop handling until the next kick. */
>>   		if (unlikely(headcount < 0))
>>   			goto out;
>> +		if (nvq->rx_array)
>> +			msg.msg_control = nvq->rxq[nvq->rh++];
>>   		/* On overrun, truncate and discard */
>>   		if (unlikely(headcount > UIO_MAXIOV)) {
>>   			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> So there's a bit of a mystery here. vhost code isn't
> batched, all we are batching is the fetch from the tun ring.

I've already had vhost batching code on top (e.g descriptor indices 
prefetching and used ring batched updating like dpdk). Baching dequing 
from skb array is the requirement for them.

>
> So what is the source of the speedup?

Well, perf diff told something like this:

     13.69%   +2.05%  [kernel.vmlinux]  [k] copy_user_generic_string
     10.77%   +2.04%  [vhost]           [k] vhost_signal
      9.59%   -3.28%  [kernel.vmlinux]  [k] copy_to_iter
      7.22%           [tun]             [k] tun_peek_len
      6.06%   -1.50%  [tun]             [k] tun_do_read.part.45
      4.83%   +4.13%  [vhost]           [k] vhost_get_vq_desc
      4.61%   -4.42%  [kernel.vmlinux]  [k] _raw_spin_lock

Batching eliminate 95% calls for raw_spin_lock.

>
> Are queued spinlocks that expensive? They shouldn't be ...
> Could you try using virt_spin_lock instead (at least as a quick hack)
> to see whether that helps?
>    

Will try.

>> @@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>>   		n->vqs[i].done_idx = 0;
>>   		n->vqs[i].vhost_hlen = 0;
>>   		n->vqs[i].sock_hlen = 0;
>> +		n->vqs[i].rt = 0;
>> +		n->vqs[i].rh = 0;
>>   	}
>>   	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>>
>> @@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
>>   					struct vhost_virtqueue *vq)
>>   {
>>   	struct socket *sock;
>> +	struct vhost_net_virtqueue *nvq =
>> +		container_of(vq, struct vhost_net_virtqueue, vq);
>>   
>>   	mutex_lock(&vq->mutex);
>>   	sock = vq->private_data;
>>   	vhost_net_disable_vq(n, vq);
>>   	vq->private_data = NULL;
>> +	while (nvq->rh != nvq->rt)
>> +		kfree_skb(nvq->rxq[nvq->rh++]);
>>   	mutex_unlock(&vq->mutex);
>>   	return sock;
>>   }
> So I didn't realise it but of course the effect will be
> dropped packets if we just connect and disconnect without
> consuming anything.

Any reason that we need care about this?

>
> So I think it's worth it to try analysing the speedup a bit
> and see whether we can get the gains without queueing
> the skbs in vhost.

Technically, other userspace may do recvmsg in the same time. So it's 
not easy to gain the same speedup as this patch.

>> @@ -953,6 +989,25 @@ static struct socket *get_raw_socket(int fd)
>>   	return ERR_PTR(r);
>>   }
>>   
>> +static struct skb_array *get_tap_skb_array(int fd)
> That's a confusing name, pls prefix with vhost_, not tap.

Ok, but I just follow the name of existing code (e.g the below 
get_tap_socket).

Thanks

>
>> +{
>> +	struct skb_array *array;
>> +	struct file *file = fget(fd);
>> +
>> +	if (!file)
>> +		return NULL;
>> +	array = tun_get_skb_array(file);
>> +	if (!IS_ERR(array))
>> +		goto out;
>> +	array = tap_get_skb_array(file);
>> +	if (!IS_ERR(array))
>> +		goto out;
>> +	array = NULL;
>> +out:
>> +	fput(file);
>> +	return array;
>> +}
>> +
>>   static struct socket *get_tap_socket(int fd)
>>   {
>>   	struct file *file = fget(fd);
>> @@ -1029,6 +1084,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>>   
>>   		vhost_net_disable_vq(n, vq);
>>   		vq->private_data = sock;
>> +		nvq->rx_array = get_tap_skb_array(fd);
>>   		r = vhost_vq_init_access(vq);
>>   		if (r)
>>   			goto err_used;
>> -- 
>> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control
  2017-03-30 15:03   ` Michael S. Tsirkin
@ 2017-03-31  4:07     ` Jason Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-31  4:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月30日 23:03, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 03:22:29PM +0800, Jason Wang wrote:
>> This patch makes tap_recvmsg() can receive from skb from its caller
>> through msg_control. Vhost_net will be the first user.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/net/tap.c | 12 ++++++++----
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index abdaf86..07d9174 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -824,15 +824,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>   
>>   static ssize_t tap_do_read(struct tap_queue *q,
>>   			   struct iov_iter *to,
>> -			   int noblock)
>> +			   int noblock, struct sk_buff *skb)
>>   {
>>   	DEFINE_WAIT(wait);
>> -	struct sk_buff *skb;
>>   	ssize_t ret = 0;
>>   
>>   	if (!iov_iter_count(to))
>>   		return 0;
>>   
>> +	if (skb)
>> +		goto done;
>> +
>>   	while (1) {
>>   		if (!noblock)
>>   			prepare_to_wait(sk_sleep(&q->sk), &wait,
>> @@ -856,6 +858,7 @@ static ssize_t tap_do_read(struct tap_queue *q,
>>   	if (!noblock)
>>   		finish_wait(sk_sleep(&q->sk), &wait);
>>   
>> +done:
> Please just use an if {} block here. goto on error is ok,
> but we are far from done here and goto done is misleading.
>
>

Ok.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control
  2017-03-30 15:06   ` Michael S. Tsirkin
@ 2017-03-31  4:10     ` Jason Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-31  4:10 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月30日 23:06, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 03:22:28PM +0800, Jason Wang wrote:
>> This patch makes tun_recvmsg() can receive from skb from its caller
>> through msg_control. Vhost_net will be the first user.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Do we need to bother with tun? I didn't realize one
> can even use that with vhost. What would be the point of
> all the virtio header stuff dealing with checksums etc?
>
> Even if you see a use-case is it worth optimizing?
>
>

It's for tap in fact. I use "tun" just because we have already had a 
tap.c which is used by macvtap.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array
  2017-03-31  4:02     ` Jason Wang
@ 2017-03-31  6:47       ` Jason Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-03-31  6:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月31日 12:02, Jason Wang wrote:
>
>
> On 2017年03月30日 22:21, Michael S. Tsirkin wrote:
>> On Thu, Mar 30, 2017 at 03:22:30PM +0800, Jason Wang wrote:
>>> We used to dequeue one skb during recvmsg() from skb_array, this could
>>> be inefficient because of the bad cache utilization
>> which cache does this refer to btw?
>
> Both icache and dcache more or less.
>
>>
>>> and spinlock
>>> touching for each packet.
>> Do you mean the effect of extra two atomics here?
>
> In fact four, packet length peeking needs another two.
>
>>
>>> This patch tries to batch them by calling
>>> batch dequeuing helpers explicitly on the exported skb array and pass
>>> the skb back through msg_control for underlayer socket to finish the
>>> userspace copying.
>>>
>>> Tests were done by XDP1:
>>> - small buffer:
>>>    Before: 1.88Mpps
>>>    After : 2.25Mpps (+19.6%)
>>> - mergeable buffer:
>>>    Before: 1.83Mpps
>>>    After : 2.10Mpps (+14.7%)
>>>
>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> Looks like I misread the code previously. More comments below,
>> sorry about not asking these questions earlier.
>>
>>> ---
>>>   drivers/vhost/net.c | 64 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++----
>>>   1 file changed, 60 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index 9b51989..ffa78c6 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -28,6 +28,8 @@
>>>   #include <linux/if_macvlan.h>
>>>   #include <linux/if_tap.h>
>>>   #include <linux/if_vlan.h>
>>> +#include <linux/skb_array.h>
>>> +#include <linux/skbuff.h>
>>>     #include <net/sock.h>
>>>   @@ -85,6 +87,7 @@ struct vhost_net_ubuf_ref {
>>>       struct vhost_virtqueue *vq;
>>>   };
>>>   +#define VHOST_RX_BATCH 64
>>>   struct vhost_net_virtqueue {
>>>       struct vhost_virtqueue vq;
>>>       size_t vhost_hlen;
>> Could you please try playing with batch size and see
>> what the effect is?
>
> Ok. I tried 32 which seems slower than 64 but still faster than no 
> batching.

Ok, result is:

no batching   1.88Mpps
RX_BATCH=1    1.93Mpps
RX_BATCH=4    2.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps


>
>>
>>> @@ -99,6 +102,10 @@ struct vhost_net_virtqueue {
>>>       /* Reference counting for outstanding ubufs.
>>>        * Protected by vq mutex. Writers must also take device mutex. */
>>>       struct vhost_net_ubuf_ref *ubufs;
>>> +    struct skb_array *rx_array;
>>> +    void *rxq[VHOST_RX_BATCH];
>>> +    int rt;
>>> +    int rh;
>>>   };
>>>     struct vhost_net {
>>> @@ -201,6 +208,8 @@ static void vhost_net_vq_reset(struct vhost_net *n)
>>>           n->vqs[i].ubufs = NULL;
>>>           n->vqs[i].vhost_hlen = 0;
>>>           n->vqs[i].sock_hlen = 0;
>>> +        n->vqs[i].rt = 0;
>>> +        n->vqs[i].rh = 0;
>>>       }
>>>     }
>>> @@ -503,13 +512,30 @@ static void handle_tx(struct vhost_net *net)
>>>       mutex_unlock(&vq->mutex);
>>>   }
>>>   -static int peek_head_len(struct sock *sk)
>>> +static int fetch_skbs(struct vhost_net_virtqueue *rvq)
>>> +{
>>> +    if (rvq->rh != rvq->rt)
>>> +        goto out;
>>> +
>>> +    rvq->rh = rvq->rt = 0;
>>> +    rvq->rt = skb_array_consume_batched(rvq->rx_array, rvq->rxq,
>>> +                        VHOST_RX_BATCH);
>>> +    if (!rvq->rt)
>>> +        return 0;
>>> +out:
>>> +    return __skb_array_len_with_tag(rvq->rxq[rvq->rh]);
>>> +}
>>> +
>>> +static int peek_head_len(struct vhost_net_virtqueue *rvq, struct 
>>> sock *sk)
>>>   {
>>>       struct socket *sock = sk->sk_socket;
>>>       struct sk_buff *head;
>>>       int len = 0;
>>>       unsigned long flags;
>>>   +    if (rvq->rx_array)
>>> +        return fetch_skbs(rvq);
>>> +
>>>       if (sock->ops->peek_len)
>>>           return sock->ops->peek_len(sock);
>>>   @@ -535,12 +561,14 @@ static int sk_has_rx_data(struct sock *sk)
>>>       return skb_queue_empty(&sk->sk_receive_queue);
>>>   }
>>>   -static int vhost_net_rx_peek_head_len(struct vhost_net *net, 
>>> struct sock *sk)
>>> +static int vhost_net_rx_peek_head_len(struct vhost_net *net,
>>> +                      struct sock *sk)
>>>   {
>>> +    struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
>>>       struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>>>       struct vhost_virtqueue *vq = &nvq->vq;
>>>       unsigned long uninitialized_var(endtime);
>>> -    int len = peek_head_len(sk);
>>> +    int len = peek_head_len(rvq, sk);
>>>         if (!len && vq->busyloop_timeout) {
>>>           /* Both tx vq and rx socket were polled here */
>>> @@ -561,7 +589,7 @@ static int vhost_net_rx_peek_head_len(struct 
>>> vhost_net *net, struct sock *sk)
>>>               vhost_poll_queue(&vq->poll);
>>>           mutex_unlock(&vq->mutex);
>>>   -        len = peek_head_len(sk);
>>> +        len = peek_head_len(rvq, sk);
>>>       }
>>>         return len;
>>> @@ -699,6 +727,8 @@ static void handle_rx(struct vhost_net *net)
>>>           /* On error, stop handling until the next kick. */
>>>           if (unlikely(headcount < 0))
>>>               goto out;
>>> +        if (nvq->rx_array)
>>> +            msg.msg_control = nvq->rxq[nvq->rh++];
>>>           /* On overrun, truncate and discard */
>>>           if (unlikely(headcount > UIO_MAXIOV)) {
>>>               iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
>> So there's a bit of a mystery here. vhost code isn't
>> batched, all we are batching is the fetch from the tun ring.
>
> I've already had vhost batching code on top (e.g descriptor indices 
> prefetching and used ring batched updating like dpdk). Baching dequing 
> from skb array is the requirement for them.
>
>>
>> So what is the source of the speedup?
>
> Well, perf diff told something like this:
>
>     13.69%   +2.05%  [kernel.vmlinux]  [k] copy_user_generic_string
>     10.77%   +2.04%  [vhost]           [k] vhost_signal
>      9.59%   -3.28%  [kernel.vmlinux]  [k] copy_to_iter
>      7.22%           [tun]             [k] tun_peek_len
>      6.06%   -1.50%  [tun]             [k] tun_do_read.part.45
>      4.83%   +4.13%  [vhost]           [k] vhost_get_vq_desc
>      4.61%   -4.42%  [kernel.vmlinux]  [k] _raw_spin_lock
>
> Batching eliminate 95% calls for raw_spin_lock.
>
>>
>> Are queued spinlocks that expensive? They shouldn't be ...
>> Could you try using virt_spin_lock instead (at least as a quick hack)
>> to see whether that helps?
>
> Will try.

I suspect it will have much difference, virt_spin_lock() has:

     do {
         while (atomic_read(&lock->val) != 0)
             cpu_relax();
     } while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0);

while queued_spin_lock():

     val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
     if (likely(val == 0))
         return;
     queued_spin_lock_slowpath(lock, val);

Since no other consumers during the test, the only difference is queued 
version use a relaxed version of atomic_cmpxchg_acquire().

Thanks

>
>>> @@ -841,6 +871,8 @@ static int vhost_net_open(struct inode *inode, 
>>> struct file *f)
>>>           n->vqs[i].done_idx = 0;
>>>           n->vqs[i].vhost_hlen = 0;
>>>           n->vqs[i].sock_hlen = 0;
>>> +        n->vqs[i].rt = 0;
>>> +        n->vqs[i].rh = 0;
>>>       }
>>>       vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>>>
>>> @@ -856,11 +888,15 @@ static struct socket *vhost_net_stop_vq(struct 
>>> vhost_net *n,
>>>                       struct vhost_virtqueue *vq)
>>>   {
>>>       struct socket *sock;
>>> +    struct vhost_net_virtqueue *nvq =
>>> +        container_of(vq, struct vhost_net_virtqueue, vq);
>>>         mutex_lock(&vq->mutex);
>>>       sock = vq->private_data;
>>>       vhost_net_disable_vq(n, vq);
>>>       vq->private_data = NULL;
>>> +    while (nvq->rh != nvq->rt)
>>> +        kfree_skb(nvq->rxq[nvq->rh++]);
>>>       mutex_unlock(&vq->mutex);
>>>       return sock;
>>>   }
>> So I didn't realise it but of course the effect will be
>> dropped packets if we just connect and disconnect without
>> consuming anything.
>
> Any reason that we need care about this?
>
>>
>> So I think it's worth it to try analysing the speedup a bit
>> and see whether we can get the gains without queueing
>> the skbs in vhost.
>
> Technically, other userspace may do recvmsg in the same time. So it's 
> not easy to gain the same speedup as this patch.
>
>>> @@ -953,6 +989,25 @@ static struct socket *get_raw_socket(int fd)
>>>       return ERR_PTR(r);
>>>   }
>>>   +static struct skb_array *get_tap_skb_array(int fd)
>> That's a confusing name, pls prefix with vhost_, not tap.
>
> Ok, but I just follow the name of existing code (e.g the below 
> get_tap_socket).
>
> Thanks
>
>>
>>> +{
>>> +    struct skb_array *array;
>>> +    struct file *file = fget(fd);
>>> +
>>> +    if (!file)
>>> +        return NULL;
>>> +    array = tun_get_skb_array(file);
>>> +    if (!IS_ERR(array))
>>> +        goto out;
>>> +    array = tap_get_skb_array(file);
>>> +    if (!IS_ERR(array))
>>> +        goto out;
>>> +    array = NULL;
>>> +out:
>>> +    fput(file);
>>> +    return array;
>>> +}
>>> +
>>>   static struct socket *get_tap_socket(int fd)
>>>   {
>>>       struct file *file = fget(fd);
>>> @@ -1029,6 +1084,7 @@ static long vhost_net_set_backend(struct 
>>> vhost_net *n, unsigned index, int fd)
>>>             vhost_net_disable_vq(n, vq);
>>>           vq->private_data = sock;
>>> +        nvq->rx_array = get_tap_skb_array(fd);
>>>           r = vhost_vq_init_access(vq);
>>>           if (r)
>>>               goto err_used;
>>> -- 
>>> 2.7.4
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
  2017-03-31  3:52     ` Jason Wang
@ 2017-03-31 14:31       ` Michael S. Tsirkin
  2017-04-01  5:14         ` Jason Wang
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2017-03-31 14:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel

On Fri, Mar 31, 2017 at 11:52:24AM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 21:53, Michael S. Tsirkin wrote:
> > On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:
> > > This patch introduce a batched version of consuming, consumer can
> > > dequeue more than one pointers from the ring at a time. We don't care
> > > about the reorder of reading here so no need for compiler barrier.
> > > 
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > >   include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 65 insertions(+)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..2be0f350 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
> > >   	return ptr;
> > >   }
> > > +static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
> > > +					     void **array, int n)
> > Can we use a shorter name? ptr_ring_consume_batch?
> 
> Ok, but at least we need to keep the prefix since there's a locked version.
> 
> 
> 
> > 
> > > +{
> > > +	void *ptr;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < n; i++) {
> > > +		ptr = __ptr_ring_consume(r);
> > > +		if (!ptr)
> > > +			break;
> > > +		array[i] = ptr;
> > > +	}
> > > +
> > > +	return i;
> > > +}
> > > +
> > >   /*
> > >    * Note: resize (below) nests producer lock within consumer lock, so if you
> > >    * call this in interrupt or BH context, you must disable interrupts/BH when
> > I'd like to add a code comment here explaining why we don't
> > care about cpu or compiler reordering. And I think the reason is
> > in the way you use this API: in vhost it does not matter
> > if you get less entries than present in the ring.
> > That's ok but needs to be noted
> > in a code comment so people use this function correctly.
> 
> Interesting, but I still think it's not necessary.
> 
> If consumer is doing a busy polling, it will eventually get the entries. If
> the consumer need notification from producer, it should drain the queue
> which means it need enable notification before last try of consuming call,
> otherwise it was a bug. The batch consuming function in this patch can
> guarantee return at least one pointer if there's many, this looks sufficient
> for the correctness?
> 
> Thanks

You ask for N entries but get N-1. This seems to imply the
ring is now empty. Do we guarantee this?


> > 
> > Also, I think you need to repeat the comment about cpu_relax
> > near this function: if someone uses it in a loop,
> > a compiler barrier is needed to prevent compiler from
> > optimizing it out.
> > 
> > I note that ptr_ring_consume currently lacks any of these
> > comments so I'm ok with merging as is, and I'll add
> > documentation on top.
> > Like this perhaps?
> > 
> > /* Consume up to n entries and return the number of entries consumed
> >   * or 0 on ring empty.
> >   * Note: this might return early with less entries than present in the
> >   * ring.
> >   * Note: callers invoking this in a loop must use a compiler barrier,
> >   * for example cpu_relax(). Callers must take consumer_lock
> >   * if the ring is ever resized - see e.g. ptr_ring_consume_batch.
> >   */
> > 
> > 
> > 
> > > @@ -297,6 +313,55 @@ static inline void *ptr_ring_consume_bh(struct ptr_ring *r)
> > >   	return ptr;
> > >   }
> > > +static inline int ptr_ring_consume_batched(struct ptr_ring *r,
> > > +					   void **array, int n)
> > > +{
> > > +	int ret;
> > > +
> > > +	spin_lock(&r->consumer_lock);
> > > +	ret = __ptr_ring_consume_batched(r, array, n);
> > > +	spin_unlock(&r->consumer_lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static inline int ptr_ring_consume_batched_irq(struct ptr_ring *r,
> > > +					       void **array, int n)
> > > +{
> > > +	int ret;
> > > +
> > > +	spin_lock_irq(&r->consumer_lock);
> > > +	ret = __ptr_ring_consume_batched(r, array, n);
> > > +	spin_unlock_irq(&r->consumer_lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static inline int ptr_ring_consume_batched_any(struct ptr_ring *r,
> > > +					       void **array, int n)
> > > +{
> > > +	unsigned long flags;
> > > +	int ret;
> > > +
> > > +	spin_lock_irqsave(&r->consumer_lock, flags);
> > > +	ret = __ptr_ring_consume_batched(r, array, n);
> > > +	spin_unlock_irqrestore(&r->consumer_lock, flags);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r,
> > > +					      void **array, int n)
> > > +{
> > > +	int ret;
> > > +
> > > +	spin_lock_bh(&r->consumer_lock);
> > > +	ret = __ptr_ring_consume_batched(r, array, n);
> > > +	spin_unlock_bh(&r->consumer_lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >   /* Cast to structure type and call a function without discarding from FIFO.
> > >    * Function must return a value.
> > >    * Callers must take consumer_lock.
> > > -- 
> > > 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing
  2017-03-31 14:31       ` Michael S. Tsirkin
@ 2017-04-01  5:14         ` Jason Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2017-04-01  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel



On 2017年03月31日 22:31, Michael S. Tsirkin wrote:
> On Fri, Mar 31, 2017 at 11:52:24AM +0800, Jason Wang wrote:
>> On 2017年03月30日 21:53, Michael S. Tsirkin wrote:
>>> On Thu, Mar 30, 2017 at 03:22:24PM +0800, Jason Wang wrote:
>>>> This patch introduce a batched version of consuming, consumer can
>>>> dequeue more than one pointers from the ring at a time. We don't care
>>>> about the reorder of reading here so no need for compiler barrier.
>>>>
>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>> ---
>>>>    include/linux/ptr_ring.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 65 insertions(+)
>>>>
>>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>>> index 6c70444..2be0f350 100644
>>>> --- a/include/linux/ptr_ring.h
>>>> +++ b/include/linux/ptr_ring.h
>>>> @@ -247,6 +247,22 @@ static inline void *__ptr_ring_consume(struct ptr_ring *r)
>>>>    	return ptr;
>>>>    }
>>>> +static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
>>>> +					     void **array, int n)
>>> Can we use a shorter name? ptr_ring_consume_batch?
>> Ok, but at least we need to keep the prefix since there's a locked version.
>>
>>
>>
>>>> +{
>>>> +	void *ptr;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < n; i++) {
>>>> +		ptr = __ptr_ring_consume(r);
>>>> +		if (!ptr)
>>>> +			break;
>>>> +		array[i] = ptr;
>>>> +	}
>>>> +
>>>> +	return i;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Note: resize (below) nests producer lock within consumer lock, so if you
>>>>     * call this in interrupt or BH context, you must disable interrupts/BH when
>>> I'd like to add a code comment here explaining why we don't
>>> care about cpu or compiler reordering. And I think the reason is
>>> in the way you use this API: in vhost it does not matter
>>> if you get less entries than present in the ring.
>>> That's ok but needs to be noted
>>> in a code comment so people use this function correctly.
>> Interesting, but I still think it's not necessary.
>>
>> If consumer is doing a busy polling, it will eventually get the entries. If
>> the consumer need notification from producer, it should drain the queue
>> which means it need enable notification before last try of consuming call,
>> otherwise it was a bug. The batch consuming function in this patch can
>> guarantee return at least one pointer if there's many, this looks sufficient
>> for the correctness?
>>
>> Thanks
> You ask for N entries but get N-1. This seems to imply the
> ring is now empty. Do we guarantee this?

I think consumer can not assume ring is empty consider producer can 
produce at the same time. It need enable notification and do another 
poll in this case.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-04-01  5:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30  7:22 [PATCH V2 net-next 0/7] vhost-net rx batching Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 1/7] ptr_ring: introduce batch dequeuing Jason Wang
2017-03-30 13:53   ` Michael S. Tsirkin
2017-03-31  3:52     ` Jason Wang
2017-03-31 14:31       ` Michael S. Tsirkin
2017-04-01  5:14         ` Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 2/7] skb_array: " Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 3/7] tun: export skb_array Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 4/7] tap: " Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 5/7] tun: support receiving skb through msg_control Jason Wang
2017-03-30 15:06   ` Michael S. Tsirkin
2017-03-31  4:10     ` Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 6/7] tap: support receiving skb from msg_control Jason Wang
2017-03-30 15:03   ` Michael S. Tsirkin
2017-03-31  4:07     ` Jason Wang
2017-03-30  7:22 ` [PATCH V2 net-next 7/7] vhost_net: try batch dequing from skb array Jason Wang
2017-03-30 14:21   ` Michael S. Tsirkin
2017-03-31  4:02     ` Jason Wang
2017-03-31  6:47       ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.