netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-11  7:16 Jason Wang
  2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty-8n+1lVoiYb80n/F98K4Iww, mst-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA, Jason Wang

Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c         | 164 ++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c              |  43 +++++++---
 drivers/vhost/scsi.c             |  23 ++++--
 drivers/vhost/test.c             |   5 +-
 drivers/vhost/vhost.c            |  44 +++++++----
 drivers/vhost/vhost.h            |  19 +++--
 drivers/virtio/virtio_ring.c     |  75 +++++++++++++++++-
 include/linux/virtio.h           |  14 ++++
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-11  7:16 [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt Jason Wang
@ 2014-10-11  7:16 ` Jason Wang
  2014-10-12  9:27   ` Michael S. Tsirkin
  2014-10-15  5:40   ` Rusty Russell
  2014-10-11  7:16 ` [PATCH net-next RFC 2/3] vhost: support " Jason Wang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

Below should be useful for some experiments Jason is doing.
I thought I'd send it out for early review/feedback.

event idx feature allows us to defer interrupts until
a specific # of descriptors were used.
Sometimes it might be useful to get an interrupt after
a specific descriptor, regardless.
This adds a descriptor flag for this, and an API
to create an urgent output descriptor.
This is still an RFC:
we'll need a feature bit for drivers to detect this,
but we've run out of feature bits for virtio 0.X.
For experimentation purposes, drivers can assume
this is set, or add a driver-specific feature bit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
 include/linux/virtio.h           | 14 ++++++++
 include/uapi/linux/virtio_ring.h |  5 ++-
 3 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a..a5188c6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
 
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
+				     bool urgent,
 				     struct scatterlist *sgs[],
 				     struct scatterlist *(*next)
 				       (struct scatterlist *, unsigned int *),
@@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	/* Use a single buffer which doesn't continue */
 	head = vq->free_head;
 	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+	if (urgent)
+		vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
 	vq->vring.desc[head].addr = virt_to_phys(desc);
 	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
 	kmemleak_ignore(desc);
@@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
+			        bool urgent,
 				struct scatterlist *sgs[],
 				struct scatterlist *(*next)
 				  (struct scatterlist *, unsigned int *),
@@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
+		head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out,
 					  total_in,
 					  out_sgs, in_sgs, gfp);
 		if (likely(head >= 0))
@@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -305,6 +317,8 @@ add_head:
 
 /**
  * virtqueue_add_sgs - expose buffers to other end
+ * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
  * @out_num: the number of scatterlists readable by other side
@@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
 		for (sg = sgs[i]; sg; sg = sg_next(sg))
 			total_in++;
 	}
-	return virtqueue_add(_vq, sgs, sg_next_chained,
+	return virtqueue_add(_vq, false, sgs, sg_next_chained,
 			     total_out, total_in, out_sgs, in_sgs, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
@@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
 
 /**
+ * virtqueue_add_outbuf - expose output buffers to other end
+ *          in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
+ * @vq: the struct virtqueue we're talking about.
+ * @sgs: array of scatterlists (need not be terminated!)
+ * @num: the number of scatterlists readable by other side
+ * @data: the token identifying the buffer.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+			 struct scatterlist sg[], unsigned int num,
+			 void *data,
+			 gfp_t gfp)
+{
+	return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent);
+
+/**
  * virtqueue_add_inbuf - expose input buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of scatterlists (need not be terminated!)
@@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
 			void *data,
 			gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
 
@@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+}
+EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
  * @vq: the struct virtqueue we're talking about.
@@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
+
+	START_USE(vq);
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+	last_used_idx = vq->last_used_idx;
+	END_USE(vq);
+	return last_used_idx;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
+
 /**
  * virtqueue_poll - query pending used buffers
  * @vq: the struct virtqueue we're talking about.
@@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *_vq)
+{
+	unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq);
+
+	return !virtqueue_poll(_vq, last_used_idx);
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b46671e..68be5f2 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp);
 
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+				struct scatterlist sg[], unsigned int num,
+				void *data,
+				gfp_t gfp);
+
+
 int virtqueue_add_inbuf(struct virtqueue *vq,
 			struct scatterlist sg[], unsigned int num,
 			void *data,
@@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void virtqueue_disable_cb(struct virtqueue *vq);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *vq);
+
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq);
+
 bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..daf5bb0 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -39,6 +39,9 @@
 #define VRING_DESC_F_WRITE	2
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT	4
+/* This means the descriptor should cause an interrupt
+ * ignoring avail event idx. */
+#define VRING_DESC_F_URGENT	8
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
@@ -48,7 +51,7 @@
  * when you consume a buffer.  It's unreliable, so it's simply an
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
-
+#define VRING_AVAIL_F_NO_URGENT_INTERRUPT	2
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next RFC 2/3] vhost: support urgent descriptors
  2014-10-11  7:16 [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt Jason Wang
  2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
@ 2014-10-11  7:16 ` Jason Wang
  2014-10-11  7:16 ` [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt Jason Wang
       [not found] ` <1413011806-3813-1-git-send-email-jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

This patches let vhost-net support urgent descriptors. For zerocopy case,
two new types of length was introduced to make it work.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   | 43 +++++++++++++++++++++++++++++++------------
 drivers/vhost/scsi.c  | 23 +++++++++++++++--------
 drivers/vhost/test.c  |  5 +++--
 drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 19 +++++++++++++------
 5 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..37b0bb5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,9 +48,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	3
+#define VHOST_DMA_FAILED_LEN	5
+/* Lower device DMA doen, urgent bit set */
+#define VHOST_DMA_DONE_LEN_URGENT	4
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	2
+#define VHOST_DMA_DONE_LEN	3
+/* Lower device DMA in progress, urgent bit set */
+#define VHOST_DMA_URGENT	2
 /* Lower device DMA in progress */
 #define VHOST_DMA_IN_PROGRESS	1
 /* Buffer unused */
@@ -284,11 +288,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 		container_of(vq, struct vhost_net_virtqueue, vq);
 	int i, add;
 	int j = 0;
+	bool urgent = false;
 
 	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
 		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
 			vhost_net_tx_err(net);
 		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+			urgent = urgent || vq->heads[i].len == VHOST_DMA_DONE_LEN_URGENT;
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			++j;
 		} else
@@ -296,7 +302,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 	}
 	while (j) {
 		add = min(UIO_MAXIOV - nvq->done_idx, j);
-		vhost_add_used_and_signal_n(vq->dev, vq,
+		vhost_add_used_and_signal_n(vq->dev, vq, urgent,
 					    &vq->heads[nvq->done_idx], add);
 		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
 		j -= add;
@@ -311,9 +317,14 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 
 	rcu_read_lock_bh();
 
-	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = success ?
-		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+	if (success) {
+		if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS)
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+		else
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN_URGENT;
+	} else {
+		vq->heads[ubuf->desc].len = VHOST_DMA_FAILED_LEN;
+	}
 	cnt = vhost_net_ubuf_put(ubufs);
 
 	/*
@@ -363,6 +374,7 @@ static void handle_tx(struct vhost_net *net)
 	zcopy = nvq->ubufs;
 
 	for (;;) {
+		bool urgent;
 		/* Release DMAs done buffers first */
 		if (zcopy)
 			vhost_zerocopy_signal_used(net, vq);
@@ -374,7 +386,7 @@ static void handle_tx(struct vhost_net *net)
 			      % UIO_MAXIOV == nvq->done_idx))
 			break;
 
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -417,7 +429,8 @@ static void handle_tx(struct vhost_net *net)
 			ubuf = nvq->ubuf_info + nvq->upend_idx;
 
 			vq->heads[nvq->upend_idx].id = head;
-			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+			vq->heads[nvq->upend_idx].len = urgent ?
+				VHOST_DMA_URGENT : VHOST_DMA_IN_PROGRESS;
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
@@ -445,7 +458,7 @@ static void handle_tx(struct vhost_net *net)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
 		if (!zcopy_used)
-			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+			vhost_add_used_and_signal(&net->dev, vq, urgent, head, 0);
 		else
 			vhost_zerocopy_signal_used(net, vq);
 		total_len += len;
@@ -488,6 +501,7 @@ static int peek_head_len(struct sock *sk)
  *	returns number of buffer heads allocated, negative on error
  */
 static int get_rx_bufs(struct vhost_virtqueue *vq,
+		       bool *urgentp,
 		       struct vring_used_elem *heads,
 		       int datalen,
 		       unsigned *iovcount,
@@ -502,11 +516,13 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 	int r, nlogs = 0;
 
 	while (datalen > 0 && headcount < quota) {
+		bool urgent = false;
+
 		if (unlikely(seg >= UIO_MAXIOV)) {
 			r = -ENOBUFS;
 			goto err;
 		}
-		r = vhost_get_vq_desc(vq, vq->iov + seg,
+		r = vhost_get_vq_desc(vq, &urgent, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
 				      &in, log, log_num);
 		if (unlikely(r < 0))
@@ -527,6 +543,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 			nlogs += *log_num;
 			log += *log_num;
 		}
+		*urgentp = *urgentp || urgent;
 		heads[headcount].id = d;
 		heads[headcount].len = iov_length(vq->iov + seg, in);
 		datalen -= heads[headcount].len;
@@ -590,9 +607,11 @@ static void handle_rx(struct vhost_net *net)
 	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
 	while ((sock_len = peek_head_len(sock->sk))) {
+		bool urgent = false;
+
 		sock_len += sock_hlen;
 		vhost_len = sock_len + vhost_hlen;
-		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
+		headcount = get_rx_bufs(vq, &urgent, vq->heads, vhost_len,
 					&in, vq_log, &log,
 					likely(mergeable) ? UIO_MAXIOV : 1);
 		/* On error, stop handling until the next kick. */
@@ -654,7 +673,7 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq, headcount);
 			break;
 		}
-		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
+		vhost_add_used_and_signal_n(&net->dev, vq, urgent, vq->heads,
 					    headcount);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 69906ca..0a7e5bc 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -72,6 +72,8 @@ struct tcm_vhost_cmd {
 	int tvc_vq_desc;
 	/* virtio-scsi initiator task attribute */
 	int tvc_task_attr;
+	/* Descriptor urgent? */
+	bool tvc_vq_desc_urgent;
 	/* virtio-scsi initiator data direction */
 	enum dma_data_direction tvc_data_direction;
 	/* Expected data transfer length from virtio-scsi header */
@@ -606,6 +608,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct tcm_vhost_evt *evt)
 	struct virtio_scsi_event __user *eventp;
 	unsigned out, in;
 	int head, ret;
+	bool urgent;
 
 	if (!vq->private_data) {
 		vs->vs_events_missed = true;
@@ -614,7 +617,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct tcm_vhost_evt *evt)
 
 again:
 	vhost_disable_notify(&vs->dev, vq);
-	head = vhost_get_vq_desc(vq, vq->iov,
+	head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 			ARRAY_SIZE(vq->iov), &out, &in,
 			NULL, NULL);
 	if (head < 0) {
@@ -643,7 +646,7 @@ again:
 	eventp = vq->iov[out].iov_base;
 	ret = __copy_to_user(eventp, event, sizeof(*event));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		vq_err(vq, "Faulted on tcm_vhost_send_event\n");
 }
@@ -704,7 +707,8 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 		ret = copy_to_user(cmd->tvc_resp, &v_rsp, sizeof(v_rsp));
 		if (likely(ret == 0)) {
 			struct vhost_scsi_virtqueue *q;
-			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc_urgent,
+				       cmd->tvc_vq_desc, 0);
 			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
 			vq = q - vs->vqs;
 			__set_bit(vq, signal);
@@ -947,6 +951,7 @@ static void tcm_vhost_submission_work(struct work_struct *work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 			   struct vhost_virtqueue *vq,
+			   bool urgent,
 			   int head, unsigned out)
 {
 	struct virtio_scsi_cmd_resp __user *resp;
@@ -958,7 +963,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 	resp = vq->iov[out].iov_base;
 	ret = __copy_to_user(resp, &rsp, sizeof(rsp));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		pr_err("Faulted on virtio_scsi_cmd_resp\n");
 }
@@ -980,6 +985,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 	u8 *target, *lunp, task_attr;
 	bool hdr_pi;
 	void *req, *cdb;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	/*
@@ -993,7 +999,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 	vhost_disable_notify(&vs->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					ARRAY_SIZE(vq->iov), &out, &in,
 					NULL, NULL);
 		pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
@@ -1067,7 +1073,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 
 		/* virtio-scsi spec requires byte 0 of the lun to be 1 */
 		if (unlikely(*lunp != 1)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1075,7 +1081,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 
 		/* Target does not exist, fail the request */
 		if (unlikely(!tpg)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1187,6 +1193,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		 * tcm_vhost_queue_data_in() and tcm_vhost_queue_status()
 		 */
 		cmd->tvc_vq_desc = head;
+		cmd->tvc_vq_desc_urgent = urgent;
 		/*
 		 * Dispatch tv_cmd descriptor for cmwq execution in process
 		 * context provided by tcm_vhost_workqueue.  This also ensures
@@ -1203,7 +1210,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 err_free:
 	vhost_scsi_free_cmd(cmd);
 err_cmd:
-	vhost_scsi_send_bad_target(vs, vq, head, out);
+	vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 out:
 	mutex_unlock(&vq->mutex);
 }
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index d9c501e..757f3a2 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -42,6 +42,7 @@ static void handle_vq(struct vhost_test *n)
 	int head;
 	size_t len, total_len = 0;
 	void *private;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	private = vq->private_data;
@@ -53,7 +54,7 @@ static void handle_vq(struct vhost_test *n)
 	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -79,7 +80,7 @@ static void handle_vq(struct vhost_test *n)
 			vq_err(vq, "Unexpected 0 len for TX\n");
 			break;
 		}
-		vhost_add_used_and_signal(&n->dev, vq, head, 0);
+		vhost_add_used_and_signal(&n->dev, vq, urgent,  head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..8a35e14 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -186,6 +186,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_used_idx = 0;
 	vq->signalled_used = 0;
 	vq->signalled_used_valid = false;
+	vq->urgent = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -1201,7 +1202,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
  * This function returns the descriptor number found, or vq->num (which is
  * never a valid descriptor number) if none was found.  A negative code is
  * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+int vhost_get_vq_desc(struct vhost_virtqueue *vq, bool *urgent,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num)
@@ -1211,6 +1212,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	u16 last_avail_idx;
 	int ret;
 
+	*urgent = false;
+
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
 	if (unlikely(__get_user(vq->avail_idx, &vq->avail->idx))) {
@@ -1274,6 +1277,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->desc + i);
 			return -EFAULT;
 		}
+		if (desc.flags & VRING_DESC_F_URGENT)
+			*urgent = true;
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
@@ -1333,11 +1338,11 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
  * want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+int vhost_add_used(struct vhost_virtqueue *vq, bool urgent, unsigned int head, int len)
 {
 	struct vring_used_elem heads = { head, len };
 
-	return vhost_add_used_n(vq, &heads, 1);
+	return vhost_add_used_n(vq, urgent, &heads, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
@@ -1386,7 +1391,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
  * want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+int vhost_add_used_n(struct vhost_virtqueue *vq, bool urgent,
+		     struct vring_used_elem *heads,
 		     unsigned count)
 {
 	int start, n, r;
@@ -1416,13 +1422,14 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 		if (vq->log_ctx)
 			eventfd_signal(vq->log_ctx, 1);
 	}
+	vq->urgent = vq->urgent || urgent;
 	return r;
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 old, new, event;
+	__u16 old, new, event, flags;
 	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
@@ -1433,14 +1440,17 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	    unlikely(vq->avail_idx == vq->last_avail_idx))
 		return true;
 
-	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
-		__u16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
-			vq_err(vq, "Failed to get flags");
-			return true;
-		}
-		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+	if (__get_user(flags, &vq->avail->flags)) {
+		vq_err(vq, "Failed to get flags");
+		return true;
 	}
+
+	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX))
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+
+	if (vq->urgent && !(flags & VRING_AVAIL_F_NO_URGENT_INTERRUPT))
+		return true;
+
 	old = vq->signalled_used;
 	v = vq->signalled_used_valid;
 	new = vq->signalled_used = vq->last_used_idx;
@@ -1460,17 +1470,20 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx && vhost_notify(dev, vq))
+	if (vq->call_ctx && vhost_notify(dev, vq)) {
 		eventfd_signal(vq->call_ctx, 1);
+		vq->urgent = false;
+	}
 }
 EXPORT_SYMBOL_GPL(vhost_signal);
 
 /* And here's the combo meal deal.  Supersize me! */
 void vhost_add_used_and_signal(struct vhost_dev *dev,
 			       struct vhost_virtqueue *vq,
+			       bool urgent,
 			       unsigned int head, int len)
 {
-	vhost_add_used(vq, head, len);
+	vhost_add_used(vq, urgent, head, len);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
@@ -1478,9 +1491,10 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
 /* multi-buffer version of vhost_add_used_and_signal */
 void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 				 struct vhost_virtqueue *vq,
+				 bool urgent,
 				 struct vring_used_elem *heads, unsigned count)
 {
-	vhost_add_used_n(vq, heads, count);
+	vhost_add_used_n(vq, urgent, heads, count);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..61ca542 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -96,6 +96,9 @@ struct vhost_virtqueue {
 	/* Last used index value we have signalled on */
 	bool signalled_used_valid;
 
+	/* Urgent descriptor was used */
+	bool urgent;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -138,20 +141,24 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
 
-int vhost_get_vq_desc(struct vhost_virtqueue *,
+int vhost_get_vq_desc(struct vhost_virtqueue *, bool *urgent,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_init_used(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
-		     unsigned count);
-void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
+int vhost_add_used(struct vhost_virtqueue *, bool urgent,
+		   unsigned int head, int len);
+int vhost_add_used_n(struct vhost_virtqueue *, bool urgent,
+		     struct vring_used_elem *heads, unsigned count);
+void vhost_add_used_and_signal(struct vhost_dev *,
+			       struct vhost_virtqueue *,
+			       bool urgent,
 			       unsigned int id, int len);
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
-			       struct vring_used_elem *heads, unsigned count);
+				 bool urgent,
+				 struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
  2014-10-11  7:16 [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt Jason Wang
  2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
  2014-10-11  7:16 ` [PATCH net-next RFC 2/3] vhost: support " Jason Wang
@ 2014-10-11  7:16 ` Jason Wang
  2014-10-11 14:48   ` Eric Dumazet
  2014-10-14 21:51   ` Michael S. Tsirkin
       [not found] ` <1413011806-3813-1-git-send-email-jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

We free transmitted packets in ndo_start_xmit() in the past to get better
performance in the past. One side effect is that skb_orphan() needs to be
called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
fact. For TCP protocol, this means several optimization could not work well
such as TCP small queue and auto corking. This can lead extra low
throughput of small packets stream.

Thanks to the urgent descriptor support. This patch tries to solve this
issue by enable the tx interrupt selectively for stream packets. This means
we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
tx interrupt for those packets. After we get tx interrupt, a tx napi was
scheduled to free those packets.

With this method, sk_wmem_alloc of TCP socket were more accurate than in
the past which let TCP can batch more through TSQ and auto corking.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 128 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5810841..b450fc4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
 	/* Name of the send queue: output.$index */
 	char name[40];
+
+	struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	int sent = 0;
+
+	while (sent < budget &&
+	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+
+		u64_stats_update_begin(&stats->tx_syncp);
+		stats->tx_bytes += skb->len;
+		stats->tx_packets++;
+		u64_stats_update_end(&stats->tx_syncp);
+
+		dev_kfree_skb_any(skb);
+		sent++;
+	}
+
+	return sent;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
+	struct send_queue *sq = &vi->sq[vq2txq(vq)];
 
-	/* Suppress further interrupts. */
-	virtqueue_disable_cb(vq);
-
-	/* We were probably waiting for more output buffers. */
-	netif_wake_subqueue(vi->dev, vq2txq(vq));
+	if (napi_schedule_prep(&sq->napi)) {
+		virtqueue_disable_cb(vq);
+		virtqueue_disable_cb_urgent(vq);
+		__napi_schedule(&sq->napi);
+	}
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -772,7 +799,38 @@ again:
 	return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct send_queue *sq =
+		container_of(napi, struct send_queue, napi);
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+	unsigned int r, sent = 0;
+
+again:
+	__netif_tx_lock(txq, smp_processor_id());
+	sent += free_old_xmit_skbs(sq, budget - sent);
+
+	if (sent < budget) {
+		r = virtqueue_enable_cb_prepare_urgent(sq->vq);
+		napi_complete(napi);
+		__netif_tx_unlock(txq);
+		if (unlikely(virtqueue_poll(sq->vq, r)) &&
+		    napi_schedule_prep(napi)) {
+			virtqueue_disable_cb_urgent(sq->vq);
+			__napi_schedule(napi);
+			goto again;
+		}
+	} else {
+		__netif_tx_unlock(txq);
+	}
+
+	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+	return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 		virtnet_napi_enable(&vi->rq[i]);
+		napi_enable(&vi->sq[i].napi);
 	}
 
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-
-		u64_stats_update_begin(&stats->tx_syncp);
-		stats->tx_bytes += skb->len;
-		stats->tx_packets++;
-		u64_stats_update_end(&stats->tx_syncp);
-
-		dev_kfree_skb_any(skb);
-	}
-}
-
-static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
+static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool urgent)
 {
 	struct skb_vnet_hdr *hdr;
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -908,7 +948,43 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, hdr, hdr_len);
 		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
-	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
+	if (urgent)
+		return virtqueue_add_outbuf_urgent(sq->vq, sq->sg, num_sg,
+						   skb, GFP_ATOMIC);
+	else
+		return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
+					    GFP_ATOMIC);
+}
+
+static bool virtnet_skb_needs_intr(struct sk_buff *skb)
+{
+	union {
+		unsigned char *network;
+		struct iphdr *ipv4;
+		struct ipv6hdr *ipv6;
+	} hdr;
+	struct tcphdr *th = tcp_hdr(skb);
+	u16 payload_len;
+
+	hdr.network = skb_network_header(skb);
+
+	/* Only IPv4/IPv6 with TCP is supported */
+	if ((skb->protocol == htons(ETH_P_IP)) &&
+	    hdr.ipv4->protocol == IPPROTO_TCP) {
+		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
+			      th->doff * 4;
+	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
+		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
+		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
+	} else {
+		return false;
+	}
+
+	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
+	if (!th->psh && payload_len)
+		return true;
+
+	return false;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -916,13 +992,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
-	int err;
+	bool urgent = virtnet_skb_needs_intr(skb);
+	int err, qsize = virtqueue_get_vring_size(sq->vq);
 
+	virtqueue_disable_cb_urgent(sq->vq);
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq);
+	free_old_xmit_skbs(sq, qsize);
 
 	/* Try to transmit */
-	err = xmit_skb(sq, skb);
+	err = xmit_skb(sq, skb, urgent);
 
 	/* This should not happen! */
 	if (unlikely(err)) {
@@ -935,22 +1013,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_OK;
 	}
 
-	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-	nf_reset(skb);
+	if (!urgent) {
+		skb_orphan(skb);
+		nf_reset(skb);
+	}
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 		netif_stop_subqueue(dev, qnum);
+		virtqueue_disable_cb_urgent(sq->vq);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq);
+			free_old_xmit_skbs(sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
+	} else if (virtqueue_enable_cb_urgent(sq->vq)) {
+		free_old_xmit_skbs(sq, qsize);
 	}
 
 	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
@@ -1132,8 +1214,10 @@ static int virtnet_close(struct net_device *dev)
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_disable(&vi->rq[i].napi);
+		napi_disable(&vi->sq[i].napi);
+	}
 
 	return 0;
 }
@@ -1452,8 +1536,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 {
 	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		netif_napi_del(&vi->rq[i].napi);
+		netif_napi_del(&vi->sq[i].napi);
+	}
 
 	kfree(vi->rq);
 	kfree(vi->sq);
@@ -1607,6 +1693,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
 		napi_hash_add(&vi->rq[i].napi);
+		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
@@ -1912,8 +2000,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	if (netif_running(vi->dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			napi_disable(&vi->sq[i].napi);
 			napi_hash_del(&vi->rq[i].napi);
 			netif_napi_del(&vi->rq[i].napi);
+			netif_napi_del(&vi->sq[i].napi);
 		}
 	}
 
@@ -1938,8 +2028,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(&vi->rq[i]);
+			napi_enable(&vi->sq[i].napi);
+		}
 	}
 
 	netif_device_attach(vi->dev);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
  2014-10-11  7:16 ` [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt Jason Wang
@ 2014-10-11 14:48   ` Eric Dumazet
  2014-10-13  6:02     ` Jason Wang
  2014-10-14 21:51   ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2014-10-11 14:48 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, linux-api

On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
> We free transmitted packets in ndo_start_xmit() in the past to get better
> performance in the past. One side effect is that skb_orphan() needs to be
> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
> fact. For TCP protocol, this means several optimization could not work well
> such as TCP small queue and auto corking. This can lead extra low
> throughput of small packets stream.
> 
> Thanks to the urgent descriptor support. This patch tries to solve this
> issue by enable the tx interrupt selectively for stream packets. This means
> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
> tx interrupt for those packets. After we get tx interrupt, a tx napi was
> scheduled to free those packets.
> 
> With this method, sk_wmem_alloc of TCP socket were more accurate than in
> the past which let TCP can batch more through TSQ and auto corking.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 128 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5810841..b450fc4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>  
>  	/* Name of the send queue: output.$index */
>  	char name[40];
> +
> +	struct napi_struct napi;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>  
> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> +	struct sk_buff *skb;
> +	unsigned int len;
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> +	int sent = 0;
> +
> +	while (sent < budget &&
> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		pr_debug("Sent skb %p\n", skb);
> +
> +		u64_stats_update_begin(&stats->tx_syncp);
> +		stats->tx_bytes += skb->len;
> +		stats->tx_packets++;
> +		u64_stats_update_end(&stats->tx_syncp);
> +
> +		dev_kfree_skb_any(skb);
> +		sent++;
> +	}
> +

You could accumulate skb->len in a totlen var, and perform a single

	u64_stats_update_begin(&stats->tx_syncp);
	stats->tx_bytes += totlen;
	stats->tx_packets += sent;
	u64_stats_update_end(&stats->tx_syncp);

after the loop.


> +	return sent;
> +}
> +

...

> +
> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
> +{
> +	union {
> +		unsigned char *network;
> +		struct iphdr *ipv4;
> +		struct ipv6hdr *ipv6;
> +	} hdr;
> +	struct tcphdr *th = tcp_hdr(skb);
> +	u16 payload_len;
> +
> +	hdr.network = skb_network_header(skb);
> +
> +	/* Only IPv4/IPv6 with TCP is supported */

	Oh well, yet another packet flow dissector :)

	If most packets were caught by your implementation, you could use it
for fast patj and fallback to skb_flow_dissect() for encapsulated
traffic.

	struct flow_keys keys;   

	if (!skb_flow_dissect(skb, &keys)) 
		return false;

	if (keys.ip_proto != IPPROTO_TCP)
		return false;

	then check __skb_get_poff() how to get th, and check if there is some
payload...


> +	if ((skb->protocol == htons(ETH_P_IP)) &&
> +	    hdr.ipv4->protocol == IPPROTO_TCP) {
> +		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
> +			      th->doff * 4;
> +	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
> +		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
> +		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
> +	} else {
> +		return false;
> +	}
> +
> +	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
> +	if (!th->psh && payload_len)
> +		return true;
> +
> +	return false;
>  }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
@ 2014-10-12  9:27   ` Michael S. Tsirkin
  2014-10-13  6:22     ` Jason Wang
  2014-10-15  5:40   ` Rusty Russell
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2014-10-12  9:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
> Below should be useful for some experiments Jason is doing.
> I thought I'd send it out for early review/feedback.
> 
> event idx feature allows us to defer interrupts until
> a specific # of descriptors were used.
> Sometimes it might be useful to get an interrupt after
> a specific descriptor, regardless.
> This adds a descriptor flag for this, and an API
> to create an urgent output descriptor.
> This is still an RFC:
> we'll need a feature bit for drivers to detect this,
> but we've run out of feature bits for virtio 0.X.
> For experimentation purposes, drivers can assume
> this is set, or add a driver-specific feature bit.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I see that as compared to my original patch, you have
added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
I don't think it's necessary, see below.

As such, I think this patch should be split:
- original patch adding support for urgent descriptors
- a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?




> ---
>  drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h           | 14 ++++++++
>  include/uapi/linux/virtio_ring.h |  5 ++-
>  3 files changed, 89 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 4d08f45a..a5188c6 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
>  
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static inline int vring_add_indirect(struct vring_virtqueue *vq,
> +				     bool urgent,
>  				     struct scatterlist *sgs[],
>  				     struct scatterlist *(*next)
>  				       (struct scatterlist *, unsigned int *),
> @@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  	/* Use a single buffer which doesn't continue */
>  	head = vq->free_head;
>  	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
> +	if (urgent)
> +		vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
>  	vq->vring.desc[head].addr = virt_to_phys(desc);
>  	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
>  	kmemleak_ignore(desc);
> @@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  }
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
> +			        bool urgent,
>  				struct scatterlist *sgs[],
>  				struct scatterlist *(*next)
>  				  (struct scatterlist *, unsigned int *),
> @@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
> -		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
> +		head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out,
>  					  total_in,
>  					  out_sgs, in_sgs, gfp);
>  		if (likely(head >= 0))
> @@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	for (n = 0; n < out_sgs; n++) {
>  		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
> +			if (urgent) {
> +				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
> +				urgent = false;
> +			}
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
>  			prev = i;
> @@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	for (; n < (out_sgs + in_sgs); n++) {
>  		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> +			if (urgent) {
> +				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
> +				urgent = false;
> +			}
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
>  			prev = i;
> @@ -305,6 +317,8 @@ add_head:
>  
>  /**
>   * virtqueue_add_sgs - expose buffers to other end
> + * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
> + *          after this descriptor was completed
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
>   * @out_num: the number of scatterlists readable by other side
> @@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
>  		for (sg = sgs[i]; sg; sg = sg_next(sg))
>  			total_in++;
>  	}
> -	return virtqueue_add(_vq, sgs, sg_next_chained,
> +	return virtqueue_add(_vq, false, sgs, sg_next_chained,
>  			     total_out, total_in, out_sgs, in_sgs, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
> @@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 void *data,
>  			 gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
> +	return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>  
>  /**
> + * virtqueue_add_outbuf - expose output buffers to other end
> + *          in case virtqueue_enable_cb_delayed was called, cause an interrupt
> + *          after this descriptor was completed
> + * @vq: the struct virtqueue we're talking about.
> + * @sgs: array of scatterlists (need not be terminated!)
> + * @num: the number of scatterlists readable by other side
> + * @data: the token identifying the buffer.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
> +			 struct scatterlist sg[], unsigned int num,
> +			 void *data,
> +			 gfp_t gfp)
> +{
> +	return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent);
> +
> +/**
>   * virtqueue_add_inbuf - expose input buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of scatterlists (need not be terminated!)
> @@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
>  			void *data,
>  			gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
> +	return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
>  
> @@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
>  
> +void virtqueue_disable_cb_urgent(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent);
> +
>  /**
>   * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
>   * @vq: the struct virtqueue we're talking about.
> @@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>  
> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	u16 last_used_idx;
> +
> +	START_USE(vq);
> +	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> +	last_used_idx = vq->last_used_idx;
> +	END_USE(vq);
> +	return last_used_idx;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
> +

You can implement virtqueue_enable_cb_prepare_urgent
simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.

The effect is same: host sends interrupts only if there
is an urgent descriptor.

>  /**
>   * virtqueue_poll - query pending used buffers
>   * @vq: the struct virtqueue we're talking about.
> @@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
>  
> +bool virtqueue_enable_cb_urgent(struct virtqueue *_vq)
> +{
> +	unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq);
> +
> +	return !virtqueue_poll(_vq, last_used_idx);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent);
> +
>  /**
>   * virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
>   * @vq: the struct virtqueue we're talking about.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index b46671e..68be5f2 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 void *data,
>  			 gfp_t gfp);
>  
> +int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
> +				struct scatterlist sg[], unsigned int num,
> +				void *data,
> +				gfp_t gfp);
> +
> +
>  int virtqueue_add_inbuf(struct virtqueue *vq,
>  			struct scatterlist sg[], unsigned int num,
>  			void *data,
> @@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
>  
>  void virtqueue_disable_cb(struct virtqueue *vq);
>  
> +void virtqueue_disable_cb_urgent(struct virtqueue *vq);
> +
>  bool virtqueue_enable_cb(struct virtqueue *vq);
>  
> +bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
> +
> +bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
> +
>  unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
>  
> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq);
> +
>  bool virtqueue_poll(struct virtqueue *vq, unsigned);
>  
>  bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index a99f9b7..daf5bb0 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -39,6 +39,9 @@
>  #define VRING_DESC_F_WRITE	2
>  /* This means the buffer contains a list of buffer descriptors. */
>  #define VRING_DESC_F_INDIRECT	4
> +/* This means the descriptor should cause an interrupt
> + * ignoring avail event idx. */
> +#define VRING_DESC_F_URGENT	8
>  
>  /* The Host uses this in used->flags to advise the Guest: don't kick me when
>   * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
> @@ -48,7 +51,7 @@
>   * when you consume a buffer.  It's unreliable, so it's simply an
>   * optimization.  */
>  #define VRING_AVAIL_F_NO_INTERRUPT	1
> -
> +#define VRING_AVAIL_F_NO_URGENT_INTERRUPT	2
>  /* We support indirect buffer descriptors */
>  #define VIRTIO_RING_F_INDIRECT_DESC	28
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
  2014-10-11 14:48   ` Eric Dumazet
@ 2014-10-13  6:02     ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-13  6:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kvm, mst, netdev, linux-kernel, virtualization, linux-api

On 10/11/2014 10:48 PM, Eric Dumazet wrote:
> On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
>> We free transmitted packets in ndo_start_xmit() in the past to get better
>> performance in the past. One side effect is that skb_orphan() needs to be
>> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
>> fact. For TCP protocol, this means several optimization could not work well
>> such as TCP small queue and auto corking. This can lead extra low
>> throughput of small packets stream.
>>
>> Thanks to the urgent descriptor support. This patch tries to solve this
>> issue by enable the tx interrupt selectively for stream packets. This means
>> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
>> tx interrupt for those packets. After we get tx interrupt, a tx napi was
>> scheduled to free those packets.
>>
>> With this method, sk_wmem_alloc of TCP socket were more accurate than in
>> the past which let TCP can batch more through TSQ and auto corking.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 128 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 5810841..b450fc4 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -72,6 +72,8 @@ struct send_queue {
>>  
>>  	/* Name of the send queue: output.$index */
>>  	char name[40];
>> +
>> +	struct napi_struct napi;
>>  };
>>  
>>  /* Internal representation of a receive virtqueue */
>> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>  	return p;
>>  }
>>  
>> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
>> +{
>> +	struct sk_buff *skb;
>> +	unsigned int len;
>> +	struct virtnet_info *vi = sq->vq->vdev->priv;
>> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>> +	int sent = 0;
>> +
>> +	while (sent < budget &&
>> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>> +		pr_debug("Sent skb %p\n", skb);
>> +
>> +		u64_stats_update_begin(&stats->tx_syncp);
>> +		stats->tx_bytes += skb->len;
>> +		stats->tx_packets++;
>> +		u64_stats_update_end(&stats->tx_syncp);
>> +
>> +		dev_kfree_skb_any(skb);
>> +		sent++;
>> +	}
>> +
> You could accumulate skb->len in a totlen var, and perform a single
>
> 	u64_stats_update_begin(&stats->tx_syncp);
> 	stats->tx_bytes += totlen;
> 	stats->tx_packets += sent;
> 	u64_stats_update_end(&stats->tx_syncp);
>
> after the loop.
>

Yes, will do this in a separated patch.
>> +	return sent;
>> +}
>> +
> ...
>
>> +
>> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
>> +{
>> +	union {
>> +		unsigned char *network;
>> +		struct iphdr *ipv4;
>> +		struct ipv6hdr *ipv6;
>> +	} hdr;
>> +	struct tcphdr *th = tcp_hdr(skb);
>> +	u16 payload_len;
>> +
>> +	hdr.network = skb_network_header(skb);
>> +
>> +	/* Only IPv4/IPv6 with TCP is supported */
> 	Oh well, yet another packet flow dissector :)
>
> 	If most packets were caught by your implementation, you could use it
> for fast patj and fallback to skb_flow_dissect() for encapsulated
> traffic.
>
> 	struct flow_keys keys;   
>
> 	if (!skb_flow_dissect(skb, &keys)) 
> 		return false;
>
> 	if (keys.ip_proto != IPPROTO_TCP)
> 		return false;
>
> 	then check __skb_get_poff() how to get th, and check if there is some
> payload...
>
>

Yes, but we don't know if most of packets were TCP or encapsulated TCP,
it depends on userspace application. If not, looks like
skb_flow_dissect() can bring some overhead, or it could be ignored?

skb_flow_dissect

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-12  9:27   ` Michael S. Tsirkin
@ 2014-10-13  6:22     ` Jason Wang
  2014-10-13  7:16       ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Wang @ 2014-10-13  6:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
> On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> I see that as compared to my original patch, you have
> added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
> I don't think it's necessary, see below.
>
> As such, I think this patch should be split:
> - original patch adding support for urgent descriptors
> - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?

Not sure this is a good idea, since the api of first patch is in-completed.
>
>> ---
>>  drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
>>  include/linux/virtio.h           | 14 ++++++++
>>  include/uapi/linux/virtio_ring.h |  5 ++-
>>  3 files changed, 89 insertions(+), 5 deletions(-)
>>
[...]
>>  
>> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
>> +{
>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>> +	u16 last_used_idx;
>> +
>> +	START_USE(vq);
>> +	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
>> +	last_used_idx = vq->last_used_idx;
>> +	END_USE(vq);
>> +	return last_used_idx;
>> +}
>> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
>> +
> You can implement virtqueue_enable_cb_prepare_urgent
> simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.
>
> The effect is same: host sends interrupts only if there
> is an urgent descriptor.

Seems not, consider the case when event index was disabled. This will
turn on all interrupts.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-13  6:22     ` Jason Wang
@ 2014-10-13  7:16       ` Michael S. Tsirkin
  0 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2014-10-13  7:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Mon, Oct 13, 2014 at 02:22:12PM +0800, Jason Wang wrote:
> On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
> > On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
> >> Below should be useful for some experiments Jason is doing.
> >> I thought I'd send it out for early review/feedback.
> >>
> >> event idx feature allows us to defer interrupts until
> >> a specific # of descriptors were used.
> >> Sometimes it might be useful to get an interrupt after
> >> a specific descriptor, regardless.
> >> This adds a descriptor flag for this, and an API
> >> to create an urgent output descriptor.
> >> This is still an RFC:
> >> we'll need a feature bit for drivers to detect this,
> >> but we've run out of feature bits for virtio 0.X.
> >> For experimentation purposes, drivers can assume
> >> this is set, or add a driver-specific feature bit.
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> > I see that as compared to my original patch, you have
> > added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
> > I don't think it's necessary, see below.
> >
> > As such, I think this patch should be split:
> > - original patch adding support for urgent descriptors
> > - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?
> 
> Not sure this is a good idea, since the api of first patch is in-completed.
> >
> >> ---
> >>  drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
> >>  include/linux/virtio.h           | 14 ++++++++
> >>  include/uapi/linux/virtio_ring.h |  5 ++-
> >>  3 files changed, 89 insertions(+), 5 deletions(-)
> >>
> [...]
> >>  
> >> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
> >> +{
> >> +	struct vring_virtqueue *vq = to_vvq(_vq);
> >> +	u16 last_used_idx;
> >> +
> >> +	START_USE(vq);
> >> +	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> >> +	last_used_idx = vq->last_used_idx;
> >> +	END_USE(vq);
> >> +	return last_used_idx;
> >> +}
> >> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
> >> +
> > You can implement virtqueue_enable_cb_prepare_urgent
> > simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.
> >
> > The effect is same: host sends interrupts only if there
> > is an urgent descriptor.
> 
> Seems not, consider the case when event index was disabled. This will
> turn on all interrupts.

This means that a legacy device without event index support
will get more interrupts.
Sounds reasonable.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
       [not found] ` <1413011806-3813-1-git-send-email-jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-10-14 18:53   ` David Miller
  2014-10-14 21:51     ` Michael S. Tsirkin
  2014-10-14 23:06     ` Michael S. Tsirkin
  0 siblings, 2 replies; 18+ messages in thread
From: David Miller @ 2014-10-14 18:53 UTC (permalink / raw)
  To: jasowang-H+wXaHxf7aLQT0dZR+AlfA
  Cc: rusty-8n+1lVoiYb80n/F98K4Iww, mst-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

From: Jason Wang <jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Sat, 11 Oct 2014 15:16:43 +0800

> We free old transmitted packets in ndo_start_xmit() currently, so any
> packet must be orphaned also there. This was used to reduce the overhead of
> tx interrupt to achieve better performance. But this may not work for some
> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> implement various optimization for small packets stream such as TCP small
> queue and auto corking. But orphaning packets early in ndo_start_xmit()
> disable such things more or less since sk_wmem_alloc was not accurate. This
> lead extra low throughput for TCP stream of small writes.
> 
> This series tries to solve this issue by enable tx interrupts for all TCP
> packets other than the ones with push bit or pure ACK. This is done through
> the support of urgent descriptor which can force an interrupt for a
> specified packet. If tx interrupt was enabled for a packet, there's no need
> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> can batch more for small write. More larger skb was produced by TCP in this
> case to improve both throughput and cpu utilization.
> 
> Test shows great improvements on small write tcp streams. For most of the
> other cases, the throughput and cpu utilization are the same in the
> past. Only few cases, more cpu utilization was noticed which needs more
> investigation.
> 
> Review and comments are welcomed.

I think proper accounting and queueing (at all levels, not just TCP
sockets) is more important than trying to skim a bunch of cycles by
avoiding TX interrupts.

Having an event to free the SKB is absolutely essential for the stack
to operate correctly.

And with virtio-net you don't even have the excuse of "the HW
unfortunately doesn't have an appropriate TX event."

So please don't play games, and instead use TX interrupts all the
time.  You can mitigate them in various ways, but don't turn them on
selectively based upon traffic type, that's terrible.

You can even use ->xmit_more to defer the TX interrupt indication to
the final TX packet in the chain.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 18:53   ` [PATCH net-next RFC 0/3] virtio-net: Conditionally " David Miller
@ 2014-10-14 21:51     ` Michael S. Tsirkin
  2014-10-15  3:24       ` Jason Wang
  2014-10-14 23:06     ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 21:51 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.

This got me thinking: how about using virtqueue_enable_cb_delayed
for this mitigation?

It's pretty easy to implement - I'll send a proof of concept patch
separately.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
  2014-10-11  7:16 ` [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt Jason Wang
  2014-10-11 14:48   ` Eric Dumazet
@ 2014-10-14 21:51   ` Michael S. Tsirkin
  2014-10-15  3:34     ` Jason Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 21:51 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Sat, Oct 11, 2014 at 03:16:46PM +0800, Jason Wang wrote:
> We free transmitted packets in ndo_start_xmit() in the past to get better
> performance in the past. One side effect is that skb_orphan() needs to be
> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
> fact. For TCP protocol, this means several optimization could not work well
> such as TCP small queue and auto corking. This can lead extra low
> throughput of small packets stream.
> 
> Thanks to the urgent descriptor support. This patch tries to solve this
> issue by enable the tx interrupt selectively for stream packets. This means
> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
> tx interrupt for those packets. After we get tx interrupt, a tx napi was
> scheduled to free those packets.
> 
> With this method, sk_wmem_alloc of TCP socket were more accurate than in
> the past which let TCP can batch more through TSQ and auto corking.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 128 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5810841..b450fc4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>  
>  	/* Name of the send queue: output.$index */
>  	char name[40];
> +
> +	struct napi_struct napi;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>  
> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> +	struct sk_buff *skb;
> +	unsigned int len;
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> +	int sent = 0;
> +
> +	while (sent < budget &&
> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		pr_debug("Sent skb %p\n", skb);
> +
> +		u64_stats_update_begin(&stats->tx_syncp);
> +		stats->tx_bytes += skb->len;
> +		stats->tx_packets++;
> +		u64_stats_update_end(&stats->tx_syncp);
> +
> +		dev_kfree_skb_any(skb);
> +		sent++;
> +	}
> +
> +	return sent;
> +}
> +
>  static void skb_xmit_done(struct virtqueue *vq)
>  {
>  	struct virtnet_info *vi = vq->vdev->priv;
> +	struct send_queue *sq = &vi->sq[vq2txq(vq)];
>  
> -	/* Suppress further interrupts. */
> -	virtqueue_disable_cb(vq);
> -
> -	/* We were probably waiting for more output buffers. */
> -	netif_wake_subqueue(vi->dev, vq2txq(vq));
> +	if (napi_schedule_prep(&sq->napi)) {
> +		virtqueue_disable_cb(vq);
> +		virtqueue_disable_cb_urgent(vq);

This disable_cb is no longer safe in xmit_done callback,
since queue can be running at the same time.

You must do it under tx lock. And yes, this likely will not work
work well without event_idx. We'll probably need extra
synchronization for such old hosts.



> +		__napi_schedule(&sq->napi);
> +	}
>  }
>  
>  static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
> @@ -772,7 +799,38 @@ again:
>  	return received;
>  }
>  
> +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> +{
> +	struct send_queue *sq =
> +		container_of(napi, struct send_queue, napi);
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> +	unsigned int r, sent = 0;
> +
> +again:
> +	__netif_tx_lock(txq, smp_processor_id());
> +	sent += free_old_xmit_skbs(sq, budget - sent);
> +
> +	if (sent < budget) {
> +		r = virtqueue_enable_cb_prepare_urgent(sq->vq);
> +		napi_complete(napi);
> +		__netif_tx_unlock(txq);
> +		if (unlikely(virtqueue_poll(sq->vq, r)) &&
> +		    napi_schedule_prep(napi)) {
> +			virtqueue_disable_cb_urgent(sq->vq);
> +			__napi_schedule(napi);
> +			goto again;
> +		}
> +	} else {
> +		__netif_tx_unlock(txq);
> +	}
> +
> +	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
> +	return sent;
> +}
> +
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> +
>  /* must be called with local_bh_disable()d */
>  static int virtnet_busy_poll(struct napi_struct *napi)
>  {
> @@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
>  			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>  				schedule_delayed_work(&vi->refill, 0);
>  		virtnet_napi_enable(&vi->rq[i]);
> +		napi_enable(&vi->sq[i].napi);
>  	}
>  
>  	return 0;
>  }
>  
> -static void free_old_xmit_skbs(struct send_queue *sq)
> -{
> -	struct sk_buff *skb;
> -	unsigned int len;
> -	struct virtnet_info *vi = sq->vq->vdev->priv;
> -	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> -
> -	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		pr_debug("Sent skb %p\n", skb);
> -
> -		u64_stats_update_begin(&stats->tx_syncp);
> -		stats->tx_bytes += skb->len;
> -		stats->tx_packets++;
> -		u64_stats_update_end(&stats->tx_syncp);
> -
> -		dev_kfree_skb_any(skb);
> -	}
> -}
> -
> -static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> +static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool urgent)
>  {
>  	struct skb_vnet_hdr *hdr;
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -908,7 +948,43 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  		sg_set_buf(sq->sg, hdr, hdr_len);
>  		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
>  	}
> -	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> +	if (urgent)
> +		return virtqueue_add_outbuf_urgent(sq->vq, sq->sg, num_sg,
> +						   skb, GFP_ATOMIC);
> +	else
> +		return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
> +					    GFP_ATOMIC);
> +}
> +
> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
> +{
> +	union {
> +		unsigned char *network;
> +		struct iphdr *ipv4;
> +		struct ipv6hdr *ipv6;
> +	} hdr;
> +	struct tcphdr *th = tcp_hdr(skb);
> +	u16 payload_len;
> +
> +	hdr.network = skb_network_header(skb);
> +
> +	/* Only IPv4/IPv6 with TCP is supported */
> +	if ((skb->protocol == htons(ETH_P_IP)) &&
> +	    hdr.ipv4->protocol == IPPROTO_TCP) {
> +		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
> +			      th->doff * 4;
> +	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
> +		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
> +		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
> +	} else {
> +		return false;
> +	}
> +
> +	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
> +	if (!th->psh && payload_len)
> +		return true;
> +
> +	return false;
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -916,13 +992,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int qnum = skb_get_queue_mapping(skb);
>  	struct send_queue *sq = &vi->sq[qnum];
> -	int err;
> +	bool urgent = virtnet_skb_needs_intr(skb);
> +	int err, qsize = virtqueue_get_vring_size(sq->vq);
>  
> +	virtqueue_disable_cb_urgent(sq->vq);
>  	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(sq);
> +	free_old_xmit_skbs(sq, qsize);
>  
>  	/* Try to transmit */
> -	err = xmit_skb(sq, skb);
> +	err = xmit_skb(sq, skb, urgent);
>  
>  	/* This should not happen! */
>  	if (unlikely(err)) {
> @@ -935,22 +1013,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		return NETDEV_TX_OK;
>  	}
>  
> -	/* Don't wait up for transmitted skbs to be freed. */
> -	skb_orphan(skb);
> -	nf_reset(skb);
> +	if (!urgent) {
> +		skb_orphan(skb);
> +		nf_reset(skb);
> +	}
>  
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>  		netif_stop_subqueue(dev, qnum);
> +		virtqueue_disable_cb_urgent(sq->vq);
>  		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>  			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(sq);
> +			free_old_xmit_skbs(sq, qsize);
>  			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>  				netif_start_subqueue(dev, qnum);
>  				virtqueue_disable_cb(sq->vq);
>  			}
>  		}
> +	} else if (virtqueue_enable_cb_urgent(sq->vq)) {
> +		free_old_xmit_skbs(sq, qsize);
>  	}
>  
>  	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
> @@ -1132,8 +1214,10 @@ static int virtnet_close(struct net_device *dev)
>  	/* Make sure refill_work doesn't re-enable napi! */
>  	cancel_delayed_work_sync(&vi->refill);
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		napi_disable(&vi->rq[i].napi);
> +		napi_disable(&vi->sq[i].napi);
> +	}
>  
>  	return 0;
>  }
> @@ -1452,8 +1536,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  {
>  	int i;
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		netif_napi_del(&vi->rq[i].napi);
> +		netif_napi_del(&vi->sq[i].napi);
> +	}
>  
>  	kfree(vi->rq);
>  	kfree(vi->sq);
> @@ -1607,6 +1693,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
>  			       napi_weight);
>  		napi_hash_add(&vi->rq[i].napi);
> +		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
> +			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
>  		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
> @@ -1912,8 +2000,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  	if (netif_running(vi->dev)) {
>  		for (i = 0; i < vi->max_queue_pairs; i++) {
>  			napi_disable(&vi->rq[i].napi);
> +			napi_disable(&vi->sq[i].napi);
>  			napi_hash_del(&vi->rq[i].napi);
>  			netif_napi_del(&vi->rq[i].napi);
> +			netif_napi_del(&vi->sq[i].napi);
>  		}
>  	}
>  
> @@ -1938,8 +2028,10 @@ static int virtnet_restore(struct virtio_device *vdev)
>  			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>  				schedule_delayed_work(&vi->refill, 0);
>  
> -		for (i = 0; i < vi->max_queue_pairs; i++)
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
>  			virtnet_napi_enable(&vi->rq[i]);
> +			napi_enable(&vi->sq[i].napi);
> +		}
>  	}
>  
>  	netif_device_attach(vi->dev);
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 18:53   ` [PATCH net-next RFC 0/3] virtio-net: Conditionally " David Miller
  2014-10-14 21:51     ` Michael S. Tsirkin
@ 2014-10-14 23:06     ` Michael S. Tsirkin
  2014-10-15  7:28       ` Jason Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 23:06 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.
> 
> You can even use ->xmit_more to defer the TX interrupt indication to
> the final TX packet in the chain.

I guess we can just defer the kick, interrupt will naturally be
deferred as well.
This should solve the problem for old hosts as well.

We'll also need to implement bql for this.
Something like the below?
Completely untested - posting here to see if I figured the
API out correctly. Has to be applied on top of the previous patch.

---

virtio_net: bql + xmit_more

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62c059d..c245047 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -213,13 +213,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
-static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+static int free_old_xmit_skbs(struct netdev_queue *txq,
+			      struct send_queue *sq, int budget)
 {
 	struct sk_buff *skb;
 	unsigned int len;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	int sent = 0;
+	unsigned int bytes = 0;
 
 	while (sent < budget &&
 	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
@@ -227,6 +229,7 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 
 		u64_stats_update_begin(&stats->tx_syncp);
 		stats->tx_bytes += skb->len;
+		bytes += skb->len;
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
@@ -234,6 +237,8 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 		sent++;
 	}
 
+	netdev_tx_completed_queue(txq, sent, bytes);
+
 	return sent;
 }
 
@@ -802,7 +807,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 again:
 	__netif_tx_lock(txq, smp_processor_id());
 	virtqueue_disable_cb(sq->vq);
-	sent += free_old_xmit_skbs(sq, budget - sent);
+	sent += free_old_xmit_skbs(txq, sq, budget - sent);
 
 	if (sent < budget) {
 		r = virtqueue_enable_cb_prepare(sq->vq);
@@ -951,6 +956,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
 	int err, qsize = virtqueue_get_vring_size(sq->vq);
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
+	bool kick = !skb->xmit_more || netif_xmit_stopped(txq);
+	unsigned int bytes = skb->len;
 
 	virtqueue_disable_cb(sq->vq);
 
@@ -967,7 +975,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(sq->vq);
+
+	netdev_tx_sent_queue(txq, bytes);
+
+	if (kick)
+		virtqueue_kick(sq->vq);
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
@@ -975,14 +987,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_subqueue(dev, qnum);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, qsize);
+			free_old_xmit_skbs(txq, sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
 	} else if (virtqueue_enable_cb_delayed(sq->vq)) {
-		free_old_xmit_skbs(sq, qsize);
+		free_old_xmit_skbs(txq, sq, qsize);
 	}
 
 	return NETDEV_TX_OK;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 21:51     ` Michael S. Tsirkin
@ 2014-10-15  3:24       ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-15  3:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Sat, 11 Oct 2014 15:16:43 +0800
>>
>>> We free old transmitted packets in ndo_start_xmit() currently, so any
>>> packet must be orphaned also there. This was used to reduce the overhead of
>>> tx interrupt to achieve better performance. But this may not work for some
>>> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> implement various optimization for small packets stream such as TCP small
>>> queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> disable such things more or less since sk_wmem_alloc was not accurate. This
>>> lead extra low throughput for TCP stream of small writes.
>>>
>>> This series tries to solve this issue by enable tx interrupts for all TCP
>>> packets other than the ones with push bit or pure ACK. This is done through
>>> the support of urgent descriptor which can force an interrupt for a
>>> specified packet. If tx interrupt was enabled for a packet, there's no need
>>> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> can batch more for small write. More larger skb was produced by TCP in this
>>> case to improve both throughput and cpu utilization.
>>>
>>> Test shows great improvements on small write tcp streams. For most of the
>>> other cases, the throughput and cpu utilization are the same in the
>>> past. Only few cases, more cpu utilization was noticed which needs more
>>> investigation.
>>>
>>> Review and comments are welcomed.
>> I think proper accounting and queueing (at all levels, not just TCP
>> sockets) is more important than trying to skim a bunch of cycles by
>> avoiding TX interrupts.
>>
>> Having an event to free the SKB is absolutely essential for the stack
>> to operate correctly.
>>
>> And with virtio-net you don't even have the excuse of "the HW
>> unfortunately doesn't have an appropriate TX event."
>>
>> So please don't play games, and instead use TX interrupts all the
>> time.  You can mitigate them in various ways, but don't turn them on
>> selectively based upon traffic type, that's terrible.
> This got me thinking: how about using virtqueue_enable_cb_delayed
> for this mitigation?

Should work, another possible solution is interrupt coalescing, which
can speed up also the case without event index.
> It's pretty easy to implement - I'll send a proof of concept patch
> separately.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
  2014-10-14 21:51   ` Michael S. Tsirkin
@ 2014-10-15  3:34     ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-15  3:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
> On Sat, Oct 11, 2014 at 03:16:46PM +0800, Jason Wang wrote:
>> > We free transmitted packets in ndo_start_xmit() in the past to get better
>> > performance in the past. One side effect is that skb_orphan() needs to be
>> > called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
>> > fact. For TCP protocol, this means several optimization could not work well
>> > such as TCP small queue and auto corking. This can lead extra low
>> > throughput of small packets stream.
>> > 
>> > Thanks to the urgent descriptor support. This patch tries to solve this
>> > issue by enable the tx interrupt selectively for stream packets. This means
>> > we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
>> > tx interrupt for those packets. After we get tx interrupt, a tx napi was
>> > scheduled to free those packets.
>> > 
>> > With this method, sk_wmem_alloc of TCP socket were more accurate than in
>> > the past which let TCP can batch more through TSQ and auto corking.
>> > 
>> > Signed-off-by: Jason Wang <jasowang@redhat.com>
>> > ---
>> >  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>> >  1 file changed, 128 insertions(+), 36 deletions(-)
>> > 
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index 5810841..b450fc4 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -72,6 +72,8 @@ struct send_queue {
>> >  
>> >  	/* Name of the send queue: output.$index */
>> >  	char name[40];
>> > +
>> > +	struct napi_struct napi;
>> >  };
>> >  
>> >  /* Internal representation of a receive virtqueue */
>> > @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>> >  	return p;
>> >  }
>> >  
>> > +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
>> > +{
>> > +	struct sk_buff *skb;
>> > +	unsigned int len;
>> > +	struct virtnet_info *vi = sq->vq->vdev->priv;
>> > +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>> > +	int sent = 0;
>> > +
>> > +	while (sent < budget &&
>> > +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>> > +		pr_debug("Sent skb %p\n", skb);
>> > +
>> > +		u64_stats_update_begin(&stats->tx_syncp);
>> > +		stats->tx_bytes += skb->len;
>> > +		stats->tx_packets++;
>> > +		u64_stats_update_end(&stats->tx_syncp);
>> > +
>> > +		dev_kfree_skb_any(skb);
>> > +		sent++;
>> > +	}
>> > +
>> > +	return sent;
>> > +}
>> > +
>> >  static void skb_xmit_done(struct virtqueue *vq)
>> >  {
>> >  	struct virtnet_info *vi = vq->vdev->priv;
>> > +	struct send_queue *sq = &vi->sq[vq2txq(vq)];
>> >  
>> > -	/* Suppress further interrupts. */
>> > -	virtqueue_disable_cb(vq);
>> > -
>> > -	/* We were probably waiting for more output buffers. */
>> > -	netif_wake_subqueue(vi->dev, vq2txq(vq));
>> > +	if (napi_schedule_prep(&sq->napi)) {
>> > +		virtqueue_disable_cb(vq);
>> > +		virtqueue_disable_cb_urgent(vq);
> This disable_cb is no longer safe in xmit_done callback,
> since queue can be running at the same time.
>
> You must do it under tx lock. And yes, this likely will not work
> work well without event_idx. We'll probably need extra
> synchronization for such old hosts.
>
>
>

Yes, and the virtqueue_enable_cb_prepare() in virtnet_poll_tx() needs to
be synced with virtqueue_enabled_cb_dealyed(). Otherwise an old idx will
be published and we may still see tx interrupt storm.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
  2014-10-12  9:27   ` Michael S. Tsirkin
@ 2014-10-15  5:40   ` Rusty Russell
  2014-10-17  5:23     ` Jason Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Rusty Russell @ 2014-10-15  5:40 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

Jason Wang <jasowang@redhat.com> writes:
> Below should be useful for some experiments Jason is doing.
> I thought I'd send it out for early review/feedback.
>
> event idx feature allows us to defer interrupts until
> a specific # of descriptors were used.
> Sometimes it might be useful to get an interrupt after
> a specific descriptor, regardless.
> This adds a descriptor flag for this, and an API
> to create an urgent output descriptor.
> This is still an RFC:
> we'll need a feature bit for drivers to detect this,
> but we've run out of feature bits for virtio 0.X.
> For experimentation purposes, drivers can assume
> this is set, or add a driver-specific feature bit.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
networking (which tends to take packets in order) couldn't we just set
the event counter to give us a tx interrupt at the packet we want?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 23:06     ` Michael S. Tsirkin
@ 2014-10-15  7:28       ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-15  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/15/2014 07:06 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> > From: Jason Wang <jasowang@redhat.com>
>> > Date: Sat, 11 Oct 2014 15:16:43 +0800
>> > 
>>> > > We free old transmitted packets in ndo_start_xmit() currently, so any
>>> > > packet must be orphaned also there. This was used to reduce the overhead of
>>> > > tx interrupt to achieve better performance. But this may not work for some
>>> > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> > > implement various optimization for small packets stream such as TCP small
>>> > > queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> > > disable such things more or less since sk_wmem_alloc was not accurate. This
>>> > > lead extra low throughput for TCP stream of small writes.
>>> > > 
>>> > > This series tries to solve this issue by enable tx interrupts for all TCP
>>> > > packets other than the ones with push bit or pure ACK. This is done through
>>> > > the support of urgent descriptor which can force an interrupt for a
>>> > > specified packet. If tx interrupt was enabled for a packet, there's no need
>>> > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> > > can batch more for small write. More larger skb was produced by TCP in this
>>> > > case to improve both throughput and cpu utilization.
>>> > > 
>>> > > Test shows great improvements on small write tcp streams. For most of the
>>> > > other cases, the throughput and cpu utilization are the same in the
>>> > > past. Only few cases, more cpu utilization was noticed which needs more
>>> > > investigation.
>>> > > 
>>> > > Review and comments are welcomed.
>> > 
>> > I think proper accounting and queueing (at all levels, not just TCP
>> > sockets) is more important than trying to skim a bunch of cycles by
>> > avoiding TX interrupts.
>> > 
>> > Having an event to free the SKB is absolutely essential for the stack
>> > to operate correctly.
>> > 
>> > And with virtio-net you don't even have the excuse of "the HW
>> > unfortunately doesn't have an appropriate TX event."
>> > 
>> > So please don't play games, and instead use TX interrupts all the
>> > time.  You can mitigate them in various ways, but don't turn them on
>> > selectively based upon traffic type, that's terrible.
>> > 
>> > You can even use ->xmit_more to defer the TX interrupt indication to
>> > the final TX packet in the chain.
> I guess we can just defer the kick, interrupt will naturally be
> deferred as well.
> This should solve the problem for old hosts as well.

Interrupt were delayed but not reduced, to support this we need publish
avail idx as used event. This should reduce the tx interrupt in the case
of bulk dequeuing.

I will draft a new rfc series contain this.
>
> We'll also need to implement bql for this.
> Something like the below?
> Completely untested - posting here to see if I figured the
> API out correctly. Has to be applied on top of the previous patch.

Looks so. I believe better to have but not a must.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
  2014-10-15  5:40   ` Rusty Russell
@ 2014-10-17  5:23     ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2014-10-17  5:23 UTC (permalink / raw)
  To: Rusty Russell, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

On 10/15/2014 01:40 PM, Rusty Russell wrote:
> Jason Wang <jasowang@redhat.com> writes:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
> networking (which tends to take packets in order) couldn't we just set
> the event counter to give us a tx interrupt at the packet we want?
>
> Cheers,
> Rusty.

Yes, we could. Recent RFC of enabling tx interrupt use this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-10-17  5:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-11  7:16 [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt Jason Wang
2014-10-11  7:16 ` [PATCH net-next RFC 1/3] virtio: support for urgent descriptors Jason Wang
2014-10-12  9:27   ` Michael S. Tsirkin
2014-10-13  6:22     ` Jason Wang
2014-10-13  7:16       ` Michael S. Tsirkin
2014-10-15  5:40   ` Rusty Russell
2014-10-17  5:23     ` Jason Wang
2014-10-11  7:16 ` [PATCH net-next RFC 2/3] vhost: support " Jason Wang
2014-10-11  7:16 ` [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt Jason Wang
2014-10-11 14:48   ` Eric Dumazet
2014-10-13  6:02     ` Jason Wang
2014-10-14 21:51   ` Michael S. Tsirkin
2014-10-15  3:34     ` Jason Wang
     [not found] ` <1413011806-3813-1-git-send-email-jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-10-14 18:53   ` [PATCH net-next RFC 0/3] virtio-net: Conditionally " David Miller
2014-10-14 21:51     ` Michael S. Tsirkin
2014-10-15  3:24       ` Jason Wang
2014-10-14 23:06     ` Michael S. Tsirkin
2014-10-15  7:28       ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).