All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-11  7:16 Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm

Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c         | 164 ++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c              |  43 +++++++---
 drivers/vhost/scsi.c             |  23 ++++--
 drivers/vhost/test.c             |   5 +-
 drivers/vhost/vhost.c            |  44 +++++++----
 drivers/vhost/vhost.h            |  19 +++--
 drivers/virtio/virtio_ring.c     |  75 +++++++++++++++++-
 include/linux/virtio.h           |  14 ++++
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 23:06     ` Michael S. Tsirkin
@ 2014-10-15  7:28       ` Jason Wang
  -1 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-15  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: rusty, virtualization, netdev, linux-kernel, linux-api, kvm

On 10/15/2014 07:06 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> > From: Jason Wang <jasowang@redhat.com>
>> > Date: Sat, 11 Oct 2014 15:16:43 +0800
>> > 
>>> > > We free old transmitted packets in ndo_start_xmit() currently, so any
>>> > > packet must be orphaned also there. This was used to reduce the overhead of
>>> > > tx interrupt to achieve better performance. But this may not work for some
>>> > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> > > implement various optimization for small packets stream such as TCP small
>>> > > queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> > > disable such things more or less since sk_wmem_alloc was not accurate. This
>>> > > lead extra low throughput for TCP stream of small writes.
>>> > > 
>>> > > This series tries to solve this issue by enable tx interrupts for all TCP
>>> > > packets other than the ones with push bit or pure ACK. This is done through
>>> > > the support of urgent descriptor which can force an interrupt for a
>>> > > specified packet. If tx interrupt was enabled for a packet, there's no need
>>> > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> > > can batch more for small write. More larger skb was produced by TCP in this
>>> > > case to improve both throughput and cpu utilization.
>>> > > 
>>> > > Test shows great improvements on small write tcp streams. For most of the
>>> > > other cases, the throughput and cpu utilization are the same in the
>>> > > past. Only few cases, more cpu utilization was noticed which needs more
>>> > > investigation.
>>> > > 
>>> > > Review and comments are welcomed.
>> > 
>> > I think proper accounting and queueing (at all levels, not just TCP
>> > sockets) is more important than trying to skim a bunch of cycles by
>> > avoiding TX interrupts.
>> > 
>> > Having an event to free the SKB is absolutely essential for the stack
>> > to operate correctly.
>> > 
>> > And with virtio-net you don't even have the excuse of "the HW
>> > unfortunately doesn't have an appropriate TX event."
>> > 
>> > So please don't play games, and instead use TX interrupts all the
>> > time.  You can mitigate them in various ways, but don't turn them on
>> > selectively based upon traffic type, that's terrible.
>> > 
>> > You can even use ->xmit_more to defer the TX interrupt indication to
>> > the final TX packet in the chain.
> I guess we can just defer the kick, interrupt will naturally be
> deferred as well.
> This should solve the problem for old hosts as well.

Interrupt were delayed but not reduced, to support this we need publish
avail idx as used event. This should reduce the tx interrupt in the case
of bulk dequeuing.

I will draft a new rfc series contain this.
>
> We'll also need to implement bql for this.
> Something like the below?
> Completely untested - posting here to see if I figured the
> API out correctly. Has to be applied on top of the previous patch.

Looks so. I believe better to have but not a must.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-15  7:28       ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-15  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/15/2014 07:06 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> > From: Jason Wang <jasowang@redhat.com>
>> > Date: Sat, 11 Oct 2014 15:16:43 +0800
>> > 
>>> > > We free old transmitted packets in ndo_start_xmit() currently, so any
>>> > > packet must be orphaned also there. This was used to reduce the overhead of
>>> > > tx interrupt to achieve better performance. But this may not work for some
>>> > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> > > implement various optimization for small packets stream such as TCP small
>>> > > queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> > > disable such things more or less since sk_wmem_alloc was not accurate. This
>>> > > lead extra low throughput for TCP stream of small writes.
>>> > > 
>>> > > This series tries to solve this issue by enable tx interrupts for all TCP
>>> > > packets other than the ones with push bit or pure ACK. This is done through
>>> > > the support of urgent descriptor which can force an interrupt for a
>>> > > specified packet. If tx interrupt was enabled for a packet, there's no need
>>> > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> > > can batch more for small write. More larger skb was produced by TCP in this
>>> > > case to improve both throughput and cpu utilization.
>>> > > 
>>> > > Test shows great improvements on small write tcp streams. For most of the
>>> > > other cases, the throughput and cpu utilization are the same in the
>>> > > past. Only few cases, more cpu utilization was noticed which needs more
>>> > > investigation.
>>> > > 
>>> > > Review and comments are welcomed.
>> > 
>> > I think proper accounting and queueing (at all levels, not just TCP
>> > sockets) is more important than trying to skim a bunch of cycles by
>> > avoiding TX interrupts.
>> > 
>> > Having an event to free the SKB is absolutely essential for the stack
>> > to operate correctly.
>> > 
>> > And with virtio-net you don't even have the excuse of "the HW
>> > unfortunately doesn't have an appropriate TX event."
>> > 
>> > So please don't play games, and instead use TX interrupts all the
>> > time.  You can mitigate them in various ways, but don't turn them on
>> > selectively based upon traffic type, that's terrible.
>> > 
>> > You can even use ->xmit_more to defer the TX interrupt indication to
>> > the final TX packet in the chain.
> I guess we can just defer the kick, interrupt will naturally be
> deferred as well.
> This should solve the problem for old hosts as well.

Interrupt were delayed but not reduced, to support this we need publish
avail idx as used event. This should reduce the tx interrupt in the case
of bulk dequeuing.

I will draft a new rfc series contain this.
>
> We'll also need to implement bql for this.
> Something like the below?
> Completely untested - posting here to see if I figured the
> API out correctly. Has to be applied on top of the previous patch.

Looks so. I believe better to have but not a must.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 21:51     ` Michael S. Tsirkin
@ 2014-10-15  3:24       ` Jason Wang
  -1 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-15  3:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: rusty, virtualization, netdev, linux-kernel, linux-api, kvm

On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Sat, 11 Oct 2014 15:16:43 +0800
>>
>>> We free old transmitted packets in ndo_start_xmit() currently, so any
>>> packet must be orphaned also there. This was used to reduce the overhead of
>>> tx interrupt to achieve better performance. But this may not work for some
>>> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> implement various optimization for small packets stream such as TCP small
>>> queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> disable such things more or less since sk_wmem_alloc was not accurate. This
>>> lead extra low throughput for TCP stream of small writes.
>>>
>>> This series tries to solve this issue by enable tx interrupts for all TCP
>>> packets other than the ones with push bit or pure ACK. This is done through
>>> the support of urgent descriptor which can force an interrupt for a
>>> specified packet. If tx interrupt was enabled for a packet, there's no need
>>> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> can batch more for small write. More larger skb was produced by TCP in this
>>> case to improve both throughput and cpu utilization.
>>>
>>> Test shows great improvements on small write tcp streams. For most of the
>>> other cases, the throughput and cpu utilization are the same in the
>>> past. Only few cases, more cpu utilization was noticed which needs more
>>> investigation.
>>>
>>> Review and comments are welcomed.
>> I think proper accounting and queueing (at all levels, not just TCP
>> sockets) is more important than trying to skim a bunch of cycles by
>> avoiding TX interrupts.
>>
>> Having an event to free the SKB is absolutely essential for the stack
>> to operate correctly.
>>
>> And with virtio-net you don't even have the excuse of "the HW
>> unfortunately doesn't have an appropriate TX event."
>>
>> So please don't play games, and instead use TX interrupts all the
>> time.  You can mitigate them in various ways, but don't turn them on
>> selectively based upon traffic type, that's terrible.
> This got me thinking: how about using virtqueue_enable_cb_delayed
> for this mitigation?

Should work, another possible solution is interrupt coalescing, which
can speed up also the case without event index.
> It's pretty easy to implement - I'll send a proof of concept patch
> separately.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-15  3:24       ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-15  3:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On 10/15/2014 05:51 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Sat, 11 Oct 2014 15:16:43 +0800
>>
>>> We free old transmitted packets in ndo_start_xmit() currently, so any
>>> packet must be orphaned also there. This was used to reduce the overhead of
>>> tx interrupt to achieve better performance. But this may not work for some
>>> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> implement various optimization for small packets stream such as TCP small
>>> queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> disable such things more or less since sk_wmem_alloc was not accurate. This
>>> lead extra low throughput for TCP stream of small writes.
>>>
>>> This series tries to solve this issue by enable tx interrupts for all TCP
>>> packets other than the ones with push bit or pure ACK. This is done through
>>> the support of urgent descriptor which can force an interrupt for a
>>> specified packet. If tx interrupt was enabled for a packet, there's no need
>>> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> can batch more for small write. More larger skb was produced by TCP in this
>>> case to improve both throughput and cpu utilization.
>>>
>>> Test shows great improvements on small write tcp streams. For most of the
>>> other cases, the throughput and cpu utilization are the same in the
>>> past. Only few cases, more cpu utilization was noticed which needs more
>>> investigation.
>>>
>>> Review and comments are welcomed.
>> I think proper accounting and queueing (at all levels, not just TCP
>> sockets) is more important than trying to skim a bunch of cycles by
>> avoiding TX interrupts.
>>
>> Having an event to free the SKB is absolutely essential for the stack
>> to operate correctly.
>>
>> And with virtio-net you don't even have the excuse of "the HW
>> unfortunately doesn't have an appropriate TX event."
>>
>> So please don't play games, and instead use TX interrupts all the
>> time.  You can mitigate them in various ways, but don't turn them on
>> selectively based upon traffic type, that's terrible.
> This got me thinking: how about using virtqueue_enable_cb_delayed
> for this mitigation?

Should work, another possible solution is interrupt coalescing, which
can speed up also the case without event index.
> It's pretty easy to implement - I'll send a proof of concept patch
> separately.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 18:53   ` David Miller
@ 2014-10-14 23:06     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 23:06 UTC (permalink / raw)
  To: David Miller
  Cc: jasowang, rusty, virtualization, netdev, linux-kernel, linux-api, kvm

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.
> 
> You can even use ->xmit_more to defer the TX interrupt indication to
> the final TX packet in the chain.

I guess we can just defer the kick, interrupt will naturally be
deferred as well.
This should solve the problem for old hosts as well.

We'll also need to implement bql for this.
Something like the below?
Completely untested - posting here to see if I figured the
API out correctly. Has to be applied on top of the previous patch.

---

virtio_net: bql + xmit_more

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62c059d..c245047 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -213,13 +213,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
-static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+static int free_old_xmit_skbs(struct netdev_queue *txq,
+			      struct send_queue *sq, int budget)
 {
 	struct sk_buff *skb;
 	unsigned int len;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	int sent = 0;
+	unsigned int bytes = 0;
 
 	while (sent < budget &&
 	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
@@ -227,6 +229,7 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 
 		u64_stats_update_begin(&stats->tx_syncp);
 		stats->tx_bytes += skb->len;
+		bytes += skb->len;
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
@@ -234,6 +237,8 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 		sent++;
 	}
 
+	netdev_tx_completed_queue(txq, sent, bytes);
+
 	return sent;
 }
 
@@ -802,7 +807,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 again:
 	__netif_tx_lock(txq, smp_processor_id());
 	virtqueue_disable_cb(sq->vq);
-	sent += free_old_xmit_skbs(sq, budget - sent);
+	sent += free_old_xmit_skbs(txq, sq, budget - sent);
 
 	if (sent < budget) {
 		r = virtqueue_enable_cb_prepare(sq->vq);
@@ -951,6 +956,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
 	int err, qsize = virtqueue_get_vring_size(sq->vq);
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
+	bool kick = !skb->xmit_more || netif_xmit_stopped(txq);
+	unsigned int bytes = skb->len;
 
 	virtqueue_disable_cb(sq->vq);
 
@@ -967,7 +975,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(sq->vq);
+
+	netdev_tx_sent_queue(txq, bytes);
+
+	if (kick)
+		virtqueue_kick(sq->vq);
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
@@ -975,14 +987,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_subqueue(dev, qnum);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, qsize);
+			free_old_xmit_skbs(txq, sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
 	} else if (virtqueue_enable_cb_delayed(sq->vq)) {
-		free_old_xmit_skbs(sq, qsize);
+		free_old_xmit_skbs(txq, sq, qsize);
 	}
 
 	return NETDEV_TX_OK;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-14 23:06     ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 23:06 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.
> 
> You can even use ->xmit_more to defer the TX interrupt indication to
> the final TX packet in the chain.

I guess we can just defer the kick, interrupt will naturally be
deferred as well.
This should solve the problem for old hosts as well.

We'll also need to implement bql for this.
Something like the below?
Completely untested - posting here to see if I figured the
API out correctly. Has to be applied on top of the previous patch.

---

virtio_net: bql + xmit_more

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62c059d..c245047 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -213,13 +213,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
-static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+static int free_old_xmit_skbs(struct netdev_queue *txq,
+			      struct send_queue *sq, int budget)
 {
 	struct sk_buff *skb;
 	unsigned int len;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	int sent = 0;
+	unsigned int bytes = 0;
 
 	while (sent < budget &&
 	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
@@ -227,6 +229,7 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 
 		u64_stats_update_begin(&stats->tx_syncp);
 		stats->tx_bytes += skb->len;
+		bytes += skb->len;
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
@@ -234,6 +237,8 @@ static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 		sent++;
 	}
 
+	netdev_tx_completed_queue(txq, sent, bytes);
+
 	return sent;
 }
 
@@ -802,7 +807,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 again:
 	__netif_tx_lock(txq, smp_processor_id());
 	virtqueue_disable_cb(sq->vq);
-	sent += free_old_xmit_skbs(sq, budget - sent);
+	sent += free_old_xmit_skbs(txq, sq, budget - sent);
 
 	if (sent < budget) {
 		r = virtqueue_enable_cb_prepare(sq->vq);
@@ -951,6 +956,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
 	int err, qsize = virtqueue_get_vring_size(sq->vq);
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
+	bool kick = !skb->xmit_more || netif_xmit_stopped(txq);
+	unsigned int bytes = skb->len;
 
 	virtqueue_disable_cb(sq->vq);
 
@@ -967,7 +975,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(sq->vq);
+
+	netdev_tx_sent_queue(txq, bytes);
+
+	if (kick)
+		virtqueue_kick(sq->vq);
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
@@ -975,14 +987,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		netif_stop_subqueue(dev, qnum);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq, qsize);
+			free_old_xmit_skbs(txq, sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
 	} else if (virtqueue_enable_cb_delayed(sq->vq)) {
-		free_old_xmit_skbs(sq, qsize);
+		free_old_xmit_skbs(txq, sq, qsize);
 	}
 
 	return NETDEV_TX_OK;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-14 18:53   ` David Miller
@ 2014-10-14 21:51     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 21:51 UTC (permalink / raw)
  To: David Miller
  Cc: jasowang, rusty, virtualization, netdev, linux-kernel, linux-api, kvm

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.

This got me thinking: how about using virtqueue_enable_cb_delayed
for this mitigation?

It's pretty easy to implement - I'll send a proof of concept patch
separately.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-14 21:51     ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2014-10-14 21:51 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api

On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 11 Oct 2014 15:16:43 +0800
> 
> > We free old transmitted packets in ndo_start_xmit() currently, so any
> > packet must be orphaned also there. This was used to reduce the overhead of
> > tx interrupt to achieve better performance. But this may not work for some
> > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> > implement various optimization for small packets stream such as TCP small
> > queue and auto corking. But orphaning packets early in ndo_start_xmit()
> > disable such things more or less since sk_wmem_alloc was not accurate. This
> > lead extra low throughput for TCP stream of small writes.
> > 
> > This series tries to solve this issue by enable tx interrupts for all TCP
> > packets other than the ones with push bit or pure ACK. This is done through
> > the support of urgent descriptor which can force an interrupt for a
> > specified packet. If tx interrupt was enabled for a packet, there's no need
> > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> > can batch more for small write. More larger skb was produced by TCP in this
> > case to improve both throughput and cpu utilization.
> > 
> > Test shows great improvements on small write tcp streams. For most of the
> > other cases, the throughput and cpu utilization are the same in the
> > past. Only few cases, more cpu utilization was noticed which needs more
> > investigation.
> > 
> > Review and comments are welcomed.
> 
> I think proper accounting and queueing (at all levels, not just TCP
> sockets) is more important than trying to skim a bunch of cycles by
> avoiding TX interrupts.
> 
> Having an event to free the SKB is absolutely essential for the stack
> to operate correctly.
> 
> And with virtio-net you don't even have the excuse of "the HW
> unfortunately doesn't have an appropriate TX event."
> 
> So please don't play games, and instead use TX interrupts all the
> time.  You can mitigate them in various ways, but don't turn them on
> selectively based upon traffic type, that's terrible.

This got me thinking: how about using virtqueue_enable_cb_delayed
for this mitigation?

It's pretty easy to implement - I'll send a proof of concept patch
separately.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-14 18:53   ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2014-10-14 18:53 UTC (permalink / raw)
  To: jasowang; +Cc: rusty, mst, virtualization, netdev, linux-kernel, linux-api, kvm

From: Jason Wang <jasowang@redhat.com>
Date: Sat, 11 Oct 2014 15:16:43 +0800

> We free old transmitted packets in ndo_start_xmit() currently, so any
> packet must be orphaned also there. This was used to reduce the overhead of
> tx interrupt to achieve better performance. But this may not work for some
> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> implement various optimization for small packets stream such as TCP small
> queue and auto corking. But orphaning packets early in ndo_start_xmit()
> disable such things more or less since sk_wmem_alloc was not accurate. This
> lead extra low throughput for TCP stream of small writes.
> 
> This series tries to solve this issue by enable tx interrupts for all TCP
> packets other than the ones with push bit or pure ACK. This is done through
> the support of urgent descriptor which can force an interrupt for a
> specified packet. If tx interrupt was enabled for a packet, there's no need
> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> can batch more for small write. More larger skb was produced by TCP in this
> case to improve both throughput and cpu utilization.
> 
> Test shows great improvements on small write tcp streams. For most of the
> other cases, the throughput and cpu utilization are the same in the
> past. Only few cases, more cpu utilization was noticed which needs more
> investigation.
> 
> Review and comments are welcomed.

I think proper accounting and queueing (at all levels, not just TCP
sockets) is more important than trying to skim a bunch of cycles by
avoiding TX interrupts.

Having an event to free the SKB is absolutely essential for the stack
to operate correctly.

And with virtio-net you don't even have the excuse of "the HW
unfortunately doesn't have an appropriate TX event."

So please don't play games, and instead use TX interrupts all the
time.  You can mitigate them in various ways, but don't turn them on
selectively based upon traffic type, that's terrible.

You can even use ->xmit_more to defer the TX interrupt indication to
the final TX packet in the chain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-14 18:53   ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2014-10-14 18:53 UTC (permalink / raw)
  To: jasowang-H+wXaHxf7aLQT0dZR+AlfA
  Cc: rusty-8n+1lVoiYb80n/F98K4Iww, mst-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

From: Jason Wang <jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Sat, 11 Oct 2014 15:16:43 +0800

> We free old transmitted packets in ndo_start_xmit() currently, so any
> packet must be orphaned also there. This was used to reduce the overhead of
> tx interrupt to achieve better performance. But this may not work for some
> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> implement various optimization for small packets stream such as TCP small
> queue and auto corking. But orphaning packets early in ndo_start_xmit()
> disable such things more or less since sk_wmem_alloc was not accurate. This
> lead extra low throughput for TCP stream of small writes.
> 
> This series tries to solve this issue by enable tx interrupts for all TCP
> packets other than the ones with push bit or pure ACK. This is done through
> the support of urgent descriptor which can force an interrupt for a
> specified packet. If tx interrupt was enabled for a packet, there's no need
> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> can batch more for small write. More larger skb was produced by TCP in this
> case to improve both throughput and cpu utilization.
> 
> Test shows great improvements on small write tcp streams. For most of the
> other cases, the throughput and cpu utilization are the same in the
> past. Only few cases, more cpu utilization was noticed which needs more
> investigation.
> 
> Review and comments are welcomed.

I think proper accounting and queueing (at all levels, not just TCP
sockets) is more important than trying to skim a bunch of cycles by
avoiding TX interrupts.

Having an event to free the SKB is absolutely essential for the stack
to operate correctly.

And with virtio-net you don't even have the excuse of "the HW
unfortunately doesn't have an appropriate TX event."

So please don't play games, and instead use TX interrupts all the
time.  You can mitigate them in various ways, but don't turn them on
selectively based upon traffic type, that's terrible.

You can even use ->xmit_more to defer the TX interrupt indication to
the final TX packet in the chain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
  2014-10-11  7:16 ` Jason Wang
  (?)
  (?)
@ 2014-10-14 18:53 ` David Miller
  -1 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2014-10-14 18:53 UTC (permalink / raw)
  To: jasowang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, linux-api

From: Jason Wang <jasowang@redhat.com>
Date: Sat, 11 Oct 2014 15:16:43 +0800

> We free old transmitted packets in ndo_start_xmit() currently, so any
> packet must be orphaned also there. This was used to reduce the overhead of
> tx interrupt to achieve better performance. But this may not work for some
> protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
> implement various optimization for small packets stream such as TCP small
> queue and auto corking. But orphaning packets early in ndo_start_xmit()
> disable such things more or less since sk_wmem_alloc was not accurate. This
> lead extra low throughput for TCP stream of small writes.
> 
> This series tries to solve this issue by enable tx interrupts for all TCP
> packets other than the ones with push bit or pure ACK. This is done through
> the support of urgent descriptor which can force an interrupt for a
> specified packet. If tx interrupt was enabled for a packet, there's no need
> to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
> by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
> can batch more for small write. More larger skb was produced by TCP in this
> case to improve both throughput and cpu utilization.
> 
> Test shows great improvements on small write tcp streams. For most of the
> other cases, the throughput and cpu utilization are the same in the
> past. Only few cases, more cpu utilization was noticed which needs more
> investigation.
> 
> Review and comments are welcomed.

I think proper accounting and queueing (at all levels, not just TCP
sockets) is more important than trying to skim a bunch of cycles by
avoiding TX interrupts.

Having an event to free the SKB is absolutely essential for the stack
to operate correctly.

And with virtio-net you don't even have the excuse of "the HW
unfortunately doesn't have an appropriate TX event."

So please don't play games, and instead use TX interrupts all the
time.  You can mitigate them in various ways, but don't turn them on
selectively based upon traffic type, that's terrible.

You can even use ->xmit_more to defer the TX interrupt indication to
the final TX packet in the chain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-11  7:16 ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel
  Cc: linux-api, kvm, Jason Wang

Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c         | 164 ++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c              |  43 +++++++---
 drivers/vhost/scsi.c             |  23 ++++--
 drivers/vhost/test.c             |   5 +-
 drivers/vhost/vhost.c            |  44 +++++++----
 drivers/vhost/vhost.h            |  19 +++--
 drivers/virtio/virtio_ring.c     |  75 +++++++++++++++++-
 include/linux/virtio.h           |  14 ++++
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
@ 2014-10-11  7:16 ` Jason Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty-8n+1lVoiYb80n/F98K4Iww, mst-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA, Jason Wang

Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c         | 164 ++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c              |  43 +++++++---
 drivers/vhost/scsi.c             |  23 ++++--
 drivers/vhost/test.c             |   5 +-
 drivers/vhost/vhost.c            |  44 +++++++----
 drivers/vhost/vhost.h            |  19 +++--
 drivers/virtio/virtio_ring.c     |  75 +++++++++++++++++-
 include/linux/virtio.h           |  14 ++++
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-15  7:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-11  7:16 [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt Jason Wang
  -- strict thread matches above, loose matches on Subject: below --
2014-10-11  7:16 Jason Wang
2014-10-11  7:16 ` Jason Wang
2014-10-14 18:53 ` David Miller
2014-10-14 18:53   ` David Miller
2014-10-14 21:51   ` Michael S. Tsirkin
2014-10-14 21:51     ` Michael S. Tsirkin
2014-10-15  3:24     ` Jason Wang
2014-10-15  3:24       ` Jason Wang
2014-10-14 23:06   ` Michael S. Tsirkin
2014-10-14 23:06     ` Michael S. Tsirkin
2014-10-15  7:28     ` Jason Wang
2014-10-15  7:28       ` Jason Wang
2014-10-14 18:53 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.