All of lore.kernel.org
 help / color / mirror / Atom feed
* Network performance with small packets
@ 2011-01-25 21:09 Steve Dobbelstein
  2011-01-26 15:17 ` Michael S. Tsirkin
  2011-02-02 18:38 ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Steve Dobbelstein @ 2011-01-25 21:09 UTC (permalink / raw)
  To: kvm


I am working on a KVM network performance issue found in our lab running
the DayTrader benchmark.  The benchmark throughput takes a significant hit
when running the application server in a KVM guest verses on bare metal.
We have dug into the problem and found that DayTrader's use of small
packets exposes KVM's overhead of handling network packets.  I have been
able to reproduce the performance hit with a simpler setup using the
netperf benchmark with the TCP_RR test and the request and response sizes
set to 256 bytes.  I run the benchmark between two physical systems, each
using a 1GB link.  In order to get the maximum throughput for the system I
have to run 100 instances of netperf.  When I run the netserver processes
in a guest, I see a maximum throughput that is 51% of what I get if I run
the netserver processes directly on the host.  The CPU utilization in the
guest is only 85% at maximum throughput, whereas it is 100% on bare metal.

The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
I run netperf on the host I boot the host with maxcpus=2 on the kernel
command line.  The host is running the current KVM upstream kernel along
with the current upstream qemu.  Here is the qemu command used to launch
the guest:
/build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
 -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2
-vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize

We have tried various proposed fixes, each with varying amounts of success.
One such fix was to add code to the vhost thread such that when it found
the work queue empty it wouldn't just exit the thread but rather would
delay for 50 microseconds and then recheck the queue.  If there was work on
the queue it would loop back and process it, else it would exit the thread.
The change got us a 13% improvement in the DayTrader throughput.

Running the same netperf configuration on the same hardware but using a
different hypervisor gets us significantly better throughput numbers.   The
guest on that hypervisor runs at 100% CPU utilization.  The various fixes
we have tried have not gotten us close to the throughput seen on the other
hypervisor.  I'm looking for ideas/input from the KVM experts on how to
make KVM perform better when handling small packets.

Thanks,
Steve


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-25 21:09 Network performance with small packets Steve Dobbelstein
@ 2011-01-26 15:17 ` Michael S. Tsirkin
  2011-01-27 18:44   ` Shirley Ma
  2011-02-02 18:38 ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-01-26 15:17 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: kvm

On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
> 
> I am working on a KVM network performance issue found in our lab running
> the DayTrader benchmark.  The benchmark throughput takes a significant hit
> when running the application server in a KVM guest verses on bare metal.
> We have dug into the problem and found that DayTrader's use of small
> packets exposes KVM's overhead of handling network packets.  I have been
> able to reproduce the performance hit with a simpler setup using the
> netperf benchmark with the TCP_RR test and the request and response sizes
> set to 256 bytes.  I run the benchmark between two physical systems, each
> using a 1GB link.  In order to get the maximum throughput for the system I
> have to run 100 instances of netperf.  When I run the netserver processes
> in a guest, I see a maximum throughput that is 51% of what I get if I run
> the netserver processes directly on the host.  The CPU utilization in the
> guest is only 85% at maximum throughput, whereas it is 100% on bare metal.
> 
> The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
> I run netperf on the host I boot the host with maxcpus=2 on the kernel
> command line.  The host is running the current KVM upstream kernel along
> with the current upstream qemu.  Here is the qemu command used to launch
> the guest:
> /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
>  -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2
> -vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
> 
> We have tried various proposed fixes, each with varying amounts of success.
> One such fix was to add code to the vhost thread such that when it found
> the work queue empty it wouldn't just exit the thread but rather would
> delay for 50 microseconds and then recheck the queue.  If there was work on
> the queue it would loop back and process it, else it would exit the thread.
> The change got us a 13% improvement in the DayTrader throughput.
> 
> Running the same netperf configuration on the same hardware but using a
> different hypervisor gets us significantly better throughput numbers.   The
> guest on that hypervisor runs at 100% CPU utilization.  The various fixes
> we have tried have not gotten us close to the throughput seen on the other
> hypervisor.  I'm looking for ideas/input from the KVM experts on how to
> make KVM perform better when handling small packets.
> 
> Thanks,
> Steve

I am seeing a similar problem, and am trying to fix that.
My current theory is that this is a variant of a receive livelock:
if the application isn't fast enough to process
incoming data, the guest net stack switches
from prequeue to backlog handling.

One thing I noticed is that locking the vhost thread
and the vcpu to the same physical CPU almost doubles the
bandwidth.  Can you confirm that in your setup?

My current guess is that when we lock both to
a single CPU, netperf in guest gets scheduled
slowing down the vhost thread in the host.

I also noticed that this specific workload
performs better with vhost off: presumably
we are loading the guest less.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-26 15:17 ` Michael S. Tsirkin
@ 2011-01-27 18:44   ` Shirley Ma
  2011-01-27 19:00     ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-01-27 18:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, kvm

On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote:
> I am seeing a similar problem, and am trying to fix that.
> My current theory is that this is a variant of a receive livelock:
> if the application isn't fast enough to process
> incoming data, the guest net stack switches
> from prequeue to backlog handling.
> 
> One thing I noticed is that locking the vhost thread
> and the vcpu to the same physical CPU almost doubles the
> bandwidth.  Can you confirm that in your setup?
> 
> My current guess is that when we lock both to
> a single CPU, netperf in guest gets scheduled
> slowing down the vhost thread in the host.
> 
> I also noticed that this specific workload
> performs better with vhost off: presumably
> we are loading the guest less. 

I found similar issue for small message size TCP_STREAM test when guest
as TX. I found when I slow down TX, the BW performance will be doubled
for 1K to 4K message size.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 18:44   ` Shirley Ma
@ 2011-01-27 19:00     ` Michael S. Tsirkin
  2011-01-27 19:09       ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-01-27 19:00 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Steve Dobbelstein, kvm

On Thu, Jan 27, 2011 at 10:44:34AM -0800, Shirley Ma wrote:
> On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote:
> > I am seeing a similar problem, and am trying to fix that.
> > My current theory is that this is a variant of a receive livelock:
> > if the application isn't fast enough to process
> > incoming data, the guest net stack switches
> > from prequeue to backlog handling.
> > 
> > One thing I noticed is that locking the vhost thread
> > and the vcpu to the same physical CPU almost doubles the
> > bandwidth.  Can you confirm that in your setup?
> > 
> > My current guess is that when we lock both to
> > a single CPU, netperf in guest gets scheduled
> > slowing down the vhost thread in the host.
> > 
> > I also noticed that this specific workload
> > performs better with vhost off: presumably
> > we are loading the guest less. 
> 
> I found similar issue for small message size TCP_STREAM test when guest
> as TX. I found when I slow down TX, the BW performance will be doubled
> for 1K to 4K message size.
> 
> Shirley

Interesting. In particular running vhost and the transmitting guest
on the same host would have the effect of slowing down TX.
Does it double the BW for you too?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 19:00     ` Michael S. Tsirkin
@ 2011-01-27 19:09       ` Shirley Ma
  2011-01-27 19:31         ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-01-27 19:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, kvm

On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote:
> Interesting. In particular running vhost and the transmitting guest
> on the same host would have the effect of slowing down TX.
> Does it double the BW for you too?
> 

Running vhost and TX guest on the same host seems not good enough to
slow down TX. In order to gain the double even triple BW for guest TX to
local host I still need to play around, so 1K message size, BW is able
to increase from 2.XGb/s to 6.XGb/s.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 19:09       ` Shirley Ma
@ 2011-01-27 19:31         ` Michael S. Tsirkin
  2011-01-27 19:45           ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-01-27 19:31 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Steve Dobbelstein, kvm

On Thu, Jan 27, 2011 at 11:09:00AM -0800, Shirley Ma wrote:
> On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote:
> > Interesting. In particular running vhost and the transmitting guest
> > on the same host would have the effect of slowing down TX.
> > Does it double the BW for you too?
> > 
> 
> Running vhost and TX guest on the same host seems not good enough to
> slow down TX. In order to gain the double even triple BW for guest TX to
> local host I still need to play around, so 1K message size, BW is able
> to increase from 2.XGb/s to 6.XGb/s.
> 
> Thanks
> Shirley

Well slowing down the guest does not sound hard - for example we can
request guest notifications, or send extra interrupts :)
A slightly more sophisticated thing to try is to
poll the vq a bit more aggressively.
For example if we handled some requests and now tx vq is empty,
reschedule and yeild. Worth a try?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 19:31         ` Michael S. Tsirkin
@ 2011-01-27 19:45           ` Shirley Ma
  2011-01-27 20:05             ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-01-27 19:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, kvm

On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote:
> Well slowing down the guest does not sound hard - for example we can
> request guest notifications, or send extra interrupts :)
> A slightly more sophisticated thing to try is to
> poll the vq a bit more aggressively.
> For example if we handled some requests and now tx vq is empty,
> reschedule and yeild. Worth a try?

I used dropping packets in high level to slow down TX. I am still
thinking what's the right the approach here. 

Requesting guest notification and extra interrupts is what we want to
avoid to reduce VM exits for saving CPUs. I don't think it's good.

By polling the vq a bit more aggressively, you meant vhost, right?

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 19:45           ` Shirley Ma
@ 2011-01-27 20:05             ` Michael S. Tsirkin
  2011-01-27 20:15               ` Shirley Ma
  2011-01-27 21:02               ` Network performance with small packets David Miller
  0 siblings, 2 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-01-27 20:05 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Steve Dobbelstein, kvm, netdev

On Thu, Jan 27, 2011 at 11:45:47AM -0800, Shirley Ma wrote:
> On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote:
> > Well slowing down the guest does not sound hard - for example we can
> > request guest notifications, or send extra interrupts :)
> > A slightly more sophisticated thing to try is to
> > poll the vq a bit more aggressively.
> > For example if we handled some requests and now tx vq is empty,
> > reschedule and yeild. Worth a try?
> 
> I used dropping packets in high level to slow down TX.
> I am still
> thinking what's the right the approach here. 

Interesting. Could this is be a variant of the now famuous bufferbloat then?

I guess we could drop some packets if we see we are not keeping up. For
example if we see that the ring is > X% full, we could quickly complete
Y% without transmitting packets on. Or maybe we should drop some bytes
not packets.

> 
> Requesting guest notification and extra interrupts is what we want to
> avoid to reduce VM exits for saving CPUs. I don't think it's good.

Yes but how do you explain regression?
One simple theory is that guest net stack became faster
and so the host can't keep up.


> 
> By polling the vq a bit more aggressively, you meant vhost, right?
> 
> Shirley

Yes.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 20:05             ` Michael S. Tsirkin
@ 2011-01-27 20:15               ` Shirley Ma
  2011-01-28 18:29                 ` Steve Dobbelstein
  2011-01-27 21:02               ` Network performance with small packets David Miller
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-01-27 20:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, kvm, netdev

On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote:
> Interesting. Could this is be a variant of the now famuous bufferbloat
> then?
> 
> I guess we could drop some packets if we see we are not keeping up.
> For
> example if we see that the ring is > X% full, we could quickly
> complete
> Y% without transmitting packets on. Or maybe we should drop some bytes
> not packets.
It's worth to try to figure out what's the best approach. I will make a
patch.


> > 
> > Requesting guest notification and extra interrupts is what we want
> to
> > avoid to reduce VM exits for saving CPUs. I don't think it's good.
> 
> Yes but how do you explain regression?
> One simple theory is that guest net stack became faster
> and so the host can't keep up.

Yes, that's what I think here. Some qdisc code has been changed
recently.

> > 
> > By polling the vq a bit more aggressively, you meant vhost, right?
> > 
> > Shirley
> 
> Yes. 

I had a similar patch before, I can modify it and test it out.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 20:05             ` Michael S. Tsirkin
  2011-01-27 20:15               ` Shirley Ma
@ 2011-01-27 21:02               ` David Miller
  2011-01-27 21:30                 ` Shirley Ma
  1 sibling, 1 reply; 88+ messages in thread
From: David Miller @ 2011-01-27 21:02 UTC (permalink / raw)
  To: mst; +Cc: mashirle, steved, kvm, netdev

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Thu, 27 Jan 2011 22:05:48 +0200

> Interesting. Could this is be a variant of the now famuous bufferbloat then?

Sigh, bufferbloat is the new global warming... :-/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 21:02               ` Network performance with small packets David Miller
@ 2011-01-27 21:30                 ` Shirley Ma
  2011-01-28 12:16                   ` Michael S. Tsirkin
  2011-02-01 17:23                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Shirley Ma @ 2011-01-27 21:30 UTC (permalink / raw)
  To: David Miller; +Cc: mst, steved, kvm, netdev

On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
> > Interesting. Could this is be a variant of the now famuous
> bufferbloat then?
> 
> Sigh, bufferbloat is the new global warming... :-/ 

Yep, some places become colder, some other places become warmer; Same as
BW results, sometimes faster, sometimes slower. :)

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 21:30                 ` Shirley Ma
@ 2011-01-28 12:16                   ` Michael S. Tsirkin
  2011-02-01  0:24                     ` Steve Dobbelstein
  2011-02-01 17:23                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-01-28 12:16 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, steved, kvm, netdev

On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
> On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
> > > Interesting. Could this is be a variant of the now famuous
> > bufferbloat then?
> > 
> > Sigh, bufferbloat is the new global warming... :-/ 
> 
> Yep, some places become colder, some other places become warmer; Same as
> BW results, sometimes faster, sometimes slower. :)
> 
> Shirley

OK, so thinking about it more, maybe the issue is this:
tx becomes full. We process one request and interrupt the guest,
then it adds one request and the queue is full again.

Maybe the following will help it stabilize?
By itself it does nothing, but if you set
all the parameters to a huge value we will
only interrupt when we see an empty ring.
Which might be too much: pls try other values
in the middle: e.g. make bufs half the ring,
or bytes some small value, or packets some
small value etc.

Warning: completely untested.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aac05bc..6769cdc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,13 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+int tx_bytes_coalesce = 0;
+module_param(tx_bytes_coalesce, int, 0644);
+int tx_bufs_coalesce = 0;
+module_param(tx_bufs_coalesce, int, 0644);
+int tx_packets_coalesce = 0;
+module_param(tx_packets_coalesce, int, 0644);
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	int bytes_coalesced = 0;
+	int bufs_coalesced = 0;
+	int packets_coalesced = 0;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
+		packets_coalesced += 1;
+		bytes_coalesced += len;
+		bufs_coalesced += in;
+		if (unlikely(packets_coalesced > tx_packets_coalesce ||
+			     bytes_coalesced > tx_bytes_coalesce ||
+			     bufs_coalesced > tx_bufs_coalesce))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		else
+			vhost_add_used(vq, head, 0);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
 	}
 
+	if (likely(packets_coalesced > tx_packets_coalesce ||
+		   bytes_coalesced > tx_bytes_coalesce ||
+		   bufs_coalesced > tx_bufs_coalesce))
+		vhost_signal(&net->dev, vq);
 	mutex_unlock(&vq->mutex);
 }
 

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 20:15               ` Shirley Ma
@ 2011-01-28 18:29                 ` Steve Dobbelstein
  2011-01-28 22:51                   ` Steve Dobbelstein
  2011-02-01 15:52                   ` [PATCHv2 dontapply] vhost-net tx tuning Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Steve Dobbelstein @ 2011-01-28 18:29 UTC (permalink / raw)
  To: mashirle; +Cc: kvm, Michael S. Tsirkin, netdev

mashirle@linux.vnet.ibm.com wrote on 01/27/2011 02:15:05 PM:

> On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote:
> > One simple theory is that guest net stack became faster
> > and so the host can't keep up.
>
> Yes, that's what I think here. Some qdisc code has been changed
> recently.

I ran a test with txqueuelen set to 128, instead of the default of 1000, in
the guest in an attempt to slow down the guest transmits.  The change had
no effect on the throughput nor on the CPU usage.

On the other hand, I ran some tests with different CPU pinnings and
with/without hyperthreading enabled.  Here is a summary of the results.

Pinning configuration 1:  pin the VCPUs and pin the vhost thread to one of
the VCPU CPUs
Pinning configuration 2:  pin the VCPUs and pin the vhost thread to a
separate CPU on the same socket
Pinning configuration 3:  pin the VCPUs and pin the vhost thread to a
separate CPU a different socket

HT   Pinning   Throughput  CPU
Yes  config 1  - 40%       - 40%
Yes  config 2  - 37%       - 35%
Yes  config 3  - 37%       - 36%
No   none         0%       -  5%
No   config 1  - 41%       - 43%
No   config 2  + 32%       -  4%
No   config 3  + 34%       +  9%

Pinning the vhost thread to the same CPU as a guest VCPU hurts performance.
Turning off hyperthreading and pinning the VPUS and vhost thread to
separate CPUs significantly improves performance, getting it into the
competitive range with other hypervisors.

Steve D.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-28 18:29                 ` Steve Dobbelstein
@ 2011-01-28 22:51                   ` Steve Dobbelstein
  2011-02-01 15:52                   ` [PATCHv2 dontapply] vhost-net tx tuning Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Steve Dobbelstein @ 2011-01-28 22:51 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: kvm, kvm-owner, mashirle, Michael S. Tsirkin, netdev

steved@us.ibm.com wrote on 01/28/2011 12:29:37 PM:

> > On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote:
> > > One simple theory is that guest net stack became faster
> > > and so the host can't keep up.
> >
> > Yes, that's what I think here. Some qdisc code has been changed
> > recently.
>
> I ran a test with txqueuelen set to 128, instead of the default of 1000,
in
> the guest in an attempt to slow down the guest transmits.  The change had
> no effect on the throughput nor on the CPU usage.
>
> On the other hand, I ran some tests with different CPU pinnings and
> with/without hyperthreading enabled.  Here is a summary of the results.
>
> Pinning configuration 1:  pin the VCPUs and pin the vhost thread to one
of
> the VCPU CPUs
> Pinning configuration 2:  pin the VCPUs and pin the vhost thread to a
> separate CPU on the same socket
> Pinning configuration 3:  pin the VCPUs and pin the vhost thread to a
> separate CPU a different socket
>
> HT   Pinning   Throughput  CPU
> Yes  config 1  - 40%       - 40%
> Yes  config 2  - 37%       - 35%
> Yes  config 3  - 37%       - 36%
> No   none         0%       -  5%
> No   config 1  - 41%       - 43%
> No   config 2  + 32%       -  4%
> No   config 3  + 34%       +  9%
>
> Pinning the vhost thread to the same CPU as a guest VCPU hurts
performance.
> Turning off hyperthreading and pinning the VPUS and vhost thread to
> separate CPUs significantly improves performance, getting it into the
> competitive range with other hypervisors.
>
> Steve D.

Those results for configs 2 and 3 with hyperthreading off are a little
strange.  Digging into the cause I found that my automation script for
pinning the vhost thread failed and pinned it to CPU 1, the same as config
1, giving results similar to config 1.  I reran the tests making sure the
pinning script did the right thing.  The results are more consistent.

HT   Pinning   Throughput  CPU
Yes  config 1  - 40%       - 40%
Yes  config 2  + 33%       -  8%
Yes  config 3  + 34%       +  9%
No   none         0%       -  5%
No   config 1  - 41%       - 43%
No   config 2  + 32%       -  4%
No   config 3  + 34%       +  9%

It appears that we have a scheduling problem.  If the processes are pinned
we can get good performance.

We also se that hyperthreading makes little difference.

Sorry for the initial misleading data.

Steve D.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-28 12:16                   ` Michael S. Tsirkin
@ 2011-02-01  0:24                     ` Steve Dobbelstein
  2011-02-01  1:30                       ` Sridhar Samudrala
  2011-02-01  5:54                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Steve Dobbelstein @ 2011-02-01  0:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, kvm, mashirle, netdev

"Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:

> OK, so thinking about it more, maybe the issue is this:
> tx becomes full. We process one request and interrupt the guest,
> then it adds one request and the queue is full again.
>
> Maybe the following will help it stabilize?
> By itself it does nothing, but if you set
> all the parameters to a huge value we will
> only interrupt when we see an empty ring.
> Which might be too much: pls try other values
> in the middle: e.g. make bufs half the ring,
> or bytes some small value, or packets some
> small value etc.
>
> Warning: completely untested.
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index aac05bc..6769cdc 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,13 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>
> +int tx_bytes_coalesce = 0;
> +module_param(tx_bytes_coalesce, int, 0644);
> +int tx_bufs_coalesce = 0;
> +module_param(tx_bufs_coalesce, int, 0644);
> +int tx_packets_coalesce = 0;
> +module_param(tx_packets_coalesce, int, 0644);
> +
>  enum {
>     VHOST_NET_VQ_RX = 0,
>     VHOST_NET_VQ_TX = 1,
> @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
>     int err, wmem;
>     size_t hdr_size;
>     struct socket *sock;
> +   int bytes_coalesced = 0;
> +   int bufs_coalesced = 0;
> +   int packets_coalesced = 0;
>
>     /* TODO: check that we are running from vhost_worker? */
>     sock = rcu_dereference_check(vq->private_data, 1);
> @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
>        if (err != len)
>           pr_debug("Truncated TX packet: "
>               " len %d != %zd\n", err, len);
> -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
>        total_len += len;
> +      packets_coalesced += 1;
> +      bytes_coalesced += len;
> +      bufs_coalesced += in;

Should this instead be:
      bufs_coalesced += out;

Perusing the code I see that earlier there is a check to see if "in" is not
zero, and, if so, error out of the loop.  After the check, "in" is not
touched until it is added to bufs_coalesced, effectively not changing
bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
below.

Or am I missing something?

> +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> +              bytes_coalesced > tx_bytes_coalesce ||
> +              bufs_coalesced > tx_bufs_coalesce))
> +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +      else
> +         vhost_add_used(vq, head, 0);
>        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>           vhost_poll_queue(&vq->poll);
>           break;
>        }
>     }
>
> +   if (likely(packets_coalesced > tx_packets_coalesce ||
> +         bytes_coalesced > tx_bytes_coalesce ||
> +         bufs_coalesced > tx_bufs_coalesce))
> +      vhost_signal(&net->dev, vq);
>     mutex_unlock(&vq->mutex);
>  }
>

Steve D.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01  0:24                     ` Steve Dobbelstein
@ 2011-02-01  1:30                       ` Sridhar Samudrala
  2011-02-01  5:56                         ` Michael S. Tsirkin
  2011-02-01 21:09                         ` Shirley Ma
  2011-02-01  5:54                       ` Michael S. Tsirkin
  1 sibling, 2 replies; 88+ messages in thread
From: Sridhar Samudrala @ 2011-02-01  1:30 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: Michael S. Tsirkin, David Miller, kvm, mashirle, netdev

On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:
> 
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> >
> > Maybe the following will help it stabilize?
> > By itself it does nothing, but if you set
> > all the parameters to a huge value we will
> > only interrupt when we see an empty ring.
> > Which might be too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value, or packets some
> > small value etc.
> >
> > Warning: completely untested.
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >
> > +int tx_bytes_coalesce = 0;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 0;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 0;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >     VHOST_NET_VQ_RX = 0,
> >     VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >     int err, wmem;
> >     size_t hdr_size;
> >     struct socket *sock;
> > +   int bytes_coalesced = 0;
> > +   int bufs_coalesced = 0;
> > +   int packets_coalesced = 0;
> >
> >     /* TODO: check that we are running from vhost_worker? */
> >     sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >        if (err != len)
> >           pr_debug("Truncated TX packet: "
> >               " len %d != %zd\n", err, len);
> > -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >        total_len += len;
> > +      packets_coalesced += 1;
> > +      bytes_coalesced += len;
> > +      bufs_coalesced += in;
> 
> Should this instead be:
>       bufs_coalesced += out;
> 
> Perusing the code I see that earlier there is a check to see if "in" is not
> zero, and, if so, error out of the loop.  After the check, "in" is not
> touched until it is added to bufs_coalesced, effectively not changing
> bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
> below.

Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.

I tried a simpler version of this patch without any tunables by
delaying the signaling until we come out of the for loop.
It definitely reduced the number of vmexits significantly for small message
guest to host stream test and the throughput went up a little.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..5f9fae9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		vhost_add_used(vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
@@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	if (total_len > 0)
+		vhost_signal(&net->dev, vq);
 	mutex_unlock(&vq->mutex);
 }
 

> 
> Or am I missing something?
> 
> > +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +              bytes_coalesced > tx_bytes_coalesce ||
> > +              bufs_coalesced > tx_bufs_coalesce))
> > +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +      else
> > +         vhost_add_used(vq, head, 0);
> >        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >           vhost_poll_queue(&vq->poll);
> >           break;
> >        }
> >     }
> >
> > +   if (likely(packets_coalesced > tx_packets_coalesce ||
> > +         bytes_coalesced > tx_bytes_coalesce ||
> > +         bufs_coalesced > tx_bufs_coalesce))
> > +      vhost_signal(&net->dev, vq);
> >     mutex_unlock(&vq->mutex);
> >  }

It is possible that we can miss signaling the guest even after
processing a few pkts, if we don't hit any of these conditions.

> >
> 
> Steve D.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01  0:24                     ` Steve Dobbelstein
  2011-02-01  1:30                       ` Sridhar Samudrala
@ 2011-02-01  5:54                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01  5:54 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: David Miller, kvm, mashirle, netdev

On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:
> 
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> >
> > Maybe the following will help it stabilize?
> > By itself it does nothing, but if you set
> > all the parameters to a huge value we will
> > only interrupt when we see an empty ring.
> > Which might be too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value, or packets some
> > small value etc.
> >
> > Warning: completely untested.
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >
> > +int tx_bytes_coalesce = 0;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 0;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 0;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >     VHOST_NET_VQ_RX = 0,
> >     VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >     int err, wmem;
> >     size_t hdr_size;
> >     struct socket *sock;
> > +   int bytes_coalesced = 0;
> > +   int bufs_coalesced = 0;
> > +   int packets_coalesced = 0;
> >
> >     /* TODO: check that we are running from vhost_worker? */
> >     sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >        if (err != len)
> >           pr_debug("Truncated TX packet: "
> >               " len %d != %zd\n", err, len);
> > -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >        total_len += len;
> > +      packets_coalesced += 1;
> > +      bytes_coalesced += len;
> > +      bufs_coalesced += in;
> 
> Should this instead be:
>       bufs_coalesced += out;

Correct.

> Perusing the code I see that earlier there is a check to see if "in" is not
> zero, and, if so, error out of the loop.  After the check, "in" is not
> touched until it is added to bufs_coalesced, effectively not changing
> bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
> below.
> 
> Or am I missing something?
> 
> > +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +              bytes_coalesced > tx_bytes_coalesce ||
> > +              bufs_coalesced > tx_bufs_coalesce))
> > +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +      else
> > +         vhost_add_used(vq, head, 0);
> >        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >           vhost_poll_queue(&vq->poll);
> >           break;
> >        }
> >     }
> >
> > +   if (likely(packets_coalesced > tx_packets_coalesce ||
> > +         bytes_coalesced > tx_bytes_coalesce ||
> > +         bufs_coalesced > tx_bufs_coalesce))
> > +      vhost_signal(&net->dev, vq);
> >     mutex_unlock(&vq->mutex);
> >  }
> >
> 
> Steve D.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01  1:30                       ` Sridhar Samudrala
@ 2011-02-01  5:56                         ` Michael S. Tsirkin
  2011-02-01 21:09                         ` Shirley Ma
  1 sibling, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01  5:56 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Steve Dobbelstein, David Miller, kvm, mashirle, netdev

On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote:
> On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 01/28/2011 06:16:16 AM:
> > 
> > > OK, so thinking about it more, maybe the issue is this:
> > > tx becomes full. We process one request and interrupt the guest,
> > > then it adds one request and the queue is full again.
> > >
> > > Maybe the following will help it stabilize?
> > > By itself it does nothing, but if you set
> > > all the parameters to a huge value we will
> > > only interrupt when we see an empty ring.
> > > Which might be too much: pls try other values
> > > in the middle: e.g. make bufs half the ring,
> > > or bytes some small value, or packets some
> > > small value etc.
> > >
> > > Warning: completely untested.
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index aac05bc..6769cdc 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -32,6 +32,13 @@
> > >   * Using this limit prevents one virtqueue from starving others. */
> > >  #define VHOST_NET_WEIGHT 0x80000
> > >
> > > +int tx_bytes_coalesce = 0;
> > > +module_param(tx_bytes_coalesce, int, 0644);
> > > +int tx_bufs_coalesce = 0;
> > > +module_param(tx_bufs_coalesce, int, 0644);
> > > +int tx_packets_coalesce = 0;
> > > +module_param(tx_packets_coalesce, int, 0644);
> > > +
> > >  enum {
> > >     VHOST_NET_VQ_RX = 0,
> > >     VHOST_NET_VQ_TX = 1,
> > > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> > >     int err, wmem;
> > >     size_t hdr_size;
> > >     struct socket *sock;
> > > +   int bytes_coalesced = 0;
> > > +   int bufs_coalesced = 0;
> > > +   int packets_coalesced = 0;
> > >
> > >     /* TODO: check that we are running from vhost_worker? */
> > >     sock = rcu_dereference_check(vq->private_data, 1);
> > > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> > >        if (err != len)
> > >           pr_debug("Truncated TX packet: "
> > >               " len %d != %zd\n", err, len);
> > > -      vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > >        total_len += len;
> > > +      packets_coalesced += 1;
> > > +      bytes_coalesced += len;
> > > +      bufs_coalesced += in;
> > 
> > Should this instead be:
> >       bufs_coalesced += out;
> > 
> > Perusing the code I see that earlier there is a check to see if "in" is not
> > zero, and, if so, error out of the loop.  After the check, "in" is not
> > touched until it is added to bufs_coalesced, effectively not changing
> > bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
> > below.
> 
> Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> 
> I tried a simpler version of this patch without any tunables by
> delaying the signaling until we come out of the for loop.
> It definitely reduced the number of vmexits significantly for small message
> guest to host stream test and the throughput went up a little.
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9b3ca10..5f9fae9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		vhost_add_used(vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	if (total_len > 0)
> +		vhost_signal(&net->dev, vq);
>  	mutex_unlock(&vq->mutex);
>  }
>  
> 
> > 
> > Or am I missing something?
> > 
> > > +      if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > > +              bytes_coalesced > tx_bytes_coalesce ||
> > > +              bufs_coalesced > tx_bufs_coalesce))
> > > +         vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > > +      else
> > > +         vhost_add_used(vq, head, 0);
> > >        if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > >           vhost_poll_queue(&vq->poll);
> > >           break;
> > >        }
> > >     }
> > >
> > > +   if (likely(packets_coalesced > tx_packets_coalesce ||
> > > +         bytes_coalesced > tx_bytes_coalesce ||
> > > +         bufs_coalesced > tx_bufs_coalesce))
> > > +      vhost_signal(&net->dev, vq);
> > >     mutex_unlock(&vq->mutex);
> > >  }
> 
> It is possible that we can miss signaling the guest even after
> processing a few pkts, if we don't hit any of these conditions.

Yes. It really should be
   if (likely(packets_coalesced && bytes_coalesced && bufs_coalesced))
      vhost_signal(&net->dev, vq);

> > >
> > 
> > Steve D.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCHv2 dontapply] vhost-net tx tuning
  2011-01-28 18:29                 ` Steve Dobbelstein
  2011-01-28 22:51                   ` Steve Dobbelstein
@ 2011-02-01 15:52                   ` Michael S. Tsirkin
  2011-02-01 23:07                     ` Sridhar Samudrala
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 15:52 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: mashirle, kvm, netdev

OK, so thinking about it more, maybe the issue is this:
tx becomes full. We process one request and interrupt the guest,
then it adds one request and the queue is full again.

Maybe the following will help it stabilize?  By default with it we will
only interrupt when we see an empty ring.
Which is liklely too much: pls try other values
in the middle: e.g. make bufs half the ring,
or bytes some small value like half ring * 200, or packets some
small value etc.

Set any one parameter to 0 to get current
behaviour (interrupt immediately when enabled).

Warning: completely untested.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aac05bc..6769cdc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,13 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+int tx_bytes_coalesce = 1000000000;
+module_param(tx_bytes_coalesce, int, 0644);
+int tx_bufs_coalesce = 1000000000;
+module_param(tx_bufs_coalesce, int, 0644);
+int tx_packets_coalesce = 1000000000;
+module_param(tx_packets_coalesce, int, 0644);
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	int bytes_coalesced = 0;
+	int bufs_coalesced = 0;
+	int packets_coalesced = 0;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
+		packets_coalesced += 1;
+		bytes_coalesced += len;
+		bufs_coalesced += out;
+		if (unlikely(packets_coalesced > tx_packets_coalesce ||
+			     bytes_coalesced > tx_bytes_coalesce ||
+			     bufs_coalesced > tx_bufs_coalesce))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		else
+			vhost_add_used(vq, head, 0);
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
 	}
 
+	if (likely(packets_coalesced &&
+		   bytes_coalesced &&
+		   bufs_coalesced))
+		vhost_signal(&net->dev, vq);
 	mutex_unlock(&vq->mutex);
 }
 

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-27 21:30                 ` Shirley Ma
  2011-01-28 12:16                   ` Michael S. Tsirkin
@ 2011-02-01 17:23                   ` Michael S. Tsirkin
       [not found]                     ` <1296590943.26937.797.camel@localhost.localdomain>
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 17:23 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, steved, kvm, netdev

On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
> On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
> > > Interesting. Could this is be a variant of the now famuous
> > bufferbloat then?
> > 
> > Sigh, bufferbloat is the new global warming... :-/ 
> 
> Yep, some places become colder, some other places become warmer; Same as
> BW results, sometimes faster, sometimes slower. :)
> 
> Shirley

Sent a tuning patch (v2) that might help.
Could you try it and play with the module parameters please?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
       [not found]                       ` <20110201201715.GA30050@redhat.com>
@ 2011-02-01 20:25                         ` Shirley Ma
  2011-02-01 21:21                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 20:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, steved, netdev, kvm

On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
> > > > On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
> > > > > > Interesting. Could this is be a variant of the now famuous
> > > > > bufferbloat then?
> > > > > 
> > > > > Sigh, bufferbloat is the new global warming... :-/ 
> > > > 
> > > > Yep, some places become colder, some other places become warmer;
> > > Same as
> > > > BW results, sometimes faster, sometimes slower. :)
> > > > 
> > > > Shirley
> > > 
> > > Sent a tuning patch (v2) that might help.
> > > Could you try it and play with the module parameters please? 
> > 
> > Hello Michael,
> > 
> > Sure I will play with this patch to see how it could help. 
> > 
> > I am looking at guest side as well, I found a couple issues on guest
> > side:
> > 
> > 1. free_old_xmit_skbs() should return the number of skbs instead of
> the
> > total of sgs since we are using ring size to stop/start netif queue.
> > static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> > {
> >         struct sk_buff *skb;
> >         unsigned int len, tot_sgs = 0;
> > 
> >         while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> >                 pr_debug("Sent skb %p\n", skb);
> >                 vi->dev->stats.tx_bytes += skb->len;
> >                 vi->dev->stats.tx_packets++;
> >                 tot_sgs += skb_vnet_hdr(skb)->num_sg;
> >                 dev_kfree_skb_any(skb);
> >         }
> >         return tot_sgs; <---- should return numbers of skbs to track
> > ring usage here, I think;
> > }
> > 
> > Did the old guest use number of buffers to track ring usage before?
> > 
> > 2. In start_xmit, I think we should move capacity +=
> free_old_xmit_skbs
> > before netif_stop_queue(); so we avoid unnecessary netif queue
> > stop/start. This condition is heavily hit for small message size.
> > 
> > Also we capacity checking condition should change to something like
> half
> > of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS?
> > 
> >        if (capacity < 2+MAX_SKB_FRAGS) {
> >                 netif_stop_queue(dev);
> >                 if (unlikely(!virtqueue_enable_cb(vi->svq))) {
> >                         /* More just got used, free them then
> recheck.
> > */
> >                         capacity += free_old_xmit_skbs(vi);
> >                         if (capacity >= 2+MAX_SKB_FRAGS) {
> >                                 netif_start_queue(dev);
> >                                 virtqueue_disable_cb(vi->svq);
> >                         }
> >                 }
> >         }
> > 
> > 3. Looks like the xmit callback is only used to wake the queue when
> the
> > queue has stopped, right? Should we put a condition check here?
> > static void skb_xmit_done(struct virtqueue *svq)
> > {
> >         struct virtnet_info *vi = svq->vdev->priv;
> > 
> >         /* Suppress further interrupts. */
> >         virtqueue_disable_cb(svq);
> > 
> >         /* We were probably waiting for more output buffers. */
> > --->   if (netif_queue_stopped(vi->dev))
> >         netif_wake_queue(vi->dev);
> > }
> > 
> > 
> > Shirley
> 
> Well the return value is used to calculate capacity and that counts
> the # of s/g. No?

Nope, the current guest kernel uses descriptors not number of sgs. I am
not sure the old guest.

> From cache utilization POV it might be better to read from the skb and
> not peek at virtio header though...
> Pls Cc the lists on any discussions in the future.
> 
> -- 
> MST

Sorry I missed reply all. :(

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01  1:30                       ` Sridhar Samudrala
  2011-02-01  5:56                         ` Michael S. Tsirkin
@ 2011-02-01 21:09                         ` Shirley Ma
  2011-02-01 21:24                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 21:09 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Steve Dobbelstein, Michael S. Tsirkin, David Miller, kvm,
	mashirle, netdev

On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
> Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> 
> I tried a simpler version of this patch without any tunables by
> delaying the signaling until we come out of the for loop.
> It definitely reduced the number of vmexits significantly for small
> message
> guest to host stream test and the throughput went up a little.
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9b3ca10..5f9fae9 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
>                 if (err != len)
>                         pr_debug("Truncated TX packet: "
>                                  " len %d != %zd\n", err, len);
> -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               vhost_add_used(vq, head, 0);
>                 total_len += len;
>                 if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>                         vhost_poll_queue(&vq->poll);
> @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
>                 }
>         }
> 
> +       if (total_len > 0)
> +               vhost_signal(&net->dev, vq);
>         mutex_unlock(&vq->mutex);
>  }

Reducing the signaling will reduce the CPU utilization by reducing VM
exits. 

The small message BW is a problem we have seen faster guest/slow vhost,
even I increased VHOST_NET_WEIGHT times, it didn't help that much for
BW. For large message size, vhost is able to process all packets on
time. I played around with guest/host codes, I only see huge BW
improvement by dropping packets on guest side so far.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 20:25                         ` Shirley Ma
@ 2011-02-01 21:21                           ` Michael S. Tsirkin
  2011-02-01 21:28                             ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 21:21 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, steved, netdev, kvm

On Tue, Feb 01, 2011 at 12:25:08PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
> > > > > On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
> > > > > > > Interesting. Could this is be a variant of the now famuous
> > > > > > bufferbloat then?
> > > > > > 
> > > > > > Sigh, bufferbloat is the new global warming... :-/ 
> > > > > 
> > > > > Yep, some places become colder, some other places become warmer;
> > > > Same as
> > > > > BW results, sometimes faster, sometimes slower. :)
> > > > > 
> > > > > Shirley
> > > > 
> > > > Sent a tuning patch (v2) that might help.
> > > > Could you try it and play with the module parameters please? 
> > > 
> > > Hello Michael,
> > > 
> > > Sure I will play with this patch to see how it could help. 
> > > 
> > > I am looking at guest side as well, I found a couple issues on guest
> > > side:
> > > 
> > > 1. free_old_xmit_skbs() should return the number of skbs instead of
> > the
> > > total of sgs since we are using ring size to stop/start netif queue.
> > > static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> > > {
> > >         struct sk_buff *skb;
> > >         unsigned int len, tot_sgs = 0;
> > > 
> > >         while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > >                 pr_debug("Sent skb %p\n", skb);
> > >                 vi->dev->stats.tx_bytes += skb->len;
> > >                 vi->dev->stats.tx_packets++;
> > >                 tot_sgs += skb_vnet_hdr(skb)->num_sg;
> > >                 dev_kfree_skb_any(skb);
> > >         }
> > >         return tot_sgs; <---- should return numbers of skbs to track
> > > ring usage here, I think;
> > > }
> > > 
> > > Did the old guest use number of buffers to track ring usage before?
> > > 
> > > 2. In start_xmit, I think we should move capacity +=
> > free_old_xmit_skbs
> > > before netif_stop_queue(); so we avoid unnecessary netif queue
> > > stop/start. This condition is heavily hit for small message size.
> > > 
> > > Also we capacity checking condition should change to something like
> > half
> > > of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS?
> > > 
> > >        if (capacity < 2+MAX_SKB_FRAGS) {
> > >                 netif_stop_queue(dev);
> > >                 if (unlikely(!virtqueue_enable_cb(vi->svq))) {
> > >                         /* More just got used, free them then
> > recheck.
> > > */
> > >                         capacity += free_old_xmit_skbs(vi);
> > >                         if (capacity >= 2+MAX_SKB_FRAGS) {
> > >                                 netif_start_queue(dev);
> > >                                 virtqueue_disable_cb(vi->svq);
> > >                         }
> > >                 }
> > >         }
> > > 
> > > 3. Looks like the xmit callback is only used to wake the queue when
> > the
> > > queue has stopped, right? Should we put a condition check here?
> > > static void skb_xmit_done(struct virtqueue *svq)
> > > {
> > >         struct virtnet_info *vi = svq->vdev->priv;
> > > 
> > >         /* Suppress further interrupts. */
> > >         virtqueue_disable_cb(svq);
> > > 
> > >         /* We were probably waiting for more output buffers. */
> > > --->   if (netif_queue_stopped(vi->dev))
> > >         netif_wake_queue(vi->dev);
> > > }
> > > 
> > > 
> > > Shirley
> > 
> > Well the return value is used to calculate capacity and that counts
> > the # of s/g. No?
> 
> Nope, the current guest kernel uses descriptors not number of sgs.

Confused. We compare capacity to skb frags, no?
That's sg I think ...

> not sure the old guest.
> 
> > From cache utilization POV it might be better to read from the skb and
> > not peek at virtio header though...
> > Pls Cc the lists on any discussions in the future.
> > 
> > -- 
> > MST
> 
> Sorry I missed reply all. :(
> 
> Shirley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:09                         ` Shirley Ma
@ 2011-02-01 21:24                           ` Michael S. Tsirkin
  2011-02-01 21:32                             ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 21:24 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote:
> On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
> > Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
> > 
> > I tried a simpler version of this patch without any tunables by
> > delaying the signaling until we come out of the for loop.
> > It definitely reduced the number of vmexits significantly for small
> > message
> > guest to host stream test and the throughput went up a little.
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 9b3ca10..5f9fae9 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
> >                 if (err != len)
> >                         pr_debug("Truncated TX packet: "
> >                                  " len %d != %zd\n", err, len);
> > -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +               vhost_add_used(vq, head, 0);
> >                 total_len += len;
> >                 if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >                         vhost_poll_queue(&vq->poll);
> > @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
> >                 }
> >         }
> > 
> > +       if (total_len > 0)
> > +               vhost_signal(&net->dev, vq);
> >         mutex_unlock(&vq->mutex);
> >  }
> 
> Reducing the signaling will reduce the CPU utilization by reducing VM
> exits. 
> 
> The small message BW is a problem we have seen faster guest/slow vhost,
> even I increased VHOST_NET_WEIGHT times, it didn't help that much for
> BW. For large message size, vhost is able to process all packets on
> time. I played around with guest/host codes, I only see huge BW
> improvement by dropping packets on guest side so far.
> 
> Thanks
> Shirley


My theory is that the issue is not signalling.
Rather, our queue fills up, then host handles
one packet and sends an interrupt, and we
immediately wake the queue. So the vq
once it gets full, stays full.

If you try my patch with bufs threshold set to e.g.
half the vq, what we will do is send interrupt after we have processed
half the vq.  So host has half the vq to go, and guest has half the vq
to fill.

See?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:21                           ` Michael S. Tsirkin
@ 2011-02-01 21:28                             ` Shirley Ma
  2011-02-01 21:41                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 21:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, steved, netdev, kvm

On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> Confused. We compare capacity to skb frags, no?
> That's sg I think ...

Current guest kernel use indirect buffers, num_free returns how many
available descriptors not skb frags. So it's wrong here.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:24                           ` Michael S. Tsirkin
@ 2011-02-01 21:32                             ` Shirley Ma
  2011-02-01 21:42                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 21:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> My theory is that the issue is not signalling.
> Rather, our queue fills up, then host handles
> one packet and sends an interrupt, and we
> immediately wake the queue. So the vq
> once it gets full, stays full.

>From the printk debugging output, it might not be exactly the case. The
ring gets full, run a bit, then gets full, then run a bit, then full...

> If you try my patch with bufs threshold set to e.g.
> half the vq, what we will do is send interrupt after we have processed
> half the vq.  So host has half the vq to go, and guest has half the vq
> to fill.
> 
> See?

I am cleaning up my set up to run your patch ...

Shirley



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:28                             ` Shirley Ma
@ 2011-02-01 21:41                               ` Michael S. Tsirkin
  2011-02-02  4:39                                 ` Krishna Kumar2
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 21:41 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, steved, netdev, kvm, rusty

On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > Confused. We compare capacity to skb frags, no?
> > That's sg I think ...
> 
> Current guest kernel use indirect buffers, num_free returns how many
> available descriptors not skb frags. So it's wrong here.
> 
> Shirley

I see. Good point. In other words when we complete the buffer
it was indirect, but when we add a new one we
can not allocate indirect so we consume.
And then we start the queue and add will fail.
I guess we need some kind of API to figure out
whether the buf we complete was indirect?

Another failure mode is when skb_xmit_done
wakes the queue: it might be too early, there
might not be space for the next packet in the vq yet.

A solution might be to keep some kind of pool
around for indirect, we wanted to do it for block anyway ...

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:32                             ` Shirley Ma
@ 2011-02-01 21:42                               ` Michael S. Tsirkin
  2011-02-01 21:53                                 ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 21:42 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > My theory is that the issue is not signalling.
> > Rather, our queue fills up, then host handles
> > one packet and sends an interrupt, and we
> > immediately wake the queue. So the vq
> > once it gets full, stays full.
> 
> >From the printk debugging output, it might not be exactly the case. The
> ring gets full, run a bit, then gets full, then run a bit, then full...

Yes, but does it get even half empty in between?

> > If you try my patch with bufs threshold set to e.g.
> > half the vq, what we will do is send interrupt after we have processed
> > half the vq.  So host has half the vq to go, and guest has half the vq
> > to fill.
> > 
> > See?
> 
> I am cleaning up my set up to run your patch ...
> 
> Shirley
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:42                               ` Michael S. Tsirkin
@ 2011-02-01 21:53                                 ` Shirley Ma
  2011-02-01 21:56                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 21:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > > My theory is that the issue is not signalling.
> > > Rather, our queue fills up, then host handles
> > > one packet and sends an interrupt, and we
> > > immediately wake the queue. So the vq
> > > once it gets full, stays full.
> > 
> > >From the printk debugging output, it might not be exactly the case.
> The
> > ring gets full, run a bit, then gets full, then run a bit, then
> full...
> 
> Yes, but does it get even half empty in between?

Sometimes, most of them not half of empty in between. But printk slow
down the traffics, so it's not accurate. I think your patch will improve
the performance if it signals guest when half of the ring size is
empty. 

But you manage signal by using TX bytes, I would like to change it to
half of the ring size instead for signaling. Is that OK?

Shirley




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:53                                 ` Shirley Ma
@ 2011-02-01 21:56                                   ` Michael S. Tsirkin
  2011-02-01 22:59                                     ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-01 21:56 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 01:53:05PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
> > > > My theory is that the issue is not signalling.
> > > > Rather, our queue fills up, then host handles
> > > > one packet and sends an interrupt, and we
> > > > immediately wake the queue. So the vq
> > > > once it gets full, stays full.
> > > 
> > > >From the printk debugging output, it might not be exactly the case.
> > The
> > > ring gets full, run a bit, then gets full, then run a bit, then
> > full...
> > 
> > Yes, but does it get even half empty in between?
> 
> Sometimes, most of them not half of empty in between. But printk slow
> down the traffics, so it's not accurate. I think your patch will improve
> the performance if it signals guest when half of the ring size is
> empty. 
> 
> But you manage signal by using TX bytes,

There are flags for bytes, buffers and packets.
Try playing with any one of them :)
Just be sure to use v2.


>I would like to change it to
> half of the ring size instead for signaling. Is that OK?
> 
> Shirley
> 
> 

Sure that is why I made it a parameter so you can experiment.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:56                                   ` Michael S. Tsirkin
@ 2011-02-01 22:59                                     ` Shirley Ma
  2011-02-02  4:40                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 22:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> There are flags for bytes, buffers and packets.
> Try playing with any one of them :)
> Just be sure to use v2.
> 
> 
> >I would like to change it to
> > half of the ring size instead for signaling. Is that OK?
> > 
> > Shirley
> > 
> > 
> 
> Sure that is why I made it a parameter so you can experiment. 

The initial test results shows that the CPUs utilization has been
reduced some, and BW has increased some with the default parameters,
like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
down from 4x% to 38%, (Similar results from the patch I submitted a
while ago to reduce signaling on vhost) but far away from dropping
packet results.

I am going to change the code to use 1/2 ring size to wake the netif
queue.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCHv2 dontapply] vhost-net tx tuning
  2011-02-01 15:52                   ` [PATCHv2 dontapply] vhost-net tx tuning Michael S. Tsirkin
@ 2011-02-01 23:07                     ` Sridhar Samudrala
  2011-02-01 23:27                       ` Shirley Ma
  2011-02-02  4:36                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Sridhar Samudrala @ 2011-02-01 23:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, mashirle, kvm, netdev

On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
> OK, so thinking about it more, maybe the issue is this:
> tx becomes full. We process one request and interrupt the guest,
> then it adds one request and the queue is full again.
> 
> Maybe the following will help it stabilize?  By default with it we will
> only interrupt when we see an empty ring.
> Which is liklely too much: pls try other values
> in the middle: e.g. make bufs half the ring,
> or bytes some small value like half ring * 200, or packets some
> small value etc.
> 
> Set any one parameter to 0 to get current
> behaviour (interrupt immediately when enabled).
> 
> Warning: completely untested.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index aac05bc..6769cdc 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,13 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
> 
> +int tx_bytes_coalesce = 1000000000;
> +module_param(tx_bytes_coalesce, int, 0644);
> +int tx_bufs_coalesce = 1000000000;
> +module_param(tx_bufs_coalesce, int, 0644);
> +int tx_packets_coalesce = 1000000000;
> +module_param(tx_packets_coalesce, int, 0644);
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	int bytes_coalesced = 0;
> +	int bufs_coalesced = 0;
> +	int packets_coalesced = 0;
> 
>  	/* TODO: check that we are running from vhost_worker? */
>  	sock = rcu_dereference_check(vq->private_data, 1);
> @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
> +		packets_coalesced += 1;
> +		bytes_coalesced += len;
> +		bufs_coalesced += out;
> +		if (unlikely(packets_coalesced > tx_packets_coalesce ||
> +			     bytes_coalesced > tx_bytes_coalesce ||
> +			     bufs_coalesced > tx_bufs_coalesce))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);

I think the counters that exceed the limits need to be reset to 0 here.
Otherwise we keep signaling for every buffer once we hit this condition.

Thanks
Sridhar

> +		else
> +			vhost_add_used(vq, head, 0);
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
>  			break;
>  		}
>  	}
> 
> +	if (likely(packets_coalesced &&
> +		   bytes_coalesced &&
> +		   bufs_coalesced))
> +		vhost_signal(&net->dev, vq);
>  	mutex_unlock(&vq->mutex);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCHv2 dontapply] vhost-net tx tuning
  2011-02-01 23:07                     ` Sridhar Samudrala
@ 2011-02-01 23:27                       ` Shirley Ma
  2011-02-02  4:36                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-01 23:27 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Steve Dobbelstein, mashirle, kvm, netdev

On Tue, 2011-02-01 at 15:07 -0800, Sridhar Samudrala wrote:
> I think the counters that exceed the limits need to be reset to 0
> here.
> Otherwise we keep signaling for every buffer once we hit this
> condition. 

I will modify the patch to rerun the test to see the difference.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCHv2 dontapply] vhost-net tx tuning
  2011-02-01 23:07                     ` Sridhar Samudrala
  2011-02-01 23:27                       ` Shirley Ma
@ 2011-02-02  4:36                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02  4:36 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Steve Dobbelstein, mashirle, kvm, netdev

On Tue, Feb 01, 2011 at 03:07:38PM -0800, Sridhar Samudrala wrote:
> On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> > 
> > Maybe the following will help it stabilize?  By default with it we will
> > only interrupt when we see an empty ring.
> > Which is liklely too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value like half ring * 200, or packets some
> > small value etc.
> > 
> > Set any one parameter to 0 to get current
> > behaviour (interrupt immediately when enabled).
> > 
> > Warning: completely untested.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> > 
> > +int tx_bytes_coalesce = 1000000000;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 1000000000;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 1000000000;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >  	VHOST_NET_VQ_RX = 0,
> >  	VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >  	int err, wmem;
> >  	size_t hdr_size;
> >  	struct socket *sock;
> > +	int bytes_coalesced = 0;
> > +	int bufs_coalesced = 0;
> > +	int packets_coalesced = 0;
> > 
> >  	/* TODO: check that we are running from vhost_worker? */
> >  	sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >  		if (err != len)
> >  			pr_debug("Truncated TX packet: "
> >  				 " len %d != %zd\n", err, len);
> > -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >  		total_len += len;
> > +		packets_coalesced += 1;
> > +		bytes_coalesced += len;
> > +		bufs_coalesced += out;
> > +		if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +			     bytes_coalesced > tx_bytes_coalesce ||
> > +			     bufs_coalesced > tx_bufs_coalesce))
> > +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> 
> I think the counters that exceed the limits need to be reset to 0 here.
> Otherwise we keep signaling for every buffer once we hit this condition.
> 
> Thanks
> Sridhar

Correct, good catch.

> > +		else
> > +			vhost_add_used(vq, head, 0);
> >  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >  			vhost_poll_queue(&vq->poll);
> >  			break;
> >  		}
> >  	}
> > 
> > +	if (likely(packets_coalesced &&
> > +		   bytes_coalesced &&
> > +		   bufs_coalesced))
> > +		vhost_signal(&net->dev, vq);
> >  	mutex_unlock(&vq->mutex);
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 21:41                               ` Michael S. Tsirkin
@ 2011-02-02  4:39                                 ` Krishna Kumar2
  2011-02-02  4:42                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Krishna Kumar2 @ 2011-02-02  4:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved

> "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
>
> On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > Confused. We compare capacity to skb frags, no?
> > > That's sg I think ...
> >
> > Current guest kernel use indirect buffers, num_free returns how many
> > available descriptors not skb frags. So it's wrong here.
> >
> > Shirley
>
> I see. Good point. In other words when we complete the buffer
> it was indirect, but when we add a new one we
> can not allocate indirect so we consume.
> And then we start the queue and add will fail.
> I guess we need some kind of API to figure out
> whether the buf we complete was indirect?
>
> Another failure mode is when skb_xmit_done
> wakes the queue: it might be too early, there
> might not be space for the next packet in the vq yet.

I am not sure if this is the problem - shouldn't you
see these messages:
	if (likely(capacity == -ENOMEM)) {
		dev_warn(&dev->dev,
			"TX queue failure: out of memory\n");
	} else {
		dev->stats.tx_fifo_errors++;
		dev_warn(&dev->dev,
			"Unexpected TX queue failure: %d\n",
			capacity);
	}
in next xmit? I am not getting this in my testing.

> A solution might be to keep some kind of pool
> around for indirect, we wanted to do it for block anyway ...

Your vhost patch should fix this automatically. Right?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-01 22:59                                     ` Shirley Ma
@ 2011-02-02  4:40                                       ` Michael S. Tsirkin
  2011-02-02  6:05                                         ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02  4:40 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> > There are flags for bytes, buffers and packets.
> > Try playing with any one of them :)
> > Just be sure to use v2.
> > 
> > 
> > >I would like to change it to
> > > half of the ring size instead for signaling. Is that OK?
> > > 
> > > Shirley
> > > 
> > > 
> > 
> > Sure that is why I made it a parameter so you can experiment. 
> 
> The initial test results shows that the CPUs utilization has been
> reduced some, and BW has increased some with the default parameters,
> like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
> down from 4x% to 38%, (Similar results from the patch I submitted a
> while ago to reduce signaling on vhost) but far away from dropping
> packet results.
> 
> I am going to change the code to use 1/2 ring size to wake the netif
> queue.
> 
> Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  4:39                                 ` Krishna Kumar2
@ 2011-02-02  4:42                                   ` Michael S. Tsirkin
  2011-02-09  0:37                                     ` Rusty Russell
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02  4:42 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved

On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> >
> > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > Confused. We compare capacity to skb frags, no?
> > > > That's sg I think ...
> > >
> > > Current guest kernel use indirect buffers, num_free returns how many
> > > available descriptors not skb frags. So it's wrong here.
> > >
> > > Shirley
> >
> > I see. Good point. In other words when we complete the buffer
> > it was indirect, but when we add a new one we
> > can not allocate indirect so we consume.
> > And then we start the queue and add will fail.
> > I guess we need some kind of API to figure out
> > whether the buf we complete was indirect?
> >
> > Another failure mode is when skb_xmit_done
> > wakes the queue: it might be too early, there
> > might not be space for the next packet in the vq yet.
> 
> I am not sure if this is the problem - shouldn't you
> see these messages:
> 	if (likely(capacity == -ENOMEM)) {
> 		dev_warn(&dev->dev,
> 			"TX queue failure: out of memory\n");
> 	} else {
> 		dev->stats.tx_fifo_errors++;
> 		dev_warn(&dev->dev,
> 			"Unexpected TX queue failure: %d\n",
> 			capacity);
> 	}
> in next xmit? I am not getting this in my testing.

Yes, I don't think we hit this in our testing,
simply because we don't stress memory.
Disable indirect, then you might see this.

> > A solution might be to keep some kind of pool
> > around for indirect, we wanted to do it for block anyway ...
> 
> Your vhost patch should fix this automatically. Right?

Reduce the chance of it happening, yes.

> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  4:40                                       ` Michael S. Tsirkin
@ 2011-02-02  6:05                                         ` Shirley Ma
  2011-02-02  6:19                                           ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02  6:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote:
> ust tweak the parameters with sysfs, you do not have to edit the code:
> echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce
> 
> Or in a similar way for tx_packets_coalesce (since we use indirect,
> packets will typically use 1 buffer each).

We should use packets instead of buffers, in indirect case, one packet
has multiple buffers, each packet uses one descriptor from the ring
(default size is 256).

echo 128 > /sys/module/vhost_net/parameters/tx_packets_coalesce

The way I am changing is only when netif queue has stopped, then we
start to count num_free descriptors to send the signal to wake netif
queue.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:05                                         ` Shirley Ma
@ 2011-02-02  6:19                                           ` Shirley Ma
  2011-02-02  6:29                                             ` Michael S. Tsirkin
  2011-02-02  6:34                                             ` Krishna Kumar2
  0 siblings, 2 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-02  6:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> 
> The way I am changing is only when netif queue has stopped, then we
> start to count num_free descriptors to send the signal to wake netif
> queue. 

I forgot to mention, the code change I am making is in guest kernel, in
xmit call back only wake up the queue when it's stopped && num_free >=
1/2 *vq->num, I add a new API in virtio_ring.

However vhost signaling reduction is needed as well. The patch I
submitted a while ago showed both CPUs and BW improvement.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:19                                           ` Shirley Ma
@ 2011-02-02  6:29                                             ` Michael S. Tsirkin
  2011-02-02  7:14                                               ` Shirley Ma
  2011-02-02  6:34                                             ` Krishna Kumar2
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02  6:29 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > 
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue. 
> 
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.
> 
> Thanks
> Shirley

Which patch was that?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:19                                           ` Shirley Ma
  2011-02-02  6:29                                             ` Michael S. Tsirkin
@ 2011-02-02  6:34                                             ` Krishna Kumar2
  2011-02-02  7:03                                               ` Shirley Ma
  2011-02-02 10:48                                               ` Michael S. Tsirkin
  1 sibling, 2 replies; 88+ messages in thread
From: Krishna Kumar2 @ 2011-02-02  6:34 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> >
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue.
>
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:34                                             ` Krishna Kumar2
@ 2011-02-02  7:03                                               ` Shirley Ma
  2011-02-02  7:37                                                 ` Krishna Kumar2
  2011-02-02 10:48                                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02  7:03 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)

> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.

Yes, that' what I did, when the netif queue stop, don't enable the
queue, just free_old_xmit_skbs(), if not enough freed, then enabling
callback until half of the ring size are freed, then wake the netif
queue. But somehow I didn't reach the performance compared to drop
packets, need to think about it more. :)

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:29                                             ` Michael S. Tsirkin
@ 2011-02-02  7:14                                               ` Shirley Ma
  2011-02-02  7:33                                                 ` Shirley Ma
  2011-02-02 10:48                                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-02  7:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > 
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue. 
> > 
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> Interesting. Yes, I agree an API extension would be helpful. However,
> wouldn't just the signaling reduction be enough, without guest
> changes?

w/i guest change, I played around the parameters,for example: I could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
usage.

> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.
> > 
> > Thanks
> > Shirley
> 
> Which patch was that? 

The patch was called "vhost: TX used buffer guest signal accumulation".
You suggested to split add_used_bufs and signal. I am still thinking
what's the best approach to cooperate guest (virtio_kick) and
vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
the overheads, so I haven't submit the new patch yet.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  7:14                                               ` Shirley Ma
@ 2011-02-02  7:33                                                 ` Shirley Ma
  2011-02-02 10:49                                                   ` Michael S. Tsirkin
  2011-02-02 10:48                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage. 

I meant w/o guest change, only vhost changes. Sorry about that.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  7:03                                               ` Shirley Ma
@ 2011-02-02  7:37                                                 ` Krishna Kumar2
  0 siblings, 0 replies; 88+ messages in thread
From: Krishna Kumar2 @ 2011-02-02  7:37 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

> Shirley Ma <mashirle@us.ibm.com> wrote:
>
> > I have tried this before. There are a couple of issues:
> >
> > 1. the free count will not reduce until you run free_old_xmit_skbs,
> >    which will not run anymore since the tx queue is stopped.
> > 2. You cannot call free_old_xmit_skbs directly as it races with a
> >    queue that was just awakened (current cb was due to the delay
> >    in disabling cb's).
> >
> > You have to call free_old_xmit_skbs() under netif_queue_stopped()
> > check to avoid the race.
>
> Yes, that' what I did, when the netif queue stop, don't enable the
> queue, just free_old_xmit_skbs(), if not enough freed, then enabling
> callback until half of the ring size are freed, then wake the netif
> queue. But somehow I didn't reach the performance compared to drop
> packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  6:34                                             ` Krishna Kumar2
  2011-02-02  7:03                                               ` Shirley Ma
@ 2011-02-02 10:48                                               ` Michael S. Tsirkin
  2011-02-02 15:39                                                 ` Shirley Ma
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Shirley Ma, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 12:04:37PM +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then we
> > > start to count num_free descriptors to send the signal to wake netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel, in
> > xmit call back only wake up the queue when it's stopped && num_free >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)
> 
> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.
> 
> I got a small improvement in my testing upto some number of threads
> (32 or 48?), but beyond that I was getting a regression.
> 
> Thanks,
> 
> - KK
> 
> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.

Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  7:14                                               ` Shirley Ma
  2011-02-02  7:33                                                 ` Shirley Ma
@ 2011-02-02 10:48                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 11:14:51PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > > 
> > > > The way I am changing is only when netif queue has stopped, then
> > we
> > > > start to count num_free descriptors to send the signal to wake
> > netif
> > > > queue. 
> > > 
> > > I forgot to mention, the code change I am making is in guest kernel,
> > in
> > > xmit call back only wake up the queue when it's stopped && num_free
> > >=
> > > 1/2 *vq->num, I add a new API in virtio_ring.
> > 
> > Interesting. Yes, I agree an API extension would be helpful. However,
> > wouldn't just the signaling reduction be enough, without guest
> > changes?
> 
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage.

We need to consider them separately IMO.  What's the best we can get
without guest change?  And which parameters give it?
There will always be old guests, and as far as I can tell
it should work better from host.

> > > However vhost signaling reduction is needed as well. The patch I
> > > submitted a while ago showed both CPUs and BW improvement.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Which patch was that? 
> 
> The patch was called "vhost: TX used buffer guest signal accumulation".
Yes, a somewhat similar idea.

> You suggested to split add_used_bufs and signal.
Exactly. And this is basically what this patch does.

> I am still thinking
> what's the best approach to cooperate guest (virtio_kick) and
> vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
> the overheads, so I haven't submit the new patch yet.
> 
> Thanks
> Shirley


-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  7:33                                                 ` Shirley Ma
@ 2011-02-02 10:49                                                   ` Michael S. Tsirkin
  2011-02-02 15:42                                                     ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 10:49 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > w/i guest change, I played around the parameters,for example: I could
> > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > size,
> > w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> > usage. 
> 
> I meant w/o guest change, only vhost changes. Sorry about that.
> 
> Shirley

Ah, excellent. What were the parameters?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 10:48                                               ` Michael S. Tsirkin
@ 2011-02-02 15:39                                                 ` Shirley Ma
  2011-02-02 15:47                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> Yes, I think doing this in the host is much simpler,
> just send an interrupt after there's a decent amount
> of space in the queue.
> 
> Having said that the simple heuristic that I coded
> might be a bit too simple.

>From the debugging out what I have seen so far (a single small message
TCP_STEAM test), I think the right approach is to patch both guest and
vhost. The problem I have found is a regression for single  small
message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
kernel has problem.

For Steven's problem, it's multiple stream TCP_RR issues, the old guest
doesn't perform well, so does new guest kernel. We tested reducing vhost
signaling patch before, it didn't help the performance at all.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 10:49                                                   ` Michael S. Tsirkin
@ 2011-02-02 15:42                                                     ` Shirley Ma
  2011-02-02 15:48                                                       ` Michael S. Tsirkin
  2011-02-02 18:20                                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 15:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > w/i guest change, I played around the parameters,for example: I
> could
> > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > size,
> > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> CPU
> > > usage. 
> > 
> > I meant w/o guest change, only vhost changes. Sorry about that.
> > 
> > Shirley
> 
> Ah, excellent. What were the parameters? 

I used half of the ring size 129 for packet counters, but the
performance is still not as good as dropping packets on guest, 3.7 Gb/s
vs. 6.2Gb/s.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 15:39                                                 ` Shirley Ma
@ 2011-02-02 15:47                                                   ` Michael S. Tsirkin
  2011-02-02 17:10                                                     ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 15:47 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > Yes, I think doing this in the host is much simpler,
> > just send an interrupt after there's a decent amount
> > of space in the queue.
> > 
> > Having said that the simple heuristic that I coded
> > might be a bit too simple.
> 
> >From the debugging out what I have seen so far (a single small message
> TCP_STEAM test), I think the right approach is to patch both guest and
> vhost.

One problem is slowing down the guest helps here.
So there's a chance that just by adding complexity
in guest driver we get a small improvement :(

We can't rely on a patched guest anyway, so
I think it is best to test guest and host changes separately.

And I do agree something needs to be done in guest too,
for example when vqs share an interrupt, we
might invoke a callback when we see vq is not empty
even though it's not requested. Probably should
check interrupts enabled here?

> The problem I have found is a regression for single  small
> message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
> kernel has problem.

Likely new kernel is faster :)

> For Steven's problem, it's multiple stream TCP_RR issues, the old guest
> doesn't perform well, so does new guest kernel. We tested reducing vhost
> signaling patch before, it didn't help the performance at all.
> 
> Thanks
> Shirley

Yes, it seems unrelated to tx interrupts.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 15:42                                                     ` Shirley Ma
@ 2011-02-02 15:48                                                       ` Michael S. Tsirkin
  2011-02-02 17:12                                                         ` Shirley Ma
  2011-02-02 18:20                                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 15:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters, but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

And this is with sndbuf=0 in host, yes?
And do you see a lot of tx interrupts?
How packets per interrupt?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 15:47                                                   ` Michael S. Tsirkin
@ 2011-02-02 17:10                                                     ` Shirley Ma
  2011-02-02 17:32                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 17:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > Yes, I think doing this in the host is much simpler,
> > > just send an interrupt after there's a decent amount
> > > of space in the queue.
> > > 
> > > Having said that the simple heuristic that I coded
> > > might be a bit too simple.
> > 
> > >From the debugging out what I have seen so far (a single small
> message
> > TCP_STEAM test), I think the right approach is to patch both guest
> and
> > vhost.
> 
> One problem is slowing down the guest helps here.
> So there's a chance that just by adding complexity
> in guest driver we get a small improvement :(
> 
> We can't rely on a patched guest anyway, so
> I think it is best to test guest and host changes separately.
> 
> And I do agree something needs to be done in guest too,
> for example when vqs share an interrupt, we
> might invoke a callback when we see vq is not empty
> even though it's not requested. Probably should
> check interrupts enabled here?

Yes, I modified xmit callback something like below:

static void skb_xmit_done(struct virtqueue *svq)
{
        struct virtnet_info *vi = svq->vdev->priv;

        /* Suppress further interrupts. */
        virtqueue_disable_cb(svq);

        /* We were probably waiting for more output buffers. */
        if (netif_queue_stopped(vi->dev)) {
                free_old_xmit_skbs(vi);
                if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
                        virtqueue_enable_cb(svq);
			return;
		}
        }
	netif_wake_queue(vi->dev);
}

> > The problem I have found is a regression for single  small
> > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> new
> > kernel has problem.
> 
> Likely new kernel is faster :)

> > For Steven's problem, it's multiple stream TCP_RR issues, the old
> guest
> > doesn't perform well, so does new guest kernel. We tested reducing
> vhost
> > signaling patch before, it didn't help the performance at all.
> > 
> > Thanks
> > Shirley
> 
> Yes, it seems unrelated to tx interrupts. 

The issue is more likely related to latency. Do you have anything in
mind on how to reduce vhost latency?

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 15:48                                                       ` Michael S. Tsirkin
@ 2011-02-02 17:12                                                         ` Shirley Ma
  0 siblings, 0 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 17:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, 2011-02-02 at 17:48 +0200, Michael S. Tsirkin wrote:
> And this is with sndbuf=0 in host, yes?
> And do you see a lot of tx interrupts?
> How packets per interrupt?

Nope, sndbuf doens't matter since I never hit reaching sock wmem
condition in vhost. I am still playing around, let me know what data you
would like to collect.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 17:10                                                     ` Shirley Ma
@ 2011-02-02 17:32                                                       ` Michael S. Tsirkin
  2011-02-02 18:11                                                         ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 17:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > > Yes, I think doing this in the host is much simpler,
> > > > just send an interrupt after there's a decent amount
> > > > of space in the queue.
> > > > 
> > > > Having said that the simple heuristic that I coded
> > > > might be a bit too simple.
> > > 
> > > >From the debugging out what I have seen so far (a single small
> > message
> > > TCP_STEAM test), I think the right approach is to patch both guest
> > and
> > > vhost.
> > 
> > One problem is slowing down the guest helps here.
> > So there's a chance that just by adding complexity
> > in guest driver we get a small improvement :(
> > 
> > We can't rely on a patched guest anyway, so
> > I think it is best to test guest and host changes separately.
> > 
> > And I do agree something needs to be done in guest too,
> > for example when vqs share an interrupt, we
> > might invoke a callback when we see vq is not empty
> > even though it's not requested. Probably should
> > check interrupts enabled here?
> 
> Yes, I modified xmit callback something like below:
> 
> static void skb_xmit_done(struct virtqueue *svq)
> {
>         struct virtnet_info *vi = svq->vdev->priv;
> 
>         /* Suppress further interrupts. */
>         virtqueue_disable_cb(svq);
> 
>         /* We were probably waiting for more output buffers. */
>         if (netif_queue_stopped(vi->dev)) {
>                 free_old_xmit_skbs(vi);
>                 if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
>                         virtqueue_enable_cb(svq);
> 			return;
> 		}
>         }
> 	netif_wake_queue(vi->dev);
> }

OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?

> > > The problem I have found is a regression for single  small
> > > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> > new
> > > kernel has problem.
> > 
> > Likely new kernel is faster :)
> 
> > > For Steven's problem, it's multiple stream TCP_RR issues, the old
> > guest
> > > doesn't perform well, so does new guest kernel. We tested reducing
> > vhost
> > > signaling patch before, it didn't help the performance at all.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Yes, it seems unrelated to tx interrupts. 
> 
> The issue is more likely related to latency.

Could be. Why do you think so?

> Do you have anything in
> mind on how to reduce vhost latency?
> 
> Thanks
> Shirley

Hmm, bypassing the bridge might help a bit.
Are you using tap+bridge or macvtap?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 17:32                                                       ` Michael S. Tsirkin
@ 2011-02-02 18:11                                                         ` Shirley Ma
  2011-02-02 18:27                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 18:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> OK, but this should have no effect with a vhost patch
> which should ensure that we don't get an interrupt
> until the queue is at least half empty.
> Right?

There should be some coordination between guest and vhost. We shouldn't
count the TX packets when netif queue is enabled since next guest TX
xmit will free any used buffers in vhost. We need to be careful here in
case we miss the interrupts when netif queue has stopped.

However we can't change old guest so we can test the patches separately
for guest only, vhost only, and the combination.

> > > 
> > > Yes, it seems unrelated to tx interrupts. 
> > 
> > The issue is more likely related to latency.
> 
> Could be. Why do you think so?

Since I played with latency hack, I can see performance difference for
different latency.

> > Do you have anything in
> > mind on how to reduce vhost latency?
> > 
> > Thanks
> > Shirley
> 
> Hmm, bypassing the bridge might help a bit.
> Are you using tap+bridge or macvtap? 

I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
before. He might have some data from his workload performance
measurement.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 15:42                                                     ` Shirley Ma
  2011-02-02 15:48                                                       ` Michael S. Tsirkin
@ 2011-02-02 18:20                                                       ` Michael S. Tsirkin
  2011-02-02 18:26                                                         ` Shirley Ma
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 18:20 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters,
> but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

How many packets and bytes per interrupt are sent?
Also, what about other values for the counters and other counters?

What does your patch do? Just drop packets instead of
stopping the interface?

To have an understanding when should we drop packets
in the guest, we need to know *why* does it help.
Otherwise, how do we know it will work for others?
Note that qdisc will drop packets when it overruns -
so what is different? Also, are we over-running some other queue
somewhere?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 18:20                                                       ` Michael S. Tsirkin
@ 2011-02-02 18:26                                                         ` Shirley Ma
  0 siblings, 0 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 18:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm,
	mashirle, netdev

On Wed, 2011-02-02 at 20:20 +0200, Michael S. Tsirkin wrote:
> How many packets and bytes per interrupt are sent?
> Also, what about other values for the counters and other counters?
> 
> What does your patch do? Just drop packets instead of
> stopping the interface?
> 
> To have an understanding when should we drop packets
> in the guest, we need to know *why* does it help.
> Otherwise, how do we know it will work for others?
> Note that qdisc will drop packets when it overruns -
> so what is different? Also, are we over-running some other queue
> somewhere? 

Agreed. I am trying to put more debugging output to look for all these
answers.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 18:11                                                         ` Shirley Ma
@ 2011-02-02 18:27                                                           ` Michael S. Tsirkin
  2011-02-02 19:29                                                             ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 18:27 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > OK, but this should have no effect with a vhost patch
> > which should ensure that we don't get an interrupt
> > until the queue is at least half empty.
> > Right?
> 
> There should be some coordination between guest and vhost.

What kind of coordination? With a patched vhost, and a full ring.
you should get an interrupt per 100 packets.
Is this what you see? And if yes, isn't the guest patch
doing nothing then?

> We shouldn't
> count the TX packets when netif queue is enabled since next guest TX
> xmit will free any used buffers in vhost. We need to be careful here in
> case we miss the interrupts when netif queue has stopped.
> 
> However we can't change old guest so we can test the patches separately
> for guest only, vhost only, and the combination.
> 
> > > > 
> > > > Yes, it seems unrelated to tx interrupts. 
> > > 
> > > The issue is more likely related to latency.
> > 
> > Could be. Why do you think so?
> 
> Since I played with latency hack, I can see performance difference for
> different latency.

Which hack was that?

> > > Do you have anything in
> > > mind on how to reduce vhost latency?
> > > 
> > > Thanks
> > > Shirley
> > 
> > Hmm, bypassing the bridge might help a bit.
> > Are you using tap+bridge or macvtap? 
> 
> I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
> before. He might have some data from his workload performance
> measurement.
> 
> Shirley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-01-25 21:09 Network performance with small packets Steve Dobbelstein
  2011-01-26 15:17 ` Michael S. Tsirkin
@ 2011-02-02 18:38 ` Michael S. Tsirkin
  2011-02-02 19:15   ` Steve Dobbelstein
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 18:38 UTC (permalink / raw)
  To: Steve Dobbelstein; +Cc: kvm

On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
> 
> I am working on a KVM network performance issue found in our lab running
> the DayTrader benchmark.  The benchmark throughput takes a significant hit
> when running the application server in a KVM guest verses on bare metal.
> We have dug into the problem and found that DayTrader's use of small
> packets exposes KVM's overhead of handling network packets.  I have been
> able to reproduce the performance hit with a simpler setup using the
> netperf benchmark with the TCP_RR test and the request and response sizes
> set to 256 bytes.  I run the benchmark between two physical systems, each
> using a 1GB link.  In order to get the maximum throughput for the system I
> have to run 100 instances of netperf.  When I run the netserver processes
> in a guest, I see a maximum throughput that is 51% of what I get if I run
> the netserver processes directly on the host.  The CPU utilization in the
> guest is only 85% at maximum throughput, whereas it is 100% on bare metal.

You are stressing the scheduler pretty hard with this test :)
Is your real benchmark also using a huge number of threads?
If it's not, you might be seeing a different issue.
IOW, the netperf degradation might not be network-related at all,
but have to do with speed of context switch in guest.
Thoughts?

> The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
> I run netperf on the host I boot the host with maxcpus=2 on the kernel
> command line.  The host is running the current KVM upstream kernel along
> with the current upstream qemu.  Here is the qemu command used to launch
> the guest:
> /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
>  -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2
> -vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
> 
> We have tried various proposed fixes, each with varying amounts of success.
> One such fix was to add code to the vhost thread such that when it found
> the work queue empty it wouldn't just exit the thread but rather would
> delay for 50 microseconds and then recheck the queue.  If there was work on
> the queue it would loop back and process it, else it would exit the thread.
> The change got us a 13% improvement in the DayTrader throughput.
> 
> Running the same netperf configuration on the same hardware but using a
> different hypervisor gets us significantly better throughput numbers.   The
> guest on that hypervisor runs at 100% CPU utilization.  The various fixes
> we have tried have not gotten us close to the throughput seen on the other
> hypervisor.  I'm looking for ideas/input from the KVM experts on how to
> make KVM perform better when handling small packets.
> 
> Thanks,
> Steve
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 18:38 ` Michael S. Tsirkin
@ 2011-02-02 19:15   ` Steve Dobbelstein
  0 siblings, 0 replies; 88+ messages in thread
From: Steve Dobbelstein @ 2011-02-02 19:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm

"Michael S. Tsirkin" <mst@redhat.com> wrote on 02/02/2011 12:38:47 PM:

> On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
> >
> > I am working on a KVM network performance issue found in our lab
running
> > the DayTrader benchmark.  The benchmark throughput takes a significant
hit
> > when running the application server in a KVM guest verses on bare
metal.
> > We have dug into the problem and found that DayTrader's use of small
> > packets exposes KVM's overhead of handling network packets.  I have
been
> > able to reproduce the performance hit with a simpler setup using the
> > netperf benchmark with the TCP_RR test and the request and response
sizes
> > set to 256 bytes.  I run the benchmark between two physical systems,
each
> > using a 1GB link.  In order to get the maximum throughput for the
system I
> > have to run 100 instances of netperf.  When I run the netserver
processes
> > in a guest, I see a maximum throughput that is 51% of what I get if I
run
> > the netserver processes directly on the host.  The CPU utilization in
the
> > guest is only 85% at maximum throughput, whereas it is 100% on bare
metal.
>
> You are stressing the scheduler pretty hard with this test :)
> Is your real benchmark also using a huge number of threads?

Yes.  The real benchmark has 60 threads handling client requests and 48
threads talking to a database server.

> If it's not, you might be seeing a different issue.
> IOW, the netperf degradation might not be network-related at all,
> but have to do with speed of context switch in guest.
> Thoughts?

Yes, context switches can add to the overhead.  We have that data captured,
and I can look at it.  What makes me think that's not the issue is that the
CPU utilization in the guest is only about 85% at maximum throughput.
Throughput/CPU is comparable to a different hypervisor, but that hypervisor
runs at full CPU utilization and gets better throughput.  I can't help but
think KVM would get better throughput if it could just keep the guest VCPUs
busy.

Recently I have been playing with different CPU pinnings for the guest
VCPUs and the vhost thread.  Certain combinations can get us up to a 35%
improvement in throughput with the same throughput/CPU ratio.  CPU
utilization was 94% -- not full CPU utilization, but it does illustrate
that we can get better throughput if we keep the guest VCPUs busy.  At this
point it's looking more like a scheduler issue.  We're starting to dig
through the scheduler code for clues.

Steve D.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 18:27                                                           ` Michael S. Tsirkin
@ 2011-02-02 19:29                                                             ` Shirley Ma
  2011-02-02 20:17                                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 19:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > OK, but this should have no effect with a vhost patch
> > > which should ensure that we don't get an interrupt
> > > until the queue is at least half empty.
> > > Right?
> > 
> > There should be some coordination between guest and vhost.
> 
> What kind of coordination? With a patched vhost, and a full ring.
> you should get an interrupt per 100 packets.
> Is this what you see? And if yes, isn't the guest patch
> doing nothing then?

vhost_signal won't be able send any TX interrupts to guest when guest TX
interrupt is disabled. Guest TX interrupt is only enabled when running
out of descriptors.

> > We shouldn't
> > count the TX packets when netif queue is enabled since next guest TX
> > xmit will free any used buffers in vhost. We need to be careful here
> in
> > case we miss the interrupts when netif queue has stopped.
> > 
> > However we can't change old guest so we can test the patches
> separately
> > for guest only, vhost only, and the combination.
> > 
> > > > > 
> > > > > Yes, it seems unrelated to tx interrupts. 
> > > > 
> > > > The issue is more likely related to latency.
> > > 
> > > Could be. Why do you think so?
> > 
> > Since I played with latency hack, I can see performance difference
> for
> > different latency.
> 
> Which hack was that? 

I tried to accumulate multiple guest to host notifications for TX xmits,
it did help multiple streams TCP_RR results; I also forced vhost
handle_tx to handle more packets; both hack seemed help.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 19:29                                                             ` Shirley Ma
@ 2011-02-02 20:17                                                               ` Michael S. Tsirkin
  2011-02-02 21:03                                                                 ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 20:17 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > > OK, but this should have no effect with a vhost patch
> > > > which should ensure that we don't get an interrupt
> > > > until the queue is at least half empty.
> > > > Right?
> > > 
> > > There should be some coordination between guest and vhost.
> > 
> > What kind of coordination? With a patched vhost, and a full ring.
> > you should get an interrupt per 100 packets.
> > Is this what you see? And if yes, isn't the guest patch
> > doing nothing then?
> 
> vhost_signal won't be able send any TX interrupts to guest when guest TX
> interrupt is disabled. Guest TX interrupt is only enabled when running
> out of descriptors.

Well, this is also the only case where the queue is stopped, no?

> > > We shouldn't
> > > count the TX packets when netif queue is enabled since next guest TX
> > > xmit will free any used buffers in vhost. We need to be careful here
> > in
> > > case we miss the interrupts when netif queue has stopped.
> > > 
> > > However we can't change old guest so we can test the patches
> > separately
> > > for guest only, vhost only, and the combination.
> > > 
> > > > > > 
> > > > > > Yes, it seems unrelated to tx interrupts. 
> > > > > 
> > > > > The issue is more likely related to latency.
> > > > 
> > > > Could be. Why do you think so?
> > > 
> > > Since I played with latency hack, I can see performance difference
> > for
> > > different latency.
> > 
> > Which hack was that? 
> 
> I tried to accumulate multiple guest to host notifications for TX xmits,
> it did help multiple streams TCP_RR results;

I don't see a point to delay used idx update, do you?
So delaying just signal seems better, right?

> I also forced vhost
> handle_tx to handle more packets; both hack seemed help.
> 
> Thanks
> Shirley

Haven't noticed that part, how does your patch make it
handle more packets?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 20:17                                                               ` Michael S. Tsirkin
@ 2011-02-02 21:03                                                                 ` Shirley Ma
  2011-02-02 21:20                                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 21:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> Well, this is also the only case where the queue is stopped, no?
Yes. I got some debugging data, I saw that sometimes there were so many
packets were waiting for free in guest between vhost_signal & guest xmit
callback. Looks like the time spent too long from vhost_signal to guest
xmit callback?

> > I tried to accumulate multiple guest to host notifications for TX
> xmits,
> > it did help multiple streams TCP_RR results;
> I don't see a point to delay used idx update, do you?

It might cause per vhost handle_tx processed more packets.

> So delaying just signal seems better, right?

I think I need to define the test matrix to collect data for TX xmit
from guest to host here for different tests.

Data to be collected:
---------------------
1. kvm_stat for VM, I/O exits
2. cpu utilization for both guest and host
3. cat /proc/interrupts on guest
4. packets rate from vhost handle_tx per loop
5. guest netif queue stop rate
6. how many packets are waiting for free between vhost signaling and
guest callback
7. performance results

Test
----
1. TCP_STREAM single stream test for 1K to 4K message size
2. TCP_RR (64 instance test): 128 - 1K request/response size

Different hacks
---------------
1. Base line data ( with the patch to fix capacity check first,
free_old_xmit_skbs returns number of skbs)

2. Drop packet data (will put some debugging in generic networking code)

3. Delay guest netif queue wake up until certain descriptors (1/2 ring
size, 1/4 ring size...) are available once the queue has stopped.

4. Accumulate more packets per vhost signal in handle_tx?

5. 3 & 4 combinations

6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 

7. Accumulate more packets per vhost handle_tx() by adding some delay?

> Haven't noticed that part, how does your patch make it
handle more packets?

Added a delay in handle_tx().

What else?

It would take sometimes to do this.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 21:03                                                                 ` Shirley Ma
@ 2011-02-02 21:20                                                                   ` Michael S. Tsirkin
  2011-02-02 21:41                                                                     ` Shirley Ma
  2011-02-03  5:05                                                                     ` Shirley Ma
  0 siblings, 2 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-02 21:20 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > Well, this is also the only case where the queue is stopped, no?
> Yes. I got some debugging data, I saw that sometimes there were so many
> packets were waiting for free in guest between vhost_signal & guest xmit
> callback.

What does this mean?

> Looks like the time spent too long from vhost_signal to guest
> xmit callback?



> > > I tried to accumulate multiple guest to host notifications for TX
> > xmits,
> > > it did help multiple streams TCP_RR results;
> > I don't see a point to delay used idx update, do you?
> 
> It might cause per vhost handle_tx processed more packets.

I don't understand. It's a couple of writes - what is the issue?

> > So delaying just signal seems better, right?
> 
> I think I need to define the test matrix to collect data for TX xmit
> from guest to host here for different tests.
> 
> Data to be collected:
> ---------------------
> 1. kvm_stat for VM, I/O exits
> 2. cpu utilization for both guest and host
> 3. cat /proc/interrupts on guest
> 4. packets rate from vhost handle_tx per loop
> 5. guest netif queue stop rate
> 6. how many packets are waiting for free between vhost signaling and
> guest callback
> 7. performance results
> 
> Test
> ----
> 1. TCP_STREAM single stream test for 1K to 4K message size
> 2. TCP_RR (64 instance test): 128 - 1K request/response size
> 
> Different hacks
> ---------------
> 1. Base line data ( with the patch to fix capacity check first,
> free_old_xmit_skbs returns number of skbs)
> 
> 2. Drop packet data (will put some debugging in generic networking code)
> 
> 3. Delay guest netif queue wake up until certain descriptors (1/2 ring
> size, 1/4 ring size...) are available once the queue has stopped.
> 
> 4. Accumulate more packets per vhost signal in handle_tx?
> 
> 5. 3 & 4 combinations
> 
> 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 
> 
> 7. Accumulate more packets per vhost handle_tx() by adding some delay?
> 
> > Haven't noticed that part, how does your patch make it
> handle more packets?
> 
> Added a delay in handle_tx().
> 
> What else?
> 
> It would take sometimes to do this.
> 
> Shirley


Need to think about this.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 21:20                                                                   ` Michael S. Tsirkin
@ 2011-02-02 21:41                                                                     ` Shirley Ma
  2011-02-03  5:59                                                                       ` Michael S. Tsirkin
  2011-02-03  5:05                                                                     ` Shirley Ma
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-02 21:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > > Well, this is also the only case where the queue is stopped, no?
> > Yes. I got some debugging data, I saw that sometimes there were so
> many
> > packets were waiting for free in guest between vhost_signal & guest
> xmit
> > callback.
> 
> What does this mean?

Let's look at the sequence here:

guest start_xmit()
	xmit_skb()
	if ring is full,
		enable_cb()

guest skb_xmit_done()
	disable_cb,
        printk free_old_xmit_skbs <-- it was between more than 1/2 to
full ring size
	printk vq->num_free 

vhost handle_tx()
	if (guest interrupt is enabled)
		signal guest to free xmit buffers

So between guest queue full/stopped queue/enable call back to guest
receives the callback from host to free_old_xmit_skbs, there were about
1/2 to full ring size descriptors available. I thought there were only a
few. (I disabled your vhost patch for this test.)
 

> > Looks like the time spent too long from vhost_signal to guest
> > xmit callback?
> 
> 
> 
> > > > I tried to accumulate multiple guest to host notifications for
> TX
> > > xmits,
> > > > it did help multiple streams TCP_RR results;
> > > I don't see a point to delay used idx update, do you?
> > 
> > It might cause per vhost handle_tx processed more packets.
> 
> I don't understand. It's a couple of writes - what is the issue?

Oh, handle_tx could process more packets per loop for multiple streams
TCP_RR case. I need to print out the data rate per loop to confirm this.

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 21:20                                                                   ` Michael S. Tsirkin
  2011-02-02 21:41                                                                     ` Shirley Ma
@ 2011-02-03  5:05                                                                     ` Shirley Ma
  2011-02-03  6:13                                                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-03  5:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > I think I need to define the test matrix to collect data for TX xmit
> > from guest to host here for different tests.
> > 
> > Data to be collected:
> > ---------------------
> > 1. kvm_stat for VM, I/O exits
> > 2. cpu utilization for both guest and host
> > 3. cat /proc/interrupts on guest
> > 4. packets rate from vhost handle_tx per loop
> > 5. guest netif queue stop rate
> > 6. how many packets are waiting for free between vhost signaling and
> > guest callback
> > 7. performance results
> > 
> > Test
> > ----
> > 1. TCP_STREAM single stream test for 1K to 4K message size
> > 2. TCP_RR (64 instance test): 128 - 1K request/response size
> > 
> > Different hacks
> > ---------------
> > 1. Base line data ( with the patch to fix capacity check first,
> > free_old_xmit_skbs returns number of skbs)
> > 
> > 2. Drop packet data (will put some debugging in generic networking
> code)

Since I found that the netif queue stop/wake up is so expensive, I
created a dropping packets patch on guest side so I don't need to debug
generic networking code.

guest start_xmit()
	capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
	if (capacity == 0)
		drop this packet;
		return;

In the patch, both guest TX interrupts and callback have been omitted.
Host vhost_signal in handle_tx can totally be removed as well. (A new
virtio_ring API is needed for exporting total of num_free descriptors
here -- virtioqueue_get_num_freed)

Initial TCP_STREAM performance results I got for guest to local host 
4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
9.8Gb/s for 4K message size. (vs.5.xGb/s)

Since large message size (64K) doesn't hit (capacity == 0) case, so the
performance only has a little better. (from 13.xGb/s to 14.x Gb/s)

kvm_stat output shows significant exits reduction for both VM and I/O,
no guest TX interrupts.

With dropping packets, TCP retrans has been increased here, so I can see
performance numbers are various.

This might be not a good solution, but it gave us some ideas on
expensive netif queue stop/wake up between guest and host notification.

I couldn't find a better solution on how to reduce netif queue stop/wake
up rate for small message size. But I think once we can address this,
the guest TX performance will burst for small message size.

I also compared this with return TX_BUSY approach when (capacity == 0),
it is not as good as dropping packets.

> > 3. Delay guest netif queue wake up until certain descriptors (1/2
> ring
> > size, 1/4 ring size...) are available once the queue has stopped.
> > 
> > 4. Accumulate more packets per vhost signal in handle_tx?
> > 
> > 5. 3 & 4 combinations
> > 
> > 6. Accumulate more packets per guest kick() (TCP_RR) by adding a
> timer? 
> > 
> > 7. Accumulate more packets per vhost handle_tx() by adding some
> delay?
> > 
> > > Haven't noticed that part, how does your patch make it
> > handle more packets?
> > 
> > Added a delay in handle_tx().
> > 
> > What else?
> > 
> > It would take sometimes to do this.
> > 
> > Shirley
> 
> 
> Need to think about this.
> 
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02 21:41                                                                     ` Shirley Ma
@ 2011-02-03  5:59                                                                       ` Michael S. Tsirkin
  2011-02-03  6:09                                                                         ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-03  5:59 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > > On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > > > Well, this is also the only case where the queue is stopped, no?
> > > Yes. I got some debugging data, I saw that sometimes there were so
> > many
> > > packets were waiting for free in guest between vhost_signal & guest
> > xmit
> > > callback.
> > 
> > What does this mean?
> 
> Let's look at the sequence here:
> 
> guest start_xmit()
> 	xmit_skb()
> 	if ring is full,
> 		enable_cb()
> 
> guest skb_xmit_done()
> 	disable_cb,
>         printk free_old_xmit_skbs <-- it was between more than 1/2 to
> full ring size
> 	printk vq->num_free 
> 
> vhost handle_tx()
> 	if (guest interrupt is enabled)
> 		signal guest to free xmit buffers
> 
> So between guest queue full/stopped queue/enable call back to guest
> receives the callback from host to free_old_xmit_skbs, there were about
> 1/2 to full ring size descriptors available. I thought there were only a
> few. (I disabled your vhost patch for this test.)


The expected number is vq->num - max skb frags - 2.

> 
> > > Looks like the time spent too long from vhost_signal to guest
> > > xmit callback?
> > 
> > 
> > 
> > > > > I tried to accumulate multiple guest to host notifications for
> > TX
> > > > xmits,
> > > > > it did help multiple streams TCP_RR results;
> > > > I don't see a point to delay used idx update, do you?
> > > 
> > > It might cause per vhost handle_tx processed more packets.
> > 
> > I don't understand. It's a couple of writes - what is the issue?
> 
> Oh, handle_tx could process more packets per loop for multiple streams
> TCP_RR case. I need to print out the data rate per loop to confirm this.
> 
> Shirley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03  5:59                                                                       ` Michael S. Tsirkin
@ 2011-02-03  6:09                                                                         ` Shirley Ma
  2011-02-03  6:16                                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-03  6:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
> > Let's look at the sequence here:
> > 
> > guest start_xmit()
> >       xmit_skb()
> >       if ring is full,
> >               enable_cb()
> > 
> > guest skb_xmit_done()
> >       disable_cb,
> >         printk free_old_xmit_skbs <-- it was between more than 1/2
> to
> > full ring size
> >       printk vq->num_free 
> > 
> > vhost handle_tx()
> >       if (guest interrupt is enabled)
> >               signal guest to free xmit buffers
> > 
> > So between guest queue full/stopped queue/enable call back to guest
> > receives the callback from host to free_old_xmit_skbs, there were
> about
> > 1/2 to full ring size descriptors available. I thought there were
> only a
> > few. (I disabled your vhost patch for this test.)
> 
> 
> The expected number is vq->num - max skb frags - 2. 

It was various (up to the ring size 256). This is using indirection
buffers, it returned how many freed descriptors, not number of buffers.

Why do you think it is vq->num - max skb frags - 2 here?

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03  5:05                                                                     ` Shirley Ma
@ 2011-02-03  6:13                                                                       ` Michael S. Tsirkin
  2011-02-03 15:58                                                                         ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-03  6:13 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > > I think I need to define the test matrix to collect data for TX xmit
> > > from guest to host here for different tests.
> > > 
> > > Data to be collected:
> > > ---------------------
> > > 1. kvm_stat for VM, I/O exits
> > > 2. cpu utilization for both guest and host
> > > 3. cat /proc/interrupts on guest
> > > 4. packets rate from vhost handle_tx per loop
> > > 5. guest netif queue stop rate
> > > 6. how many packets are waiting for free between vhost signaling and
> > > guest callback
> > > 7. performance results
> > > 
> > > Test
> > > ----
> > > 1. TCP_STREAM single stream test for 1K to 4K message size
> > > 2. TCP_RR (64 instance test): 128 - 1K request/response size
> > > 
> > > Different hacks
> > > ---------------
> > > 1. Base line data ( with the patch to fix capacity check first,
> > > free_old_xmit_skbs returns number of skbs)
> > > 
> > > 2. Drop packet data (will put some debugging in generic networking
> > code)
> 
> Since I found that the netif queue stop/wake up is so expensive, I
> created a dropping packets patch on guest side so I don't need to debug
> generic networking code.
> 
> guest start_xmit()
> 	capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
> 	if (capacity == 0)
> 		drop this packet;
> 		return;
> 
> In the patch, both guest TX interrupts and callback have been omitted.
> Host vhost_signal in handle_tx can totally be removed as well. (A new
> virtio_ring API is needed for exporting total of num_free descriptors
> here -- virtioqueue_get_num_freed)
> 
> Initial TCP_STREAM performance results I got for guest to local host 
> 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> 9.8Gb/s for 4K message size. (vs.5.xGb/s)

What is the average packet size, # bytes per ack, and the # of interrupts
per packet? It could be that just slowing down trahsmission
makes GSO work better.

> Since large message size (64K) doesn't hit (capacity == 0) case, so the
> performance only has a little better. (from 13.xGb/s to 14.x Gb/s)
> 
> kvm_stat output shows significant exits reduction for both VM and I/O,
> no guest TX interrupts.
> 
> With dropping packets, TCP retrans has been increased here, so I can see
> performance numbers are various.
> 
> This might be not a good solution, but it gave us some ideas on
> expensive netif queue stop/wake up between guest and host notification.
> 
> I couldn't find a better solution on how to reduce netif queue stop/wake
> up rate for small message size. But I think once we can address this,
> the guest TX performance will burst for small message size.
> 
> I also compared this with return TX_BUSY approach when (capacity == 0),
> it is not as good as dropping packets.
> 
> > > 3. Delay guest netif queue wake up until certain descriptors (1/2
> > ring
> > > size, 1/4 ring size...) are available once the queue has stopped.
> > > 
> > > 4. Accumulate more packets per vhost signal in handle_tx?
> > > 
> > > 5. 3 & 4 combinations
> > > 
> > > 6. Accumulate more packets per guest kick() (TCP_RR) by adding a
> > timer? 
> > > 
> > > 7. Accumulate more packets per vhost handle_tx() by adding some
> > delay?
> > > 
> > > > Haven't noticed that part, how does your patch make it
> > > handle more packets?
> > > 
> > > Added a delay in handle_tx().
> > > 
> > > What else?
> > > 
> > > It would take sometimes to do this.
> > > 
> > > Shirley
> > 
> > 
> > Need to think about this.
> > 
> > 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03  6:09                                                                         ` Shirley Ma
@ 2011-02-03  6:16                                                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-03  6:16 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote:
> On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
> > > Let's look at the sequence here:
> > > 
> > > guest start_xmit()
> > >       xmit_skb()
> > >       if ring is full,
> > >               enable_cb()
> > > 
> > > guest skb_xmit_done()
> > >       disable_cb,
> > >         printk free_old_xmit_skbs <-- it was between more than 1/2
> > to
> > > full ring size
> > >       printk vq->num_free 
> > > 
> > > vhost handle_tx()
> > >       if (guest interrupt is enabled)
> > >               signal guest to free xmit buffers
> > > 
> > > So between guest queue full/stopped queue/enable call back to guest
> > > receives the callback from host to free_old_xmit_skbs, there were
> > about
> > > 1/2 to full ring size descriptors available. I thought there were
> > only a
> > > few. (I disabled your vhost patch for this test.)
> > 
> > 
> > The expected number is vq->num - max skb frags - 2. 
> 
> It was various (up to the ring size 256). This is using indirection
> buffers, it returned how many freed descriptors, not number of buffers.
> 
> Why do you think it is vq->num - max skb frags - 2 here?
> 
> Shirley

well queue is stopped which happens when

        if (capacity < 2+MAX_SKB_FRAGS) {
                netif_stop_queue(dev);
                if (unlikely(!virtqueue_enable_cb(vi->svq))) {
                        /* More just got used, free them then recheck.
 * */
                        capacity += free_old_xmit_skbs(vi);
                        if (capacity >= 2+MAX_SKB_FRAGS) {
                                netif_start_queue(dev);
                                virtqueue_disable_cb(vi->svq);
                        }
                }
        }

This should be the most common case.
I guess the case with += free_old_xmit_skbs is what can get us more.
But it should be rare. Can you count how common it is?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03  6:13                                                                       ` Michael S. Tsirkin
@ 2011-02-03 15:58                                                                         ` Shirley Ma
  2011-02-03 16:20                                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-02-03 15:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
> > Initial TCP_STREAM performance results I got for guest to local
> host 
> > 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> > 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> > 9.8Gb/s for 4K message size. (vs.5.xGb/s)
> 
> What is the average packet size, # bytes per ack, and the # of
> interrupts
> per packet? It could be that just slowing down trahsmission
> makes GSO work better. 

There is no TX interrupts with dropping packet.

GSO/TSO is the key for small message performance, w/o GSO/TSO, the
performance is limited to about 2Gb/s no matter how big the message size
it is. I think any work we try here will increase large packet size
rate. BTW for dropping packet, TCP increased fast retrans, not slow
start. 

I will collect tcpdump, netstart before and after data to compare packet
size/rate w/o w/i the patch.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03 15:58                                                                         ` Shirley Ma
@ 2011-02-03 16:20                                                                           ` Michael S. Tsirkin
  2011-02-03 17:18                                                                             ` Shirley Ma
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-03 16:20 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote:
> On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
> > > Initial TCP_STREAM performance results I got for guest to local
> > host 
> > > 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
> > > 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
> > > 9.8Gb/s for 4K message size. (vs.5.xGb/s)
> > 
> > What is the average packet size, # bytes per ack, and the # of
> > interrupts
> > per packet? It could be that just slowing down trahsmission
> > makes GSO work better. 
> 
> There is no TX interrupts with dropping packet.
> 
> GSO/TSO is the key for small message performance, w/o GSO/TSO, the
> performance is limited to about 2Gb/s no matter how big the message size
> it is. I think any work we try here will increase large packet size
> rate. BTW for dropping packet, TCP increased fast retrans, not slow
> start. 
> 
> I will collect tcpdump, netstart before and after data to compare packet
> size/rate w/o w/i the patch.
> 
> Thanks
> Shirley

Just a thought: does it help to make tx queue len of the
virtio device smaller?
E.g. match the vq size?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-03 16:20                                                                           ` Michael S. Tsirkin
@ 2011-02-03 17:18                                                                             ` Shirley Ma
  0 siblings, 0 replies; 88+ messages in thread
From: Shirley Ma @ 2011-02-03 17:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein

On Thu, 2011-02-03 at 18:20 +0200, Michael S. Tsirkin wrote:
> Just a thought: does it help to make tx queue len of the
> virtio device smaller? 

Yes, that what I did before, reducing txqueuelen will cause qdisc dropp
the packet early, But it's hard to control by using tx queuelen for
performance gain. I tried on different systems, it required different
values.

Also, I tried another patch, instead of dropping packets, I used to
timer (2 jiffies) to enable/disable queue on guest without interrupts
notification, it gets better performance than original but worse
performance than dropping packets because of netif stop/wake up too
often.

vhost is definitely needed to improve for handling small message sizes.
It's unable to handle small message packets rate for queue size 256,
even with ring size 1024. QEMU seems not allowing to increase the TX
ring size to 2K (start qemu_kvm failure with no errors), I am not able
to test it out.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-02  4:42                                   ` Michael S. Tsirkin
@ 2011-02-09  0:37                                     ` Rusty Russell
  2011-02-09  0:53                                       ` Michael S. Tsirkin
  2011-03-08 21:57                                       ` Shirley Ma
  0 siblings, 2 replies; 88+ messages in thread
From: Rusty Russell @ 2011-02-09  0:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, Shirley Ma, netdev, steved

On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> > >
> > > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > > Confused. We compare capacity to skb frags, no?
> > > > > That's sg I think ...
> > > >
> > > > Current guest kernel use indirect buffers, num_free returns how many
> > > > available descriptors not skb frags. So it's wrong here.
> > > >
> > > > Shirley
> > >
> > > I see. Good point. In other words when we complete the buffer
> > > it was indirect, but when we add a new one we
> > > can not allocate indirect so we consume.
> > > And then we start the queue and add will fail.
> > > I guess we need some kind of API to figure out
> > > whether the buf we complete was indirect?

I've finally read this thread... I think we need to get more serious
with our stats gathering to diagnose these kind of performance issues.

This is a start; it should tell us what is actually happening to the
virtio ring(s) without significant performance impact...

Subject: virtio: CONFIG_VIRTIO_STATS

For performance problems we'd like to know exactly what the ring looks
like.  This patch adds stats indexed by how-full-ring-is; we could extend
it to also record them by how-used-ring-is if we need.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -7,6 +7,14 @@ config VIRTIO_RING
 	tristate
 	depends on VIRTIO
 
+config VIRTIO_STATS
+	bool "Virtio debugging stats (EXPERIMENTAL)"
+	depends on VIRTIO_RING
+	select DEBUG_FS
+	---help---
+	  Virtio stats collected by how full the ring is at any time,
+	  presented under debugfs/virtio/<name>-<vq>/<num-used>/
+
 config VIRTIO_PCI
 	tristate "PCI driver for virtio devices (EXPERIMENTAL)"
 	depends on PCI && EXPERIMENTAL
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -21,6 +21,7 @@
 #include <linux/virtio_config.h>
 #include <linux/device.h>
 #include <linux/slab.h>
+#include <linux/debugfs.h>
 
 /* virtio guest is communicating with a virtual "device" that actually runs on
  * a host processor.  Memory barriers are used to control SMP effects. */
@@ -95,6 +96,11 @@ struct vring_virtqueue
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	void (*notify)(struct virtqueue *vq);
 
+#ifdef CONFIG_VIRTIO_STATS
+	struct vring_stat *stats;
+	struct dentry *statdir;
+#endif
+
 #ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
@@ -106,6 +112,87 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+#ifdef CONFIG_VIRTIO_STATS
+/* We have an array of these, indexed by how full the ring is. */
+struct vring_stat {
+	/* How many interrupts? */
+	size_t interrupt_nowork, interrupt_work;
+	/* How many non-notify kicks, how many notify kicks, how many add notify? */
+	size_t kick_no_notify, kick_notify, add_notify;
+	/* How many adds? */
+	size_t add_direct, add_indirect, add_fail;
+	/* How many gets? */
+	size_t get;
+	/* How many disable callbacks? */
+	size_t disable_cb;
+	/* How many enables? */
+	size_t enable_cb_retry, enable_cb_success;
+};
+
+static struct dentry *virtio_stats;
+
+static void create_stat_files(struct vring_virtqueue *vq)
+{
+	char name[80];
+	unsigned int i;
+
+	/* Racy in theory, but we don't care. */
+	if (!virtio_stats)
+		virtio_stats = debugfs_create_dir("virtio-stats", NULL);
+
+	sprintf(name, "%s-%s", dev_name(&vq->vq.vdev->dev), vq->vq.name);
+	vq->statdir = debugfs_create_dir(name, virtio_stats);
+
+	for (i = 0; i < vq->vring.num; i++) {
+		struct dentry *dir;
+
+		sprintf(name, "%i", i);
+		dir = debugfs_create_dir(name, vq->statdir);
+		debugfs_create_size_t("interrupt_nowork", 0400, dir,
+				      &vq->stats[i].interrupt_nowork);
+		debugfs_create_size_t("interrupt_work", 0400, dir,
+				      &vq->stats[i].interrupt_work);
+		debugfs_create_size_t("kick_no_notify", 0400, dir,
+				      &vq->stats[i].kick_no_notify);
+		debugfs_create_size_t("kick_notify", 0400, dir,
+				      &vq->stats[i].kick_notify);
+		debugfs_create_size_t("add_notify", 0400, dir,
+				      &vq->stats[i].add_notify);
+		debugfs_create_size_t("add_direct", 0400, dir,
+				      &vq->stats[i].add_direct);
+		debugfs_create_size_t("add_indirect", 0400, dir,
+				      &vq->stats[i].add_indirect);
+		debugfs_create_size_t("add_fail", 0400, dir,
+				      &vq->stats[i].add_fail);
+		debugfs_create_size_t("get", 0400, dir,
+				      &vq->stats[i].get);
+		debugfs_create_size_t("disable_cb", 0400, dir,
+				      &vq->stats[i].disable_cb);
+		debugfs_create_size_t("enable_cb_retry", 0400, dir,
+				      &vq->stats[i].enable_cb_retry);
+		debugfs_create_size_t("enable_cb_success", 0400, dir,
+				      &vq->stats[i].enable_cb_success);
+	}
+}
+
+static void delete_stat_files(struct vring_virtqueue *vq)
+{
+	debugfs_remove_recursive(vq->statdir);
+}
+
+#define add_stat(vq, name)						\
+	do {								\
+		struct vring_virtqueue *_vq = (vq);			\
+		_vq->stats[_vq->num_free - _vq->vring.num].name++;	\
+	} while (0)
+
+#else
+#define add_stat(vq, name)
+static void delete_stat_files(struct vring_virtqueue *vq)
+{
+}
+#endif
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static int vring_add_indirect(struct vring_virtqueue *vq,
 			      struct scatterlist sg[],
@@ -121,6 +208,8 @@ static int vring_add_indirect(struct vri
 	if (!desc)
 		return -ENOMEM;
 
+	add_stat(vq, add_indirect);
+
 	/* Transfer entries from the sg list into the indirect page */
 	for (i = 0; i < out; i++) {
 		desc[i].flags = VRING_DESC_F_NEXT;
@@ -183,17 +272,22 @@ int virtqueue_add_buf_gfp(struct virtque
 	BUG_ON(out + in == 0);
 
 	if (vq->num_free < out + in) {
+		add_stat(vq, add_fail);
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 out + in, vq->num_free);
 		/* FIXME: for historical reasons, we force a notify here if
 		 * there are outgoing parts to the buffer.  Presumably the
 		 * host should service the ring ASAP. */
-		if (out)
+		if (out) {
+			add_stat(vq, add_notify);
 			vq->notify(&vq->vq);
+		}
 		END_USE(vq);
 		return -ENOSPC;
 	}
 
+	add_stat(vq, add_direct);
+
 	/* We're about to use some buffers from the free list. */
 	vq->num_free -= out + in;
 
@@ -248,9 +342,12 @@ void virtqueue_kick(struct virtqueue *_v
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
+		add_stat(vq, kick_notify);
 		/* Prod other side to tell it about changes. */
 		vq->notify(&vq->vq);
+	} else
+		add_stat(vq, kick_no_notify);
 
 	END_USE(vq);
 }
@@ -294,6 +391,8 @@ void *virtqueue_get_buf(struct virtqueue
 
 	START_USE(vq);
 
+	add_stat(vq, get);
+
 	if (unlikely(vq->broken)) {
 		END_USE(vq);
 		return NULL;
@@ -333,6 +432,7 @@ void virtqueue_disable_cb(struct virtque
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
+	add_stat(vq, disable_cb);
 	vq->vring.avail->flags |= VRING_AVAIL_F_NO_INTERRUPT;
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
@@ -348,10 +448,12 @@ bool virtqueue_enable_cb(struct virtqueu
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
 	virtio_mb();
 	if (unlikely(more_used(vq))) {
+		add_stat(vq, enable_cb_retry);
 		END_USE(vq);
 		return false;
 	}
 
+	add_stat(vq, enable_cb_success);
 	END_USE(vq);
 	return true;
 }
@@ -387,10 +489,12 @@ irqreturn_t vring_interrupt(int irq, voi
 	struct vring_virtqueue *vq = to_vvq(_vq);
 
 	if (!more_used(vq)) {
+		add_stat(vq, interrupt_nowork);
 		pr_debug("virtqueue interrupt with no work for %p\n", vq);
 		return IRQ_NONE;
 	}
 
+	add_stat(vq, interrupt_work);
 	if (unlikely(vq->broken))
 		return IRQ_HANDLED;
 
@@ -451,6 +555,15 @@ struct virtqueue *vring_new_virtqueue(un
 	}
 	vq->data[i] = NULL;
 
+#ifdef CONFIG_VIRTIO_STATS
+	vq->stats = kzalloc(sizeof(*vq->stats) * num, GFP_KERNEL);
+	if (!vq->stats) {
+		kfree(vq);
+		return NULL;
+	}
+	create_stat_files(vq);
+#endif
+
 	return &vq->vq;
 }
 EXPORT_SYMBOL_GPL(vring_new_virtqueue);
@@ -458,6 +571,7 @@ EXPORT_SYMBOL_GPL(vring_new_virtqueue);
 void vring_del_virtqueue(struct virtqueue *vq)
 {
 	list_del(&vq->list);
+	delete_stat_files(to_vvq(vq));
 	kfree(to_vvq(vq));
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-09  0:37                                     ` Rusty Russell
@ 2011-02-09  0:53                                       ` Michael S. Tsirkin
  2011-02-09  1:39                                         ` Rusty Russell
  2011-03-08 21:57                                       ` Shirley Ma
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-09  0:53 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar2, David Miller, kvm, Shirley Ma, netdev, steved

On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
> On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > > > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> > > >
> > > > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > > > Confused. We compare capacity to skb frags, no?
> > > > > > That's sg I think ...
> > > > >
> > > > > Current guest kernel use indirect buffers, num_free returns how many
> > > > > available descriptors not skb frags. So it's wrong here.
> > > > >
> > > > > Shirley
> > > >
> > > > I see. Good point. In other words when we complete the buffer
> > > > it was indirect, but when we add a new one we
> > > > can not allocate indirect so we consume.
> > > > And then we start the queue and add will fail.
> > > > I guess we need some kind of API to figure out
> > > > whether the buf we complete was indirect?
> 
> I've finally read this thread... I think we need to get more serious
> with our stats gathering to diagnose these kind of performance issues.
> 
> This is a start; it should tell us what is actually happening to the
> virtio ring(s) without significant performance impact...
> 
> Subject: virtio: CONFIG_VIRTIO_STATS
> 
> For performance problems we'd like to know exactly what the ring looks
> like.  This patch adds stats indexed by how-full-ring-is; we could extend
> it to also record them by how-used-ring-is if we need.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Not sure whether the intent is to merge this. If yes -
would it make sense to use tracing for this instead?
That's what kvm does.

> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -7,6 +7,14 @@ config VIRTIO_RING
>  	tristate
>  	depends on VIRTIO
>  
> +config VIRTIO_STATS
> +	bool "Virtio debugging stats (EXPERIMENTAL)"
> +	depends on VIRTIO_RING
> +	select DEBUG_FS
> +	---help---
> +	  Virtio stats collected by how full the ring is at any time,
> +	  presented under debugfs/virtio/<name>-<vq>/<num-used>/
> +
>  config VIRTIO_PCI
>  	tristate "PCI driver for virtio devices (EXPERIMENTAL)"
>  	depends on PCI && EXPERIMENTAL
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -21,6 +21,7 @@
>  #include <linux/virtio_config.h>
>  #include <linux/device.h>
>  #include <linux/slab.h>
> +#include <linux/debugfs.h>
>  
>  /* virtio guest is communicating with a virtual "device" that actually runs on
>   * a host processor.  Memory barriers are used to control SMP effects. */
> @@ -95,6 +96,11 @@ struct vring_virtqueue
>  	/* How to notify other side. FIXME: commonalize hcalls! */
>  	void (*notify)(struct virtqueue *vq);
>  
> +#ifdef CONFIG_VIRTIO_STATS
> +	struct vring_stat *stats;
> +	struct dentry *statdir;
> +#endif
> +
>  #ifdef DEBUG
>  	/* They're supposed to lock for us. */
>  	unsigned int in_use;
> @@ -106,6 +112,87 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> +#ifdef CONFIG_VIRTIO_STATS
> +/* We have an array of these, indexed by how full the ring is. */
> +struct vring_stat {
> +	/* How many interrupts? */
> +	size_t interrupt_nowork, interrupt_work;
> +	/* How many non-notify kicks, how many notify kicks, how many add notify? */
> +	size_t kick_no_notify, kick_notify, add_notify;
> +	/* How many adds? */
> +	size_t add_direct, add_indirect, add_fail;
> +	/* How many gets? */
> +	size_t get;
> +	/* How many disable callbacks? */
> +	size_t disable_cb;
> +	/* How many enables? */
> +	size_t enable_cb_retry, enable_cb_success;
> +};
> +
> +static struct dentry *virtio_stats;
> +
> +static void create_stat_files(struct vring_virtqueue *vq)
> +{
> +	char name[80];
> +	unsigned int i;
> +
> +	/* Racy in theory, but we don't care. */
> +	if (!virtio_stats)
> +		virtio_stats = debugfs_create_dir("virtio-stats", NULL);
> +
> +	sprintf(name, "%s-%s", dev_name(&vq->vq.vdev->dev), vq->vq.name);
> +	vq->statdir = debugfs_create_dir(name, virtio_stats);
> +
> +	for (i = 0; i < vq->vring.num; i++) {
> +		struct dentry *dir;
> +
> +		sprintf(name, "%i", i);
> +		dir = debugfs_create_dir(name, vq->statdir);
> +		debugfs_create_size_t("interrupt_nowork", 0400, dir,
> +				      &vq->stats[i].interrupt_nowork);
> +		debugfs_create_size_t("interrupt_work", 0400, dir,
> +				      &vq->stats[i].interrupt_work);
> +		debugfs_create_size_t("kick_no_notify", 0400, dir,
> +				      &vq->stats[i].kick_no_notify);
> +		debugfs_create_size_t("kick_notify", 0400, dir,
> +				      &vq->stats[i].kick_notify);
> +		debugfs_create_size_t("add_notify", 0400, dir,
> +				      &vq->stats[i].add_notify);
> +		debugfs_create_size_t("add_direct", 0400, dir,
> +				      &vq->stats[i].add_direct);
> +		debugfs_create_size_t("add_indirect", 0400, dir,
> +				      &vq->stats[i].add_indirect);
> +		debugfs_create_size_t("add_fail", 0400, dir,
> +				      &vq->stats[i].add_fail);
> +		debugfs_create_size_t("get", 0400, dir,
> +				      &vq->stats[i].get);
> +		debugfs_create_size_t("disable_cb", 0400, dir,
> +				      &vq->stats[i].disable_cb);
> +		debugfs_create_size_t("enable_cb_retry", 0400, dir,
> +				      &vq->stats[i].enable_cb_retry);
> +		debugfs_create_size_t("enable_cb_success", 0400, dir,
> +				      &vq->stats[i].enable_cb_success);
> +	}
> +}
> +
> +static void delete_stat_files(struct vring_virtqueue *vq)
> +{
> +	debugfs_remove_recursive(vq->statdir);
> +}
> +
> +#define add_stat(vq, name)						\
> +	do {								\
> +		struct vring_virtqueue *_vq = (vq);			\
> +		_vq->stats[_vq->num_free - _vq->vring.num].name++;	\
> +	} while (0)
> +
> +#else
> +#define add_stat(vq, name)
> +static void delete_stat_files(struct vring_virtqueue *vq)
> +{
> +}
> +#endif
> +
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static int vring_add_indirect(struct vring_virtqueue *vq,
>  			      struct scatterlist sg[],
> @@ -121,6 +208,8 @@ static int vring_add_indirect(struct vri
>  	if (!desc)
>  		return -ENOMEM;
>  
> +	add_stat(vq, add_indirect);
> +
>  	/* Transfer entries from the sg list into the indirect page */
>  	for (i = 0; i < out; i++) {
>  		desc[i].flags = VRING_DESC_F_NEXT;
> @@ -183,17 +272,22 @@ int virtqueue_add_buf_gfp(struct virtque
>  	BUG_ON(out + in == 0);
>  
>  	if (vq->num_free < out + in) {
> +		add_stat(vq, add_fail);
>  		pr_debug("Can't add buf len %i - avail = %i\n",
>  			 out + in, vq->num_free);
>  		/* FIXME: for historical reasons, we force a notify here if
>  		 * there are outgoing parts to the buffer.  Presumably the
>  		 * host should service the ring ASAP. */
> -		if (out)
> +		if (out) {
> +			add_stat(vq, add_notify);
>  			vq->notify(&vq->vq);
> +		}
>  		END_USE(vq);
>  		return -ENOSPC;
>  	}
>  
> +	add_stat(vq, add_direct);
> +
>  	/* We're about to use some buffers from the free list. */
>  	vq->num_free -= out + in;
>  
> @@ -248,9 +342,12 @@ void virtqueue_kick(struct virtqueue *_v
>  	/* Need to update avail index before checking if we should notify */
>  	virtio_mb();
>  
> -	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
> +	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
> +		add_stat(vq, kick_notify);
>  		/* Prod other side to tell it about changes. */
>  		vq->notify(&vq->vq);
> +	} else
> +		add_stat(vq, kick_no_notify);
>  
>  	END_USE(vq);
>  }
> @@ -294,6 +391,8 @@ void *virtqueue_get_buf(struct virtqueue
>  
>  	START_USE(vq);
>  
> +	add_stat(vq, get);
> +
>  	if (unlikely(vq->broken)) {
>  		END_USE(vq);
>  		return NULL;
> @@ -333,6 +432,7 @@ void virtqueue_disable_cb(struct virtque
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  
> +	add_stat(vq, disable_cb);
>  	vq->vring.avail->flags |= VRING_AVAIL_F_NO_INTERRUPT;
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
> @@ -348,10 +448,12 @@ bool virtqueue_enable_cb(struct virtqueu
>  	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
>  	virtio_mb();
>  	if (unlikely(more_used(vq))) {
> +		add_stat(vq, enable_cb_retry);
>  		END_USE(vq);
>  		return false;
>  	}
>  
> +	add_stat(vq, enable_cb_success);
>  	END_USE(vq);
>  	return true;
>  }
> @@ -387,10 +489,12 @@ irqreturn_t vring_interrupt(int irq, voi
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  
>  	if (!more_used(vq)) {
> +		add_stat(vq, interrupt_nowork);
>  		pr_debug("virtqueue interrupt with no work for %p\n", vq);
>  		return IRQ_NONE;
>  	}
>  
> +	add_stat(vq, interrupt_work);
>  	if (unlikely(vq->broken))
>  		return IRQ_HANDLED;
>  
> @@ -451,6 +555,15 @@ struct virtqueue *vring_new_virtqueue(un
>  	}
>  	vq->data[i] = NULL;
>  
> +#ifdef CONFIG_VIRTIO_STATS
> +	vq->stats = kzalloc(sizeof(*vq->stats) * num, GFP_KERNEL);
> +	if (!vq->stats) {
> +		kfree(vq);
> +		return NULL;
> +	}
> +	create_stat_files(vq);
> +#endif
> +
>  	return &vq->vq;
>  }
>  EXPORT_SYMBOL_GPL(vring_new_virtqueue);
> @@ -458,6 +571,7 @@ EXPORT_SYMBOL_GPL(vring_new_virtqueue);
>  void vring_del_virtqueue(struct virtqueue *vq)
>  {
>  	list_del(&vq->list);
> +	delete_stat_files(to_vvq(vq));
>  	kfree(to_vvq(vq));
>  }
>  EXPORT_SYMBOL_GPL(vring_del_virtqueue);

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-09  0:53                                       ` Michael S. Tsirkin
@ 2011-02-09  1:39                                         ` Rusty Russell
  2011-02-09  1:55                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 88+ messages in thread
From: Rusty Russell @ 2011-02-09  1:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, Shirley Ma, netdev, steved

On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
> On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
> > On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
> > > On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > > > > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> > > > >
> > > > > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > > > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > > > > Confused. We compare capacity to skb frags, no?
> > > > > > > That's sg I think ...
> > > > > >
> > > > > > Current guest kernel use indirect buffers, num_free returns how many
> > > > > > available descriptors not skb frags. So it's wrong here.
> > > > > >
> > > > > > Shirley
> > > > >
> > > > > I see. Good point. In other words when we complete the buffer
> > > > > it was indirect, but when we add a new one we
> > > > > can not allocate indirect so we consume.
> > > > > And then we start the queue and add will fail.
> > > > > I guess we need some kind of API to figure out
> > > > > whether the buf we complete was indirect?
> > 
> > I've finally read this thread... I think we need to get more serious
> > with our stats gathering to diagnose these kind of performance issues.
> > 
> > This is a start; it should tell us what is actually happening to the
> > virtio ring(s) without significant performance impact...
> > 
> > Subject: virtio: CONFIG_VIRTIO_STATS
> > 
> > For performance problems we'd like to know exactly what the ring looks
> > like.  This patch adds stats indexed by how-full-ring-is; we could extend
> > it to also record them by how-used-ring-is if we need.
> > 
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 
> Not sure whether the intent is to merge this. If yes -
> would it make sense to use tracing for this instead?
> That's what kvm does.

Intent wasn't; I've not used tracepoints before, but maybe we should
consider a longer-term monitoring solution?

Patch welcome!

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-09  1:39                                         ` Rusty Russell
@ 2011-02-09  1:55                                           ` Michael S. Tsirkin
  2011-02-09  7:43                                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-02-09  1:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar2, David Miller, kvm, Shirley Ma, netdev, steved

On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
> On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
> > On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
> > > On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
> > > > On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> > > > > >
> > > > > > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > > > > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > > > > > Confused. We compare capacity to skb frags, no?
> > > > > > > > That's sg I think ...
> > > > > > >
> > > > > > > Current guest kernel use indirect buffers, num_free returns how many
> > > > > > > available descriptors not skb frags. So it's wrong here.
> > > > > > >
> > > > > > > Shirley
> > > > > >
> > > > > > I see. Good point. In other words when we complete the buffer
> > > > > > it was indirect, but when we add a new one we
> > > > > > can not allocate indirect so we consume.
> > > > > > And then we start the queue and add will fail.
> > > > > > I guess we need some kind of API to figure out
> > > > > > whether the buf we complete was indirect?
> > > 
> > > I've finally read this thread... I think we need to get more serious
> > > with our stats gathering to diagnose these kind of performance issues.
> > > 
> > > This is a start; it should tell us what is actually happening to the
> > > virtio ring(s) without significant performance impact...
> > > 
> > > Subject: virtio: CONFIG_VIRTIO_STATS
> > > 
> > > For performance problems we'd like to know exactly what the ring looks
> > > like.  This patch adds stats indexed by how-full-ring-is; we could extend
> > > it to also record them by how-used-ring-is if we need.
> > > 
> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > 
> > Not sure whether the intent is to merge this. If yes -
> > would it make sense to use tracing for this instead?
> > That's what kvm does.
> 
> Intent wasn't; I've not used tracepoints before, but maybe we should
> consider a longer-term monitoring solution?
> 
> Patch welcome!
> 
> Cheers,
> Rusty.

Sure, I'll look into this.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-09  1:55                                           ` Michael S. Tsirkin
@ 2011-02-09  7:43                                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 88+ messages in thread
From: Stefan Hajnoczi @ 2011-02-09  7:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, Shirley Ma,
	netdev, steved

On Wed, Feb 9, 2011 at 1:55 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
>> On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
>> > On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
>> > > On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
>> > > > On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
>> > > > > > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
>> > > > > >
>> > > > > > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
>> > > > > > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
>> > > > > > > > Confused. We compare capacity to skb frags, no?
>> > > > > > > > That's sg I think ...
>> > > > > > >
>> > > > > > > Current guest kernel use indirect buffers, num_free returns how many
>> > > > > > > available descriptors not skb frags. So it's wrong here.
>> > > > > > >
>> > > > > > > Shirley
>> > > > > >
>> > > > > > I see. Good point. In other words when we complete the buffer
>> > > > > > it was indirect, but when we add a new one we
>> > > > > > can not allocate indirect so we consume.
>> > > > > > And then we start the queue and add will fail.
>> > > > > > I guess we need some kind of API to figure out
>> > > > > > whether the buf we complete was indirect?
>> > >
>> > > I've finally read this thread... I think we need to get more serious
>> > > with our stats gathering to diagnose these kind of performance issues.
>> > >
>> > > This is a start; it should tell us what is actually happening to the
>> > > virtio ring(s) without significant performance impact...
>> > >
>> > > Subject: virtio: CONFIG_VIRTIO_STATS
>> > >
>> > > For performance problems we'd like to know exactly what the ring looks
>> > > like.  This patch adds stats indexed by how-full-ring-is; we could extend
>> > > it to also record them by how-used-ring-is if we need.
>> > >
>> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
>> >
>> > Not sure whether the intent is to merge this. If yes -
>> > would it make sense to use tracing for this instead?
>> > That's what kvm does.
>>
>> Intent wasn't; I've not used tracepoints before, but maybe we should
>> consider a longer-term monitoring solution?
>>
>> Patch welcome!
>>
>> Cheers,
>> Rusty.
>
> Sure, I'll look into this.

There are several virtio trace events already in QEMU today (see the
trace-events file):
virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned
int idx) "vq %p elem %p len %u idx %u"
virtqueue_flush(void *vq, unsigned int count) "vq %p count %u"
virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int
out_num) "vq %p elem %p in_num %u out_num %u"
virtio_queue_notify(void *vdev, int n, void *vq) "vdev %p n %d vq %p"
virtio_irq(void *vq) "vq %p"
virtio_notify(void *vdev, void *vq) "vdev %p vq %p"

These can be used by building QEMU with a suitable tracing backend
like SystemTap (see docs/tracing.txt).

Inside the guest I've used dynamic ftrace in the past, although static
tracepoints would be nice.

Stefan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-02-09  0:37                                     ` Rusty Russell
  2011-02-09  0:53                                       ` Michael S. Tsirkin
@ 2011-03-08 21:57                                       ` Shirley Ma
  2011-03-09  2:21                                         ` Andrew Theurer
  1 sibling, 1 reply; 88+ messages in thread
From: Shirley Ma @ 2011-03-08 21:57 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Michael S. Tsirkin, Krishna Kumar2, David Miller, kvm, netdev, steved

On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
> I've finally read this thread... I think we need to get more serious
> with our stats gathering to diagnose these kind of performance issues.
> 
> This is a start; it should tell us what is actually happening to the
> virtio ring(s) without significant performance impact... 

Should we also add similar stat on vhost vq as well for monitoring
vhost_signal & vhost_notify?

Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-03-08 21:57                                       ` Shirley Ma
@ 2011-03-09  2:21                                         ` Andrew Theurer
  2011-03-09 15:42                                           ` Shirley Ma
  2011-03-10  1:49                                           ` Rusty Russell
  0 siblings, 2 replies; 88+ messages in thread
From: Andrew Theurer @ 2011-03-09  2:21 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Rusty Russell, Michael S. Tsirkin, Krishna Kumar2, David Miller,
	kvm, netdev, steved, Tom Lendacky

On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
> On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
> > I've finally read this thread... I think we need to get more serious
> > with our stats gathering to diagnose these kind of performance issues.
> > 
> > This is a start; it should tell us what is actually happening to the
> > virtio ring(s) without significant performance impact... 
> 
> Should we also add similar stat on vhost vq as well for monitoring
> vhost_signal & vhost_notify?

Tom L has started using Rusty's patches and found some interesting
results, sent yesterday:
http://marc.info/?l=kvm&m=129953710930124&w=2


-Andrew
> 
> Shirley
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-03-09  2:21                                         ` Andrew Theurer
@ 2011-03-09 15:42                                           ` Shirley Ma
  2011-03-10  1:49                                           ` Rusty Russell
  1 sibling, 0 replies; 88+ messages in thread
From: Shirley Ma @ 2011-03-09 15:42 UTC (permalink / raw)
  To: habanero
  Cc: Rusty Russell, Michael S. Tsirkin, Krishna Kumar2, David Miller,
	kvm, netdev, steved, Tom Lendacky

On Tue, 2011-03-08 at 20:21 -0600, Andrew Theurer wrote:
> Tom L has started using Rusty's patches and found some interesting
> results, sent yesterday:
> http://marc.info/?l=kvm&m=129953710930124&w=2

Thanks. Very good experimental. I have been struggling with guest/vhost
optimization work for a while. I created different experimental patches,
performance results really depends on workloads.

Based on the discussions and findings, seems that to improve
virtio_net/vhost optimization work, we really need to collect more
statistics data on both virtio_net and vhost for both TX and RX. 

A way to filter number of guest exits, I/O exits, irq injections in
guest networking stacks only would be helpful.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-03-09  2:21                                         ` Andrew Theurer
  2011-03-09 15:42                                           ` Shirley Ma
@ 2011-03-10  1:49                                           ` Rusty Russell
  2011-04-12 20:01                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 88+ messages in thread
From: Rusty Russell @ 2011-03-10  1:49 UTC (permalink / raw)
  To: habanero, Shirley Ma
  Cc: Michael S. Tsirkin, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky

On Tue, 08 Mar 2011 20:21:18 -0600, Andrew Theurer <habanero@linux.vnet.ibm.com> wrote:
> On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
> > On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
> > > I've finally read this thread... I think we need to get more serious
> > > with our stats gathering to diagnose these kind of performance issues.
> > > 
> > > This is a start; it should tell us what is actually happening to the
> > > virtio ring(s) without significant performance impact... 
> > 
> > Should we also add similar stat on vhost vq as well for monitoring
> > vhost_signal & vhost_notify?
> 
> Tom L has started using Rusty's patches and found some interesting
> results, sent yesterday:
> http://marc.info/?l=kvm&m=129953710930124&w=2

Hmm, I'm not subscribed to kvm@ any more, so I didn't get this, so
replying here:

> Also, it looks like vhost is sending a lot of notifications for
> packets it has received before the guest can get scheduled to disable
> notifications and begin processing the packets resulting in some lock
> contention in the guest (and high interrupt rates).

Yes, this is a virtio design flaw, but one that should be fixable.
We have room at the end of the ring, which we can put a "last_used"
count.  Then we can tell if wakeups are redundant, before the guest
updates the flag.

Here's an old patch where I played with implementing this:

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/virtio/virtio_ring.c |   23 +++++++++++++++--------
 include/linux/virtio_ring.h  |   12 +++++++++++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
 	/* Number we've added since last sync. */
 	unsigned int num_added;
 
-	/* Last used index we've seen. */
-	u16 last_used_idx;
-
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	void (*notify)(struct virtqueue *vq);
 
@@ -278,12 +275,13 @@ static void detach_buf(struct vring_virt
 
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-	return vq->last_used_idx != vq->vring.used->idx;
+	return vring_last_used(&vq->vring) != vq->vring.used->idx;
 }
 
 static void *vring_get_buf(struct virtqueue *_vq, unsigned int *len)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct vring_used_elem *u;
 	void *ret;
 	unsigned int i;
 
@@ -300,8 +298,11 @@ static void *vring_get_buf(struct virtqu
 		return NULL;
 	}
 
-	i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
-	*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
+	u = &vq->vring.used->ring[vring_last_used(&vq->vring) % vq->vring.num];
+	i = u->id;
+	*len = u->len;
+	/* Make sure we don't reload i after doing checks. */
+	rmb();
 
 	if (unlikely(i >= vq->vring.num)) {
 		BAD_RING(vq, "id %u out of range\n", i);
@@ -315,7 +316,8 @@ static void *vring_get_buf(struct virtqu
 	/* detach_buf clears data, so grab it now. */
 	ret = vq->data[i];
 	detach_buf(vq, i);
-	vq->last_used_idx++;
+	vring_last_used(&vq->vring)++;
+
 	END_USE(vq);
 	return ret;
 }
@@ -402,7 +404,6 @@ struct virtqueue *vring_new_virtqueue(un
 	vq->vq.name = name;
 	vq->notify = notify;
 	vq->broken = false;
-	vq->last_used_idx = 0;
 	vq->num_added = 0;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
@@ -413,6 +414,10 @@ struct virtqueue *vring_new_virtqueue(un
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 
+	/* We publish indices whether they offer it or not: if not, it's junk
+	 * space anyway.  But calling this acknowledges the feature. */
+	virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_INDICES);
+
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
 		vq->vring.avail->flags |= VRING_AVAIL_F_NO_INTERRUPT;
@@ -443,6 +448,8 @@ void vring_transport_features(struct vir
 		switch (i) {
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+		case VIRTIO_RING_F_PUBLISH_INDICES:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,9 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
+/* We publish our last-seen used index at the end of the avail ring. */
+#define VIRTIO_RING_F_PUBLISH_INDICES	29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc
 {
@@ -87,6 +90,7 @@ struct vring {
  *	__u16 avail_flags;
  *	__u16 avail_idx;
  *	__u16 available[num];
+ *	__u16 last_used_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
@@ -95,6 +99,7 @@ struct vring {
  *	__u16 used_flags;
  *	__u16 used_idx;
  *	struct vring_used_elem used[num];
+ *	__u16 last_avail_idx;
  * };
  */
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
@@ -111,9 +116,14 @@ static inline unsigned vring_size(unsign
 {
 	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num + 2;
 }
 
+/* We publish the last-seen used index at the end of the available ring, and
+ * vice-versa.  These are at the end for backwards compatibility. */
+#define vring_last_used(vr) ((vr)->avail->ring[(vr)->num])
+#define vring_last_avail(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
+
 #ifdef __KERNEL__
 #include <linux/irqreturn.h>
 struct virtio_device;

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-03-10  1:49                                           ` Rusty Russell
@ 2011-04-12 20:01                                             ` Michael S. Tsirkin
  2011-04-14 11:28                                               ` Rusty Russell
  0 siblings, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-04-12 20:01 UTC (permalink / raw)
  To: Rusty Russell
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger

On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> Here's an old patch where I played with implementing this:

...

> 
> virtio: put last_used and last_avail index into ring itself.
> 
> Generally, the other end of the virtio ring doesn't need to see where
> you're up to in consuming the ring.  However, to completely understand
> what's going on from the outside, this information must be exposed.
> For example, if you want to save and restore a virtio_ring, but you're
> not the consumer because the kernel is using it directly.
> 
> Fortunately, we have room to expand:

This seems to be true for x86 kvm and lguest but is it true
for s390?

        err = vmem_add_mapping(config->address,
                               vring_size(config->num,
                                          KVM_S390_VIRTIO_RING_ALIGN));
        if (err)
                goto out;
                
        vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
                                 vdev, (void *) config->address,
                                 kvm_notify, callback, name);
        if (!vq) {
                err = -ENOMEM;
                goto unmap;
        }
        


> the ring is always a whole number
> of pages and there's hundreds of bytes of padding after the avail ring
> and the used ring, whatever the number of descriptors (which must be a
> power of 2).
> 
> We add a feature bit so the guest can tell the host that it's writing
> out the current value there, if it wants to use that.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---

....

> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -29,6 +29,9 @@
>  /* We support indirect buffer descriptors */
>  #define VIRTIO_RING_F_INDIRECT_DESC	28
>  
> +/* We publish our last-seen used index at the end of the avail ring. */
> +#define VIRTIO_RING_F_PUBLISH_INDICES	29
> +
>  /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
>  struct vring_desc
>  {
> @@ -87,6 +90,7 @@ struct vring {
>   *	__u16 avail_flags;
>   *	__u16 avail_idx;
>   *	__u16 available[num];
> + *	__u16 last_used_idx;
>   *
>   *	// Padding to the next align boundary.
>   *	char pad[];
> @@ -95,6 +99,7 @@ struct vring {
>   *	__u16 used_flags;
>   *	__u16 used_idx;
>   *	struct vring_used_elem used[num];
> + *	__u16 last_avail_idx;
>   * };
>   */
>  static inline void vring_init(struct vring *vr, unsigned int num, void *p,
> @@ -111,9 +116,14 @@ static inline unsigned vring_size(unsign
>  {
>  	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
>  		 + align - 1) & ~(align - 1))
> -		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
> +		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num + 2;
>  }
>  
> +/* We publish the last-seen used index at the end of the available ring, and
> + * vice-versa.  These are at the end for backwards compatibility. */
> +#define vring_last_used(vr) ((vr)->avail->ring[(vr)->num])
> +#define vring_last_avail(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
> +

Will this last bit work on s390?
If I understand correctly the memory is allocated by host there?

>  #ifdef __KERNEL__
>  #include <linux/irqreturn.h>
>  struct virtio_device;

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-04-12 20:01                                             ` Michael S. Tsirkin
@ 2011-04-14 11:28                                               ` Rusty Russell
  2011-04-14 12:40                                                 ` Michael S. Tsirkin
  2011-04-14 16:03                                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 88+ messages in thread
From: Rusty Russell @ 2011-04-14 11:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger

On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > Here's an old patch where I played with implementing this:
> 
> ...
> 
> > 
> > virtio: put last_used and last_avail index into ring itself.
> > 
> > Generally, the other end of the virtio ring doesn't need to see where
> > you're up to in consuming the ring.  However, to completely understand
> > what's going on from the outside, this information must be exposed.
> > For example, if you want to save and restore a virtio_ring, but you're
> > not the consumer because the kernel is using it directly.
> > 
> > Fortunately, we have room to expand:
> 
> This seems to be true for x86 kvm and lguest but is it true
> for s390?

Yes, as the ring is page aligned so there's always room.

> Will this last bit work on s390?
> If I understand correctly the memory is allocated by host there?

They have to offer the feature, so if the have some way of allocating
non-page-aligned amounts of memory, they'll have to add those extra 2
bytes.

So I think it's OK...
Rusty.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-04-14 11:28                                               ` Rusty Russell
@ 2011-04-14 12:40                                                 ` Michael S. Tsirkin
  2011-04-14 16:03                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-04-14 12:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger

On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
> On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > > Here's an old patch where I played with implementing this:
> > 
> > ...
> > 
> > > 
> > > virtio: put last_used and last_avail index into ring itself.
> > > 
> > > Generally, the other end of the virtio ring doesn't need to see where
> > > you're up to in consuming the ring.  However, to completely understand
> > > what's going on from the outside, this information must be exposed.
> > > For example, if you want to save and restore a virtio_ring, but you're
> > > not the consumer because the kernel is using it directly.
> > > 
> > > Fortunately, we have room to expand:
> > 
> > This seems to be true for x86 kvm and lguest but is it true
> > for s390?
> 
> Yes, as the ring is page aligned so there's always room.
> 
> > Will this last bit work on s390?
> > If I understand correctly the memory is allocated by host there?
> 
> They have to offer the feature, so if the have some way of allocating
> non-page-aligned amounts of memory, they'll have to add those extra 2
> bytes.
> 
> So I think it's OK...
> Rusty.

Correct. I wonder whether we need to pass the relevant flag
to vring_size. If we do we'll need to add a new function
for that though as vring_size is exported to userspace.

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-04-14 11:28                                               ` Rusty Russell
  2011-04-14 12:40                                                 ` Michael S. Tsirkin
@ 2011-04-14 16:03                                                 ` Michael S. Tsirkin
  2011-04-19  0:33                                                   ` Rusty Russell
  1 sibling, 1 reply; 88+ messages in thread
From: Michael S. Tsirkin @ 2011-04-14 16:03 UTC (permalink / raw)
  To: Rusty Russell
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger

On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
> On Tue, 12 Apr 2011 23:01:12 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
> > > Here's an old patch where I played with implementing this:
> > 
> > ...
> > 
> > > 
> > > virtio: put last_used and last_avail index into ring itself.
> > > 
> > > Generally, the other end of the virtio ring doesn't need to see where
> > > you're up to in consuming the ring.  However, to completely understand
> > > what's going on from the outside, this information must be exposed.
> > > For example, if you want to save and restore a virtio_ring, but you're
> > > not the consumer because the kernel is using it directly.
> > > 
> > > Fortunately, we have room to expand:
> > 
> > This seems to be true for x86 kvm and lguest but is it true
> > for s390?
> 
> Yes, as the ring is page aligned so there's always room.
> 
> > Will this last bit work on s390?
> > If I understand correctly the memory is allocated by host there?
> 
> They have to offer the feature, so if the have some way of allocating
> non-page-aligned amounts of memory, they'll have to add those extra 2
> bytes.
> 
> So I think it's OK...
> Rusty.

To clarify, my concern is that we always seem to try to map
these extra 2 bytes, which thinkably might fail?

-- 
MST

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Network performance with small packets
  2011-04-14 16:03                                                 ` Michael S. Tsirkin
@ 2011-04-19  0:33                                                   ` Rusty Russell
  0 siblings, 0 replies; 88+ messages in thread
From: Rusty Russell @ 2011-04-19  0:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, Shirley Ma, Krishna Kumar2, David Miller, kvm, netdev,
	steved, Tom Lendacky, borntraeger

On Thu, 14 Apr 2011 19:03:59 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
> > They have to offer the feature, so if the have some way of allocating
> > non-page-aligned amounts of memory, they'll have to add those extra 2
> > bytes.
> > 
> > So I think it's OK...
> > Rusty.
> 
> To clarify, my concern is that we always seem to try to map
> these extra 2 bytes, which thinkably might fail?

No, if you look at the layout it's clear that there's always most of a
page left for this extra room, both in the middle and at the end.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2011-04-19  1:39 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-25 21:09 Network performance with small packets Steve Dobbelstein
2011-01-26 15:17 ` Michael S. Tsirkin
2011-01-27 18:44   ` Shirley Ma
2011-01-27 19:00     ` Michael S. Tsirkin
2011-01-27 19:09       ` Shirley Ma
2011-01-27 19:31         ` Michael S. Tsirkin
2011-01-27 19:45           ` Shirley Ma
2011-01-27 20:05             ` Michael S. Tsirkin
2011-01-27 20:15               ` Shirley Ma
2011-01-28 18:29                 ` Steve Dobbelstein
2011-01-28 22:51                   ` Steve Dobbelstein
2011-02-01 15:52                   ` [PATCHv2 dontapply] vhost-net tx tuning Michael S. Tsirkin
2011-02-01 23:07                     ` Sridhar Samudrala
2011-02-01 23:27                       ` Shirley Ma
2011-02-02  4:36                       ` Michael S. Tsirkin
2011-01-27 21:02               ` Network performance with small packets David Miller
2011-01-27 21:30                 ` Shirley Ma
2011-01-28 12:16                   ` Michael S. Tsirkin
2011-02-01  0:24                     ` Steve Dobbelstein
2011-02-01  1:30                       ` Sridhar Samudrala
2011-02-01  5:56                         ` Michael S. Tsirkin
2011-02-01 21:09                         ` Shirley Ma
2011-02-01 21:24                           ` Michael S. Tsirkin
2011-02-01 21:32                             ` Shirley Ma
2011-02-01 21:42                               ` Michael S. Tsirkin
2011-02-01 21:53                                 ` Shirley Ma
2011-02-01 21:56                                   ` Michael S. Tsirkin
2011-02-01 22:59                                     ` Shirley Ma
2011-02-02  4:40                                       ` Michael S. Tsirkin
2011-02-02  6:05                                         ` Shirley Ma
2011-02-02  6:19                                           ` Shirley Ma
2011-02-02  6:29                                             ` Michael S. Tsirkin
2011-02-02  7:14                                               ` Shirley Ma
2011-02-02  7:33                                                 ` Shirley Ma
2011-02-02 10:49                                                   ` Michael S. Tsirkin
2011-02-02 15:42                                                     ` Shirley Ma
2011-02-02 15:48                                                       ` Michael S. Tsirkin
2011-02-02 17:12                                                         ` Shirley Ma
2011-02-02 18:20                                                       ` Michael S. Tsirkin
2011-02-02 18:26                                                         ` Shirley Ma
2011-02-02 10:48                                                 ` Michael S. Tsirkin
2011-02-02  6:34                                             ` Krishna Kumar2
2011-02-02  7:03                                               ` Shirley Ma
2011-02-02  7:37                                                 ` Krishna Kumar2
2011-02-02 10:48                                               ` Michael S. Tsirkin
2011-02-02 15:39                                                 ` Shirley Ma
2011-02-02 15:47                                                   ` Michael S. Tsirkin
2011-02-02 17:10                                                     ` Shirley Ma
2011-02-02 17:32                                                       ` Michael S. Tsirkin
2011-02-02 18:11                                                         ` Shirley Ma
2011-02-02 18:27                                                           ` Michael S. Tsirkin
2011-02-02 19:29                                                             ` Shirley Ma
2011-02-02 20:17                                                               ` Michael S. Tsirkin
2011-02-02 21:03                                                                 ` Shirley Ma
2011-02-02 21:20                                                                   ` Michael S. Tsirkin
2011-02-02 21:41                                                                     ` Shirley Ma
2011-02-03  5:59                                                                       ` Michael S. Tsirkin
2011-02-03  6:09                                                                         ` Shirley Ma
2011-02-03  6:16                                                                           ` Michael S. Tsirkin
2011-02-03  5:05                                                                     ` Shirley Ma
2011-02-03  6:13                                                                       ` Michael S. Tsirkin
2011-02-03 15:58                                                                         ` Shirley Ma
2011-02-03 16:20                                                                           ` Michael S. Tsirkin
2011-02-03 17:18                                                                             ` Shirley Ma
2011-02-01  5:54                       ` Michael S. Tsirkin
2011-02-01 17:23                   ` Michael S. Tsirkin
     [not found]                     ` <1296590943.26937.797.camel@localhost.localdomain>
     [not found]                       ` <20110201201715.GA30050@redhat.com>
2011-02-01 20:25                         ` Shirley Ma
2011-02-01 21:21                           ` Michael S. Tsirkin
2011-02-01 21:28                             ` Shirley Ma
2011-02-01 21:41                               ` Michael S. Tsirkin
2011-02-02  4:39                                 ` Krishna Kumar2
2011-02-02  4:42                                   ` Michael S. Tsirkin
2011-02-09  0:37                                     ` Rusty Russell
2011-02-09  0:53                                       ` Michael S. Tsirkin
2011-02-09  1:39                                         ` Rusty Russell
2011-02-09  1:55                                           ` Michael S. Tsirkin
2011-02-09  7:43                                             ` Stefan Hajnoczi
2011-03-08 21:57                                       ` Shirley Ma
2011-03-09  2:21                                         ` Andrew Theurer
2011-03-09 15:42                                           ` Shirley Ma
2011-03-10  1:49                                           ` Rusty Russell
2011-04-12 20:01                                             ` Michael S. Tsirkin
2011-04-14 11:28                                               ` Rusty Russell
2011-04-14 12:40                                                 ` Michael S. Tsirkin
2011-04-14 16:03                                                 ` Michael S. Tsirkin
2011-04-19  0:33                                                   ` Rusty Russell
2011-02-02 18:38 ` Michael S. Tsirkin
2011-02-02 19:15   ` Steve Dobbelstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.