All of lore.kernel.org
 help / color / mirror / Atom feed
* Flow Control and Port Mirroring Revisited
@ 2011-01-06  9:33 Simon Horman
  2011-01-06 10:22 ` Eric Dumazet
  2011-01-06 10:27 ` Michael S. Tsirkin
  0 siblings, 2 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-06  9:33 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, Jesse Gross, dev, virtualization, netdev, kvm,
	Michael S. Tsirkin

Hi,

Back in October I reported that I noticed a problem whereby flow control
breaks down when openvswitch is configured to mirror a port[1].

I have (finally) looked into this further and the problem appears to relate
to cloning of skbs, as Jesse Gross originally suspected.

More specifically, in do_execute_actions[2] the first n-1 times that an skb
needs to be transmitted it is cloned first and the final time the original
skb is used.

In the case that there is only one action, which is the normal case, then
the original skb will be used. But in the case of mirroring the cloning
comes into effect. And in my case the cloned skb seems to go to the (slow)
eth1 interface while the original skb goes to the (fast) dummy0 interface
that I set up to be a mirror. The result is that dummy0 "paces" the flow,
and its a cracking pace at that.

As an experiment I hacked do_execute_actions() to use the original skb
for the first action instead of the last one.  In my case the result was
that eth1 "paces" the flow, and things work reasonably nicely.

Well, sort of. Things work well for non-GSO skbs but extremely poorly for
GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
netserv. I'm unsure why, but I digress.

It seems to me that my hack illustrates the point that the flow ends up
being "paced" by one interface. However I think that what would be
desirable is that the flow is "paced" by the slowest link. Unfortunately
I'm unsure how to achieve that.

One idea that I had was to skb_get() the original skb each time it is
cloned - that is easy enough. But unfortunately it seems to me that
approach would require some sort of callback mechanism in kfree_skb() so
that the cloned skbs can kfree_skb() the original skb.

Ideas would be greatly appreciated.

[1] http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
[2] http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06  9:33 Flow Control and Port Mirroring Revisited Simon Horman
@ 2011-01-06 10:22 ` Eric Dumazet
  2011-01-06 12:44   ` Simon Horman
  2011-01-06 10:27 ` Michael S. Tsirkin
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2011-01-06 10:22 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
> Hi,
> 
> Back in October I reported that I noticed a problem whereby flow control
> breaks down when openvswitch is configured to mirror a port[1].
> 
> I have (finally) looked into this further and the problem appears to relate
> to cloning of skbs, as Jesse Gross originally suspected.
> 
> More specifically, in do_execute_actions[2] the first n-1 times that an skb
> needs to be transmitted it is cloned first and the final time the original
> skb is used.
> 
> In the case that there is only one action, which is the normal case, then
> the original skb will be used. But in the case of mirroring the cloning
> comes into effect. And in my case the cloned skb seems to go to the (slow)
> eth1 interface while the original skb goes to the (fast) dummy0 interface
> that I set up to be a mirror. The result is that dummy0 "paces" the flow,
> and its a cracking pace at that.
> 
> As an experiment I hacked do_execute_actions() to use the original skb
> for the first action instead of the last one.  In my case the result was
> that eth1 "paces" the flow, and things work reasonably nicely.
> 
> Well, sort of. Things work well for non-GSO skbs but extremely poorly for
> GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
> netserv. I'm unsure why, but I digress.
> 
> It seems to me that my hack illustrates the point that the flow ends up
> being "paced" by one interface. However I think that what would be
> desirable is that the flow is "paced" by the slowest link. Unfortunately
> I'm unsure how to achieve that.
> 

Hi Simon !

"pacing" is done because skb is attached to a socket, and a socket has a
limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
of all truesize skbs in flight.

When you enter something that :

1) Get a clone of the skb, queue the clone to device X
2) queue the original skb to device Y

Then :	Socket sndbuf is not affected at all by device X queue.
	This is speed on device Y that matters.

You want to get servo control on both X and Y

You could try to

1) Get a clone of skb
   Attach it to socket too (so that socket get a feedback of final
orphaning for the clone) with skb_set_owner_w()
   queue the clone to device X

Unfortunatly, stacked skb->destructor() makes this possible only for
known destructor (aka sock_wfree())

> One idea that I had was to skb_get() the original skb each time it is
> cloned - that is easy enough. But unfortunately it seems to me that
> approach would require some sort of callback mechanism in kfree_skb() so
> that the cloned skbs can kfree_skb() the original skb.
> 
> Ideas would be greatly appreciated.
> 
> [1] http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
> [2] http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD
> --



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06  9:33 Flow Control and Port Mirroring Revisited Simon Horman
  2011-01-06 10:22 ` Eric Dumazet
@ 2011-01-06 10:27 ` Michael S. Tsirkin
  2011-01-06 11:30   ` Simon Horman
  1 sibling, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-06 10:27 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm

On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> Hi,
> 
> Back in October I reported that I noticed a problem whereby flow control
> breaks down when openvswitch is configured to mirror a port[1].

Apropos the UDP flow control.  See this
http://www.spinics.net/lists/netdev/msg150806.html
for some problems it introduces.
Unfortunately UDP does not have built-in flow control.
At some level it's just conceptually broken:
it's not present in physical networks so why should
we try and emulate it in a virtual network?


Specifically, when you do:
# netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
You are asking: what happens if I push data faster than it can be received?
But why is this an interesting question?
Ask 'what is the maximum rate at which I can send data with %X packet
loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
-b and -w flags for this. It needs to be configured
with --enable-intervals=yes for them to work.

If you pose the questions this way the problem of pacing
the execution just goes away.

> 
> I have (finally) looked into this further and the problem appears to relate
> to cloning of skbs, as Jesse Gross originally suspected.
> 
> More specifically, in do_execute_actions[2] the first n-1 times that an skb
> needs to be transmitted it is cloned first and the final time the original
> skb is used.
> 
> In the case that there is only one action, which is the normal case, then
> the original skb will be used. But in the case of mirroring the cloning
> comes into effect. And in my case the cloned skb seems to go to the (slow)
> eth1 interface while the original skb goes to the (fast) dummy0 interface
> that I set up to be a mirror. The result is that dummy0 "paces" the flow,
> and its a cracking pace at that.
> 
> As an experiment I hacked do_execute_actions() to use the original skb
> for the first action instead of the last one.  In my case the result was
> that eth1 "paces" the flow, and things work reasonably nicely.
> 
> Well, sort of. Things work well for non-GSO skbs but extremely poorly for
> GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
> netserv. I'm unsure why, but I digress.
> 
> It seems to me that my hack illustrates the point that the flow ends up
> being "paced" by one interface. However I think that what would be
> desirable is that the flow is "paced" by the slowest link. Unfortunately
> I'm unsure how to achieve that.

What if you have multiple UDP sockets with different targets
in the guest?

> One idea that I had was to skb_get() the original skb each time it is
> cloned - that is easy enough. But unfortunately it seems to me that
> approach would require some sort of callback mechanism in kfree_skb() so
> that the cloned skbs can kfree_skb() the original skb.
> 
> Ideas would be greatly appreciated.
> 
> [1] http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
> [2] http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 10:27 ` Michael S. Tsirkin
@ 2011-01-06 11:30   ` Simon Horman
  2011-01-06 12:07     ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-06 11:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm

On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > Hi,
> > 
> > Back in October I reported that I noticed a problem whereby flow control
> > breaks down when openvswitch is configured to mirror a port[1].
> 
> Apropos the UDP flow control.  See this
> http://www.spinics.net/lists/netdev/msg150806.html
> for some problems it introduces.
> Unfortunately UDP does not have built-in flow control.
> At some level it's just conceptually broken:
> it's not present in physical networks so why should
> we try and emulate it in a virtual network?
> 
> 
> Specifically, when you do:
> # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> You are asking: what happens if I push data faster than it can be received?
> But why is this an interesting question?
> Ask 'what is the maximum rate at which I can send data with %X packet
> loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> -b and -w flags for this. It needs to be configured
> with --enable-intervals=yes for them to work.
> 
> If you pose the questions this way the problem of pacing
> the execution just goes away.

I am aware that UDP inherently lacks flow control.

The aspect of flow control that I am interested in is situations where the
guest can create large amounts of work for the host. However, it seems that
in the case of virtio with vhostnet that the CPU utilisation seems to be
almost entirely attributable to the vhost and qemu-system processes.  And
in the case of virtio without vhost net the CPU is used by the qemu-system
process. In both case I assume that I could use a cgroup or something
similar to limit the guests.

Assuming all of that is true then from a resource control problem point of
view, which is mostly what I am concerned about, the problem goes away.
However, I still think that it would be nice to resolve the situation I
described.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 11:30   ` Simon Horman
@ 2011-01-06 12:07     ` Michael S. Tsirkin
  2011-01-06 12:29       ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-06 12:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm

On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > Hi,
> > > 
> > > Back in October I reported that I noticed a problem whereby flow control
> > > breaks down when openvswitch is configured to mirror a port[1].
> > 
> > Apropos the UDP flow control.  See this
> > http://www.spinics.net/lists/netdev/msg150806.html
> > for some problems it introduces.
> > Unfortunately UDP does not have built-in flow control.
> > At some level it's just conceptually broken:
> > it's not present in physical networks so why should
> > we try and emulate it in a virtual network?
> > 
> > 
> > Specifically, when you do:
> > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > You are asking: what happens if I push data faster than it can be received?
> > But why is this an interesting question?
> > Ask 'what is the maximum rate at which I can send data with %X packet
> > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > -b and -w flags for this. It needs to be configured
> > with --enable-intervals=yes for them to work.
> > 
> > If you pose the questions this way the problem of pacing
> > the execution just goes away.
> 
> I am aware that UDP inherently lacks flow control.

Everyone's is aware of that, but this is always followed by a 'however'
:).

> The aspect of flow control that I am interested in is situations where the
> guest can create large amounts of work for the host. However, it seems that
> in the case of virtio with vhostnet that the CPU utilisation seems to be
> almost entirely attributable to the vhost and qemu-system processes.  And
> in the case of virtio without vhost net the CPU is used by the qemu-system
> process. In both case I assume that I could use a cgroup or something
> similar to limit the guests.

cgroups, yes. the vhost process inherits the cgroups
from the qemu process so you can limit them all.

If you are after limiting the max troughput of the guest
you can do this with cgroups as well.

> Assuming all of that is true then from a resource control problem point of
> view, which is mostly what I am concerned about, the problem goes away.
> However, I still think that it would be nice to resolve the situation I
> described.

We need to articulate what's wrong here, otherwise we won't
be able to resolve the situation. We are sending UDP packets
as fast as we can and some receivers can't cope. Is this the problem?
We have made attempts to add a pseudo flow control in the past
in an attempt to make UDP on the same host work better.
Maybe they help some but they also sure introduce problems.

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 12:07     ` Michael S. Tsirkin
@ 2011-01-06 12:29       ` Simon Horman
  2011-01-06 12:47         ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-06 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm

On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> > On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > > Hi,
> > > > 
> > > > Back in October I reported that I noticed a problem whereby flow control
> > > > breaks down when openvswitch is configured to mirror a port[1].
> > > 
> > > Apropos the UDP flow control.  See this
> > > http://www.spinics.net/lists/netdev/msg150806.html
> > > for some problems it introduces.
> > > Unfortunately UDP does not have built-in flow control.
> > > At some level it's just conceptually broken:
> > > it's not present in physical networks so why should
> > > we try and emulate it in a virtual network?
> > > 
> > > 
> > > Specifically, when you do:
> > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > > You are asking: what happens if I push data faster than it can be received?
> > > But why is this an interesting question?
> > > Ask 'what is the maximum rate at which I can send data with %X packet
> > > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > > -b and -w flags for this. It needs to be configured
> > > with --enable-intervals=yes for them to work.
> > > 
> > > If you pose the questions this way the problem of pacing
> > > the execution just goes away.
> > 
> > I am aware that UDP inherently lacks flow control.
> 
> Everyone's is aware of that, but this is always followed by a 'however'
> :).
> 
> > The aspect of flow control that I am interested in is situations where the
> > guest can create large amounts of work for the host. However, it seems that
> > in the case of virtio with vhostnet that the CPU utilisation seems to be
> > almost entirely attributable to the vhost and qemu-system processes.  And
> > in the case of virtio without vhost net the CPU is used by the qemu-system
> > process. In both case I assume that I could use a cgroup or something
> > similar to limit the guests.
> 
> cgroups, yes. the vhost process inherits the cgroups
> from the qemu process so you can limit them all.
> 
> If you are after limiting the max troughput of the guest
> you can do this with cgroups as well.

Do you mean a CPU cgroup or something else?

> > Assuming all of that is true then from a resource control problem point of
> > view, which is mostly what I am concerned about, the problem goes away.
> > However, I still think that it would be nice to resolve the situation I
> > described.
> 
> We need to articulate what's wrong here, otherwise we won't
> be able to resolve the situation. We are sending UDP packets
> as fast as we can and some receivers can't cope. Is this the problem?
> We have made attempts to add a pseudo flow control in the past
> in an attempt to make UDP on the same host work better.
> Maybe they help some but they also sure introduce problems.

In the case where port mirroring is not active, which is the
usual case, to some extent there is flow control in place due to
(as Eric Dumazet pointed out) the socket buffer.

When port mirroring is activated the flow control operates based
only on one port - which can't be controlled by the administrator
in an obvious way.

I think that it would be more intuitive if flow control was
based on sending a packet to all ports rather than just one.

Though now I think about it some more, perhaps this isn't the best either.
For instance the case where data was being sent to dummy0 and suddenly
adding a mirror on eth1 slowed everything down.

So perhaps there needs to be another knob to tune when setting
up port-mirroring. Or perhaps the current situation isn't so bad.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 10:22 ` Eric Dumazet
@ 2011-01-06 12:44   ` Simon Horman
  2011-01-06 13:28     ` Eric Dumazet
  2011-01-06 22:38     ` Jesse Gross
  0 siblings, 2 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-06 12:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
> > Hi,
> > 
> > Back in October I reported that I noticed a problem whereby flow control
> > breaks down when openvswitch is configured to mirror a port[1].
> > 
> > I have (finally) looked into this further and the problem appears to relate
> > to cloning of skbs, as Jesse Gross originally suspected.
> > 
> > More specifically, in do_execute_actions[2] the first n-1 times that an skb
> > needs to be transmitted it is cloned first and the final time the original
> > skb is used.
> > 
> > In the case that there is only one action, which is the normal case, then
> > the original skb will be used. But in the case of mirroring the cloning
> > comes into effect. And in my case the cloned skb seems to go to the (slow)
> > eth1 interface while the original skb goes to the (fast) dummy0 interface
> > that I set up to be a mirror. The result is that dummy0 "paces" the flow,
> > and its a cracking pace at that.
> > 
> > As an experiment I hacked do_execute_actions() to use the original skb
> > for the first action instead of the last one.  In my case the result was
> > that eth1 "paces" the flow, and things work reasonably nicely.
> > 
> > Well, sort of. Things work well for non-GSO skbs but extremely poorly for
> > GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
> > netserv. I'm unsure why, but I digress.
> > 
> > It seems to me that my hack illustrates the point that the flow ends up
> > being "paced" by one interface. However I think that what would be
> > desirable is that the flow is "paced" by the slowest link. Unfortunately
> > I'm unsure how to achieve that.
> > 
> 
> Hi Simon !
> 
> "pacing" is done because skb is attached to a socket, and a socket has a
> limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
> of all truesize skbs in flight.
> 
> When you enter something that :
> 
> 1) Get a clone of the skb, queue the clone to device X
> 2) queue the original skb to device Y
> 
> Then :	Socket sndbuf is not affected at all by device X queue.
> 	This is speed on device Y that matters.
> 
> You want to get servo control on both X and Y
> 
> You could try to
> 
> 1) Get a clone of skb
>    Attach it to socket too (so that socket get a feedback of final
> orphaning for the clone) with skb_set_owner_w()
>    queue the clone to device X
> 
> Unfortunatly, stacked skb->destructor() makes this possible only for
> known destructor (aka sock_wfree())

Hi Eric !

Thanks for the advice. I had thought about the socket buffer but at some
point it slipped my mind.

In any case the following patch seems to implement the change that I had in
mind. However my discussions Michael Tsirkin elsewhere in this thread are
beginning to make me think that think that perhaps this change isn't the
best solution.

diff --git a/datapath/actions.c b/datapath/actions.c
index 5e16143..505f13f 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 
 	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
 		if (prev_port != -1) {
-			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
+			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+			if (nskb) {
+				if (skb->sk)
+					skb_set_owner_w(nskb, skb->sk);
+				do_output(dp, nskb, prev_port);
+			}
 			prev_port = -1;
 		}

I got a rather nasty panic without the if (skb->sk),
I guess some skbs don't have a socket.

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 12:29       ` Simon Horman
@ 2011-01-06 12:47         ` Michael S. Tsirkin
  0 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-06 12:47 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm

On Thu, Jan 06, 2011 at 09:29:02PM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
> > > On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
> > > > > Hi,
> > > > > 
> > > > > Back in October I reported that I noticed a problem whereby flow control
> > > > > breaks down when openvswitch is configured to mirror a port[1].
> > > > 
> > > > Apropos the UDP flow control.  See this
> > > > http://www.spinics.net/lists/netdev/msg150806.html
> > > > for some problems it introduces.
> > > > Unfortunately UDP does not have built-in flow control.
> > > > At some level it's just conceptually broken:
> > > > it's not present in physical networks so why should
> > > > we try and emulate it in a virtual network?
> > > > 
> > > > 
> > > > Specifically, when you do:
> > > > # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
> > > > You are asking: what happens if I push data faster than it can be received?
> > > > But why is this an interesting question?
> > > > Ask 'what is the maximum rate at which I can send data with %X packet
> > > > loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
> > > > -b and -w flags for this. It needs to be configured
> > > > with --enable-intervals=yes for them to work.
> > > > 
> > > > If you pose the questions this way the problem of pacing
> > > > the execution just goes away.
> > > 
> > > I am aware that UDP inherently lacks flow control.
> > 
> > Everyone's is aware of that, but this is always followed by a 'however'
> > :).
> > 
> > > The aspect of flow control that I am interested in is situations where the
> > > guest can create large amounts of work for the host. However, it seems that
> > > in the case of virtio with vhostnet that the CPU utilisation seems to be
> > > almost entirely attributable to the vhost and qemu-system processes.  And
> > > in the case of virtio without vhost net the CPU is used by the qemu-system
> > > process. In both case I assume that I could use a cgroup or something
> > > similar to limit the guests.
> > 
> > cgroups, yes. the vhost process inherits the cgroups
> > from the qemu process so you can limit them all.
> > 
> > If you are after limiting the max troughput of the guest
> > you can do this with cgroups as well.
> 
> Do you mean a CPU cgroup or something else?

net classifier cgroup

> > > Assuming all of that is true then from a resource control problem point of
> > > view, which is mostly what I am concerned about, the problem goes away.
> > > However, I still think that it would be nice to resolve the situation I
> > > described.
> > 
> > We need to articulate what's wrong here, otherwise we won't
> > be able to resolve the situation. We are sending UDP packets
> > as fast as we can and some receivers can't cope. Is this the problem?
> > We have made attempts to add a pseudo flow control in the past
> > in an attempt to make UDP on the same host work better.
> > Maybe they help some but they also sure introduce problems.
> 
> In the case where port mirroring is not active, which is the
> usual case, to some extent there is flow control in place due to
> (as Eric Dumazet pointed out) the socket buffer.
> 
> When port mirroring is activated the flow control operates based
> only on one port - which can't be controlled by the administrator
> in an obvious way.
> 
> I think that it would be more intuitive if flow control was
> based on sending a packet to all ports rather than just one.
> 
> Though now I think about it some more, perhaps this isn't the best either.
> For instance the case where data was being sent to dummy0 and suddenly
> adding a mirror on eth1 slowed everything down.
> 
> So perhaps there needs to be another knob to tune when setting
> up port-mirroring. Or perhaps the current situation isn't so bad.

To understand whether it's bad, you'd need to measure it.
The netperf manual says:
	5.2.4 UDP_STREAM

		A UDP_STREAM test is similar to a TCP_STREAM test except UDP is used as
	the transport rather than TCP.

		A UDP_STREAM test has no end-to-end flow control - UDP provides none
	and neither does netperf. However, if you wish, you can configure netperf with
	--enable-intervals=yes to enable the global command-line -b and -w options to
	pace bursts of traffic onto the network.

	This has a number of implications.

	...
and one of the implications is that the max throughput
might not be reached when you try to send as much data as possible.
It might be confusing that this is what netperf does by default with UDP_STREAM:
if the endpoint is much faster than the network the issue might not appear.

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 12:44   ` Simon Horman
@ 2011-01-06 13:28     ` Eric Dumazet
  2011-01-06 22:01       ` Simon Horman
  2011-01-06 22:38     ` Jesse Gross
  1 sibling, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2011-01-06 13:28 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :

> Hi Eric !
> 
> Thanks for the advice. I had thought about the socket buffer but at some
> point it slipped my mind.
> 
> In any case the following patch seems to implement the change that I had in
> mind. However my discussions Michael Tsirkin elsewhere in this thread are
> beginning to make me think that think that perhaps this change isn't the
> best solution.
> 
> diff --git a/datapath/actions.c b/datapath/actions.c
> index 5e16143..505f13f 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  
>  	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
>  		if (prev_port != -1) {
> -			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
> +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (nskb) {
> +				if (skb->sk)
> +					skb_set_owner_w(nskb, skb->sk);
> +				do_output(dp, nskb, prev_port);
> +			}
>  			prev_port = -1;
>  		}
> 
> I got a rather nasty panic without the if (skb->sk),
> I guess some skbs don't have a socket.

Indeed, some packets are not linked to a socket.

(ARP packets for example)

Sorry, I should have mentioned it :)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 13:28     ` Eric Dumazet
@ 2011-01-06 22:01       ` Simon Horman
  0 siblings, 0 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-06 22:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rusty Russell, virtualization, Jesse Gross, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 06, 2011 at 02:28:18PM +0100, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :
> 
> > Hi Eric !
> > 
> > Thanks for the advice. I had thought about the socket buffer but at some
> > point it slipped my mind.
> > 
> > In any case the following patch seems to implement the change that I had in
> > mind. However my discussions Michael Tsirkin elsewhere in this thread are
> > beginning to make me think that think that perhaps this change isn't the
> > best solution.
> > 
> > diff --git a/datapath/actions.c b/datapath/actions.c
> > index 5e16143..505f13f 100644
> > --- a/datapath/actions.c
> > +++ b/datapath/actions.c
> > @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >  
> >  	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
> >  		if (prev_port != -1) {
> > -			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
> > +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> > +			if (nskb) {
> > +				if (skb->sk)
> > +					skb_set_owner_w(nskb, skb->sk);
> > +				do_output(dp, nskb, prev_port);
> > +			}
> >  			prev_port = -1;
> >  		}
> > 
> > I got a rather nasty panic without the if (skb->sk),
> > I guess some skbs don't have a socket.
> 
> Indeed, some packets are not linked to a socket.
> 
> (ARP packets for example)
> 
> Sorry, I should have mentioned it :)

Not at all, the occasional panic during hacking is good for the soul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 12:44   ` Simon Horman
  2011-01-06 13:28     ` Eric Dumazet
@ 2011-01-06 22:38     ` Jesse Gross
  2011-01-07  1:23       ` Simon Horman
  1 sibling, 1 reply; 40+ messages in thread
From: Jesse Gross @ 2011-01-06 22:38 UTC (permalink / raw)
  To: Simon Horman
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 6, 2011 at 7:44 AM, Simon Horman <horms@verge.net.au> wrote:
> On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
>> Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
>> > Hi,
>> >
>> > Back in October I reported that I noticed a problem whereby flow control
>> > breaks down when openvswitch is configured to mirror a port[1].
>> >
>> > I have (finally) looked into this further and the problem appears to relate
>> > to cloning of skbs, as Jesse Gross originally suspected.
>> >
>> > More specifically, in do_execute_actions[2] the first n-1 times that an skb
>> > needs to be transmitted it is cloned first and the final time the original
>> > skb is used.
>> >
>> > In the case that there is only one action, which is the normal case, then
>> > the original skb will be used. But in the case of mirroring the cloning
>> > comes into effect. And in my case the cloned skb seems to go to the (slow)
>> > eth1 interface while the original skb goes to the (fast) dummy0 interface
>> > that I set up to be a mirror. The result is that dummy0 "paces" the flow,
>> > and its a cracking pace at that.
>> >
>> > As an experiment I hacked do_execute_actions() to use the original skb
>> > for the first action instead of the last one.  In my case the result was
>> > that eth1 "paces" the flow, and things work reasonably nicely.
>> >
>> > Well, sort of. Things work well for non-GSO skbs but extremely poorly for
>> > GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
>> > netserv. I'm unsure why, but I digress.
>> >
>> > It seems to me that my hack illustrates the point that the flow ends up
>> > being "paced" by one interface. However I think that what would be
>> > desirable is that the flow is "paced" by the slowest link. Unfortunately
>> > I'm unsure how to achieve that.
>> >
>>
>> Hi Simon !
>>
>> "pacing" is done because skb is attached to a socket, and a socket has a
>> limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
>> of all truesize skbs in flight.
>>
>> When you enter something that :
>>
>> 1) Get a clone of the skb, queue the clone to device X
>> 2) queue the original skb to device Y
>>
>> Then :        Socket sndbuf is not affected at all by device X queue.
>>       This is speed on device Y that matters.
>>
>> You want to get servo control on both X and Y
>>
>> You could try to
>>
>> 1) Get a clone of skb
>>    Attach it to socket too (so that socket get a feedback of final
>> orphaning for the clone) with skb_set_owner_w()
>>    queue the clone to device X
>>
>> Unfortunatly, stacked skb->destructor() makes this possible only for
>> known destructor (aka sock_wfree())
>
> Hi Eric !
>
> Thanks for the advice. I had thought about the socket buffer but at some
> point it slipped my mind.
>
> In any case the following patch seems to implement the change that I had in
> mind. However my discussions Michael Tsirkin elsewhere in this thread are
> beginning to make me think that think that perhaps this change isn't the
> best solution.

I know that everyone likes a nice netperf result but I agree with
Michael that this probably isn't the right question to be asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but it's
more of a side effect than anything.  It's just that the amount of
memory consumed by packets in the queue(s) doesn't really have any
implicit meaning for flow control (think multiple physical adapters,
all with the same speed instead of a virtual device and a physical
device with wildly different speeds).  The analog in the physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory consumption then
that's a different story.

This patch also double counts memory, since the full size of the
packet will be accounted for by each clone, even though they share the
actual packet data.  Probably not too significant here but it might be
when flooding/mirroring to many interfaces.  This is at least fixable
(the Xen-style accounting through page tracking deals with it, though
it has its own problems).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-06 22:38     ` Jesse Gross
@ 2011-01-07  1:23       ` Simon Horman
  2011-01-10  9:31         ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-07  1:23 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

[ snip ]
> 
> I know that everyone likes a nice netperf result but I agree with
> Michael that this probably isn't the right question to be asking.  I
> don't think that socket buffers are a real solution to the flow
> control problem: they happen to provide that functionality but it's
> more of a side effect than anything.  It's just that the amount of
> memory consumed by packets in the queue(s) doesn't really have any
> implicit meaning for flow control (think multiple physical adapters,
> all with the same speed instead of a virtual device and a physical
> device with wildly different speeds).  The analog in the physical
> world that you're looking for would be Ethernet flow control.
> Obviously, if the question is limiting CPU or memory consumption then
> that's a different story.

Point taken. I will see if I can control CPU (and thus memory) consumption
using cgroups and/or tc.

> This patch also double counts memory, since the full size of the
> packet will be accounted for by each clone, even though they share the
> actual packet data.  Probably not too significant here but it might be
> when flooding/mirroring to many interfaces.  This is at least fixable
> (the Xen-style accounting through page tracking deals with it, though
> it has its own problems).

Agreed on all counts.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-07  1:23       ` Simon Horman
@ 2011-01-10  9:31         ` Simon Horman
  2011-01-13  6:47           ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-10  9:31 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> 
> [ snip ]
> > 
> > I know that everyone likes a nice netperf result but I agree with
> > Michael that this probably isn't the right question to be asking.  I
> > don't think that socket buffers are a real solution to the flow
> > control problem: they happen to provide that functionality but it's
> > more of a side effect than anything.  It's just that the amount of
> > memory consumed by packets in the queue(s) doesn't really have any
> > implicit meaning for flow control (think multiple physical adapters,
> > all with the same speed instead of a virtual device and a physical
> > device with wildly different speeds).  The analog in the physical
> > world that you're looking for would be Ethernet flow control.
> > Obviously, if the question is limiting CPU or memory consumption then
> > that's a different story.
> 
> Point taken. I will see if I can control CPU (and thus memory) consumption
> using cgroups and/or tc.

I have found that I can successfully control the throughput using
the following techniques

1) Place a tc egress filter on dummy0

2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
   this is effectively the same as one of my hacks to the datapath
   that I mentioned in an earlier mail. The result is that eth1
   "paces" the connection.

3) 2) + place a tc egress filter on eth1

Which mostly makes sense to me although I am a little confused about
why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
even if the skb is sent to dummy0 last.

I also had some limited success using CPU cgroups, though obviously
that targets CPU usage and thus the effect on throughput is fairly course.
In short, its a useful technique but not one that bares further
discussion here.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-10  9:31         ` Simon Horman
@ 2011-01-13  6:47           ` Simon Horman
  2011-01-13 15:45             ` Jesse Gross
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-13  6:47 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > 
> > [ snip ]
> > > 
> > > I know that everyone likes a nice netperf result but I agree with
> > > Michael that this probably isn't the right question to be asking.  I
> > > don't think that socket buffers are a real solution to the flow
> > > control problem: they happen to provide that functionality but it's
> > > more of a side effect than anything.  It's just that the amount of
> > > memory consumed by packets in the queue(s) doesn't really have any
> > > implicit meaning for flow control (think multiple physical adapters,
> > > all with the same speed instead of a virtual device and a physical
> > > device with wildly different speeds).  The analog in the physical
> > > world that you're looking for would be Ethernet flow control.
> > > Obviously, if the question is limiting CPU or memory consumption then
> > > that's a different story.
> > 
> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > using cgroups and/or tc.
> 
> I have found that I can successfully control the throughput using
> the following techniques
> 
> 1) Place a tc egress filter on dummy0
> 
> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
>    this is effectively the same as one of my hacks to the datapath
>    that I mentioned in an earlier mail. The result is that eth1
>    "paces" the connection.

Further to this, I wonder if there is any interest in providing
a method to switch the action order - using ovs-ofctl is a hack imho -
and/or switching the default action order for mirroring.

> 3) 2) + place a tc egress filter on eth1
> 
> Which mostly makes sense to me although I am a little confused about
> why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
> but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
> even if the skb is sent to dummy0 last.
> 
> I also had some limited success using CPU cgroups, though obviously
> that targets CPU usage and thus the effect on throughput is fairly course.
> In short, its a useful technique but not one that bares further
> discussion here.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-13  6:47           ` Simon Horman
@ 2011-01-13 15:45             ` Jesse Gross
  2011-01-13 23:41               ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Jesse Gross @ 2011-01-13 15:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
>> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
>> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
>> >
>> > [ snip ]
>> > >
>> > > I know that everyone likes a nice netperf result but I agree with
>> > > Michael that this probably isn't the right question to be asking.  I
>> > > don't think that socket buffers are a real solution to the flow
>> > > control problem: they happen to provide that functionality but it's
>> > > more of a side effect than anything.  It's just that the amount of
>> > > memory consumed by packets in the queue(s) doesn't really have any
>> > > implicit meaning for flow control (think multiple physical adapters,
>> > > all with the same speed instead of a virtual device and a physical
>> > > device with wildly different speeds).  The analog in the physical
>> > > world that you're looking for would be Ethernet flow control.
>> > > Obviously, if the question is limiting CPU or memory consumption then
>> > > that's a different story.
>> >
>> > Point taken. I will see if I can control CPU (and thus memory) consumption
>> > using cgroups and/or tc.
>>
>> I have found that I can successfully control the throughput using
>> the following techniques
>>
>> 1) Place a tc egress filter on dummy0
>>
>> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
>>    this is effectively the same as one of my hacks to the datapath
>>    that I mentioned in an earlier mail. The result is that eth1
>>    "paces" the connection.
>
> Further to this, I wonder if there is any interest in providing
> a method to switch the action order - using ovs-ofctl is a hack imho -
> and/or switching the default action order for mirroring.

I'm not sure that there is a way to do this that is correct in the
generic case.  It's possible that the destination could be a VM while
packets are being mirrored to a physical device or we could be
multicasting or some other arbitrarily complex scenario.  Just think
of what a physical switch would do if it has ports with two different
speeds.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-13 15:45             ` Jesse Gross
@ 2011-01-13 23:41               ` Simon Horman
  2011-01-14  4:58                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-13 23:41 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin

On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> >> >
> >> > [ snip ]
> >> > >
> >> > > I know that everyone likes a nice netperf result but I agree with
> >> > > Michael that this probably isn't the right question to be asking.  I
> >> > > don't think that socket buffers are a real solution to the flow
> >> > > control problem: they happen to provide that functionality but it's
> >> > > more of a side effect than anything.  It's just that the amount of
> >> > > memory consumed by packets in the queue(s) doesn't really have any
> >> > > implicit meaning for flow control (think multiple physical adapters,
> >> > > all with the same speed instead of a virtual device and a physical
> >> > > device with wildly different speeds).  The analog in the physical
> >> > > world that you're looking for would be Ethernet flow control.
> >> > > Obviously, if the question is limiting CPU or memory consumption then
> >> > > that's a different story.
> >> >
> >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> >> > using cgroups and/or tc.
> >>
> >> I have found that I can successfully control the throughput using
> >> the following techniques
> >>
> >> 1) Place a tc egress filter on dummy0
> >>
> >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> >>    this is effectively the same as one of my hacks to the datapath
> >>    that I mentioned in an earlier mail. The result is that eth1
> >>    "paces" the connection.
> >
> > Further to this, I wonder if there is any interest in providing
> > a method to switch the action order - using ovs-ofctl is a hack imho -
> > and/or switching the default action order for mirroring.
> 
> I'm not sure that there is a way to do this that is correct in the
> generic case.  It's possible that the destination could be a VM while
> packets are being mirrored to a physical device or we could be
> multicasting or some other arbitrarily complex scenario.  Just think
> of what a physical switch would do if it has ports with two different
> speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable somehow?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-13 23:41               ` Simon Horman
@ 2011-01-14  4:58                 ` Michael S. Tsirkin
  2011-01-14  6:35                   ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-14  4:58 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > >> >
> > >> > [ snip ]
> > >> > >
> > >> > > I know that everyone likes a nice netperf result but I agree with
> > >> > > Michael that this probably isn't the right question to be asking.  I
> > >> > > don't think that socket buffers are a real solution to the flow
> > >> > > control problem: they happen to provide that functionality but it's
> > >> > > more of a side effect than anything.  It's just that the amount of
> > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > >> > > implicit meaning for flow control (think multiple physical adapters,
> > >> > > all with the same speed instead of a virtual device and a physical
> > >> > > device with wildly different speeds).  The analog in the physical
> > >> > > world that you're looking for would be Ethernet flow control.
> > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > >> > > that's a different story.
> > >> >
> > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > >> > using cgroups and/or tc.
> > >>
> > >> I have found that I can successfully control the throughput using
> > >> the following techniques
> > >>
> > >> 1) Place a tc egress filter on dummy0
> > >>
> > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > >>    this is effectively the same as one of my hacks to the datapath
> > >>    that I mentioned in an earlier mail. The result is that eth1
> > >>    "paces" the connection.

This is actually a bug. This means that one slow connection will
affect fast ones. I intend to change the default for qemu to sndbuf=0 :
this will fix it but break your "pacing". So pls do not count on this behaviour.

> > > Further to this, I wonder if there is any interest in providing
> > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > and/or switching the default action order for mirroring.
> > 
> > I'm not sure that there is a way to do this that is correct in the
> > generic case.  It's possible that the destination could be a VM while
> > packets are being mirrored to a physical device or we could be
> > multicasting or some other arbitrarily complex scenario.  Just think
> > of what a physical switch would do if it has ports with two different
> > speeds.
> 
> Yes, I have considered that case. And I agree that perhaps there
> is no sensible default. But perhaps we could make it configurable somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-14  4:58                 ` Michael S. Tsirkin
@ 2011-01-14  6:35                   ` Simon Horman
  2011-01-14  6:54                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-14  6:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > >> >
> > > >> > [ snip ]
> > > >> > >
> > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > >> > > don't think that socket buffers are a real solution to the flow
> > > >> > > control problem: they happen to provide that functionality but it's
> > > >> > > more of a side effect than anything.  It's just that the amount of
> > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > >> > > all with the same speed instead of a virtual device and a physical
> > > >> > > device with wildly different speeds).  The analog in the physical
> > > >> > > world that you're looking for would be Ethernet flow control.
> > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > >> > > that's a different story.
> > > >> >
> > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > >> > using cgroups and/or tc.
> > > >>
> > > >> I have found that I can successfully control the throughput using
> > > >> the following techniques
> > > >>
> > > >> 1) Place a tc egress filter on dummy0
> > > >>
> > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > >>    this is effectively the same as one of my hacks to the datapath
> > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > >>    "paces" the connection.
> 
> This is actually a bug. This means that one slow connection will affect
> fast ones. I intend to change the default for qemu to sndbuf=0 : this
> will fix it but break your "pacing". So pls do not count on this
> behaviour.

Do you have a patch I could test?

> > > > Further to this, I wonder if there is any interest in providing
> > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > and/or switching the default action order for mirroring.
> > > 
> > > I'm not sure that there is a way to do this that is correct in the
> > > generic case.  It's possible that the destination could be a VM while
> > > packets are being mirrored to a physical device or we could be
> > > multicasting or some other arbitrarily complex scenario.  Just think
> > > of what a physical switch would do if it has ports with two different
> > > speeds.
> > 
> > Yes, I have considered that case. And I agree that perhaps there
> > is no sensible default. But perhaps we could make it configurable somehow?
> 
> The fix is at the application level. Run netperf with -b and -w flags to
> limit the speed to a sensible value.

Perhaps I should have stated my goals more clearly.
I'm interested in situations where I don't control the application.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-14  6:35                   ` Simon Horman
@ 2011-01-14  6:54                     ` Michael S. Tsirkin
  2011-01-16 22:37                       ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-14  6:54 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > >> >
> > > > >> > [ snip ]
> > > > >> > >
> > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > >> > > control problem: they happen to provide that functionality but it's
> > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > >> > > that's a different story.
> > > > >> >
> > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > >> > using cgroups and/or tc.
> > > > >>
> > > > >> I have found that I can successfully control the throughput using
> > > > >> the following techniques
> > > > >>
> > > > >> 1) Place a tc egress filter on dummy0
> > > > >>
> > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > >>    "paces" the connection.
> > 
> > This is actually a bug. This means that one slow connection will affect
> > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > will fix it but break your "pacing". So pls do not count on this
> > behaviour.
> 
> Do you have a patch I could test?

You can (and users already can) just run qemu with sndbuf=0. But if you
like, below.

> > > > > Further to this, I wonder if there is any interest in providing
> > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > and/or switching the default action order for mirroring.
> > > > 
> > > > I'm not sure that there is a way to do this that is correct in the
> > > > generic case.  It's possible that the destination could be a VM while
> > > > packets are being mirrored to a physical device or we could be
> > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > of what a physical switch would do if it has ports with two different
> > > > speeds.
> > > 
> > > Yes, I have considered that case. And I agree that perhaps there
> > > is no sensible default. But perhaps we could make it configurable somehow?
> > 
> > The fix is at the application level. Run netperf with -b and -w flags to
> > limit the speed to a sensible value.
> 
> Perhaps I should have stated my goals more clearly.
> I'm interested in situations where I don't control the application.

Well an application that streams UDP without any throttling
at the application level will break on a physical network, right?
So I am not sure why should one try to make it work on the virtual one.

But let's assume that you do want to throttle the guest
for reasons such as QOS. The proper approach seems
to be to throttle the sender, not have a dummy throttled
receiver "pacing" it. Place the qemu process in the
correct net_cls cgroup, set the class id and apply a rate limit?


---

diff --git a/net/tap-linux.c b/net/tap-linux.c
index f7aa904..0dbcdd4 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -87,7 +87,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
  * a good default, given a 1500 byte MTU.
  */
-#define TAP_DEFAULT_SNDBUF 1024*1024
+#define TAP_DEFAULT_SNDBUF 0
 
 int tap_set_sndbuf(int fd, QemuOpts *opts)
 {

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-14  6:54                     ` Michael S. Tsirkin
@ 2011-01-16 22:37                       ` Simon Horman
  2011-01-16 23:56                         ` Rusty Russell
  2011-01-17 10:26                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-16 22:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> > On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > > >> >
> > > > > >> > [ snip ]
> > > > > >> > >
> > > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > > >> > > control problem: they happen to provide that functionality but it's
> > > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > > >> > > that's a different story.
> > > > > >> >
> > > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > > >> > using cgroups and/or tc.
> > > > > >>
> > > > > >> I have found that I can successfully control the throughput using
> > > > > >> the following techniques
> > > > > >>
> > > > > >> 1) Place a tc egress filter on dummy0
> > > > > >>
> > > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > > >>    "paces" the connection.
> > > 
> > > This is actually a bug. This means that one slow connection will affect
> > > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > > will fix it but break your "pacing". So pls do not count on this
> > > behaviour.
> > 
> > Do you have a patch I could test?
> 
> You can (and users already can) just run qemu with sndbuf=0. But if you
> like, below.

Thanks

> > > > > > Further to this, I wonder if there is any interest in providing
> > > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > > and/or switching the default action order for mirroring.
> > > > > 
> > > > > I'm not sure that there is a way to do this that is correct in the
> > > > > generic case.  It's possible that the destination could be a VM while
> > > > > packets are being mirrored to a physical device or we could be
> > > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > > of what a physical switch would do if it has ports with two different
> > > > > speeds.
> > > > 
> > > > Yes, I have considered that case. And I agree that perhaps there
> > > > is no sensible default. But perhaps we could make it configurable somehow?
> > > 
> > > The fix is at the application level. Run netperf with -b and -w flags to
> > > limit the speed to a sensible value.
> > 
> > Perhaps I should have stated my goals more clearly.
> > I'm interested in situations where I don't control the application.
> 
> Well an application that streams UDP without any throttling
> at the application level will break on a physical network, right?
> So I am not sure why should one try to make it work on the virtual one.
> 
> But let's assume that you do want to throttle the guest
> for reasons such as QOS. The proper approach seems
> to be to throttle the sender, not have a dummy throttled
> receiver "pacing" it. Place the qemu process in the
> correct net_cls cgroup, set the class id and apply a rate limit?

I would like to be able to use a class to rate limit egress packets.
That much works fine for me.

What I would also like is for there to be back-pressure such that the guest
doesn't consume lots of CPU, spinning, sending packets as fast as it can,
almost of all of which are dropped. That does seem like a lot of wasted
CPU to me.

Unfortunately there are several problems with this and I am fast concluding
that I will need to use a CPU cgroup. Which does make some sense, as what I
am really trying to limit here is CPU usage not network packet rates - even
if the test using the CPU is netperf.  So long as the CPU usage can
(mostly) be attributed to the guest using a cgroup should work fine.  And
indeed seems to in my limited testing.

One scenario in which I don't think it is possible for there to be
back-pressure in a meaningful sense is if root in the guest sets
/proc/sys/net/core/wmem_default to a large value, say 2000000.


I do think that to some extent there is back-pressure provided by sockbuf
in the case where process on the host is sending directly to a physical
interface.  And to my mind it would be "nice" if the same kind of
back-pressure was present in guests.  But through our discussions of the
past week or so I get the feeling that is not your view of things.

Perhaps I could characterise the guest situation by saying:
	Egress packet rates can be controlled using tc on the host;
	Guest CPU usage can be controlled using CPU cgroups on the host;
	Sockbuf controls memory usage on the host;
	Back-pressure is irrelevant.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-16 22:37                       ` Simon Horman
@ 2011-01-16 23:56                         ` Rusty Russell
  2011-01-17 10:38                           ` Michael S. Tsirkin
  2011-01-17 10:26                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 40+ messages in thread
From: Rusty Russell @ 2011-01-16 23:56 UTC (permalink / raw)
  To: Simon Horman
  Cc: Michael S. Tsirkin, Jesse Gross, Eric Dumazet, virtualization,
	dev, virtualization, netdev, kvm

On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:

[snip]

I've been away, but what concerns me is that socket buffer limits are
bypassed in various configurations, due to skb cloning.  We should probably
drop such limits altogether, or fix them to be consistent.

Simple fix is as someone suggested here, to attach the clone.  That might
seriously reduce your sk limit, though.  I haven't thought about it hard,
but might it make sense to move ownership into skb_shared_info; ie. the
data, rather than the skb head?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-16 22:37                       ` Simon Horman
  2011-01-16 23:56                         ` Rusty Russell
@ 2011-01-17 10:26                         ` Michael S. Tsirkin
  2011-01-18 19:41                           ` Rick Jones
  1 sibling, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-17 10:26 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Mon, Jan 17, 2011 at 07:37:30AM +0900, Simon Horman wrote:
> On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> > > On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > > > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > > > >> >
> > > > > > >> > [ snip ]
> > > > > > >> > >
> > > > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > > > >> > > control problem: they happen to provide that functionality but it's
> > > > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > > > >> > > that's a different story.
> > > > > > >> >
> > > > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > > > >> > using cgroups and/or tc.
> > > > > > >>
> > > > > > >> I have found that I can successfully control the throughput using
> > > > > > >> the following techniques
> > > > > > >>
> > > > > > >> 1) Place a tc egress filter on dummy0
> > > > > > >>
> > > > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > > > >>    "paces" the connection.
> > > > 
> > > > This is actually a bug. This means that one slow connection will affect
> > > > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > > > will fix it but break your "pacing". So pls do not count on this
> > > > behaviour.
> > > 
> > > Do you have a patch I could test?
> > 
> > You can (and users already can) just run qemu with sndbuf=0. But if you
> > like, below.
> 
> Thanks
> 
> > > > > > > Further to this, I wonder if there is any interest in providing
> > > > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > > > and/or switching the default action order for mirroring.
> > > > > > 
> > > > > > I'm not sure that there is a way to do this that is correct in the
> > > > > > generic case.  It's possible that the destination could be a VM while
> > > > > > packets are being mirrored to a physical device or we could be
> > > > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > > > of what a physical switch would do if it has ports with two different
> > > > > > speeds.
> > > > > 
> > > > > Yes, I have considered that case. And I agree that perhaps there
> > > > > is no sensible default. But perhaps we could make it configurable somehow?
> > > > 
> > > > The fix is at the application level. Run netperf with -b and -w flags to
> > > > limit the speed to a sensible value.
> > > 
> > > Perhaps I should have stated my goals more clearly.
> > > I'm interested in situations where I don't control the application.
> > 
> > Well an application that streams UDP without any throttling
> > at the application level will break on a physical network, right?
> > So I am not sure why should one try to make it work on the virtual one.
> > 
> > But let's assume that you do want to throttle the guest
> > for reasons such as QOS. The proper approach seems
> > to be to throttle the sender, not have a dummy throttled
> > receiver "pacing" it. Place the qemu process in the
> > correct net_cls cgroup, set the class id and apply a rate limit?
> 
> I would like to be able to use a class to rate limit egress packets.
> That much works fine for me.
> 
> What I would also like is for there to be back-pressure such that the guest
> doesn't consume lots of CPU, spinning, sending packets as fast as it can,
> almost of all of which are dropped. That does seem like a lot of wasted
> CPU to me.
> 
> Unfortunately there are several problems with this and I am fast concluding
> that I will need to use a CPU cgroup. Which does make some sense, as what I
> am really trying to limit here is CPU usage not network packet rates - even
> if the test using the CPU is netperf.  So long as the CPU usage can
> (mostly) be attributed to the guest using a cgroup should work fine.  And
> indeed seems to in my limited testing.
> 
> One scenario in which I don't think it is possible for there to be
> back-pressure in a meaningful sense is if root in the guest sets
> /proc/sys/net/core/wmem_default to a large value, say 2000000.
> 
> 
> I do think that to some extent there is back-pressure provided by sockbuf
> in the case where process on the host is sending directly to a physical
> interface.  And to my mind it would be "nice" if the same kind of
> back-pressure was present in guests.  But through our discussions of the
> past week or so I get the feeling that is not your view of things.

It might be nice. Unfortunately this is not what we have implemented:
the sockbuf backpressure blocks the socket, what we have blocks all
transmit from the guest. Another issue is that the strategy we have
seems to be broken if the target is a guest on another machine.

So it won't be all that simple to implement well, and before we try,
I'd like to know whether there are applications that are helped
by it. For example, we could try to measure latency at various
pps and see whether the backpressure helps. netperf has -b, -w
flags which might help these measurements.

> Perhaps I could characterise the guest situation by saying:
> 	Egress packet rates can be controlled using tc on the host;
> 	Guest CPU usage can be controlled using CPU cgroups on the host;
> 	Sockbuf controls memory usage on the host;

Not really, the memory usage on the host is controlled by the
various queue lengths in the host. E.g. if you send packets to
the physical device, they will get queued there.

> 	Back-pressure is irrelevant.

Or at least, broken :)

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-16 23:56                         ` Rusty Russell
@ 2011-01-17 10:38                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-17 10:38 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Simon Horman, Jesse Gross, Eric Dumazet, virtualization, dev,
	virtualization, netdev, kvm

On Mon, Jan 17, 2011 at 10:26:25AM +1030, Rusty Russell wrote:
> On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:
> 
> [snip]
> 
> I've been away, but what concerns me is that socket buffer limits are
> bypassed in various configurations, due to skb cloning.  We should probably
> drop such limits altogether, or fix them to be consistent.

Further, it looks like when the limits are not bypassed, they
easily result in deadlocks. For example, with
multiple tun devices attached to a single bridge in host,
if a number of these have their queues blocked,
others will reach the socket buffer limit and
traffic on the bridge will get blocked altogether.

It might be better to drop the limits altogether
unless we can fix them. Happily, as the limits are off by
default, doing so does not require kernel changes.

> Simple fix is as someone suggested here, to attach the clone.  That might
> seriously reduce your sk limit, though.  I haven't thought about it hard,
> but might it make sense to move ownership into skb_shared_info; ie. the
> data, rather than the skb head?
> 
> Cheers,
> Rusty.

tracking data ownership might benefit others such as various zero-copy
strategies. It might need to be done per-page, though, not per-skb.

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-17 10:26                         ` Michael S. Tsirkin
@ 2011-01-18 19:41                           ` Rick Jones
  2011-01-18 20:13                             ` Michael S. Tsirkin
  2011-01-20  8:38                             ` Simon Horman
  0 siblings, 2 replies; 40+ messages in thread
From: Rick Jones @ 2011-01-18 19:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Eric Dumazet, Rusty Russell,
	virtualization, dev, virtualization, netdev, kvm

> So it won't be all that simple to implement well, and before we try,
> I'd like to know whether there are applications that are helped
> by it. For example, we could try to measure latency at various
> pps and see whether the backpressure helps. netperf has -b, -w
> flags which might help these measurements.

Those options are enabled when one adds --enable-burst to the pre-compilation 
./configure  of netperf (one doesn't have to recompile netserver).  However, if 
one is also looking at latency statistics via the -j option in the top-of-trunk, 
or simply at the histogram with --enable-histogram on the ./configure and a 
verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf 
from:

http://www.netperf.org/svn/netperf2/trunk

to get the recently added support for accurate (netperf level) RTT measuremnts 
on burst-mode request/response tests.

happy benchmarking,

rick jones

PS - the enhanced latency statistics from -j are only available in the "omni" 
version of the TCP_RR test.  To get that add a --enable-omni to the ./configure 
- and in this case both netperf and netserver have to be recompiled.  For very 
basic output one can peruse the output of:

src/netperf -t omni -- -O /?

and then pick those outputs of interest and put them into an output selection 
file which one then passes to either (test-specific) -o, -O or -k to get CVS, 
"Human" or keyval output respectively.  E.G.

raj@tardy:~/netperf2_trunk$ cat foo
THROUGHPUT,THROUGHPUT_UNITS
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY

when foo is passed to -o one will get those all on one line of CSV.  To -O one 
gets three lines of more netperf-classic-like "human" readable output, and when 
one passes that to -k one gets a string of keyval output a la:

raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 
AF_INET : histogram
THROUGHPUT=29454.12
THROUGHPUT_UNITS=Trans/s
RT_LATENCY=33.951
MIN_LATENCY=19
MEAN_LATENCY=32.00
MAX_LATENCY=126
P50_LATENCY=32
P90_LATENCY=38
P99_LATENCY=41
STDDEV_LATENCY=5.46

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0: 3553: 45244: 237790: 7859:   86:   10:    3:    0:    0
HUNDRED_USEC  :    0:    2:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_MSEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
 >100_SECS: 0
HIST_TOTAL:      294547


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-18 19:41                           ` Rick Jones
@ 2011-01-18 20:13                             ` Michael S. Tsirkin
  2011-01-18 21:28                               ` Rick Jones
  2011-01-19  9:11                               ` Simon Horman
  2011-01-20  8:38                             ` Simon Horman
  1 sibling, 2 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-18 20:13 UTC (permalink / raw)
  To: Rick Jones
  Cc: Simon Horman, Jesse Gross, Eric Dumazet, Rusty Russell,
	virtualization, dev, virtualization, netdev, kvm

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> >So it won't be all that simple to implement well, and before we try,
> >I'd like to know whether there are applications that are helped
> >by it. For example, we could try to measure latency at various
> >pps and see whether the backpressure helps. netperf has -b, -w
> >flags which might help these measurements.
> 
> Those options are enabled when one adds --enable-burst to the
> pre-compilation ./configure  of netperf (one doesn't have to
> recompile netserver).  However, if one is also looking at latency
> statistics via the -j option in the top-of-trunk, or simply at the
> histogram with --enable-histogram on the ./configure and a verbosity
> level of 2 (global -v 2) then one wants the very top of trunk
> netperf from:
> 
> http://www.netperf.org/svn/netperf2/trunk
> 
> to get the recently added support for accurate (netperf level) RTT
> measuremnts on burst-mode request/response tests.
> 
> happy benchmarking,
> 
> rick jones
> 
> PS - the enhanced latency statistics from -j are only available in
> the "omni" version of the TCP_RR test.  To get that add a
> --enable-omni to the ./configure - and in this case both netperf and
> netserver have to be recompiled.


Is this TCP only? I would love to get latency data from UDP as well.

>  For very basic output one can
> peruse the output of:
> 
> src/netperf -t omni -- -O /?
> 
> and then pick those outputs of interest and put them into an output
> selection file which one then passes to either (test-specific) -o,
> -O or -k to get CVS, "Human" or keyval output respectively.  E.G.
> 
> raj@tardy:~/netperf2_trunk$ cat foo
> THROUGHPUT,THROUGHPUT_UNITS
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> 
> when foo is passed to -o one will get those all on one line of CSV.
> To -O one gets three lines of more netperf-classic-like "human"
> readable output, and when one passes that to -k one gets a string of
> keyval output a la:
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
> OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost
> (127.0.0.1) port 0 AF_INET : histogram
> THROUGHPUT=29454.12
> THROUGHPUT_UNITS=Trans/s
> RT_LATENCY=33.951
> MIN_LATENCY=19
> MEAN_LATENCY=32.00
> MAX_LATENCY=126
> P50_LATENCY=32
> P90_LATENCY=38
> P99_LATENCY=41
> STDDEV_LATENCY=5.46
> 
> Histogram of request/response times
> UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_USEC      :    0: 3553: 45244: 237790: 7859:   86:   10:    3:    0:    0
> HUNDRED_USEC  :    0:    2:    0:    0:    0:    0:    0:    0:    0:    0
> UNIT_MSEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> >100_SECS: 0
> HIST_TOTAL:      294547

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-18 20:13                             ` Michael S. Tsirkin
@ 2011-01-18 21:28                               ` Rick Jones
  2011-01-19  9:11                               ` Simon Horman
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-01-18 21:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Eric Dumazet, Rusty Russell,
	virtualization, dev, virtualization, netdev, kvm

Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> 
>>PS - the enhanced latency statistics from -j are only available in
>>the "omni" version of the TCP_RR test.  To get that add a
>>--enable-omni to the ./configure - and in this case both netperf and
>>netserver have to be recompiled.
> 
> Is this TCP only? I would love to get latency data from UDP as well.

I believe it will work with UDP request response as well.  The omni test code 
strives to be protocol agnostic.  (I'm sure there are bugs of course, there 
always are.)

There is though the added complication of there being no specific matching of 
requests to responses.  The code as written takes advantage of TCP's in-order 
semantics and recovery from packet loss.  In a "plain" UDP_RR test, with one at 
a time transactions, if either the request or response are lost, data flow 
effectively stops there until the timer expires.  So, one has "reasonable" RTT 
numbers from before that point.  In a burst UDP RR test, the code doesn't know 
which request/response was lost and so the matching being done to get RTTs will 
be off by each lost datagram.  And if something were re-ordered the timstamps 
would be off even without a datagram loss event.

To "fix" that would require netperf do something it has not yet done in 18-odd 
years :)  That is actually echo something back from the netserver on the RR test 
- either an id, or a timestamp.  That means "dirtying" the buffers which means 
still more cache misses, from places other than the actual stack. Not beyond the 
realm of the possible, but it would be a bit of departure for "normal" operation 
(*) and could enforce a minimum request/response size beyond the present single 
byte (ok, perhaps only two or four bytes :).  But that, perhaps, is a discussion 
best left to netperf-talk at netperf.org.

happy benchmarking,

rick jones

(*) netperf does have the concept of reading from and/or dirtying buffers, 
put-in back in the days of COW/page-remapping in HP-UX 9.0, but that was mainly 
to force COW and/or show the effect of the required data cache purges/flushes. 
As such it was made conditional on DIRTY being defined.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-18 20:13                             ` Michael S. Tsirkin
  2011-01-18 21:28                               ` Rick Jones
@ 2011-01-19  9:11                               ` Simon Horman
  1 sibling, 0 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-19  9:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rick Jones, Jesse Gross, Eric Dumazet, Rusty Russell,
	virtualization, dev, virtualization, netdev, kvm

On Tue, Jan 18, 2011 at 10:13:33PM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > >So it won't be all that simple to implement well, and before we try,
> > >I'd like to know whether there are applications that are helped
> > >by it. For example, we could try to measure latency at various
> > >pps and see whether the backpressure helps. netperf has -b, -w
> > >flags which might help these measurements.
> > 
> > Those options are enabled when one adds --enable-burst to the
> > pre-compilation ./configure  of netperf (one doesn't have to
> > recompile netserver).  However, if one is also looking at latency
> > statistics via the -j option in the top-of-trunk, or simply at the
> > histogram with --enable-histogram on the ./configure and a verbosity
> > level of 2 (global -v 2) then one wants the very top of trunk
> > netperf from:
> > 
> > http://www.netperf.org/svn/netperf2/trunk
> > 
> > to get the recently added support for accurate (netperf level) RTT
> > measuremnts on burst-mode request/response tests.
> > 
> > happy benchmarking,
> > 
> > rick jones

Thanks Rick, that is really helpful.

> > PS - the enhanced latency statistics from -j are only available in
> > the "omni" version of the TCP_RR test.  To get that add a
> > --enable-omni to the ./configure - and in this case both netperf and
> > netserver have to be recompiled.
> 
> 
> Is this TCP only? I would love to get latency data from UDP as well.

At a glance, -- -T UDP is what you are after.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-18 19:41                           ` Rick Jones
  2011-01-18 20:13                             ` Michael S. Tsirkin
@ 2011-01-20  8:38                             ` Simon Horman
  2011-01-21  2:30                               ` Rick Jones
  2011-01-21  9:59                               ` Michael S. Tsirkin
  1 sibling, 2 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-20  8:38 UTC (permalink / raw)
  To: Rick Jones
  Cc: Michael S. Tsirkin, Jesse Gross, Rusty Russell, virtualization,
	dev, virtualization, netdev, kvm

[ Trimmed Eric from CC list as vger was complaining that it is too long ]

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> >So it won't be all that simple to implement well, and before we try,
> >I'd like to know whether there are applications that are helped
> >by it. For example, we could try to measure latency at various
> >pps and see whether the backpressure helps. netperf has -b, -w
> >flags which might help these measurements.
> 
> Those options are enabled when one adds --enable-burst to the
> pre-compilation ./configure  of netperf (one doesn't have to
> recompile netserver).  However, if one is also looking at latency
> statistics via the -j option in the top-of-trunk, or simply at the
> histogram with --enable-histogram on the ./configure and a verbosity
> level of 2 (global -v 2) then one wants the very top of trunk
> netperf from:

Hi,

I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

1. Omni test only:		MEAN_LATENCY=272.00
2. Omni and stream test:	MEAN_LATENCY=3423.00
3. cpu and net_cls group:	MEAN_LATENCY=493.00
   As per 2 plus cgoups are created for each guest
   and guest tasks added to the groups
4. 100Mbit/s class:		MEAN_LATENCY=273.00
   As per 3 plus the net_cls groups each have a 100MBit/s HTB class
5. cpu.shares=128:		MEAN_LATENCY=652.00
   As per 4 plus the cpu groups have cpu.shares set to 128
6. Busy CPUS:			MEAN_LATENCY=15126.00
   As per 5 but the CPUs are made busy using a simple shell while loop

There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200

foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-20  8:38                             ` Simon Horman
@ 2011-01-21  2:30                               ` Rick Jones
  2011-01-21  9:59                               ` Michael S. Tsirkin
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-01-21  2:30 UTC (permalink / raw)
  To: Simon Horman
  Cc: Michael S. Tsirkin, Jesse Gross, Rusty Russell, virtualization,
	dev, virtualization, netdev, kvm

Simon Horman wrote:
> [ Trimmed Eric from CC list as vger was complaining that it is too long ]
>...
> I have constructed a test where I run an un-paced  UDP_STREAM test in
> one guest and a paced omni rr test in another guest at the same time.
> Breifly I get the following results from the omni test..
> 
>...
 >
> There is a bit of noise in the results as the two netperf invocations
> aren't started at exactly the same moment
> 
> For reference, my netperf invocations are:
> netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
> netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200

Since the -b and -w are in the test-specific portion, this test was not actually 
  paced. The -w will have been ignored entirely (IIRC) and the -b will have 
attempted to set the "burst" size of a --enable-burst ./configured netperf.  If 
netperf was ./configured that way, it will have had two rr transactions in 
flight at one time - the "regular" one and then the one additional from the -b 
option.  If netperf was not ./configured with --enable-burst then a warning 
message should have been emitted.

Also, I am guessing you wanted TCP_NODELAY set, and that is -D but not a global 
-D.  I'm reasonably confident the -m 200 will have been ignored, but it would be 
best to drop it. So, I think your second line needs to be:

netperf.omni -p 12866 -c -C -H  172.17.60.216 -t omni -j -v 2 -b 1 -w 200 -- -r 
1 -d rr -k foo -D

If you want the request and response sizes to be 200 bytes, use -r 200 
(test-specific).

Also, if you ./configure with --enable-omni first, that netserver will 
understand both omni and non-omni tests at the same time and you don't have to 
have a second netserver on a different control port.  You can also go-in to 
config.h after the ./configure and unset WANT_MIGRATION and then UDP_STREAM in 
netperf will be the "true" classic UDP_STREAM code rather than the migrated to 
omni path.

> foo contains
> PROTOCOL
> THROUGHPUT,THROUGHPUT_UNITS
> LOCAL_SEND_THROUGHPUT
> LOCAL_RECV_THROUGHPUT
> REMOTE_SEND_THROUGHPUT
> REMOTE_RECV_THROUGHPUT
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

As the -k file parsing option didn't care until recently (within the hour or 
so), I think it didn't matter that you had more than four lines (assuming that 
is a verbatim cat of foo).  However, if you pull the *current* top of trunk, it 
will probably start to care - I'm in the midst of adding support for "direct 
output selection" in the -k, -o and -O options and also cleaning-up the omni 
printing code to the point where there is only the one routing parsing the 
output selection file.  Currently that is the one for "human" output, which has 
a four line restriction.  I will try to make it smarter as I go.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-20  8:38                             ` Simon Horman
  2011-01-21  2:30                               ` Rick Jones
@ 2011-01-21  9:59                               ` Michael S. Tsirkin
  2011-01-21 18:04                                 ` Rick Jones
  2011-01-21 23:11                                 ` Simon Horman
  1 sibling, 2 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-21  9:59 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> 
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > >So it won't be all that simple to implement well, and before we try,
> > >I'd like to know whether there are applications that are helped
> > >by it. For example, we could try to measure latency at various
> > >pps and see whether the backpressure helps. netperf has -b, -w
> > >flags which might help these measurements.
> > 
> > Those options are enabled when one adds --enable-burst to the
> > pre-compilation ./configure  of netperf (one doesn't have to
> > recompile netserver).  However, if one is also looking at latency
> > statistics via the -j option in the top-of-trunk, or simply at the
> > histogram with --enable-histogram on the ./configure and a verbosity
> > level of 2 (global -v 2) then one wants the very top of trunk
> > netperf from:
> 
> Hi,
> 
> I have constructed a test where I run an un-paced  UDP_STREAM test in
> one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.

> Breifly I get the following results from the omni test..
> 
> 1. Omni test only:		MEAN_LATENCY=272.00
> 2. Omni and stream test:	MEAN_LATENCY=3423.00
> 3. cpu and net_cls group:	MEAN_LATENCY=493.00
>    As per 2 plus cgoups are created for each guest
>    and guest tasks added to the groups
> 4. 100Mbit/s class:		MEAN_LATENCY=273.00
>    As per 3 plus the net_cls groups each have a 100MBit/s HTB class
> 5. cpu.shares=128:		MEAN_LATENCY=652.00
>    As per 4 plus the cpu groups have cpu.shares set to 128
> 6. Busy CPUS:			MEAN_LATENCY=15126.00
>    As per 5 but the CPUs are made busy using a simple shell while loop
> 
> There is a bit of noise in the results as the two netperf invocations
> aren't started at exactly the same moment
> 
> For reference, my netperf invocations are:
> netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
> netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200
> 
> foo contains
> PROTOCOL
> THROUGHPUT,THROUGHPUT_UNITS
> LOCAL_SEND_THROUGHPUT
> LOCAL_RECV_THROUGHPUT
> REMOTE_SEND_THROUGHPUT
> REMOTE_RECV_THROUGHPUT
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-21  9:59                               ` Michael S. Tsirkin
@ 2011-01-21 18:04                                 ` Rick Jones
  2011-01-21 23:11                                 ` Simon Horman
  1 sibling, 0 replies; 40+ messages in thread
From: Rick Jones @ 2011-01-21 18:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

>>I have constructed a test where I run an un-paced  UDP_STREAM test in
>>one guest and a paced omni rr test in another guest at the same time.
> 
> 
> Hmm, what is this supposed to measure?  Basically each time you run an
> un-paced UDP_STREAM you get some random load on the network.

Well, if the netperf is (effectively) pinned to a given CPU, presumably it would 
be trying to generate UDP datagrams at the same rate each time.  Indeed though, 
no guarantee that rate would consistently get through each time.

But then, that is where one can use the confidence intervals options to get an 
idea by how much the rate varied.

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-21  9:59                               ` Michael S. Tsirkin
  2011-01-21 18:04                                 ` Rick Jones
@ 2011-01-21 23:11                                 ` Simon Horman
  2011-01-22 21:57                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-21 23:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > 
> > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > >So it won't be all that simple to implement well, and before we try,
> > > >I'd like to know whether there are applications that are helped
> > > >by it. For example, we could try to measure latency at various
> > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > >flags which might help these measurements.
> > > 
> > > Those options are enabled when one adds --enable-burst to the
> > > pre-compilation ./configure  of netperf (one doesn't have to
> > > recompile netserver).  However, if one is also looking at latency
> > > statistics via the -j option in the top-of-trunk, or simply at the
> > > histogram with --enable-histogram on the ./configure and a verbosity
> > > level of 2 (global -v 2) then one wants the very top of trunk
> > > netperf from:
> > 
> > Hi,
> > 
> > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > one guest and a paced omni rr test in another guest at the same time.
> 
> Hmm, what is this supposed to measure?  Basically each time you run an
> un-paced UDP_STREAM you get some random load on the network.
> You can't tell what it was exactly, only that it was between
> the send and receive throughput.

Rick mentioned in another email that I messed up my test parameters a bit,
so I will re-run the tests, incorporating his suggestions.

What I was attempting to measure was the effect of an unpaced UDP_STREAM
on the latency of more moderated traffic. Because I am interested in
what effect an abusive guest has on other guests and how that my be
mitigated.

Could you suggest some tests that you feel are more appropriate?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-21 23:11                                 ` Simon Horman
@ 2011-01-22 21:57                                   ` Michael S. Tsirkin
  2011-01-23  6:38                                     ` Simon Horman
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-22 21:57 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > 
> > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > >So it won't be all that simple to implement well, and before we try,
> > > > >I'd like to know whether there are applications that are helped
> > > > >by it. For example, we could try to measure latency at various
> > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > >flags which might help these measurements.
> > > > 
> > > > Those options are enabled when one adds --enable-burst to the
> > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > recompile netserver).  However, if one is also looking at latency
> > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > netperf from:
> > > 
> > > Hi,
> > > 
> > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > one guest and a paced omni rr test in another guest at the same time.
> > 
> > Hmm, what is this supposed to measure?  Basically each time you run an
> > un-paced UDP_STREAM you get some random load on the network.
> > You can't tell what it was exactly, only that it was between
> > the send and receive throughput.
> 
> Rick mentioned in another email that I messed up my test parameters a bit,
> so I will re-run the tests, incorporating his suggestions.
> 
> What I was attempting to measure was the effect of an unpaced UDP_STREAM
> on the latency of more moderated traffic. Because I am interested in
> what effect an abusive guest has on other guests and how that my be
> mitigated.
> 
> Could you suggest some tests that you feel are more appropriate?

Yes. To refraze my concern in these terms, besides the malicious guest
you have another software in host (netperf) that interferes with
the traffic, and it cooperates with the malicious guest.
Right?

IMO for a malicious guest you would send
UDP packets that then get dropped by the host.

For example block netperf in host so that
it does not consume packets from the socket.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-22 21:57                                   ` Michael S. Tsirkin
@ 2011-01-23  6:38                                     ` Simon Horman
  2011-01-23 10:39                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Simon Horman @ 2011-01-23  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > > 
> > > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > > >So it won't be all that simple to implement well, and before we try,
> > > > > >I'd like to know whether there are applications that are helped
> > > > > >by it. For example, we could try to measure latency at various
> > > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > > >flags which might help these measurements.
> > > > > 
> > > > > Those options are enabled when one adds --enable-burst to the
> > > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > > recompile netserver).  However, if one is also looking at latency
> > > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > > netperf from:
> > > > 
> > > > Hi,
> > > > 
> > > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > > one guest and a paced omni rr test in another guest at the same time.
> > > 
> > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > un-paced UDP_STREAM you get some random load on the network.
> > > You can't tell what it was exactly, only that it was between
> > > the send and receive throughput.
> > 
> > Rick mentioned in another email that I messed up my test parameters a bit,
> > so I will re-run the tests, incorporating his suggestions.
> > 
> > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > on the latency of more moderated traffic. Because I am interested in
> > what effect an abusive guest has on other guests and how that my be
> > mitigated.
> > 
> > Could you suggest some tests that you feel are more appropriate?
> 
> Yes. To refraze my concern in these terms, besides the malicious guest
> you have another software in host (netperf) that interferes with
> the traffic, and it cooperates with the malicious guest.
> Right?

Yes, that is the scenario in this test.

> IMO for a malicious guest you would send
> UDP packets that then get dropped by the host.
> 
> For example block netperf in host so that
> it does not consume packets from the socket.

I'm more interested in rate-limiting netperf than blocking it.
But in any case, do you mean use iptables or tc based on
classification made by net_cls?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-23  6:38                                     ` Simon Horman
@ 2011-01-23 10:39                                       ` Michael S. Tsirkin
  2011-01-23 13:53                                         ` Simon Horman
  2011-01-24 18:27                                         ` Rick Jones
  0 siblings, 2 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-23 10:39 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
> On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> > On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > > > 
> > > > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > > > >So it won't be all that simple to implement well, and before we try,
> > > > > > >I'd like to know whether there are applications that are helped
> > > > > > >by it. For example, we could try to measure latency at various
> > > > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > > > >flags which might help these measurements.
> > > > > > 
> > > > > > Those options are enabled when one adds --enable-burst to the
> > > > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > > > recompile netserver).  However, if one is also looking at latency
> > > > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > > > netperf from:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > > > one guest and a paced omni rr test in another guest at the same time.
> > > > 
> > > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > > un-paced UDP_STREAM you get some random load on the network.
> > > > You can't tell what it was exactly, only that it was between
> > > > the send and receive throughput.
> > > 
> > > Rick mentioned in another email that I messed up my test parameters a bit,
> > > so I will re-run the tests, incorporating his suggestions.
> > > 
> > > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > > on the latency of more moderated traffic. Because I am interested in
> > > what effect an abusive guest has on other guests and how that my be
> > > mitigated.
> > > 
> > > Could you suggest some tests that you feel are more appropriate?
> > 
> > Yes. To refraze my concern in these terms, besides the malicious guest
> > you have another software in host (netperf) that interferes with
> > the traffic, and it cooperates with the malicious guest.
> > Right?
> 
> Yes, that is the scenario in this test.

Yes but I think that you want to put some controlled load on host.
Let's assume that we impove the speed somehow and now you can push more
bytes per second without loss.  Result might be a regression in your
test because you let the guest push "as much as it can" and suddenly it
can push more data through.  OTOH with packet loss the load on host is
anywhere in between send and receive throughput: there's no easy way to
measure it from netperf: the earlier some buffers overrun, the earlier
the packets get dropped and the less the load on host.

This is why I say that to get a specific
load on host you want to limit the sender
to a specific BW and then either
- make sure packet loss % is close to 0.
- make sure packet loss % is close to 100%.

> > IMO for a malicious guest you would send
> > UDP packets that then get dropped by the host.
> > 
> > For example block netperf in host so that
> > it does not consume packets from the socket.
> 
> I'm more interested in rate-limiting netperf than blocking it.

Well I mean netperf on host.

> But in any case, do you mean use iptables or tc based on
> classification made by net_cls?

Just to block netperf you can send it SIGSTOP :)

-- 
MST

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-23 10:39                                       ` Michael S. Tsirkin
@ 2011-01-23 13:53                                         ` Simon Horman
  2011-01-24 18:27                                         ` Rick Jones
  1 sibling, 0 replies; 40+ messages in thread
From: Simon Horman @ 2011-01-23 13:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Sun, Jan 23, 2011 at 12:39:02PM +0200, Michael S. Tsirkin wrote:
> On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
> > On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> > > On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > > > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:

[snip]

> > > > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > > > un-paced UDP_STREAM you get some random load on the network.
> > > > > You can't tell what it was exactly, only that it was between
> > > > > the send and receive throughput.
> > > > 
> > > > Rick mentioned in another email that I messed up my test parameters a bit,
> > > > so I will re-run the tests, incorporating his suggestions.
> > > > 
> > > > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > > > on the latency of more moderated traffic. Because I am interested in
> > > > what effect an abusive guest has on other guests and how that my be
> > > > mitigated.
> > > > 
> > > > Could you suggest some tests that you feel are more appropriate?
> > > 
> > > Yes. To refraze my concern in these terms, besides the malicious guest
> > > you have another software in host (netperf) that interferes with
> > > the traffic, and it cooperates with the malicious guest.
> > > Right?
> > 
> > Yes, that is the scenario in this test.
> 
> Yes but I think that you want to put some controlled load on host.
> Let's assume that we impove the speed somehow and now you can push more
> bytes per second without loss.  Result might be a regression in your
> test because you let the guest push "as much as it can" and suddenly it
> can push more data through.  OTOH with packet loss the load on host is
> anywhere in between send and receive throughput: there's no easy way to
> measure it from netperf: the earlier some buffers overrun, the earlier
> the packets get dropped and the less the load on host.
> 
> This is why I say that to get a specific
> load on host you want to limit the sender
> to a specific BW and then either
> - make sure packet loss % is close to 0.
> - make sure packet loss % is close to 100%.

Thanks, and sorry for being a bit slow.  I now see what you have
been getting at with regards to limiting the tests.
I will see about getting some numbers based on your suggestions.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-23 10:39                                       ` Michael S. Tsirkin
  2011-01-23 13:53                                         ` Simon Horman
@ 2011-01-24 18:27                                         ` Rick Jones
  2011-01-24 18:36                                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 40+ messages in thread
From: Rick Jones @ 2011-01-24 18:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

> 
> Just to block netperf you can send it SIGSTOP :)
> 

Clever :)  One could I suppose achieve the same result by making the remote 
receive socket buffer size smaller than the UDP message size and then not worry 
about having to learn the netserver's PID to send it the SIGSTOP.  I *think* the 
semantics will be substantially the same?  Both will be drops at the socket 
buffer, albeit for for different reasons.  The "too small socket buffer" version 
though doesn't require one remember to "wake" the netserver in time to have it 
send results back to netperf without netperf tossing-up an error and not 
reporting any statistics.

Also, netperf has a "no control connection" mode where you can, in effect cause 
it to send UDP datagrams out into the void - I put it there to allow folks to 
test against the likes of echo discard and chargen services but it may have a 
use here.  Requires that one specify the destination IP and port for the "data 
connection" explicitly via the test-specific options.  In that mode the only 
stats reported are those local to netperf rather than netserver.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-24 18:27                                         ` Rick Jones
@ 2011-01-24 18:36                                           ` Michael S. Tsirkin
  2011-01-24 19:01                                             ` Rick Jones
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-24 18:36 UTC (permalink / raw)
  To: Rick Jones
  Cc: Simon Horman, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> >
> >Just to block netperf you can send it SIGSTOP :)
> >
> 
> Clever :)  One could I suppose achieve the same result by making the
> remote receive socket buffer size smaller than the UDP message size
> and then not worry about having to learn the netserver's PID to send
> it the SIGSTOP.  I *think* the semantics will be substantially the
> same?

If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

        case SO_RCVBUF:
                /* Don't error on this BSD doesn't and if you think
                   about it this is right. Otherwise apps have to
                   play 'guess the biggest size' games. RCVBUF/SNDBUF
                   are treated in BSD as hints */

                if (val > sysctl_rmem_max)
                        val = sysctl_rmem_max;
set_rcvbuf:     
                sk->sk_userlocks |= SOCK_RCVBUF_LOCK;

                /*
                 * We double it on the way in to account for
                 * "struct sk_buff" etc. overhead.   Applications
                 * assume that the SO_RCVBUF setting they make will
                 * allow that much actual data to be received on that
                 * socket.
                 *
                 * Applications are unaware that "struct sk_buff" and
                 * other overheads allocate from the receive buffer
                 * during socket buffer allocation. 
                 *
                 * And after considering the possible alternatives,
                 * returning the value we actually used in getsockopt
                 * is the most desirable behavior.
                 */ 
                if ((val * 2) < SOCK_MIN_RCVBUF)
                        sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
                else
                        sk->sk_rcvbuf = val * 2;

and

/*                      
 * Since sk_rmem_alloc sums skb->truesize, even a small frame might need
 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */             
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


>  Both will be drops at the socket buffer, albeit for for
> different reasons.  The "too small socket buffer" version though
> doesn't require one remember to "wake" the netserver in time to have
> it send results back to netperf without netperf tossing-up an error
> and not reporting any statistics.
> 
> Also, netperf has a "no control connection" mode where you can, in
> effect cause it to send UDP datagrams out into the void - I put it
> there to allow folks to test against the likes of echo discard and
> chargen services but it may have a use here.  Requires that one
> specify the destination IP and port for the "data connection"
> explicitly via the test-specific options.  In that mode the only
> stats reported are those local to netperf rather than netserver.

Ah, sounds perfect.

> happy benchmarking,
> 
> rick jones


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-24 18:36                                           ` Michael S. Tsirkin
@ 2011-01-24 19:01                                             ` Rick Jones
  2011-01-24 19:42                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Rick Jones @ 2011-01-24 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Horman, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

Michael S. Tsirkin wrote:
> On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> 
>>>Just to block netperf you can send it SIGSTOP :)
>>>
>>
>>Clever :)  One could I suppose achieve the same result by making the
>>remote receive socket buffer size smaller than the UDP message size
>>and then not worry about having to learn the netserver's PID to send
>>it the SIGSTOP.  I *think* the semantics will be substantially the
>>same?
> 
> 
> If you could set, it, yes. But at least linux ignores
> any value substantially smaller than 1K, and then
> multiplies that by 2:
> 
>         case SO_RCVBUF:
>                 /* Don't error on this BSD doesn't and if you think
>                    about it this is right. Otherwise apps have to
>                    play 'guess the biggest size' games. RCVBUF/SNDBUF
>                    are treated in BSD as hints */
> 
>                 if (val > sysctl_rmem_max)
>                         val = sysctl_rmem_max;
> set_rcvbuf:     
>                 sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> 
>                 /*
>                  * We double it on the way in to account for
>                  * "struct sk_buff" etc. overhead.   Applications
>                  * assume that the SO_RCVBUF setting they make will
>                  * allow that much actual data to be received on that
>                  * socket.
>                  *
>                  * Applications are unaware that "struct sk_buff" and
>                  * other overheads allocate from the receive buffer
>                  * during socket buffer allocation. 
>                  *
>                  * And after considering the possible alternatives,
>                  * returning the value we actually used in getsockopt
>                  * is the most desirable behavior.
>                  */ 
>                 if ((val * 2) < SOCK_MIN_RCVBUF)
>                         sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
>                 else
>                         sk->sk_rcvbuf = val * 2;
> 
> and
> 
> /*                      
>  * Since sk_rmem_alloc sums skb->truesize, even a small frame might need
>  * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
>  */             
> #define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))

Pity - seems to work back on 2.6.26:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928    1024   10.00     2882334      0    2361.17
    256           10.00           0              0.00

raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Still, even with that (or SIGSTOP) we don't really know where the packets were 
dropped right?  There is no guarantee they weren't dropped before they got to 
the socket buffer

happy benchmarking,
rick jones

PS - here is with a -S 1024 option:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928    1024   10.00     1679269      0    1375.64
   2048           10.00     1490662           1221.13

showing that there is a decent chance that many of the frames were dropped at 
the socket buffer, but not all - I suppose I could/should be checking netstat 
stats... :)

And just a little more, only because I was curious :)

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928     257   10.00     1869134      0     384.29
262142           10.00     1869134            384.29

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928     257   10.00     3076363      0     632.49
    256           10.00           0              0.00


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Flow Control and Port Mirroring Revisited
  2011-01-24 19:01                                             ` Rick Jones
@ 2011-01-24 19:42                                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2011-01-24 19:42 UTC (permalink / raw)
  To: Rick Jones
  Cc: Simon Horman, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm

On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote:
> Michael S. Tsirkin wrote:
> >On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> >
> >>>Just to block netperf you can send it SIGSTOP :)
> >>>
> >>
> >>Clever :)  One could I suppose achieve the same result by making the
> >>remote receive socket buffer size smaller than the UDP message size
> >>and then not worry about having to learn the netserver's PID to send
> >>it the SIGSTOP.  I *think* the semantics will be substantially the
> >>same?
> >
> >
> >If you could set, it, yes. But at least linux ignores
> >any value substantially smaller than 1K, and then
> >multiplies that by 2:
> >
> >        case SO_RCVBUF:
> >                /* Don't error on this BSD doesn't and if you think
> >                   about it this is right. Otherwise apps have to
> >                   play 'guess the biggest size' games. RCVBUF/SNDBUF
> >                   are treated in BSD as hints */
> >
> >                if (val > sysctl_rmem_max)
> >                        val = sysctl_rmem_max;
> >set_rcvbuf:                     sk->sk_userlocks |=
> >SOCK_RCVBUF_LOCK;
> >
> >                /*
> >                 * We double it on the way in to account for
> >                 * "struct sk_buff" etc. overhead.   Applications
> >                 * assume that the SO_RCVBUF setting they make will
> >                 * allow that much actual data to be received on that
> >                 * socket.
> >                 *
> >                 * Applications are unaware that "struct sk_buff" and
> >                 * other overheads allocate from the receive buffer
> >                 * during socket buffer allocation.
> >*
> >                 * And after considering the possible alternatives,
> >                 * returning the value we actually used in getsockopt
> >                 * is the most desirable behavior.
> >                 */                 if ((val * 2) <
> >SOCK_MIN_RCVBUF)
> >                        sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
> >                else
> >                        sk->sk_rcvbuf = val * 2;
> >
> >and
> >
> >/*                       * Since sk_rmem_alloc sums skb->truesize,
> >even a small frame might need
> > * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
> > */             #define SOCK_MIN_RCVBUF (2048 + sizeof(struct
> >sk_buff))
> 
> Pity - seems to work back on 2.6.26:

Hmm, that code is there at least as far back as 2.6.12.

> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928    1024   10.00     2882334      0    2361.17
>    256           10.00           0              0.00
> 
> raj@tardy:~/netperf2_trunk$ uname -a
> Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
> 
> Still, even with that (or SIGSTOP) we don't really know where the
> packets were dropped right?  There is no guarantee they weren't
> dropped before they got to the socket buffer
> 
> happy benchmarking,
> rick jones

Right. Better send to a port with no socket listening there,
that would drop the packet at an early (if not at the earliest
possible)  opportunity.

> PS - here is with a -S 1024 option:
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928    1024   10.00     1679269      0    1375.64
>   2048           10.00     1490662           1221.13
> 
> showing that there is a decent chance that many of the frames were
> dropped at the socket buffer, but not all - I suppose I could/should
> be checking netstat stats... :)
> 
> And just a little more, only because I was curious :)
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928     257   10.00     1869134      0     384.29
> 262142           10.00     1869134            384.29
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928     257   10.00     3076363      0     632.49
>    256           10.00           0              0.00

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-01-24 19:42 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-06  9:33 Flow Control and Port Mirroring Revisited Simon Horman
2011-01-06 10:22 ` Eric Dumazet
2011-01-06 12:44   ` Simon Horman
2011-01-06 13:28     ` Eric Dumazet
2011-01-06 22:01       ` Simon Horman
2011-01-06 22:38     ` Jesse Gross
2011-01-07  1:23       ` Simon Horman
2011-01-10  9:31         ` Simon Horman
2011-01-13  6:47           ` Simon Horman
2011-01-13 15:45             ` Jesse Gross
2011-01-13 23:41               ` Simon Horman
2011-01-14  4:58                 ` Michael S. Tsirkin
2011-01-14  6:35                   ` Simon Horman
2011-01-14  6:54                     ` Michael S. Tsirkin
2011-01-16 22:37                       ` Simon Horman
2011-01-16 23:56                         ` Rusty Russell
2011-01-17 10:38                           ` Michael S. Tsirkin
2011-01-17 10:26                         ` Michael S. Tsirkin
2011-01-18 19:41                           ` Rick Jones
2011-01-18 20:13                             ` Michael S. Tsirkin
2011-01-18 21:28                               ` Rick Jones
2011-01-19  9:11                               ` Simon Horman
2011-01-20  8:38                             ` Simon Horman
2011-01-21  2:30                               ` Rick Jones
2011-01-21  9:59                               ` Michael S. Tsirkin
2011-01-21 18:04                                 ` Rick Jones
2011-01-21 23:11                                 ` Simon Horman
2011-01-22 21:57                                   ` Michael S. Tsirkin
2011-01-23  6:38                                     ` Simon Horman
2011-01-23 10:39                                       ` Michael S. Tsirkin
2011-01-23 13:53                                         ` Simon Horman
2011-01-24 18:27                                         ` Rick Jones
2011-01-24 18:36                                           ` Michael S. Tsirkin
2011-01-24 19:01                                             ` Rick Jones
2011-01-24 19:42                                               ` Michael S. Tsirkin
2011-01-06 10:27 ` Michael S. Tsirkin
2011-01-06 11:30   ` Simon Horman
2011-01-06 12:07     ` Michael S. Tsirkin
2011-01-06 12:29       ` Simon Horman
2011-01-06 12:47         ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.