Regression in throughput between kvm guests over virtual bridge

All of lore.kernel.org
 help / color / mirror / Atom feed

* Regression in throughput between kvm guests over virtual bridge
@ 2017-09-12 17:56 Matthew Rosato
  2017-09-13  1:16 ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-12 17:56 UTC (permalink / raw)
  To: netdev, jasowang; +Cc: davem, mst

We are seeing a regression for a subset of workloads across KVM guests
over a virtual bridge between host kernel 4.12 and 4.13.  Bisecting
points to c67df11f "vhost_net: try batch dequing from skb array"

In the regressed environment, we are running 4 kvm guests, 2 running as
uperf servers and 2 running as uperf clients, all on a single host.
They are connected via a virtual bridge.  The uperf client profile looks
like:

<?xml version="1.0"?>
<profile name="TCP_STREAM">
  <group nprocs="1">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost=192.168.122.103
protocol=tcp"/>
    </transaction>
    <transaction duration="300">
      <flowop type="write" options="count=16 size=30000"/>
    </transaction>
    <transaction iterations="1">
      <flowop type="disconnect"/>
    </transaction>
  </group>
</profile>

So, 1 tcp streaming instance per client.  When upgrading the host kernel
from 4.12->4.13, we see about a 30% drop in throughput for this
scenario.  After the bisect, I further verified that reverting c67df11f
on 4.13 "fixes" the throughput for this scenario.

On the other hand, if we increase the load by upping the number of
streaming instances to 50 (nprocs="50") or even 10, we see instead a
~10% increase in throughput when upgrading host from 4.12->4.13.

So it may be the issue is specific to "light load" scenarios.  I would
expect some overhead for the batching, but 30% seems significant...  Any
thoughts on what might be happening here?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-12 17:56 Regression in throughput between kvm guests over virtual bridge Matthew Rosato
@ 2017-09-13  1:16 ` Jason Wang
  2017-09-13  8:13   ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-13  1:16 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年09月13日 01:56, Matthew Rosato wrote:
> We are seeing a regression for a subset of workloads across KVM guests
> over a virtual bridge between host kernel 4.12 and 4.13.  Bisecting
> points to c67df11f "vhost_net: try batch dequing from skb array"
>
> In the regressed environment, we are running 4 kvm guests, 2 running as
> uperf servers and 2 running as uperf clients, all on a single host.
> They are connected via a virtual bridge.  The uperf client profile looks
> like:
>
> <?xml version="1.0"?>
> <profile name="TCP_STREAM">
>    <group nprocs="1">
>      <transaction iterations="1">
>        <flowop type="connect" options="remotehost=192.168.122.103
> protocol=tcp"/>
>      </transaction>
>      <transaction duration="300">
>        <flowop type="write" options="count=16 size=30000"/>
>      </transaction>
>      <transaction iterations="1">
>        <flowop type="disconnect"/>
>      </transaction>
>    </group>
> </profile>
>
> So, 1 tcp streaming instance per client.  When upgrading the host kernel
> from 4.12->4.13, we see about a 30% drop in throughput for this
> scenario.  After the bisect, I further verified that reverting c67df11f
> on 4.13 "fixes" the throughput for this scenario.
>
> On the other hand, if we increase the load by upping the number of
> streaming instances to 50 (nprocs="50") or even 10, we see instead a
> ~10% increase in throughput when upgrading host from 4.12->4.13.
>
> So it may be the issue is specific to "light load" scenarios.  I would
> expect some overhead for the batching, but 30% seems significant...  Any
> thoughts on what might be happening here?
>

Hi, thanks for the bisecting. Will try to see if I can reproduce. 
Various factors could have impact on stream performance. If possible, 
could you collect the #pkts and average packet size during the test? And 
if you guest version is above 4.12, could you please retry with 
napi_tx=true?

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-13  1:16 ` Jason Wang
@ 2017-09-13  8:13   ` Jason Wang
  2017-09-13 16:59     ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-13  8:13 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年09月13日 09:16, Jason Wang wrote:
>
>
> On 2017年09月13日 01:56, Matthew Rosato wrote:
>> We are seeing a regression for a subset of workloads across KVM guests
>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
>> points to c67df11f "vhost_net: try batch dequing from skb array"
>>
>> In the regressed environment, we are running 4 kvm guests, 2 running as
>> uperf servers and 2 running as uperf clients, all on a single host.
>> They are connected via a virtual bridge.  The uperf client profile looks
>> like:
>>
>> <?xml version="1.0"?>
>> <profile name="TCP_STREAM">
>>    <group nprocs="1">
>>      <transaction iterations="1">
>>        <flowop type="connect" options="remotehost=192.168.122.103
>> protocol=tcp"/>
>>      </transaction>
>>      <transaction duration="300">
>>        <flowop type="write" options="count=16 size=30000"/>
>>      </transaction>
>>      <transaction iterations="1">
>>        <flowop type="disconnect"/>
>>      </transaction>
>>    </group>
>> </profile>
>>
>> So, 1 tcp streaming instance per client.  When upgrading the host kernel
>> from 4.12->4.13, we see about a 30% drop in throughput for this
>> scenario.  After the bisect, I further verified that reverting c67df11f
>> on 4.13 "fixes" the throughput for this scenario.
>>
>> On the other hand, if we increase the load by upping the number of
>> streaming instances to 50 (nprocs="50") or even 10, we see instead a
>> ~10% increase in throughput when upgrading host from 4.12->4.13.
>>
>> So it may be the issue is specific to "light load" scenarios.  I would
>> expect some overhead for the batching, but 30% seems significant...  Any
>> thoughts on what might be happening here?
>>
>
> Hi, thanks for the bisecting. Will try to see if I can reproduce. 
> Various factors could have impact on stream performance. If possible, 
> could you collect the #pkts and average packet size during the test? 
> And if you guest version is above 4.12, could you please retry with 
> napi_tx=true?
>
> Thanks

Unfortunately, I could not reproduce it locally. I'm using net-next.git 
as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz 
for both before and after the commit. I use 1 vcpu and 1 queue, and pin 
vcpu and vhost threads into separate cpu on host manually (in same numa 
node).

Can you hit this regression constantly and what's you qemu command line 
and #cpus on host? Is zerocopy enabled?

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-13  8:13   ` Jason Wang
@ 2017-09-13 16:59     ` Matthew Rosato
  2017-09-14  4:21       ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-13 16:59 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst

On 09/13/2017 04:13 AM, Jason Wang wrote:
> 
> 
> On 2017年09月13日 09:16, Jason Wang wrote:
>>
>>
>> On 2017年09月13日 01:56, Matthew Rosato wrote:
>>> We are seeing a regression for a subset of workloads across KVM guests
>>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
>>> points to c67df11f "vhost_net: try batch dequing from skb array"
>>>
>>> In the regressed environment, we are running 4 kvm guests, 2 running as
>>> uperf servers and 2 running as uperf clients, all on a single host.
>>> They are connected via a virtual bridge.  The uperf client profile looks
>>> like:
>>>
>>> <?xml version="1.0"?>
>>> <profile name="TCP_STREAM">
>>>    <group nprocs="1">
>>>      <transaction iterations="1">
>>>        <flowop type="connect" options="remotehost=192.168.122.103
>>> protocol=tcp"/>
>>>      </transaction>
>>>      <transaction duration="300">
>>>        <flowop type="write" options="count=16 size=30000"/>
>>>      </transaction>
>>>      <transaction iterations="1">
>>>        <flowop type="disconnect"/>
>>>      </transaction>
>>>    </group>
>>> </profile>
>>>
>>> So, 1 tcp streaming instance per client.  When upgrading the host kernel
>>> from 4.12->4.13, we see about a 30% drop in throughput for this
>>> scenario.  After the bisect, I further verified that reverting c67df11f
>>> on 4.13 "fixes" the throughput for this scenario.
>>>
>>> On the other hand, if we increase the load by upping the number of
>>> streaming instances to 50 (nprocs="50") or even 10, we see instead a
>>> ~10% increase in throughput when upgrading host from 4.12->4.13.
>>>
>>> So it may be the issue is specific to "light load" scenarios.  I would
>>> expect some overhead for the batching, but 30% seems significant...  Any
>>> thoughts on what might be happening here?
>>>
>>
>> Hi, thanks for the bisecting. Will try to see if I can reproduce.
>> Various factors could have impact on stream performance. If possible,
>> could you collect the #pkts and average packet size during the test?
>> And if you guest version is above 4.12, could you please retry with
>> napi_tx=true?

Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 -
4.4.0-93-generic specifically).  Here's a throughput report (uperf) and
#pkts and average packet size (tcpstat) for one of the uperf clients:

host 4.12 / guest 4.4:
throughput: 29.98Gb/s
#pkts=33465571 avg packet size=33755.70

host 4.13 / guest 4.4:
throughput: 20.36Gb/s
#pkts=21233399 avg packet size=36130.69

I ran the test again using net-next.git as guest kernel, with and
without napi_tx=true.  napi_tx did not seem to have any significant
impact on throughput.  However, the guest kernel shift from
4.4->net-next improved things.  I can still see a regression between
host 4.12 and 4.13, but it's more on the order of 10-15% - another sample:

host 4.12 / guest net-next (without napi_tx):
throughput: 28.88Gb/s
#pkts=31743116 avg packet size=33779.78

host 4.13 / guest net-next (without napi_tx):
throughput: 24.34Gb/s
#pkts=25532724 avg packet size=35963.20

>>
>> Thanks
> 
> Unfortunately, I could not reproduce it locally. I'm using net-next.git
> as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> for both before and after the commit. I use 1 vcpu and 1 queue, and pin
> vcpu and vhost threads into separate cpu on host manually (in same numa
> node).

The environment is quite a bit different -- I'm running in an LPAR on a
z13 (s390x).  We've seen the issue in various configurations, the
smallest thus far was a host partition w/ 40G and 20 CPUs defined (the
numbers above were gathered w/ this configuration).  Each guest has 4GB
and 4 vcpus.  No pinning / affinity configured.

> 
> Can you hit this regression constantly and what's you qemu command line

Yes, the regression seems consistent.  I can try tweaking some of the
host and guest definitions to see if it makes a difference.

The guests are instantiated from libvirt - Here's one of the resulting
qemu command lines:

/usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes
-machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m
4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -drive
file=/dev/disk/by-id/scsi-3600507630bffc0380000000000001803,format=raw,if=none,id=drive-virtio-disk0
-device
virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device
virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1
-chardev pty,id=charconsole0 -device
sclpconsole,chardev=charconsole0,id=console0 -device
virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on

In the above, net0 is used for a macvtap connection (not used in the
experiment, just for a reliable ssh connection - can remove if needed).
net1 is the bridge connection used for the uperf tests.


> and #cpus on host? Is zerocopy enabled?

Host info provided above.

cat /sys/module/vhost_net/parameters/experimental_zcopytx
1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-13 16:59     ` Matthew Rosato
@ 2017-09-14  4:21       ` Jason Wang
  2017-09-15  3:36         ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-14  4:21 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年09月14日 00:59, Matthew Rosato wrote:
> On 09/13/2017 04:13 AM, Jason Wang wrote:
>>
>> On 2017年09月13日 09:16, Jason Wang wrote:
>>>
>>> On 2017年09月13日 01:56, Matthew Rosato wrote:
>>>> We are seeing a regression for a subset of workloads across KVM guests
>>>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
>>>> points to c67df11f "vhost_net: try batch dequing from skb array"
>>>>
>>>> In the regressed environment, we are running 4 kvm guests, 2 running as
>>>> uperf servers and 2 running as uperf clients, all on a single host.
>>>> They are connected via a virtual bridge.  The uperf client profile looks
>>>> like:
>>>>
>>>> <?xml version="1.0"?>
>>>> <profile name="TCP_STREAM">
>>>>     <group nprocs="1">
>>>>       <transaction iterations="1">
>>>>         <flowop type="connect" options="remotehost=192.168.122.103
>>>> protocol=tcp"/>
>>>>       </transaction>
>>>>       <transaction duration="300">
>>>>         <flowop type="write" options="count=16 size=30000"/>
>>>>       </transaction>
>>>>       <transaction iterations="1">
>>>>         <flowop type="disconnect"/>
>>>>       </transaction>
>>>>     </group>
>>>> </profile>
>>>>
>>>> So, 1 tcp streaming instance per client.  When upgrading the host kernel
>>>> from 4.12->4.13, we see about a 30% drop in throughput for this
>>>> scenario.  After the bisect, I further verified that reverting c67df11f
>>>> on 4.13 "fixes" the throughput for this scenario.
>>>>
>>>> On the other hand, if we increase the load by upping the number of
>>>> streaming instances to 50 (nprocs="50") or even 10, we see instead a
>>>> ~10% increase in throughput when upgrading host from 4.12->4.13.
>>>>
>>>> So it may be the issue is specific to "light load" scenarios.  I would
>>>> expect some overhead for the batching, but 30% seems significant...  Any
>>>> thoughts on what might be happening here?
>>>>
>>> Hi, thanks for the bisecting. Will try to see if I can reproduce.
>>> Various factors could have impact on stream performance. If possible,
>>> could you collect the #pkts and average packet size during the test?
>>> And if you guest version is above 4.12, could you please retry with
>>> napi_tx=true?
> Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 -
> 4.4.0-93-generic specifically).  Here's a throughput report (uperf) and
> #pkts and average packet size (tcpstat) for one of the uperf clients:
>
> host 4.12 / guest 4.4:
> throughput: 29.98Gb/s
> #pkts=33465571 avg packet size=33755.70
>
> host 4.13 / guest 4.4:
> throughput: 20.36Gb/s
> #pkts=21233399 avg packet size=36130.69

I test guest 4.4 on Intel machine, still can reproduce :(

>
> I ran the test again using net-next.git as guest kernel, with and
> without napi_tx=true.  napi_tx did not seem to have any significant
> impact on throughput.  However, the guest kernel shift from
> 4.4->net-next improved things.  I can still see a regression between
> host 4.12 and 4.13, but it's more on the order of 10-15% - another sample:
>
> host 4.12 / guest net-next (without napi_tx):
> throughput: 28.88Gb/s
> #pkts=31743116 avg packet size=33779.78
>
> host 4.13 / guest net-next (without napi_tx):
> throughput: 24.34Gb/s
> #pkts=25532724 avg packet size=35963.20

Thanks for the numbers. I originally suspect batching will lead more 
pkts but less size, but looks not. The less packets is also a hint that 
there's delay somewhere.

>
>>> Thanks
>> Unfortunately, I could not reproduce it locally. I'm using net-next.git
>> as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> for both before and after the commit. I use 1 vcpu and 1 queue, and pin
>> vcpu and vhost threads into separate cpu on host manually (in same numa
>> node).
> The environment is quite a bit different -- I'm running in an LPAR on a
> z13 (s390x).  We've seen the issue in various configurations, the
> smallest thus far was a host partition w/ 40G and 20 CPUs defined (the
> numbers above were gathered w/ this configuration).  Each guest has 4GB
> and 4 vcpus.  No pinning / affinity configured.

Unfortunately, I don't have s390x on hand. Will try to get one.

>
>> Can you hit this regression constantly and what's you qemu command line
> Yes, the regression seems consistent.  I can try tweaking some of the
> host and guest definitions to see if it makes a difference.

Is the issue gone if you reduce VHOST_RX_BATCH to 1? And it would be 
also helpful to collect perf diff to see if anything interesting. 
(Consider 4.4 shows more obvious regression, please use 4.4).

>
> The guests are instantiated from libvirt - Here's one of the resulting
> qemu command lines:
>
> /usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S
> -object
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes
> -machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m
> 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
> 44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config
> -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
> -no-shutdown -boot strict=on -drive
> file=/dev/disk/by-id/scsi-3600507630bffc0380000000000001803,format=raw,if=none,id=drive-virtio-disk0
> -device
> virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device
> virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001
> -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device
> virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1
> -chardev pty,id=charconsole0 -device
> sclpconsole,chardev=charconsole0,id=console0 -device
> virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on
>
> In the above, net0 is used for a macvtap connection (not used in the
> experiment, just for a reliable ssh connection - can remove if needed).
> net1 is the bridge connection used for the uperf tests.
>
>
>> and #cpus on host? Is zerocopy enabled?
> Host info provided above.
>
> cat /sys/module/vhost_net/parameters/experimental_zcopytx
> 1

May worth to try disable zerocopy or do the test form host to guest 
instead of guest to guest to exclude the possible issue of sender.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-14  4:21       ` Jason Wang
@ 2017-09-15  3:36         ` Matthew Rosato
  2017-09-15  8:55           ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-15  3:36 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst


> Is the issue gone if you reduce VHOST_RX_BATCH to 1? And it would be
> also helpful to collect perf diff to see if anything interesting.
> (Consider 4.4 shows more obvious regression, please use 4.4).
> 

Issue still exists when I force VHOST_RX_BATCH = 1

Collected perf data, with 4.12 as the baseline, 4.13 as delta1 and
4.13+VHOST_RX_BATCH=1 as delta2. All guests running 4.4.  Same scenario,
2 uperf client guests, 2 uperf slave guests - I collected perf data
against 1 uperf client process and 1 uperf slave process.  Here are the
significant diffs:

uperf client:

75.09%   +9.32%   +8.52%  [kernel.kallsyms]   [k] enabled_wait
 9.04%   -4.11%   -3.79%  [kernel.kallsyms]   [k] __copy_from_user
 2.30%   -0.79%   -0.71%  [kernel.kallsyms]   [k] arch_free_page
 2.17%   -0.65%   -0.58%  [kernel.kallsyms]   [k] arch_alloc_page
 0.69%   -0.25%   -0.24%  [kernel.kallsyms]   [k] get_page_from_freelist
 0.56%   +0.08%   +0.14%  [kernel.kallsyms]   [k] virtio_ccw_kvm_notify
 0.42%   -0.11%   -0.09%  [kernel.kallsyms]   [k] tcp_sendmsg
 0.31%   -0.15%   -0.14%  [kernel.kallsyms]   [k] tcp_write_xmit

uperf slave:

72.44%   +8.99%   +8.85%  [kernel.kallsyms]   [k] enabled_wait
 8.99%   -3.67%   -3.51%  [kernel.kallsyms]   [k] __copy_to_user
 2.31%   -0.71%   -0.67%  [kernel.kallsyms]   [k] arch_free_page
 2.16%   -0.67%   -0.63%  [kernel.kallsyms]   [k] arch_alloc_page
 0.89%   -0.14%   -0.11%  [kernel.kallsyms]   [k] virtio_ccw_kvm_notify
 0.71%   -0.30%   -0.30%  [kernel.kallsyms]   [k] get_page_from_freelist
 0.70%   -0.25%   -0.29%  [kernel.kallsyms]   [k] __wake_up_sync_key
 0.61%   -0.22%   -0.22%  [kernel.kallsyms]   [k] virtqueue_add_inbuf


> 
> May worth to try disable zerocopy or do the test form host to guest
> instead of guest to guest to exclude the possible issue of sender.
> 

With zerocopy disabled, still seeing the regression.  The provided perf
#s have zerocopy enabled.

I replaced 1 uperf guest and instead ran that uperf client as a host
process, pointing at a guest.  All traffic still over the virtual
bridge.  In this setup, it's still easy to see the regression for the
remaining guest1<->guest2 uperf run, but the host<->guest3 run does NOT
exhibit a reliable regression pattern.  The significant perf diffs from
the host uperf process (baseline=4.12, delta=4.13):


59.96%   +5.03%  [kernel.kallsyms]           [k] enabled_wait
 6.47%   -2.27%  [kernel.kallsyms]           [k] raw_copy_to_user
 5.52%   -1.63%  [kernel.kallsyms]           [k] raw_copy_from_user
 0.87%   -0.30%  [kernel.kallsyms]           [k] get_page_from_freelist
 0.69%   +0.30%  [kernel.kallsyms]           [k] finish_task_switch
 0.66%   -0.15%  [kernel.kallsyms]           [k] swake_up
 0.58%   -0.00%  [vhost]                     [k] vhost_get_vq_desc
   ...
 0.42%   +0.50%  [kernel.kallsyms]           [k] ckc_irq_pending

I also tried flipping the uperf stream around (a guest uperf client is
communicating to a slave uperf process on the host) and also cannot see
the regression pattern.  So it seems to require a guest on both ends of
the connection.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-15  3:36         ` Matthew Rosato
@ 2017-09-15  8:55           ` Jason Wang
  2017-09-15 19:19             ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-15  8:55 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年09月15日 11:36, Matthew Rosato wrote:
>> Is the issue gone if you reduce VHOST_RX_BATCH to 1? And it would be
>> also helpful to collect perf diff to see if anything interesting.
>> (Consider 4.4 shows more obvious regression, please use 4.4).
>>
> Issue still exists when I force VHOST_RX_BATCH = 1

Interesting, so this looks more like an issue of the changes in 
vhost_net instead of batch dequeuing itself. I try this on Intel but 
still can't meet it.

>
> Collected perf data, with 4.12 as the baseline, 4.13 as delta1 and
> 4.13+VHOST_RX_BATCH=1 as delta2. All guests running 4.4.  Same scenario,
> 2 uperf client guests, 2 uperf slave guests - I collected perf data
> against 1 uperf client process and 1 uperf slave process.  Here are the
> significant diffs:
>
> uperf client:
>
> 75.09%   +9.32%   +8.52%  [kernel.kallsyms]   [k] enabled_wait
>   9.04%   -4.11%   -3.79%  [kernel.kallsyms]   [k] __copy_from_user
>   2.30%   -0.79%   -0.71%  [kernel.kallsyms]   [k] arch_free_page
>   2.17%   -0.65%   -0.58%  [kernel.kallsyms]   [k] arch_alloc_page
>   0.69%   -0.25%   -0.24%  [kernel.kallsyms]   [k] get_page_from_freelist
>   0.56%   +0.08%   +0.14%  [kernel.kallsyms]   [k] virtio_ccw_kvm_notify
>   0.42%   -0.11%   -0.09%  [kernel.kallsyms]   [k] tcp_sendmsg
>   0.31%   -0.15%   -0.14%  [kernel.kallsyms]   [k] tcp_write_xmit
>
> uperf slave:
>
> 72.44%   +8.99%   +8.85%  [kernel.kallsyms]   [k] enabled_wait
>   8.99%   -3.67%   -3.51%  [kernel.kallsyms]   [k] __copy_to_user
>   2.31%   -0.71%   -0.67%  [kernel.kallsyms]   [k] arch_free_page
>   2.16%   -0.67%   -0.63%  [kernel.kallsyms]   [k] arch_alloc_page
>   0.89%   -0.14%   -0.11%  [kernel.kallsyms]   [k] virtio_ccw_kvm_notify
>   0.71%   -0.30%   -0.30%  [kernel.kallsyms]   [k] get_page_from_freelist
>   0.70%   -0.25%   -0.29%  [kernel.kallsyms]   [k] __wake_up_sync_key
>   0.61%   -0.22%   -0.22%  [kernel.kallsyms]   [k] virtqueue_add_inbuf

It looks like vhost is slowed down for some reason which leads to more 
idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the 
perf.diff on host, one for rx and one for tx.

>
>
>> May worth to try disable zerocopy or do the test form host to guest
>> instead of guest to guest to exclude the possible issue of sender.
>>
> With zerocopy disabled, still seeing the regression.  The provided perf
> #s have zerocopy enabled.
>
> I replaced 1 uperf guest and instead ran that uperf client as a host
> process, pointing at a guest.  All traffic still over the virtual
> bridge.  In this setup, it's still easy to see the regression for the
> remaining guest1<->guest2 uperf run, but the host<->guest3 run does NOT
> exhibit a reliable regression pattern.  The significant perf diffs from
> the host uperf process (baseline=4.12, delta=4.13):
>
>
> 59.96%   +5.03%  [kernel.kallsyms]           [k] enabled_wait
>   6.47%   -2.27%  [kernel.kallsyms]           [k] raw_copy_to_user
>   5.52%   -1.63%  [kernel.kallsyms]           [k] raw_copy_from_user
>   0.87%   -0.30%  [kernel.kallsyms]           [k] get_page_from_freelist
>   0.69%   +0.30%  [kernel.kallsyms]           [k] finish_task_switch
>   0.66%   -0.15%  [kernel.kallsyms]           [k] swake_up
>   0.58%   -0.00%  [vhost]                     [k] vhost_get_vq_desc
>     ...
>   0.42%   +0.50%  [kernel.kallsyms]           [k] ckc_irq_pending

Another hint to perf vhost threads.

>
> I also tried flipping the uperf stream around (a guest uperf client is
> communicating to a slave uperf process on the host) and also cannot see
> the regression pattern.  So it seems to require a guest on both ends of
> the connection.
>

Yes. Will try to get a s390 environment.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-15  8:55           ` Jason Wang
@ 2017-09-15 19:19             ` Matthew Rosato
  2017-09-18  3:13               ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-15 19:19 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst

> It looks like vhost is slowed down for some reason which leads to more
> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
> perf.diff on host, one for rx and one for tx.
> 

perf data below for the associated vhost threads, baseline=4.12,
delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1

Client vhost:

60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
 2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
 1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
 1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
 1.09%   +0.28%   +0.35%  [vhost]            [k] vhost_get_vq_desc
 1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
 0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
 0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
 0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
 0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
 0.79%   +0.09%   +0.19%  [vhost]            [k] __vhost_add_used_n
 0.74%                    [kernel.vmlinux]   [k] get_task_policy.part.7
 0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
 0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
 0.58%   -0.15%   -0.12%  [ebtables]         [k] ebt_do_table
 0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
   ...
 0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
   ...
 0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
   ...
         +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
         +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
         +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
         +0.24%   +0.23%  [vhost_net]        [k] vhost_net_buf_peek

Server vhost:

61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
 9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
 5.16%   +1.41%   +1.57%  [vhost]            [k] vhost_get_vq_desc
 5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
 3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
 1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
 1.24%   +1.65%   +0.45%  [vhost_net]        [k] handle_rx
 1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
 0.96%   +0.70%   +1.10%  [vhost]            [k] translate_desc
 0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
 0.69%                    [kernel.vmlinux]   [k] tun_peek_len
 0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
 0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
 0.50%   +0.05%   +0.09%  [vhost]            [k] vhost_add_used_n
   ...
         +0.63%   +0.58%  [vhost_net]        [k] vhost_net_buf_peek
         +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
         +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
         +0.11%   +0.21%  [vhost]            [k] vhost_umem_interval_tr

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-15 19:19             ` Matthew Rosato
@ 2017-09-18  3:13               ` Jason Wang
  2017-09-18  4:14                 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
  2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
  0 siblings, 2 replies; 42+ messages in thread
From: Jason Wang @ 2017-09-18  3:13 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst

[-- Attachment #1: Type: text/plain, Size: 3468 bytes --]



On 2017年09月16日 03:19, Matthew Rosato wrote:
>> It looks like vhost is slowed down for some reason which leads to more
>> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
>> perf.diff on host, one for rx and one for tx.
>>
> perf data below for the associated vhost threads, baseline=4.12,
> delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1
>
> Client vhost:
>
> 60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
> 13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
>   2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
>   1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
>   1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
>   1.09%   +0.28%   +0.35%  [vhost]            [k] vhost_get_vq_desc
>   1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
>   0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
>   0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
>   0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
>   0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
>   0.79%   +0.09%   +0.19%  [vhost]            [k] __vhost_add_used_n
>   0.74%                    [kernel.vmlinux]   [k] get_task_policy.part.7
>   0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
>   0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
>   0.58%   -0.15%   -0.12%  [ebtables]         [k] ebt_do_table
>   0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
>     ...
>   0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
>     ...
>   0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
>     ...
>           +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
>           +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>           +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
>           +0.24%   +0.23%  [vhost_net]        [k] vhost_net_buf_peek
>
> Server vhost:
>
> 61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
>   9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
>   5.16%   +1.41%   +1.57%  [vhost]            [k] vhost_get_vq_desc
>   5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
>   3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
>   1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
>   1.24%   +1.65%   +0.45%  [vhost_net]        [k] handle_rx
>   1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
>   0.96%   +0.70%   +1.10%  [vhost]            [k] translate_desc
>   0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
>   0.69%                    [kernel.vmlinux]   [k] tun_peek_len
>   0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
>   0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
>   0.50%   +0.05%   +0.09%  [vhost]            [k] vhost_add_used_n
>     ...
>           +0.63%   +0.58%  [vhost_net]        [k] vhost_net_buf_peek
>           +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
>           +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>           +0.11%   +0.21%  [vhost]            [k] vhost_umem_interval_tr
>

Looks like for some unknown reason which leads more wakeups.

Could you please try to attached patch to see if it solves or mitigate 
the issue?

Thanks

[-- Attachment #2: 0001-vhost_net-conditionally-enable-tx-polling.patch --]
[-- Type: text/x-patch, Size: 899 bytes --]

>From 63b276ed881c1e2a89b7ea35b6f328f70ddd6185 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Mon, 18 Sep 2017 10:56:30 +0800
Subject: [PATCH] vhost_net: conditionally enable tx polling

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 58585ec..397d86a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -471,6 +471,7 @@ static void handle_tx(struct vhost_net *net)
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
+	vhost_net_disable_vq(net, vq);
 
 	hdr_size = nvq->vhost_hlen;
 	zcopy = nvq->ubufs;
@@ -562,6 +563,8 @@ static void handle_tx(struct vhost_net *net)
 					% UIO_MAXIOV;
 			}
 			vhost_discard_vq_desc(vq, 1);
+			if (err = -EAGAIN)
+				vhost_net_enable_vq(net, vq);
 			break;
 		}
 		if (err != len)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] vhost_net: conditionally enable tx polling
  2017-09-18  3:13               ` Jason Wang
@ 2017-09-18  4:14                 ` kbuild test robot
  2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
  1 sibling, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2017-09-18  4:14 UTC (permalink / raw)
  To: Jason Wang; +Cc: kbuild-all, Matthew Rosato, netdev, davem, mst

[-- Attachment #1: Type: text/plain, Size: 5778 bytes --]

Hi Jason,

[auto build test WARNING on vhost/linux-next]
[also build test WARNING on v4.14-rc1 next-20170915]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-Wang/vhost_net-conditionally-enable-tx-polling/20170918-112041
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: x86_64-randconfig-x009-201738 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers//vhost/net.c: In function 'handle_tx':
>> drivers//vhost/net.c:565:4: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
       if (err = -EAGAIN)
       ^~

vim +565 drivers//vhost/net.c

   442	
   443	/* Expects to be always run from workqueue - which acts as
   444	 * read-size critical section for our kind of RCU. */
   445	static void handle_tx(struct vhost_net *net)
   446	{
   447		struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
   448		struct vhost_virtqueue *vq = &nvq->vq;
   449		unsigned out, in;
   450		int head;
   451		struct msghdr msg = {
   452			.msg_name = NULL,
   453			.msg_namelen = 0,
   454			.msg_control = NULL,
   455			.msg_controllen = 0,
   456			.msg_flags = MSG_DONTWAIT,
   457		};
   458		size_t len, total_len = 0;
   459		int err;
   460		size_t hdr_size;
   461		struct socket *sock;
   462		struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
   463		bool zcopy, zcopy_used;
   464	
   465		mutex_lock(&vq->mutex);
   466		sock = vq->private_data;
   467		if (!sock)
   468			goto out;
   469	
   470		if (!vq_iotlb_prefetch(vq))
   471			goto out;
   472	
   473		vhost_disable_notify(&net->dev, vq);
   474		vhost_net_disable_vq(net, vq);
   475	
   476		hdr_size = nvq->vhost_hlen;
   477		zcopy = nvq->ubufs;
   478	
   479		for (;;) {
   480			/* Release DMAs done buffers first */
   481			if (zcopy)
   482				vhost_zerocopy_signal_used(net, vq);
   483	
   484			/* If more outstanding DMAs, queue the work.
   485			 * Handle upend_idx wrap around
   486			 */
   487			if (unlikely(vhost_exceeds_maxpend(net)))
   488				break;
   489	
   490			head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
   491							ARRAY_SIZE(vq->iov),
   492							&out, &in);
   493			/* On error, stop handling until the next kick. */
   494			if (unlikely(head < 0))
   495				break;
   496			/* Nothing new?  Wait for eventfd to tell us they refilled. */
   497			if (head == vq->num) {
   498				if (unlikely(vhost_enable_notify(&net->dev, vq))) {
   499					vhost_disable_notify(&net->dev, vq);
   500					continue;
   501				}
   502				break;
   503			}
   504			if (in) {
   505				vq_err(vq, "Unexpected descriptor format for TX: "
   506				       "out %d, int %d\n", out, in);
   507				break;
   508			}
   509			/* Skip header. TODO: support TSO. */
   510			len = iov_length(vq->iov, out);
   511			iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
   512			iov_iter_advance(&msg.msg_iter, hdr_size);
   513			/* Sanity check */
   514			if (!msg_data_left(&msg)) {
   515				vq_err(vq, "Unexpected header len for TX: "
   516				       "%zd expected %zd\n",
   517				       len, hdr_size);
   518				break;
   519			}
   520			len = msg_data_left(&msg);
   521	
   522			zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
   523					   && (nvq->upend_idx + 1) % UIO_MAXIOV !=
   524					      nvq->done_idx
   525					   && vhost_net_tx_select_zcopy(net);
   526	
   527			/* use msg_control to pass vhost zerocopy ubuf info to skb */
   528			if (zcopy_used) {
   529				struct ubuf_info *ubuf;
   530				ubuf = nvq->ubuf_info + nvq->upend_idx;
   531	
   532				vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
   533				vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
   534				ubuf->callback = vhost_zerocopy_callback;
   535				ubuf->ctx = nvq->ubufs;
   536				ubuf->desc = nvq->upend_idx;
   537				msg.msg_control = ubuf;
   538				msg.msg_controllen = sizeof(ubuf);
   539				ubufs = nvq->ubufs;
   540				atomic_inc(&ubufs->refcount);
   541				nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
   542			} else {
   543				msg.msg_control = NULL;
   544				ubufs = NULL;
   545			}
   546	
   547			total_len += len;
   548			if (total_len < VHOST_NET_WEIGHT &&
   549			    !vhost_vq_avail_empty(&net->dev, vq) &&
   550			    likely(!vhost_exceeds_maxpend(net))) {
   551				msg.msg_flags |= MSG_MORE;
   552			} else {
   553				msg.msg_flags &= ~MSG_MORE;
   554			}
   555	
   556			/* TODO: Check specific error and bomb out unless ENOBUFS? */
   557			err = sock->ops->sendmsg(sock, &msg, len);
   558			if (unlikely(err < 0)) {
   559				if (zcopy_used) {
   560					vhost_net_ubuf_put(ubufs);
   561					nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
   562						% UIO_MAXIOV;
   563				}
   564				vhost_discard_vq_desc(vq, 1);
 > 565				if (err = -EAGAIN)
   566					vhost_net_enable_vq(net, vq);
   567				break;
   568			}
   569			if (err != len)
   570				pr_debug("Truncated TX packet: "
   571					 " len %d != %zd\n", err, len);
   572			if (!zcopy_used)
   573				vhost_add_used_and_signal(&net->dev, vq, head, 0);
   574			else
   575				vhost_zerocopy_signal_used(net, vq);
   576			vhost_net_tx_packet(net);
   577			if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
   578				vhost_poll_queue(&vq->poll);
   579				break;
   580			}
   581		}
   582	out:
   583		mutex_unlock(&vq->mutex);
   584	}
   585	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31832 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-18  3:13               ` Jason Wang
  2017-09-18  4:14                 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
@ 2017-09-18  7:36                 ` Jason Wang
  2017-09-18 18:11                   ` Matthew Rosato
  1 sibling, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-18  7:36 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst

[-- Attachment #1: Type: text/plain, Size: 4108 bytes --]



On 2017年09月18日 11:13, Jason Wang wrote:
>
>
> On 2017年09月16日 03:19, Matthew Rosato wrote:
>>> It looks like vhost is slowed down for some reason which leads to more
>>> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
>>> perf.diff on host, one for rx and one for tx.
>>>
>> perf data below for the associated vhost threads, baseline=4.12,
>> delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1
>>
>> Client vhost:
>>
>> 60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
>> 13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
>>   2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>   1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
>>   1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
>>   1.09%   +0.28%   +0.35%  [vhost]            [k] vhost_get_vq_desc
>>   1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
>>   0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
>>   0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
>>   0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
>>   0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
>>   0.79%   +0.09%   +0.19%  [vhost]            [k] __vhost_add_used_n
>>   0.74%                    [kernel.vmlinux]   [k] get_task_policy.part.7
>>   0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
>>   0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
>>   0.58%   -0.15%   -0.12%  [ebtables]         [k] ebt_do_table
>>   0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
>>     ...
>>   0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
>>     ...
>>   0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
>>     ...
>>           +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
>>           +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>           +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
>>           +0.24%   +0.23%  [vhost_net]        [k] vhost_net_buf_peek
>>
>> Server vhost:
>>
>> 61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
>>   9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
>>   5.16%   +1.41%   +1.57%  [vhost]            [k] vhost_get_vq_desc
>>   5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
>>   3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
>>   1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
>>   1.24%   +1.65%   +0.45%  [vhost_net]        [k] handle_rx
>>   1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>   0.96%   +0.70%   +1.10%  [vhost]            [k] translate_desc
>>   0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
>>   0.69%                    [kernel.vmlinux]   [k] tun_peek_len
>>   0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
>>   0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
>>   0.50%   +0.05%   +0.09%  [vhost]            [k] vhost_add_used_n
>>     ...
>>           +0.63%   +0.58%  [vhost_net]        [k] vhost_net_buf_peek
>>           +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
>>           +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>           +0.11%   +0.21%  [vhost]            [k] vhost_umem_interval_tr
>>
>
> Looks like for some unknown reason which leads more wakeups.
>
> Could you please try to attached patch to see if it solves or mitigate 
> the issue?
>
> Thanks 

My bad, please try this.

Thanks

[-- Attachment #2: 0001-vhost_net-conditionally-enable-tx-polling.patch --]
[-- Type: text/x-patch, Size: 898 bytes --]

>From 8be3edfcd415ba6157ab34d250127c6f2b21ff5d Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Mon, 18 Sep 2017 10:56:30 +0800
Subject: [PATCH] vhost_net: conditionally enable tx polling

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 58585ec..2b308e0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -471,6 +471,7 @@ static void handle_tx(struct vhost_net *net)
 		goto out;
 
 	vhost_disable_notify(&net->dev, vq);
+	vhost_net_disable_vq(net, vq);
 
 	hdr_size = nvq->vhost_hlen;
 	zcopy = nvq->ubufs;
@@ -562,6 +563,8 @@ static void handle_tx(struct vhost_net *net)
 					% UIO_MAXIOV;
 			}
 			vhost_discard_vq_desc(vq, 1);
+			if (err == -EAGAIN)
+				vhost_net_enable_vq(net, vq);
 			break;
 		}
 		if (err != len)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
@ 2017-09-18 18:11                   ` Matthew Rosato
  2017-09-20  6:27                     ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-18 18:11 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst

On 09/18/2017 03:36 AM, Jason Wang wrote:
> 
> 
> On 2017年09月18日 11:13, Jason Wang wrote:
>>
>>
>> On 2017年09月16日 03:19, Matthew Rosato wrote:
>>>> It looks like vhost is slowed down for some reason which leads to more
>>>> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
>>>> perf.diff on host, one for rx and one for tx.
>>>>
>>> perf data below for the associated vhost threads, baseline=4.12,
>>> delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1
>>>
>>> Client vhost:
>>>
>>> 60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
>>> 13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
>>>   2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>>   1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
>>>   1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
>>>   1.09%   +0.28%   +0.35%  [vhost]            [k] vhost_get_vq_desc
>>>   1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
>>>   0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
>>>   0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
>>>   0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
>>>   0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
>>>   0.79%   +0.09%   +0.19%  [vhost]            [k] __vhost_add_used_n
>>>   0.74%                    [kernel.vmlinux]   [k] get_task_policy.part.7
>>>   0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
>>>   0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
>>>   0.58%   -0.15%   -0.12%  [ebtables]         [k] ebt_do_table
>>>   0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
>>>     ...
>>>   0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
>>>     ...
>>>   0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
>>>     ...
>>>           +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
>>>           +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>>           +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
>>>           +0.24%   +0.23%  [vhost_net]        [k] vhost_net_buf_peek
>>>
>>> Server vhost:
>>>
>>> 61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
>>>   9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
>>>   5.16%   +1.41%   +1.57%  [vhost]            [k] vhost_get_vq_desc
>>>   5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
>>>   3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
>>>   1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
>>>   1.24%   +1.65%   +0.45%  [vhost_net]        [k] handle_rx
>>>   1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>>   0.96%   +0.70%   +1.10%  [vhost]            [k] translate_desc
>>>   0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
>>>   0.69%                    [kernel.vmlinux]   [k] tun_peek_len
>>>   0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
>>>   0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
>>>   0.50%   +0.05%   +0.09%  [vhost]            [k] vhost_add_used_n
>>>     ...
>>>           +0.63%   +0.58%  [vhost_net]        [k] vhost_net_buf_peek
>>>           +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
>>>           +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>>           +0.11%   +0.21%  [vhost]            [k] vhost_umem_interval_tr
>>>
>>
>> Looks like for some unknown reason which leads more wakeups.
>>
>> Could you please try to attached patch to see if it solves or mitigate
>> the issue?
>>
>> Thanks 
> 
> My bad, please try this.
> 
> Thanks

Thanks Jason.  Built 4.13 + supplied patch, I see some decrease in
wakeups, but there's still quite a bit more compared to 4.12
(baseline=4.12, delta1=4.13, delta2=4.13+patch):

client:
 2.00%   +3.69%   +2.55%  [kernel.vmlinux]   [k] __wake_up_sync_key

server:
 1.08%   +3.03%   +1.85%  [kernel.vmlinux]   [k] __wake_up_sync_key


Throughput was roughly equivalent to base 4.13 (so, still seeing the
regression w/ this patch applied).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-18 18:11                   ` Matthew Rosato
@ 2017-09-20  6:27                     ` Jason Wang
  2017-09-20 19:38                       ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-20  6:27 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst

[-- Attachment #1: Type: text/plain, Size: 4826 bytes --]



On 2017年09月19日 02:11, Matthew Rosato wrote:
> On 09/18/2017 03:36 AM, Jason Wang wrote:
>>
>> On 2017年09月18日 11:13, Jason Wang wrote:
>>>
>>> On 2017年09月16日 03:19, Matthew Rosato wrote:
>>>>> It looks like vhost is slowed down for some reason which leads to more
>>>>> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
>>>>> perf.diff on host, one for rx and one for tx.
>>>>>
>>>> perf data below for the associated vhost threads, baseline=4.12,
>>>> delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1
>>>>
>>>> Client vhost:
>>>>
>>>> 60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
>>>> 13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
>>>>    2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>>>    1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
>>>>    1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
>>>>    1.09%   +0.28%   +0.35%  [vhost]            [k] vhost_get_vq_desc
>>>>    1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
>>>>    0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
>>>>    0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
>>>>    0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
>>>>    0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
>>>>    0.79%   +0.09%   +0.19%  [vhost]            [k] __vhost_add_used_n
>>>>    0.74%                    [kernel.vmlinux]   [k] get_task_policy.part.7
>>>>    0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
>>>>    0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
>>>>    0.58%   -0.15%   -0.12%  [ebtables]         [k] ebt_do_table
>>>>    0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
>>>>      ...
>>>>    0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
>>>>      ...
>>>>    0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
>>>>      ...
>>>>            +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
>>>>            +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>>>            +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
>>>>            +0.24%   +0.23%  [vhost_net]        [k] vhost_net_buf_peek
>>>>
>>>> Server vhost:
>>>>
>>>> 61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
>>>>    9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
>>>>    5.16%   +1.41%   +1.57%  [vhost]            [k] vhost_get_vq_desc
>>>>    5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
>>>>    3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
>>>>    1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
>>>>    1.24%   +1.65%   +0.45%  [vhost_net]        [k] handle_rx
>>>>    1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
>>>>    0.96%   +0.70%   +1.10%  [vhost]            [k] translate_desc
>>>>    0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
>>>>    0.69%                    [kernel.vmlinux]   [k] tun_peek_len
>>>>    0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
>>>>    0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
>>>>    0.50%   +0.05%   +0.09%  [vhost]            [k] vhost_add_used_n
>>>>      ...
>>>>            +0.63%   +0.58%  [vhost_net]        [k] vhost_net_buf_peek
>>>>            +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
>>>>            +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
>>>>            +0.11%   +0.21%  [vhost]            [k] vhost_umem_interval_tr
>>>>
>>> Looks like for some unknown reason which leads more wakeups.
>>>
>>> Could you please try to attached patch to see if it solves or mitigate
>>> the issue?
>>>
>>> Thanks
>> My bad, please try this.
>>
>> Thanks
> Thanks Jason.  Built 4.13 + supplied patch, I see some decrease in
> wakeups, but there's still quite a bit more compared to 4.12
> (baseline=4.12, delta1=4.13, delta2=4.13+patch):
>
> client:
>   2.00%   +3.69%   +2.55%  [kernel.vmlinux]   [k] __wake_up_sync_key
>
> server:
>   1.08%   +3.03%   +1.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
>
>
> Throughput was roughly equivalent to base 4.13 (so, still seeing the
> regression w/ this patch applied).
>

Seems to make some progress on wakeup mitigation. Previous patch tries 
to reduce the unnecessary traversal of waitqueue during rx. Attached 
patch goes even further which disables rx polling during processing tx. 
Please try it to see if it has any difference.

And two questions:
- Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
- Can enable batching in the tap of sending VM improve the performance 
(ethtool -C $tap rx-frames 64)

Thanks

[-- Attachment #2: 0001-vhost_net-avoid-unnecessary-wakeups-during-tx.patch --]
[-- Type: text/x-patch, Size: 1938 bytes --]

>From d57ad96083fc57205336af1b5ea777e5185f1581 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Wed, 20 Sep 2017 11:44:49 +0800
Subject: [PATCH] vhost_net: avoid unnecessary wakeups during tx

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ed476fa..e7349cf 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -444,8 +444,11 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
+	struct vhost_net_virtqueue *rx_nvq = &net->vqs[VHOST_NET_VQ_RX];
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
+	struct vhost_virtqueue *rx_vq = &rx_nvq->vq;
+
 	unsigned out, in;
 	int head;
 	struct msghdr msg = {
@@ -462,6 +465,10 @@ static void handle_tx(struct vhost_net *net)
 	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
 	bool zcopy, zcopy_used;
 
+	mutex_lock(&rx_vq->mutex);
+	vhost_net_disable_vq(net, rx_vq);
+	mutex_unlock(&rx_vq->mutex);
+
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	if (!sock)
@@ -574,13 +581,21 @@ static void handle_tx(struct vhost_net *net)
 		else
 			vhost_zerocopy_signal_used(net, vq);
 		vhost_net_tx_packet(net);
-		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
-			vhost_poll_queue(&vq->poll);
+		if (unlikely(total_len >= VHOST_NET_WEIGHT))
 			break;
-		}
 	}
 out:
 	mutex_unlock(&vq->mutex);
+
+	mutex_lock(&rx_vq->mutex);
+	vhost_net_enable_vq(net, rx_vq);
+	mutex_unlock(&rx_vq->mutex);
+
+	if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+		mutex_lock(&vq->mutex);
+		vhost_poll_queue(&vq->poll);
+		mutex_unlock(&vq->mutex);
+	}
 }
 
 static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-20  6:27                     ` Jason Wang
@ 2017-09-20 19:38                       ` Matthew Rosato
  2017-09-22  4:03                         ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-20 19:38 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst


> Seems to make some progress on wakeup mitigation. Previous patch tries
> to reduce the unnecessary traversal of waitqueue during rx. Attached
> patch goes even further which disables rx polling during processing tx.
> Please try it to see if it has any difference.

Unfortunately, this patch doesn't seem to have made a difference.  I
tried runs with both this patch and the previous patch applied, as well
as only this patch applied for comparison (numbers from vhost thread of
sending VM):

4.12    4.13     patch1   patch2   patch1+2
2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key

In each case, the regression in throughput was still present.

> And two questions:
> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)

Verified that the second set of guests are not actually required, I can
see the regression with only 2 VMs.

> - Can enable batching in the tap of sending VM improve the performance
> (ethtool -C $tap rx-frames 64)

I tried this, but it did not help (actually seemed to make things a
little worse)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-20 19:38                       ` Matthew Rosato
@ 2017-09-22  4:03                         ` Jason Wang
  2017-09-25 20:18                           ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Wang @ 2017-09-22  4:03 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年09月21日 03:38, Matthew Rosato wrote:
>> Seems to make some progress on wakeup mitigation. Previous patch tries
>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>> patch goes even further which disables rx polling during processing tx.
>> Please try it to see if it has any difference.
> Unfortunately, this patch doesn't seem to have made a difference.  I
> tried runs with both this patch and the previous patch applied, as well
> as only this patch applied for comparison (numbers from vhost thread of
> sending VM):
>
> 4.12    4.13     patch1   patch2   patch1+2
> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>
> In each case, the regression in throughput was still present.

This probably means some other cases of the wakeups were missed. Could 
you please record the callers of __wake_up_sync_key()?

>
>> And two questions:
>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
> Verified that the second set of guests are not actually required, I can
> see the regression with only 2 VMs.
>
>> - Can enable batching in the tap of sending VM improve the performance
>> (ethtool -C $tap rx-frames 64)
> I tried this, but it did not help (actually seemed to make things a
> little worse)
>

  I still can't see a reason that can lead more wakeups, will take more 
time to look at this issue and keep you posted.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-22  4:03                         ` Jason Wang
@ 2017-09-25 20:18                           ` Matthew Rosato
  2017-10-05 20:07                             ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-09-25 20:18 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst

On 09/22/2017 12:03 AM, Jason Wang wrote:
> 
> 
> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>> patch goes even further which disables rx polling during processing tx.
>>> Please try it to see if it has any difference.
>> Unfortunately, this patch doesn't seem to have made a difference.  I
>> tried runs with both this patch and the previous patch applied, as well
>> as only this patch applied for comparison (numbers from vhost thread of
>> sending VM):
>>
>> 4.12    4.13     patch1   patch2   patch1+2
>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>
>> In each case, the regression in throughput was still present.
> 
> This probably means some other cases of the wakeups were missed. Could
> you please record the callers of __wake_up_sync_key()?
> 

Hi Jason,

With your 2 previous patches applied, every call to __wake_up_sync_key
(for both sender and server vhost threads) shows the following stack trace:

     vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
<-sock_def_readable
     vhost-11478-11520 [002] ....   312.927230: <stack trace>
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __dev_queue_xmit
 => br_dev_queue_push_xmit
 => br_forward_finish
 => __br_forward
 => br_handle_frame_finish
 => br_handle_frame
 => __netif_receive_skb_core
 => netif_receive_skb_internal
 => tun_get_user
 => tun_sendmsg
 => handle_tx
 => vhost_worker
 => kthread
 => kernel_thread_starter
 => kernel_thread_starter

>>
>>> And two questions:
>>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
>> Verified that the second set of guests are not actually required, I can
>> see the regression with only 2 VMs.
>>
>>> - Can enable batching in the tap of sending VM improve the performance
>>> (ethtool -C $tap rx-frames 64)
>> I tried this, but it did not help (actually seemed to make things a
>> little worse)
>>
> 
>  I still can't see a reason that can lead more wakeups, will take more
> time to look at this issue and keep you posted.
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-09-25 20:18                           ` Matthew Rosato
@ 2017-10-05 20:07                             ` Matthew Rosato
  2017-10-11  2:41                               ` Jason Wang
  2017-10-12 18:31                               ` Wei Xu
  0 siblings, 2 replies; 42+ messages in thread
From: Matthew Rosato @ 2017-10-05 20:07 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst

On 09/25/2017 04:18 PM, Matthew Rosato wrote:
> On 09/22/2017 12:03 AM, Jason Wang wrote:
>>
>>
>> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>>> patch goes even further which disables rx polling during processing tx.
>>>> Please try it to see if it has any difference.
>>> Unfortunately, this patch doesn't seem to have made a difference.  I
>>> tried runs with both this patch and the previous patch applied, as well
>>> as only this patch applied for comparison (numbers from vhost thread of
>>> sending VM):
>>>
>>> 4.12    4.13     patch1   patch2   patch1+2
>>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>>
>>> In each case, the regression in throughput was still present.
>>
>> This probably means some other cases of the wakeups were missed. Could
>> you please record the callers of __wake_up_sync_key()?
>>
> 
> Hi Jason,
> 
> With your 2 previous patches applied, every call to __wake_up_sync_key
> (for both sender and server vhost threads) shows the following stack trace:
> 
>      vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
> <-sock_def_readable
>      vhost-11478-11520 [002] ....   312.927230: <stack trace>
>  => dev_hard_start_xmit
>  => sch_direct_xmit
>  => __dev_queue_xmit
>  => br_dev_queue_push_xmit
>  => br_forward_finish
>  => __br_forward
>  => br_handle_frame_finish
>  => br_handle_frame
>  => __netif_receive_skb_core
>  => netif_receive_skb_internal
>  => tun_get_user
>  => tun_sendmsg
>  => handle_tx
>  => vhost_worker
>  => kthread
>  => kernel_thread_starter
>  => kernel_thread_starter
> 

Ping...  Jason, any other ideas or suggestions?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-05 20:07                             ` Matthew Rosato
@ 2017-10-11  2:41                               ` Jason Wang
  2017-10-12 18:31                               ` Wei Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Jason Wang @ 2017-10-11  2:41 UTC (permalink / raw)
  To: Matthew Rosato, netdev; +Cc: davem, mst



On 2017年10月06日 04:07, Matthew Rosato wrote:
> On 09/25/2017 04:18 PM, Matthew Rosato wrote:
>> On 09/22/2017 12:03 AM, Jason Wang wrote:
>>>
>>> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>>>> patch goes even further which disables rx polling during processing tx.
>>>>> Please try it to see if it has any difference.
>>>> Unfortunately, this patch doesn't seem to have made a difference.  I
>>>> tried runs with both this patch and the previous patch applied, as well
>>>> as only this patch applied for comparison (numbers from vhost thread of
>>>> sending VM):
>>>>
>>>> 4.12    4.13     patch1   patch2   patch1+2
>>>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>>>
>>>> In each case, the regression in throughput was still present.
>>> This probably means some other cases of the wakeups were missed. Could
>>> you please record the callers of __wake_up_sync_key()?
>>>
>> Hi Jason,
>>
>> With your 2 previous patches applied, every call to __wake_up_sync_key
>> (for both sender and server vhost threads) shows the following stack trace:
>>
>>       vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
>> <-sock_def_readable
>>       vhost-11478-11520 [002] ....   312.927230: <stack trace>
>>   => dev_hard_start_xmit
>>   => sch_direct_xmit
>>   => __dev_queue_xmit
>>   => br_dev_queue_push_xmit
>>   => br_forward_finish
>>   => __br_forward
>>   => br_handle_frame_finish
>>   => br_handle_frame
>>   => __netif_receive_skb_core
>>   => netif_receive_skb_internal
>>   => tun_get_user
>>   => tun_sendmsg
>>   => handle_tx
>>   => vhost_worker
>>   => kthread
>>   => kernel_thread_starter
>>   => kernel_thread_starter
>>
> Ping...  Jason, any other ideas or suggestions?
>

Sorry for the late, recovering from a long holiday. Will go back to this 
soon.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-05 20:07                             ` Matthew Rosato
  2017-10-11  2:41                               ` Jason Wang
@ 2017-10-12 18:31                               ` Wei Xu
  2017-10-18 20:17                                 ` Matthew Rosato
  1 sibling, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-10-12 18:31 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, netdev, davem, mst

On Thu, Oct 05, 2017 at 04:07:45PM -0400, Matthew Rosato wrote:
> 
> Ping...  Jason, any other ideas or suggestions?

Hi Matthew,
Recently I am doing similar test on x86 for this patch, here are some,
differences between our testbeds.

1. It is nice you have got improvement with 50+ instances(or connections here?)
which would be quite helpful to address the issue, also you've figured out the
cost(wait/wakeup), kindly reminder did you pin uperf client/server along the whole
path besides vhost and vcpu threads? 

2. It might be useful to short the traffic path as a reference, What I am running
is briefly like:
    pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)

The bridge driver(br_forward(), etc) might impact performance due to my personal
experience, so eventually I settled down with this simplified testbed which fully
isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
bridge driver, etc), therefore reduces potential interferences.

The down side of this is that it needs DPDK support in guest, has this ever be
run on s390x guest? An alternative approach is to directly run XDP drop on
virtio-net nic in guest, while this requires compiling XDP inside guest which needs
a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).

3. BTW, did you enable hugepage for your guest? It would  performance more
or less depends on the memory demand when generating traffic, I didn't see
similar command lines in yours.

Hope this doesn't make it more complicated for you.:) We will keep working on this
and update you.

Thanks,
Wei

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-12 18:31                               ` Wei Xu
@ 2017-10-18 20:17                                 ` Matthew Rosato
  2017-10-23  2:06                                   ` Jason Wang
  2017-10-23 13:57                                   ` Wei Xu
  0 siblings, 2 replies; 42+ messages in thread
From: Matthew Rosato @ 2017-10-18 20:17 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, netdev, davem, mst

On 10/12/2017 02:31 PM, Wei Xu wrote:
> On Thu, Oct 05, 2017 at 04:07:45PM -0400, Matthew Rosato wrote:
>>
>> Ping...  Jason, any other ideas or suggestions?
> 
> Hi Matthew,
> Recently I am doing similar test on x86 for this patch, here are some,
> differences between our testbeds.
> 
> 1. It is nice you have got improvement with 50+ instances(or connections here?)
> which would be quite helpful to address the issue, also you've figured out the
> cost(wait/wakeup), kindly reminder did you pin uperf client/server along the whole
> path besides vhost and vcpu threads? 

Was not previously doing any pinning whatsoever, just reproducing an
environment that one of our testers here was running.  Reducing guest
vcpu count from 4->1, still see the regression.  Then, pinned each vcpu
thread and vhost thread to a separate host CPU -- still made no
difference (regression still present).

> 
> 2. It might be useful to short the traffic path as a reference, What I am running
> is briefly like:
>     pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
> 
> The bridge driver(br_forward(), etc) might impact performance due to my personal
> experience, so eventually I settled down with this simplified testbed which fully
> isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
> bridge driver, etc), therefore reduces potential interferences.
> 
> The down side of this is that it needs DPDK support in guest, has this ever be
> run on s390x guest? An alternative approach is to directly run XDP drop on
> virtio-net nic in guest, while this requires compiling XDP inside guest which needs
> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
> 

I made an attempt at DPDK, but it has not been run on s390x as far as
I'm aware and didn't seem trivial to get working.

So instead I took your alternate suggestion & did:
pktgen(host) -> tap(x) -> guest(xdp_drop)

When running this setup, I am not able to reproduce the regression.  As
mentioned previously, I am also unable to reproduce when running one end
of the uperf connection from the host - I have only ever been able to
reproduce when both ends of the uperf connection are running within a guest.

> 3. BTW, did you enable hugepage for your guest? It would  performance more
> or less depends on the memory demand when generating traffic, I didn't see
> similar command lines in yours.
> 

s390x does not currently support passing through hugetlb backing via
QEMU mem-path.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-18 20:17                                 ` Matthew Rosato
@ 2017-10-23  2:06                                   ` Jason Wang
  2017-10-23  2:13                                     ` Michael S. Tsirkin
  2017-10-25 20:21                                     ` Matthew Rosato
  2017-10-23 13:57                                   ` Wei Xu
  1 sibling, 2 replies; 42+ messages in thread
From: Jason Wang @ 2017-10-23  2:06 UTC (permalink / raw)
  To: Matthew Rosato, Wei Xu, mst; +Cc: netdev, davem



On 2017年10月19日 04:17, Matthew Rosato wrote:
>> 2. It might be useful to short the traffic path as a reference, What I am running
>> is briefly like:
>>      pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
>>
>> The bridge driver(br_forward(), etc) might impact performance due to my personal
>> experience, so eventually I settled down with this simplified testbed which fully
>> isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
>> bridge driver, etc), therefore reduces potential interferences.
>>
>> The down side of this is that it needs DPDK support in guest, has this ever be
>> run on s390x guest? An alternative approach is to directly run XDP drop on
>> virtio-net nic in guest, while this requires compiling XDP inside guest which needs
>> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
>>
> I made an attempt at DPDK, but it has not been run on s390x as far as
> I'm aware and didn't seem trivial to get working.
>
> So instead I took your alternate suggestion & did:
> pktgen(host) -> tap(x) -> guest(xdp_drop)
>
> When running this setup, I am not able to reproduce the regression.  As
> mentioned previously, I am also unable to reproduce when running one end
> of the uperf connection from the host - I have only ever been able to
> reproduce when both ends of the uperf connection are running within a guest.
>

Thanks for the test. Looking at the code, the only obvious difference 
when BATCH is 1 is that one spinlock which was previously called by 
tun_peek_len() was avoided since we can do it locally. I wonder whether 
or not this speeds up handle_rx() a little more then leads more wakeups 
during some rates/sizes of TCP stream. To prove this, maybe you can try:

- enable busy polling, using poll-us=1000, and to see if we can still 
get the regression
- measure the pps pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2

Michael, any another possibility in your mind?

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-23  2:06                                   ` Jason Wang
@ 2017-10-23  2:13                                     ` Michael S. Tsirkin
  2017-10-25 20:21                                     ` Matthew Rosato
  1 sibling, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2017-10-23  2:13 UTC (permalink / raw)
  To: Jason Wang; +Cc: Matthew Rosato, Wei Xu, netdev, davem

On Mon, Oct 23, 2017 at 10:06:36AM +0800, Jason Wang wrote:
> 
> 
> On 2017年10月19日 04:17, Matthew Rosato wrote:
> > > 2. It might be useful to short the traffic path as a reference, What I am running
> > > is briefly like:
> > >      pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
> > > 
> > > The bridge driver(br_forward(), etc) might impact performance due to my personal
> > > experience, so eventually I settled down with this simplified testbed which fully
> > > isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
> > > bridge driver, etc), therefore reduces potential interferences.
> > > 
> > > The down side of this is that it needs DPDK support in guest, has this ever be
> > > run on s390x guest? An alternative approach is to directly run XDP drop on
> > > virtio-net nic in guest, while this requires compiling XDP inside guest which needs
> > > a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
> > > 
> > I made an attempt at DPDK, but it has not been run on s390x as far as
> > I'm aware and didn't seem trivial to get working.
> > 
> > So instead I took your alternate suggestion & did:
> > pktgen(host) -> tap(x) -> guest(xdp_drop)
> > 
> > When running this setup, I am not able to reproduce the regression.  As
> > mentioned previously, I am also unable to reproduce when running one end
> > of the uperf connection from the host - I have only ever been able to
> > reproduce when both ends of the uperf connection are running within a guest.
> > 
> 
> Thanks for the test. Looking at the code, the only obvious difference when
> BATCH is 1 is that one spinlock which was previously called by
> tun_peek_len() was avoided since we can do it locally. I wonder whether or
> not this speeds up handle_rx() a little more then leads more wakeups during
> some rates/sizes of TCP stream. To prove this, maybe you can try:
> 
> - enable busy polling, using poll-us=1000, and to see if we can still get
> the regression
> - measure the pps pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2
> 
> Michael, any another possibility in your mind?
> 
> Thanks

Not really. I still suspect since it's s390 only there's
some kind of race condition where we wake up a task repeatedly.

-- 
MST

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-23  2:06                                   ` Jason Wang
  2017-10-23  2:13                                     ` Michael S. Tsirkin
@ 2017-10-25 20:21                                     ` Matthew Rosato
  2017-10-26  9:44                                       ` Wei Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-10-25 20:21 UTC (permalink / raw)
  To: Jason Wang, Wei Xu, mst; +Cc: netdev, davem

On 10/22/2017 10:06 PM, Jason Wang wrote:
> 
> 
> On 2017年10月19日 04:17, Matthew Rosato wrote:
>>> 2. It might be useful to short the traffic path as a reference, What
>>> I am running
>>> is briefly like:
>>>      pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
>>>
>>> The bridge driver(br_forward(), etc) might impact performance due to
>>> my personal
>>> experience, so eventually I settled down with this simplified testbed
>>> which fully
>>> isolates the traffic from both userspace and host kernel stack(1 and
>>> 50 instances,
>>> bridge driver, etc), therefore reduces potential interferences.
>>>
>>> The down side of this is that it needs DPDK support in guest, has
>>> this ever be
>>> run on s390x guest? An alternative approach is to directly run XDP
>>> drop on
>>> virtio-net nic in guest, while this requires compiling XDP inside
>>> guest which needs
>>> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
>>>
>> I made an attempt at DPDK, but it has not been run on s390x as far as
>> I'm aware and didn't seem trivial to get working.
>>
>> So instead I took your alternate suggestion & did:
>> pktgen(host) -> tap(x) -> guest(xdp_drop)
>>
>> When running this setup, I am not able to reproduce the regression.  As
>> mentioned previously, I am also unable to reproduce when running one end
>> of the uperf connection from the host - I have only ever been able to
>> reproduce when both ends of the uperf connection are running within a
>> guest.
>>
> 
> Thanks for the test. Looking at the code, the only obvious difference
> when BATCH is 1 is that one spinlock which was previously called by
> tun_peek_len() was avoided since we can do it locally. I wonder whether
> or not this speeds up handle_rx() a little more then leads more wakeups
> during some rates/sizes of TCP stream. To prove this, maybe you can try:
> 
> - enable busy polling, using poll-us=1000, and to see if we can still
> get the regression

Enabled poll-us=1000 for both guests - drastically reduces throughput,
but can still see the regression between host 4.12->4.13 running the
uperf workload


> - measure the pps pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2
> 

I'm getting apparent stalls when I run pktgen from the guest in this
manner...  (pktgen thread continues spinning after the first 5000
packets make it to vm2, but no further packets get sent).  Not sure why yet.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-25 20:21                                     ` Matthew Rosato
@ 2017-10-26  9:44                                       ` Wei Xu
  2017-10-26 17:53                                         ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-10-26  9:44 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Wed, Oct 25, 2017 at 04:21:26PM -0400, Matthew Rosato wrote:
> On 10/22/2017 10:06 PM, Jason Wang wrote:
> > 
> > 
> > On 2017年10月19日 04:17, Matthew Rosato wrote:
> >>> 2. It might be useful to short the traffic path as a reference, What
> >>> I am running
> >>> is briefly like:
> >>>      pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
> >>>
> >>> The bridge driver(br_forward(), etc) might impact performance due to
> >>> my personal
> >>> experience, so eventually I settled down with this simplified testbed
> >>> which fully
> >>> isolates the traffic from both userspace and host kernel stack(1 and
> >>> 50 instances,
> >>> bridge driver, etc), therefore reduces potential interferences.
> >>>
> >>> The down side of this is that it needs DPDK support in guest, has
> >>> this ever be
> >>> run on s390x guest? An alternative approach is to directly run XDP
> >>> drop on
> >>> virtio-net nic in guest, while this requires compiling XDP inside
> >>> guest which needs
> >>> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
> >>>
> >> I made an attempt at DPDK, but it has not been run on s390x as far as
> >> I'm aware and didn't seem trivial to get working.
> >>
> >> So instead I took your alternate suggestion & did:
> >> pktgen(host) -> tap(x) -> guest(xdp_drop)
> >>
> >> When running this setup, I am not able to reproduce the regression.  As
> >> mentioned previously, I am also unable to reproduce when running one end
> >> of the uperf connection from the host - I have only ever been able to
> >> reproduce when both ends of the uperf connection are running within a
> >> guest.
> >>
> > 
> > Thanks for the test. Looking at the code, the only obvious difference
> > when BATCH is 1 is that one spinlock which was previously called by
> > tun_peek_len() was avoided since we can do it locally. I wonder whether
> > or not this speeds up handle_rx() a little more then leads more wakeups
> > during some rates/sizes of TCP stream. To prove this, maybe you can try:
> > 
> > - enable busy polling, using poll-us=1000, and to see if we can still
> > get the regression
> 
> Enabled poll-us=1000 for both guests - drastically reduces throughput,
> but can still see the regression between host 4.12->4.13 running the
> uperf workload
> 
> 
> > - measure the pps pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2
> > 
> 
> I'm getting apparent stalls when I run pktgen from the guest in this
> manner...  (pktgen thread continues spinning after the first 5000
> packets make it to vm2, but no further packets get sent).  Not sure why yet.
> 


Are you using the same binding as mentioned in previous mail sent by you? it
might be caused by cpu convention between pktgen and vhost, could you please
try to run pktgen from another idle cpu by adjusting the binding? 
 
BTW, did you see any improvement when running pktgen from the host if no 
regression was found? Since this can be reproduced with only 1 vcpu for
guest, may you try this bind? This might help simplify the problem.
  vcpu0  -> cpu2
  vhost  -> cpu3
  pktgen -> cpu1 

Wei

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-26  9:44                                       ` Wei Xu
@ 2017-10-26 17:53                                         ` Matthew Rosato
  2017-10-31  7:07                                           ` Wei Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-10-26 17:53 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem


> 
> Are you using the same binding as mentioned in previous mail sent by you? it
> might be caused by cpu convention between pktgen and vhost, could you please
> try to run pktgen from another idle cpu by adjusting the binding? 

I don't think that's the case -- I can cause pktgen to hang in the guest
without any cpu binding, and with vhost disabled even.

> BTW, did you see any improvement when running pktgen from the host if no 
> regression was found? Since this can be reproduced with only 1 vcpu for
> guest, may you try this bind? This might help simplify the problem.
>   vcpu0  -> cpu2
>   vhost  -> cpu3
>   pktgen -> cpu1 
> 

Yes -- I ran the pktgen test from host to guest with the binding
described.  I see an approx 5% increase in throughput from 4.12->4.13.
Some numbers:

host-4.12: 1384486.2pps 663.8MB/sec
host-4.13: 1434598.6pps 688.2MB/sec

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-26 17:53                                         ` Matthew Rosato
@ 2017-10-31  7:07                                           ` Wei Xu
  2017-10-31  7:00                                             ` Jason Wang
  2017-11-03  4:30                                             ` Matthew Rosato
  0 siblings, 2 replies; 42+ messages in thread
From: Wei Xu @ 2017-10-31  7:07 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
> 
> > 
> > Are you using the same binding as mentioned in previous mail sent by you? it
> > might be caused by cpu convention between pktgen and vhost, could you please
> > try to run pktgen from another idle cpu by adjusting the binding? 
> 
> I don't think that's the case -- I can cause pktgen to hang in the guest
> without any cpu binding, and with vhost disabled even.

Yes, I did a test and it also hangs in guest, before we figure it out,
maybe you try udp with uperf with this case?

VM   -> Host
Host -> VM
VM   -> VM

> 
> > BTW, did you see any improvement when running pktgen from the host if no 
> > regression was found? Since this can be reproduced with only 1 vcpu for
> > guest, may you try this bind? This might help simplify the problem.
> >   vcpu0  -> cpu2
> >   vhost  -> cpu3
> >   pktgen -> cpu1 
> > 
> 
> Yes -- I ran the pktgen test from host to guest with the binding
> described.  I see an approx 5% increase in throughput from 4.12->4.13.
> Some numbers:
> 
> host-4.12: 1384486.2pps 663.8MB/sec
> host-4.13: 1434598.6pps 688.2MB/sec

That's great, at least we are aligned in this case.

Jason, any thoughts on this? 

Wei

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-31  7:07                                           ` Wei Xu
@ 2017-10-31  7:00                                             ` Jason Wang
  2017-11-03  4:30                                             ` Matthew Rosato
  1 sibling, 0 replies; 42+ messages in thread
From: Jason Wang @ 2017-10-31  7:00 UTC (permalink / raw)
  To: Wei Xu, Matthew Rosato; +Cc: mst, netdev, davem



On 2017年10月31日 15:07, Wei Xu wrote:
>>> BTW, did you see any improvement when running pktgen from the host if no
>>> regression was found? Since this can be reproduced with only 1 vcpu for
>>> guest, may you try this bind? This might help simplify the problem.
>>>    vcpu0  -> cpu2
>>>    vhost  -> cpu3
>>>    pktgen -> cpu1
>>>
>> Yes -- I ran the pktgen test from host to guest with the binding
>> described.  I see an approx 5% increase in throughput from 4.12->4.13.
>> Some numbers:
>>
>> host-4.12: 1384486.2pps 663.8MB/sec
>> host-4.13: 1434598.6pps 688.2MB/sec
> That's great, at least we are aligned in this case.
>
> Jason, any thoughts on this?
>
> Wei
>

Good news is that pps is increased. I think the first step is moving 
things a little bit ahead by reposting the optimization of tx polling.

I will post a new version soon.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-31  7:07                                           ` Wei Xu
  2017-10-31  7:00                                             ` Jason Wang
@ 2017-11-03  4:30                                             ` Matthew Rosato
  2017-11-04 23:35                                               ` Wei Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-11-03  4:30 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem

On 10/31/2017 03:07 AM, Wei Xu wrote:
> On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
>>
>>>
>>> Are you using the same binding as mentioned in previous mail sent by you? it
>>> might be caused by cpu convention between pktgen and vhost, could you please
>>> try to run pktgen from another idle cpu by adjusting the binding? 
>>
>> I don't think that's the case -- I can cause pktgen to hang in the guest
>> without any cpu binding, and with vhost disabled even.
> 
> Yes, I did a test and it also hangs in guest, before we figure it out,
> maybe you try udp with uperf with this case?
> 
> VM   -> Host
> Host -> VM
> VM   -> VM
> 

Here are averaged run numbers (Gbps throughput) across 4.12, 4.13 and
net-next with and without Jason's recent "vhost_net: conditionally
enable tx polling" applied (referred to as 'patch' below).  1 uperf
instance in each case:

uperf TCP:
	 4.12	4.13	4.13+patch	net-next	net-next+patch
----------------------------------------------------------------------
VM->VM	 35.2	16.5	20.84		22.2		24.36
VM->Host 42.15	43.57	44.90		30.83		32.26
Host->VM 53.17	41.51	42.18		37.05		37.30

uperf UDP:
	 4.12	4.13	4.13+patch	net-next	net-next+patch
----------------------------------------------------------------------
VM->VM	 24.93	21.63	25.09		8.86		9.62
VM->Host 40.21	38.21	39.72		8.74		9.35
Host->VM 31.26	30.18	31.25		7.2		9.26

The net is that Jason's recent patch definitely improves things across
the board at 4.13 as well as at net-next -- But the VM<->VM TCP numbers
I am observing are still lower than base 4.12.

A separate concern is why my UDP numbers look so bad on net-next (have
not bisected this yet).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-03  4:30                                             ` Matthew Rosato
@ 2017-11-04 23:35                                               ` Wei Xu
  2017-11-08  1:02                                                 ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-11-04 23:35 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Fri, Nov 03, 2017 at 12:30:12AM -0400, Matthew Rosato wrote:
> On 10/31/2017 03:07 AM, Wei Xu wrote:
> > On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
> >>
> >>>
> >>> Are you using the same binding as mentioned in previous mail sent by you? it
> >>> might be caused by cpu convention between pktgen and vhost, could you please
> >>> try to run pktgen from another idle cpu by adjusting the binding? 
> >>
> >> I don't think that's the case -- I can cause pktgen to hang in the guest
> >> without any cpu binding, and with vhost disabled even.
> > 
> > Yes, I did a test and it also hangs in guest, before we figure it out,
> > maybe you try udp with uperf with this case?
> > 
> > VM   -> Host
> > Host -> VM
> > VM   -> VM
> > 
> 
> Here are averaged run numbers (Gbps throughput) across 4.12, 4.13 and
> net-next with and without Jason's recent "vhost_net: conditionally
> enable tx polling" applied (referred to as 'patch' below).  1 uperf
> instance in each case:

Thanks a lot for the test. 

> 
> uperf TCP:
> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
> ----------------------------------------------------------------------
> VM->VM	 35.2	16.5	20.84		22.2		24.36

Are you using the same server/test suite? You mentioned the number was around 
28Gb for 4.12 and it dropped about 40% for 4.13, it seems thing changed, are
there any options for performance tuning on the server to maximize the cpu
utilization? 

I had similar experience on x86 server and desktop before and it made that
the result number always went up and down pretty much.

> VM->Host 42.15	43.57	44.90		30.83		32.26
> Host->VM 53.17	41.51	42.18		37.05		37.30

This is a bit odd, I remember you said there was no regression while 
testing Host>VM, wasn't it? 

> 
> uperf UDP:
> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
> ----------------------------------------------------------------------
> VM->VM	 24.93	21.63	25.09		8.86		9.62
> VM->Host 40.21	38.21	39.72		8.74		9.35
> Host->VM 31.26	30.18	31.25		7.2		9.26

This case should be quite similar with pkgten, if you got improvement with
pktgen, usually it was also the same for UDP, could you please try to disable
tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
the most significant tests would be like this AFAICT:

Host->VM     4.12    4.13
 TCP:
 UDP:
pktgen:

Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
work since we have seen positive number for that, you can also temporarily skip
net-next as well.

If you see UDP and pktgen are aligned, then it might be helpful to continue
the other two cases, otherwise we fail in the first place.

> The net is that Jason's recent patch definitely improves things across
> the board at 4.13 as well as at net-next -- But the VM<->VM TCP numbers
> I am observing are still lower than base 4.12.

Cool.

> 
> A separate concern is why my UDP numbers look so bad on net-next (have
> not bisected this yet).

This might be another issue, I am in vacation, will try it on x86 once back
to work on next Wednesday.

Wei

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-04 23:35                                               ` Wei Xu
@ 2017-11-08  1:02                                                 ` Matthew Rosato
  2017-11-11 20:59                                                   ` Matthew Rosato
  2017-11-12 15:40                                                   ` Wei Xu
  0 siblings, 2 replies; 42+ messages in thread
From: Matthew Rosato @ 2017-11-08  1:02 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem

On 11/04/2017 07:35 PM, Wei Xu wrote:
> On Fri, Nov 03, 2017 at 12:30:12AM -0400, Matthew Rosato wrote:
>> On 10/31/2017 03:07 AM, Wei Xu wrote:
>>> On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
>>>>
>>>>>
>>>>> Are you using the same binding as mentioned in previous mail sent by you? it
>>>>> might be caused by cpu convention between pktgen and vhost, could you please
>>>>> try to run pktgen from another idle cpu by adjusting the binding? 
>>>>
>>>> I don't think that's the case -- I can cause pktgen to hang in the guest
>>>> without any cpu binding, and with vhost disabled even.
>>>
>>> Yes, I did a test and it also hangs in guest, before we figure it out,
>>> maybe you try udp with uperf with this case?
>>>
>>> VM   -> Host
>>> Host -> VM
>>> VM   -> VM
>>>
>>
>> Here are averaged run numbers (Gbps throughput) across 4.12, 4.13 and
>> net-next with and without Jason's recent "vhost_net: conditionally
>> enable tx polling" applied (referred to as 'patch' below).  1 uperf
>> instance in each case:
> 
> Thanks a lot for the test. 
> 
>>
>> uperf TCP:
>> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
>> ----------------------------------------------------------------------
>> VM->VM	 35.2	16.5	20.84		22.2		24.36
> 
> Are you using the same server/test suite? You mentioned the number was around 
> 28Gb for 4.12 and it dropped about 40% for 4.13, it seems thing changed, are
> there any options for performance tuning on the server to maximize the cpu
> utilization? 

I experience some volatility as I am running on 1 of multiple LPARs
available to this system (they are sharing physical resources).  But I
think the real issue was that I left my guest environment set to 4
vcpus, but was binding assuming there was 1 vcpu (was working on
something else, forgot to change back).  This likely tainted my most
recent results, sorry.

> 
> I had similar experience on x86 server and desktop before and it made that
> the result number always went up and down pretty much.
> 
>> VM->Host 42.15	43.57	44.90		30.83		32.26
>> Host->VM 53.17	41.51	42.18		37.05		37.30
> 
> This is a bit odd, I remember you said there was no regression while 
> testing Host>VM, wasn't it? 
> 
>>
>> uperf UDP:
>> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
>> ----------------------------------------------------------------------
>> VM->VM	 24.93	21.63	25.09		8.86		9.62
>> VM->Host 40.21	38.21	39.72		8.74		9.35
>> Host->VM 31.26	30.18	31.25		7.2		9.26
> 
> This case should be quite similar with pkgten, if you got improvement with
> pktgen, usually it was also the same for UDP, could you please try to disable
> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> the most significant tests would be like this AFAICT:
> 
> Host->VM     4.12    4.13
>  TCP:
>  UDP:
> pktgen:
> 
> Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
> work since we have seen positive number for that, you can also temporarily skip
> net-next as well.

Here are the requested numbers, averaged over numerous runs --  guest is
4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
taps / guest virtio-net devs as requested:

Host->VM	4.12		4.13
TCP:		9.92Gb/s	6.44Gb/s
UDP:		5.77Gb/s	6.63Gb/s
pktgen:		1572403pps	1904265pps

UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
however, is that I am seeing the TCP regression for the first time from
host->VM.  I wonder if the combination of CPU binding + disabling of one
or more of tso/gso/gro/ufo is related.

> 
> If you see UDP and pktgen are aligned, then it might be helpful to continue
> the other two cases, otherwise we fail in the first place.

I will start gathering those numbers tomorrow.

> 
>> The net is that Jason's recent patch definitely improves things across
>> the board at 4.13 as well as at net-next -- But the VM<->VM TCP numbers
>> I am observing are still lower than base 4.12.
> 
> Cool.
> 
>>
>> A separate concern is why my UDP numbers look so bad on net-next (have
>> not bisected this yet).
> 
> This might be another issue, I am in vacation, will try it on x86 once back
> to work on next Wednesday.
> 
> Wei
> 
>>
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-08  1:02                                                 ` Matthew Rosato
@ 2017-11-11 20:59                                                   ` Matthew Rosato
  2017-11-12 18:34                                                     ` Wei Xu
  2017-11-12 15:40                                                   ` Wei Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-11-11 20:59 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem

>> This case should be quite similar with pkgten, if you got improvement with
>> pktgen, usually it was also the same for UDP, could you please try to disable
>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
>> the most significant tests would be like this AFAICT:
>>
>> Host->VM     4.12    4.13
>>  TCP:
>>  UDP:
>> pktgen:
>>
>> Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
>> work since we have seen positive number for that, you can also temporarily skip
>> net-next as well.
> 
> Here are the requested numbers, averaged over numerous runs --  guest is
> 4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
> pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
> taps / guest virtio-net devs as requested:
> 
> Host->VM	4.12		4.13
> TCP:		9.92Gb/s	6.44Gb/s
> UDP:		5.77Gb/s	6.63Gb/s
> pktgen:		1572403pps	1904265pps
> 
> UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
> however, is that I am seeing the TCP regression for the first time from
> host->VM.  I wonder if the combination of CPU binding + disabling of one
> or more of tso/gso/gro/ufo is related.
> 
>>
>> If you see UDP and pktgen are aligned, then it might be helpful to continue
>> the other two cases, otherwise we fail in the first place.
> 

I continued running many iterations of these tests between 4.12 and
4.13..  My throughput findings can be summarized as:

VM->VM case:
UDP:  roughly equivalent
TCP:  Consistent regression (5-10%)

VM->Host
Both UDP and TCP traffic are roughly equivalent.

Host->VM
UDP+pktgen: improvement (5-10%), but inconsistent
TCP: Consistent regression (25-30%)

Host->VM UDP and pktgen seemed to show improvement in some runs, and in
others seemed to mirror 4.12-level performance.

The TCP regression for VM->VM is no surprise, we started with that.
It's still consistent, but smaller in this specific environment.

The TCP regression in Host->VM is interesting because I wasn't seeing it
consistently before binding CPUs + disabling tso/gso/gro/ufo.  Also
interesting because of how large it is -- By any chance can you see this
regression on x86 with the same configuration?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-11 20:59                                                   ` Matthew Rosato
@ 2017-11-12 18:34                                                     ` Wei Xu
  2017-11-14 20:11                                                       ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-11-12 18:34 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> >> This case should be quite similar with pkgten, if you got improvement with
> >> pktgen, usually it was also the same for UDP, could you please try to disable
> >> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> >> the most significant tests would be like this AFAICT:
> >>
> >> Host->VM     4.12    4.13
> >>  TCP:
> >>  UDP:
> >> pktgen:
> >>
> >> Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
> >> work since we have seen positive number for that, you can also temporarily skip
> >> net-next as well.
> > 
> > Here are the requested numbers, averaged over numerous runs --  guest is
> > 4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
> > pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
> > taps / guest virtio-net devs as requested:
> > 
> > Host->VM	4.12		4.13
> > TCP:		9.92Gb/s	6.44Gb/s
> > UDP:		5.77Gb/s	6.63Gb/s
> > pktgen:		1572403pps	1904265pps
> > 
> > UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
> > however, is that I am seeing the TCP regression for the first time from
> > host->VM.  I wonder if the combination of CPU binding + disabling of one
> > or more of tso/gso/gro/ufo is related.
> > 
> >>
> >> If you see UDP and pktgen are aligned, then it might be helpful to continue
> >> the other two cases, otherwise we fail in the first place.
> > 
> 
> I continued running many iterations of these tests between 4.12 and
> 4.13..  My throughput findings can be summarized as:

Really nice to have these numbers.

> 
> VM->VM case:
> UDP:  roughly equivalent
> TCP:  Consistent regression (5-10%)
> 
> VM->Host
> Both UDP and TCP traffic are roughly equivalent.

The patch improves performance for Rx from guest point of view, so the Tx
would be no big difference since the Rx packets are far less than Tx in 
this case.

> 
> Host->VM
> UDP+pktgen: improvement (5-10%), but inconsistent
> TCP: Consistent regression (25-30%)

Maybe we can try to figure out this case first since it is the shortest path,
can you have a look at TCP statistics and paste a few outputs between tests?
I am suspecting there are some retransmitting, zero window probing, etc.

> 
> Host->VM UDP and pktgen seemed to show improvement in some runs, and in
> others seemed to mirror 4.12-level performance.
> 
> The TCP regression for VM->VM is no surprise, we started with that.
> It's still consistent, but smaller in this specific environment.

Right, there are too many facts might influent the performance.

> 
> The TCP regression in Host->VM is interesting because I wasn't seeing it
> consistently before binding CPUs + disabling tso/gso/gro/ufo.  Also
> interesting because of how large it is -- By any chance can you see this
> regression on x86 with the same configuration?

Had a quick test and it seems I also got drop on x86 without tso,gro,..., data
with/without tso,gso,..., will check out tcp statistics and let you know soon.

4.12  
    --------------------------------------------------------------------------
    master            32.34s   112.63GB    29.91Gb/s      4031090        0.00
    master            32.33s    32.58GB     8.66Gb/s      1166014        0.00
    -------------------------------------------------------------------------

4.13
    -------------------------------------------------------------------------
    master            32.35s   119.17GB    31.64Gb/s      4265190        0.00
    master            32.33s    27.02GB     7.18Gb/s       967007        0.00
    -------------------------------------------------------------------------

Wei 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-12 18:34                                                     ` Wei Xu
@ 2017-11-14 20:11                                                       ` Matthew Rosato
  2017-11-20 19:25                                                         ` Matthew Rosato
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-11-14 20:11 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem

On 11/12/2017 01:34 PM, Wei Xu wrote:
> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>> This case should be quite similar with pkgten, if you got improvement with
>>>> pktgen, usually it was also the same for UDP, could you please try to disable
>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
>>>> the most significant tests would be like this AFAICT:
>>>>
>>>> Host->VM     4.12    4.13
>>>>  TCP:
>>>>  UDP:
>>>> pktgen:
>>>>
>>>> Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
>>>> work since we have seen positive number for that, you can also temporarily skip
>>>> net-next as well.
>>>
>>> Here are the requested numbers, averaged over numerous runs --  guest is
>>> 4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
>>> pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
>>> taps / guest virtio-net devs as requested:
>>>
>>> Host->VM	4.12		4.13
>>> TCP:		9.92Gb/s	6.44Gb/s
>>> UDP:		5.77Gb/s	6.63Gb/s
>>> pktgen:		1572403pps	1904265pps
>>>
>>> UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
>>> however, is that I am seeing the TCP regression for the first time from
>>> host->VM.  I wonder if the combination of CPU binding + disabling of one
>>> or more of tso/gso/gro/ufo is related.
>>>
>>>>
>>>> If you see UDP and pktgen are aligned, then it might be helpful to continue
>>>> the other two cases, otherwise we fail in the first place.
>>>
>>
>> I continued running many iterations of these tests between 4.12 and
>> 4.13..  My throughput findings can be summarized as:
> 
> Really nice to have these numbers.
> 

Wasn't sure if you were asking for the individual #s -- Just in case,
here are the other averages I used to draw my conclusions:

VM->VM		4.12		4.13
UDP		9.06Gb/s	8.99Gb/s
TCP		9.16Gb/s	8.67Gb/s

VM->Host	4.12		4.13
UDP		9.70Gb/s	9.53Gb/s
TCP		6.12Gb/s	6.00Gb/s

>>
>> VM->VM case:
>> UDP:  roughly equivalent
>> TCP:  Consistent regression (5-10%)
>>
>> VM->Host
>> Both UDP and TCP traffic are roughly equivalent.
> 
> The patch improves performance for Rx from guest point of view, so the Tx
> would be no big difference since the Rx packets are far less than Tx in 
> this case.
> 
>>
>> Host->VM
>> UDP+pktgen: improvement (5-10%), but inconsistent
>> TCP: Consistent regression (25-30%)
> 
> Maybe we can try to figure out this case first since it is the shortest path,
> can you have a look at TCP statistics and paste a few outputs between tests?
> I am suspecting there are some retransmitting, zero window probing, etc.
> 

Grabbed some netperf -s results after a few minutes of running (snipped
uninteresting icmp and udp sections).  The test was TCP Host->VM
scenario, binding and tso/gso/gro/ufo disabled as before:


Host 4.12

Ip:
    Forwarding: 1
    3724964 total packets received
    0 forwarded
    0 incoming packets discarded
    3724964 incoming packets delivered
    5000026 requests sent out
Tcp:
    4 active connection openings
    1 passive connection openings
    0 failed connection attempts
    0 connection resets received
    1 connections established
    3724954 segments received
    133112205 segments sent out
    93106 segments retransmitted
    0 bad segments received
    2 resets sent
TcpExt:
    5 delayed acks sent
    8 packets directly queued to recvmsg prequeue
    TCPDirectCopyFromPrequeue: 1736
    146 packet headers predicted
    4 packet headers predicted and directly queued to user
    3218205 acknowledgments not containing data payload received
    506561 predicted acknowledgments
    TCPSackRecovery: 2096
    TCPLostRetransmit: 860
    93106 fast retransmits
    TCPLossProbes: 5
    TCPSackShifted: 1959097
    TCPSackMerged: 458343
    TCPSackShiftFallback: 7969
    TCPRcvCoalesce: 2
    TCPOrigDataSent: 133112178
    TCPHystartTrainDetect: 2
    TCPHystartTrainCwnd: 96
    TCPWinProbe: 2
IpExt:
    InBcastPkts: 4
    InOctets: 226014831
    OutOctets: 193103919403
    InBcastOctets: 1312
    InNoECTPkts: 3724964


Host 4.13

Ip:
    Forwarding: 1
    5930785 total packets received
    0 forwarded
    0 incoming packets discarded
    5930785 incoming packets delivered
    4495113 requests sent out
Tcp:
    4 active connection openings
    1 passive connection openings
    0 failed connection attempts
    0 connection resets received
    1 connections established
    5930775 segments received
    73226521 segments sent out
    13975 segments retransmitted
    0 bad segments received
    4 resets sent
TcpExt:
    5 delayed acks sent
    8 packets directly queued to recvmsg prequeue
    TCPDirectCopyFromPrequeue: 1736
    18 packet headers predicted
    4 packet headers predicted and directly queued to user
    4091720 acknowledgments not containing data payload received
    1838984 predicted acknowledgments
    TCPSackRecovery: 9920
    TCPLostRetransmit: 31
    13975 fast retransmits
    TCPLossProbes: 6
    TCPSackShifted: 1700187
    TCPSackMerged: 1143698
    TCPSackShiftFallback: 23839
    TCPRcvCoalesce: 2
    TCPOrigDataSent: 73226494
    TCPHystartTrainDetect: 2
    TCPHystartTrainCwnd: 530
IpExt:
    InBcastPkts: 4
    InOctets: 344809215
    OutOctets: 106285682663
    InBcastOctets: 1312
    InNoECTPkts: 5930785


Guest 4.12

Ip:
    133112471 total packets received
    1 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    133112470 incoming packets delivered
    3724897 requests sent out
    40 outgoing packets dropped
Tcp:
    0 active connections openings
    6 passive connection openings
    0 failed connection attempts
    2 connection resets received
    2 connections established
    133112301 segments received
    3724731 segments send out
    0 segments retransmited
    0 bad segments received.
    5 resets sent
TcpExt:
    1 TCP sockets finished time wait in fast timer
    13 delayed acks sent
    138408 packets directly queued to recvmsg prequeue.
    33119208 bytes directly in process context from backlog
    1907783720 bytes directly received in process context from prequeue
    127259218 packet headers predicted
    1313774 packets header predicted and directly queued to user
    24 acknowledgments not containing data payload received
    196 predicted acknowledgments
    2 connections reset due to early user close
    TCPRcvCoalesce: 117069950
    TCPOFOQueue: 2425393
    TCPFromZeroWindowAdv: 109
    TCPToZeroWindowAdv: 109
    TCPWantZeroWindowAdv: 4487
    TCPOrigDataSent: 223
    TCPACKSkippedSeq: 1
IpExt:
    InBcastPkts: 2
    InOctets: 199630961414
    OutOctets: 226019278
    InBcastOctets: 656
    InNoECTPkts: 133112471


Guest 4.13

Ip:
    73226690 total packets received
    1 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    73226689 incoming packets delivered
    5930853 requests sent out
    40 outgoing packets dropped
Tcp:
    0 active connections openings
    6 passive connection openings
    0 failed connection attempts
    2 connection resets received
    2 connections established
    73226522 segments received
    5930688 segments send out
    0 segments retransmited
    0 bad segments received.
    2 resets sent
TcpExt:
    1 TCP sockets finished time wait in fast timer
    13 delayed acks sent
    490503 packets directly queued to recvmsg prequeue.
    306976 bytes directly in process context from backlog
    6875924176 bytes directly received in process context from prequeue
    65617512 packet headers predicted
    4735750 packets header predicted and directly queued to user
    20 acknowledgments not containing data payload received
    61 predicted acknowledgments
    2 connections reset due to early user close
    TCPRcvCoalesce: 60654609
    TCPOFOQueue: 2857814
    TCPOrigDataSent: 85
IpExt:
    InBcastPkts: 1
    InOctets: 109839485374
    OutOctets: 344816614
    InBcastOctets: 328
    InNoECTPkts: 73226690


>>
>> Host->VM UDP and pktgen seemed to show improvement in some runs, and in
>> others seemed to mirror 4.12-level performance.
>>
>> The TCP regression for VM->VM is no surprise, we started with that.
>> It's still consistent, but smaller in this specific environment.
> 
> Right, there are too many facts might influent the performance.
> 
>>
>> The TCP regression in Host->VM is interesting because I wasn't seeing it
>> consistently before binding CPUs + disabling tso/gso/gro/ufo.  Also
>> interesting because of how large it is -- By any chance can you see this
>> regression on x86 with the same configuration?
> 
> Had a quick test and it seems I also got drop on x86 without tso,gro,..., data
> with/without tso,gso,..., will check out tcp statistics and let you know soon.
> 
> 4.12  
>     --------------------------------------------------------------------------
>     master            32.34s   112.63GB    29.91Gb/s      4031090        0.00
>     master            32.33s    32.58GB     8.66Gb/s      1166014        0.00
>     -------------------------------------------------------------------------
> 
> 4.13
>     -------------------------------------------------------------------------
>     master            32.35s   119.17GB    31.64Gb/s      4265190        0.00
>     master            32.33s    27.02GB     7.18Gb/s       967007        0.00
>     -------------------------------------------------------------------------
> 
> Wei 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-14 20:11                                                       ` Matthew Rosato
@ 2017-11-20 19:25                                                         ` Matthew Rosato
  2017-11-27 16:21                                                           ` Wei Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-11-20 19:25 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, mst, netdev, davem

On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> On 11/12/2017 01:34 PM, Wei Xu wrote:
>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>> This case should be quite similar with pkgten, if you got improvement with
>>>>> pktgen, usually it was also the same for UDP, could you please try to disable
>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
>>>>> the most significant tests would be like this AFAICT:
>>>>>
>>>>> Host->VM     4.12    4.13
>>>>>  TCP:
>>>>>  UDP:
>>>>> pktgen:

So, I automated these scenarios for extended overnight runs and started
experiencing OOM conditions overnight on a 40G system.  I did a bisect
and it also points to c67df11f.  I can see a leak in at least all of the
Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
fastest leak.

I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
intervals until a large% of host memory was consumed.  Numbers below
after the last pktgen run completed. The summary is that a very large #
of active skbuff_head_cache entries can be seen - The sum of alloc/free
calls match up, but the # of active skbuff_head_cache entries keeps
growing each time the workload is run and never goes back down in
between runs.

free -h:
     total        used        free      shared  buff/cache   available
Mem:   39G         31G        6.6G        472K        1.4G        6.8G

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME

1001952 1000610  99%    0.75K  23856	   42    763392K skbuff_head_cache
126192 126153  99%    0.36K   2868	 44     45888K ksm_rmap_item
100485 100435  99%    0.41K   1305	 77     41760K kernfs_node_cache
 63294  39598  62%    0.48K    959	 66     30688K dentry
 31968  31719  99%    0.88K    888	 36     28416K inode_cache

/sys/kernel/slab/skbuff_head_cache/alloc_calls :
    259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10

/sys/kernel/slab/skbuff_head_cache/free_calls:
  13492 <not-available> age=4295073614 pid=0 cpus=0
 978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
cpus=1-19
      6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
cpus=4,8,10,12,14
      3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
pid=0-11605 cpus=5,7,12
      1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
      2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
      1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
      1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
      3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
pid=9915-11581 cpus=8,16,18
      2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
pid=11605-11699 cpus=2,9
      1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
pid=331 cpus=11
   8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
pid=11863 cpus=0


By comparison, when running 4.13 with c67df11f reverted, here's the same
output after the exact same test:

free -h:
       total        used        free      shared  buff/cache   available
Mem:     39G        783M         37G        472K        637M         37G

slabtop:
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
   714    256  35%    0.75K     17	 42	  544K skbuff_head_cache

/sys/kernel/slab/skbuff_head_cache/alloc_calls:
    257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
/sys/kernel/slab/skbuff_head_cache/free_calls:
    255 <not-available> age=4295003081 pid=0 cpus=0
      1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
      1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-20 19:25                                                         ` Matthew Rosato
@ 2017-11-27 16:21                                                           ` Wei Xu
  2017-11-28  1:36                                                             ` Jason Wang
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-11-27 16:21 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> > On 11/12/2017 01:34 PM, Wei Xu wrote:
> >> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> >>>>> This case should be quite similar with pkgten, if you got improvement with
> >>>>> pktgen, usually it was also the same for UDP, could you please try to disable
> >>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> >>>>> the most significant tests would be like this AFAICT:
> >>>>>
> >>>>> Host->VM     4.12    4.13
> >>>>>  TCP:
> >>>>>  UDP:
> >>>>> pktgen:
> 
> So, I automated these scenarios for extended overnight runs and started
> experiencing OOM conditions overnight on a 40G system.  I did a bisect
> and it also points to c67df11f.  I can see a leak in at least all of the
> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
> fastest leak.
> 
> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
> intervals until a large% of host memory was consumed.  Numbers below
> after the last pktgen run completed. The summary is that a very large #
> of active skbuff_head_cache entries can be seen - The sum of alloc/free
> calls match up, but the # of active skbuff_head_cache entries keeps
> growing each time the workload is run and never goes back down in
> between runs.
> 
> free -h:
>      total        used        free      shared  buff/cache   available
> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 
> 1001952 1000610  99%    0.75K  23856	   42    763392K skbuff_head_cache
> 126192 126153  99%    0.36K   2868	 44     45888K ksm_rmap_item
> 100485 100435  99%    0.41K   1305	 77     41760K kernfs_node_cache
>  63294  39598  62%    0.48K    959	 66     30688K dentry
>  31968  31719  99%    0.88K    888	 36     28416K inode_cache
> 
> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>     259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10
> 
> /sys/kernel/slab/skbuff_head_cache/free_calls:
>   13492 <not-available> age=4295073614 pid=0 cpus=0
>  978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
> cpus=1-19
>       6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
> cpus=4,8,10,12,14
>       3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
> pid=0-11605 cpus=5,7,12
>       1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>       2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
>       1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>       1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>       3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
> pid=9915-11581 cpus=8,16,18
>       2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
> pid=11605-11699 cpus=2,9
>       1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
> pid=331 cpus=11
>    8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
> pid=11863 cpus=0
> 
> 
> By comparison, when running 4.13 with c67df11f reverted, here's the same
> output after the exact same test:
> 
> free -h:
>        total        used        free      shared  buff/cache   available
> Mem:     39G        783M         37G        472K        637M         37G
> 
> slabtop:
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>    714    256  35%    0.75K     17	 42	  544K skbuff_head_cache
> 
> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>     257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
> /sys/kernel/slab/skbuff_head_cache/free_calls:
>     255 <not-available> age=4295003081 pid=0 cpus=0
>       1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>       1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
> 

Thanks a lot for the test, and sorry for the late update, I was working on
the code path and didn't find anything helpful to you till today.

I did some tests and initially it turned out that the bottleneck was the guest
kernel stack(napi) side, followed by tracking the traffic footprints and it
appeared as the loss happened when vring was full and could not be drained
out by the guest, afterwards it triggered a SKB drop in vhost driver due
to no headcount to fill it with, it can be avoided by deferring consuming the
SKB after having obtained a sufficient headcount with below patch.

Could you please try it? It is based on 4.13 and I also applied Jason's
'conditionally enable tx polling' patch.
    https://lkml.org/lkml/2016/6/1/39

I only tested one instance case from Host -> VM with uperf & iperf3, I like
iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
during testing. :)

To maximize the performance of one instance case, two vcpus are needed,
one does the kernel napi and the other one should serve the socket syscall
(mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest
and pinned the iperf/uperf slave to the one not used by kernel napi, you may
need to check out which one you should pin properly by seeing the CPU
utilization with a quick trial test before running the long duration test.

Slight performance improvement for tcp with the patch(host/guest offload off)
on x86, also 4.12 wins the game with 20-30% possibility from time to time, but
the cwnd and retransmitted statistics are almost the same now, the 'retrans'
was about 10x times more and cwnd was 6x smaller than 4.12 before.

Here is one typical sample of my tests.
                4.12          4.13
offload on:   36.8Gbits     37.4Gbits
offload off:  7.68Gbits     7.84Gbits

I also borrowed a s390x machine with 6 cpus and 4G memory from system z team,
it seems 4.12 is still a bit faster than 4.13, could you please see if this
is aligned with your test bed?
                4.12          4.13
offload on:   37.3Gbits     38.3Gbits
offload off:  6.26Gbits     6.06Gbits

For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster
than Jason's number before.
                4.12          4.13
              3.33 Mpss     3.70 Mpps

Thanks again for all the tests your have done.

Wei

--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
                /* On error, stop handling until the next kick. */
                if (unlikely(headcount < 0))
                        goto out;
-               if (nvq->rx_array)
-                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
                /* On overrun, truncate and discard */
                if (unlikely(headcount > UIO_MAXIOV)) {
                        iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
                         * they refilled. */
                        goto out;
                }
+
+               if (nvq->rx_array)
+                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
+
                /* We don't need to be notified again. */
                iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
                fixup = msg.msg_iter;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-27 16:21                                                           ` Wei Xu
@ 2017-11-28  1:36                                                             ` Jason Wang
  2017-11-28  2:44                                                               ` Matthew Rosato
  2017-11-28  3:51                                                               ` Wei Xu
  0 siblings, 2 replies; 42+ messages in thread
From: Jason Wang @ 2017-11-28  1:36 UTC (permalink / raw)
  To: Wei Xu, Matthew Rosato; +Cc: mst, netdev, davem



On 2017年11月28日 00:21, Wei Xu wrote:
> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>>>> This case should be quite similar with pkgten, if you got improvement with
>>>>>>> pktgen, usually it was also the same for UDP, could you please try to disable
>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
>>>>>>> the most significant tests would be like this AFAICT:
>>>>>>>
>>>>>>> Host->VM     4.12    4.13
>>>>>>>   TCP:
>>>>>>>   UDP:
>>>>>>> pktgen:
>> So, I automated these scenarios for extended overnight runs and started
>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
>> and it also points to c67df11f.  I can see a leak in at least all of the
>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
>> fastest leak.
>>
>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
>> intervals until a large% of host memory was consumed.  Numbers below
>> after the last pktgen run completed. The summary is that a very large #
>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
>> calls match up, but the # of active skbuff_head_cache entries keeps
>> growing each time the workload is run and never goes back down in
>> between runs.
>>
>> free -h:
>>       total        used        free      shared  buff/cache   available
>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
>>
>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>
>> 1001952 1000610  99%    0.75K  23856	   42    763392K skbuff_head_cache
>> 126192 126153  99%    0.36K   2868	 44     45888K ksm_rmap_item
>> 100485 100435  99%    0.41K   1305	 77     41760K kernfs_node_cache
>>   63294  39598  62%    0.48K    959	 66     30688K dentry
>>   31968  31719  99%    0.88K    888	 36     28416K inode_cache
>>
>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10
>>
>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>    13492 <not-available> age=4295073614 pid=0 cpus=0
>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
>> cpus=1-19
>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
>> cpus=4,8,10,12,14
>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
>> pid=0-11605 cpus=5,7,12
>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
>> pid=9915-11581 cpus=8,16,18
>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
>> pid=11605-11699 cpus=2,9
>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
>> pid=331 cpus=11
>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
>> pid=11863 cpus=0
>>
>>
>> By comparison, when running 4.13 with c67df11f reverted, here's the same
>> output after the exact same test:
>>
>> free -h:
>>         total        used        free      shared  buff/cache   available
>> Mem:     39G        783M         37G        472K        637M         37G
>>
>> slabtop:
>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>     714    256  35%    0.75K     17	 42	  544K skbuff_head_cache
>>
>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>      255 <not-available> age=4295003081 pid=0 cpus=0
>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
>>
> Thanks a lot for the test, and sorry for the late update, I was working on
> the code path and didn't find anything helpful to you till today.
>
> I did some tests and initially it turned out that the bottleneck was the guest
> kernel stack(napi) side, followed by tracking the traffic footprints and it
> appeared as the loss happened when vring was full and could not be drained
> out by the guest, afterwards it triggered a SKB drop in vhost driver due
> to no headcount to fill it with, it can be avoided by deferring consuming the
> SKB after having obtained a sufficient headcount with below patch.
>
> Could you please try it? It is based on 4.13 and I also applied Jason's
> 'conditionally enable tx polling' patch.
>      https://lkml.org/lkml/2016/6/1/39

This patch has already been merged.

>
> I only tested one instance case from Host -> VM with uperf & iperf3, I like
> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> during testing. :)
>
> To maximize the performance of one instance case, two vcpus are needed,
> one does the kernel napi and the other one should serve the socket syscall
> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest
> and pinned the iperf/uperf slave to the one not used by kernel napi, you may
> need to check out which one you should pin properly by seeing the CPU
> utilization with a quick trial test before running the long duration test.
>
> Slight performance improvement for tcp with the patch(host/guest offload off)
> on x86, also 4.12 wins the game with 20-30% possibility from time to time, but
> the cwnd and retransmitted statistics are almost the same now, the 'retrans'
> was about 10x times more and cwnd was 6x smaller than 4.12 before.
>
> Here is one typical sample of my tests.
>                  4.12          4.13
> offload on:   36.8Gbits     37.4Gbits
> offload off:  7.68Gbits     7.84Gbits
>
> I also borrowed a s390x machine with 6 cpus and 4G memory from system z team,
> it seems 4.12 is still a bit faster than 4.13, could you please see if this
> is aligned with your test bed?
>                  4.12          4.13
> offload on:   37.3Gbits     38.3Gbits
> offload off:  6.26Gbits     6.06Gbits
>
> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster
> than Jason's number before.
>                  4.12          4.13
>                3.33 Mpss     3.70 Mpps
>
> Thanks again for all the tests your have done.
>
> Wei
>
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
>                  /* On error, stop handling until the next kick. */
>                  if (unlikely(headcount < 0))
>                          goto out;
> -               if (nvq->rx_array)
> -                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
>                  /* On overrun, truncate and discard */
>                  if (unlikely(headcount > UIO_MAXIOV)) {

I think you need do msg.msg_control = vhost_net_buf_consume() here too.

>                          iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
>                           * they refilled. */
>                          goto out;
>                  }
> +
> +               if (nvq->rx_array)
> +                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> +
>                  /* We don't need to be notified again. */
>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
>                  fixup = msg.msg_iter;
>
>

Good catch, this fixes the memory leak too.

I suggest to post a formal patch for -net as soon as possible too since 
it was a valid fix even if it does not help for performance.

Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-28  1:36                                                             ` Jason Wang
@ 2017-11-28  2:44                                                               ` Matthew Rosato
  2017-11-28 18:00                                                                 ` Wei Xu
  2017-11-28  3:51                                                               ` Wei Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Rosato @ 2017-11-28  2:44 UTC (permalink / raw)
  To: Jason Wang, Wei Xu; +Cc: mst, netdev, davem

On 11/27/2017 08:36 PM, Jason Wang wrote:
> 
> 
> On 2017年11月28日 00:21, Wei Xu wrote:
>> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
>>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
>>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
>>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>>>>> This case should be quite similar with pkgten, if you got
>>>>>>>> improvement with
>>>>>>>> pktgen, usually it was also the same for UDP, could you please
>>>>>>>> try to disable
>>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net
>>>>>>>> devices? Currently
>>>>>>>> the most significant tests would be like this AFAICT:
>>>>>>>>
>>>>>>>> Host->VM     4.12    4.13
>>>>>>>>   TCP:
>>>>>>>>   UDP:
>>>>>>>> pktgen:
>>> So, I automated these scenarios for extended overnight runs and started
>>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
>>> and it also points to c67df11f.  I can see a leak in at least all of the
>>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
>>> fastest leak.
>>>
>>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
>>> intervals until a large% of host memory was consumed.  Numbers below
>>> after the last pktgen run completed. The summary is that a very large #
>>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
>>> calls match up, but the # of active skbuff_head_cache entries keeps
>>> growing each time the workload is run and never goes back down in
>>> between runs.
>>>
>>> free -h:
>>>       total        used        free      shared  buff/cache   available
>>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
>>>
>>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>
>>> 1001952 1000610  99%    0.75K  23856       42    763392K
>>> skbuff_head_cache
>>> 126192 126153  99%    0.36K   2868     44     45888K ksm_rmap_item
>>> 100485 100435  99%    0.41K   1305     77     41760K kernfs_node_cache
>>>   63294  39598  62%    0.48K    959     66     30688K dentry
>>>   31968  31719  99%    0.88K    888     36     28416K inode_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776
>>> cpus=0,2,4,18
>>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863
>>> cpus=0,10
>>>
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>>    13492 <not-available> age=4295073614 pid=0 cpus=0
>>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
>>> cpus=1-19
>>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
>>> cpus=4,8,10,12,14
>>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
>>> pid=0-11605 cpus=5,7,12
>>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325
>>> cpus=4,12
>>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
>>> pid=9915-11581 cpus=8,16,18
>>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
>>> pid=11605-11699 cpus=2,9
>>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
>>> pid=331 cpus=11
>>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen]
>>> age=8545/62184/110571
>>> pid=11863 cpus=0
>>>
>>>
>>> By comparison, when running 4.13 with c67df11f reverted, here's the same
>>> output after the exact same test:
>>>
>>> free -h:
>>>         total        used        free      shared  buff/cache  
>>> available
>>> Mem:     39G        783M         37G        472K        637M         37G
>>>
>>> slabtop:
>>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>     714    256  35%    0.75K     17     42      544K skbuff_head_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>>      255 <not-available> age=4295003081 pid=0 cpus=0
>>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
>>>
>> Thanks a lot for the test, and sorry for the late update, I was
>> working on
>> the code path and didn't find anything helpful to you till today.
>>
>> I did some tests and initially it turned out that the bottleneck was
>> the guest
>> kernel stack(napi) side, followed by tracking the traffic footprints
>> and it
>> appeared as the loss happened when vring was full and could not be
>> drained
>> out by the guest, afterwards it triggered a SKB drop in vhost driver due
>> to no headcount to fill it with, it can be avoided by deferring
>> consuming the
>> SKB after having obtained a sufficient headcount with below patch.
>>
>> Could you please try it? It is based on 4.13 and I also applied Jason's
>> 'conditionally enable tx polling' patch.
>>      https://lkml.org/lkml/2016/6/1/39
> 
> This patch has already been merged.
> 
>>
>> I only tested one instance case from Host -> VM with uperf & iperf3, I
>> like
>> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
>> during testing. :)
>>
>> To maximize the performance of one instance case, two vcpus are needed,
>> one does the kernel napi and the other one should serve the socket
>> syscall
>> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the
>> guest
>> and pinned the iperf/uperf slave to the one not used by kernel napi,
>> you may
>> need to check out which one you should pin properly by seeing the CPU
>> utilization with a quick trial test before running the long duration
>> test.
>>
>> Slight performance improvement for tcp with the patch(host/guest
>> offload off)
>> on x86, also 4.12 wins the game with 20-30% possibility from time to
>> time, but
>> the cwnd and retransmitted statistics are almost the same now, the
>> 'retrans'
>> was about 10x times more and cwnd was 6x smaller than 4.12 before.
>>
>> Here is one typical sample of my tests.
>>                  4.12          4.13
>> offload on:   36.8Gbits     37.4Gbits
>> offload off:  7.68Gbits     7.84Gbits
>>
>> I also borrowed a s390x machine with 6 cpus and 4G memory from system
>> z team,
>> it seems 4.12 is still a bit faster than 4.13, could you please see if
>> this
>> is aligned with your test bed?
>>                  4.12          4.13
>> offload on:   37.3Gbits     38.3Gbits
>> offload off:  6.26Gbits     6.06Gbits
>>
>> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit
>> faster
>> than Jason's number before.
>>                  4.12          4.13
>>                3.33 Mpss     3.70 Mpps
>>
>> Thanks again for all the tests your have done.
>>
>> Wei
>>
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
>>                  /* On error, stop handling until the next kick. */
>>                  if (unlikely(headcount < 0))
>>                          goto out;
>> -               if (nvq->rx_array)
>> -                       msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>>                  /* On overrun, truncate and discard */
>>                  if (unlikely(headcount > UIO_MAXIOV)) {
> 
> I think you need do msg.msg_control = vhost_net_buf_consume() here too.
> 
>>                          iov_iter_init(&msg.msg_iter, READ, vq->iov,
>> 1, 1);
>> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
>>                           * they refilled. */
>>                          goto out;
>>                  }
>> +
>> +               if (nvq->rx_array)
>> +                       msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>> +
>>                  /* We don't need to be notified again. */
>>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in,
>> vhost_len);
>>                  fixup = msg.msg_iter;
>>
>>
> 
> Good catch, this fixes the memory leak too.
> 
> I suggest to post a formal patch for -net as soon as possible too since
> it was a valid fix even if it does not help for performance.
>> Thanks
> 

+1 to posting this patch formally.  I also verified that it resolves the
memory leak I was experiencing.

In terms of performance numbers, here are quick #s using the original
environment where the regression was noted (4GB, 4vcpu guests, no CPU
binding, TCP VM<->VM):

4.12:	34.71Gb/s
4.13:	18.80Gb/s
4.13+:	38.26Gb/s

I'll keep running numbers, but that looks very promising.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-28  2:44                                                               ` Matthew Rosato
@ 2017-11-28 18:00                                                                 ` Wei Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Wei Xu @ 2017-11-28 18:00 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Mon, Nov 27, 2017 at 09:44:07PM -0500, Matthew Rosato wrote:
> On 11/27/2017 08:36 PM, Jason Wang wrote:
> > 
> > 
> > On 2017年11月28日 00:21, Wei Xu wrote:
> >> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
> >>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> >>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
> >>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> >>>>>>>> This case should be quite similar with pkgten, if you got
> >>>>>>>> improvement with
> >>>>>>>> pktgen, usually it was also the same for UDP, could you please
> >>>>>>>> try to disable
> >>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net
> >>>>>>>> devices? Currently
> >>>>>>>> the most significant tests would be like this AFAICT:
> >>>>>>>>
> >>>>>>>> Host->VM     4.12    4.13
> >>>>>>>>   TCP:
> >>>>>>>>   UDP:
> >>>>>>>> pktgen:
> >>> So, I automated these scenarios for extended overnight runs and started
> >>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
> >>> and it also points to c67df11f.  I can see a leak in at least all of the
> >>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
> >>> fastest leak.
> >>>
> >>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
> >>> intervals until a large% of host memory was consumed.  Numbers below
> >>> after the last pktgen run completed. The summary is that a very large #
> >>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
> >>> calls match up, but the # of active skbuff_head_cache entries keeps
> >>> growing each time the workload is run and never goes back down in
> >>> between runs.
> >>>
> >>> free -h:
> >>>       total        used        free      shared  buff/cache   available
> >>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
> >>>
> >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>
> >>> 1001952 1000610  99%    0.75K  23856       42    763392K
> >>> skbuff_head_cache
> >>> 126192 126153  99%    0.36K   2868     44     45888K ksm_rmap_item
> >>> 100485 100435  99%    0.41K   1305     77     41760K kernfs_node_cache
> >>>   63294  39598  62%    0.48K    959     66     30688K dentry
> >>>   31968  31719  99%    0.88K    888     36     28416K inode_cache
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
> >>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776
> >>> cpus=0,2,4,18
> >>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863
> >>> cpus=0,10
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/free_calls:
> >>>    13492 <not-available> age=4295073614 pid=0 cpus=0
> >>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
> >>> cpus=1-19
> >>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
> >>> cpus=4,8,10,12,14
> >>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
> >>> pid=0-11605 cpus=5,7,12
> >>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
> >>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325
> >>> cpus=4,12
> >>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
> >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
> >>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
> >>> pid=9915-11581 cpus=8,16,18
> >>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
> >>> pid=11605-11699 cpus=2,9
> >>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
> >>> pid=331 cpus=11
> >>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen]
> >>> age=8545/62184/110571
> >>> pid=11863 cpus=0
> >>>
> >>>
> >>> By comparison, when running 4.13 with c67df11f reverted, here's the same
> >>> output after the exact same test:
> >>>
> >>> free -h:
> >>>         total        used        free      shared  buff/cache  
> >>> available
> >>> Mem:     39G        783M         37G        472K        637M         37G
> >>>
> >>> slabtop:
> >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>     714    256  35%    0.75K     17     42      544K skbuff_head_cache
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
> >>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
> >>> /sys/kernel/slab/skbuff_head_cache/free_calls:
> >>>      255 <not-available> age=4295003081 pid=0 cpus=0
> >>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
> >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
> >>>
> >> Thanks a lot for the test, and sorry for the late update, I was
> >> working on
> >> the code path and didn't find anything helpful to you till today.
> >>
> >> I did some tests and initially it turned out that the bottleneck was
> >> the guest
> >> kernel stack(napi) side, followed by tracking the traffic footprints
> >> and it
> >> appeared as the loss happened when vring was full and could not be
> >> drained
> >> out by the guest, afterwards it triggered a SKB drop in vhost driver due
> >> to no headcount to fill it with, it can be avoided by deferring
> >> consuming the
> >> SKB after having obtained a sufficient headcount with below patch.
> >>
> >> Could you please try it? It is based on 4.13 and I also applied Jason's
> >> 'conditionally enable tx polling' patch.
> >>      https://lkml.org/lkml/2016/6/1/39
> > 
> > This patch has already been merged.
> > 
> >>
> >> I only tested one instance case from Host -> VM with uperf & iperf3, I
> >> like
> >> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> >> during testing. :)
> >>
> >> To maximize the performance of one instance case, two vcpus are needed,
> >> one does the kernel napi and the other one should serve the socket
> >> syscall
> >> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the
> >> guest
> >> and pinned the iperf/uperf slave to the one not used by kernel napi,
> >> you may
> >> need to check out which one you should pin properly by seeing the CPU
> >> utilization with a quick trial test before running the long duration
> >> test.
> >>
> >> Slight performance improvement for tcp with the patch(host/guest
> >> offload off)
> >> on x86, also 4.12 wins the game with 20-30% possibility from time to
> >> time, but
> >> the cwnd and retransmitted statistics are almost the same now, the
> >> 'retrans'
> >> was about 10x times more and cwnd was 6x smaller than 4.12 before.
> >>
> >> Here is one typical sample of my tests.
> >>                  4.12          4.13
> >> offload on:   36.8Gbits     37.4Gbits
> >> offload off:  7.68Gbits     7.84Gbits
> >>
> >> I also borrowed a s390x machine with 6 cpus and 4G memory from system
> >> z team,
> >> it seems 4.12 is still a bit faster than 4.13, could you please see if
> >> this
> >> is aligned with your test bed?
> >>                  4.12          4.13
> >> offload on:   37.3Gbits     38.3Gbits
> >> offload off:  6.26Gbits     6.06Gbits
> >>
> >> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit
> >> faster
> >> than Jason's number before.
> >>                  4.12          4.13
> >>                3.33 Mpss     3.70 Mpps
> >>
> >> Thanks again for all the tests your have done.
> >>
> >> Wei
> >>
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
> >>                  /* On error, stop handling until the next kick. */
> >>                  if (unlikely(headcount < 0))
> >>                          goto out;
> >> -               if (nvq->rx_array)
> >> -                       msg.msg_control =
> >> vhost_net_buf_consume(&nvq->rxq);
> >>                  /* On overrun, truncate and discard */
> >>                  if (unlikely(headcount > UIO_MAXIOV)) {
> > 
> > I think you need do msg.msg_control = vhost_net_buf_consume() here too.
> > 
> >>                          iov_iter_init(&msg.msg_iter, READ, vq->iov,
> >> 1, 1);
> >> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
> >>                           * they refilled. */
> >>                          goto out;
> >>                  }
> >> +
> >> +               if (nvq->rx_array)
> >> +                       msg.msg_control =
> >> vhost_net_buf_consume(&nvq->rxq);
> >> +
> >>                  /* We don't need to be notified again. */
> >>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in,
> >> vhost_len);
> >>                  fixup = msg.msg_iter;
> >>
> >>
> > 
> > Good catch, this fixes the memory leak too.
> > 
> > I suggest to post a formal patch for -net as soon as possible too since
> > it was a valid fix even if it does not help for performance.
> >> Thanks
> > 
> 
> +1 to posting this patch formally.  I also verified that it resolves the
> memory leak I was experiencing.
> 
> In terms of performance numbers, here are quick #s using the original
> environment where the regression was noted (4GB, 4vcpu guests, no CPU
> binding, TCP VM<->VM):
> 
> 4.12:	34.71Gb/s
> 4.13:	18.80Gb/s
> 4.13+:	38.26Gb/s
> 

Great to know the number, patch sent, thanks you so much for all your
profound tests, it really helped a lot to figure it out.

Wei

> I'll keep running numbers, but that looks very promising.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-28  1:36                                                             ` Jason Wang
  2017-11-28  2:44                                                               ` Matthew Rosato
@ 2017-11-28  3:51                                                               ` Wei Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Wei Xu @ 2017-11-28  3:51 UTC (permalink / raw)
  To: Jason Wang; +Cc: Matthew Rosato, mst, netdev, davem

On Tue, Nov 28, 2017 at 09:36:37AM +0800, Jason Wang wrote:
> 
> 
> On 2017年11月28日 00:21, Wei Xu wrote:
> > On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
> > > On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> > > > On 11/12/2017 01:34 PM, Wei Xu wrote:
> > > > > On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> > > > > > > > This case should be quite similar with pkgten, if you got improvement with
> > > > > > > > pktgen, usually it was also the same for UDP, could you please try to disable
> > > > > > > > tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> > > > > > > > the most significant tests would be like this AFAICT:
> > > > > > > > 
> > > > > > > > Host->VM     4.12    4.13
> > > > > > > >   TCP:
> > > > > > > >   UDP:
> > > > > > > > pktgen:
> > > So, I automated these scenarios for extended overnight runs and started
> > > experiencing OOM conditions overnight on a 40G system.  I did a bisect
> > > and it also points to c67df11f.  I can see a leak in at least all of the
> > > Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
> > > fastest leak.
> > > 
> > > I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
> > > intervals until a large% of host memory was consumed.  Numbers below
> > > after the last pktgen run completed. The summary is that a very large #
> > > of active skbuff_head_cache entries can be seen - The sum of alloc/free
> > > calls match up, but the # of active skbuff_head_cache entries keeps
> > > growing each time the workload is run and never goes back down in
> > > between runs.
> > > 
> > > free -h:
> > >       total        used        free      shared  buff/cache   available
> > > Mem:   39G         31G        6.6G        472K        1.4G        6.8G
> > > 
> > >    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > > 
> > > 1001952 1000610  99%    0.75K  23856	   42    763392K skbuff_head_cache
> > > 126192 126153  99%    0.36K   2868	 44     45888K ksm_rmap_item
> > > 100485 100435  99%    0.41K   1305	 77     41760K kernfs_node_cache
> > >   63294  39598  62%    0.48K    959	 66     30688K dentry
> > >   31968  31719  99%    0.88K    888	 36     28416K inode_cache
> > > 
> > > /sys/kernel/slab/skbuff_head_cache/alloc_calls :
> > >      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
> > > 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10
> > > 
> > > /sys/kernel/slab/skbuff_head_cache/free_calls:
> > >    13492 <not-available> age=4295073614 pid=0 cpus=0
> > >   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
> > > cpus=1-19
> > >        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
> > > cpus=4,8,10,12,14
> > >        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
> > > pid=0-11605 cpus=5,7,12
> > >        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
> > >        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
> > >        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
> > >        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
> > >        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
> > > pid=9915-11581 cpus=8,16,18
> > >        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
> > > pid=11605-11699 cpus=2,9
> > >        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
> > > pid=331 cpus=11
> > >     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
> > > pid=11863 cpus=0
> > > 
> > > 
> > > By comparison, when running 4.13 with c67df11f reverted, here's the same
> > > output after the exact same test:
> > > 
> > > free -h:
> > >         total        used        free      shared  buff/cache   available
> > > Mem:     39G        783M         37G        472K        637M         37G
> > > 
> > > slabtop:
> > >    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> > >     714    256  35%    0.75K     17	 42	  544K skbuff_head_cache
> > > 
> > > /sys/kernel/slab/skbuff_head_cache/alloc_calls:
> > >      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
> > > /sys/kernel/slab/skbuff_head_cache/free_calls:
> > >      255 <not-available> age=4295003081 pid=0 cpus=0
> > >        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
> > >        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
> > > 
> > Thanks a lot for the test, and sorry for the late update, I was working on
> > the code path and didn't find anything helpful to you till today.
> > 
> > I did some tests and initially it turned out that the bottleneck was the guest
> > kernel stack(napi) side, followed by tracking the traffic footprints and it
> > appeared as the loss happened when vring was full and could not be drained
> > out by the guest, afterwards it triggered a SKB drop in vhost driver due
> > to no headcount to fill it with, it can be avoided by deferring consuming the
> > SKB after having obtained a sufficient headcount with below patch.
> > 
> > Could you please try it? It is based on 4.13 and I also applied Jason's
> > 'conditionally enable tx polling' patch.
> >      https://lkml.org/lkml/2016/6/1/39
> 
> This patch has already been merged.
> 
> > 
> > I only tested one instance case from Host -> VM with uperf & iperf3, I like
> > iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> > during testing. :)
> > 
> > To maximize the performance of one instance case, two vcpus are needed,
> > one does the kernel napi and the other one should serve the socket syscall
> > (mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest
> > and pinned the iperf/uperf slave to the one not used by kernel napi, you may
> > need to check out which one you should pin properly by seeing the CPU
> > utilization with a quick trial test before running the long duration test.
> > 
> > Slight performance improvement for tcp with the patch(host/guest offload off)
> > on x86, also 4.12 wins the game with 20-30% possibility from time to time, but
> > the cwnd and retransmitted statistics are almost the same now, the 'retrans'
> > was about 10x times more and cwnd was 6x smaller than 4.12 before.
> > 
> > Here is one typical sample of my tests.
> >                  4.12          4.13
> > offload on:   36.8Gbits     37.4Gbits
> > offload off:  7.68Gbits     7.84Gbits
> > 
> > I also borrowed a s390x machine with 6 cpus and 4G memory from system z team,
> > it seems 4.12 is still a bit faster than 4.13, could you please see if this
> > is aligned with your test bed?
> >                  4.12          4.13
> > offload on:   37.3Gbits     38.3Gbits
> > offload off:  6.26Gbits     6.06Gbits
> > 
> > For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster
> > than Jason's number before.
> >                  4.12          4.13
> >                3.33 Mpss     3.70 Mpps
> > 
> > Thanks again for all the tests your have done.
> > 
> > Wei
> > 
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
> >                  /* On error, stop handling until the next kick. */
> >                  if (unlikely(headcount < 0))
> >                          goto out;
> > -               if (nvq->rx_array)
> > -                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> >                  /* On overrun, truncate and discard */
> >                  if (unlikely(headcount > UIO_MAXIOV)) {
> 
> I think you need do msg.msg_control = vhost_net_buf_consume() here too.
> 
> >                          iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> > @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
> >                           * they refilled. */
> >                          goto out;
> >                  }
> > +
> > +               if (nvq->rx_array)
> > +                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> > +
> >                  /* We don't need to be notified again. */
> >                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
> >                  fixup = msg.msg_iter;
> > 
> > 
> 
> Good catch, this fixes the memory leak too.
> 
> I suggest to post a formal patch for -net as soon as possible too since it
> was a valid fix even if it does not help for performance.

OK, will post it soon.

Wei

> 
> Thanks

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-11-08  1:02                                                 ` Matthew Rosato
  2017-11-11 20:59                                                   ` Matthew Rosato
@ 2017-11-12 15:40                                                   ` Wei Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Wei Xu @ 2017-11-12 15:40 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, mst, netdev, davem

On Tue, Nov 07, 2017 at 08:02:48PM -0500, Matthew Rosato wrote:
> On 11/04/2017 07:35 PM, Wei Xu wrote:
> > On Fri, Nov 03, 2017 at 12:30:12AM -0400, Matthew Rosato wrote:
> >> On 10/31/2017 03:07 AM, Wei Xu wrote:
> >>> On Thu, Oct 26, 2017 at 01:53:12PM -0400, Matthew Rosato wrote:
> >>>>
> >>>>>
> >>>>> Are you using the same binding as mentioned in previous mail sent by you? it
> >>>>> might be caused by cpu convention between pktgen and vhost, could you please
> >>>>> try to run pktgen from another idle cpu by adjusting the binding? 
> >>>>
> >>>> I don't think that's the case -- I can cause pktgen to hang in the guest
> >>>> without any cpu binding, and with vhost disabled even.
> >>>
> >>> Yes, I did a test and it also hangs in guest, before we figure it out,
> >>> maybe you try udp with uperf with this case?
> >>>
> >>> VM   -> Host
> >>> Host -> VM
> >>> VM   -> VM
> >>>
> >>
> >> Here are averaged run numbers (Gbps throughput) across 4.12, 4.13 and
> >> net-next with and without Jason's recent "vhost_net: conditionally
> >> enable tx polling" applied (referred to as 'patch' below).  1 uperf
> >> instance in each case:
> > 
> > Thanks a lot for the test. 
> > 
> >>
> >> uperf TCP:
> >> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
> >> ----------------------------------------------------------------------
> >> VM->VM	 35.2	16.5	20.84		22.2		24.36
> > 
> > Are you using the same server/test suite? You mentioned the number was around 
> > 28Gb for 4.12 and it dropped about 40% for 4.13, it seems thing changed, are
> > there any options for performance tuning on the server to maximize the cpu
> > utilization? 
> 
> I experience some volatility as I am running on 1 of multiple LPARs
> available to this system (they are sharing physical resources).  But I
> think the real issue was that I left my guest environment set to 4
> vcpus, but was binding assuming there was 1 vcpu (was working on
> something else, forgot to change back).  This likely tainted my most
> recent results, sorry.

Not a problem at all, also thanks for the feedback. :)

> 
> > 
> > I had similar experience on x86 server and desktop before and it made that
> > the result number always went up and down pretty much.
> > 
> >> VM->Host 42.15	43.57	44.90		30.83		32.26
> >> Host->VM 53.17	41.51	42.18		37.05		37.30
> > 
> > This is a bit odd, I remember you said there was no regression while 
> > testing Host>VM, wasn't it? 
> > 
> >>
> >> uperf UDP:
> >> 	 4.12	4.13	4.13+patch	net-next	net-next+patch
> >> ----------------------------------------------------------------------
> >> VM->VM	 24.93	21.63	25.09		8.86		9.62
> >> VM->Host 40.21	38.21	39.72		8.74		9.35
> >> Host->VM 31.26	30.18	31.25		7.2		9.26
> > 
> > This case should be quite similar with pkgten, if you got improvement with
> > pktgen, usually it was also the same for UDP, could you please try to disable
> > tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> > the most significant tests would be like this AFAICT:
> > 
> > Host->VM     4.12    4.13
> >  TCP:
> >  UDP:
> > pktgen:
> > 
> > Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's patch should
> > work since we have seen positive number for that, you can also temporarily skip
> > net-next as well.
> 
> Here are the requested numbers, averaged over numerous runs --  guest is
> 4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
> pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
> taps / guest virtio-net devs as requested:
> 
> Host->VM	4.12		4.13
> TCP:		9.92Gb/s	6.44Gb/s
> UDP:		5.77Gb/s	6.63Gb/s
> pktgen:		1572403pps	1904265pps
> 
> UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
> however, is that I am seeing the TCP regression for the first time from
> host->VM.  I wonder if the combination of CPU binding + disabling of one
> or more of tso/gso/gro/ufo is related.

Interesting, then maybe we can address the regression based on this case first
if we can reproduce it. Can you have a look at TCP statistics difference on
both host and guest side with 'netstat -s' between tests? 

Wei

> 
> > 
> > If you see UDP and pktgen are aligned, then it might be helpful to continue
> > the other two cases, otherwise we fail in the first place.
> 
> I will start gathering those numbers tomorrow.
> 
> > 
> >> The net is that Jason's recent patch definitely improves things across
> >> the board at 4.13 as well as at net-next -- But the VM<->VM TCP numbers
> >> I am observing are still lower than base 4.12.
> > 
> > Cool.
> > 
> >>
> >> A separate concern is why my UDP numbers look so bad on net-next (have
> >> not bisected this yet).
> > 
> > This might be another issue, I am in vacation, will try it on x86 once back
> > to work on next Wednesday.
> > 
> > Wei
> > 
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-18 20:17                                 ` Matthew Rosato
  2017-10-23  2:06                                   ` Jason Wang
@ 2017-10-23 13:57                                   ` Wei Xu
  2017-10-25 20:31                                     ` Matthew Rosato
  1 sibling, 1 reply; 42+ messages in thread
From: Wei Xu @ 2017-10-23 13:57 UTC (permalink / raw)
  To: Matthew Rosato; +Cc: Jason Wang, netdev, davem, mst

On Wed, Oct 18, 2017 at 04:17:51PM -0400, Matthew Rosato wrote:
> On 10/12/2017 02:31 PM, Wei Xu wrote:
> > On Thu, Oct 05, 2017 at 04:07:45PM -0400, Matthew Rosato wrote:
> >>
> >> Ping...  Jason, any other ideas or suggestions?
> > 
> > Hi Matthew,
> > Recently I am doing similar test on x86 for this patch, here are some,
> > differences between our testbeds.
> > 
> > 1. It is nice you have got improvement with 50+ instances(or connections here?)
> > which would be quite helpful to address the issue, also you've figured out the
> > cost(wait/wakeup), kindly reminder did you pin uperf client/server along the whole
> > path besides vhost and vcpu threads? 
> 
> Was not previously doing any pinning whatsoever, just reproducing an
> environment that one of our testers here was running.  Reducing guest
> vcpu count from 4->1, still see the regression.  Then, pinned each vcpu
> thread and vhost thread to a separate host CPU -- still made no
> difference (regression still present).
> 
> > 
> > 2. It might be useful to short the traffic path as a reference, What I am running
> > is briefly like:
> >     pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
> > 
> > The bridge driver(br_forward(), etc) might impact performance due to my personal
> > experience, so eventually I settled down with this simplified testbed which fully
> > isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
> > bridge driver, etc), therefore reduces potential interferences.
> > 
> > The down side of this is that it needs DPDK support in guest, has this ever be
> > run on s390x guest? An alternative approach is to directly run XDP drop on
> > virtio-net nic in guest, while this requires compiling XDP inside guest which needs
> > a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
> > 
> 
> I made an attempt at DPDK, but it has not been run on s390x as far as
> I'm aware and didn't seem trivial to get working.
> 
> So instead I took your alternate suggestion & did:
> pktgen(host) -> tap(x) -> guest(xdp_drop)

It is really nice of you for having tried this, I also tried this on x86 with 
two ubuntu 16.04 guests, but unfortunately I couldn't reproduce it as well,
but I did get lower throughput with 50 instances than one instance(1-4 vcpus),
is this the same on s390x? 

> 
> When running this setup, I am not able to reproduce the regression.  As
> mentioned previously, I am also unable to reproduce when running one end
> of the uperf connection from the host - I have only ever been able to
> reproduce when both ends of the uperf connection are running within a guest.

Did you see improvement when running uperf from the host if no regression? 

It would be pretty nice to run pktgen from the VM as Jason suggested in another
mail(pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2), this is super close to your
original test case and can help to determine if we can get some clue with tcp or
bridge driver.

Also I am interested in your hardware platform, how many NUMA nodes do you have?
what about your binding(vcpu/vhost/pktgen). For my case, I got a server with 4
NUMA nodes and 12 cpus for each sockets, and I am explicitly launching qemu from
cpu0, then bind vhost(Rx/Tx) to cpu 2&3, and vcpus start from cpu 4(3 vcpus for
each).

> 
> > 3. BTW, did you enable hugepage for your guest? It would  performance more
> > or less depends on the memory demand when generating traffic, I didn't see
> > similar command lines in yours.
> > 
> 
> s390x does not currently support passing through hugetlb backing via
> QEMU mem-path.

Okay, thanks for sharing this.

Wei


> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Regression in throughput between kvm guests over virtual bridge
  2017-10-23 13:57                                   ` Wei Xu
@ 2017-10-25 20:31                                     ` Matthew Rosato
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Rosato @ 2017-10-25 20:31 UTC (permalink / raw)
  To: Wei Xu; +Cc: Jason Wang, netdev, davem, mst

On 10/23/2017 09:57 AM, Wei Xu wrote:
> On Wed, Oct 18, 2017 at 04:17:51PM -0400, Matthew Rosato wrote:
>> On 10/12/2017 02:31 PM, Wei Xu wrote:
>>> On Thu, Oct 05, 2017 at 04:07:45PM -0400, Matthew Rosato wrote:
>>>>
>>>> Ping...  Jason, any other ideas or suggestions?
>>>
>>> Hi Matthew,
>>> Recently I am doing similar test on x86 for this patch, here are some,
>>> differences between our testbeds.
>>>
>>> 1. It is nice you have got improvement with 50+ instances(or connections here?)
>>> which would be quite helpful to address the issue, also you've figured out the
>>> cost(wait/wakeup), kindly reminder did you pin uperf client/server along the whole
>>> path besides vhost and vcpu threads? 
>>
>> Was not previously doing any pinning whatsoever, just reproducing an
>> environment that one of our testers here was running.  Reducing guest
>> vcpu count from 4->1, still see the regression.  Then, pinned each vcpu
>> thread and vhost thread to a separate host CPU -- still made no
>> difference (regression still present).
>>
>>>
>>> 2. It might be useful to short the traffic path as a reference, What I am running
>>> is briefly like:
>>>     pktgen(host kernel) -> tap(x) -> guest(DPDK testpmd)
>>>
>>> The bridge driver(br_forward(), etc) might impact performance due to my personal
>>> experience, so eventually I settled down with this simplified testbed which fully
>>> isolates the traffic from both userspace and host kernel stack(1 and 50 instances,
>>> bridge driver, etc), therefore reduces potential interferences.
>>>
>>> The down side of this is that it needs DPDK support in guest, has this ever be
>>> run on s390x guest? An alternative approach is to directly run XDP drop on
>>> virtio-net nic in guest, while this requires compiling XDP inside guest which needs
>>> a newer distro(Fedora 25+ in my case or Ubuntu 16.10, not sure).
>>>
>>
>> I made an attempt at DPDK, but it has not been run on s390x as far as
>> I'm aware and didn't seem trivial to get working.
>>
>> So instead I took your alternate suggestion & did:
>> pktgen(host) -> tap(x) -> guest(xdp_drop)
> 
> It is really nice of you for having tried this, I also tried this on x86 with 
> two ubuntu 16.04 guests, but unfortunately I couldn't reproduce it as well,
> but I did get lower throughput with 50 instances than one instance(1-4 vcpus),
> is this the same on s390x? 

For me, the total throughput is higher from 50 instances than for 1
instance when host kernel is 4.13.  However, when running a 50 instance
uperf load I cannot reproduce the regression, either.  Throughput is a
little bit better when host is 4.13 vs 4.12 for a 50 instance run.

> 
>>
>> When running this setup, I am not able to reproduce the regression.  As
>> mentioned previously, I am also unable to reproduce when running one end
>> of the uperf connection from the host - I have only ever been able to
>> reproduce when both ends of the uperf connection are running within a guest.
> 
> Did you see improvement when running uperf from the host if no regression? 
> 
> It would be pretty nice to run pktgen from the VM as Jason suggested in another
> mail(pktgen(vm1) -> tap1 -> bridge -> tap2 -> vm2), this is super close to your
> original test case and can help to determine if we can get some clue with tcp or
> bridge driver.
> 
> Also I am interested in your hardware platform, how many NUMA nodes do you have?
> what about your binding(vcpu/vhost/pktgen). For my case, I got a server with 4
> NUMA nodes and 12 cpus for each sockets, and I am explicitly launching qemu from
> cpu0, then bind vhost(Rx/Tx) to cpu 2&3, and vcpus start from cpu 4(3 vcpus for
> each).

I'm running in an LPAR on a z13.  The particular LPAR I am using to
reproduce has 20 CPUs and 40G of memory assigned, all in 1 NUMA node.  I
was initially recreating an issue uncovered by someone elses test, and
thus was doing no cpu binding -- But have attempted binding vhost and
vcpu threads to individual host CPUs and it seemed to have no impact on
the noted regression.  When doing said binding, I did: qemu-guestA ->
cpu0(or 0-3 when running 4vcpu), qemu-guestA-vhost -> cpu4, qemu-guestB
-> cpu8(or 8-11 when running 4vcpu), qemu-guestB-vhost -> cpu12.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-11-28 17:39 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12 17:56 Regression in throughput between kvm guests over virtual bridge Matthew Rosato
2017-09-13  1:16 ` Jason Wang
2017-09-13  8:13   ` Jason Wang
2017-09-13 16:59     ` Matthew Rosato
2017-09-14  4:21       ` Jason Wang
2017-09-15  3:36         ` Matthew Rosato
2017-09-15  8:55           ` Jason Wang
2017-09-15 19:19             ` Matthew Rosato
2017-09-18  3:13               ` Jason Wang
2017-09-18  4:14                 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
2017-09-18 18:11                   ` Matthew Rosato
2017-09-20  6:27                     ` Jason Wang
2017-09-20 19:38                       ` Matthew Rosato
2017-09-22  4:03                         ` Jason Wang
2017-09-25 20:18                           ` Matthew Rosato
2017-10-05 20:07                             ` Matthew Rosato
2017-10-11  2:41                               ` Jason Wang
2017-10-12 18:31                               ` Wei Xu
2017-10-18 20:17                                 ` Matthew Rosato
2017-10-23  2:06                                   ` Jason Wang
2017-10-23  2:13                                     ` Michael S. Tsirkin
2017-10-25 20:21                                     ` Matthew Rosato
2017-10-26  9:44                                       ` Wei Xu
2017-10-26 17:53                                         ` Matthew Rosato
2017-10-31  7:07                                           ` Wei Xu
2017-10-31  7:00                                             ` Jason Wang
2017-11-03  4:30                                             ` Matthew Rosato
2017-11-04 23:35                                               ` Wei Xu
2017-11-08  1:02                                                 ` Matthew Rosato
2017-11-11 20:59                                                   ` Matthew Rosato
2017-11-12 18:34                                                     ` Wei Xu
2017-11-14 20:11                                                       ` Matthew Rosato
2017-11-20 19:25                                                         ` Matthew Rosato
2017-11-27 16:21                                                           ` Wei Xu
2017-11-28  1:36                                                             ` Jason Wang
2017-11-28  2:44                                                               ` Matthew Rosato
2017-11-28 18:00                                                                 ` Wei Xu
2017-11-28  3:51                                                               ` Wei Xu
2017-11-12 15:40                                                   ` Wei Xu
2017-10-23 13:57                                   ` Wei Xu
2017-10-25 20:31                                     ` Matthew Rosato

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.