All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch
@ 2012-03-22 23:48 Shirley Ma
  2012-03-27 10:09 ` Jason Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Shirley Ma @ 2012-03-22 23:48 UTC (permalink / raw)
  To: Michael S. Tsirkin, netdev, kvm, tahm

Sorry for being late to submit this patch. I have spent lots of time
trying to find the best approach. This effort is still going on...

This patch is built against net-next tree.

This is an experimental RFC patch. The purpose of this patch is to
address KVM networking scalability and NUMA scheduling issue.

The existing implementation of vhost creats a vhost thread per-device
(virtio_net) based. RX and TX work of a VMs per-device is handled by
same vhost thread. 

One of the limitation of this implementation is with increasing the
number VMs or the number of virtio-net interfces, more vhost threads are
created, it will consume more kernel resources, and induce more threads
context switches/scheduling overhead. We noticed that the KVM network
performance doesn't scale with increasing number of VMs. 

The other limitation is to have single vhost thread to process both RX
and TX, the work will be blocked. So we create this per cpu vhost thread
implementation. The number of vhost cpu threads is limited to the number
of cpus on the host.

To address these limitations, we are propsing a per-cpu vhost thread
model where the number of vhost threads are limited and equal to the
number of online cpus on the host. 

Based on our testing experience, the vcpus can be scheduled across cpu
sockets even when the number of vcpus is smaller than the number of
cores per cpu socket and there is no other  activities besides KVM
networking workload. We found that if vhost thread is scheduled on the
same socket as the work is received, the performance will be better. 

So in this per cpu vhost thread implementation, a vhost thread is
selected dynamically based on where the TX/RX work is initiated. A vhost
thread on the same cpu socket is selected but not on the same cpu as the
vcpu/interrupt thread that initizated the TX/RX work.

When we test this RFC patch, the other interesting thing we found is the
performance results also seem related to NIC flow steering. We are
spending time on evaluate different NICs flow director implementation
now. We will enhance this patch based on our findings later.

We have tried different scheduling: per-device based, per vq based and
per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost scheduling,
we found that so far the per vq based scheduling is good enough for now.

We also tried different algorithm to select which cpu vhost thread will
running on a specific cpu socket: avg_load balance, and randomly...

>From our test results, we found that the scalability has been
significantly improved. And this patch is also helpful for small packets
performance. 

Hoever, we are seeing some regressions in a local guest to guest
scenario on a 8 cpu NUMA system.

In one case, 24 VMs 256 bytes tcp_stream test shows it has improved from
810Mb/s to 9.1Gb/s. :)
(We created two local VMs, and each VM has 2 vcpus. W/o this patch, the
number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4 vcpus +
8 vhosts = 12. It causes more context switches. When I change the
scheduling to use 2-4 vhost threads, the regressions are gone. I am
continue investigation on how to make small number of VMs, local guest
to gues performance better. Once I find the clue, I will share here.)

The cpu hotplug support hasn't in place yet. I will post it later.

Since we have per cpu vhost thread, each vhost thread will handle
multiple vqs, so we will be able to reduce/remove vq notification when
the work is heavy loaded in future.

Here is my test results for remote host to guest test: tcp_rrs, udp_rrs,
tcp_stream with guest has 2 vpus, host has two cpu socket, each socket
has 4 cores.

TCP_STREAM	256	512	1K	2K	4K	8K	16K
--------------------------------------------------------------------
Original
H->Guest	2501	4238	4744	5256	7203	6975	5799 		Patch
H->Guest	1676	2290	3149	8026	8439	8283	8216	
								
Original
Guest->H	744	1773	5675	1397	8207	7296	8117	
Patch
Guest->Host	1041	1386	5407	7057	8298	8127	8241

60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
65%  improved with taskset vcpus on the same socket
60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
67%  improved with taskset vcpus on the same socket 

Tom has run 1VM to 24 VMs test for different work. He will post it here
soon.

If the host scheduler ensures that the VM's vcpus are not scheduled to
another socket (i.e. cpu mask the vcpus on same socket) then the
performance will be better.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Tested-by: Tom Lendacky <toml@us.ibm.com>
---

 drivers/vhost/net.c                  |   26 ++-
 drivers/vhost/vhost.c                |  289
+++++++++++++++++++++++----------
 drivers/vhost/vhost.h                |   16 ++-
 3 files changed, 232 insertions(+), 103 deletions(-)

Thanks
Shirley

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch
  2012-03-22 23:48 [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch Shirley Ma
@ 2012-03-27 10:09 ` Jason Wang
  2012-03-27 17:43   ` Shirley Ma
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Wang @ 2012-03-27 10:09 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Michael S. Tsirkin, netdev, kvm, tahm

Hi:

Thanks for the work and it looks very reasonable, some questions below.

On 03/23/2012 07:48 AM, Shirley Ma wrote:
> Sorry for being late to submit this patch. I have spent lots of time
> trying to find the best approach. This effort is still going on...
>
> This patch is built against net-next tree.
>
> This is an experimental RFC patch. The purpose of this patch is to
> address KVM networking scalability and NUMA scheduling issue.

Need also test for non-NUMA machine, I see that you just choose the cpu 
that initiates the work for non-numa machine which seems sub optimal.
> The existing implementation of vhost creats a vhost thread per-device
> (virtio_net) based. RX and TX work of a VMs per-device is handled by
> same vhost thread.
>
> One of the limitation of this implementation is with increasing the
> number VMs or the number of virtio-net interfces, more vhost threads are
> created, it will consume more kernel resources, and induce more threads
> context switches/scheduling overhead. We noticed that the KVM network
> performance doesn't scale with increasing number of VMs.
>
> The other limitation is to have single vhost thread to process both RX
> and TX, the work will be blocked. So we create this per cpu vhost thread
> implementation. The number of vhost cpu threads is limited to the number
> of cpus on the host.
>
> To address these limitations, we are propsing a per-cpu vhost thread
> model where the number of vhost threads are limited and equal to the
> number of online cpus on the host.

The number of vhost thread needs more consideration. Consider that we 
have a 1024 cores host with a card have 16 tx/rx queues, do we really 
need 1024 vhost threads?
>
> Based on our testing experience, the vcpus can be scheduled across cpu
> sockets even when the number of vcpus is smaller than the number of
> cores per cpu socket and there is no other  activities besides KVM
> networking workload. We found that if vhost thread is scheduled on the
> same socket as the work is received, the performance will be better.
>
> So in this per cpu vhost thread implementation, a vhost thread is
> selected dynamically based on where the TX/RX work is initiated. A vhost
> thread on the same cpu socket is selected but not on the same cpu as the
> vcpu/interrupt thread that initizated the TX/RX work.
>
> When we test this RFC patch, the other interesting thing we found is the
> performance results also seem related to NIC flow steering. We are
> spending time on evaluate different NICs flow director implementation
> now. We will enhance this patch based on our findings later.
>
> We have tried different scheduling: per-device based, per vq based and
> per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost scheduling,
> we found that so far the per vq based scheduling is good enough for now.

Could you please explain more about those scheduling strategies? Does 
per-device based means let a dedicated vhost thread to handle all work 
from that vhost device? As you mentioned, maybe an improvement of the 
scheduling to take flow steering info (queue mapping, rxhash etc.) of 
skb in host into account.
>
> We also tried different algorithm to select which cpu vhost thread will
> running on a specific cpu socket: avg_load balance, and randomly...

May worth to account the out-of-oder packet during the test as for a 
single stream as different cpu/vhost/physical queue may be chose to do 
the packet transmission/reception?
>
> > From our test results, we found that the scalability has been
> significantly improved. And this patch is also helpful for small packets
> performance.
>
> Hoever, we are seeing some regressions in a local guest to guest
> scenario on a 8 cpu NUMA system.
>
> In one case, 24 VMs 256 bytes tcp_stream test shows it has improved from
> 810Mb/s to 9.1Gb/s. :)
> (We created two local VMs, and each VM has 2 vcpus. W/o this patch, the
> number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4 vcpus +
> 8 vhosts = 12. It causes more context switches. When I change the
> scheduling to use 2-4 vhost threads, the regressions are gone. I am
> continue investigation on how to make small number of VMs, local guest
> to gues performance better. Once I find the clue, I will share here.)
>
> The cpu hotplug support hasn't in place yet. I will post it later.

Another question is why not just using workqueue? It has full support 
for cpu hotplug and allow more polices.
>
> Since we have per cpu vhost thread, each vhost thread will handle
> multiple vqs, so we will be able to reduce/remove vq notification when
> the work is heavy loaded in future.

Does this issue still exist if event index is used? If vhost does not 
publish new used index, guest would not kick again.
>
> Here is my test results for remote host to guest test: tcp_rrs, udp_rrs,
> tcp_stream with guest has 2 vpus, host has two cpu socket, each socket
> has 4 cores.
>
> TCP_STREAM	256	512	1K	2K	4K	8K	16K
> --------------------------------------------------------------------
> Original
> H->Guest	2501	4238	4744	5256	7203	6975	5799 		Patch
> H->Guest	1676	2290	3149	8026	8439	8283	8216	
> 								
> Original
> Guest->H	744	1773	5675	1397	8207	7296	8117	
> Patch
> Guest->Host	1041	1386	5407	7057	8298	8127	8241

Looks like there's some noise in the result, the throughput of "original 
guest -> Host 2K" looks too low. And some strange is that I see 
regressions of packet transmission of guest when testing this patch. ( 
Guest to Local Host TCP_STREAM in a NUMA machine).
>
> 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
> 65%  improved with taskset vcpus on the same socket
> 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
> 67%  improved with taskset vcpus on the same socket
>
> Tom has run 1VM to 24 VMs test for different work. He will post it here
> soon.
>
> If the host scheduler ensures that the VM's vcpus are not scheduled to
> another socket (i.e. cpu mask the vcpus on same socket) then the
> performance will be better.
>
> Signed-off-by: Shirley Ma<xma@us.ibm.com>
> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> Tested-by: Tom Lendacky<toml@us.ibm.com>
> ---
>
>   drivers/vhost/net.c                  |   26 ++-
>   drivers/vhost/vhost.c                |  289
> +++++++++++++++++++++++----------
>   drivers/vhost/vhost.h                |   16 ++-
>   3 files changed, 232 insertions(+), 103 deletions(-)
>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch
  2012-03-27 10:09 ` Jason Wang
@ 2012-03-27 17:43   ` Shirley Ma
  2012-04-05 12:28     ` Michael S. Tsirkin
  0 siblings, 1 reply; 6+ messages in thread
From: Shirley Ma @ 2012-03-27 17:43 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, netdev, kvm, tahm

On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
> Hi:
> 
> Thanks for the work and it looks very reasonable, some questions
> below.
> 
> On 03/23/2012 07:48 AM, Shirley Ma wrote:
> > Sorry for being late to submit this patch. I have spent lots of time
> > trying to find the best approach. This effort is still going on...
> >
> > This patch is built against net-next tree.
> >
> > This is an experimental RFC patch. The purpose of this patch is to
> > address KVM networking scalability and NUMA scheduling issue.
> 
> Need also test for non-NUMA machine, I see that you just choose the
> cpu 
> that initiates the work for non-numa machine which seems sub optimal.

Good suggestions. I don't have any non-numa systems. But KK run some
tests on non-numa system. He could see around 20% performance gain for
single VMs local host to guest. I hope we can run a full test on
non-numa system.

On non-numa system, the same per vhost-cpu thread will be always picked
up consistently for a particular vq since all cores are on same cpu
socket. So there will be two per-cpu vhost threads handle TX/RX
simultaneously.

> > The existing implementation of vhost creats a vhost thread
> per-device
> > (virtio_net) based. RX and TX work of a VMs per-device is handled by
> > same vhost thread.
> >
> > One of the limitation of this implementation is with increasing the
> > number VMs or the number of virtio-net interfces, more vhost threads
> are
> > created, it will consume more kernel resources, and induce more
> threads
> > context switches/scheduling overhead. We noticed that the KVM
> network
> > performance doesn't scale with increasing number of VMs.
> >
> > The other limitation is to have single vhost thread to process both
> RX
> > and TX, the work will be blocked. So we create this per cpu vhost
> thread
> > implementation. The number of vhost cpu threads is limited to the
> number
> > of cpus on the host.
> >
> > To address these limitations, we are propsing a per-cpu vhost thread
> > model where the number of vhost threads are limited and equal to the
> > number of online cpus on the host.
> 
> The number of vhost thread needs more consideration. Consider that we 
> have a 1024 cores host with a card have 16 tx/rx queues, do we really 
> need 1024 vhost threads?

In this case, we could add a module parameter to limit the number of
cores/sockets to be used.

> >
> > Based on our testing experience, the vcpus can be scheduled across
> cpu
> > sockets even when the number of vcpus is smaller than the number of
> > cores per cpu socket and there is no other  activities besides KVM
> > networking workload. We found that if vhost thread is scheduled on
> the
> > same socket as the work is received, the performance will be better.
> >
> > So in this per cpu vhost thread implementation, a vhost thread is
> > selected dynamically based on where the TX/RX work is initiated. A
> vhost
> > thread on the same cpu socket is selected but not on the same cpu as
> the
> > vcpu/interrupt thread that initizated the TX/RX work.
> >
> > When we test this RFC patch, the other interesting thing we found is
> the
> > performance results also seem related to NIC flow steering. We are
> > spending time on evaluate different NICs flow director
> implementation
> > now. We will enhance this patch based on our findings later.
> >
> > We have tried different scheduling: per-device based, per vq based
> and
> > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
> scheduling,
> > we found that so far the per vq based scheduling is good enough for
> now.
> 
> Could you please explain more about those scheduling strategies? Does 
> per-device based means let a dedicated vhost thread to handle all
> work 
> from that vhost device? As you mentioned, maybe an improvement of the 
> scheduling to take flow steering info (queue mapping, rxhash etc.) of 
> skb in host into account.

Yes, per-device scheduling means one per-cpu vhost theads handle all
works from one particular vhost-device.

Yes, we think scheduling to take flow steering info would help
performance. I am studying this now.

> >
> > We also tried different algorithm to select which cpu vhost thread
> will
> > running on a specific cpu socket: avg_load balance, and randomly...
> 
> May worth to account the out-of-oder packet during the test as for a 
> single stream as different cpu/vhost/physical queue may be chose to
> do 
> the packet transmission/reception?

Good point. I haven't gone through all data yet. netstat output might
tell us something.

We used Intel 10G NIC to run all test. For a single steam test, Intel
NIC receiving irq steers with same irq/queue which TX packets have been
sent. So when we mask vcpus from same VM on one socket, we shouldn't hit
packet out-of-order case. We might hit packet out of order when vcpus
run across sockets.

> >
> > > From our test results, we found that the scalability has been
> > significantly improved. And this patch is also helpful for small
> packets
> > performance.
> >
> > Hoever, we are seeing some regressions in a local guest to guest
> > scenario on a 8 cpu NUMA system.
> >
> > In one case, 24 VMs 256 bytes tcp_stream test shows it has improved
> from
> > 810Mb/s to 9.1Gb/s. :)
> > (We created two local VMs, and each VM has 2 vcpus. W/o this patch,
> the
> > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4
> vcpus +
> > 8 vhosts = 12. It causes more context switches. When I change the
> > scheduling to use 2-4 vhost threads, the regressions are gone. I am
> > continue investigation on how to make small number of VMs, local
> guest
> > to gues performance better. Once I find the clue, I will share
> here.)
> >
> > The cpu hotplug support hasn't in place yet. I will post it later.
> 
> Another question is why not just using workqueue? It has full support 
> for cpu hotplug and allow more polices.

Yes, it's good to use workqueue. I just did everything on top of current
implementation so it's easy to compare/analyze the performance data.

I remembered the vhost implementation changed from workqueue to thread
for some reason. I couldn't recall the reason.

> >
> > Since we have per cpu vhost thread, each vhost thread will handle
> > multiple vqs, so we will be able to reduce/remove vq notification
> when
> > the work is heavy loaded in future.
> 
> Does this issue still exist if event index is used? If vhost does not 
> publish new used index, guest would not kick again.

Since the vhost model has been changed to handle multiple VMs' vqs work,
then it's not necessary to enable these VMs' vqs notification (published
new used idex) where these vqs' future work will be processed on the
same per-cpu vhost thread, as long as the per-cpu vhost thread is still
running.

> >
> > Here is my test results for remote host to guest test: tcp_rrs,
> udp_rrs,
> > tcp_stream with guest has 2 vpus, host has two cpu socket, each
> socket
> > has 4 cores.
> >
> > TCP_STREAM    256     512     1K      2K      4K      8K      16K
> > --------------------------------------------------------------------
> > Original
> >
> H->Guest      2501    4238    4744    5256    7203    6975    5799            Patch
> >
> H->Guest      1676    2290    3149    8026    8439    8283    8216    
> >                                                               
> > Original
> >
> Guest->H      744     1773    5675    1397    8207    7296    8117    
> > Patch
> > Guest->Host   1041    1386    5407    7057    8298    8127    8241
> 
> Looks like there's some noise in the result, the throughput of
> "original 
> guest -> Host 2K" looks too low. And some strange is that I see 
> regressions of packet transmission of guest when testing this patch.
> ( 
> Guest to Local Host TCP_STREAM in a NUMA machine).

Yes, since I didn't mask the vcpus on the same socket, it might come
from packets out of order. I will rerun the test w/i masking vcpus on
the same socket to see any difference.

You can reference Tom's results. His test is more formal than mine.

> >
> > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
> > 65%  improved with taskset vcpus on the same socket
> > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
> > 67%  improved with taskset vcpus on the same socket
> >
> > Tom has run 1VM to 24 VMs test for different work. He will post it
> here
> > soon.
> >
> > If the host scheduler ensures that the VM's vcpus are not scheduled
> to
> > another socket (i.e. cpu mask the vcpus on same socket) then the
> > performance will be better.
> >
> > Signed-off-by: Shirley Ma<xma@us.ibm.com>
> > Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> > Tested-by: Tom Lendacky<toml@us.ibm.com>
> > ---
> >
> >   drivers/vhost/net.c                  |   26 ++-
> >   drivers/vhost/vhost.c                |  289
> > +++++++++++++++++++++++----------
> >   drivers/vhost/vhost.h                |   16 ++-
> >   3 files changed, 232 insertions(+), 103 deletions(-)
> >
> > Thanks
> > Shirley
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch
  2012-03-27 17:43   ` Shirley Ma
@ 2012-04-05 12:28     ` Michael S. Tsirkin
  2012-04-05 15:22       ` [RFC PATCH 0/1] NUMA aware scheduling per cpu " Shirley Ma
  0 siblings, 1 reply; 6+ messages in thread
From: Michael S. Tsirkin @ 2012-04-05 12:28 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Jason Wang, netdev, kvm, tahm

On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
> On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
> > Hi:
> > 
> > Thanks for the work and it looks very reasonable, some questions
> > below.

Yes I am happy to see the per-cpu work resurrected.
Some comments below.

> > On 03/23/2012 07:48 AM, Shirley Ma wrote:
> > > Sorry for being late to submit this patch. I have spent lots of time
> > > trying to find the best approach. This effort is still going on...
> > >
> > > This patch is built against net-next tree.
> > >
> > > This is an experimental RFC patch. The purpose of this patch is to
> > > address KVM networking scalability and NUMA scheduling issue.
> > 
> > Need also test for non-NUMA machine, I see that you just choose the
> > cpu 
> > that initiates the work for non-numa machine which seems sub optimal.
> 
> Good suggestions. I don't have any non-numa systems. But KK run some
> tests on non-numa system. He could see around 20% performance gain for
> single VMs local host to guest. I hope we can run a full test on
> non-numa system.
> 
> On non-numa system, the same per vhost-cpu thread will be always picked
> up consistently for a particular vq since all cores are on same cpu
> socket. So there will be two per-cpu vhost threads handle TX/RX
> simultaneously.
> 
> > > The existing implementation of vhost creats a vhost thread
> > per-device
> > > (virtio_net) based. RX and TX work of a VMs per-device is handled by
> > > same vhost thread.
> > >
> > > One of the limitation of this implementation is with increasing the
> > > number VMs or the number of virtio-net interfces, more vhost threads
> > are
> > > created, it will consume more kernel resources, and induce more
> > threads
> > > context switches/scheduling overhead. We noticed that the KVM
> > network
> > > performance doesn't scale with increasing number of VMs.
> > >
> > > The other limitation is to have single vhost thread to process both
> > RX
> > > and TX, the work will be blocked. So we create this per cpu vhost
> > thread
> > > implementation. The number of vhost cpu threads is limited to the
> > number
> > > of cpus on the host.
> > >
> > > To address these limitations, we are propsing a per-cpu vhost thread
> > > model where the number of vhost threads are limited and equal to the
> > > number of online cpus on the host.
> > 
> > The number of vhost thread needs more consideration. Consider that we 
> > have a 1024 cores host with a card have 16 tx/rx queues, do we really 
> > need 1024 vhost threads?
> 
> In this case, we could add a module parameter to limit the number of
> cores/sockets to be used.

Hmm. And then which cores would we run on?
Also, is the parameter different between guests?
Another idea is to scale the # of threads on demand.

Sharing the same thread between guests is also an
interesting approach, if we did this then per-cpu
won't be so expensive but making this work well
with cgroups would be a challenge.


> > >
> > > Based on our testing experience, the vcpus can be scheduled across
> > cpu
> > > sockets even when the number of vcpus is smaller than the number of
> > > cores per cpu socket and there is no other  activities besides KVM
> > > networking workload. We found that if vhost thread is scheduled on
> > the
> > > same socket as the work is received, the performance will be better.
> > >
> > > So in this per cpu vhost thread implementation, a vhost thread is
> > > selected dynamically based on where the TX/RX work is initiated. A
> > vhost
> > > thread on the same cpu socket is selected but not on the same cpu as
> > the
> > > vcpu/interrupt thread that initizated the TX/RX work.
> > >
> > > When we test this RFC patch, the other interesting thing we found is
> > the
> > > performance results also seem related to NIC flow steering. We are
> > > spending time on evaluate different NICs flow director
> > implementation
> > > now. We will enhance this patch based on our findings later.
> > >
> > > We have tried different scheduling: per-device based, per vq based
> > and
> > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
> > scheduling,
> > > we found that so far the per vq based scheduling is good enough for
> > now.
> > 
> > Could you please explain more about those scheduling strategies? Does 
> > per-device based means let a dedicated vhost thread to handle all
> > work 
> > from that vhost device? As you mentioned, maybe an improvement of the 
> > scheduling to take flow steering info (queue mapping, rxhash etc.) of 
> > skb in host into account.
> 
> Yes, per-device scheduling means one per-cpu vhost theads handle all
> works from one particular vhost-device.
> 
> Yes, we think scheduling to take flow steering info would help
> performance. I am studying this now.

Did anything interesing turn up?


> > >
> > > We also tried different algorithm to select which cpu vhost thread
> > will
> > > running on a specific cpu socket: avg_load balance, and randomly...
> > 
> > May worth to account the out-of-oder packet during the test as for a 
> > single stream as different cpu/vhost/physical queue may be chose to
> > do 
> > the packet transmission/reception?
> 
> Good point. I haven't gone through all data yet. netstat output might
> tell us something.
> 
> We used Intel 10G NIC to run all test. For a single steam test, Intel
> NIC receiving irq steers with same irq/queue which TX packets have been
> sent. So when we mask vcpus from same VM on one socket, we shouldn't hit
> packet out-of-order case. We might hit packet out of order when vcpus
> run across sockets.
> 
> > >
> > > > From our test results, we found that the scalability has been
> > > significantly improved. And this patch is also helpful for small
> > packets
> > > performance.
> > >
> > > Hoever, we are seeing some regressions in a local guest to guest
> > > scenario on a 8 cpu NUMA system.
> > > In one case, 24 VMs 256 bytes tcp_stream test shows it has improved
> > from
> > > 810Mb/s to 9.1Gb/s. :)
> > > (We created two local VMs, and each VM has 2 vcpus. W/o this patch,
> > the
> > > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4
> > vcpus +
> > > 8 vhosts = 12. It causes more context switches. When I change the
> > > scheduling to use 2-4 vhost threads, the regressions are gone. I am
> > > continue investigation on how to make small number of VMs, local
> > guest
> > > to gues performance better. Once I find the clue, I will share
> > here.)

So, that's one obvious reason. But there could be other explanations:
1. You explicitly mask out the same CPU. But if the socket
   is very small (it's likely each socket is 2 CPUs or even 1 here),
   this might limit the scheduler drastically.
2. If guest ends up running on the same socket, you cause
   more IPIs which cause exists for the other guest.

> > >
> > > The cpu hotplug support hasn't in place yet. I will post it later.

Not yet done, right?

> > Another question is why not just using workqueue? It has full support 
> > for cpu hotplug and allow more polices.
> 
> Yes, it's good to use workqueue. I just did everything on top of current
> implementation so it's easy to compare/analyze the performance data.
> 
> I remembered the vhost implementation changed from workqueue to thread
> for some reason. I couldn't recall the reason.

At the time the implementation didn't perform well with per-cpu
threads. We wanted a single thread so switched to use just that.

> > >
> > > Since we have per cpu vhost thread, each vhost thread will handle
> > > multiple vqs, so we will be able to reduce/remove vq notification
> > when
> > > the work is heavy loaded in future.
> > 
> > Does this issue still exist if event index is used? If vhost does not 
> > publish new used index, guest would not kick again.
> 
> Since the vhost model has been changed to handle multiple VMs' vqs work,
> then it's not necessary to enable these VMs' vqs notification (published
> new used idex) where these vqs' future work will be processed on the
> same per-cpu vhost thread, as long as the per-cpu vhost thread is still
> running.
> 
> > >
> > > Here is my test results for remote host to guest test: tcp_rrs,
> > udp_rrs,
> > > tcp_stream with guest has 2 vpus, host has two cpu socket, each
> > socket
> > > has 4 cores.
> > >
> > > TCP_STREAM    256     512     1K      2K      4K      8K      16K
> > > --------------------------------------------------------------------
> > > Original
> > >
> > H->Guest      2501    4238    4744    5256    7203    6975    5799            Patch
> > >
> > H->Guest      1676    2290    3149    8026    8439    8283    8216    
> > >                                                               
> > > Original
> > >
> > Guest->H      744     1773    5675    1397    8207    7296    8117    
> > > Patch
> > > Guest->Host   1041    1386    5407    7057    8298    8127    8241
> > 
> > Looks like there's some noise in the result, the throughput of
> > "original 
> > guest -> Host 2K" looks too low. And some strange is that I see 
> > regressions of packet transmission of guest when testing this patch.
> > ( 
> > Guest to Local Host TCP_STREAM in a NUMA machine).
> 
> Yes, since I didn't mask the vcpus on the same socket, it might come
> from packets out of order. I will rerun the test w/i masking vcpus on
> the same socket to see any difference.

Did anything interesting turn up?

> You can reference Tom's results. His test is more formal than mine.
> 
> > >
> > > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
> > > 65%  improved with taskset vcpus on the same socket
> > > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
> > > 67%  improved with taskset vcpus on the same socket
> > >
> > > Tom has run 1VM to 24 VMs test for different work. He will post it
> > here
> > > soon.
> > >
> > > If the host scheduler ensures that the VM's vcpus are not scheduled
> > to
> > > another socket (i.e. cpu mask the vcpus on same socket) then the
> > > performance will be better.
> > >
> > > Signed-off-by: Shirley Ma<xma@us.ibm.com>
> > > Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> > > Tested-by: Tom Lendacky<toml@us.ibm.com>
> > > ---
> > >
> > >   drivers/vhost/net.c                  |   26 ++-
> > >   drivers/vhost/vhost.c                |  289
> > > +++++++++++++++++++++++----------
> > >   drivers/vhost/vhost.h                |   16 ++-
> > >   3 files changed, 232 insertions(+), 103 deletions(-)
> > >
> > > Thanks
> > > Shirley
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 

Also a question: how does this interact with zero copy tx?

-- 
MST

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch
  2012-04-05 12:28     ` Michael S. Tsirkin
@ 2012-04-05 15:22       ` Shirley Ma
  2012-04-05 16:48         ` Shirley Ma
  0 siblings, 1 reply; 6+ messages in thread
From: Shirley Ma @ 2012-04-05 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, netdev, kvm, tahm, vivek

On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
> > On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > Thanks for the work and it looks very reasonable, some questions
> > > below.
> 
> Yes I am happy to see the per-cpu work resurrected.
> Some comments below.
Glad to see you have time on reviewing this.

> > > On 03/23/2012 07:48 AM, Shirley Ma wrote:
> > > > Sorry for being late to submit this patch. I have spent lots of
> time
> > > > trying to find the best approach. This effort is still going
> on...
> > > >
> > > > This patch is built against net-next tree.
> > > >
> > > > This is an experimental RFC patch. The purpose of this patch is
> to
> > > > address KVM networking scalability and NUMA scheduling issue.
> > > 
> > > Need also test for non-NUMA machine, I see that you just choose
> the
> > > cpu 
> > > that initiates the work for non-numa machine which seems sub
> optimal.
> > 
> > Good suggestions. I don't have any non-numa systems. But KK run some
> > tests on non-numa system. He could see around 20% performance gain
> for
> > single VMs local host to guest. I hope we can run a full test on
> > non-numa system.
> > 
> > On non-numa system, the same per vhost-cpu thread will be always
> picked
> > up consistently for a particular vq since all cores are on same cpu
> > socket. So there will be two per-cpu vhost threads handle TX/RX
> > simultaneously.
> > 
> > > > The existing implementation of vhost creats a vhost thread
> > > per-device
> > > > (virtio_net) based. RX and TX work of a VMs per-device is
> handled by
> > > > same vhost thread.
> > > >
> > > > One of the limitation of this implementation is with increasing
> the
> > > > number VMs or the number of virtio-net interfces, more vhost
> threads
> > > are
> > > > created, it will consume more kernel resources, and induce more
> > > threads
> > > > context switches/scheduling overhead. We noticed that the KVM
> > > network
> > > > performance doesn't scale with increasing number of VMs.
> > > >
> > > > The other limitation is to have single vhost thread to process
> both
> > > RX
> > > > and TX, the work will be blocked. So we create this per cpu
> vhost
> > > thread
> > > > implementation. The number of vhost cpu threads is limited to
> the
> > > number
> > > > of cpus on the host.
> > > >
> > > > To address these limitations, we are propsing a per-cpu vhost
> thread
> > > > model where the number of vhost threads are limited and equal to
> the
> > > > number of online cpus on the host.
> > > 
> > > The number of vhost thread needs more consideration. Consider that
> we 
> > > have a 1024 cores host with a card have 16 tx/rx queues, do we
> really 
> > > need 1024 vhost threads?
> > 
> > In this case, we could add a module parameter to limit the number of
> > cores/sockets to be used.
> 
> Hmm. And then which cores would we run on?
> Also, is the parameter different between guests?
> Another idea is to scale the # of threads on demand.

If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?


> Sharing the same thread between guests is also an
> interesting approach, if we did this then per-cpu
> won't be so expensive but making this work well
> with cgroups would be a challenge.

Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.

It's challenge to work with cgroups anyway.

> 
> > > >
> > > > Based on our testing experience, the vcpus can be scheduled
> across
> > > cpu
> > > > sockets even when the number of vcpus is smaller than the number
> of
> > > > cores per cpu socket and there is no other  activities besides
> KVM
> > > > networking workload. We found that if vhost thread is scheduled
> on
> > > the
> > > > same socket as the work is received, the performance will be
> better.
> > > >
> > > > So in this per cpu vhost thread implementation, a vhost thread
> is
> > > > selected dynamically based on where the TX/RX work is initiated.
> A
> > > vhost
> > > > thread on the same cpu socket is selected but not on the same
> cpu as
> > > the
> > > > vcpu/interrupt thread that initizated the TX/RX work.
> > > >
> > > > When we test this RFC patch, the other interesting thing we
> found is
> > > the
> > > > performance results also seem related to NIC flow steering. We
> are
> > > > spending time on evaluate different NICs flow director
> > > implementation
> > > > now. We will enhance this patch based on our findings later.
> > > >
> > > > We have tried different scheduling: per-device based, per vq
> based
> > > and
> > > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
> > > scheduling,
> > > > we found that so far the per vq based scheduling is good enough
> for
> > > now.
> > > 
> > > Could you please explain more about those scheduling strategies?
> Does 
> > > per-device based means let a dedicated vhost thread to handle all
> > > work 
> > > from that vhost device? As you mentioned, maybe an improvement of
> the 
> > > scheduling to take flow steering info (queue mapping, rxhash etc.)
> of 
> > > skb in host into account.
> > 
> > Yes, per-device scheduling means one per-cpu vhost theads handle all
> > works from one particular vhost-device.
> > 
> > Yes, we think scheduling to take flow steering info would help
> > performance. I am studying this now.
> 
> Did anything interesing turn up?

Not yet, still investigating.

> 
> > > >
> > > > We also tried different algorithm to select which cpu vhost
> thread
> > > will
> > > > running on a specific cpu socket: avg_load balance, and
> randomly...
> > > 
> > > May worth to account the out-of-oder packet during the test as for
> a 
> > > single stream as different cpu/vhost/physical queue may be chose
> to
> > > do 
> > > the packet transmission/reception?
> > 
> > Good point. I haven't gone through all data yet. netstat output
> might
> > tell us something.
> > 
> > We used Intel 10G NIC to run all test. For a single steam test,
> Intel
> > NIC receiving irq steers with same irq/queue which TX packets have
> been
> > sent. So when we mask vcpus from same VM on one socket, we shouldn't
> hit
> > packet out-of-order case. We might hit packet out of order when
> vcpus
> > run across sockets.
> > 
> > > >
> > > > > From our test results, we found that the scalability has been
> > > > significantly improved. And this patch is also helpful for small
> > > packets
> > > > performance.
> > > >
> > > > Hoever, we are seeing some regressions in a local guest to guest
> > > > scenario on a 8 cpu NUMA system.
> > > > In one case, 24 VMs 256 bytes tcp_stream test shows it has
> improved
> > > from
> > > > 810Mb/s to 9.1Gb/s. :)
> > > > (We created two local VMs, and each VM has 2 vcpus. W/o this
> patch,
> > > the
> > > > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4
> > > vcpus +
> > > > 8 vhosts = 12. It causes more context switches. When I change
> the
> > > > scheduling to use 2-4 vhost threads, the regressions are gone. I
> am
> > > > continue investigation on how to make small number of VMs, local
> > > guest
> > > > to gues performance better. Once I find the clue, I will share
> > > here.)
> 
> So, that's one obvious reason. But there could be other explanations:
> 1. You explicitly mask out the same CPU. But if the socket
>    is very small (it's likely each socket is 2 CPUs or even 1 here),
>    this might limit the scheduler drastically.
Only if we limit guest vcpus on same socket. The default host schedules
vcpus across sockets.

> 2. If guest ends up running on the same socket, you cause
>    more IPIs which cause exists for the other guest.
I used different approaches to schedule vhost thread: 1. check loadavg
on a particular cpu; 2. randomly pick up a cpu, the performance didn't
make much difference in a small amount of VMs. 

On Tom's 1-24 VMs scalability test, it had impressive results when
amount of VMs are increased compared to existing approach. So it might
not be a big issue.

> > > >
> > > > The cpu hotplug support hasn't in place yet. I will post it
> later.
> 
> Not yet done, right?

Done now, under testing.

> > > Another question is why not just using workqueue? It has full
> support 
> > > for cpu hotplug and allow more polices.
> > 
> > Yes, it's good to use workqueue. I just did everything on top of
> current
> > implementation so it's easy to compare/analyze the performance data.
> > 
> > I remembered the vhost implementation changed from workqueue to
> thread
> > for some reason. I couldn't recall the reason.
> 
> At the time the implementation didn't perform well with per-cpu
> threads. We wanted a single thread so switched to use just that.
> 
> > > >
> > > > Since we have per cpu vhost thread, each vhost thread will
> handle
> > > > multiple vqs, so we will be able to reduce/remove vq
> notification
> > > when
> > > > the work is heavy loaded in future.
> > > 
> > > Does this issue still exist if event index is used? If vhost does
> not 
> > > publish new used index, guest would not kick again.
> > 
> > Since the vhost model has been changed to handle multiple VMs' vqs
> work,
> > then it's not necessary to enable these VMs' vqs notification
> (published
> > new used idex) where these vqs' future work will be processed on the
> > same per-cpu vhost thread, as long as the per-cpu vhost thread is
> still
> > running.
> > 
> > > >
> > > > Here is my test results for remote host to guest test: tcp_rrs,
> > > udp_rrs,
> > > > tcp_stream with guest has 2 vpus, host has two cpu socket, each
> > > socket
> > > > has 4 cores.
> > > >
> > > > TCP_STREAM    256     512     1K      2K      4K      8K
> 16K
> > > >
> --------------------------------------------------------------------
> > > > Original
> > > >
> > > H->Guest      2501    4238    4744    5256    7203    6975    5799
> Patch
> > > >
> > > H->Guest      1676    2290    3149    8026    8439    8283
> 8216    
> > > >                                                               
> > > > Original
> > > >
> > > Guest->H      744     1773    5675    1397    8207    7296
> 8117    
> > > > Patch
> > > > Guest->Host   1041    1386    5407    7057    8298    8127
> 8241
> > > 
> > > Looks like there's some noise in the result, the throughput of
> > > "original 
> > > guest -> Host 2K" looks too low. And some strange is that I see 
> > > regressions of packet transmission of guest when testing this
> patch.
> > > ( 
> > > Guest to Local Host TCP_STREAM in a NUMA machine).
> > 
> > Yes, since I didn't mask the vcpus on the same socket, it might come
> > from packets out of order. I will rerun the test w/i masking vcpus
> on
> > the same socket to see any difference.
> 
> Did anything interesting turn up?

Haven't had time to focus on single stream result yet.

> 
> > You can reference Tom's results. His test is more formal than mine.
> > 
> > > >
> > > > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
> > > > 65%  improved with taskset vcpus on the same socket
> > > > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
> > > > 67%  improved with taskset vcpus on the same socket
> > > >
> > > > Tom has run 1VM to 24 VMs test for different work. He will post
> it
> > > here
> > > > soon.
> > > >
> > > > If the host scheduler ensures that the VM's vcpus are not
> scheduled
> > > to
> > > > another socket (i.e. cpu mask the vcpus on same socket) then the
> > > > performance will be better.
> > > >
> > > > Signed-off-by: Shirley Ma<xma@us.ibm.com>
> > > > Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> > > > Tested-by: Tom Lendacky<toml@us.ibm.com>
> > > > ---
> > > >
> > > >   drivers/vhost/net.c                  |   26 ++-
> > > >   drivers/vhost/vhost.c                |  289
> > > > +++++++++++++++++++++++----------
> > > >   drivers/vhost/vhost.h                |   16 ++-
> > > >   3 files changed, 232 insertions(+), 103 deletions(-)
> > > >
> > > > Thanks
> > > > Shirley
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> netdev" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> Also a question: how does this interact with zero copy tx? 

Yes, I tested this with zero copy tx. The vhost thread which handles tx
work has been significantly reduced.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch
  2012-04-05 15:22       ` [RFC PATCH 0/1] NUMA aware scheduling per cpu " Shirley Ma
@ 2012-04-05 16:48         ` Shirley Ma
  0 siblings, 0 replies; 6+ messages in thread
From: Shirley Ma @ 2012-04-05 16:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, netdev, kvm, tahm, vivek

On Thu, 2012-04-05 at 08:22 -0700, Shirley Ma wrote:

> Haven't had time to focus on single stream result yet. 

I forgot to mention that if I switch the vhost scheduling to per-device
based from per vq based, this minor single stream test regression will
be gone. However the improvement of tcp_rrs, udp_rrs and other stream
case performance will also gone.

Shirley

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-04-05 16:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-22 23:48 [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch Shirley Ma
2012-03-27 10:09 ` Jason Wang
2012-03-27 17:43   ` Shirley Ma
2012-04-05 12:28     ` Michael S. Tsirkin
2012-04-05 15:22       ` [RFC PATCH 0/1] NUMA aware scheduling per cpu " Shirley Ma
2012-04-05 16:48         ` Shirley Ma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.