Re: [PATCH 1/4] KVM: x86: Introduce .pcpu_is_idle() stub infrastructure

From: Jinrong Liang <ljr.kernel@gmail.com>
To: Tianqiang Xu <skyele@sjtu.edu.cn>
Cc: x86@kernel.org, pbonzini@redhat.com, seanjc@google.com,
	vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com,
	joro@8bytes.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, kvm@vger.kernel.org, hpa@zytor.com,
	jarkko@kernel.org, dave.hansen@linux.intel.com,
	linux-kernel@vger.kernel.org, linux-sgx@vger.kernel.org
Subject: Re: [PATCH 1/4] KVM: x86: Introduce .pcpu_is_idle() stub infrastructure
Date: Fri, 17 Dec 2021 17:39:24 +0800	[thread overview]
Message-ID: <CAFg_LQWV56zok563F8WbPEuUiJeeEhfUK3ua+tcm8ChZETWKWg@mail.gmail.com> (raw)
In-Reply-To: <20210831015919.13006-1-skyele@sjtu.edu.cn>

Hi Tianqiang,
Tianqiang Xu <skyele@sjtu.edu.cn> 于2021年12月17日周五 15:55写道：
>
> This patch series aims to fix performance issue caused by current
> para-virtualized scheduling design.
>
> The current para-virtualized scheduling design uses 'preempted' field of
> kvm_steal_time to avoid scheduling task on the preempted vCPU.
> However, when the pCPU where the preempted vCPU most recently run is idle,
> it will result in low cpu utilization, and consequently poor performance.
>
> The new field: 'is_idle' of kvm_steal_time can precisely reveal
> the status of pCPU where preempted vCPU most recently run, and
> then improve cpu utilization.
>
> pcpu_is_idle() is used to get the value of 'is_idle' of kvm_steal_time.
>
> Experiments on a VM with 16 vCPUs show that the patch can reduce around
> 50% to 80% execution time for most PARSEC benchmarks.
> This also holds true for a VM with 112 vCPUs.
>
> Experiments on 2 VMs with 112 vCPUs show that the patch can reduce around
> 20% to 80% execution time for most PARSEC benchmarks.
>
> Test environment:
> -- PowerEdge R740
> -- 56C-112T CPU Intel(R) Xeon(R) Gold 6238R CPU
> -- Host 190G DRAM
> -- QEMU 5.0.0
> -- PARSEC 3.0 Native Inputs
> -- Host is idle during the test
> -- Host and Guest kernel are both kernel-5.14.0
>
> Results:
> 1. 1 VM, 16 VCPU, 16 THREAD.
>    Host Topology: sockets=2 cores=28 threads=2
>    VM Topology:   sockets=1 cores=16 threads=1
>    Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n 16
>    Statistics below are the real time of running each benchmark.(lower is better)
>
>                         before patch    after patch     improvements
> bodytrack               52.866s         22.619s         57.21%
> fluidanimate            84.009s         38.148s         54.59%
> streamcluster           270.17s         42.726s         84.19%
> splash2x.ocean_cp       31.932s         9.539s          70.13%
> splash2x.ocean_ncp      36.063s         14.189s         60.65%
> splash2x.volrend        134.587s        21.79s          83.81%
>
> 2. 1VM, 112 VCPU. Some benchmarks require the number of threads to be the power of 2,
> so we run them with 64 threads and 128 threads.
>    Host Topology: sockets=2 cores=28 threads=2
>    VM Topology:   sockets=1 cores=112 threads=1
>    Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n <64,112,128>
>    Statistics below are the real time of running each benchmark.(lower is better)
>
>                                         before patch    after patch     improvements
> fluidanimate(64 thread)                 124.235s        27.924s         77.52%
> fluidanimate(128 thread)                169.127s        64.541s         61.84%
> streamcluster(112 thread)               861.879s        496.66s         42.37%
> splash2x.ocean_cp(64 thread)            46.415s         18.527s         60.08%
> splash2x.ocean_cp(128 thread)           53.647s         28.929s         46.08%
> splash2x.ocean_ncp(64 thread)           47.613s         19.576s         58.89%
> splash2x.ocean_ncp(128 thread)          54.94s          29.199s         46.85%
> splash2x.volrend(112 thread)            801.384s        144.824s        81.93%
>
> 3. 2VM, each VM: 112 VCPU. Some benchmarks require the number of threads to
> be the power of 2, so we run them with 64 threads and 128 threads.
>    Host Topology: sockets=2 cores=28 threads=2
>    VM Topology:   sockets=1 cores=112 threads=1
>    Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n <64,112,128>
>    Statistics below are the average real time of running each benchmark in 2 VMs.(lower is better)
>
>                                         before patch    after patch     improvements
> fluidanimate(64 thread)                 135.2125s       49.827s         63.15%
> fluidanimate(128 thread)                178.309s        86.964s         51.23%
> splash2x.ocean_cp(64 thread)            47.4505s        20.314s         57.19%
> splash2x.ocean_cp(128 thread)           55.5645s        30.6515s        44.84%
> splash2x.ocean_ncp(64 thread)           49.9775s        23.489s         53.00%
> splash2x.ocean_ncp(128 thread)          56.847s         28.545s         49.79%
> splash2x.volrend(112 thread)            838.939s        239.632s        71.44%
>
> For space limit, we list representative statistics here.

I did a performance test according to the description in the patch,
but did not get the performance improvement described in the description.

I suspect that the big difference between my kernel configuration
and yours has caused this problem. Can you please provide more detailed
test information, such as kernel configuration that must be turned on or off ?

Regards,
Jinrong Liang