All of lore.kernel.org
 help / color / mirror / Atom feed
From: yezengruan <yezengruan@huawei.com>
To: Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>
Cc: daniel.lezcano@linaro.org, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, peterz@infradead.org,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux@armlinux.org.uk,
	steven.price@arm.com, kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v2 0/6] KVM: arm64: VCPU preempted check support
Date: Wed, 16 Dec 2020 16:45:44 +0800	[thread overview]
Message-ID: <6c1f0896-b78f-c92f-4c3b-9ab17400487b@huawei.com> (raw)
In-Reply-To: <b1d23a82d6a7caa79a99597fb83472be@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 6389 bytes --]

On 2020/1/15 22:14, Marc Zyngier wrote:
> On 2020-01-13 12:12, Will Deacon wrote:
>> [+PeterZ]
>>
>> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote:
>>> This patch set aims to support the vcpu_is_preempted() functionality
>>> under KVM/arm64, which allowing the guest to obtain the VCPU is
>>> currently running or not. This will enhance lock performance on
>>> overcommitted hosts (more runnable VCPUs than physical CPUs in the
>>> system) as doing busy waits for preempted VCPUs will hurt system
>>> performance far worse than early yielding.
>>>
>>> We have observed some performace improvements in uninx benchmark tests.
>>>
>>> unix benchmark result:
>>>   host:  kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs
>>>   guest: kernel 5.5.0-rc1, 16 VCPUs
>>>
>>>                test-case                |    after-patch    |   before-patch
>>> ----------------------------------------+-------------------+------------------
>>>  Dhrystone 2 using register variables   | 334600751.0 lps   | 335319028.3 lps
>>>  Double-Precision Whetstone             |     32856.1 MWIPS |     32849.6 MWIPS
>>>  Execl Throughput                       |      3662.1 lps   |      2718.0 lps
>>>  File Copy 1024 bufsize 2000 maxblocks  |    432906.4 KBps  |    158011.8 KBps
>>>  File Copy 256 bufsize 500 maxblocks    |    116023.0 KBps  |     37664.0 KBps
>>>  File Copy 4096 bufsize 8000 maxblocks  |   1432769.8 KBps  |    441108.8 KBps
>>>  Pipe Throughput                        |   6405029.6 lps   |   6021457.6 lps
>>>  Pipe-based Context Switching           |    185872.7 lps   |    184255.3 lps
>>>  Process Creation                       |      4025.7 lps   |      3706.6 lps
>>>  Shell Scripts (1 concurrent)           |      6745.6 lpm   |      6436.1 lpm
>>>  Shell Scripts (8 concurrent)           |       998.7 lpm   |       931.1 lpm
>>>  System Call Overhead                   |   3913363.1 lps   |   3883287.8 lps
>>> ----------------------------------------+-------------------+------------------
>>>  System Benchmarks Index Score          |      1835.1       |      1327.6
>>
>> Interesting, thanks for the numbers.
>>
>> So it looks like there is a decent improvement to be had from targetted vCPU
>> wakeup, but I really dislike the explicit PV interface and it's already been
>> shown to interact badly with the WFE-based polling in smp_cond_load_*().
>>
>> Rather than expose a divergent interface, I would instead like to explore an
>> improvement to smp_cond_load_*() and see how that performs before we commit
>> to something more intrusive. Marc and I looked at this very briefly in the
>> past, and the basic idea is to register all of the WFE sites with the
>> hypervisor, indicating which register contains the address being spun on
>> and which register contains the "bad" value. That way, you don't bother
>> rescheduling a vCPU if the value at the address is still bad, because you
>> know it will exit immediately.
>>
>> Of course, the devil is in the details because when I say "address", that's
>> a guest virtual address, so you need to play some tricks in the hypervisor
>> so that you have a separate mapping for the lockword (it's enough to keep
>> track of the physical address).
>>
>> Our hacks are here but we basically ran out of time to work on them beyond
>> an unoptimised and hacky prototype:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy
>>
>> Marc -- how would you prefer to handle this?
>
> Let me try and rebase this thing to a modern kernel (I doubt it applies without
> conflicts to mainline). We can then have discussion about its merit on the list
> once I post it. It'd be good to have a pointer to the benchamrks that have been
> used here.

Hi Marc, Will,

My apologies for the slow reply. Just checking what is the latest on this
PV cond yield prototype?

https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy

Recently, I re-doed the unixbench test comparison between vCPU preempted check
and PV cond yield. The results are as follows:


unix benchmark result:
  host:  kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs
  guest: kernel 5.10.0-rc6, 16 VCPUs
                                       | 5.10.0-rc6 | pv_cond_yield | vcpu_is_preempted
 System Benchmarks Index Values        |    INDEX   |      INDEX    |      INDEX
---------------------------------------+------------+---------------+-------------------
 Dhrystone 2 using register variables  |  29164.0   |    29156.9    |    29207.2
 Double-Precision Whetstone            |   6807.6   |     6789.2    |     6912.1
 Execl Throughput                      |    856.7   |     1195.6    |      863.1
 File Copy 1024 bufsize 2000 maxblocks |    189.9   |      923.5    |     1094.2
 File Copy 256 bufsize 500 maxblocks   |    121.9   |      578.4    |      588.7
 File Copy 4096 bufsize 8000 maxblocks |    419.9   |     1992.0    |     2733.7
 Pipe Throughput                       |   6727.2   |     6670.2    |     6743.2
 Pipe-based Context Switching          |    486.9   |      547.0    |      471.9
 Process Creation                      |    353.4   |      345.1    |      338.5
 Shell Scripts (1 concurrent)          |   3187.2   |     1432.2    |     2798.7
 Shell Scripts (8 concurrent)          |   3410.5   |     1360.1    |     2672.9
 System Call Overhead                  |   2967.0   |     3273.9    |     3497.9
---------------------------------------+------------+---------------+-------------------
 System Benchmarks Index Score         |   1410.0   |     1885.8    |     2128.5


Thanks,

Zengruan

>
> Thanks,
>
>         M.



[-- Attachment #1.2: Type: text/html, Size: 10216 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

WARNING: multiple messages have this Message-ID (diff)
From: yezengruan <yezengruan@huawei.com>
To: Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>
Cc: mark.rutland@arm.com, daniel.lezcano@linaro.org,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	peterz@infradead.org, catalin.marinas@arm.com,
	suzuki.poulose@arm.com, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, james.morse@arm.com,
	julien.thierry.kdev@gmail.com,
	"Wanghaibin \(D\)" <wanghaibin.wang@huawei.com>,
	linux@armlinux.org.uk, steven.price@arm.com,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v2 0/6] KVM: arm64: VCPU preempted check support
Date: Wed, 16 Dec 2020 16:45:44 +0800	[thread overview]
Message-ID: <6c1f0896-b78f-c92f-4c3b-9ab17400487b@huawei.com> (raw)
In-Reply-To: <b1d23a82d6a7caa79a99597fb83472be@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 6389 bytes --]

On 2020/1/15 22:14, Marc Zyngier wrote:
> On 2020-01-13 12:12, Will Deacon wrote:
>> [+PeterZ]
>>
>> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote:
>>> This patch set aims to support the vcpu_is_preempted() functionality
>>> under KVM/arm64, which allowing the guest to obtain the VCPU is
>>> currently running or not. This will enhance lock performance on
>>> overcommitted hosts (more runnable VCPUs than physical CPUs in the
>>> system) as doing busy waits for preempted VCPUs will hurt system
>>> performance far worse than early yielding.
>>>
>>> We have observed some performace improvements in uninx benchmark tests.
>>>
>>> unix benchmark result:
>>>   host:  kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs
>>>   guest: kernel 5.5.0-rc1, 16 VCPUs
>>>
>>>                test-case                |    after-patch    |   before-patch
>>> ----------------------------------------+-------------------+------------------
>>>  Dhrystone 2 using register variables   | 334600751.0 lps   | 335319028.3 lps
>>>  Double-Precision Whetstone             |     32856.1 MWIPS |     32849.6 MWIPS
>>>  Execl Throughput                       |      3662.1 lps   |      2718.0 lps
>>>  File Copy 1024 bufsize 2000 maxblocks  |    432906.4 KBps  |    158011.8 KBps
>>>  File Copy 256 bufsize 500 maxblocks    |    116023.0 KBps  |     37664.0 KBps
>>>  File Copy 4096 bufsize 8000 maxblocks  |   1432769.8 KBps  |    441108.8 KBps
>>>  Pipe Throughput                        |   6405029.6 lps   |   6021457.6 lps
>>>  Pipe-based Context Switching           |    185872.7 lps   |    184255.3 lps
>>>  Process Creation                       |      4025.7 lps   |      3706.6 lps
>>>  Shell Scripts (1 concurrent)           |      6745.6 lpm   |      6436.1 lpm
>>>  Shell Scripts (8 concurrent)           |       998.7 lpm   |       931.1 lpm
>>>  System Call Overhead                   |   3913363.1 lps   |   3883287.8 lps
>>> ----------------------------------------+-------------------+------------------
>>>  System Benchmarks Index Score          |      1835.1       |      1327.6
>>
>> Interesting, thanks for the numbers.
>>
>> So it looks like there is a decent improvement to be had from targetted vCPU
>> wakeup, but I really dislike the explicit PV interface and it's already been
>> shown to interact badly with the WFE-based polling in smp_cond_load_*().
>>
>> Rather than expose a divergent interface, I would instead like to explore an
>> improvement to smp_cond_load_*() and see how that performs before we commit
>> to something more intrusive. Marc and I looked at this very briefly in the
>> past, and the basic idea is to register all of the WFE sites with the
>> hypervisor, indicating which register contains the address being spun on
>> and which register contains the "bad" value. That way, you don't bother
>> rescheduling a vCPU if the value at the address is still bad, because you
>> know it will exit immediately.
>>
>> Of course, the devil is in the details because when I say "address", that's
>> a guest virtual address, so you need to play some tricks in the hypervisor
>> so that you have a separate mapping for the lockword (it's enough to keep
>> track of the physical address).
>>
>> Our hacks are here but we basically ran out of time to work on them beyond
>> an unoptimised and hacky prototype:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy
>>
>> Marc -- how would you prefer to handle this?
>
> Let me try and rebase this thing to a modern kernel (I doubt it applies without
> conflicts to mainline). We can then have discussion about its merit on the list
> once I post it. It'd be good to have a pointer to the benchamrks that have been
> used here.

Hi Marc, Will,

My apologies for the slow reply. Just checking what is the latest on this
PV cond yield prototype?

https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy

Recently, I re-doed the unixbench test comparison between vCPU preempted check
and PV cond yield. The results are as follows:


unix benchmark result:
  host:  kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs
  guest: kernel 5.10.0-rc6, 16 VCPUs
                                       | 5.10.0-rc6 | pv_cond_yield | vcpu_is_preempted
 System Benchmarks Index Values        |    INDEX   |      INDEX    |      INDEX
---------------------------------------+------------+---------------+-------------------
 Dhrystone 2 using register variables  |  29164.0   |    29156.9    |    29207.2
 Double-Precision Whetstone            |   6807.6   |     6789.2    |     6912.1
 Execl Throughput                      |    856.7   |     1195.6    |      863.1
 File Copy 1024 bufsize 2000 maxblocks |    189.9   |      923.5    |     1094.2
 File Copy 256 bufsize 500 maxblocks   |    121.9   |      578.4    |      588.7
 File Copy 4096 bufsize 8000 maxblocks |    419.9   |     1992.0    |     2733.7
 Pipe Throughput                       |   6727.2   |     6670.2    |     6743.2
 Pipe-based Context Switching          |    486.9   |      547.0    |      471.9
 Process Creation                      |    353.4   |      345.1    |      338.5
 Shell Scripts (1 concurrent)          |   3187.2   |     1432.2    |     2798.7
 Shell Scripts (8 concurrent)          |   3410.5   |     1360.1    |     2672.9
 System Call Overhead                  |   2967.0   |     3273.9    |     3497.9
---------------------------------------+------------+---------------+-------------------
 System Benchmarks Index Score         |   1410.0   |     1885.8    |     2128.5


Thanks,

Zengruan

>
> Thanks,
>
>         M.



[-- Attachment #1.2: Type: text/html, Size: 10216 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2020-12-16  8:45 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-26 13:58 [PATCH v2 0/6] KVM: arm64: VCPU preempted check support Zengruan Ye
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58 ` [PATCH v2 1/6] KVM: arm64: Document PV-lock interface Zengruan Ye
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2020-01-09 14:53   ` Steven Price
2020-01-09 14:53     ` Steven Price
2020-01-09 14:53     ` Steven Price
2020-01-11  6:51     ` yezengruan
2020-01-11  6:51       ` yezengruan
2020-01-11  6:51       ` yezengruan
2020-01-11  6:51       ` yezengruan
2019-12-26 13:58 ` [PATCH v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls Zengruan Ye
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58 ` [PATCH v2 3/6] KVM: arm64: Support pvlock preempted via shared structure Zengruan Ye
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2020-01-09 15:02   ` Steven Price
2020-01-09 15:02     ` Steven Price
2020-01-09 15:02     ` Steven Price
2020-01-11  7:30     ` yezengruan
2020-01-11  7:30       ` yezengruan
2020-01-11  7:30       ` yezengruan
2020-01-11  7:30       ` yezengruan
2020-01-13 10:31       ` Steven Price
2020-01-13 10:31         ` Steven Price
2020-01-13 10:31         ` Steven Price
2020-01-13 10:31         ` Steven Price
2019-12-26 13:58 ` [PATCH v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58 ` [PATCH v2 5/6] KVM: arm64: Add interface to support VCPU preempted check Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 18:51   ` kbuild test robot
2019-12-26 18:51     ` kbuild test robot
2019-12-26 18:51     ` kbuild test robot
2019-12-26 18:51     ` kbuild test robot
2019-12-27  6:52     ` yezengruan
2019-12-27  6:52     ` yezengruan
2019-12-27  6:52       ` yezengruan
2019-12-27  6:52       ` yezengruan
2019-12-27  6:52       ` yezengruan
2019-12-26 18:51   ` kbuild test robot
2019-12-26 13:58 ` Zengruan Ye
2019-12-26 13:58 ` [PATCH v2 6/6] KVM: arm64: Support the VCPU preemption check Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2019-12-26 13:58   ` Zengruan Ye
2020-01-09 15:09   ` Steven Price
2020-01-09 15:09     ` Steven Price
2020-01-09 15:09     ` Steven Price
2020-01-11  7:33     ` yezengruan
2020-01-11  7:33       ` yezengruan
2020-01-11  7:33       ` yezengruan
2020-01-11  7:33       ` yezengruan
2019-12-26 13:58 ` Zengruan Ye
2020-01-13 12:12 ` [PATCH v2 0/6] KVM: arm64: VCPU preempted check support Will Deacon
2020-01-13 12:12   ` Will Deacon
2020-01-13 12:12   ` Will Deacon
2020-01-13 12:12   ` Will Deacon
2020-01-15 14:14   ` Marc Zyngier
2020-01-15 14:14     ` Marc Zyngier
2020-01-15 14:14     ` Marc Zyngier
2020-01-15 14:14     ` Marc Zyngier
2020-12-16  8:45     ` yezengruan [this message]
2020-12-16  8:45       ` yezengruan
2020-12-29  8:50     ` yezengruan
2020-12-29  8:50       ` yezengruan
2020-12-29  8:50       ` yezengruan
2020-12-29  8:50       ` yezengruan
  -- strict thread matches above, loose matches on Subject: below --
2019-12-26 13:58 Zengruan Ye

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6c1f0896-b78f-c92f-4c3b-9ab17400487b@huawei.com \
    --to=yezengruan@huawei.com \
    --cc=catalin.marinas@arm.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=maz@kernel.org \
    --cc=peterz@infradead.org \
    --cc=steven.price@arm.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.