linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
@ 2022-11-04  6:20 Usama Arif
  2022-11-04  6:21 ` [v2 1/6] KVM: arm64: Document PV-lock interface Usama Arif
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:20 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
to check if a vcpu was scheduled out, which is useful to know incase it was
holding a lock. vcpu_is_preempted can be used to improve
performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
(see available_idle_cpu which is used in several places in kernel/sched/fair.c
for e.g. in wake_affine to determine which CPU can run soonest):

This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
for preempted vCPUs reduces performance.

This patchset is inspired from the para_steal_clock implementation and from the
work originally done by Zengruan Ye:
https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.

All the results in the below experiments are done on an aws r6g.metal instance
which has 64 pCPUs.

The following table shows the index results of UnixBench running on a 128 vCPU VM
with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
Dhrystone 2 using register variables    187761      191274.7                   1.871368389
Double-Precision Whetstone              96743.6     98414.4                    1.727039308
Execl Throughput                        689.3       10426                      1412.548963
File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
Pipe Throughput                         76819.5     78601.5                    2.319723508
Pipe-based Context Switching            3444.8      13414.5                    289.4130283
Process Creation                        301.1       293.4                      -2.557289937
Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
System Call Overhead                    3426        3729.4                     8.855808523

System Benchmarks Index Score           3053        11534                      277.7923354

This shows a 277% overall improvement using these patches.

The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
This can be seen from one of the callstack of the perf output of the shell
scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
- 33.79% el0_svc
   - 33.43% do_el0_svc
      - 33.43% el0_svc_common.constprop.3
         - 33.30% invoke_syscall
            - 17.27% __arm64_sys_clone
               - 17.27% __do_sys_clone
                  - 17.26% kernel_clone
                     - 16.73% copy_process
                        - 11.91% dup_mm
                           - 11.82% dup_mmap
                              - 9.15% down_write
                                 - 8.87% rwsem_down_write_slowpath
                                    - 8.48% osq_lock

Just under 50% of the total time in the shell script benchmarks ends up being
spent in osq_lock in the base 6.0.0 kernel:
  Children      Self  Command   Shared Object        Symbol
   17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
    6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
    4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
    3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
    3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
    3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
    3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
    3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
    3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
    0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
    0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock

and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
  Children      Self  Command   Shared Object        Symbol
     0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
     0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
     0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
     0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
     0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
     0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
     0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
     0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
     0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
     0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock

To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
performed 0.9% better overall than base kernel, and the individual benchmarks
were within +/-2% improvement over 6.0.0 base.
Hence the patches have no negative affect when vCPUs < pCPUs.


The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
if the vCPU will exit immediately.
(https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:

TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
Dhrystone 2 using register variables    187761      183128          -2.467498575
Double-Precision Whetstone              96743.6     96645           -0.101918887
Execl Throughput                        689.3       1019.8          47.9471928
File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
Pipe Throughput                         76819.5     74268.8         -3.320380893
Pipe-based Context Switching            3444.8      5963.3          73.11019508
Process Creation                        301.1       163.2           -45.79873796
Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
Shell Scripts (8 concurrent)            781.2       1171            49.89759345
System Call Overhead                    3426        3194.4          -6.760070053

System Benchmarks Index Score           3053        4605            50.83524402

pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
Host side flamegraph analysis shows that ~60% of the host time when using pvcy
is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
hence vcpu_is_preempted shows a larger improvement.

It might be that pvcy can be used in combination with vcpu_is_preempted, but this
series is to start a discussion on vcpu_is_preempted as it shows a much bigger
improvement in performance and its much easier to review vcpu_is_preempted standalone.

The respective QEMU change to test this is at
https://github.com/uarif1/qemu/commit/2da2c2927ae8de8f03f439804a0dad9cf68501b6,

Looking forward to your response!
Thanks,
Usama
---
RFC->v2
- Fixed table and code referencing in pvlock documentation
- Switched to using a single hypercall similar to ptp_kvm and made check
  for has_kvm_pvlock simpler

Usama Arif (6):
  KVM: arm64: Document PV-lock interface
  KVM: arm64: Add SMCCC paravirtualised lock calls
  KVM: arm64: Support pvlock preempted via shared structure
  KVM: arm64: Provide VCPU attributes for PV lock
  KVM: arm64: Support the VCPU preemption check
  KVM: selftests: add tests for PV time specific hypercall

 Documentation/virt/kvm/arm/hypercalls.rst     |   3 +
 Documentation/virt/kvm/arm/index.rst          |   1 +
 Documentation/virt/kvm/arm/pvlock.rst         |  52 ++++++++
 Documentation/virt/kvm/devices/vcpu.rst       |  25 ++++
 arch/arm64/include/asm/kvm_host.h             |  25 ++++
 arch/arm64/include/asm/paravirt.h             |   2 +
 arch/arm64/include/asm/pvlock-abi.h           |  17 +++
 arch/arm64/include/asm/spinlock.h             |  16 ++-
 arch/arm64/include/uapi/asm/kvm.h             |   3 +
 arch/arm64/kernel/paravirt.c                  | 112 ++++++++++++++++++
 arch/arm64/kernel/setup.c                     |   3 +
 arch/arm64/kvm/Makefile                       |   2 +-
 arch/arm64/kvm/arm.c                          |   8 ++
 arch/arm64/kvm/guest.c                        |   9 ++
 arch/arm64/kvm/hypercalls.c                   |   8 ++
 arch/arm64/kvm/pvlock.c                       | 100 ++++++++++++++++
 include/linux/arm-smccc.h                     |   8 ++
 include/linux/cpuhotplug.h                    |   1 +
 include/uapi/linux/kvm.h                      |   2 +
 tools/arch/arm64/include/uapi/asm/kvm.h       |   1 +
 tools/include/linux/arm-smccc.h               |   8 ++
 .../selftests/kvm/aarch64/hypercalls.c        |   2 +
 22 files changed, 406 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
 create mode 100644 arch/arm64/include/asm/pvlock-abi.h
 create mode 100644 arch/arm64/kvm/pvlock.c

-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [v2 1/6] KVM: arm64: Document PV-lock interface
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-07 17:56   ` Punit Agrawal
  2022-11-04  6:21 ` [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls Usama Arif
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

Introduce a paravirtualization interface for KVM/arm64 to obtain whether
the VCPU is currently running or not.

The PV lock structure of the guest is allocated by user space.

A hypercall interface is provided for the guest to interrogate the
location of the shared memory structures.

Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 Documentation/virt/kvm/arm/index.rst    |  1 +
 Documentation/virt/kvm/arm/pvlock.rst   | 52 +++++++++++++++++++++++++
 Documentation/virt/kvm/devices/vcpu.rst | 25 ++++++++++++
 3 files changed, 78 insertions(+)
 create mode 100644 Documentation/virt/kvm/arm/pvlock.rst

diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index e84848432158..b8499dc00a6a 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -10,4 +10,5 @@ ARM
    hyp-abi
    hypercalls
    pvtime
+   pvlock
    ptp_kvm
diff --git a/Documentation/virt/kvm/arm/pvlock.rst b/Documentation/virt/kvm/arm/pvlock.rst
new file mode 100644
index 000000000000..d3c391b16d36
--- /dev/null
+++ b/Documentation/virt/kvm/arm/pvlock.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Paravirtualized lock support for arm64
+======================================
+
+KVM/arm64 provides a hypervisor service call for paravirtualized guests to
+determine whether a VCPU is currently running or not.
+
+A new SMCCC compatible hypercall is defined:
+
+* ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:   0xC6000002
+
+ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID
+
+    ============= ========    ==========================================
+    Function ID:  (uint32)    0xC6000002
+    Return value: (int64)     IPA of the pv lock data structure for this
+                              VCPU. On failure:
+                              NOT_SUPPORTED (-1)
+    ============= ========    ==========================================
+
+The IPA returned by PV_LOCK_PREEMPTED should be mapped by the guest as normal
+memory with inner and outer write back caching attributes, in the inner
+shareable domain.
+
+PV_LOCK_PREEMPTED returns the structure for the calling VCPU.
+
+PV lock state
+-------------
+
+The structure pointed to by the PV_LOCK_PREEMPTED hypercall is as follows:
+
++-----------+-------------+-------------+---------------------------------+
+| Field     | Byte Length | Byte Offset | Description                     |
++===========+=============+=============+=================================+
+| preempted |      8      |      0      | Indicate if the VCPU that owns  |
+|           |             |             | this struct is running or not.  |
+|           |             |             | Non-zero values mean the VCPU   |
+|           |             |             | has been preempted. Zero means  |
+|           |             |             | the VCPU is not preempted.      |
++-----------+-------------+-------------+---------------------------------+
+
+The preempted field will be updated to 1 by the hypervisor prior to scheduling
+a VCPU. When the VCPU is scheduled out, the preempted field will be updated
+to 0 by the hypervisor.
+
+The structure will be present within a reserved region of the normal memory
+given to the guest. The guest should not attempt to write into this memory.
+There is a structure per VCPU of the guest.
+
+For the user space interface see
+:ref:`Documentation/virt/kvm/devices/vcpu.rst <kvm_arm_vcpu_pvlock_ctrl>`.
\ No newline at end of file
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 716aa3edae14..ea81b98f0e19 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -263,3 +263,28 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+.. _kvm_arm_vcpu_pvlock_ctrl:
+
+5. GROUP: KVM_ARM_VCPU_PVLOCK_CTRL
+==================================
+
+:Architectures: ARM64
+
+5.1 ATTRIBUTE: KVM_ARM_VCPU_PVLOCK_IPA
+--------------------------------------
+
+:Parameters: 64-bit base address
+
+Returns:
+
+	 =======  ======================================
+	 -ENXIO   PV lock not implemented
+	 -EEXIST  Base address already set for this VCPU
+	 -EINVAL  Base address not 64 byte aligned
+	 =======  ======================================
+
+Specifies the base address of the pv lock structure for this VCPU. The
+base address must be 64 byte aligned and exist within a valid guest memory
+region. See Documentation/virt/kvm/arm/pvlock.rst for more information
+including the layout of the pv lock structure.
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
  2022-11-04  6:21 ` [v2 1/6] KVM: arm64: Document PV-lock interface Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-07 17:58   ` Punit Agrawal
  2022-11-04  6:21 ` [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure Usama Arif
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

Add a new SMCCC compatible hypercalls for PV lock features:
  ARM_SMCCC_KVM_FUNC_PV_LOCK:   0xC6000002

Also add the header file which defines the ABI for the paravirtualized
lock features we're about to add.

Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 arch/arm64/include/asm/pvlock-abi.h | 17 +++++++++++++++++
 include/linux/arm-smccc.h           |  8 ++++++++
 tools/include/linux/arm-smccc.h     |  8 ++++++++
 3 files changed, 33 insertions(+)
 create mode 100644 arch/arm64/include/asm/pvlock-abi.h

diff --git a/arch/arm64/include/asm/pvlock-abi.h b/arch/arm64/include/asm/pvlock-abi.h
new file mode 100644
index 000000000000..3f4574071679
--- /dev/null
+++ b/arch/arm64/include/asm/pvlock-abi.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright(c) 2019 Huawei Technologies Co., Ltd
+ * Author: Zengruan Ye <yezengruan@huawei.com>
+ *         Usama Arif <usama.arif@bytedance.com>
+ */
+
+#ifndef __ASM_PVLOCK_ABI_H
+#define __ASM_PVLOCK_ABI_H
+
+struct pvlock_vcpu_state {
+	__le64 preempted;
+	/* Structure must be 64 byte aligned, pad to that size */
+	u8 padding[56];
+} __packed;
+
+#endif
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 220c8c60e021..104c10035b10 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -112,6 +112,7 @@
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES		0
 #define ARM_SMCCC_KVM_FUNC_PTP			1
+#define ARM_SMCCC_KVM_FUNC_PV_LOCK		2
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -151,6 +152,13 @@
 			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
 			   0x21)
 
+/* Paravirtualised lock calls */
+#define ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID		\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_64,			\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
+			   ARM_SMCCC_KVM_FUNC_PV_LOCK)
+
 /* TRNG entropy source calls (defined by ARM DEN0098) */
 #define ARM_SMCCC_TRNG_VERSION					\
 	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
diff --git a/tools/include/linux/arm-smccc.h b/tools/include/linux/arm-smccc.h
index 63ce9bebccd3..c21e539c0228 100644
--- a/tools/include/linux/arm-smccc.h
+++ b/tools/include/linux/arm-smccc.h
@@ -111,6 +111,7 @@
 /* KVM "vendor specific" services */
 #define ARM_SMCCC_KVM_FUNC_FEATURES		0
 #define ARM_SMCCC_KVM_FUNC_PTP			1
+#define ARM_SMCCC_KVM_FUNC_PV_LOCK		2
 #define ARM_SMCCC_KVM_FUNC_FEATURES_2		127
 #define ARM_SMCCC_KVM_NUM_FUNCS			128
 
@@ -150,6 +151,13 @@
 			   ARM_SMCCC_OWNER_STANDARD_HYP,	\
 			   0x21)
 
+/* Paravirtualised lock calls */
+#define ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID		\
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
+			   ARM_SMCCC_SMC_64,			\
+			   ARM_SMCCC_OWNER_VENDOR_HYP,		\
+			   ARM_SMCCC_KVM_FUNC_PV_LOCK)
+
 /* TRNG entropy source calls (defined by ARM DEN0098) */
 #define ARM_SMCCC_TRNG_VERSION					\
 	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,			\
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
  2022-11-04  6:21 ` [v2 1/6] KVM: arm64: Document PV-lock interface Usama Arif
  2022-11-04  6:21 ` [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-07 18:02   ` Punit Agrawal
  2022-11-04  6:21 ` [v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock Usama Arif
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

Implement the service call for configuring a shared structure between a
VCPU and the hypervisor in which the hypervisor can tell whether the
VCPU is running or not.

The preempted field is zero if the VCPU is not preempted.
Any other value means the VCPU has been preempted.

Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 Documentation/virt/kvm/arm/hypercalls.rst |  3 ++
 arch/arm64/include/asm/kvm_host.h         | 18 ++++++++++
 arch/arm64/include/uapi/asm/kvm.h         |  1 +
 arch/arm64/kvm/Makefile                   |  2 +-
 arch/arm64/kvm/arm.c                      |  8 +++++
 arch/arm64/kvm/hypercalls.c               |  8 +++++
 arch/arm64/kvm/pvlock.c                   | 43 +++++++++++++++++++++++
 tools/arch/arm64/include/uapi/asm/kvm.h   |  1 +
 8 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/pvlock.c

diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst
index 3e23084644ba..872a16226ace 100644
--- a/Documentation/virt/kvm/arm/hypercalls.rst
+++ b/Documentation/virt/kvm/arm/hypercalls.rst
@@ -127,6 +127,9 @@ The pseudo-firmware bitmap register are as follows:
     Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
       The bit represents the Precision Time Protocol KVM service.
 
+    Bit-2: KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK:
+      The bit represents the Paravirtualized lock service.
+
 Errors:
 
     =======  =============================================================
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 45e2136322ba..18303b30b7e9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -417,6 +417,11 @@ struct kvm_vcpu_arch {
 		u64 last_steal;
 		gpa_t base;
 	} steal;
+
+	/* Guest PV lock state */
+	struct {
+		gpa_t base;
+	} pv;
 };
 
 /*
@@ -840,6 +845,19 @@ static inline bool kvm_arm_is_pvtime_enabled(struct kvm_vcpu_arch *vcpu_arch)
 	return (vcpu_arch->steal.base != GPA_INVALID);
 }
 
+static inline void kvm_arm_pvlock_preempted_init(struct kvm_vcpu_arch *vcpu_arch)
+{
+	vcpu_arch->pv.base = GPA_INVALID;
+}
+
+static inline bool kvm_arm_is_pvlock_preempted_ready(struct kvm_vcpu_arch *vcpu_arch)
+{
+	return (vcpu_arch->pv.base != GPA_INVALID);
+}
+
+gpa_t kvm_init_pvlock(struct kvm_vcpu *vcpu);
+void kvm_update_pvlock_preempted(struct kvm_vcpu *vcpu, u64 preempted);
+
 void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
 
 struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 316917b98707..bd05ece5c590 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -365,6 +365,7 @@ enum {
 enum {
 	KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT	= 0,
 	KVM_REG_ARM_VENDOR_HYP_BIT_PTP		= 1,
+	KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK	= 2,
 #ifdef __KERNEL__
 	KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
 #endif
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5e33c2d4645a..e1f711885916 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -10,7 +10,7 @@ include $(srctree)/virt/kvm/Makefile.kvm
 obj-$(CONFIG_KVM) += kvm.o
 obj-$(CONFIG_KVM) += hyp/
 
-kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
+kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o pvlock.o \
 	 inject_fault.o va_layout.o handle_exit.o \
 	 guest.o debug.o reset.o sys_regs.o stacktrace.o \
 	 vgic-sys-reg-v3.o fpsimd.o pkvm.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94d33e296e10..73da4ac859fd 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -345,6 +345,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	kvm_arm_pvtime_vcpu_init(&vcpu->arch);
 
+	kvm_arm_pvlock_preempted_init(&vcpu->arch);
+
 	vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
 
 	err = kvm_vgic_vcpu_init(vcpu);
@@ -420,6 +422,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	if (vcpu_has_ptrauth(vcpu))
 		vcpu_ptrauth_disable(vcpu);
+
+	if (kvm_arm_is_pvlock_preempted_ready(&vcpu->arch))
+		kvm_update_pvlock_preempted(vcpu, 0);
+
 	kvm_arch_vcpu_load_debug_state_flags(vcpu);
 
 	if (!cpumask_test_cpu(smp_processor_id(), vcpu->kvm->arch.supported_cpus))
@@ -433,6 +439,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	if (has_vhe())
 		kvm_vcpu_put_sysregs_vhe(vcpu);
 	kvm_timer_vcpu_put(vcpu);
+	if (kvm_arm_is_pvlock_preempted_ready(&vcpu->arch))
+		kvm_update_pvlock_preempted(vcpu, 1);
 	kvm_vgic_put(vcpu);
 	kvm_vcpu_pmu_restore_host(vcpu);
 	kvm_arm_vmid_clear_active();
diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index c9f401fa01a9..ec85b4b2a272 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -116,6 +116,9 @@ static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id)
 	case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
 		return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PTP,
 				&smccc_feat->vendor_hyp_bmap);
+	case ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:
+		return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK,
+				&smccc_feat->vendor_hyp_bmap);
 	default:
 		return kvm_hvc_call_default_allowed(func_id);
 	}
@@ -201,6 +204,11 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
 		if (gpa != GPA_INVALID)
 			val[0] = gpa;
 		break;
+	case ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:
+		gpa = kvm_init_pvlock(vcpu);
+		if (gpa != GPA_INVALID)
+			val[0] = gpa;
+		break;
 	case ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID:
 		val[0] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0;
 		val[1] = ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1;
diff --git a/arch/arm64/kvm/pvlock.c b/arch/arm64/kvm/pvlock.c
new file mode 100644
index 000000000000..3eb35ab31481
--- /dev/null
+++ b/arch/arm64/kvm/pvlock.c
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright(c) 2019 Huawei Technologies Co., Ltd
+ * Author: Zengruan Ye <yezengruan@huawei.com>
+ *         Usama Arif <usama.arif@bytedance.com>
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/kvm_host.h>
+
+#include <asm/pvlock-abi.h>
+
+#include <kvm/arm_hypercalls.h>
+
+gpa_t kvm_init_pvlock(struct kvm_vcpu *vcpu)
+{
+	struct pvlock_vcpu_state init_values = {};
+	struct kvm *kvm = vcpu->kvm;
+	u64 base = vcpu->arch.pv.base;
+	int idx;
+
+	if (base == GPA_INVALID)
+		return base;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_write_guest(kvm, base, &init_values, sizeof(init_values));
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return base;
+}
+
+void kvm_update_pvlock_preempted(struct kvm_vcpu *vcpu, u64 preempted)
+{
+	int idx;
+	u64 offset;
+	struct kvm *kvm = vcpu->kvm;
+	u64 base = vcpu->arch.pv.base;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	offset = offsetof(struct pvlock_vcpu_state, preempted);
+	kvm_put_guest(kvm, base + offset, cpu_to_le64(preempted));
+	srcu_read_unlock(&kvm->srcu, idx);
+}
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h
index 316917b98707..bd05ece5c590 100644
--- a/tools/arch/arm64/include/uapi/asm/kvm.h
+++ b/tools/arch/arm64/include/uapi/asm/kvm.h
@@ -365,6 +365,7 @@ enum {
 enum {
 	KVM_REG_ARM_VENDOR_HYP_BIT_FUNC_FEAT	= 0,
 	KVM_REG_ARM_VENDOR_HYP_BIT_PTP		= 1,
+	KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK	= 2,
 #ifdef __KERNEL__
 	KVM_REG_ARM_VENDOR_HYP_BMAP_BIT_COUNT,
 #endif
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (2 preceding siblings ...)
  2022-11-04  6:21 ` [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-04  6:21 ` [v2 5/6] KVM: arm64: Support the VCPU preemption check Usama Arif
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

Allow user space to inform the KVM host where in the physical memory
map the paravirtualized lock structures should be located.

User space can set an attribute on the VCPU providing the IPA base
address of the PV lock structure for that VCPU. This must be
repeated for every VCPU in the VM.

The address is given in terms of the physical address visible to
the guest and must be 64 byte aligned. The guest will discover the
address via a hypercall.

Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 arch/arm64/include/asm/kvm_host.h |  7 ++++
 arch/arm64/include/uapi/asm/kvm.h |  2 ++
 arch/arm64/kvm/guest.c            |  9 +++++
 arch/arm64/kvm/pvlock.c           | 57 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          |  2 ++
 5 files changed, 77 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 18303b30b7e9..86aeca8a4393 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -829,6 +829,13 @@ int kvm_arm_pvtime_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_pvtime_has_attr(struct kvm_vcpu *vcpu,
 			    struct kvm_device_attr *attr);
 
+int kvm_arm_pvlock_set_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr);
+int kvm_arm_pvlock_get_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr);
+int kvm_arm_pvlock_has_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr);
+
 extern unsigned int kvm_arm_vmid_bits;
 int kvm_arm_vmid_alloc_init(void);
 void kvm_arm_vmid_alloc_free(void);
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index bd05ece5c590..71010bacaaab 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -412,6 +412,8 @@ enum {
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
 #define KVM_ARM_VCPU_PVTIME_CTRL	2
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
+#define KVM_ARM_VCPU_PVLOCK_CTRL	3
+#define   KVM_ARM_VCPU_PVLOCK_IPA	0
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 2ff13a3f8479..7e765e3256ef 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -959,6 +959,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_set_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_PVLOCK_CTRL:
+		ret = kvm_arm_pvlock_set_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -982,6 +985,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_get_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_PVLOCK_CTRL:
+		ret = kvm_arm_pvlock_get_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -1005,6 +1011,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_has_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_PVLOCK_CTRL:
+		ret = kvm_arm_pvlock_has_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
diff --git a/arch/arm64/kvm/pvlock.c b/arch/arm64/kvm/pvlock.c
index 3eb35ab31481..b08a287a4811 100644
--- a/arch/arm64/kvm/pvlock.c
+++ b/arch/arm64/kvm/pvlock.c
@@ -41,3 +41,60 @@ void kvm_update_pvlock_preempted(struct kvm_vcpu *vcpu, u64 preempted)
 	kvm_put_guest(kvm, base + offset, cpu_to_le64(preempted));
 	srcu_read_unlock(&kvm->srcu, idx);
 }
+
+int kvm_arm_pvlock_set_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr)
+{
+	u64 __user *user = (u64 __user *)attr->addr;
+	struct kvm *kvm = vcpu->kvm;
+	u64 ipa;
+	int ret = 0;
+	int idx;
+
+	if (attr->attr != KVM_ARM_VCPU_PVLOCK_IPA)
+		return -ENXIO;
+
+	if (get_user(ipa, user))
+		return -EFAULT;
+	if (!IS_ALIGNED(ipa, 64))
+		return -EINVAL;
+	if (vcpu->arch.pv.base != GPA_INVALID)
+		return -EEXIST;
+
+	/* Check the address is in a valid memslot */
+	idx = srcu_read_lock(&kvm->srcu);
+	if (kvm_is_error_hva(gfn_to_hva(kvm, ipa >> PAGE_SHIFT)))
+		ret = -EINVAL;
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	if (!ret)
+		vcpu->arch.pv.base = ipa;
+
+	return ret;
+}
+
+int kvm_arm_pvlock_get_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr)
+{
+	u64 __user *user = (u64 __user *)attr->addr;
+	u64 ipa;
+
+	if (attr->attr != KVM_ARM_VCPU_PVLOCK_IPA)
+		return -ENXIO;
+
+	ipa = vcpu->arch.pv.base;
+
+	if (put_user(ipa, user))
+		return -EFAULT;
+	return 0;
+}
+
+int kvm_arm_pvlock_has_attr(struct kvm_vcpu *vcpu,
+			    struct kvm_device_attr *attr)
+{
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_PVLOCK_IPA:
+		return 0;
+	}
+	return -ENXIO;
+}
\ No newline at end of file
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..1fe3cce5c84a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1429,6 +1429,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
 	KVM_DEV_TYPE_ARM_PV_TIME,
 #define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
+	KVM_DEV_TYPE_ARM_PV_LOCK,
+#define KVM_DEV_TYPE_ARM_PV_LOCK	KVM_DEV_TYPE_ARM_PV_LOCK
 	KVM_DEV_TYPE_MAX,
 };
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [v2 5/6] KVM: arm64: Support the VCPU preemption check
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (3 preceding siblings ...)
  2022-11-04  6:21 ` [v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-04  6:21 ` [v2 6/6] KVM: selftests: add tests for PV time specific hypercall Usama Arif
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

Support the vcpu_is_preempted() functionality under KVM/arm64. This will
enhance lock performance on overcommitted hosts (more runnable VCPUs
than physical CPUs in the system) as doing busy waits for preempted
VCPUs will hurt system performance far worse than early yielding.

Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 arch/arm64/include/asm/paravirt.h |   2 +
 arch/arm64/include/asm/spinlock.h |  16 ++++-
 arch/arm64/kernel/paravirt.c      | 112 ++++++++++++++++++++++++++++++
 arch/arm64/kernel/setup.c         |   3 +
 include/linux/cpuhotplug.h        |   1 +
 5 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/paravirt.h b/arch/arm64/include/asm/paravirt.h
index 9aa193e0e8f2..4ccb4356c56b 100644
--- a/arch/arm64/include/asm/paravirt.h
+++ b/arch/arm64/include/asm/paravirt.h
@@ -19,10 +19,12 @@ static inline u64 paravirt_steal_clock(int cpu)
 }
 
 int __init pv_time_init(void);
+int __init pv_lock_init(void);
 
 #else
 
 #define pv_time_init() do {} while (0)
+#define pv_lock_init() do {} while (0)
 
 #endif // CONFIG_PARAVIRT
 
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index 0525c0b089ed..7023efa4de96 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -10,7 +10,20 @@
 
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
+#define vcpu_is_preempted vcpu_is_preempted
+
+#ifdef CONFIG_PARAVIRT
+#include <linux/static_call_types.h>
+
+bool dummy_vcpu_is_preempted(int cpu);
 
+DECLARE_STATIC_CALL(pv_vcpu_is_preempted, dummy_vcpu_is_preempted);
+static inline bool vcpu_is_preempted(int cpu)
+{
+	return static_call(pv_vcpu_is_preempted)(cpu);
+}
+
+#else
 /*
  * Changing this will break osq_lock() thanks to the call inside
  * smp_cond_load_relaxed().
@@ -18,10 +31,11 @@
  * See:
  * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
  */
-#define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
 	return false;
 }
 
+#endif /* CONFIG_PARAVIRT */
+
 #endif /* __ASM_SPINLOCK_H */
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
index 57c7c211f8c7..a340f5f4327f 100644
--- a/arch/arm64/kernel/paravirt.c
+++ b/arch/arm64/kernel/paravirt.c
@@ -20,8 +20,10 @@
 #include <linux/types.h>
 #include <linux/static_call.h>
 
+#include <asm/hypervisor.h>
 #include <asm/paravirt.h>
 #include <asm/pvclock-abi.h>
+#include <asm/pvlock-abi.h>
 #include <asm/smp_plat.h>
 
 struct static_key paravirt_steal_enabled;
@@ -38,7 +40,12 @@ struct pv_time_stolen_time_region {
 	struct pvclock_vcpu_stolen_time __rcu *kaddr;
 };
 
+struct pv_lock_state_region {
+	struct pvlock_vcpu_state __rcu *kaddr;
+};
+
 static DEFINE_PER_CPU(struct pv_time_stolen_time_region, stolen_time_region);
+static DEFINE_PER_CPU(struct pv_lock_state_region, lock_state_region);
 
 static bool steal_acc = true;
 static int __init parse_no_stealacc(char *arg)
@@ -178,3 +185,108 @@ int __init pv_time_init(void)
 
 	return 0;
 }
+
+static bool native_vcpu_is_preempted(int cpu)
+{
+	return false;
+}
+
+DEFINE_STATIC_CALL(pv_vcpu_is_preempted, native_vcpu_is_preempted);
+
+static bool para_vcpu_is_preempted(int cpu)
+{
+	struct pv_lock_state_region *reg;
+	__le64 preempted_le;
+
+	reg = per_cpu_ptr(&lock_state_region, cpu);
+	if (!reg->kaddr) {
+		pr_warn_once("PV lock enabled but not configured for cpu %d\n",
+			     cpu);
+		return false;
+	}
+
+	preempted_le = le64_to_cpu(READ_ONCE(reg->kaddr->preempted));
+
+	return !!(preempted_le);
+}
+
+static int pvlock_vcpu_state_dying_cpu(unsigned int cpu)
+{
+	struct pv_lock_state_region *reg;
+
+	reg = this_cpu_ptr(&lock_state_region);
+	if (!reg->kaddr)
+		return 0;
+
+	memunmap(reg->kaddr);
+	memset(reg, 0, sizeof(*reg));
+
+	return 0;
+}
+
+static int init_pvlock_vcpu_state(unsigned int cpu)
+{
+	struct pv_lock_state_region *reg;
+	struct arm_smccc_res res;
+
+	reg = this_cpu_ptr(&lock_state_region);
+
+	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID, &res);
+
+	if (res.a0 == SMCCC_RET_NOT_SUPPORTED) {
+		pr_warn("Failed to init PV lock data structure\n");
+		return -EINVAL;
+	}
+
+	reg->kaddr = memremap(res.a0,
+			      sizeof(struct pvlock_vcpu_state),
+			      MEMREMAP_WB);
+
+	if (!reg->kaddr) {
+		pr_warn("Failed to map PV lock data structure\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int kvm_arm_init_pvlock(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ARM_KVM_PVLOCK_STARTING,
+				"hypervisor/arm/pvlock:starting",
+				init_pvlock_vcpu_state,
+				pvlock_vcpu_state_dying_cpu);
+	if (ret < 0) {
+		pr_warn("PV-lock init failed\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static bool has_kvm_pvlock(void)
+{
+	return kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_PV_LOCK);
+}
+
+int __init pv_lock_init(void)
+{
+	int ret;
+
+	if (is_hyp_mode_available())
+		return 0;
+
+	if (!has_kvm_pvlock())
+		return 0;
+
+	ret = kvm_arm_init_pvlock();
+	if (ret)
+		return ret;
+
+	static_call_update(pv_vcpu_is_preempted, para_vcpu_is_preempted);
+	pr_info("using PV-lock preempted\n");
+
+	return 0;
+}
\ No newline at end of file
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index fea3223704b6..05ca07ac5800 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -42,6 +42,7 @@
 #include <asm/cpu_ops.h>
 #include <asm/kasan.h>
 #include <asm/numa.h>
+#include <asm/paravirt.h>
 #include <asm/sections.h>
 #include <asm/setup.h>
 #include <asm/smp_plat.h>
@@ -360,6 +361,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	smp_init_cpus();
 	smp_build_mpidr_hash();
 
+	pv_lock_init();
+
 	/* Init percpu seeds for random tags after cpus are set up. */
 	kasan_init_sw_tags();
 
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f61447913db9..c0ee11855c73 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -192,6 +192,7 @@ enum cpuhp_state {
 	/* Must be the last timer callback */
 	CPUHP_AP_DUMMY_TIMER_STARTING,
 	CPUHP_AP_ARM_XEN_STARTING,
+	CPUHP_AP_ARM_KVM_PVLOCK_STARTING,
 	CPUHP_AP_ARM_CORESIGHT_STARTING,
 	CPUHP_AP_ARM_CORESIGHT_CTI_STARTING,
 	CPUHP_AP_ARM64_ISNDEP_STARTING,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [v2 6/6] KVM: selftests: add tests for PV time specific hypercall
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (4 preceding siblings ...)
  2022-11-04  6:21 ` [v2 5/6] KVM: arm64: Support the VCPU preemption check Usama Arif
@ 2022-11-04  6:21 ` Usama Arif
  2022-11-04  9:02 ` [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Marc Zyngier
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-04  6:21 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme
  Cc: fam.zheng, liangma, punit.agrawal, Usama Arif

This is a vendor specific hypercall.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 tools/testing/selftests/kvm/aarch64/hypercalls.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/kvm/aarch64/hypercalls.c b/tools/testing/selftests/kvm/aarch64/hypercalls.c
index a39da3fe4952..c5c304e886a5 100644
--- a/tools/testing/selftests/kvm/aarch64/hypercalls.c
+++ b/tools/testing/selftests/kvm/aarch64/hypercalls.c
@@ -78,6 +78,8 @@ static const struct test_hvc_info hvc_info[] = {
 	TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID,
 			ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID),
 	TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, 0),
+	TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID,
+			ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID),
 	TEST_HVC_INFO(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID, KVM_PTP_VIRT_COUNTER),
 };
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (5 preceding siblings ...)
  2022-11-04  6:21 ` [v2 6/6] KVM: selftests: add tests for PV time specific hypercall Usama Arif
@ 2022-11-04  9:02 ` Marc Zyngier
  2022-11-06 16:35 ` Marc Zyngier
  2022-11-07 18:24 ` Punit Agrawal
  8 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-11-04  9:02 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

On Fri, 04 Nov 2022 06:20:59 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted can be used to improve
> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest):

Please refrain from reposting a series only two days after the initial
one. One week is a minimum, and only if there is enough review
comments to justify a respin (there were no valuable comments so far).

Reposting more often only results in the review process being
exponentially delayed.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (6 preceding siblings ...)
  2022-11-04  9:02 ` [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Marc Zyngier
@ 2022-11-06 16:35 ` Marc Zyngier
  2022-11-07 12:00   ` [External] " Usama Arif
  2022-11-07 18:24 ` Punit Agrawal
  8 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2022-11-06 16:35 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

On Fri, 04 Nov 2022 06:20:59 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> This patchset adds support for vcpu_is_preempted in arm64, which
> allows the guest to check if a vcpu was scheduled out, which is
> useful to know incase it was holding a lock. vcpu_is_preempted can
> be used to improve performance in locking (see owner_on_cpu usage in
> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
> and osq_lock) and scheduling (see available_idle_cpu which is used
> in several places in kernel/sched/fair.c for e.g. in wake_affine to
> determine which CPU can run soonest):

[...]

> pvcy shows a smaller overall improvement (50%) compared to
> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
> compared with ~1.5% when using vcpu_is_preempted, hence
> vcpu_is_preempted shows a larger improvement.

And have you worked out *why* we spend so much time handling WFE?

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-06 16:35 ` Marc Zyngier
@ 2022-11-07 12:00   ` Usama Arif
  2022-11-18  0:20     ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-11-07 12:00 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal



On 06/11/2022 16:35, Marc Zyngier wrote:
> On Fri, 04 Nov 2022 06:20:59 +0000,
> Usama Arif <usama.arif@bytedance.com> wrote:
>>
>> This patchset adds support for vcpu_is_preempted in arm64, which
>> allows the guest to check if a vcpu was scheduled out, which is
>> useful to know incase it was holding a lock. vcpu_is_preempted can
>> be used to improve performance in locking (see owner_on_cpu usage in
>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>> and osq_lock) and scheduling (see available_idle_cpu which is used
>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>> determine which CPU can run soonest):
> 
> [...]
> 
>> pvcy shows a smaller overall improvement (50%) compared to
>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>> compared with ~1.5% when using vcpu_is_preempted, hence
>> vcpu_is_preempted shows a larger improvement.
> 
> And have you worked out *why* we spend so much time handling WFE?
> 
> 	M.

Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
         }

         if (esr & ESR_ELx_WFx_ISS_WFE) {
-               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+               int state;
+               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+                       schedule();
+
+               if (state == -1)
+                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
         } else {
                 if (esr & ESR_ELx_WFx_ISS_WFxT)
                         vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy 
returns an unchanged vcpu state, we would schedule to another vcpu. And 
its the constant scheduling where the time is spent. I guess the affects 
are much higher when the lock contention is very high. This can be seem 
from the pvcy host side flamegraph as well with (~67% of the time spent 
in the schedule() call in kvm_handle_wfx), For reference, I have put the 
graph at:
https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

Thanks,
Usama

> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [v2 1/6] KVM: arm64: Document PV-lock interface
  2022-11-04  6:21 ` [v2 1/6] KVM: arm64: Document PV-lock interface Usama Arif
@ 2022-11-07 17:56   ` Punit Agrawal
  2022-11-07 18:05     ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Punit Agrawal @ 2022-11-07 17:56 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

Hi Usama,

Usama Arif <usama.arif@bytedance.com> writes:

> Introduce a paravirtualization interface for KVM/arm64 to obtain whether
> the VCPU is currently running or not.
>
> The PV lock structure of the guest is allocated by user space.
>
> A hypercall interface is provided for the guest to interrogate the
> location of the shared memory structures.
>
> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>  Documentation/virt/kvm/arm/index.rst    |  1 +
>  Documentation/virt/kvm/arm/pvlock.rst   | 52 +++++++++++++++++++++++++
>  Documentation/virt/kvm/devices/vcpu.rst | 25 ++++++++++++
>  3 files changed, 78 insertions(+)
>  create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
>
> diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
> index e84848432158..b8499dc00a6a 100644
> --- a/Documentation/virt/kvm/arm/index.rst
> +++ b/Documentation/virt/kvm/arm/index.rst
> @@ -10,4 +10,5 @@ ARM
>     hyp-abi
>     hypercalls
>     pvtime
> +   pvlock
>     ptp_kvm
> diff --git a/Documentation/virt/kvm/arm/pvlock.rst b/Documentation/virt/kvm/arm/pvlock.rst
> new file mode 100644
> index 000000000000..d3c391b16d36
> --- /dev/null
> +++ b/Documentation/virt/kvm/arm/pvlock.rst
> @@ -0,0 +1,52 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Paravirtualized lock support for arm64
> +======================================
> +
> +KVM/arm64 provides a hypervisor service call for paravirtualized guests to
> +determine whether a VCPU is currently running or not.
> +
> +A new SMCCC compatible hypercall is defined:
> +
> +* ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:   0xC6000002
> +
> +ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID
> +
> +    ============= ========    ==========================================
> +    Function ID:  (uint32)    0xC6000002
> +    Return value: (int64)     IPA of the pv lock data structure for this
> +                              VCPU. On failure:
> +                              NOT_SUPPORTED (-1)
> +    ============= ========    ==========================================
> +
> +The IPA returned by PV_LOCK_PREEMPTED should be mapped by the guest as normal
> +memory with inner and outer write back caching attributes, in the inner
> +shareable domain.
> +
> +PV_LOCK_PREEMPTED returns the structure for the calling VCPU.
> +
> +PV lock state
> +-------------
> +
> +The structure pointed to by the PV_LOCK_PREEMPTED hypercall is as follows:
> +
> ++-----------+-------------+-------------+---------------------------------+
> +| Field     | Byte Length | Byte Offset | Description                     |
> ++===========+=============+=============+=================================+
> +| preempted |      8      |      0      | Indicate if the VCPU that owns  |
> +|           |             |             | this struct is running or not.  |
> +|           |             |             | Non-zero values mean the VCPU   |
> +|           |             |             | has been preempted. Zero means  |
> +|           |             |             | the VCPU is not preempted.      |
> ++-----------+-------------+-------------+---------------------------------+
> +
> +The preempted field will be updated to 1 by the hypervisor prior to scheduling
> +a VCPU. When the VCPU is scheduled out, the preempted field will be updated
> +to 0 by the hypervisor.

The text above doesn't match the description in the table. Please update
the texts to align them with the code.

[...]


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls
  2022-11-04  6:21 ` [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls Usama Arif
@ 2022-11-07 17:58   ` Punit Agrawal
  2022-11-07 18:08     ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Punit Agrawal @ 2022-11-07 17:58 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

Usama Arif <usama.arif@bytedance.com> writes:

> Add a new SMCCC compatible hypercalls for PV lock features:
>   ARM_SMCCC_KVM_FUNC_PV_LOCK:   0xC6000002
>
> Also add the header file which defines the ABI for the paravirtualized
> lock features we're about to add.
>
> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>  arch/arm64/include/asm/pvlock-abi.h | 17 +++++++++++++++++
>  include/linux/arm-smccc.h           |  8 ++++++++
>  tools/include/linux/arm-smccc.h     |  8 ++++++++
>  3 files changed, 33 insertions(+)
>  create mode 100644 arch/arm64/include/asm/pvlock-abi.h
>
> diff --git a/arch/arm64/include/asm/pvlock-abi.h b/arch/arm64/include/asm/pvlock-abi.h
> new file mode 100644
> index 000000000000..3f4574071679
> --- /dev/null
> +++ b/arch/arm64/include/asm/pvlock-abi.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright(c) 2019 Huawei Technologies Co., Ltd
> + * Author: Zengruan Ye <yezengruan@huawei.com>
> + *         Usama Arif <usama.arif@bytedance.com>
> + */
> +
> +#ifndef __ASM_PVLOCK_ABI_H
> +#define __ASM_PVLOCK_ABI_H
> +
> +struct pvlock_vcpu_state {
> +	__le64 preempted;
> +	/* Structure must be 64 byte aligned, pad to that size */
> +	u8 padding[56];
> +} __packed;

For structure alignment, I'd have expected to see the use of "aligned"
attribute. Is there any benefit in using padding to achieve alignment?

[...]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure
  2022-11-04  6:21 ` [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure Usama Arif
@ 2022-11-07 18:02   ` Punit Agrawal
  2022-11-07 18:09     ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Punit Agrawal @ 2022-11-07 18:02 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

Usama Arif <usama.arif@bytedance.com> writes:

> Implement the service call for configuring a shared structure between a
> VCPU and the hypervisor in which the hypervisor can tell whether the
> VCPU is running or not.
>
> The preempted field is zero if the VCPU is not preempted.
> Any other value means the VCPU has been preempted.
>
> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>  Documentation/virt/kvm/arm/hypercalls.rst |  3 ++
>  arch/arm64/include/asm/kvm_host.h         | 18 ++++++++++
>  arch/arm64/include/uapi/asm/kvm.h         |  1 +
>  arch/arm64/kvm/Makefile                   |  2 +-
>  arch/arm64/kvm/arm.c                      |  8 +++++
>  arch/arm64/kvm/hypercalls.c               |  8 +++++
>  arch/arm64/kvm/pvlock.c                   | 43 +++++++++++++++++++++++
>  tools/arch/arm64/include/uapi/asm/kvm.h   |  1 +
>  8 files changed, 83 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/pvlock.c
>
> diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst
> index 3e23084644ba..872a16226ace 100644
> --- a/Documentation/virt/kvm/arm/hypercalls.rst
> +++ b/Documentation/virt/kvm/arm/hypercalls.rst
> @@ -127,6 +127,9 @@ The pseudo-firmware bitmap register are as follows:
>      Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
>        The bit represents the Precision Time Protocol KVM service.
>  
> +    Bit-2: KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK:
> +      The bit represents the Paravirtualized lock service.
> +
>  Errors:
>  
>      =======  =============================================================
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 45e2136322ba..18303b30b7e9 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -417,6 +417,11 @@ struct kvm_vcpu_arch {
>  		u64 last_steal;
>  		gpa_t base;
>  	} steal;
> +
> +	/* Guest PV lock state */
> +	struct {
> +		gpa_t base;
> +	} pv;

Using "pv" for the structure isn't quite describing the usage well. It'd
be better to call it "pv_lock" or "pvlock" at the least.

[...]


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 1/6] KVM: arm64: Document PV-lock interface
  2022-11-07 17:56   ` Punit Agrawal
@ 2022-11-07 18:05     ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-07 18:05 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma



On 07/11/2022 17:56, Punit Agrawal wrote:
> Hi Usama,
> 
> Usama Arif <usama.arif@bytedance.com> writes:
> 
>> Introduce a paravirtualization interface for KVM/arm64 to obtain whether
>> the VCPU is currently running or not.
>>
>> The PV lock structure of the guest is allocated by user space.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> location of the shared memory structures.
>>
>> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>>   Documentation/virt/kvm/arm/index.rst    |  1 +
>>   Documentation/virt/kvm/arm/pvlock.rst   | 52 +++++++++++++++++++++++++
>>   Documentation/virt/kvm/devices/vcpu.rst | 25 ++++++++++++
>>   3 files changed, 78 insertions(+)
>>   create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
>>
>> diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
>> index e84848432158..b8499dc00a6a 100644
>> --- a/Documentation/virt/kvm/arm/index.rst
>> +++ b/Documentation/virt/kvm/arm/index.rst
>> @@ -10,4 +10,5 @@ ARM
>>      hyp-abi
>>      hypercalls
>>      pvtime
>> +   pvlock
>>      ptp_kvm
>> diff --git a/Documentation/virt/kvm/arm/pvlock.rst b/Documentation/virt/kvm/arm/pvlock.rst
>> new file mode 100644
>> index 000000000000..d3c391b16d36
>> --- /dev/null
>> +++ b/Documentation/virt/kvm/arm/pvlock.rst
>> @@ -0,0 +1,52 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +Paravirtualized lock support for arm64
>> +======================================
>> +
>> +KVM/arm64 provides a hypervisor service call for paravirtualized guests to
>> +determine whether a VCPU is currently running or not.
>> +
>> +A new SMCCC compatible hypercall is defined:
>> +
>> +* ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID:   0xC6000002
>> +
>> +ARM_SMCCC_VENDOR_HYP_KVM_PV_LOCK_FUNC_ID
>> +
>> +    ============= ========    ==========================================
>> +    Function ID:  (uint32)    0xC6000002
>> +    Return value: (int64)     IPA of the pv lock data structure for this
>> +                              VCPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +    ============= ========    ==========================================
>> +
>> +The IPA returned by PV_LOCK_PREEMPTED should be mapped by the guest as normal
>> +memory with inner and outer write back caching attributes, in the inner
>> +shareable domain.
>> +
>> +PV_LOCK_PREEMPTED returns the structure for the calling VCPU.
>> +
>> +PV lock state
>> +-------------
>> +
>> +The structure pointed to by the PV_LOCK_PREEMPTED hypercall is as follows:
>> +
>> ++-----------+-------------+-------------+---------------------------------+
>> +| Field     | Byte Length | Byte Offset | Description                     |
>> ++===========+=============+=============+=================================+
>> +| preempted |      8      |      0      | Indicate if the VCPU that owns  |
>> +|           |             |             | this struct is running or not.  |
>> +|           |             |             | Non-zero values mean the VCPU   |
>> +|           |             |             | has been preempted. Zero means  |
>> +|           |             |             | the VCPU is not preempted.      |
>> ++-----------+-------------+-------------+---------------------------------+
>> +
>> +The preempted field will be updated to 1 by the hypervisor prior to scheduling
>> +a VCPU. When the VCPU is scheduled out, the preempted field will be updated
>> +to 0 by the hypervisor.
> 
> The text above doesn't match the description in the table. Please update
> the texts to align them with the code.
> 
Will make it clearer in the next revision. Thanks.

> [...]
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls
  2022-11-07 17:58   ` Punit Agrawal
@ 2022-11-07 18:08     ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-07 18:08 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma



On 07/11/2022 17:58, Punit Agrawal wrote:
> Usama Arif <usama.arif@bytedance.com> writes:
> 
>> Add a new SMCCC compatible hypercalls for PV lock features:
>>    ARM_SMCCC_KVM_FUNC_PV_LOCK:   0xC6000002
>>
>> Also add the header file which defines the ABI for the paravirtualized
>> lock features we're about to add.
>>
>> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>>   arch/arm64/include/asm/pvlock-abi.h | 17 +++++++++++++++++
>>   include/linux/arm-smccc.h           |  8 ++++++++
>>   tools/include/linux/arm-smccc.h     |  8 ++++++++
>>   3 files changed, 33 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/pvlock-abi.h
>>
>> diff --git a/arch/arm64/include/asm/pvlock-abi.h b/arch/arm64/include/asm/pvlock-abi.h
>> new file mode 100644
>> index 000000000000..3f4574071679
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/pvlock-abi.h
>> @@ -0,0 +1,17 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright(c) 2019 Huawei Technologies Co., Ltd
>> + * Author: Zengruan Ye <yezengruan@huawei.com>
>> + *         Usama Arif <usama.arif@bytedance.com>
>> + */
>> +
>> +#ifndef __ASM_PVLOCK_ABI_H
>> +#define __ASM_PVLOCK_ABI_H
>> +
>> +struct pvlock_vcpu_state {
>> +	__le64 preempted;
>> +	/* Structure must be 64 byte aligned, pad to that size */
>> +	u8 padding[56];
>> +} __packed;
> 
> For structure alignment, I'd have expected to see the use of "aligned"
> attribute. Is there any benefit in using padding to achieve alignment?
> 
Just made it consistent with pvclock-abi.h. If its more appropriate, can 
change to aligned in next patchset:

diff --git a/arch/arm64/include/asm/pvlock-abi.h 
b/arch/arm64/include/asm/pvlock-abi.h
index 3f4574071679..bacec4cb927a 100644
--- a/arch/arm64/include/asm/pvlock-abi.h
+++ b/arch/arm64/include/asm/pvlock-abi.h
@@ -11,7 +11,6 @@
  struct pvlock_vcpu_state {
         __le64 preempted;
         /* Structure must be 64 byte aligned, pad to that size */
-       u8 padding[56];
-} __packed;
+} __attribute__((aligned(64)));


> [...]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure
  2022-11-07 18:02   ` Punit Agrawal
@ 2022-11-07 18:09     ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-07 18:09 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma



On 07/11/2022 18:02, Punit Agrawal wrote:
> Usama Arif <usama.arif@bytedance.com> writes:
> 
>> Implement the service call for configuring a shared structure between a
>> VCPU and the hypervisor in which the hypervisor can tell whether the
>> VCPU is running or not.
>>
>> The preempted field is zero if the VCPU is not preempted.
>> Any other value means the VCPU has been preempted.
>>
>> Signed-off-by: Zengruan Ye <yezengruan@huawei.com>
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>>   Documentation/virt/kvm/arm/hypercalls.rst |  3 ++
>>   arch/arm64/include/asm/kvm_host.h         | 18 ++++++++++
>>   arch/arm64/include/uapi/asm/kvm.h         |  1 +
>>   arch/arm64/kvm/Makefile                   |  2 +-
>>   arch/arm64/kvm/arm.c                      |  8 +++++
>>   arch/arm64/kvm/hypercalls.c               |  8 +++++
>>   arch/arm64/kvm/pvlock.c                   | 43 +++++++++++++++++++++++
>>   tools/arch/arm64/include/uapi/asm/kvm.h   |  1 +
>>   8 files changed, 83 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm64/kvm/pvlock.c
>>
>> diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst
>> index 3e23084644ba..872a16226ace 100644
>> --- a/Documentation/virt/kvm/arm/hypercalls.rst
>> +++ b/Documentation/virt/kvm/arm/hypercalls.rst
>> @@ -127,6 +127,9 @@ The pseudo-firmware bitmap register are as follows:
>>       Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
>>         The bit represents the Precision Time Protocol KVM service.
>>   
>> +    Bit-2: KVM_REG_ARM_VENDOR_HYP_BIT_PV_LOCK:
>> +      The bit represents the Paravirtualized lock service.
>> +
>>   Errors:
>>   
>>       =======  =============================================================
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 45e2136322ba..18303b30b7e9 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -417,6 +417,11 @@ struct kvm_vcpu_arch {
>>   		u64 last_steal;
>>   		gpa_t base;
>>   	} steal;
>> +
>> +	/* Guest PV lock state */
>> +	struct {
>> +		gpa_t base;
>> +	} pv;
> 
> Using "pv" for the structure isn't quite describing the usage well. It'd
> be better to call it "pv_lock" or "pvlock" at the least.
> 
Yes makes sense, will change in next patchset, Thanks.

> [...]
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
                   ` (7 preceding siblings ...)
  2022-11-06 16:35 ` Marc Zyngier
@ 2022-11-07 18:24 ` Punit Agrawal
  2022-11-09 19:38   ` Usama Arif
  8 siblings, 1 reply; 22+ messages in thread
From: Punit Agrawal @ 2022-11-07 18:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

Hi Usama,

Usama Arif <usama.arif@bytedance.com> writes:

> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted can be used to improve
> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest):
>
> This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
> for preempted vCPUs reduces performance.
>
> This patchset is inspired from the para_steal_clock implementation and from the
> work originally done by Zengruan Ye:
> https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.
>
> All the results in the below experiments are done on an aws r6g.metal instance
> which has 64 pCPUs.
>
> The following table shows the index results of UnixBench running on a 128 vCPU VM
> with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
> TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
> Dhrystone 2 using register variables    187761      191274.7                   1.871368389
> Double-Precision Whetstone              96743.6     98414.4                    1.727039308
> Execl Throughput                        689.3       10426                      1412.548963
> File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
> File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
> File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
> Pipe Throughput                         76819.5     78601.5                    2.319723508
> Pipe-based Context Switching            3444.8      13414.5                    289.4130283
> Process Creation                        301.1       293.4                      -2.557289937
> Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
> Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
> System Call Overhead                    3426        3729.4                     8.855808523
>
> System Benchmarks Index Score           3053        11534                      277.7923354
>
> This shows a 277% overall improvement using these patches.
>
> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
> This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
> This can be seen from one of the callstack of the perf output of the shell
> scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
> - 33.79% el0_svc
>    - 33.43% do_el0_svc
>       - 33.43% el0_svc_common.constprop.3
>          - 33.30% invoke_syscall
>             - 17.27% __arm64_sys_clone
>                - 17.27% __do_sys_clone
>                   - 17.26% kernel_clone
>                      - 16.73% copy_process
>                         - 11.91% dup_mm
>                            - 11.82% dup_mmap
>                               - 9.15% down_write
>                                  - 8.87% rwsem_down_write_slowpath
>                                     - 8.48% osq_lock
>
> Just under 50% of the total time in the shell script benchmarks ends up being
> spent in osq_lock in the base 6.0.0 kernel:
>   Children      Self  Command   Shared Object        Symbol
>    17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
>     6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
>     4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
>     3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
>     3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
>     3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
>     3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
>     3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
>     3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
>     0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
>     0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock
>
> and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
>   Children      Self  Command   Shared Object        Symbol
>      0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
>      0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
>      0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
>      0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
>      0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
>      0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
>      0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock
>
> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
> performed 0.9% better overall than base kernel, and the individual benchmarks
> were within +/-2% improvement over 6.0.0 base.
> Hence the patches have no negative affect when vCPUs < pCPUs.
>
>
> The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
> was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
> if the vCPU will exit immediately.
> (https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
> The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:
>
> TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
> Dhrystone 2 using register variables    187761      183128          -2.467498575
> Double-Precision Whetstone              96743.6     96645           -0.101918887
> Execl Throughput                        689.3       1019.8          47.9471928
> File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
> File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
> File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
> Pipe Throughput                         76819.5     74268.8         -3.320380893
> Pipe-based Context Switching            3444.8      5963.3          73.11019508
> Process Creation                        301.1       163.2           -45.79873796
> Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
> Shell Scripts (8 concurrent)            781.2       1171            49.89759345
> System Call Overhead                    3426        3194.4          -6.760070053
>
> System Benchmarks Index Score           3053        4605            50.83524402
>
> pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
> Host side flamegraph analysis shows that ~60% of the host time when using pvcy
> is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
> hence vcpu_is_preempted shows a larger improvement.
>
> It might be that pvcy can be used in combination with vcpu_is_preempted, but this
> series is to start a discussion on vcpu_is_preempted as it shows a much bigger
> improvement in performance and its much easier to review vcpu_is_preempted standalone.

Looking at both the patchsets - this one and the pvcy, it looks to me
that vcpu_is_preempted() and the pvcy patches are somewhat
orthogonal. The former is optimizing mutex and rwsem in their optimistic
spinning phase while the latter is going after spinlocks (via wfe).

Unless I'm missing something the features are not necessarily comparable
on the same workloads - unixbench is probably mutex heavy and doesn't
show as much benefit with just the pvcy changes. I wonder if it's easy
to have both the features enabled to see this in effect.

I've left some comments on the patches; but no need to respin just
yet. Let's see if there is any other feedback.

Thanks,
Punit

[...]


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-07 18:24 ` Punit Agrawal
@ 2022-11-09 19:38   ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2022-11-09 19:38 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will, maz,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma



On 07/11/2022 18:24, Punit Agrawal wrote:
> Hi Usama,
> 
> Usama Arif <usama.arif@bytedance.com> writes:
> 
>> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
>> to check if a vcpu was scheduled out, which is useful to know incase it was
>> holding a lock. vcpu_is_preempted can be used to improve
>> performance in locking (see owner_on_cpu usage in mutex_spin_on_owner,
>> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
>> (see available_idle_cpu which is used in several places in kernel/sched/fair.c
>> for e.g. in wake_affine to determine which CPU can run soonest):
>>
>> This patchset shows improvement on overcommitted hosts (vCPUs > pCPUS), as waiting
>> for preempted vCPUs reduces performance.
>>
>> This patchset is inspired from the para_steal_clock implementation and from the
>> work originally done by Zengruan Ye:
>> https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/.
>>
>> All the results in the below experiments are done on an aws r6g.metal instance
>> which has 64 pCPUs.
>>
>> The following table shows the index results of UnixBench running on a 128 vCPU VM
>> with (6.0.0+vcpu_is_preempted) and without (6.0.0 base) the patchset.
>> TestName                                6.0.0 base  6.0.0+vcpu_is_preempted    % improvement for vcpu_is_preempted
>> Dhrystone 2 using register variables    187761      191274.7                   1.871368389
>> Double-Precision Whetstone              96743.6     98414.4                    1.727039308
>> Execl Throughput                        689.3       10426                      1412.548963
>> File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
>> File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
>> File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
>> Pipe Throughput                         76819.5     78601.5                    2.319723508
>> Pipe-based Context Switching            3444.8      13414.5                    289.4130283
>> Process Creation                        301.1       293.4                      -2.557289937
>> Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
>> Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
>> System Call Overhead                    3426        3729.4                     8.855808523
>>
>> System Benchmarks Index Score           3053        11534                      277.7923354
>>
>> This shows a 277% overall improvement using these patches.
>>
>> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
>> This acquires rwsem lock where a large chunk of time is spent in base 6.0.0 kernel.
>> This can be seen from one of the callstack of the perf output of the shell
>> scripts benchmark on 6.0.0 base (pseudo NMI enabled for perf numbers below):
>> - 33.79% el0_svc
>>     - 33.43% do_el0_svc
>>        - 33.43% el0_svc_common.constprop.3
>>           - 33.30% invoke_syscall
>>              - 17.27% __arm64_sys_clone
>>                 - 17.27% __do_sys_clone
>>                    - 17.26% kernel_clone
>>                       - 16.73% copy_process
>>                          - 11.91% dup_mm
>>                             - 11.82% dup_mmap
>>                                - 9.15% down_write
>>                                   - 8.87% rwsem_down_write_slowpath
>>                                      - 8.48% osq_lock
>>
>> Just under 50% of the total time in the shell script benchmarks ends up being
>> spent in osq_lock in the base 6.0.0 kernel:
>>    Children      Self  Command   Shared Object        Symbol
>>     17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
>>      6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
>>      4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
>>      3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
>>      3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
>>      3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
>>      3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
>>      3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
>>      3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
>>      0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
>>      0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock
>>
>> and this comes down to less than 1% total with 6.0.0+vcpu_is_preempted kernel:
>>    Children      Self  Command   Shared Object        Symbol
>>       0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
>>       0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
>>       0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
>>       0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
>>       0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
>>       0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
>>       0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
>>       0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock
>>
>> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
>> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
>> performed 0.9% better overall than base kernel, and the individual benchmarks
>> were within +/-2% improvement over 6.0.0 base.
>> Hence the patches have no negative affect when vCPUs < pCPUs.
>>
>>
>> The other method discussed in https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/
>> was pv conditional yield by Marc Zyngier and Will Deacon to reduce vCPU reschedule
>> if the vCPU will exit immediately.
>> (https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy).
>> The patches were ported to 6.0.0 kernel and tested with UnixBench with a 128 vCPU VM:
>>
>> TestName                                6.0.0 base  6.0.0+pvcy      % improvement for pvcy
>> Dhrystone 2 using register variables    187761      183128          -2.467498575
>> Double-Precision Whetstone              96743.6     96645           -0.101918887
>> Execl Throughput                        689.3       1019.8          47.9471928
>> File Copy 1024 bufsize 2000 maxblocks   549.5       2029.7          269.3721565
>> File Copy 256 bufsize 500 maxblocks     400.7       1439.4          259.2213626
>> File Copy 4096 bufsize 8000 maxblocks   894.3       3434.1          283.9986582
>> Pipe Throughput                         76819.5     74268.8         -3.320380893
>> Pipe-based Context Switching            3444.8      5963.3          73.11019508
>> Process Creation                        301.1       163.2           -45.79873796
>> Shell Scripts (1 concurrent)            1248.1      1859.7          49.00248378
>> Shell Scripts (8 concurrent)            781.2       1171            49.89759345
>> System Call Overhead                    3426        3194.4          -6.760070053
>>
>> System Benchmarks Index Score           3053        4605            50.83524402
>>
>> pvcy shows a smaller overall improvement (50%) compared to vcpu_is_preempted (277%).
>> Host side flamegraph analysis shows that ~60% of the host time when using pvcy
>> is spent in kvm_handle_wfx, compared with ~1.5% when using vcpu_is_preempted,
>> hence vcpu_is_preempted shows a larger improvement.
>>
>> It might be that pvcy can be used in combination with vcpu_is_preempted, but this
>> series is to start a discussion on vcpu_is_preempted as it shows a much bigger
>> improvement in performance and its much easier to review vcpu_is_preempted standalone.
> 
> Looking at both the patchsets - this one and the pvcy, it looks to me
> that vcpu_is_preempted() and the pvcy patches are somewhat
> orthogonal. The former is optimizing mutex and rwsem in their optimistic
> spinning phase while the latter is going after spinlocks (via wfe).
> 
> Unless I'm missing something the features are not necessarily comparable
> on the same workloads - unixbench is probably mutex heavy and doesn't
> show as much benefit with just the pvcy changes. I wonder if it's easy
> to have both the features enabled to see this in effect.
> 
> I've left some comments on the patches; but no need to respin just
> yet. Let's see if there is any other feedback.
> 
> Thanks,
> Punit
> 

There was a small bug in v2, where pv_lock_init was called too early in 
the boot in setup_arch, hence pvlock_vcpu_state was not initialized for 
vCPU 0 (the state was initialized for vCPUs 1-127 during secondary core 
boot, hence the rest of the vCPUs were using pvlock correctly). I will 
send the next revision making it an early_initcall along with addressing 
Punits' comments, but will wait for further comments on v2 before 
sending it. I have tested it with early_initcall and didnt see a 
significant change in performance (which is expected as only 1 out of 
128 vCPUs wasnt using pvlock correctly).

I tried pvcy+vcpu_is_preempted patches together and I see a slight 
reduction in performance over pvcy only.
As a summary, with the above changes to move to early_initcall included 
the overall Unixbench score improvements are:

Change over 6.0.0 base kernel                 % improvement over base
vcpu_is_preempted                             279%
pvcy                                          51%
pvcy+vcpu_is_preempted                        37%

Thanks,
Usama


> [...]
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-07 12:00   ` [External] " Usama Arif
@ 2022-11-18  0:20     ` Marc Zyngier
  2022-11-24 13:55       ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2022-11-18  0:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal

On Mon, 07 Nov 2022 12:00:44 +0000,
Usama Arif <usama.arif@bytedance.com> wrote:
> 
> 
> 
> On 06/11/2022 16:35, Marc Zyngier wrote:
> > On Fri, 04 Nov 2022 06:20:59 +0000,
> > Usama Arif <usama.arif@bytedance.com> wrote:
> >> 
> >> This patchset adds support for vcpu_is_preempted in arm64, which
> >> allows the guest to check if a vcpu was scheduled out, which is
> >> useful to know incase it was holding a lock. vcpu_is_preempted can
> >> be used to improve performance in locking (see owner_on_cpu usage in
> >> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
> >> and osq_lock) and scheduling (see available_idle_cpu which is used
> >> in several places in kernel/sched/fair.c for e.g. in wake_affine to
> >> determine which CPU can run soonest):
> > 
> > [...]
> > 
> >> pvcy shows a smaller overall improvement (50%) compared to
> >> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
> >> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
> >> compared with ~1.5% when using vcpu_is_preempted, hence
> >> vcpu_is_preempted shows a larger improvement.
> > 
> > And have you worked out *why* we spend so much time handling WFE?
> > 
> > 	M.
> 
> Its from the following change in pvcy patchset:
> 
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index e778eefcf214..915644816a85 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>         }
> 
>         if (esr & ESR_ELx_WFx_ISS_WFE) {
> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
> +               int state;
> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
> +                       schedule();
> +
> +               if (state == -1)
> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>         } else {
>                 if (esr & ESR_ELx_WFx_ISS_WFxT)
>                         vcpu_set_flag(vcpu, IN_WFIT);
> 
> 
> If my understanding is correct of the pvcy changes, whenever pvcy
> returns an unchanged vcpu state, we would schedule to another
> vcpu. And its the constant scheduling where the time is spent. I guess
> the affects are much higher when the lock contention is very
> high. This can be seem from the pvcy host side flamegraph as well with
> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
> reference, I have put the graph at:
> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-18  0:20     ` Marc Zyngier
@ 2022-11-24 13:55       ` Usama Arif
  2022-12-05 13:43         ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-11-24 13:55 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal



On 18/11/2022 00:20, Marc Zyngier wrote:
> On Mon, 07 Nov 2022 12:00:44 +0000,
> Usama Arif <usama.arif@bytedance.com> wrote:
>>
>>
>>
>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>
>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>> determine which CPU can run soonest):
>>>
>>> [...]
>>>
>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>> vcpu_is_preempted shows a larger improvement.
>>>
>>> And have you worked out *why* we spend so much time handling WFE?
>>>
>>> 	M.
>>
>> Its from the following change in pvcy patchset:
>>
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index e778eefcf214..915644816a85 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>          }
>>
>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>> +               int state;
>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>> +                       schedule();
>> +
>> +               if (state == -1)
>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>          } else {
>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>
>>
>> If my understanding is correct of the pvcy changes, whenever pvcy
>> returns an unchanged vcpu state, we would schedule to another
>> vcpu. And its the constant scheduling where the time is spent. I guess
>> the affects are much higher when the lock contention is very
>> high. This can be seem from the pvcy host side flamegraph as well with
>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>> reference, I have put the graph at:
>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
> 
> The real issue here is that we don't try to pick the right vcpu to
> run, and strictly rely on schedule() to eventually pick something that
> can run.
> 
> An interesting to do would be to try and fit the directed yield
> mechanism there. It would be a lot more interesting than the one-off
> vcpu_is_preempted hack, as it gives us a low-level primitive on which
> to construct things (pvcy is effectively a mwait-like primitive).

We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
would we determine which vcpu to yield to?

IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
scheduler to determine better which vCPU we can run on soonest, select 
idle core, etc. I am not sure if all of these cases will be optimized by 
pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
come down from spending around 50% of the time in lock to less than 1% 
(so not sure how much more room is there for improvement).

We could also use vcpu_is_preempted to optimize IPI performance (along 
with directed yield to target IPI vCPU) similar to how its done in x86 
(https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
This case definitely wont be covered by pvcy.

Considering all the above, i.e. the core kernel integration already 
present and possible future usecases of vcpu_is_preempted, maybe its 
worth making vcpu_is_preempted work on arm independently of pvcy?

Thanks,
Usama

> 
> 	M.
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-11-24 13:55       ` Usama Arif
@ 2022-12-05 13:43         ` Usama Arif
  2023-01-17 10:50           ` Usama Arif
  0 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2022-12-05 13:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, linux-arm-kernel, kvmarm, kvm, linux-doc,
	virtualization, linux, yezengruan, catalin.marinas, will,
	steven.price, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal



On 24/11/2022 13:55, Usama Arif wrote:
> 
> 
> On 18/11/2022 00:20, Marc Zyngier wrote:
>> On Mon, 07 Nov 2022 12:00:44 +0000,
>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>
>>>
>>>
>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>>
>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>> determine which CPU can run soonest):
>>>>
>>>> [...]
>>>>
>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>> vcpu_is_preempted shows a larger improvement.
>>>>
>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>
>>>>     M.
>>>
>>> Its from the following change in pvcy patchset:
>>>
>>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>>> index e778eefcf214..915644816a85 100644
>>> --- a/arch/arm64/kvm/handle_exit.c
>>> +++ b/arch/arm64/kvm/handle_exit.c
>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>          }
>>>
>>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>> +               int state;
>>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>> +                       schedule();
>>> +
>>> +               if (state == -1)
>>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>          } else {
>>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>>
>>>
>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>> returns an unchanged vcpu state, we would schedule to another
>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>> the affects are much higher when the lock contention is very
>>> high. This can be seem from the pvcy host side flamegraph as well with
>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>> reference, I have put the graph at:
>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>
>> The real issue here is that we don't try to pick the right vcpu to
>> run, and strictly rely on schedule() to eventually pick something that
>> can run.
>>
>> An interesting to do would be to try and fit the directed yield
>> mechanism there. It would be a lot more interesting than the one-off
>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>> to construct things (pvcy is effectively a mwait-like primitive).
> 
> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
> would we determine which vcpu to yield to?
> 
> IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
> scheduler to determine better which vCPU we can run on soonest, select 
> idle core, etc. I am not sure if all of these cases will be optimized by 
> pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
> come down from spending around 50% of the time in lock to less than 1% 
> (so not sure how much more room is there for improvement).
> 
> We could also use vcpu_is_preempted to optimize IPI performance (along 
> with directed yield to target IPI vCPU) similar to how its done in x86 
> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
> This case definitely wont be covered by pvcy.
> 
> Considering all the above, i.e. the core kernel integration already 
> present and possible future usecases of vcpu_is_preempted, maybe its 
> worth making vcpu_is_preempted work on arm independently of pvcy?
> 

Hi,

Just wanted to check if there are any comments on above? I can send a v3 
with the doc and code fixes suggested in the earlier reviews if it makes 
sense?

Thanks,
Usama

> Thanks,
> Usama
> 
>>
>>     M.
>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check
  2022-12-05 13:43         ` Usama Arif
@ 2023-01-17 10:50           ` Usama Arif
  0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2023-01-17 10:50 UTC (permalink / raw)
  To: Marc Zyngier, catalin.marinas, will, steven.price, pbonzini
  Cc: linux-kernel, linux-arm-kernel, kvm, linux-doc, virtualization,
	linux, yezengruan, mark.rutland, bagasdotme, fam.zheng, liangma,
	punit.agrawal



On 05/12/2022 13:43, Usama Arif wrote:
> 
> 
> On 24/11/2022 13:55, Usama Arif wrote:
>>
>>
>> On 18/11/2022 00:20, Marc Zyngier wrote:
>>> On Mon, 07 Nov 2022 12:00:44 +0000,
>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>
>>>>
>>>>
>>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>>> Usama Arif <usama.arif@bytedance.com> wrote:
>>>>>>
>>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>>> determine which CPU can run soonest):
>>>>>
>>>>> [...]
>>>>>
>>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>>> vcpu_is_preempted shows a larger improvement.
>>>>>
>>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>>
>>>>>     M.
>>>>
>>>> Its from the following change in pvcy patchset:
>>>>
>>>> diff --git a/arch/arm64/kvm/handle_exit.c 
>>>> b/arch/arm64/kvm/handle_exit.c
>>>> index e778eefcf214..915644816a85 100644
>>>> --- a/arch/arm64/kvm/handle_exit.c
>>>> +++ b/arch/arm64/kvm/handle_exit.c
>>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>>          }
>>>>
>>>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>>>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>> +               int state;
>>>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>>> +                       schedule();
>>>> +
>>>> +               if (state == -1)
>>>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>>          } else {
>>>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>>>
>>>>
>>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>>> returns an unchanged vcpu state, we would schedule to another
>>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>>> the affects are much higher when the lock contention is very
>>>> high. This can be seem from the pvcy host side flamegraph as well with
>>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>>> reference, I have put the graph at:
>>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>>
>>> The real issue here is that we don't try to pick the right vcpu to
>>> run, and strictly rely on schedule() to eventually pick something that
>>> can run.
>>>
>>> An interesting to do would be to try and fit the directed yield
>>> mechanism there. It would be a lot more interesting than the one-off
>>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>>> to construct things (pvcy is effectively a mwait-like primitive).
>>
>> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
>> would we determine which vcpu to yield to?
>>
>> IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
>> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
>> scheduler to determine better which vCPU we can run on soonest, select 
>> idle core, etc. I am not sure if all of these cases will be optimized 
>> by pvcy? Also, with vcpu_is_preempted, some of the lock heavy 
>> benchmarks come down from spending around 50% of the time in lock to 
>> less than 1% (so not sure how much more room is there for improvement).
>>
>> We could also use vcpu_is_preempted to optimize IPI performance (along 
>> with directed yield to target IPI vCPU) similar to how its done in x86 
>> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
>> This case definitely wont be covered by pvcy.
>>
>> Considering all the above, i.e. the core kernel integration already 
>> present and possible future usecases of vcpu_is_preempted, maybe its 
>> worth making vcpu_is_preempted work on arm independently of pvcy?
>>
> 
> Hi,
> 
> Just wanted to check if there are any comments on above? I can send a v3 
> with the doc and code fixes suggested in the earlier reviews if it makes 
> sense?
> 
> Thanks,
> Usama
> 
>> Thanks,
>> Usama
>>

Hi,

The discussion on the patches had died down around November. I have sent 
v3 of the patches 
(https://lore.kernel.org/all/20230117102930.1053337-1-usama.arif@bytedance.com/) 
to hopefully restart it as I think that there is a significant 
performance improvement to be had with vcpu_is_preempted being 
implemented in arm64 which is well integrated in mutex, rtmutex, rwsem, 
osq_lock and scheduler, and could potentially be used to improve the IPI 
performance in the future.

Thanks,
Usama

>>>
>>>     M.
>>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-17 10:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-04  6:20 [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Usama Arif
2022-11-04  6:21 ` [v2 1/6] KVM: arm64: Document PV-lock interface Usama Arif
2022-11-07 17:56   ` Punit Agrawal
2022-11-07 18:05     ` Usama Arif
2022-11-04  6:21 ` [v2 2/6] KVM: arm64: Add SMCCC paravirtualised lock calls Usama Arif
2022-11-07 17:58   ` Punit Agrawal
2022-11-07 18:08     ` Usama Arif
2022-11-04  6:21 ` [v2 3/6] KVM: arm64: Support pvlock preempted via shared structure Usama Arif
2022-11-07 18:02   ` Punit Agrawal
2022-11-07 18:09     ` Usama Arif
2022-11-04  6:21 ` [v2 4/6] KVM: arm64: Provide VCPU attributes for PV lock Usama Arif
2022-11-04  6:21 ` [v2 5/6] KVM: arm64: Support the VCPU preemption check Usama Arif
2022-11-04  6:21 ` [v2 6/6] KVM: selftests: add tests for PV time specific hypercall Usama Arif
2022-11-04  9:02 ` [v2 0/6] KVM: arm64: implement vcpu_is_preempted check Marc Zyngier
2022-11-06 16:35 ` Marc Zyngier
2022-11-07 12:00   ` [External] " Usama Arif
2022-11-18  0:20     ` Marc Zyngier
2022-11-24 13:55       ` Usama Arif
2022-12-05 13:43         ` Usama Arif
2023-01-17 10:50           ` Usama Arif
2022-11-07 18:24 ` Punit Agrawal
2022-11-09 19:38   ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).