All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
@ 2017-02-15 21:37 Waiman Long
  2017-02-15 21:37 ` [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, x86, linux-kernel, virtualization, xen-devel, kvm,
	Pan Xinhui, Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross, Waiman Long

 v3->v4:
  - Fix x86-32 build error.

 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
    in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
    to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
    KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
    patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h      |  2 +-
 arch/x86/include/asm/qspinlock.h     |  2 +-
 arch/x86/kernel/kvm.c                | 32 +++++++++++++++++++++++++++++++-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 4 files changed, 34 insertions(+), 4 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
  2017-02-15 21:37 ` [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long Waiman Long
  2017-02-15 21:37 ` Waiman Long
@ 2017-02-15 21:37 ` Waiman Long
  2017-02-16 16:09   ` Peter Zijlstra
  2017-02-16 16:09     ` Peter Zijlstra
  2017-02-15 21:37 ` [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 Waiman Long
  2017-02-15 21:37   ` Waiman Long
  4 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, x86, linux-kernel, virtualization, xen-devel, kvm,
	Pan Xinhui, Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross, Waiman Long

The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/include/asm/paravirt.h      | 2 +-
 arch/x86/include/asm/qspinlock.h     | 2 +-
 arch/x86/kernel/kvm.c                | 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
 	PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
 	return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
 		__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
 	return false;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
  2017-02-15 21:37 ` [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long Waiman Long
@ 2017-02-15 21:37 ` Waiman Long
  2017-02-15 21:37 ` Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, Juergen Gross, kvm, Radim Krčmář,
	Pan Xinhui, x86, linux-kernel, virtualization, Waiman Long,
	Paolo Bonzini, xen-devel, Boris Ostrovsky

The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/include/asm/paravirt.h      | 2 +-
 arch/x86/include/asm/qspinlock.h     | 2 +-
 arch/x86/kernel/kvm.c                | 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
 	PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
 	return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
 		__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
 	return false;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
@ 2017-02-15 21:37 ` Waiman Long
  2017-02-15 21:37 ` Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, Juergen Gross, kvm, Radim Krčmář,
	Pan Xinhui, x86, linux-kernel, virtualization, Waiman Long,
	Paolo Bonzini, xen-devel, Boris Ostrovsky

The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/include/asm/paravirt.h      | 2 +-
 arch/x86/include/asm/qspinlock.h     | 2 +-
 arch/x86/kernel/kvm.c                | 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
 	PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
 	return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
 		__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
 	return false;
 }
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
@ 2017-02-15 21:37   ` Waiman Long
  2017-02-15 21:37 ` Waiman Long
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, x86, linux-kernel, virtualization, xen-devel, kvm,
	Pan Xinhui, Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross, Waiman Long

It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type      w/o patch    with patch
   --------      ---------    ----------
   random read   8141.2 MB/s  8497.1 MB/s
   seq read      8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write     1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/kernel/kvm.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..e423435 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
@@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET	16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq	__per_cpu_offset(,%rdi,8), %rax;"
+"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne	%al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+#endif
+
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
@ 2017-02-15 21:37   ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, Juergen Gross, kvm, Radim Krčmář,
	Pan Xinhui, x86, linux-kernel, virtualization, Waiman Long,
	Paolo Bonzini, xen-devel, Boris Ostrovsky

It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type      w/o patch    with patch
   --------      ---------    ----------
   random read   8141.2 MB/s  8497.1 MB/s
   seq read      8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write     1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/kernel/kvm.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..e423435 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
@@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET	16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq	__per_cpu_offset(,%rdi,8), %rax;"
+"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne	%al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+#endif
+
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
                   ` (2 preceding siblings ...)
  2017-02-15 21:37 ` Waiman Long
@ 2017-02-15 21:37 ` Waiman Long
  2017-02-15 21:37   ` Waiman Long
  4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-15 21:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-arch, Juergen Gross, kvm, Radim Krčmář,
	Pan Xinhui, x86, linux-kernel, virtualization, Waiman Long,
	Paolo Bonzini, xen-devel, Boris Ostrovsky

It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type      w/o patch    with patch
   --------      ---------    ----------
   random read   8141.2 MB/s  8497.1 MB/s
   seq read      8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write     1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/kernel/kvm.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..e423435 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 	local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
 	struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
@@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET	16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq	__per_cpu_offset(,%rdi,8), %rax;"
+"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne	%al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+#endif
+
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-15 21:37 ` Waiman Long
@ 2017-02-16 16:09     ` Peter Zijlstra
  2017-02-16 16:09     ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-arch, x86,
	linux-kernel, virtualization, xen-devel, kvm, Pan Xinhui,
	Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross

On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> The cpu argument in the function prototype of vcpu_is_preempted()
> is changed from int to long. That makes it easier to provide a better
> optimized assembly version of that function.
> 
> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> downcast from long to int is not a problem as vCPU number won't exceed
> 32 bits.
> 

Note that because of the cast in PVOP_CALL_ARG1() this patch is
pointless.

Then again, it doesn't seem to affect code generation, so why not. Takes
away the reliance on that weird cast.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
@ 2017-02-16 16:09     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, linux-kernel,
	virtualization, Chris Wright, Ingo Molnar, H. Peter Anvin,
	xen-devel, Alok Kataria, Thomas Gleixner

On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> The cpu argument in the function prototype of vcpu_is_preempted()
> is changed from int to long. That makes it easier to provide a better
> optimized assembly version of that function.
> 
> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> downcast from long to int is not a problem as vCPU number won't exceed
> 32 bits.
> 

Note that because of the cast in PVOP_CALL_ARG1() this patch is
pointless.

Then again, it doesn't seem to affect code generation, so why not. Takes
away the reliance on that weird cast.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-15 21:37 ` Waiman Long
@ 2017-02-16 16:09   ` Peter Zijlstra
  2017-02-16 16:09     ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, Rusty Russell,
	linux-kernel, virtualization, Chris Wright, Ingo Molnar,
	H. Peter Anvin, xen-devel, Alok Kataria, Thomas Gleixner

On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> The cpu argument in the function prototype of vcpu_is_preempted()
> is changed from int to long. That makes it easier to provide a better
> optimized assembly version of that function.
> 
> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> downcast from long to int is not a problem as vCPU number won't exceed
> 32 bits.
> 

Note that because of the cast in PVOP_CALL_ARG1() this patch is
pointless.

Then again, it doesn't seem to affect code generation, so why not. Takes
away the reliance on that weird cast.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-15 21:37   ` Waiman Long
@ 2017-02-16 16:48     ` Peter Zijlstra
  -1 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-arch, x86,
	linux-kernel, virtualization, xen-devel, kvm, Pan Xinhui,
	Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross, andrew.cooper3

On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
> +/*
> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
> + * restoring to/from the stack. It is assumed that the preempted value
> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
> + * which is verified by the BUILD_BUG_ON() macro below.
> + */
> +#define PREEMPTED_OFFSET	16

As per Andrew's suggestion, the 'right' way is something like so.

---
 asm-offsets_64.c |   11 +++++++++++
 kvm.c            |   14 ++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -13,6 +13,10 @@ static char syscalls_ia32[] = {
 #include <asm/syscalls_32.h>
 };
 
+#ifdef CONFIG_KVM_GUEST
+#include <asm/kvm_para.h>
+#endif
+
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
@@ -22,6 +26,13 @@ int main(void)
 	BLANK();
 #endif
 
+#ifdef CONFIG_KVM_GUEST
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted);
+	BLANK();
+#endif
+#endif
+
 #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
 	ENTRY(bx);
 	ENTRY(cx);
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -600,22 +600,21 @@ PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_
 
 #else
 
+#include <asm/asm-offsets.h>
+
 extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
 
 /*
  * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
- * restoring to/from the stack. It is assumed that the preempted value
- * is at an offset of 16 from the beginning of the kvm_steal_time structure
- * which is verified by the BUILD_BUG_ON() macro below.
+ * restoring to/from the stack. 
  */
-#define PREEMPTED_OFFSET	16
 asm(
 ".pushsection .text;"
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
 "__raw_callee_save___kvm_vcpu_is_preempted:"
 "movq	__per_cpu_offset(,%rdi,8), %rax;"
-"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"cmpb	$0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
 "setne	%al;"
 "ret;"
 ".popsection");
@@ -627,11 +626,6 @@ asm(
  */
 void __init kvm_spinlock_init(void)
 {
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
-		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
-#endif
-
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
@ 2017-02-16 16:48     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, linux-kernel,
	virtualization, Chris Wright, Ingo Molnar, andrew.cooper3,
	H. Peter Anvin, xen-devel, Alok Kataria, Thomas Gleixner

On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
> +/*
> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
> + * restoring to/from the stack. It is assumed that the preempted value
> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
> + * which is verified by the BUILD_BUG_ON() macro below.
> + */
> +#define PREEMPTED_OFFSET	16

As per Andrew's suggestion, the 'right' way is something like so.

---
 asm-offsets_64.c |   11 +++++++++++
 kvm.c            |   14 ++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -13,6 +13,10 @@ static char syscalls_ia32[] = {
 #include <asm/syscalls_32.h>
 };
 
+#ifdef CONFIG_KVM_GUEST
+#include <asm/kvm_para.h>
+#endif
+
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
@@ -22,6 +26,13 @@ int main(void)
 	BLANK();
 #endif
 
+#ifdef CONFIG_KVM_GUEST
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted);
+	BLANK();
+#endif
+#endif
+
 #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
 	ENTRY(bx);
 	ENTRY(cx);
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -600,22 +600,21 @@ PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_
 
 #else
 
+#include <asm/asm-offsets.h>
+
 extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
 
 /*
  * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
- * restoring to/from the stack. It is assumed that the preempted value
- * is at an offset of 16 from the beginning of the kvm_steal_time structure
- * which is verified by the BUILD_BUG_ON() macro below.
+ * restoring to/from the stack. 
  */
-#define PREEMPTED_OFFSET	16
 asm(
 ".pushsection .text;"
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
 "__raw_callee_save___kvm_vcpu_is_preempted:"
 "movq	__per_cpu_offset(,%rdi,8), %rax;"
-"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"cmpb	$0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
 "setne	%al;"
 "ret;"
 ".popsection");
@@ -627,11 +626,6 @@ asm(
  */
 void __init kvm_spinlock_init(void)
 {
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
-		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
-#endif
-
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-15 21:37   ` Waiman Long
  (?)
@ 2017-02-16 16:48   ` Peter Zijlstra
  -1 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-16 16:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, Rusty Russell,
	linux-kernel, virtualization, Chris Wright, Ingo Molnar,
	andrew.cooper3, H. Peter Anvin, xen-devel, Alok Kataria,
	Thomas Gleixner

On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
> +/*
> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
> + * restoring to/from the stack. It is assumed that the preempted value
> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
> + * which is verified by the BUILD_BUG_ON() macro below.
> + */
> +#define PREEMPTED_OFFSET	16

As per Andrew's suggestion, the 'right' way is something like so.

---
 asm-offsets_64.c |   11 +++++++++++
 kvm.c            |   14 ++++----------
 2 files changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -13,6 +13,10 @@ static char syscalls_ia32[] = {
 #include <asm/syscalls_32.h>
 };
 
+#ifdef CONFIG_KVM_GUEST
+#include <asm/kvm_para.h>
+#endif
+
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
@@ -22,6 +26,13 @@ int main(void)
 	BLANK();
 #endif
 
+#ifdef CONFIG_KVM_GUEST
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted);
+	BLANK();
+#endif
+#endif
+
 #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
 	ENTRY(bx);
 	ENTRY(cx);
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -600,22 +600,21 @@ PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_
 
 #else
 
+#include <asm/asm-offsets.h>
+
 extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
 
 /*
  * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
- * restoring to/from the stack. It is assumed that the preempted value
- * is at an offset of 16 from the beginning of the kvm_steal_time structure
- * which is verified by the BUILD_BUG_ON() macro below.
+ * restoring to/from the stack. 
  */
-#define PREEMPTED_OFFSET	16
 asm(
 ".pushsection .text;"
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
 "__raw_callee_save___kvm_vcpu_is_preempted:"
 "movq	__per_cpu_offset(,%rdi,8), %rax;"
-"cmpb	$0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"cmpb	$0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
 "setne	%al;"
 "ret;"
 ".popsection");
@@ -627,11 +626,6 @@ asm(
  */
 void __init kvm_spinlock_init(void)
 {
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
-		!= PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
-#endif
-
 	if (!kvm_para_available())
 		return;
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-16 16:48     ` Peter Zijlstra
@ 2017-02-16 21:00       ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-arch, x86,
	linux-kernel, virtualization, xen-devel, kvm, Pan Xinhui,
	Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross, andrew.cooper3

On 02/16/2017 11:48 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
>> +/*
>> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
>> + * restoring to/from the stack. It is assumed that the preempted value
>> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
>> + * which is verified by the BUILD_BUG_ON() macro below.
>> + */
>> +#define PREEMPTED_OFFSET	16
> As per Andrew's suggestion, the 'right' way is something like so.

Thanks for the tip. I was not aware of the asm-offsets stuff. I will
update the patch to use it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
@ 2017-02-16 21:00       ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, linux-kernel,
	virtualization, Chris Wright, Ingo Molnar, andrew.cooper3,
	H. Peter Anvin, xen-devel, Alok Kataria, Thomas Gleixner

On 02/16/2017 11:48 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
>> +/*
>> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
>> + * restoring to/from the stack. It is assumed that the preempted value
>> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
>> + * which is verified by the BUILD_BUG_ON() macro below.
>> + */
>> +#define PREEMPTED_OFFSET	16
> As per Andrew's suggestion, the 'right' way is something like so.

Thanks for the tip. I was not aware of the asm-offsets stuff. I will
update the patch to use it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  2017-02-16 16:48     ` Peter Zijlstra
  (?)
  (?)
@ 2017-02-16 21:00     ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, Rusty Russell,
	linux-kernel, virtualization, Chris Wright, Ingo Molnar,
	andrew.cooper3, H. Peter Anvin, xen-devel, Alok Kataria,
	Thomas Gleixner

On 02/16/2017 11:48 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
>> +/*
>> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
>> + * restoring to/from the stack. It is assumed that the preempted value
>> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
>> + * which is verified by the BUILD_BUG_ON() macro below.
>> + */
>> +#define PREEMPTED_OFFSET	16
> As per Andrew's suggestion, the 'right' way is something like so.

Thanks for the tip. I was not aware of the asm-offsets stuff. I will
update the patch to use it.

Cheers,
Longman


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-16 16:09     ` Peter Zijlstra
@ 2017-02-16 21:02       ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-arch, x86,
	linux-kernel, virtualization, xen-devel, kvm, Pan Xinhui,
	Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross

On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
>> The cpu argument in the function prototype of vcpu_is_preempted()
>> is changed from int to long. That makes it easier to provide a better
>> optimized assembly version of that function.
>>
>> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
>> downcast from long to int is not a problem as vCPU number won't exceed
>> 32 bits.
>>
> Note that because of the cast in PVOP_CALL_ARG1() this patch is
> pointless.
>
> Then again, it doesn't seem to affect code generation, so why not. Takes
> away the reliance on that weird cast.

I add this patch because I am a bit uneasy about clearing the upper 32
bits of rdi and assuming that the compiler won't have a previous use of
those bits. It gives me peace of mind.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
@ 2017-02-16 21:02       ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, linux-kernel,
	virtualization, Chris Wright, Ingo Molnar, H. Peter Anvin,
	xen-devel, Alok Kataria, Thomas Gleixner

On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
>> The cpu argument in the function prototype of vcpu_is_preempted()
>> is changed from int to long. That makes it easier to provide a better
>> optimized assembly version of that function.
>>
>> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
>> downcast from long to int is not a problem as vCPU number won't exceed
>> 32 bits.
>>
> Note that because of the cast in PVOP_CALL_ARG1() this patch is
> pointless.
>
> Then again, it doesn't seem to affect code generation, so why not. Takes
> away the reliance on that weird cast.

I add this patch because I am a bit uneasy about clearing the upper 32
bits of rdi and assuming that the compiler won't have a previous use of
those bits. It gives me peace of mind.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-16 16:09     ` Peter Zijlstra
  (?)
@ 2017-02-16 21:02     ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2017-02-16 21:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, Rusty Russell,
	linux-kernel, virtualization, Chris Wright, Ingo Molnar,
	H. Peter Anvin, xen-devel, Alok Kataria, Thomas Gleixner

On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
>> The cpu argument in the function prototype of vcpu_is_preempted()
>> is changed from int to long. That makes it easier to provide a better
>> optimized assembly version of that function.
>>
>> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
>> downcast from long to int is not a problem as vCPU number won't exceed
>> 32 bits.
>>
> Note that because of the cast in PVOP_CALL_ARG1() this patch is
> pointless.
>
> Then again, it doesn't seem to affect code generation, so why not. Takes
> away the reliance on that weird cast.

I add this patch because I am a bit uneasy about clearing the upper 32
bits of rdi and assuming that the compiler won't have a previous use of
those bits. It gives me peace of mind.

Cheers,
Longman


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-16 21:02       ` Waiman Long
@ 2017-02-17  9:42         ` Peter Zijlstra
  -1 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-17  9:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-arch, x86,
	linux-kernel, virtualization, xen-devel, kvm, Pan Xinhui,
	Paolo Bonzini, Radim Krčmář,
	Boris Ostrovsky, Juergen Gross

On Thu, Feb 16, 2017 at 04:02:57PM -0500, Waiman Long wrote:
> On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> > On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> >> The cpu argument in the function prototype of vcpu_is_preempted()
> >> is changed from int to long. That makes it easier to provide a better
> >> optimized assembly version of that function.
> >>
> >> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> >> downcast from long to int is not a problem as vCPU number won't exceed
> >> 32 bits.
> >>
> > Note that because of the cast in PVOP_CALL_ARG1() this patch is
> > pointless.
> >
> > Then again, it doesn't seem to affect code generation, so why not. Takes
> > away the reliance on that weird cast.
> 
> I add this patch because I am a bit uneasy about clearing the upper 32
> bits of rdi and assuming that the compiler won't have a previous use of
> those bits. It gives me peace of mind.

So currently the PVOP_CALL_ARG#() macros force cast everything to
(unsigned long) anyway, but it would be good not to rely on that I
think, so yes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
@ 2017-02-17  9:42         ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-17  9:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, linux-kernel,
	virtualization, Chris Wright, Ingo Molnar, H. Peter Anvin,
	xen-devel, Alok Kataria, Thomas Gleixner

On Thu, Feb 16, 2017 at 04:02:57PM -0500, Waiman Long wrote:
> On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> > On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> >> The cpu argument in the function prototype of vcpu_is_preempted()
> >> is changed from int to long. That makes it easier to provide a better
> >> optimized assembly version of that function.
> >>
> >> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> >> downcast from long to int is not a problem as vCPU number won't exceed
> >> 32 bits.
> >>
> > Note that because of the cast in PVOP_CALL_ARG1() this patch is
> > pointless.
> >
> > Then again, it doesn't seem to affect code generation, so why not. Takes
> > away the reliance on that weird cast.
> 
> I add this patch because I am a bit uneasy about clearing the upper 32
> bits of rdi and assuming that the compiler won't have a previous use of
> those bits. It gives me peace of mind.

So currently the PVOP_CALL_ARG#() macros force cast everything to
(unsigned long) anyway, but it would be good not to rely on that I
think, so yes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
  2017-02-16 21:02       ` Waiman Long
  (?)
  (?)
@ 2017-02-17  9:42       ` Peter Zijlstra
  -1 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-02-17  9:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-arch, Juergen Gross, Jeremy Fitzhardinge, x86, kvm,
	Radim Krčmář,
	Boris Ostrovsky, Pan Xinhui, Paolo Bonzini, Rusty Russell,
	linux-kernel, virtualization, Chris Wright, Ingo Molnar,
	H. Peter Anvin, xen-devel, Alok Kataria, Thomas Gleixner

On Thu, Feb 16, 2017 at 04:02:57PM -0500, Waiman Long wrote:
> On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> > On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
> >> The cpu argument in the function prototype of vcpu_is_preempted()
> >> is changed from int to long. That makes it easier to provide a better
> >> optimized assembly version of that function.
> >>
> >> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
> >> downcast from long to int is not a problem as vCPU number won't exceed
> >> 32 bits.
> >>
> > Note that because of the cast in PVOP_CALL_ARG1() this patch is
> > pointless.
> >
> > Then again, it doesn't seem to affect code generation, so why not. Takes
> > away the reliance on that weird cast.
> 
> I add this patch because I am a bit uneasy about clearing the upper 32
> bits of rdi and assuming that the compiler won't have a previous use of
> those bits. It gives me peace of mind.

So currently the PVOP_CALL_ARG#() macros force cast everything to
(unsigned long) anyway, but it would be good not to rely on that I
think, so yes.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-02-17  9:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 21:37 [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead Waiman Long
2017-02-15 21:37 ` [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long Waiman Long
2017-02-15 21:37 ` Waiman Long
2017-02-15 21:37 ` Waiman Long
2017-02-16 16:09   ` Peter Zijlstra
2017-02-16 16:09   ` Peter Zijlstra
2017-02-16 16:09     ` Peter Zijlstra
2017-02-16 21:02     ` Waiman Long
2017-02-16 21:02     ` Waiman Long
2017-02-16 21:02       ` Waiman Long
2017-02-17  9:42       ` Peter Zijlstra
2017-02-17  9:42         ` Peter Zijlstra
2017-02-17  9:42       ` Peter Zijlstra
2017-02-15 21:37 ` [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 Waiman Long
2017-02-15 21:37 ` Waiman Long
2017-02-15 21:37   ` Waiman Long
2017-02-16 16:48   ` Peter Zijlstra
2017-02-16 16:48   ` Peter Zijlstra
2017-02-16 16:48     ` Peter Zijlstra
2017-02-16 21:00     ` Waiman Long
2017-02-16 21:00       ` Waiman Long
2017-02-16 21:00     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.