All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08  5:02 ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the
host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.

This lockorture splat that I saw on the guest while testing this is
indicative of the problem:

  [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
  [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
  [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220

(Caused by an oversubscribed host but using mismatched native pv_lock_ops
on the gues.)

This series addresses the problem by doing paravirt switching at runtime.

We keep an interesting subset of pv-ops (pv_lock_ops only for now,
but PV-TLB ops are also good candidates) in .parainstructions.runtime,
while discarding the .parainstructions as usual at init. This is then
used for switching back and forth between native and paravirt mode.
([1] lists some representative numbers of the increased memory
footprint.)

Mechanism: the patching itself is done using stop_machine(). That is
not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
via text_poke_bp(), but I'm using this to address two issues:
 1) emulation in text_poke() can only easily handle a small set
 of instructions and this is problematic for inlined pv-ops (and see
 a possible alternatives use-case below.)
 2) paravirt patching might have inter-dependendent ops (ex.
 lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
 need to be updated atomically.)

The alternative use-case is a runtime version of apply_alternatives()
(not posted with this patch-set) that can be used for some safe subset
of X86_FEATUREs. This could be useful in conjunction with the ongoing
late microcode loading work that Mihai Carabas and others have been
working on.

Also, there are points of similarity with the ongoing static_call work
which does rewriting of indirect calls. The difference here is that
we need to switch a group of calls atomically and given that
some of them can be inlined, need to handle a wider variety of opcodes.

To patch safely we need to satisfy these constraints:

 - No references to insn sequences under replacement on any kernel stack
   once replacement is in progress. Without this constraint we might end
   up returning to an address that is in the middle of an instruction.

 - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
   lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
   a good example.

 - handle a broader set of insns than CALL and JMP: some pv-ops end up
   getting inlined. Alternatives can contain arbitrary instructions.

 - locking operations can be called from interrupt handlers which means
   we cannot trivially use IPIs for flushing.

Handling these, necessitates that target pv-ops not be preemptible.
Once that is a given (for safety these need to be explicitly whitelisted
in runtime_patch()), use a state-machine with the primary CPU doing the
patching and secondary CPUs in a sync_core() loop. 

In case we hit an INT3/BP (in NMI or thread-context) we makes forward
progress by continuing the patching instead of emulating.

One remaining issue is inter-dependent pv-ops which are also executed in
the NMI handler -- patching can potentially deadlock in case of multiple
NMIs. Handle these by pushing some of this work in the NMI handler where
we know it will be uninterrupted.

There are four main sets of patches in this series:

 1. PV-ops management (patches 1-10, 20): mostly infrastructure and
 refactoring pieces to make paravirt patching usable at runtime. For the
 most part scoped under CONFIG_PARAVIRT_RUNTIME.

 Patches 1-7, to persist part of parainstructions in memory:
  "x86/paravirt: Specify subsection in PVOP macros"
  "x86/paravirt: Allow paravirt patching post-init"
  "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
  "x86/alternatives: Refactor alternatives_smp_module*
  "x86/alternatives: Rename alternatives_smp*, smp_alt_module
  "x86/alternatives: Remove stale symbols
  "x86/paravirt: Persist .parainstructions.runtime"

 Patches 8-10, develop the inerfaces to safely switch pv-ops:
  "x86/paravirt: Stash native pv-ops"
  "x86/paravirt: Add runtime_patch()"
  "x86/paravirt: Add primitives to stage pv-ops"

 Patch 20 enables switching of pv_lock_ops:
  "x86/paravirt: Enable pv-spinlocks in runtime_patch()"

 2. Non-emulated text poking (patches 11-19)

 Patches 11-13 are mostly refactoring to split __text_poke() into map,
 unmap and poke/memcpy phases with the poke portion being re-entrant
  "x86/alternatives: Remove return value of text_poke*()"
  "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
  "x86/alternatives: Split __text_poke()"

 Patches 15, 17 add the actual poking state-machine:
  "x86/alternatives: Non-emulated text poking"
  "x86/alternatives: Add patching logic in text_poke_site()"

 with patches 14 and 18 containing the pieces for BP handling:
  "x86/alternatives: Handle native insns in text_poke_loc*()"
  "x86/alternatives: Handle BP in non-emulated text poking"

 and patch 19 provides the ability to use the state-machine above in an
 NMI context (fixes some potential deadlocks when handling inter-
 dependent operations and multiple NMIs):
  "x86/alternatives: NMI safe runtime patching".

 Patch 16 provides the interface (paravirt_runtime_patch()) to use the
 poking mechanism developed above and patch 21 adds a selftest:
  "x86/alternatives: Add paravirt patching at runtime"
  "x86/alternatives: Paravirt runtime selftest"

 3. KVM guest changes to be able to use this (patches 22-23,25-26):
  "kvm/paravirt: Encapsulate KVM pv switching logic"
  "x86/kvm: Add worker to trigger runtime patching"
  "x86/kvm: Guest support for dynamic hints"
  "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".

 4. KVM host changes to notify the guest of a change (patch 24):
  "x86/kvm: Support dynamic CPUID hints"

Testing:
With paravirt patching, the code is mostly stable on Intel and AMD
systems under kernbench and locktorture with paravirt toggling (with,
without synthetic NMIs) in the background.

Queued spinlock performance for locktorture is also on expected lines:
 [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0 
 # toggle PV spinlocks

 [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0 
 # PV spinlocks (in ~60 seconds) = 62,901,545

 # toggle native spinlocks
 [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0 
  # native spinlocks (in ~60 seconds) = 2,228,295

The alternatives testing is more limited with it being used to rewrite
mostly harmless X86_FEATUREs with load in the background.

Patches also at:

ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1

Please review.

Thanks
Ankur

[1] The precise change in memory footprint depends on config options
but the following example inlines queued_spin_unlock() (which forms
the bulk of the added state). The added footprint is the size of the
.parainstructions.runtime section:

 $ objdump -h vmlinux|grep .parainstructions
 Idx Name              		Size      VMA               
 	LMA                File-off  Algn
  27 .parainstructions 		0001013c  ffffffff82895000
  	0000000002895000   01c95000  2**3
  28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
  	00000000028a5140   01ca5140  2**3

  $ size vmlinux                                         
  text       data       bss        dec      hex       filename
  13726196   12302814   14094336   40123346 2643bd2   vmlinux

Ankur Arora (26):
  x86/paravirt: Specify subsection in PVOP macros
  x86/paravirt: Allow paravirt patching post-init
  x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
  x86/alternatives: Refactor alternatives_smp_module*
  x86/alternatives: Rename alternatives_smp*, smp_alt_module
  x86/alternatives: Remove stale symbols
  x86/paravirt: Persist .parainstructions.runtime
  x86/paravirt: Stash native pv-ops
  x86/paravirt: Add runtime_patch()
  x86/paravirt: Add primitives to stage pv-ops
  x86/alternatives: Remove return value of text_poke*()
  x86/alternatives: Use __get_unlocked_pte() in text_poke()
  x86/alternatives: Split __text_poke()
  x86/alternatives: Handle native insns in text_poke_loc*()
  x86/alternatives: Non-emulated text poking
  x86/alternatives: Add paravirt patching at runtime
  x86/alternatives: Add patching logic in text_poke_site()
  x86/alternatives: Handle BP in non-emulated text poking
  x86/alternatives: NMI safe runtime patching
  x86/paravirt: Enable pv-spinlocks in runtime_patch()
  x86/alternatives: Paravirt runtime selftest
  kvm/paravirt: Encapsulate KVM pv switching logic
  x86/kvm: Add worker to trigger runtime patching
  x86/kvm: Support dynamic CPUID hints
  x86/kvm: Guest support for dynamic hints
  x86/kvm: Add hint change notifier for KVM_HINT_REALTIME

 Documentation/virt/kvm/api.rst        |  17 +
 Documentation/virt/kvm/cpuid.rst      |   9 +-
 arch/x86/Kconfig                      |  14 +
 arch/x86/Kconfig.debug                |  13 +
 arch/x86/entry/entry_64.S             |   5 +
 arch/x86/include/asm/alternative.h    |  20 +-
 arch/x86/include/asm/kvm_host.h       |   6 +
 arch/x86/include/asm/kvm_para.h       |  17 +
 arch/x86/include/asm/paravirt.h       |  10 +-
 arch/x86/include/asm/paravirt_types.h | 230 ++++--
 arch/x86/include/asm/text-patching.h  |  18 +-
 arch/x86/include/uapi/asm/kvm_para.h  |   2 +
 arch/x86/kernel/Makefile              |   1 +
 arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
 arch/x86/kernel/kvm.c                 | 191 ++++-
 arch/x86/kernel/module.c              |  42 +-
 arch/x86/kernel/paravirt.c            |  16 +-
 arch/x86/kernel/paravirt_patch.c      |  61 ++
 arch/x86/kernel/pv_selftest.c         | 264 +++++++
 arch/x86/kernel/pv_selftest.h         |  15 +
 arch/x86/kernel/setup.c               |   2 +
 arch/x86/kernel/vmlinux.lds.S         |  16 +
 arch/x86/kvm/cpuid.c                  |   3 +-
 arch/x86/kvm/x86.c                    |  39 +
 include/asm-generic/kvm_para.h        |  12 +
 include/asm-generic/vmlinux.lds.h     |   8 +
 include/linux/kvm_para.h              |   5 +
 include/linux/mm.h                    |  16 +-
 include/linux/preempt.h               |  17 +
 include/uapi/linux/kvm.h              |   4 +
 kernel/locking/lock_events.c          |   2 +-
 mm/memory.c                           |   9 +-
 32 files changed, 1850 insertions(+), 221 deletions(-)
 create mode 100644 arch/x86/kernel/pv_selftest.c
 create mode 100644 arch/x86/kernel/pv_selftest.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08  5:02 ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the
host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.

This lockorture splat that I saw on the guest while testing this is
indicative of the problem:

  [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
  [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
  [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220

(Caused by an oversubscribed host but using mismatched native pv_lock_ops
on the gues.)

This series addresses the problem by doing paravirt switching at runtime.

We keep an interesting subset of pv-ops (pv_lock_ops only for now,
but PV-TLB ops are also good candidates) in .parainstructions.runtime,
while discarding the .parainstructions as usual at init. This is then
used for switching back and forth between native and paravirt mode.
([1] lists some representative numbers of the increased memory
footprint.)

Mechanism: the patching itself is done using stop_machine(). That is
not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
via text_poke_bp(), but I'm using this to address two issues:
 1) emulation in text_poke() can only easily handle a small set
 of instructions and this is problematic for inlined pv-ops (and see
 a possible alternatives use-case below.)
 2) paravirt patching might have inter-dependendent ops (ex.
 lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
 need to be updated atomically.)

The alternative use-case is a runtime version of apply_alternatives()
(not posted with this patch-set) that can be used for some safe subset
of X86_FEATUREs. This could be useful in conjunction with the ongoing
late microcode loading work that Mihai Carabas and others have been
working on.

Also, there are points of similarity with the ongoing static_call work
which does rewriting of indirect calls. The difference here is that
we need to switch a group of calls atomically and given that
some of them can be inlined, need to handle a wider variety of opcodes.

To patch safely we need to satisfy these constraints:

 - No references to insn sequences under replacement on any kernel stack
   once replacement is in progress. Without this constraint we might end
   up returning to an address that is in the middle of an instruction.

 - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
   lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
   a good example.

 - handle a broader set of insns than CALL and JMP: some pv-ops end up
   getting inlined. Alternatives can contain arbitrary instructions.

 - locking operations can be called from interrupt handlers which means
   we cannot trivially use IPIs for flushing.

Handling these, necessitates that target pv-ops not be preemptible.
Once that is a given (for safety these need to be explicitly whitelisted
in runtime_patch()), use a state-machine with the primary CPU doing the
patching and secondary CPUs in a sync_core() loop. 

In case we hit an INT3/BP (in NMI or thread-context) we makes forward
progress by continuing the patching instead of emulating.

One remaining issue is inter-dependent pv-ops which are also executed in
the NMI handler -- patching can potentially deadlock in case of multiple
NMIs. Handle these by pushing some of this work in the NMI handler where
we know it will be uninterrupted.

There are four main sets of patches in this series:

 1. PV-ops management (patches 1-10, 20): mostly infrastructure and
 refactoring pieces to make paravirt patching usable at runtime. For the
 most part scoped under CONFIG_PARAVIRT_RUNTIME.

 Patches 1-7, to persist part of parainstructions in memory:
  "x86/paravirt: Specify subsection in PVOP macros"
  "x86/paravirt: Allow paravirt patching post-init"
  "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
  "x86/alternatives: Refactor alternatives_smp_module*
  "x86/alternatives: Rename alternatives_smp*, smp_alt_module
  "x86/alternatives: Remove stale symbols
  "x86/paravirt: Persist .parainstructions.runtime"

 Patches 8-10, develop the inerfaces to safely switch pv-ops:
  "x86/paravirt: Stash native pv-ops"
  "x86/paravirt: Add runtime_patch()"
  "x86/paravirt: Add primitives to stage pv-ops"

 Patch 20 enables switching of pv_lock_ops:
  "x86/paravirt: Enable pv-spinlocks in runtime_patch()"

 2. Non-emulated text poking (patches 11-19)

 Patches 11-13 are mostly refactoring to split __text_poke() into map,
 unmap and poke/memcpy phases with the poke portion being re-entrant
  "x86/alternatives: Remove return value of text_poke*()"
  "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
  "x86/alternatives: Split __text_poke()"

 Patches 15, 17 add the actual poking state-machine:
  "x86/alternatives: Non-emulated text poking"
  "x86/alternatives: Add patching logic in text_poke_site()"

 with patches 14 and 18 containing the pieces for BP handling:
  "x86/alternatives: Handle native insns in text_poke_loc*()"
  "x86/alternatives: Handle BP in non-emulated text poking"

 and patch 19 provides the ability to use the state-machine above in an
 NMI context (fixes some potential deadlocks when handling inter-
 dependent operations and multiple NMIs):
  "x86/alternatives: NMI safe runtime patching".

 Patch 16 provides the interface (paravirt_runtime_patch()) to use the
 poking mechanism developed above and patch 21 adds a selftest:
  "x86/alternatives: Add paravirt patching at runtime"
  "x86/alternatives: Paravirt runtime selftest"

 3. KVM guest changes to be able to use this (patches 22-23,25-26):
  "kvm/paravirt: Encapsulate KVM pv switching logic"
  "x86/kvm: Add worker to trigger runtime patching"
  "x86/kvm: Guest support for dynamic hints"
  "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".

 4. KVM host changes to notify the guest of a change (patch 24):
  "x86/kvm: Support dynamic CPUID hints"

Testing:
With paravirt patching, the code is mostly stable on Intel and AMD
systems under kernbench and locktorture with paravirt toggling (with,
without synthetic NMIs) in the background.

Queued spinlock performance for locktorture is also on expected lines:
 [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0 
 # toggle PV spinlocks

 [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0 
 # PV spinlocks (in ~60 seconds) = 62,901,545

 # toggle native spinlocks
 [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0 
  # native spinlocks (in ~60 seconds) = 2,228,295

The alternatives testing is more limited with it being used to rewrite
mostly harmless X86_FEATUREs with load in the background.

Patches also at:

ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1

Please review.

Thanks
Ankur

[1] The precise change in memory footprint depends on config options
but the following example inlines queued_spin_unlock() (which forms
the bulk of the added state). The added footprint is the size of the
.parainstructions.runtime section:

 $ objdump -h vmlinux|grep .parainstructions
 Idx Name              		Size      VMA               
 	LMA                File-off  Algn
  27 .parainstructions 		0001013c  ffffffff82895000
  	0000000002895000   01c95000  2**3
  28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
  	00000000028a5140   01ca5140  2**3

  $ size vmlinux                                         
  text       data       bss        dec      hex       filename
  13726196   12302814   14094336   40123346 2643bd2   vmlinux

Ankur Arora (26):
  x86/paravirt: Specify subsection in PVOP macros
  x86/paravirt: Allow paravirt patching post-init
  x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
  x86/alternatives: Refactor alternatives_smp_module*
  x86/alternatives: Rename alternatives_smp*, smp_alt_module
  x86/alternatives: Remove stale symbols
  x86/paravirt: Persist .parainstructions.runtime
  x86/paravirt: Stash native pv-ops
  x86/paravirt: Add runtime_patch()
  x86/paravirt: Add primitives to stage pv-ops
  x86/alternatives: Remove return value of text_poke*()
  x86/alternatives: Use __get_unlocked_pte() in text_poke()
  x86/alternatives: Split __text_poke()
  x86/alternatives: Handle native insns in text_poke_loc*()
  x86/alternatives: Non-emulated text poking
  x86/alternatives: Add paravirt patching at runtime
  x86/alternatives: Add patching logic in text_poke_site()
  x86/alternatives: Handle BP in non-emulated text poking
  x86/alternatives: NMI safe runtime patching
  x86/paravirt: Enable pv-spinlocks in runtime_patch()
  x86/alternatives: Paravirt runtime selftest
  kvm/paravirt: Encapsulate KVM pv switching logic
  x86/kvm: Add worker to trigger runtime patching
  x86/kvm: Support dynamic CPUID hints
  x86/kvm: Guest support for dynamic hints
  x86/kvm: Add hint change notifier for KVM_HINT_REALTIME

 Documentation/virt/kvm/api.rst        |  17 +
 Documentation/virt/kvm/cpuid.rst      |   9 +-
 arch/x86/Kconfig                      |  14 +
 arch/x86/Kconfig.debug                |  13 +
 arch/x86/entry/entry_64.S             |   5 +
 arch/x86/include/asm/alternative.h    |  20 +-
 arch/x86/include/asm/kvm_host.h       |   6 +
 arch/x86/include/asm/kvm_para.h       |  17 +
 arch/x86/include/asm/paravirt.h       |  10 +-
 arch/x86/include/asm/paravirt_types.h | 230 ++++--
 arch/x86/include/asm/text-patching.h  |  18 +-
 arch/x86/include/uapi/asm/kvm_para.h  |   2 +
 arch/x86/kernel/Makefile              |   1 +
 arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
 arch/x86/kernel/kvm.c                 | 191 ++++-
 arch/x86/kernel/module.c              |  42 +-
 arch/x86/kernel/paravirt.c            |  16 +-
 arch/x86/kernel/paravirt_patch.c      |  61 ++
 arch/x86/kernel/pv_selftest.c         | 264 +++++++
 arch/x86/kernel/pv_selftest.h         |  15 +
 arch/x86/kernel/setup.c               |   2 +
 arch/x86/kernel/vmlinux.lds.S         |  16 +
 arch/x86/kvm/cpuid.c                  |   3 +-
 arch/x86/kvm/x86.c                    |  39 +
 include/asm-generic/kvm_para.h        |  12 +
 include/asm-generic/vmlinux.lds.h     |   8 +
 include/linux/kvm_para.h              |   5 +
 include/linux/mm.h                    |  16 +-
 include/linux/preempt.h               |  17 +
 include/uapi/linux/kvm.h              |   4 +
 kernel/locking/lock_events.c          |   2 +-
 mm/memory.c                           |   9 +-
 32 files changed, 1850 insertions(+), 221 deletions(-)
 create mode 100644 arch/x86/kernel/pv_selftest.c
 create mode 100644 arch/x86/kernel/pv_selftest.h

-- 
2.20.1

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [RFC PATCH 01/26] x86/paravirt: Specify subsection in PVOP macros
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:02   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Allow PVOP macros to specify a subsection such that _paravirt_alt() can
optionally put sites in .parainstructions.*.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h | 158 +++++++++++++++++---------
 1 file changed, 102 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 732f62e04ddb..37e8f27a3b9d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,9 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct paravirt_patch_template pv_ops;
 
+/* Sub-section for .parainstructions */
+#define PV_SUFFIX ""
+
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
@@ -350,9 +353,9 @@ extern struct paravirt_patch_template pv_ops;
  * Generate some code, and mark it as patchable by the
  * apply_paravirt() alternate instruction patcher.
  */
-#define _paravirt_alt(insn_string, type, clobber)	\
+#define _paravirt_alt(sec, insn_string, type, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
+	".pushsection .parainstructions" sec ",\"a\"\n"	\
 	_ASM_ALIGN "\n"					\
 	_ASM_PTR " 771b\n"				\
 	"  .byte " type "\n"				\
@@ -361,8 +364,9 @@ extern struct paravirt_patch_template pv_ops;
 	".popsection\n"
 
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_alt(sec, insn_string)					\
+	_paravirt_alt(sec, insn_string, "%c[paravirt_typenum]",		\
+		      "%c[paravirt_clobber]")
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -414,7 +418,7 @@ int paravirt_disable_iospace(void);
  * unfortunately, are quite a bit (r8 - r11)
  *
  * The call instruction itself is marked by placing its start address
- * and size into the .parainstructions section, so that
+ * and size into the .parainstructions* sections, so that
  * apply_paravirt() in arch/i386/kernel/alternative.c can do the
  * appropriate patching under the control of the backend pv_init_ops
  * implementation.
@@ -512,7 +516,7 @@ int paravirt_disable_iospace(void);
 	})
 
 
-#define ____PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr,		\
+#define ____PVOP_CALL(sec, rettype, op, clbr, call_clbr, extra_clbr,	\
 		      pre, post, ...)					\
 	({								\
 		rettype __ret;						\
@@ -522,7 +526,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt(sec, PARAVIRT_CALL)	\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -532,7 +536,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt(sec, PARAVIRT_CALL)	\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -544,22 +548,22 @@ int paravirt_disable_iospace(void);
 		__ret;							\
 	})
 
-#define __PVOP_CALL(rettype, op, pre, post, ...)			\
-	____PVOP_CALL(rettype, op, CLBR_ANY, PVOP_CALL_CLOBBERS,	\
+#define __PVOP_CALL(sec, rettype, op, pre, post, ...)			\
+	____PVOP_CALL(sec, rettype, op, CLBR_ANY, PVOP_CALL_CLOBBERS,	\
 		      EXTRA_CLOBBERS, pre, post, ##__VA_ARGS__)
 
-#define __PVOP_CALLEESAVE(rettype, op, pre, post, ...)			\
-	____PVOP_CALL(rettype, op.func, CLBR_RET_REG,			\
+#define __PVOP_CALLEESAVE(sec, rettype, op, pre, post, ...)		\
+	____PVOP_CALL(sec, rettype, op.func, CLBR_RET_REG,		\
 		      PVOP_CALLEE_CLOBBERS, ,				\
 		      pre, post, ##__VA_ARGS__)
 
 
-#define ____PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...)	\
+#define ____PVOP_VCALL(sec, op, clbr, call_clbr, extra_clbr, pre, post, ...)	\
 	({								\
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_alt(sec, PARAVIRT_CALL)		\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
@@ -568,85 +572,127 @@ int paravirt_disable_iospace(void);
 			     : "memory", "cc" extra_clbr);		\
 	})
 
-#define __PVOP_VCALL(op, pre, post, ...)				\
-	____PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS,		\
+#define __PVOP_VCALL(sec, op, pre, post, ...)				\
+	____PVOP_VCALL(sec, op, CLBR_ANY, PVOP_VCALL_CLOBBERS,		\
 		       VEXTRA_CLOBBERS,					\
 		       pre, post, ##__VA_ARGS__)
 
-#define __PVOP_VCALLEESAVE(op, pre, post, ...)				\
-	____PVOP_VCALL(op.func, CLBR_RET_REG,				\
+#define __PVOP_VCALLEESAVE(sec, op, pre, post, ...)			\
+	____PVOP_VCALL(sec, op.func, CLBR_RET_REG,			\
 		      PVOP_VCALLEE_CLOBBERS, ,				\
 		      pre, post, ##__VA_ARGS__)
 
 
 
-#define PVOP_CALL0(rettype, op)						\
-	__PVOP_CALL(rettype, op, "", "")
-#define PVOP_VCALL0(op)							\
-	__PVOP_VCALL(op, "", "")
+#define _PVOP_CALL0(sec, rettype, op)					\
+	__PVOP_CALL(sec, rettype, op, "", "")
+#define _PVOP_VCALL0(sec, op)						\
+	__PVOP_VCALL(sec, op, "", "")
 
-#define PVOP_CALLEE0(rettype, op)					\
-	__PVOP_CALLEESAVE(rettype, op, "", "")
-#define PVOP_VCALLEE0(op)						\
-	__PVOP_VCALLEESAVE(op, "", "")
+#define _PVOP_CALLEE0(sec, rettype, op)					\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "")
+#define _PVOP_VCALLEE0(sec, op)						\
+	__PVOP_VCALLEESAVE(sec, op, "", "")
 
 
-#define PVOP_CALL1(rettype, op, arg1)					\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
-#define PVOP_VCALL1(op, arg1)						\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_CALL1(sec, rettype, op, arg1)				\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_VCALL1(sec, op, arg1)					\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1))
 
-#define PVOP_CALLEE1(rettype, op, arg1)					\
-	__PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
-#define PVOP_VCALLEE1(op, arg1)						\
-	__PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_CALLEE1(sec, rettype, op, arg1)				\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_VCALLEE1(sec, op, arg1)					\
+	__PVOP_VCALLEESAVE(sec, op, "", "", PVOP_CALL_ARG1(arg1))
 
-
-#define PVOP_CALL2(rettype, op, arg1, arg2)				\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_CALL2(sec, rettype, op, arg1, arg2)			\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
 		    PVOP_CALL_ARG2(arg2))
-#define PVOP_VCALL2(op, arg1, arg2)					\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1),			\
+#define _PVOP_VCALL2(sec, op, arg1, arg2)				\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1),		\
 		     PVOP_CALL_ARG2(arg2))
 
-#define PVOP_CALLEE2(rettype, op, arg1, arg2)				\
-	__PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
+#define _PVOP_CALLEE2(sec, rettype, op, arg1, arg2)			\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1), \
 			  PVOP_CALL_ARG2(arg2))
-#define PVOP_VCALLEE2(op, arg1, arg2)					\
-	__PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_VCALLEE2(sec, op, arg1, arg2)				\
+	__PVOP_VCALLEESAVE(sec, op, "", "", PVOP_CALL_ARG1(arg1),	\
 			   PVOP_CALL_ARG2(arg2))
 
 
-#define PVOP_CALL3(rettype, op, arg1, arg2, arg3)			\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_CALL3(sec, rettype, op, arg1, arg2, arg3)			\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
 		    PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3))
-#define PVOP_VCALL3(op, arg1, arg2, arg3)				\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1),			\
+#define _PVOP_VCALL3(sec, op, arg1, arg2, arg3)				\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1),		\
 		     PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3))
 
 /* This is the only difference in x86_64. We can make it much simpler */
 #ifdef CONFIG_X86_32
-#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
-	__PVOP_CALL(rettype, op,					\
+#define _PVOP_CALL4(sec, rettype, op, arg1, arg2, arg3, arg4)		\
+	__PVOP_CALL(sec, rettype, op,					\
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
 		    PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),		\
 		    PVOP_CALL_ARG3(arg3), [_arg4] "mr" ((u32)(arg4)))
-#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
-	__PVOP_VCALL(op,						\
+#define _PVOP_VCALL4(sec, op, arg1, arg2, arg3, arg4)			\
+	__PVOP_VCALL(sec, op,						\
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
 		    "0" ((u32)(arg1)), "1" ((u32)(arg2)),		\
 		    "2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4)))
 #else
-#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
-	__PVOP_CALL(rettype, op, "", "",				\
+#define _PVOP_CALL4(sec, rettype, op, arg1, arg2, arg3, arg4)		\
+	__PVOP_CALL(sec, rettype, op, "", "",				\
 		    PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),		\
 		    PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
-#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
-	__PVOP_VCALL(op, "", "",					\
+#define _PVOP_VCALL4(sec, op, arg1, arg2, arg3, arg4)			\
+	__PVOP_VCALL(sec, op, "", "",					\
 		     PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),	\
 		     PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
 #endif
 
+/*
+ * PVOP macros for .parainstructions
+ */
+#define PVOP_CALL0(rettype, op)						\
+	_PVOP_CALL0(PV_SUFFIX, rettype, op)
+#define PVOP_VCALL0(op)							\
+	_PVOP_VCALL0(PV_SUFFIX, op)
+
+#define PVOP_CALLEE0(rettype, op)					\
+	_PVOP_CALLEE0(PV_SUFFIX, rettype, op)
+#define PVOP_VCALLEE0(op)						\
+	_PVOP_VCALLEE0(PV_SUFFIX, op)
+
+#define PVOP_CALL1(rettype, op, arg1)					\
+	_PVOP_CALL1(PV_SUFFIX, rettype, op, arg1)
+#define PVOP_VCALL1(op, arg1)						\
+	_PVOP_VCALL1(PV_SUFFIX, op, arg1)
+
+#define PVOP_CALLEE1(rettype, op, arg1)					\
+	_PVOP_CALLEE1(PV_SUFFIX, rettype, op, arg1)
+#define PVOP_VCALLEE1(op, arg1)						\
+	_PVOP_VCALLEE1(PV_SUFFIX, op, arg1)
+
+#define PVOP_CALL2(rettype, op, arg1, arg2)				\
+	_PVOP_CALL2(PV_SUFFIX, rettype, op, arg1, arg2)
+#define PVOP_VCALL2(op, arg1, arg2)					\
+	_PVOP_VCALL2(PV_SUFFIX, op, arg1, arg2)
+
+#define PVOP_CALLEE2(rettype, op, arg1, arg2)				\
+	_PVOP_CALLEE2(PV_SUFFIX, rettype, op, arg1, arg2)
+#define PVOP_VCALLEE2(op, arg1, arg2)					\
+	_PVOP_VCALLEE2(PV_SUFFIX, op, arg1, arg2)
+
+#define PVOP_CALL3(rettype, op, arg1, arg2, arg3)			\
+	_PVOP_CALL3(PV_SUFFIX, rettype, op, arg1, arg2, arg3)
+#define PVOP_VCALL3(op, arg1, arg2, arg3)				\
+	_PVOP_VCALL3(PV_SUFFIX, op, arg1, arg2, arg3)
+
+#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
+	_PVOP_CALL4(PV_SUFFIX, rettype, op, arg1, arg2, arg3, arg4)
+#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
+	_PVOP_VCALL4(PV_SUFFIX, op, arg1, arg2, arg3, arg4)
+
 /* Lazy mode for batching updates / context switch */
 enum paravirt_lazy_mode {
 	PARAVIRT_LAZY_NONE,
@@ -667,7 +713,7 @@ u64 _paravirt_ident_64(u64);
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
-/* These all sit in the .parainstructions section to tell us what to patch. */
+/* These all sit in .parainstructions* sections to tell us what to patch. */
 struct paravirt_patch_site {
 	u8 *instr;		/* original instructions */
 	u8 type;		/* type of this instruction */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 01/26] x86/paravirt: Specify subsection in PVOP macros
@ 2020-04-08  5:02   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Allow PVOP macros to specify a subsection such that _paravirt_alt() can
optionally put sites in .parainstructions.*.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h | 158 +++++++++++++++++---------
 1 file changed, 102 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 732f62e04ddb..37e8f27a3b9d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,9 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct paravirt_patch_template pv_ops;
 
+/* Sub-section for .parainstructions */
+#define PV_SUFFIX ""
+
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
@@ -350,9 +353,9 @@ extern struct paravirt_patch_template pv_ops;
  * Generate some code, and mark it as patchable by the
  * apply_paravirt() alternate instruction patcher.
  */
-#define _paravirt_alt(insn_string, type, clobber)	\
+#define _paravirt_alt(sec, insn_string, type, clobber)	\
 	"771:\n\t" insn_string "\n" "772:\n"		\
-	".pushsection .parainstructions,\"a\"\n"	\
+	".pushsection .parainstructions" sec ",\"a\"\n"	\
 	_ASM_ALIGN "\n"					\
 	_ASM_PTR " 771b\n"				\
 	"  .byte " type "\n"				\
@@ -361,8 +364,9 @@ extern struct paravirt_patch_template pv_ops;
 	".popsection\n"
 
 /* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)					\
-	_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+#define paravirt_alt(sec, insn_string)					\
+	_paravirt_alt(sec, insn_string, "%c[paravirt_typenum]",		\
+		      "%c[paravirt_clobber]")
 
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
@@ -414,7 +418,7 @@ int paravirt_disable_iospace(void);
  * unfortunately, are quite a bit (r8 - r11)
  *
  * The call instruction itself is marked by placing its start address
- * and size into the .parainstructions section, so that
+ * and size into the .parainstructions* sections, so that
  * apply_paravirt() in arch/i386/kernel/alternative.c can do the
  * appropriate patching under the control of the backend pv_init_ops
  * implementation.
@@ -512,7 +516,7 @@ int paravirt_disable_iospace(void);
 	})
 
 
-#define ____PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr,		\
+#define ____PVOP_CALL(sec, rettype, op, clbr, call_clbr, extra_clbr,	\
 		      pre, post, ...)					\
 	({								\
 		rettype __ret;						\
@@ -522,7 +526,7 @@ int paravirt_disable_iospace(void);
 		/* since this condition will never hold */		\
 		if (sizeof(rettype) > sizeof(unsigned long)) {		\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt(sec, PARAVIRT_CALL)	\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -532,7 +536,7 @@ int paravirt_disable_iospace(void);
 			__ret = (rettype)((((u64)__edx) << 32) | __eax); \
 		} else {						\
 			asm volatile(pre				\
-				     paravirt_alt(PARAVIRT_CALL)	\
+				     paravirt_alt(sec, PARAVIRT_CALL)	\
 				     post				\
 				     : call_clbr, ASM_CALL_CONSTRAINT	\
 				     : paravirt_type(op),		\
@@ -544,22 +548,22 @@ int paravirt_disable_iospace(void);
 		__ret;							\
 	})
 
-#define __PVOP_CALL(rettype, op, pre, post, ...)			\
-	____PVOP_CALL(rettype, op, CLBR_ANY, PVOP_CALL_CLOBBERS,	\
+#define __PVOP_CALL(sec, rettype, op, pre, post, ...)			\
+	____PVOP_CALL(sec, rettype, op, CLBR_ANY, PVOP_CALL_CLOBBERS,	\
 		      EXTRA_CLOBBERS, pre, post, ##__VA_ARGS__)
 
-#define __PVOP_CALLEESAVE(rettype, op, pre, post, ...)			\
-	____PVOP_CALL(rettype, op.func, CLBR_RET_REG,			\
+#define __PVOP_CALLEESAVE(sec, rettype, op, pre, post, ...)		\
+	____PVOP_CALL(sec, rettype, op.func, CLBR_RET_REG,		\
 		      PVOP_CALLEE_CLOBBERS, ,				\
 		      pre, post, ##__VA_ARGS__)
 
 
-#define ____PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...)	\
+#define ____PVOP_VCALL(sec, op, clbr, call_clbr, extra_clbr, pre, post, ...)	\
 	({								\
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_alt(sec, PARAVIRT_CALL)		\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
@@ -568,85 +572,127 @@ int paravirt_disable_iospace(void);
 			     : "memory", "cc" extra_clbr);		\
 	})
 
-#define __PVOP_VCALL(op, pre, post, ...)				\
-	____PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS,		\
+#define __PVOP_VCALL(sec, op, pre, post, ...)				\
+	____PVOP_VCALL(sec, op, CLBR_ANY, PVOP_VCALL_CLOBBERS,		\
 		       VEXTRA_CLOBBERS,					\
 		       pre, post, ##__VA_ARGS__)
 
-#define __PVOP_VCALLEESAVE(op, pre, post, ...)				\
-	____PVOP_VCALL(op.func, CLBR_RET_REG,				\
+#define __PVOP_VCALLEESAVE(sec, op, pre, post, ...)			\
+	____PVOP_VCALL(sec, op.func, CLBR_RET_REG,			\
 		      PVOP_VCALLEE_CLOBBERS, ,				\
 		      pre, post, ##__VA_ARGS__)
 
 
 
-#define PVOP_CALL0(rettype, op)						\
-	__PVOP_CALL(rettype, op, "", "")
-#define PVOP_VCALL0(op)							\
-	__PVOP_VCALL(op, "", "")
+#define _PVOP_CALL0(sec, rettype, op)					\
+	__PVOP_CALL(sec, rettype, op, "", "")
+#define _PVOP_VCALL0(sec, op)						\
+	__PVOP_VCALL(sec, op, "", "")
 
-#define PVOP_CALLEE0(rettype, op)					\
-	__PVOP_CALLEESAVE(rettype, op, "", "")
-#define PVOP_VCALLEE0(op)						\
-	__PVOP_VCALLEESAVE(op, "", "")
+#define _PVOP_CALLEE0(sec, rettype, op)					\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "")
+#define _PVOP_VCALLEE0(sec, op)						\
+	__PVOP_VCALLEESAVE(sec, op, "", "")
 
 
-#define PVOP_CALL1(rettype, op, arg1)					\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
-#define PVOP_VCALL1(op, arg1)						\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_CALL1(sec, rettype, op, arg1)				\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_VCALL1(sec, op, arg1)					\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1))
 
-#define PVOP_CALLEE1(rettype, op, arg1)					\
-	__PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
-#define PVOP_VCALLEE1(op, arg1)						\
-	__PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_CALLEE1(sec, rettype, op, arg1)				\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1))
+#define _PVOP_VCALLEE1(sec, op, arg1)					\
+	__PVOP_VCALLEESAVE(sec, op, "", "", PVOP_CALL_ARG1(arg1))
 
-
-#define PVOP_CALL2(rettype, op, arg1, arg2)				\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_CALL2(sec, rettype, op, arg1, arg2)			\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
 		    PVOP_CALL_ARG2(arg2))
-#define PVOP_VCALL2(op, arg1, arg2)					\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1),			\
+#define _PVOP_VCALL2(sec, op, arg1, arg2)				\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1),		\
 		     PVOP_CALL_ARG2(arg2))
 
-#define PVOP_CALLEE2(rettype, op, arg1, arg2)				\
-	__PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
+#define _PVOP_CALLEE2(sec, rettype, op, arg1, arg2)			\
+	__PVOP_CALLEESAVE(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1), \
 			  PVOP_CALL_ARG2(arg2))
-#define PVOP_VCALLEE2(op, arg1, arg2)					\
-	__PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_VCALLEE2(sec, op, arg1, arg2)				\
+	__PVOP_VCALLEESAVE(sec, op, "", "", PVOP_CALL_ARG1(arg1),	\
 			   PVOP_CALL_ARG2(arg2))
 
 
-#define PVOP_CALL3(rettype, op, arg1, arg2, arg3)			\
-	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1),		\
+#define _PVOP_CALL3(sec, rettype, op, arg1, arg2, arg3)			\
+	__PVOP_CALL(sec, rettype, op, "", "", PVOP_CALL_ARG1(arg1),	\
 		    PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3))
-#define PVOP_VCALL3(op, arg1, arg2, arg3)				\
-	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1),			\
+#define _PVOP_VCALL3(sec, op, arg1, arg2, arg3)				\
+	__PVOP_VCALL(sec, op, "", "", PVOP_CALL_ARG1(arg1),		\
 		     PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3))
 
 /* This is the only difference in x86_64. We can make it much simpler */
 #ifdef CONFIG_X86_32
-#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
-	__PVOP_CALL(rettype, op,					\
+#define _PVOP_CALL4(sec, rettype, op, arg1, arg2, arg3, arg4)		\
+	__PVOP_CALL(sec, rettype, op,					\
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
 		    PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),		\
 		    PVOP_CALL_ARG3(arg3), [_arg4] "mr" ((u32)(arg4)))
-#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
-	__PVOP_VCALL(op,						\
+#define _PVOP_VCALL4(sec, op, arg1, arg2, arg3, arg4)			\
+	__PVOP_VCALL(sec, op,						\
 		    "push %[_arg4];", "lea 4(%%esp),%%esp;",		\
 		    "0" ((u32)(arg1)), "1" ((u32)(arg2)),		\
 		    "2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4)))
 #else
-#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
-	__PVOP_CALL(rettype, op, "", "",				\
+#define _PVOP_CALL4(sec, rettype, op, arg1, arg2, arg3, arg4)		\
+	__PVOP_CALL(sec, rettype, op, "", "",				\
 		    PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),		\
 		    PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
-#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
-	__PVOP_VCALL(op, "", "",					\
+#define _PVOP_VCALL4(sec, op, arg1, arg2, arg3, arg4)			\
+	__PVOP_VCALL(sec, op, "", "",					\
 		     PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),	\
 		     PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
 #endif
 
+/*
+ * PVOP macros for .parainstructions
+ */
+#define PVOP_CALL0(rettype, op)						\
+	_PVOP_CALL0(PV_SUFFIX, rettype, op)
+#define PVOP_VCALL0(op)							\
+	_PVOP_VCALL0(PV_SUFFIX, op)
+
+#define PVOP_CALLEE0(rettype, op)					\
+	_PVOP_CALLEE0(PV_SUFFIX, rettype, op)
+#define PVOP_VCALLEE0(op)						\
+	_PVOP_VCALLEE0(PV_SUFFIX, op)
+
+#define PVOP_CALL1(rettype, op, arg1)					\
+	_PVOP_CALL1(PV_SUFFIX, rettype, op, arg1)
+#define PVOP_VCALL1(op, arg1)						\
+	_PVOP_VCALL1(PV_SUFFIX, op, arg1)
+
+#define PVOP_CALLEE1(rettype, op, arg1)					\
+	_PVOP_CALLEE1(PV_SUFFIX, rettype, op, arg1)
+#define PVOP_VCALLEE1(op, arg1)						\
+	_PVOP_VCALLEE1(PV_SUFFIX, op, arg1)
+
+#define PVOP_CALL2(rettype, op, arg1, arg2)				\
+	_PVOP_CALL2(PV_SUFFIX, rettype, op, arg1, arg2)
+#define PVOP_VCALL2(op, arg1, arg2)					\
+	_PVOP_VCALL2(PV_SUFFIX, op, arg1, arg2)
+
+#define PVOP_CALLEE2(rettype, op, arg1, arg2)				\
+	_PVOP_CALLEE2(PV_SUFFIX, rettype, op, arg1, arg2)
+#define PVOP_VCALLEE2(op, arg1, arg2)					\
+	_PVOP_VCALLEE2(PV_SUFFIX, op, arg1, arg2)
+
+#define PVOP_CALL3(rettype, op, arg1, arg2, arg3)			\
+	_PVOP_CALL3(PV_SUFFIX, rettype, op, arg1, arg2, arg3)
+#define PVOP_VCALL3(op, arg1, arg2, arg3)				\
+	_PVOP_VCALL3(PV_SUFFIX, op, arg1, arg2, arg3)
+
+#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)			\
+	_PVOP_CALL4(PV_SUFFIX, rettype, op, arg1, arg2, arg3, arg4)
+#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
+	_PVOP_VCALL4(PV_SUFFIX, op, arg1, arg2, arg3, arg4)
+
 /* Lazy mode for batching updates / context switch */
 enum paravirt_lazy_mode {
 	PARAVIRT_LAZY_NONE,
@@ -667,7 +713,7 @@ u64 _paravirt_ident_64(u64);
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
-/* These all sit in the .parainstructions section to tell us what to patch. */
+/* These all sit in .parainstructions* sections to tell us what to patch. */
 struct paravirt_patch_site {
 	u8 *instr;		/* original instructions */
 	u8 type;		/* type of this instruction */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 02/26] x86/paravirt: Allow paravirt patching post-init
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:02   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Paravirt-ops are patched at init to convert indirect calls into
direct calls and in some cases, to inline the target at the call-site.
This is done by way of PVOP* macros which save the call-site
information via compile time annotations.

Pull this state out in .parainstructions.runtime for some pv-ops such
that they can be used for runtime patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig                      | 12 ++++++++++++
 arch/x86/include/asm/paravirt_types.h |  5 +++++
 arch/x86/include/asm/text-patching.h  |  5 +++++
 arch/x86/kernel/alternative.c         |  2 ++
 arch/x86/kernel/module.c              | 10 +++++++++-
 arch/x86/kernel/vmlinux.lds.S         | 16 ++++++++++++++++
 include/asm-generic/vmlinux.lds.h     |  8 ++++++++
 7 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1edf788d301c..605619938f08 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -764,6 +764,18 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_RUNTIME
+	bool "Enable paravirtualized ops to be patched at runtime"
+	depends on PARAVIRT
+	help
+	  Enable the paravirtualized guest kernel to switch pv-ops based on
+	  changed host conditions, potentially improving performance
+	  significantly.
+
+	  This would increase the memory footprint of the running kernel
+	  slightly (depending mostly on whether lock and unlock are inlined
+	  or not.)
+
 config PARAVIRT_XXL
 	bool
 
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 37e8f27a3b9d..00e4a062ca10 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -723,6 +723,11 @@ struct paravirt_patch_site {
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+extern struct paravirt_patch_site __parainstructions_runtime[],
+	__parainstructions_runtime_end[];
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 67315fa3956a..e2ef241c261e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+#ifndef CONFIG_PARAVIRT_RUNTIME
+#define __parainstructions_runtime	NULL
+#define __parainstructions_runtime_end	NULL
+#endif
+
 /*
  * Currently, the max observed size in the kernel code is
  * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7867dfb3963e..fdfda1375f82 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -740,6 +740,8 @@ void __init alternative_instructions(void)
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
+	apply_paravirt(__parainstructions_runtime,
+		       __parainstructions_runtime_end);
 
 	restart_nmi();
 	alternatives_patched = 1;
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index d5c72cb877b3..658ea60ce324 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -222,7 +222,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL, *orc = NULL, *orc_ip = NULL;
+		*para = NULL, *para_run = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -234,6 +234,9 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".parainstructions.runtime",
+			    secstrings + s->sh_name))
+			para_run = s;
 		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
 			orc = s;
 		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
@@ -257,6 +260,11 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *pseg = (void *)para->sh_addr;
 		apply_paravirt(pseg, pseg + para->sh_size);
 	}
+	if (para_run) {
+		void *pseg = (void *)para_run->sh_addr;
+
+		apply_paravirt(pseg, pseg + para_run->sh_size);
+	}
 
 	/* make jump label nops */
 	jump_label_apply_nops(me);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1bf7e312361f..7f5b8f6ab96e 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -269,6 +269,7 @@ SECTIONS
 	.parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
 		__parainstructions = .;
 		*(.parainstructions)
+		PARAVIRT_DISCARD(.parainstructions.runtime)
 		__parainstructions_end = .;
 	}
 
@@ -348,6 +349,21 @@ SECTIONS
 		__smp_locks_end = .;
 	}
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	/*
+	 * .parainstructions.runtime sticks around in memory after
+	 * init so it doesn't need to be page-aligned but everything
+	 * around us is so we will be too.
+	 */
+	. = ALIGN(8);
+	.parainstructions.runtime : AT(ADDR(.parainstructions.runtime) - \
+								LOAD_OFFSET) {
+		__parainstructions_runtime = .;
+		PARAVIRT_KEEP(.parainstructions.runtime)
+		__parainstructions_runtime_end = .;
+	}
+#endif
+
 #ifdef CONFIG_X86_64
 	.data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
 		NOSAVE_DATA
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 71e387a5fe90..6b009d5ce51f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -135,6 +135,14 @@
 #define MEM_DISCARD(sec) *(.mem##sec)
 #endif
 
+#if defined(CONFIG_PARAVIRT_RUNTIME)
+#define PARAVIRT_KEEP(sec)	*(sec)
+#define PARAVIRT_DISCARD(sec)
+#else
+#define PARAVIRT_KEEP(sec)
+#define PARAVIRT_DISCARD(sec)	*(sec)
+#endif
+
 #ifdef CONFIG_FTRACE_MCOUNT_RECORD
 /*
  * The ftrace call sites are logged to a section whose name depends on the
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 02/26] x86/paravirt: Allow paravirt patching post-init
@ 2020-04-08  5:02   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:02 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Paravirt-ops are patched at init to convert indirect calls into
direct calls and in some cases, to inline the target at the call-site.
This is done by way of PVOP* macros which save the call-site
information via compile time annotations.

Pull this state out in .parainstructions.runtime for some pv-ops such
that they can be used for runtime patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig                      | 12 ++++++++++++
 arch/x86/include/asm/paravirt_types.h |  5 +++++
 arch/x86/include/asm/text-patching.h  |  5 +++++
 arch/x86/kernel/alternative.c         |  2 ++
 arch/x86/kernel/module.c              | 10 +++++++++-
 arch/x86/kernel/vmlinux.lds.S         | 16 ++++++++++++++++
 include/asm-generic/vmlinux.lds.h     |  8 ++++++++
 7 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1edf788d301c..605619938f08 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -764,6 +764,18 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_RUNTIME
+	bool "Enable paravirtualized ops to be patched at runtime"
+	depends on PARAVIRT
+	help
+	  Enable the paravirtualized guest kernel to switch pv-ops based on
+	  changed host conditions, potentially improving performance
+	  significantly.
+
+	  This would increase the memory footprint of the running kernel
+	  slightly (depending mostly on whether lock and unlock are inlined
+	  or not.)
+
 config PARAVIRT_XXL
 	bool
 
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 37e8f27a3b9d..00e4a062ca10 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -723,6 +723,11 @@ struct paravirt_patch_site {
 extern struct paravirt_patch_site __parainstructions[],
 	__parainstructions_end[];
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+extern struct paravirt_patch_site __parainstructions_runtime[],
+	__parainstructions_runtime_end[];
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* _ASM_X86_PARAVIRT_TYPES_H */
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 67315fa3956a..e2ef241c261e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+#ifndef CONFIG_PARAVIRT_RUNTIME
+#define __parainstructions_runtime	NULL
+#define __parainstructions_runtime_end	NULL
+#endif
+
 /*
  * Currently, the max observed size in the kernel code is
  * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7867dfb3963e..fdfda1375f82 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -740,6 +740,8 @@ void __init alternative_instructions(void)
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
+	apply_paravirt(__parainstructions_runtime,
+		       __parainstructions_runtime_end);
 
 	restart_nmi();
 	alternatives_patched = 1;
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index d5c72cb877b3..658ea60ce324 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -222,7 +222,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		    struct module *me)
 {
 	const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
-		*para = NULL, *orc = NULL, *orc_ip = NULL;
+		*para = NULL, *para_run = NULL, *orc = NULL, *orc_ip = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -234,6 +234,9 @@ int module_finalize(const Elf_Ehdr *hdr,
 			locks = s;
 		if (!strcmp(".parainstructions", secstrings + s->sh_name))
 			para = s;
+		if (!strcmp(".parainstructions.runtime",
+			    secstrings + s->sh_name))
+			para_run = s;
 		if (!strcmp(".orc_unwind", secstrings + s->sh_name))
 			orc = s;
 		if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
@@ -257,6 +260,11 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *pseg = (void *)para->sh_addr;
 		apply_paravirt(pseg, pseg + para->sh_size);
 	}
+	if (para_run) {
+		void *pseg = (void *)para_run->sh_addr;
+
+		apply_paravirt(pseg, pseg + para_run->sh_size);
+	}
 
 	/* make jump label nops */
 	jump_label_apply_nops(me);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1bf7e312361f..7f5b8f6ab96e 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -269,6 +269,7 @@ SECTIONS
 	.parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
 		__parainstructions = .;
 		*(.parainstructions)
+		PARAVIRT_DISCARD(.parainstructions.runtime)
 		__parainstructions_end = .;
 	}
 
@@ -348,6 +349,21 @@ SECTIONS
 		__smp_locks_end = .;
 	}
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	/*
+	 * .parainstructions.runtime sticks around in memory after
+	 * init so it doesn't need to be page-aligned but everything
+	 * around us is so we will be too.
+	 */
+	. = ALIGN(8);
+	.parainstructions.runtime : AT(ADDR(.parainstructions.runtime) - \
+								LOAD_OFFSET) {
+		__parainstructions_runtime = .;
+		PARAVIRT_KEEP(.parainstructions.runtime)
+		__parainstructions_runtime_end = .;
+	}
+#endif
+
 #ifdef CONFIG_X86_64
 	.data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
 		NOSAVE_DATA
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 71e387a5fe90..6b009d5ce51f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -135,6 +135,14 @@
 #define MEM_DISCARD(sec) *(.mem##sec)
 #endif
 
+#if defined(CONFIG_PARAVIRT_RUNTIME)
+#define PARAVIRT_KEEP(sec)	*(sec)
+#define PARAVIRT_DISCARD(sec)
+#else
+#define PARAVIRT_KEEP(sec)
+#define PARAVIRT_DISCARD(sec)	*(sec)
+#endif
+
 #ifdef CONFIG_FTRACE_MCOUNT_RECORD
 /*
  * The ftrace call sites are logged to a section whose name depends on the
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 03/26] x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Define PVRT* macros which can be used to put pv-ops in
.parainstructions.runtime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h | 49 +++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 00e4a062ca10..f1153f53c529 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,12 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct paravirt_patch_template pv_ops;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+#define PVRT_SUFFIX ".runtime"
+#else
+#define PVRT_SUFFIX ""
+#endif
+
 /* Sub-section for .parainstructions */
 #define PV_SUFFIX ""
 
@@ -693,6 +699,49 @@ int paravirt_disable_iospace(void);
 #define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
 	_PVOP_VCALL4(PV_SUFFIX, op, arg1, arg2, arg3, arg4)
 
+/*
+ * PVRTOP macros for .parainstructions.runtime
+ */
+#define PVRTOP_CALL0(rettype, op)					\
+	_PVOP_CALL0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALL0(op)						\
+	_PVOP_VCALL0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALLEE0(rettype, op)					\
+	_PVOP_CALLEE0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALLEE0(op)						\
+	_PVOP_VCALLEE0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALL1(rettype, op, arg1)					\
+	_PVOP_CALL1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALL1(op, arg1)						\
+	_PVOP_VCALL1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALLEE1(rettype, op, arg1)				\
+	_PVOP_CALLEE1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALLEE1(op, arg1)					\
+	_PVOP_VCALLEE1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALL2(rettype, op, arg1, arg2)				\
+	_PVOP_CALL2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALL2(op, arg1, arg2)					\
+	_PVOP_VCALL2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALLEE2(rettype, op, arg1, arg2)				\
+	_PVOP_CALLEE2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALLEE2(op, arg1, arg2)					\
+	_PVOP_VCALLEE2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALL3(rettype, op, arg1, arg2, arg3)			\
+	_PVOP_CALL3(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3)
+#define PVRTOP_VCALL3(op, arg1, arg2, arg3)				\
+	_PVOP_VCALL3(PVRT_SUFFIX, op, arg1, arg2, arg3)
+
+#define PVRTOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)		\
+	_PVOP_CALL4(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3, arg4)
+#define PVRTOP_VCALL4(op, arg1, arg2, arg3, arg4)			\
+	_PVOP_VCALL4(PVRT_SUFFIX, op, arg1, arg2, arg3, arg4)
+
 /* Lazy mode for batching updates / context switch */
 enum paravirt_lazy_mode {
 	PARAVIRT_LAZY_NONE,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 03/26] x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Define PVRT* macros which can be used to put pv-ops in
.parainstructions.runtime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h | 49 +++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 00e4a062ca10..f1153f53c529 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,12 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct paravirt_patch_template pv_ops;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+#define PVRT_SUFFIX ".runtime"
+#else
+#define PVRT_SUFFIX ""
+#endif
+
 /* Sub-section for .parainstructions */
 #define PV_SUFFIX ""
 
@@ -693,6 +699,49 @@ int paravirt_disable_iospace(void);
 #define PVOP_VCALL4(op, arg1, arg2, arg3, arg4)				\
 	_PVOP_VCALL4(PV_SUFFIX, op, arg1, arg2, arg3, arg4)
 
+/*
+ * PVRTOP macros for .parainstructions.runtime
+ */
+#define PVRTOP_CALL0(rettype, op)					\
+	_PVOP_CALL0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALL0(op)						\
+	_PVOP_VCALL0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALLEE0(rettype, op)					\
+	_PVOP_CALLEE0(PVRT_SUFFIX, rettype, op)
+#define PVRTOP_VCALLEE0(op)						\
+	_PVOP_VCALLEE0(PVRT_SUFFIX, op)
+
+#define PVRTOP_CALL1(rettype, op, arg1)					\
+	_PVOP_CALL1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALL1(op, arg1)						\
+	_PVOP_VCALL1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALLEE1(rettype, op, arg1)				\
+	_PVOP_CALLEE1(PVRT_SUFFIX, rettype, op, arg1)
+#define PVRTOP_VCALLEE1(op, arg1)					\
+	_PVOP_VCALLEE1(PVRT_SUFFIX, op, arg1)
+
+#define PVRTOP_CALL2(rettype, op, arg1, arg2)				\
+	_PVOP_CALL2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALL2(op, arg1, arg2)					\
+	_PVOP_VCALL2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALLEE2(rettype, op, arg1, arg2)				\
+	_PVOP_CALLEE2(PVRT_SUFFIX, rettype, op, arg1, arg2)
+#define PVRTOP_VCALLEE2(op, arg1, arg2)					\
+	_PVOP_VCALLEE2(PVRT_SUFFIX, op, arg1, arg2)
+
+#define PVRTOP_CALL3(rettype, op, arg1, arg2, arg3)			\
+	_PVOP_CALL3(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3)
+#define PVRTOP_VCALL3(op, arg1, arg2, arg3)				\
+	_PVOP_VCALL3(PVRT_SUFFIX, op, arg1, arg2, arg3)
+
+#define PVRTOP_CALL4(rettype, op, arg1, arg2, arg3, arg4)		\
+	_PVOP_CALL4(PVRT_SUFFIX, rettype, op, arg1, arg2, arg3, arg4)
+#define PVRTOP_VCALL4(op, arg1, arg2, arg3, arg4)			\
+	_PVOP_VCALL4(PVRT_SUFFIX, op, arg1, arg2, arg3, arg4)
+
 /* Lazy mode for batching updates / context switch */
 enum paravirt_lazy_mode {
 	PARAVIRT_LAZY_NONE,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 04/26] x86/alternatives: Refactor alternatives_smp_module*
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Refactor alternatives_smp_module* logic to make it available for
holding generic late patching state.

Most of the changes are to pull the module handling logic out
from CONFIG_SMP. In addition now we unconditionally call
alternatives_smp_module_add() and make the decision on patching
for UP or not there.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h | 13 ++-----
 arch/x86/kernel/alternative.c      | 55 ++++++++++++++++--------------
 2 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 13adca37c99a..8235bbb746d9 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -75,24 +75,15 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 
 struct module;
 
-#ifdef CONFIG_SMP
 extern void alternatives_smp_module_add(struct module *mod, char *name,
 					void *locks, void *locks_end,
 					void *text, void *text_end);
 extern void alternatives_smp_module_del(struct module *mod);
-extern void alternatives_enable_smp(void);
 extern int alternatives_text_reserved(void *start, void *end);
-extern bool skip_smp_alternatives;
+#ifdef CONFIG_SMP
+extern void alternatives_enable_smp(void);
 #else
-static inline void alternatives_smp_module_add(struct module *mod, char *name,
-					       void *locks, void *locks_end,
-					       void *text, void *text_end) {}
-static inline void alternatives_smp_module_del(struct module *mod) {}
 static inline void alternatives_enable_smp(void) {}
-static inline int alternatives_text_reserved(void *start, void *end)
-{
-	return 0;
-}
 #endif	/* CONFIG_SMP */
 
 #define b_replacement(num)	"664"#num
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index fdfda1375f82..32aa1ddf441d 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -470,6 +470,13 @@ static void alternatives_smp_unlock(const s32 *start, const s32 *end,
 	}
 }
 
+static bool uniproc_patched;	/* protected by text_mutex */
+#else	/* !CONFIG_SMP */
+#define uniproc_patched false
+static inline void alternatives_smp_unlock(const s32 *start, const s32 *end,
+					   u8 *text, u8 *text_end) { }
+#endif	/* CONFIG_SMP */
+
 struct smp_alt_module {
 	/* what is this ??? */
 	struct module	*mod;
@@ -486,7 +493,6 @@ struct smp_alt_module {
 	struct list_head next;
 };
 static LIST_HEAD(smp_alt_modules);
-static bool uniproc_patched = false;	/* protected by text_mutex */
 
 void __init_or_module alternatives_smp_module_add(struct module *mod,
 						  char *name,
@@ -495,23 +501,27 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 {
 	struct smp_alt_module *smp;
 
-	mutex_lock(&text_mutex);
+#ifdef CONFIG_SMP
+	/* Patch to UP if other cpus not imminent. */
+	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
+		uniproc_patched = true;
+#endif
 	if (!uniproc_patched)
-		goto unlock;
+		return;
 
-	if (num_possible_cpus() == 1)
-		/* Don't bother remembering, we'll never have to undo it. */
-		goto smp_unlock;
+	mutex_lock(&text_mutex);
 
-	smp = kzalloc(sizeof(*smp), GFP_KERNEL);
-	if (NULL == smp)
-		/* we'll run the (safe but slow) SMP code then ... */
-		goto unlock;
+	smp = kzalloc(sizeof(*smp), GFP_KERNEL | __GFP_NOFAIL);
 
 	smp->mod	= mod;
 	smp->name	= name;
-	smp->locks	= locks;
-	smp->locks_end	= locks_end;
+
+	if (num_possible_cpus() != 1 || uniproc_patched) {
+		/* Remember only if we'll need to undo it. */
+		smp->locks	= locks;
+		smp->locks_end	= locks_end;
+	}
+
 	smp->text	= text;
 	smp->text_end	= text_end;
 	DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
@@ -519,9 +529,9 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 		smp->text, smp->text_end, smp->name);
 
 	list_add_tail(&smp->next, &smp_alt_modules);
-smp_unlock:
-	alternatives_smp_unlock(locks, locks_end, text, text_end);
-unlock:
+
+	if (uniproc_patched)
+		alternatives_smp_unlock(locks, locks_end, text, text_end);
 	mutex_unlock(&text_mutex);
 }
 
@@ -540,6 +550,7 @@ void __init_or_module alternatives_smp_module_del(struct module *mod)
 	mutex_unlock(&text_mutex);
 }
 
+#ifdef CONFIG_SMP
 void alternatives_enable_smp(void)
 {
 	struct smp_alt_module *mod;
@@ -561,6 +572,7 @@ void alternatives_enable_smp(void)
 	}
 	mutex_unlock(&text_mutex);
 }
+#endif /* CONFIG_SMP */
 
 /*
  * Return 1 if the address range is reserved for SMP-alternatives.
@@ -588,7 +600,6 @@ int alternatives_text_reserved(void *start, void *end)
 
 	return 0;
 }
-#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_PARAVIRT
 void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
@@ -723,21 +734,15 @@ void __init alternative_instructions(void)
 
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
-#ifdef CONFIG_SMP
-	/* Patch to UP if other cpus not imminent. */
-	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1)) {
-		uniproc_patched = true;
-		alternatives_smp_module_add(NULL, "core kernel",
-					    __smp_locks, __smp_locks_end,
-					    _text, _etext);
-	}
+	alternatives_smp_module_add(NULL, "core kernel",
+				    __smp_locks, __smp_locks_end,
+				    _text, _etext);
 
 	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
 	}
-#endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
 	apply_paravirt(__parainstructions_runtime,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 04/26] x86/alternatives: Refactor alternatives_smp_module*
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Refactor alternatives_smp_module* logic to make it available for
holding generic late patching state.

Most of the changes are to pull the module handling logic out
from CONFIG_SMP. In addition now we unconditionally call
alternatives_smp_module_add() and make the decision on patching
for UP or not there.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h | 13 ++-----
 arch/x86/kernel/alternative.c      | 55 ++++++++++++++++--------------
 2 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 13adca37c99a..8235bbb746d9 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -75,24 +75,15 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 
 struct module;
 
-#ifdef CONFIG_SMP
 extern void alternatives_smp_module_add(struct module *mod, char *name,
 					void *locks, void *locks_end,
 					void *text, void *text_end);
 extern void alternatives_smp_module_del(struct module *mod);
-extern void alternatives_enable_smp(void);
 extern int alternatives_text_reserved(void *start, void *end);
-extern bool skip_smp_alternatives;
+#ifdef CONFIG_SMP
+extern void alternatives_enable_smp(void);
 #else
-static inline void alternatives_smp_module_add(struct module *mod, char *name,
-					       void *locks, void *locks_end,
-					       void *text, void *text_end) {}
-static inline void alternatives_smp_module_del(struct module *mod) {}
 static inline void alternatives_enable_smp(void) {}
-static inline int alternatives_text_reserved(void *start, void *end)
-{
-	return 0;
-}
 #endif	/* CONFIG_SMP */
 
 #define b_replacement(num)	"664"#num
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index fdfda1375f82..32aa1ddf441d 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -470,6 +470,13 @@ static void alternatives_smp_unlock(const s32 *start, const s32 *end,
 	}
 }
 
+static bool uniproc_patched;	/* protected by text_mutex */
+#else	/* !CONFIG_SMP */
+#define uniproc_patched false
+static inline void alternatives_smp_unlock(const s32 *start, const s32 *end,
+					   u8 *text, u8 *text_end) { }
+#endif	/* CONFIG_SMP */
+
 struct smp_alt_module {
 	/* what is this ??? */
 	struct module	*mod;
@@ -486,7 +493,6 @@ struct smp_alt_module {
 	struct list_head next;
 };
 static LIST_HEAD(smp_alt_modules);
-static bool uniproc_patched = false;	/* protected by text_mutex */
 
 void __init_or_module alternatives_smp_module_add(struct module *mod,
 						  char *name,
@@ -495,23 +501,27 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 {
 	struct smp_alt_module *smp;
 
-	mutex_lock(&text_mutex);
+#ifdef CONFIG_SMP
+	/* Patch to UP if other cpus not imminent. */
+	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
+		uniproc_patched = true;
+#endif
 	if (!uniproc_patched)
-		goto unlock;
+		return;
 
-	if (num_possible_cpus() == 1)
-		/* Don't bother remembering, we'll never have to undo it. */
-		goto smp_unlock;
+	mutex_lock(&text_mutex);
 
-	smp = kzalloc(sizeof(*smp), GFP_KERNEL);
-	if (NULL == smp)
-		/* we'll run the (safe but slow) SMP code then ... */
-		goto unlock;
+	smp = kzalloc(sizeof(*smp), GFP_KERNEL | __GFP_NOFAIL);
 
 	smp->mod	= mod;
 	smp->name	= name;
-	smp->locks	= locks;
-	smp->locks_end	= locks_end;
+
+	if (num_possible_cpus() != 1 || uniproc_patched) {
+		/* Remember only if we'll need to undo it. */
+		smp->locks	= locks;
+		smp->locks_end	= locks_end;
+	}
+
 	smp->text	= text;
 	smp->text_end	= text_end;
 	DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
@@ -519,9 +529,9 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 		smp->text, smp->text_end, smp->name);
 
 	list_add_tail(&smp->next, &smp_alt_modules);
-smp_unlock:
-	alternatives_smp_unlock(locks, locks_end, text, text_end);
-unlock:
+
+	if (uniproc_patched)
+		alternatives_smp_unlock(locks, locks_end, text, text_end);
 	mutex_unlock(&text_mutex);
 }
 
@@ -540,6 +550,7 @@ void __init_or_module alternatives_smp_module_del(struct module *mod)
 	mutex_unlock(&text_mutex);
 }
 
+#ifdef CONFIG_SMP
 void alternatives_enable_smp(void)
 {
 	struct smp_alt_module *mod;
@@ -561,6 +572,7 @@ void alternatives_enable_smp(void)
 	}
 	mutex_unlock(&text_mutex);
 }
+#endif /* CONFIG_SMP */
 
 /*
  * Return 1 if the address range is reserved for SMP-alternatives.
@@ -588,7 +600,6 @@ int alternatives_text_reserved(void *start, void *end)
 
 	return 0;
 }
-#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_PARAVIRT
 void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
@@ -723,21 +734,15 @@ void __init alternative_instructions(void)
 
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
-#ifdef CONFIG_SMP
-	/* Patch to UP if other cpus not imminent. */
-	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1)) {
-		uniproc_patched = true;
-		alternatives_smp_module_add(NULL, "core kernel",
-					    __smp_locks, __smp_locks_end,
-					    _text, _etext);
-	}
+	alternatives_smp_module_add(NULL, "core kernel",
+				    __smp_locks, __smp_locks_end,
+				    _text, _etext);
 
 	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
 	}
-#endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
 	apply_paravirt(__parainstructions_runtime,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 05/26] x86/alternatives: Rename alternatives_smp*, smp_alt_module
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Rename alternatives_smp_module_*(), smp_alt_module to reflect
their new purpose.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h | 10 +++---
 arch/x86/kernel/alternative.c      | 54 +++++++++++++++---------------
 arch/x86/kernel/module.c           |  8 ++---
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 8235bbb746d9..db91a7731d87 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -75,11 +75,11 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 
 struct module;
 
-extern void alternatives_smp_module_add(struct module *mod, char *name,
-					void *locks, void *locks_end,
-					void *text, void *text_end);
-extern void alternatives_smp_module_del(struct module *mod);
-extern int alternatives_text_reserved(void *start, void *end);
+void alternatives_module_add(struct module *mod, char *name,
+			     void *locks, void *locks_end,
+			     void *text, void *text_end);
+void alternatives_module_del(struct module *mod);
+int alternatives_text_reserved(void *start, void *end);
 #ifdef CONFIG_SMP
 extern void alternatives_enable_smp(void);
 #else
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 32aa1ddf441d..4157f848b537 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -477,7 +477,7 @@ static inline void alternatives_smp_unlock(const s32 *start, const s32 *end,
 					   u8 *text, u8 *text_end) { }
 #endif	/* CONFIG_SMP */
 
-struct smp_alt_module {
+struct alt_module {
 	/* what is this ??? */
 	struct module	*mod;
 	char		*name;
@@ -492,14 +492,14 @@ struct smp_alt_module {
 
 	struct list_head next;
 };
-static LIST_HEAD(smp_alt_modules);
 
-void __init_or_module alternatives_smp_module_add(struct module *mod,
-						  char *name,
-						  void *locks, void *locks_end,
-						  void *text,  void *text_end)
+static LIST_HEAD(alt_modules);
+
+void __init_or_module alternatives_module_add(struct module *mod, char *name,
+					      void *locks, void *locks_end,
+					      void *text,  void *text_end)
 {
-	struct smp_alt_module *smp;
+	struct alt_module *alt;
 
 #ifdef CONFIG_SMP
 	/* Patch to UP if other cpus not imminent. */
@@ -511,36 +511,36 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 
 	mutex_lock(&text_mutex);
 
-	smp = kzalloc(sizeof(*smp), GFP_KERNEL | __GFP_NOFAIL);
+	alt = kzalloc(sizeof(*alt), GFP_KERNEL | __GFP_NOFAIL);
 
-	smp->mod	= mod;
-	smp->name	= name;
+	alt->mod	= mod;
+	alt->name	= name;
 
 	if (num_possible_cpus() != 1 || uniproc_patched) {
 		/* Remember only if we'll need to undo it. */
-		smp->locks	= locks;
-		smp->locks_end	= locks_end;
+		alt->locks	= locks;
+		alt->locks_end	= locks_end;
 	}
 
-	smp->text	= text;
-	smp->text_end	= text_end;
+	alt->text	= text;
+	alt->text_end	= text_end;
 	DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
-		smp->locks, smp->locks_end,
-		smp->text, smp->text_end, smp->name);
+		alt->locks, alt->locks_end,
+		alt->text, alt->text_end, alt->name);
 
-	list_add_tail(&smp->next, &smp_alt_modules);
+	list_add_tail(&alt->next, &alt_modules);
 
 	if (uniproc_patched)
 		alternatives_smp_unlock(locks, locks_end, text, text_end);
 	mutex_unlock(&text_mutex);
 }
 
-void __init_or_module alternatives_smp_module_del(struct module *mod)
+void __init_or_module alternatives_module_del(struct module *mod)
 {
-	struct smp_alt_module *item;
+	struct alt_module *item;
 
 	mutex_lock(&text_mutex);
-	list_for_each_entry(item, &smp_alt_modules, next) {
+	list_for_each_entry(item, &alt_modules, next) {
 		if (mod != item->mod)
 			continue;
 		list_del(&item->next);
@@ -553,7 +553,7 @@ void __init_or_module alternatives_smp_module_del(struct module *mod)
 #ifdef CONFIG_SMP
 void alternatives_enable_smp(void)
 {
-	struct smp_alt_module *mod;
+	struct alt_module *mod;
 
 	/* Why bother if there are no other CPUs? */
 	BUG_ON(num_possible_cpus() == 1);
@@ -565,7 +565,7 @@ void alternatives_enable_smp(void)
 		BUG_ON(num_online_cpus() != 1);
 		clear_cpu_cap(&boot_cpu_data, X86_FEATURE_UP);
 		clear_cpu_cap(&cpu_data(0), X86_FEATURE_UP);
-		list_for_each_entry(mod, &smp_alt_modules, next)
+		list_for_each_entry(mod, &alt_modules, next)
 			alternatives_smp_lock(mod->locks, mod->locks_end,
 					      mod->text, mod->text_end);
 		uniproc_patched = false;
@@ -580,14 +580,14 @@ void alternatives_enable_smp(void)
  */
 int alternatives_text_reserved(void *start, void *end)
 {
-	struct smp_alt_module *mod;
+	struct alt_module *mod;
 	const s32 *poff;
 	u8 *text_start = start;
 	u8 *text_end = end;
 
 	lockdep_assert_held(&text_mutex);
 
-	list_for_each_entry(mod, &smp_alt_modules, next) {
+	list_for_each_entry(mod, &alt_modules, next) {
 		if (mod->text > text_end || mod->text_end < text_start)
 			continue;
 		for (poff = mod->locks; poff < mod->locks_end; poff++) {
@@ -734,9 +734,9 @@ void __init alternative_instructions(void)
 
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
-	alternatives_smp_module_add(NULL, "core kernel",
-				    __smp_locks, __smp_locks_end,
-				    _text, _etext);
+	alternatives_module_add(NULL, "core kernel",
+				__smp_locks, __smp_locks_end,
+				_text, _etext);
 
 	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 658ea60ce324..fc3d35198b09 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -251,9 +251,9 @@ int module_finalize(const Elf_Ehdr *hdr,
 	if (locks && text) {
 		void *lseg = (void *)locks->sh_addr;
 		void *tseg = (void *)text->sh_addr;
-		alternatives_smp_module_add(me, me->name,
-					    lseg, lseg + locks->sh_size,
-					    tseg, tseg + text->sh_size);
+		alternatives_module_add(me, me->name,
+					lseg, lseg + locks->sh_size,
+					tseg, tseg + text->sh_size);
 	}
 
 	if (para) {
@@ -278,5 +278,5 @@ int module_finalize(const Elf_Ehdr *hdr,
 
 void module_arch_cleanup(struct module *mod)
 {
-	alternatives_smp_module_del(mod);
+	alternatives_module_del(mod);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 05/26] x86/alternatives: Rename alternatives_smp*, smp_alt_module
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Rename alternatives_smp_module_*(), smp_alt_module to reflect
their new purpose.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h | 10 +++---
 arch/x86/kernel/alternative.c      | 54 +++++++++++++++---------------
 arch/x86/kernel/module.c           |  8 ++---
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 8235bbb746d9..db91a7731d87 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -75,11 +75,11 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 
 struct module;
 
-extern void alternatives_smp_module_add(struct module *mod, char *name,
-					void *locks, void *locks_end,
-					void *text, void *text_end);
-extern void alternatives_smp_module_del(struct module *mod);
-extern int alternatives_text_reserved(void *start, void *end);
+void alternatives_module_add(struct module *mod, char *name,
+			     void *locks, void *locks_end,
+			     void *text, void *text_end);
+void alternatives_module_del(struct module *mod);
+int alternatives_text_reserved(void *start, void *end);
 #ifdef CONFIG_SMP
 extern void alternatives_enable_smp(void);
 #else
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 32aa1ddf441d..4157f848b537 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -477,7 +477,7 @@ static inline void alternatives_smp_unlock(const s32 *start, const s32 *end,
 					   u8 *text, u8 *text_end) { }
 #endif	/* CONFIG_SMP */
 
-struct smp_alt_module {
+struct alt_module {
 	/* what is this ??? */
 	struct module	*mod;
 	char		*name;
@@ -492,14 +492,14 @@ struct smp_alt_module {
 
 	struct list_head next;
 };
-static LIST_HEAD(smp_alt_modules);
 
-void __init_or_module alternatives_smp_module_add(struct module *mod,
-						  char *name,
-						  void *locks, void *locks_end,
-						  void *text,  void *text_end)
+static LIST_HEAD(alt_modules);
+
+void __init_or_module alternatives_module_add(struct module *mod, char *name,
+					      void *locks, void *locks_end,
+					      void *text,  void *text_end)
 {
-	struct smp_alt_module *smp;
+	struct alt_module *alt;
 
 #ifdef CONFIG_SMP
 	/* Patch to UP if other cpus not imminent. */
@@ -511,36 +511,36 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 
 	mutex_lock(&text_mutex);
 
-	smp = kzalloc(sizeof(*smp), GFP_KERNEL | __GFP_NOFAIL);
+	alt = kzalloc(sizeof(*alt), GFP_KERNEL | __GFP_NOFAIL);
 
-	smp->mod	= mod;
-	smp->name	= name;
+	alt->mod	= mod;
+	alt->name	= name;
 
 	if (num_possible_cpus() != 1 || uniproc_patched) {
 		/* Remember only if we'll need to undo it. */
-		smp->locks	= locks;
-		smp->locks_end	= locks_end;
+		alt->locks	= locks;
+		alt->locks_end	= locks_end;
 	}
 
-	smp->text	= text;
-	smp->text_end	= text_end;
+	alt->text	= text;
+	alt->text_end	= text_end;
 	DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
-		smp->locks, smp->locks_end,
-		smp->text, smp->text_end, smp->name);
+		alt->locks, alt->locks_end,
+		alt->text, alt->text_end, alt->name);
 
-	list_add_tail(&smp->next, &smp_alt_modules);
+	list_add_tail(&alt->next, &alt_modules);
 
 	if (uniproc_patched)
 		alternatives_smp_unlock(locks, locks_end, text, text_end);
 	mutex_unlock(&text_mutex);
 }
 
-void __init_or_module alternatives_smp_module_del(struct module *mod)
+void __init_or_module alternatives_module_del(struct module *mod)
 {
-	struct smp_alt_module *item;
+	struct alt_module *item;
 
 	mutex_lock(&text_mutex);
-	list_for_each_entry(item, &smp_alt_modules, next) {
+	list_for_each_entry(item, &alt_modules, next) {
 		if (mod != item->mod)
 			continue;
 		list_del(&item->next);
@@ -553,7 +553,7 @@ void __init_or_module alternatives_smp_module_del(struct module *mod)
 #ifdef CONFIG_SMP
 void alternatives_enable_smp(void)
 {
-	struct smp_alt_module *mod;
+	struct alt_module *mod;
 
 	/* Why bother if there are no other CPUs? */
 	BUG_ON(num_possible_cpus() == 1);
@@ -565,7 +565,7 @@ void alternatives_enable_smp(void)
 		BUG_ON(num_online_cpus() != 1);
 		clear_cpu_cap(&boot_cpu_data, X86_FEATURE_UP);
 		clear_cpu_cap(&cpu_data(0), X86_FEATURE_UP);
-		list_for_each_entry(mod, &smp_alt_modules, next)
+		list_for_each_entry(mod, &alt_modules, next)
 			alternatives_smp_lock(mod->locks, mod->locks_end,
 					      mod->text, mod->text_end);
 		uniproc_patched = false;
@@ -580,14 +580,14 @@ void alternatives_enable_smp(void)
  */
 int alternatives_text_reserved(void *start, void *end)
 {
-	struct smp_alt_module *mod;
+	struct alt_module *mod;
 	const s32 *poff;
 	u8 *text_start = start;
 	u8 *text_end = end;
 
 	lockdep_assert_held(&text_mutex);
 
-	list_for_each_entry(mod, &smp_alt_modules, next) {
+	list_for_each_entry(mod, &alt_modules, next) {
 		if (mod->text > text_end || mod->text_end < text_start)
 			continue;
 		for (poff = mod->locks; poff < mod->locks_end; poff++) {
@@ -734,9 +734,9 @@ void __init alternative_instructions(void)
 
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
-	alternatives_smp_module_add(NULL, "core kernel",
-				    __smp_locks, __smp_locks_end,
-				    _text, _etext);
+	alternatives_module_add(NULL, "core kernel",
+				__smp_locks, __smp_locks_end,
+				_text, _etext);
 
 	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 658ea60ce324..fc3d35198b09 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -251,9 +251,9 @@ int module_finalize(const Elf_Ehdr *hdr,
 	if (locks && text) {
 		void *lseg = (void *)locks->sh_addr;
 		void *tseg = (void *)text->sh_addr;
-		alternatives_smp_module_add(me, me->name,
-					    lseg, lseg + locks->sh_size,
-					    tseg, tseg + text->sh_size);
+		alternatives_module_add(me, me->name,
+					lseg, lseg + locks->sh_size,
+					tseg, tseg + text->sh_size);
 	}
 
 	if (para) {
@@ -278,5 +278,5 @@ int module_finalize(const Elf_Ehdr *hdr,
 
 void module_arch_cleanup(struct module *mod)
 {
-	alternatives_smp_module_del(mod);
+	alternatives_module_del(mod);
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 06/26] x86/alternatives: Remove stale symbols
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

__start_parainstructions and __stop_parainstructions aren't
defined, remove them.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4157f848b537..09e4ee0e09a2 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -623,8 +623,6 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 		text_poke_early(p->instr, insn_buff, p->len);
 	}
 }
-extern struct paravirt_patch_site __start_parainstructions[],
-	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 06/26] x86/alternatives: Remove stale symbols
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

__start_parainstructions and __stop_parainstructions aren't
defined, remove them.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4157f848b537..09e4ee0e09a2 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -623,8 +623,6 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 		text_poke_early(p->instr, insn_buff, p->len);
 	}
 }
-extern struct paravirt_patch_site __start_parainstructions[],
-	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
 /*
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 07/26] x86/paravirt: Persist .parainstructions.runtime
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Persist .parainstructions.runtime in memory. We will use it to
patch paravirt-ops at runtime.

The extra memory footprint depends on chosen config options but the
inlined queued_spin_unlock() presents an edge case:

 $ objdump -h vmlinux|grep .parainstructions
 Idx Name              		Size      VMA
 	LMA                File-off  Algn
  27 .parainstructions 		0001013c  ffffffff82895000
  	0000000002895000   01c95000  2**3
  28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
  	00000000028a5140   01ca5140  2**3

(The added footprint is the size of the .parainstructions.runtime
section.)

  $ size vmlinux
  text       data       bss        dec      hex       filename
  13726196   12302814   14094336   40123346 2643bd2   vmlinux

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h |  1 +
 arch/x86/kernel/alternative.c      | 16 +++++++++++++++-
 arch/x86/kernel/module.c           | 28 +++++++++++++++++++++++-----
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index db91a7731d87..d19546c14ff6 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -76,6 +76,7 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 struct module;
 
 void alternatives_module_add(struct module *mod, char *name,
+			     void *para, void *para_end,
 			     void *locks, void *locks_end,
 			     void *text, void *text_end);
 void alternatives_module_del(struct module *mod);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 09e4ee0e09a2..8189ac21624c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -482,6 +482,12 @@ struct alt_module {
 	struct module	*mod;
 	char		*name;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	/* ptrs to paravirt sites */
+	struct paravirt_patch_site *para;
+	struct paravirt_patch_site *para_end;
+#endif
+
 	/* ptrs to lock prefixes */
 	const s32	*locks;
 	const s32	*locks_end;
@@ -496,6 +502,7 @@ struct alt_module {
 static LIST_HEAD(alt_modules);
 
 void __init_or_module alternatives_module_add(struct module *mod, char *name,
+					      void *para, void *para_end,
 					      void *locks, void *locks_end,
 					      void *text,  void *text_end)
 {
@@ -506,7 +513,7 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
 	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
 		uniproc_patched = true;
 #endif
-	if (!uniproc_patched)
+	if (!IS_ENABLED(CONFIG_PARAVIRT_RUNTIME) && !uniproc_patched)
 		return;
 
 	mutex_lock(&text_mutex);
@@ -516,6 +523,11 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
 	alt->mod	= mod;
 	alt->name	= name;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	alt->para	= para;
+	alt->para_end	= para_end;
+#endif
+
 	if (num_possible_cpus() != 1 || uniproc_patched) {
 		/* Remember only if we'll need to undo it. */
 		alt->locks	= locks;
@@ -733,6 +745,8 @@ void __init alternative_instructions(void)
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
 	alternatives_module_add(NULL, "core kernel",
+				__parainstructions_runtime,
+				__parainstructions_runtime_end,
 				__smp_locks, __smp_locks_end,
 				_text, _etext);
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index fc3d35198b09..7b2632184c11 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -248,12 +248,30 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *aseg = (void *)alt->sh_addr;
 		apply_alternatives(aseg, aseg + alt->sh_size);
 	}
-	if (locks && text) {
-		void *lseg = (void *)locks->sh_addr;
-		void *tseg = (void *)text->sh_addr;
+	if (para_run || (locks && text)) {
+		void *pseg, *pseg_end;
+		void *lseg, *lseg_end;
+		void *tseg, *tseg_end;
+
+		pseg = pseg_end = NULL;
+		lseg = lseg_end = NULL;
+		tseg = tseg_end = NULL;
+		if (para_run) {
+			pseg = (void *)para_run->sh_addr;
+			pseg_end = pseg + para_run->sh_size;
+		}
+
+		if (locks && text) {
+			tseg = (void *)text->sh_addr;
+			tseg_end = tseg + text->sh_size;
+
+			lseg = (void *)locks->sh_addr;
+			lseg_end = lseg + locks->sh_size;
+		}
 		alternatives_module_add(me, me->name,
-					lseg, lseg + locks->sh_size,
-					tseg, tseg + text->sh_size);
+					pseg, pseg_end,
+					lseg, lseg_end,
+					tseg, tseg_end);
 	}
 
 	if (para) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 07/26] x86/paravirt: Persist .parainstructions.runtime
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Persist .parainstructions.runtime in memory. We will use it to
patch paravirt-ops at runtime.

The extra memory footprint depends on chosen config options but the
inlined queued_spin_unlock() presents an edge case:

 $ objdump -h vmlinux|grep .parainstructions
 Idx Name              		Size      VMA
 	LMA                File-off  Algn
  27 .parainstructions 		0001013c  ffffffff82895000
  	0000000002895000   01c95000  2**3
  28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
  	00000000028a5140   01ca5140  2**3

(The added footprint is the size of the .parainstructions.runtime
section.)

  $ size vmlinux
  text       data       bss        dec      hex       filename
  13726196   12302814   14094336   40123346 2643bd2   vmlinux

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/alternative.h |  1 +
 arch/x86/kernel/alternative.c      | 16 +++++++++++++++-
 arch/x86/kernel/module.c           | 28 +++++++++++++++++++++++-----
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index db91a7731d87..d19546c14ff6 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -76,6 +76,7 @@ extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 struct module;
 
 void alternatives_module_add(struct module *mod, char *name,
+			     void *para, void *para_end,
 			     void *locks, void *locks_end,
 			     void *text, void *text_end);
 void alternatives_module_del(struct module *mod);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 09e4ee0e09a2..8189ac21624c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -482,6 +482,12 @@ struct alt_module {
 	struct module	*mod;
 	char		*name;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	/* ptrs to paravirt sites */
+	struct paravirt_patch_site *para;
+	struct paravirt_patch_site *para_end;
+#endif
+
 	/* ptrs to lock prefixes */
 	const s32	*locks;
 	const s32	*locks_end;
@@ -496,6 +502,7 @@ struct alt_module {
 static LIST_HEAD(alt_modules);
 
 void __init_or_module alternatives_module_add(struct module *mod, char *name,
+					      void *para, void *para_end,
 					      void *locks, void *locks_end,
 					      void *text,  void *text_end)
 {
@@ -506,7 +513,7 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
 	if (!noreplace_smp && (num_present_cpus() == 1 || setup_max_cpus <= 1))
 		uniproc_patched = true;
 #endif
-	if (!uniproc_patched)
+	if (!IS_ENABLED(CONFIG_PARAVIRT_RUNTIME) && !uniproc_patched)
 		return;
 
 	mutex_lock(&text_mutex);
@@ -516,6 +523,11 @@ void __init_or_module alternatives_module_add(struct module *mod, char *name,
 	alt->mod	= mod;
 	alt->name	= name;
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+	alt->para	= para;
+	alt->para_end	= para_end;
+#endif
+
 	if (num_possible_cpus() != 1 || uniproc_patched) {
 		/* Remember only if we'll need to undo it. */
 		alt->locks	= locks;
@@ -733,6 +745,8 @@ void __init alternative_instructions(void)
 	apply_alternatives(__alt_instructions, __alt_instructions_end);
 
 	alternatives_module_add(NULL, "core kernel",
+				__parainstructions_runtime,
+				__parainstructions_runtime_end,
 				__smp_locks, __smp_locks_end,
 				_text, _etext);
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index fc3d35198b09..7b2632184c11 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -248,12 +248,30 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *aseg = (void *)alt->sh_addr;
 		apply_alternatives(aseg, aseg + alt->sh_size);
 	}
-	if (locks && text) {
-		void *lseg = (void *)locks->sh_addr;
-		void *tseg = (void *)text->sh_addr;
+	if (para_run || (locks && text)) {
+		void *pseg, *pseg_end;
+		void *lseg, *lseg_end;
+		void *tseg, *tseg_end;
+
+		pseg = pseg_end = NULL;
+		lseg = lseg_end = NULL;
+		tseg = tseg_end = NULL;
+		if (para_run) {
+			pseg = (void *)para_run->sh_addr;
+			pseg_end = pseg + para_run->sh_size;
+		}
+
+		if (locks && text) {
+			tseg = (void *)text->sh_addr;
+			tseg_end = tseg + text->sh_size;
+
+			lseg = (void *)locks->sh_addr;
+			lseg_end = lseg + locks->sh_size;
+		}
 		alternatives_module_add(me, me->name,
-					lseg, lseg + locks->sh_size,
-					tseg, tseg + text->sh_size);
+					pseg, pseg_end,
+					lseg, lseg_end,
+					tseg, tseg_end);
 	}
 
 	if (para) {
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 08/26] x86/paravirt: Stash native pv-ops
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Introduce native_pv_ops where we stash the pv_ops array before
hypervisor specific hooks have modified it.

native_pv_ops get used when switching between paravirt and native
pv-ops at runtime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  4 ++++
 arch/x86/kernel/paravirt.c            | 10 ++++++++++
 arch/x86/kernel/setup.c               |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index f1153f53c529..bc935eec7ec6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -339,6 +339,7 @@ extern struct paravirt_patch_template pv_ops;
 
 #ifdef CONFIG_PARAVIRT_RUNTIME
 #define PVRT_SUFFIX ".runtime"
+extern struct paravirt_patch_template native_pv_ops;
 #else
 #define PVRT_SUFFIX ""
 #endif
@@ -775,6 +776,9 @@ extern struct paravirt_patch_site __parainstructions[],
 #ifdef CONFIG_PARAVIRT_RUNTIME
 extern struct paravirt_patch_site __parainstructions_runtime[],
 	__parainstructions_runtime_end[];
+void paravirt_ops_init(void);
+#else
+static inline void paravirt_ops_init(void) { }
 #endif
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c131ba4e70ef..8c511cc4d4f4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -458,5 +458,15 @@ NOKPROBE_SYMBOL(native_set_debugreg);
 NOKPROBE_SYMBOL(native_load_idt);
 #endif
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+__ro_after_init struct paravirt_patch_template native_pv_ops;
+
+void __init paravirt_ops_init(void)
+{
+	native_pv_ops = pv_ops;
+}
+EXPORT_SYMBOL(native_pv_ops);
+#endif
+
 EXPORT_SYMBOL(pv_ops);
 EXPORT_SYMBOL_GPL(pv_info);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e6b545047f38..2746a6a78fe7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -43,6 +43,7 @@
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
 #include <linux/vmalloc.h>
+#include <asm/paravirt_types.h>
 
 /*
  * max_low_pfn_mapped: highest directly mapped pfn < 4 GB
@@ -831,6 +832,7 @@ void __init setup_arch(char **cmdline_p)
 	boot_cpu_data.x86_phys_bits = MAX_PHYSMEM_BITS;
 #endif
 
+	paravirt_ops_init();
 	/*
 	 * If we have OLPC OFW, we might end up relocating the fixmap due to
 	 * reserve_top(), so do this before touching the ioremap area.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 08/26] x86/paravirt: Stash native pv-ops
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Introduce native_pv_ops where we stash the pv_ops array before
hypervisor specific hooks have modified it.

native_pv_ops get used when switching between paravirt and native
pv-ops at runtime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  4 ++++
 arch/x86/kernel/paravirt.c            | 10 ++++++++++
 arch/x86/kernel/setup.c               |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index f1153f53c529..bc935eec7ec6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -339,6 +339,7 @@ extern struct paravirt_patch_template pv_ops;
 
 #ifdef CONFIG_PARAVIRT_RUNTIME
 #define PVRT_SUFFIX ".runtime"
+extern struct paravirt_patch_template native_pv_ops;
 #else
 #define PVRT_SUFFIX ""
 #endif
@@ -775,6 +776,9 @@ extern struct paravirt_patch_site __parainstructions[],
 #ifdef CONFIG_PARAVIRT_RUNTIME
 extern struct paravirt_patch_site __parainstructions_runtime[],
 	__parainstructions_runtime_end[];
+void paravirt_ops_init(void);
+#else
+static inline void paravirt_ops_init(void) { }
 #endif
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c131ba4e70ef..8c511cc4d4f4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -458,5 +458,15 @@ NOKPROBE_SYMBOL(native_set_debugreg);
 NOKPROBE_SYMBOL(native_load_idt);
 #endif
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+__ro_after_init struct paravirt_patch_template native_pv_ops;
+
+void __init paravirt_ops_init(void)
+{
+	native_pv_ops = pv_ops;
+}
+EXPORT_SYMBOL(native_pv_ops);
+#endif
+
 EXPORT_SYMBOL(pv_ops);
 EXPORT_SYMBOL_GPL(pv_info);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e6b545047f38..2746a6a78fe7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -43,6 +43,7 @@
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
 #include <linux/vmalloc.h>
+#include <asm/paravirt_types.h>
 
 /*
  * max_low_pfn_mapped: highest directly mapped pfn < 4 GB
@@ -831,6 +832,7 @@ void __init setup_arch(char **cmdline_p)
 	boot_cpu_data.x86_phys_bits = MAX_PHYSMEM_BITS;
 #endif
 
+	paravirt_ops_init();
 	/*
 	 * If we have OLPC OFW, we might end up relocating the fixmap due to
 	 * reserve_top(), so do this before touching the ioremap area.
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 09/26] x86/paravirt: Add runtime_patch()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

runtime_patch() generates insn sequences for patching supported pv_ops.
It does this by calling paravirt_patch_default() or native_patch()
dpending on if the target is a paravirt or native pv-op.

In addition, runtime_patch() also whitelists pv-ops that are safe to
patch at runtime.

The static conditions that need to be satisfied to patch safely:
 - Insn sequences under replacement need to execute without preemption.
   This is meant to avoid scenarios where a call-site (ex.
   lock.vcpu_is_preempted) switches between the following sequences:

  lock.vcpu_is_preempted = __raw_callee_save___kvm_vcpu_is_preempted
    0: e8 31 e6 ff ff		callq  0xffffffffffffe636
    4: 66 90			xchg   %ax,%ax      # NOP2

  lock.vcpu_is_preempted = __raw_callee_save___native_vcpu_is_preempted
    0: 31 c0			xor   %rax, %rax
    2: 0f 1f 44 00 00		nopl   0x0(%rax)    # NOP5

   If kvm_vcpu_is_preempted() were preemptible, then, post patching
   we would return to address 4 above, which is in the middle of an
   instruction for native_vcpu_is_preempted().

   Even if this were to be made safe (ex. by changing the NOP2 to be a
   prefix instead of a suffix), it would still not be enough -- since
   we do not want any code from the switched out pv-op to be executing
   after the pv-op has been switched out.

 - Entered only at the beginning: this allows us to use text_poke()
   which uses INT3 as a barrier.

We don't store the address inside any call-sites so the second can be
assumed.

Guaranteeing the first condition boils down to stating that any pv-op
being patched cannot be present/referenced from any call-stack in the
system. pv-ops that are not obviously non-preemptible need to be
enclosed in preempt_disable_runtime_patch()/preempt_enable_runtime_patch().

This should be sufficient because runtime_patch() itself is called from
a stop_machine() context which would would be enough to flush out any
non-preemptible sequences.

Note that preemption in the host is okay: stop_machine() would unwind
any pv-ops sleeping in the host.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  8 +++++
 arch/x86/kernel/paravirt.c            |  6 +---
 arch/x86/kernel/paravirt_patch.c      | 49 +++++++++++++++++++++++++++
 include/linux/preempt.h               | 17 ++++++++++
 4 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index bc935eec7ec6..3b9f6c105397 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+/*
+ * Neat trick to map patch type back to the call within the
+ * corresponding structure.
+ */
+#define PARAVIRT_PATCH_OP(ops, type) (*(long *)(&((long **)&(ops))[type]))
+
 #define paravirt_type(op)				\
 	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
 	[paravirt_opptr] "i" (&(pv_ops.op))
@@ -383,6 +389,8 @@ unsigned paravirt_patch_default(u8 type, void *insn_buff, unsigned long addr, un
 unsigned paravirt_patch_insns(void *insn_buff, unsigned len, const char *start, const char *end);
 
 unsigned native_patch(u8 type, void *insn_buff, unsigned long addr, unsigned len);
+int runtime_patch(u8 type, void *insn_buff, void *op, unsigned long addr,
+		  unsigned int len);
 
 int paravirt_disable_iospace(void);
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 8c511cc4d4f4..c4128436b05a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -117,11 +117,7 @@ void __init native_pv_lock_init(void)
 unsigned paravirt_patch_default(u8 type, void *insn_buff,
 				unsigned long addr, unsigned len)
 {
-	/*
-	 * Neat trick to map patch type back to the call within the
-	 * corresponding structure.
-	 */
-	void *opfunc = *((void **)&pv_ops + type);
+	void *opfunc = (void *)PARAVIRT_PATCH_OP(pv_ops, type);
 	unsigned ret;
 
 	if (opfunc == NULL)
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eff63c090d2..3eb8c0e720b4 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/stringify.h>
+#include <linux/errno.h>
 
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
@@ -124,3 +125,51 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
 
 	return paravirt_patch_default(type, insn_buff, addr, len);
 }
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+/**
+ * runtime_patch - Generate patching code for a native/paravirt op
+ * @type: op type to generate code for
+ * @insn_buff: destination buffer
+ * @op: op target
+ * @addr: call site address
+ * @len: length of insn_buff
+ *
+ * Note that pv-ops are only suitable for runtime patching if they are
+ * non-preemptible. This is necessary for two reasons: we don't want to
+ * be overwriting insn sequences which might be referenced from call-stacks
+ * (and thus would be returned to), and we want patching to act as a barrier
+ * so no code from now stale paravirt ops should execute after an op has
+ * changed.
+ *
+ * Return: size of insn sequence on success, -EINVAL on error.
+ */
+int runtime_patch(u8 type, void *insn_buff, void *op,
+		  unsigned long addr, unsigned int len)
+{
+	void *native_op;
+	int used = 0;
+
+	/* Nothing whitelisted for now. */
+	switch (type) {
+	default:
+		pr_warn("type=%d unsuitable for runtime-patching\n", type);
+		return -EINVAL;
+	}
+
+	if (PARAVIRT_PATCH_OP(pv_ops, type) != (long)op)
+		PARAVIRT_PATCH_OP(pv_ops, type) = (long)op;
+
+	native_op = (void *)PARAVIRT_PATCH_OP(native_pv_ops, type);
+
+	/*
+	 * Use native_patch() to get the right insns if we are switching
+	 * back to a native_op.
+	 */
+	if (op == native_op)
+		used = native_patch(type, insn_buff, addr, len);
+	else
+		used = paravirt_patch_default(type, insn_buff, addr, len);
+	return used;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index bc3f1aecaa19..c569d077aab2 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -203,6 +203,13 @@ do { \
 		__preempt_schedule(); \
 } while (0)
 
+/*
+ * preempt_enable_no_resched() so we don't add any preemption points until
+ * after the caller has returned.
+ */
+#define preempt_enable_runtime_patch()	preempt_enable_no_resched()
+#define preempt_disable_runtime_patch()	preempt_disable()
+
 #else /* !CONFIG_PREEMPTION */
 #define preempt_enable() \
 do { \
@@ -217,6 +224,12 @@ do { \
 } while (0)
 
 #define preempt_check_resched() do { } while (0)
+
+/*
+ * NOP, if there's no preemption.
+ */
+#define preempt_disable_runtime_patch()	do { } while (0)
+#define preempt_enable_runtime_patch()	do { } while (0)
 #endif /* CONFIG_PREEMPTION */
 
 #define preempt_disable_notrace() \
@@ -250,6 +263,8 @@ do { \
 #define preempt_enable_notrace()		barrier()
 #define preemptible()				0
 
+#define preempt_disable_runtime_patch()	do { } while (0)
+#define preempt_enable_runtime_patch()	do { } while (0)
 #endif /* CONFIG_PREEMPT_COUNT */
 
 #ifdef MODULE
@@ -260,6 +275,8 @@ do { \
 #undef preempt_enable_no_resched
 #undef preempt_enable_no_resched_notrace
 #undef preempt_check_resched
+#undef preempt_disable_runtime_patch
+#undef preempt_enable_runtime_patch
 #endif
 
 #define preempt_set_need_resched() \
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 09/26] x86/paravirt: Add runtime_patch()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

runtime_patch() generates insn sequences for patching supported pv_ops.
It does this by calling paravirt_patch_default() or native_patch()
dpending on if the target is a paravirt or native pv-op.

In addition, runtime_patch() also whitelists pv-ops that are safe to
patch at runtime.

The static conditions that need to be satisfied to patch safely:
 - Insn sequences under replacement need to execute without preemption.
   This is meant to avoid scenarios where a call-site (ex.
   lock.vcpu_is_preempted) switches between the following sequences:

  lock.vcpu_is_preempted = __raw_callee_save___kvm_vcpu_is_preempted
    0: e8 31 e6 ff ff		callq  0xffffffffffffe636
    4: 66 90			xchg   %ax,%ax      # NOP2

  lock.vcpu_is_preempted = __raw_callee_save___native_vcpu_is_preempted
    0: 31 c0			xor   %rax, %rax
    2: 0f 1f 44 00 00		nopl   0x0(%rax)    # NOP5

   If kvm_vcpu_is_preempted() were preemptible, then, post patching
   we would return to address 4 above, which is in the middle of an
   instruction for native_vcpu_is_preempted().

   Even if this were to be made safe (ex. by changing the NOP2 to be a
   prefix instead of a suffix), it would still not be enough -- since
   we do not want any code from the switched out pv-op to be executing
   after the pv-op has been switched out.

 - Entered only at the beginning: this allows us to use text_poke()
   which uses INT3 as a barrier.

We don't store the address inside any call-sites so the second can be
assumed.

Guaranteeing the first condition boils down to stating that any pv-op
being patched cannot be present/referenced from any call-stack in the
system. pv-ops that are not obviously non-preemptible need to be
enclosed in preempt_disable_runtime_patch()/preempt_enable_runtime_patch().

This should be sufficient because runtime_patch() itself is called from
a stop_machine() context which would would be enough to flush out any
non-preemptible sequences.

Note that preemption in the host is okay: stop_machine() would unwind
any pv-ops sleeping in the host.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  8 +++++
 arch/x86/kernel/paravirt.c            |  6 +---
 arch/x86/kernel/paravirt_patch.c      | 49 +++++++++++++++++++++++++++
 include/linux/preempt.h               | 17 ++++++++++
 4 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index bc935eec7ec6..3b9f6c105397 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+/*
+ * Neat trick to map patch type back to the call within the
+ * corresponding structure.
+ */
+#define PARAVIRT_PATCH_OP(ops, type) (*(long *)(&((long **)&(ops))[type]))
+
 #define paravirt_type(op)				\
 	[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),	\
 	[paravirt_opptr] "i" (&(pv_ops.op))
@@ -383,6 +389,8 @@ unsigned paravirt_patch_default(u8 type, void *insn_buff, unsigned long addr, un
 unsigned paravirt_patch_insns(void *insn_buff, unsigned len, const char *start, const char *end);
 
 unsigned native_patch(u8 type, void *insn_buff, unsigned long addr, unsigned len);
+int runtime_patch(u8 type, void *insn_buff, void *op, unsigned long addr,
+		  unsigned int len);
 
 int paravirt_disable_iospace(void);
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 8c511cc4d4f4..c4128436b05a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -117,11 +117,7 @@ void __init native_pv_lock_init(void)
 unsigned paravirt_patch_default(u8 type, void *insn_buff,
 				unsigned long addr, unsigned len)
 {
-	/*
-	 * Neat trick to map patch type back to the call within the
-	 * corresponding structure.
-	 */
-	void *opfunc = *((void **)&pv_ops + type);
+	void *opfunc = (void *)PARAVIRT_PATCH_OP(pv_ops, type);
 	unsigned ret;
 
 	if (opfunc == NULL)
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eff63c090d2..3eb8c0e720b4 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/stringify.h>
+#include <linux/errno.h>
 
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
@@ -124,3 +125,51 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
 
 	return paravirt_patch_default(type, insn_buff, addr, len);
 }
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+/**
+ * runtime_patch - Generate patching code for a native/paravirt op
+ * @type: op type to generate code for
+ * @insn_buff: destination buffer
+ * @op: op target
+ * @addr: call site address
+ * @len: length of insn_buff
+ *
+ * Note that pv-ops are only suitable for runtime patching if they are
+ * non-preemptible. This is necessary for two reasons: we don't want to
+ * be overwriting insn sequences which might be referenced from call-stacks
+ * (and thus would be returned to), and we want patching to act as a barrier
+ * so no code from now stale paravirt ops should execute after an op has
+ * changed.
+ *
+ * Return: size of insn sequence on success, -EINVAL on error.
+ */
+int runtime_patch(u8 type, void *insn_buff, void *op,
+		  unsigned long addr, unsigned int len)
+{
+	void *native_op;
+	int used = 0;
+
+	/* Nothing whitelisted for now. */
+	switch (type) {
+	default:
+		pr_warn("type=%d unsuitable for runtime-patching\n", type);
+		return -EINVAL;
+	}
+
+	if (PARAVIRT_PATCH_OP(pv_ops, type) != (long)op)
+		PARAVIRT_PATCH_OP(pv_ops, type) = (long)op;
+
+	native_op = (void *)PARAVIRT_PATCH_OP(native_pv_ops, type);
+
+	/*
+	 * Use native_patch() to get the right insns if we are switching
+	 * back to a native_op.
+	 */
+	if (op == native_op)
+		used = native_patch(type, insn_buff, addr, len);
+	else
+		used = paravirt_patch_default(type, insn_buff, addr, len);
+	return used;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index bc3f1aecaa19..c569d077aab2 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -203,6 +203,13 @@ do { \
 		__preempt_schedule(); \
 } while (0)
 
+/*
+ * preempt_enable_no_resched() so we don't add any preemption points until
+ * after the caller has returned.
+ */
+#define preempt_enable_runtime_patch()	preempt_enable_no_resched()
+#define preempt_disable_runtime_patch()	preempt_disable()
+
 #else /* !CONFIG_PREEMPTION */
 #define preempt_enable() \
 do { \
@@ -217,6 +224,12 @@ do { \
 } while (0)
 
 #define preempt_check_resched() do { } while (0)
+
+/*
+ * NOP, if there's no preemption.
+ */
+#define preempt_disable_runtime_patch()	do { } while (0)
+#define preempt_enable_runtime_patch()	do { } while (0)
 #endif /* CONFIG_PREEMPTION */
 
 #define preempt_disable_notrace() \
@@ -250,6 +263,8 @@ do { \
 #define preempt_enable_notrace()		barrier()
 #define preemptible()				0
 
+#define preempt_disable_runtime_patch()	do { } while (0)
+#define preempt_enable_runtime_patch()	do { } while (0)
 #endif /* CONFIG_PREEMPT_COUNT */
 
 #ifdef MODULE
@@ -260,6 +275,8 @@ do { \
 #undef preempt_enable_no_resched
 #undef preempt_enable_no_resched_notrace
 #undef preempt_check_resched
+#undef preempt_disable_runtime_patch
+#undef preempt_enable_runtime_patch
 #endif
 
 #define preempt_set_need_resched() \
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 10/26] x86/paravirt: Add primitives to stage pv-ops
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Add paravirt_stage_alt() which conditionally selects between a paravirt
or native pv-op and then stages it for later patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  6 +++
 arch/x86/include/asm/text-patching.h  |  3 ++
 arch/x86/kernel/alternative.c         | 58 +++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3b9f6c105397..0c4ca7ad719c 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#define paravirt_stage_alt(do_stage, op, opfn)				\
+	(text_poke_pv_stage(PARAVIRT_PATCH(op),				\
+			    (do_stage) ? (opfn) : (native_pv_ops.op)))
+
+#define paravirt_stage_zero() text_poke_pv_stage_zero()
+
 /*
  * Neat trick to map patch type back to the call within the
  * corresponding structure.
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e2ef241c261e..706e61e6967d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -55,6 +55,9 @@ extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void
 extern void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate);
 extern void text_poke_finish(void);
 
+bool text_poke_pv_stage(u8 type, void *opfn);
+void text_poke_pv_stage_zero(void);
+
 #define INT3_INSN_SIZE		1
 #define INT3_INSN_OPCODE	0xCC
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8189ac21624c..0c335af9ee28 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1307,3 +1307,61 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 	text_poke_loc_init(&tp, addr, opcode, len, emulate);
 	text_poke_bp_batch(&tp, 1);
 }
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+struct paravirt_stage_entry {
+	void *dest;	/* pv_op destination */
+	u8 type;	/* pv_op type */
+};
+
+/*
+ * We don't anticipate many pv-ops being written at runtime.
+ */
+#define PARAVIRT_STAGE_MAX 8
+struct paravirt_stage {
+	struct paravirt_stage_entry ops[PARAVIRT_STAGE_MAX];
+	u32 count;
+};
+
+/* Protected by text_mutex */
+static struct paravirt_stage pv_stage;
+
+/**
+ * text_poke_pv_stage - Stage paravirt-op for poking.
+ * @addr: address in struct paravirt_patch_template
+ * @type: pv-op type
+ * @opfn: destination of the pv-op
+ *
+ * Return: staging status.
+ */
+bool text_poke_pv_stage(u8 type, void *opfn)
+{
+	if (system_state == SYSTEM_BOOTING) { /* Passthrough */
+		PARAVIRT_PATCH_OP(pv_ops, type) = (long)opfn;
+		goto out;
+	}
+
+	lockdep_assert_held(&text_mutex);
+
+	if (PARAVIRT_PATCH_OP(pv_ops, type) == (long)opfn)
+		goto out;
+
+	if (pv_stage.count >= PARAVIRT_STAGE_MAX)
+		goto out;
+
+	pv_stage.ops[pv_stage.count].type = type;
+	pv_stage.ops[pv_stage.count].dest = opfn;
+
+	pv_stage.count++;
+
+	return true;
+out:
+	return false;
+}
+
+void text_poke_pv_stage_zero(void)
+{
+	lockdep_assert_held(&text_mutex);
+	pv_stage.count = 0;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 10/26] x86/paravirt: Add primitives to stage pv-ops
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Add paravirt_stage_alt() which conditionally selects between a paravirt
or native pv-op and then stages it for later patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt_types.h |  6 +++
 arch/x86/include/asm/text-patching.h  |  3 ++
 arch/x86/kernel/alternative.c         | 58 +++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3b9f6c105397..0c4ca7ad719c 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -350,6 +350,12 @@ extern struct paravirt_patch_template native_pv_ops;
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#define paravirt_stage_alt(do_stage, op, opfn)				\
+	(text_poke_pv_stage(PARAVIRT_PATCH(op),				\
+			    (do_stage) ? (opfn) : (native_pv_ops.op)))
+
+#define paravirt_stage_zero() text_poke_pv_stage_zero()
+
 /*
  * Neat trick to map patch type back to the call within the
  * corresponding structure.
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e2ef241c261e..706e61e6967d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -55,6 +55,9 @@ extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void
 extern void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate);
 extern void text_poke_finish(void);
 
+bool text_poke_pv_stage(u8 type, void *opfn);
+void text_poke_pv_stage_zero(void);
+
 #define INT3_INSN_SIZE		1
 #define INT3_INSN_OPCODE	0xCC
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8189ac21624c..0c335af9ee28 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1307,3 +1307,61 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 	text_poke_loc_init(&tp, addr, opcode, len, emulate);
 	text_poke_bp_batch(&tp, 1);
 }
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+struct paravirt_stage_entry {
+	void *dest;	/* pv_op destination */
+	u8 type;	/* pv_op type */
+};
+
+/*
+ * We don't anticipate many pv-ops being written at runtime.
+ */
+#define PARAVIRT_STAGE_MAX 8
+struct paravirt_stage {
+	struct paravirt_stage_entry ops[PARAVIRT_STAGE_MAX];
+	u32 count;
+};
+
+/* Protected by text_mutex */
+static struct paravirt_stage pv_stage;
+
+/**
+ * text_poke_pv_stage - Stage paravirt-op for poking.
+ * @addr: address in struct paravirt_patch_template
+ * @type: pv-op type
+ * @opfn: destination of the pv-op
+ *
+ * Return: staging status.
+ */
+bool text_poke_pv_stage(u8 type, void *opfn)
+{
+	if (system_state == SYSTEM_BOOTING) { /* Passthrough */
+		PARAVIRT_PATCH_OP(pv_ops, type) = (long)opfn;
+		goto out;
+	}
+
+	lockdep_assert_held(&text_mutex);
+
+	if (PARAVIRT_PATCH_OP(pv_ops, type) == (long)opfn)
+		goto out;
+
+	if (pv_stage.count >= PARAVIRT_STAGE_MAX)
+		goto out;
+
+	pv_stage.ops[pv_stage.count].type = type;
+	pv_stage.ops[pv_stage.count].dest = opfn;
+
+	pv_stage.count++;
+
+	return true;
+out:
+	return false;
+}
+
+void text_poke_pv_stage_zero(void)
+{
+	lockdep_assert_held(&text_mutex);
+	pv_stage.count = 0;
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 11/26] x86/alternatives: Remove return value of text_poke*()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Various text_poke() variants don't return a useful value. Remove it.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  4 ++--
 arch/x86/kernel/alternative.c        | 11 +++++------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 706e61e6967d..04778c2bc34e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -46,9 +46,9 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
  * On the local CPU you need to be protected against NMI or MCE handlers seeing
  * an inconsistent instruction while you patch.
  */
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
-extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
+extern void text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0c335af9ee28..8c79a3dc5e72 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,7 +805,7 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
-static void *__text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke(void *addr, const void *opcode, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {NULL};
@@ -906,7 +906,6 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
 
 	pte_unmap_unlock(ptep, ptl);
 	local_irq_restore(flags);
-	return addr;
 }
 
 /**
@@ -925,11 +924,11 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
  * by registering a module notifier, and ordering module removal and patching
  * trough a mutex.
  */
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
 {
 	lockdep_assert_held(&text_mutex);
 
-	return __text_poke(addr, opcode, len);
+	__text_poke(addr, opcode, len);
 }
 
 /**
@@ -946,9 +945,9 @@ void *text_poke(void *addr, const void *opcode, size_t len)
  * Context: should only be used by kgdb, which ensures no other core is running,
  *	    despite the fact it does not hold the text_mutex.
  */
-void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
+void text_poke_kgdb(void *addr, const void *opcode, size_t len)
 {
-	return __text_poke(addr, opcode, len);
+	__text_poke(addr, opcode, len);
 }
 
 static void do_sync_core(void *info)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 11/26] x86/alternatives: Remove return value of text_poke*()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Various text_poke() variants don't return a useful value. Remove it.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  4 ++--
 arch/x86/kernel/alternative.c        | 11 +++++------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 706e61e6967d..04778c2bc34e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -46,9 +46,9 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
  * On the local CPU you need to be protected against NMI or MCE handlers seeing
  * an inconsistent instruction while you patch.
  */
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
-extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
+extern void text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0c335af9ee28..8c79a3dc5e72 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,7 +805,7 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
-static void *__text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke(void *addr, const void *opcode, size_t len)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {NULL};
@@ -906,7 +906,6 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
 
 	pte_unmap_unlock(ptep, ptl);
 	local_irq_restore(flags);
-	return addr;
 }
 
 /**
@@ -925,11 +924,11 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
  * by registering a module notifier, and ordering module removal and patching
  * trough a mutex.
  */
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
 {
 	lockdep_assert_held(&text_mutex);
 
-	return __text_poke(addr, opcode, len);
+	__text_poke(addr, opcode, len);
 }
 
 /**
@@ -946,9 +945,9 @@ void *text_poke(void *addr, const void *opcode, size_t len)
  * Context: should only be used by kgdb, which ensures no other core is running,
  *	    despite the fact it does not hold the text_mutex.
  */
-void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
+void text_poke_kgdb(void *addr, const void *opcode, size_t len)
 {
-	return __text_poke(addr, opcode, len);
+	__text_poke(addr, opcode, len);
 }
 
 static void do_sync_core(void *info)
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 12/26] x86/alternatives: Use __get_unlocked_pte() in text_poke()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

text_poke() uses get_locked_pte() to map poking_addr. However, this
introduces a dependency on locking code which precludes using
text_poke() to modify qspinlock primitives.

Accesses to this pte (and poking_addr) are protected by text_mutex
so we can safely switch to __get_unlocked_pte() here. Note that
we do need to be careful that we do not try to modify the poking_addr
from multiple contexts simultaneously (ex. INT3 or NMI context.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c |  9 ++++-----
 include/linux/mm.h            | 16 ++++++++++++++--
 mm/memory.c                   |  9 ++++++---
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8c79a3dc5e72..0344e49a4ade 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -812,7 +812,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	temp_mm_state_t prev;
 	unsigned long flags;
 	pte_t pte, *ptep;
-	spinlock_t *ptl;
 	pgprot_t pgprot;
 
 	/*
@@ -846,10 +845,11 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
 
 	/*
-	 * The lock is not really needed, but this allows to avoid open-coding.
+	 * text_poke() might be used to poke spinlock primitives so do this
+	 * unlocked. This does mean that we need to be careful that no other
+	 * context (ex. INT3 handler) is simultaneously writing to this pte.
 	 */
-	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
-
+	ptep = __get_unlocked_pte(poking_mm, poking_addr);
 	/*
 	 * This must not fail; preallocated in poking_init().
 	 */
@@ -904,7 +904,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(memcmp(addr, opcode, len));
 
-	pte_unmap_unlock(ptep, ptl);
 	local_irq_restore(flags);
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dd5c4ccbf85..d4a652c2e269 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1895,8 +1895,20 @@ static inline int pte_devmap(pte_t pte)
 
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 
-extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			       spinlock_t **ptl);
+pte_t *__get_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl);
+
+static inline pte_t *__get_unlocked_pte(struct mm_struct *mm,
+					unsigned long addr)
+{
+	return __get_pte(mm, addr, NULL);
+}
+
+static inline pte_t *__get_locked_pte(struct mm_struct *mm,
+				      unsigned long addr, spinlock_t **ptl)
+{
+	return __get_pte(mm, addr, ptl);
+}
+
 static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 				    spinlock_t **ptl)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 586271f3efc6..7acfe9512084 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1407,8 +1407,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
-pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			spinlock_t **ptl)
+pte_t *__get_pte(struct mm_struct *mm, unsigned long addr,
+		 spinlock_t **ptl)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -1427,7 +1427,10 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 		return NULL;
 
 	VM_BUG_ON(pmd_trans_huge(*pmd));
-	return pte_alloc_map_lock(mm, pmd, addr, ptl);
+	if (likely(ptl))
+		return pte_alloc_map_lock(mm, pmd, addr, ptl);
+	else
+		return pte_alloc_map(mm, pmd, addr);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 12/26] x86/alternatives: Use __get_unlocked_pte() in text_poke()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

text_poke() uses get_locked_pte() to map poking_addr. However, this
introduces a dependency on locking code which precludes using
text_poke() to modify qspinlock primitives.

Accesses to this pte (and poking_addr) are protected by text_mutex
so we can safely switch to __get_unlocked_pte() here. Note that
we do need to be careful that we do not try to modify the poking_addr
from multiple contexts simultaneously (ex. INT3 or NMI context.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c |  9 ++++-----
 include/linux/mm.h            | 16 ++++++++++++++--
 mm/memory.c                   |  9 ++++++---
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8c79a3dc5e72..0344e49a4ade 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -812,7 +812,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	temp_mm_state_t prev;
 	unsigned long flags;
 	pte_t pte, *ptep;
-	spinlock_t *ptl;
 	pgprot_t pgprot;
 
 	/*
@@ -846,10 +845,11 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
 
 	/*
-	 * The lock is not really needed, but this allows to avoid open-coding.
+	 * text_poke() might be used to poke spinlock primitives so do this
+	 * unlocked. This does mean that we need to be careful that no other
+	 * context (ex. INT3 handler) is simultaneously writing to this pte.
 	 */
-	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
-
+	ptep = __get_unlocked_pte(poking_mm, poking_addr);
 	/*
 	 * This must not fail; preallocated in poking_init().
 	 */
@@ -904,7 +904,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(memcmp(addr, opcode, len));
 
-	pte_unmap_unlock(ptep, ptl);
 	local_irq_restore(flags);
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7dd5c4ccbf85..d4a652c2e269 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1895,8 +1895,20 @@ static inline int pte_devmap(pte_t pte)
 
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 
-extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			       spinlock_t **ptl);
+pte_t *__get_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl);
+
+static inline pte_t *__get_unlocked_pte(struct mm_struct *mm,
+					unsigned long addr)
+{
+	return __get_pte(mm, addr, NULL);
+}
+
+static inline pte_t *__get_locked_pte(struct mm_struct *mm,
+				      unsigned long addr, spinlock_t **ptl)
+{
+	return __get_pte(mm, addr, ptl);
+}
+
 static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 				    spinlock_t **ptl)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 586271f3efc6..7acfe9512084 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1407,8 +1407,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
-pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			spinlock_t **ptl)
+pte_t *__get_pte(struct mm_struct *mm, unsigned long addr,
+		 spinlock_t **ptl)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -1427,7 +1427,10 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 		return NULL;
 
 	VM_BUG_ON(pmd_trans_huge(*pmd));
-	return pte_alloc_map_lock(mm, pmd, addr, ptl);
+	if (likely(ptl))
+		return pte_alloc_map_lock(mm, pmd, addr, ptl);
+	else
+		return pte_alloc_map(mm, pmd, addr);
 }
 
 /*
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 13/26] x86/alternatives: Split __text_poke()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Separate __text_poke() into map, memcpy and unmap portions,
(__text_poke_map(), __text_do_poke() and __text_poke_unmap().)

Do this to separate the non-reentrant bits from the reentrant
__text_do_poke(). __text_poke_map()/_unmap() modify poking_mm,
poking_addr and do the pte-mapping and thus are non-reentrant.

This allows __text_do_poke() to be safely called from an INT3
context with __text_poke_map()/_unmap() being called at the
start and the end of the patching of a call-site instead of
doing that for each stage of the three patching stages.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 46 +++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0344e49a4ade..337aad8c2521 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,13 +805,12 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
-static void __text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke_map(void *addr, size_t len,
+			    temp_mm_state_t *prev_mm, pte_t **ptep)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {NULL};
-	temp_mm_state_t prev;
-	unsigned long flags;
-	pte_t pte, *ptep;
+	pte_t pte;
 	pgprot_t pgprot;
 
 	/*
@@ -836,8 +835,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));
 
-	local_irq_save(flags);
-
 	/*
 	 * Map the page without the global bit, as TLB flushing is done with
 	 * flush_tlb_mm_range(), which is intended for non-global PTEs.
@@ -849,30 +846,42 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * unlocked. This does mean that we need to be careful that no other
 	 * context (ex. INT3 handler) is simultaneously writing to this pte.
 	 */
-	ptep = __get_unlocked_pte(poking_mm, poking_addr);
+	*ptep = __get_unlocked_pte(poking_mm, poking_addr);
 	/*
 	 * This must not fail; preallocated in poking_init().
 	 */
-	VM_BUG_ON(!ptep);
+	VM_BUG_ON(!*ptep);
 
 	pte = mk_pte(pages[0], pgprot);
-	set_pte_at(poking_mm, poking_addr, ptep, pte);
+	set_pte_at(poking_mm, poking_addr, *ptep, pte);
 
 	if (cross_page_boundary) {
 		pte = mk_pte(pages[1], pgprot);
-		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, *ptep + 1, pte);
 	}
 
 	/*
 	 * Loading the temporary mm behaves as a compiler barrier, which
 	 * guarantees that the PTE will be set at the time memcpy() is done.
 	 */
-	prev = use_temporary_mm(poking_mm);
+	*prev_mm = use_temporary_mm(poking_mm);
+}
 
+/*
+ * Do the actual poke. Needs to be re-entrant as this can be called
+ * via INT3 context as well.
+ */
+static void __text_do_poke(unsigned long offset, const void *opcode, size_t len)
+{
 	kasan_disable_current();
-	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+	memcpy((u8 *)poking_addr + offset, opcode, len);
 	kasan_enable_current();
+}
 
+static void __text_poke_unmap(void *addr, const void *opcode, size_t len,
+			      temp_mm_state_t *prev_mm, pte_t *ptep)
+{
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	/*
 	 * Ensure that the PTE is only cleared after the instructions of memcpy
 	 * were issued by using a compiler barrier.
@@ -888,7 +897,7 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * instruction that already allows the core to see the updated version.
 	 * Xen-PV is assumed to serialize execution in a similar manner.
 	 */
-	unuse_temporary_mm(prev);
+	unuse_temporary_mm(*prev_mm);
 
 	/*
 	 * Flushing the TLB might involve IPIs, which would require enabled
@@ -903,7 +912,18 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * fundamentally screwy; there's nothing we can really do about that.
 	 */
 	BUG_ON(memcmp(addr, opcode, len));
+}
 
+static void __text_poke(void *addr, const void *opcode, size_t len)
+{
+	temp_mm_state_t prev_mm;
+	unsigned long flags;
+	pte_t *ptep;
+
+	local_irq_save(flags);
+	__text_poke_map(addr, len, &prev_mm, &ptep);
+	__text_do_poke(offset_in_page(addr), opcode, len);
+	__text_poke_unmap(addr, opcode, len, &prev_mm, ptep);
 	local_irq_restore(flags);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 13/26] x86/alternatives: Split __text_poke()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Separate __text_poke() into map, memcpy and unmap portions,
(__text_poke_map(), __text_do_poke() and __text_poke_unmap().)

Do this to separate the non-reentrant bits from the reentrant
__text_do_poke(). __text_poke_map()/_unmap() modify poking_mm,
poking_addr and do the pte-mapping and thus are non-reentrant.

This allows __text_do_poke() to be safely called from an INT3
context with __text_poke_map()/_unmap() being called at the
start and the end of the patching of a call-site instead of
doing that for each stage of the three patching stages.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 46 +++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0344e49a4ade..337aad8c2521 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -805,13 +805,12 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
-static void __text_poke(void *addr, const void *opcode, size_t len)
+static void __text_poke_map(void *addr, size_t len,
+			    temp_mm_state_t *prev_mm, pte_t **ptep)
 {
 	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	struct page *pages[2] = {NULL};
-	temp_mm_state_t prev;
-	unsigned long flags;
-	pte_t pte, *ptep;
+	pte_t pte;
 	pgprot_t pgprot;
 
 	/*
@@ -836,8 +835,6 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 */
 	BUG_ON(!pages[0] || (cross_page_boundary && !pages[1]));
 
-	local_irq_save(flags);
-
 	/*
 	 * Map the page without the global bit, as TLB flushing is done with
 	 * flush_tlb_mm_range(), which is intended for non-global PTEs.
@@ -849,30 +846,42 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * unlocked. This does mean that we need to be careful that no other
 	 * context (ex. INT3 handler) is simultaneously writing to this pte.
 	 */
-	ptep = __get_unlocked_pte(poking_mm, poking_addr);
+	*ptep = __get_unlocked_pte(poking_mm, poking_addr);
 	/*
 	 * This must not fail; preallocated in poking_init().
 	 */
-	VM_BUG_ON(!ptep);
+	VM_BUG_ON(!*ptep);
 
 	pte = mk_pte(pages[0], pgprot);
-	set_pte_at(poking_mm, poking_addr, ptep, pte);
+	set_pte_at(poking_mm, poking_addr, *ptep, pte);
 
 	if (cross_page_boundary) {
 		pte = mk_pte(pages[1], pgprot);
-		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, *ptep + 1, pte);
 	}
 
 	/*
 	 * Loading the temporary mm behaves as a compiler barrier, which
 	 * guarantees that the PTE will be set at the time memcpy() is done.
 	 */
-	prev = use_temporary_mm(poking_mm);
+	*prev_mm = use_temporary_mm(poking_mm);
+}
 
+/*
+ * Do the actual poke. Needs to be re-entrant as this can be called
+ * via INT3 context as well.
+ */
+static void __text_do_poke(unsigned long offset, const void *opcode, size_t len)
+{
 	kasan_disable_current();
-	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+	memcpy((u8 *)poking_addr + offset, opcode, len);
 	kasan_enable_current();
+}
 
+static void __text_poke_unmap(void *addr, const void *opcode, size_t len,
+			      temp_mm_state_t *prev_mm, pte_t *ptep)
+{
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
 	/*
 	 * Ensure that the PTE is only cleared after the instructions of memcpy
 	 * were issued by using a compiler barrier.
@@ -888,7 +897,7 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * instruction that already allows the core to see the updated version.
 	 * Xen-PV is assumed to serialize execution in a similar manner.
 	 */
-	unuse_temporary_mm(prev);
+	unuse_temporary_mm(*prev_mm);
 
 	/*
 	 * Flushing the TLB might involve IPIs, which would require enabled
@@ -903,7 +912,18 @@ static void __text_poke(void *addr, const void *opcode, size_t len)
 	 * fundamentally screwy; there's nothing we can really do about that.
 	 */
 	BUG_ON(memcmp(addr, opcode, len));
+}
 
+static void __text_poke(void *addr, const void *opcode, size_t len)
+{
+	temp_mm_state_t prev_mm;
+	unsigned long flags;
+	pte_t *ptep;
+
+	local_irq_save(flags);
+	__text_poke_map(addr, len, &prev_mm, &ptep);
+	__text_do_poke(offset_in_page(addr), opcode, len);
+	__text_poke_unmap(addr, opcode, len, &prev_mm, ptep);
 	local_irq_restore(flags);
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Intended to handle scenarios where we might want to patch arbitrary
instructions (ex. inlined opcodes in pv_lock_ops.)

Users for native mode (as opposed to emulated) are introduced in
later patches.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  4 +-
 arch/x86/kernel/alternative.c        | 61 ++++++++++++++++++++--------
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 04778c2bc34e..c4b2814f2f9d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -25,10 +25,10 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 
 /*
  * Currently, the max observed size in the kernel code is
- * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * NOP7 for indirect call, which is 7.
  * Raise it if needed.
  */
-#define POKE_MAX_OPCODE_SIZE	5
+#define POKE_MAX_OPCODE_SIZE	7
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 337aad8c2521..004fe86f463f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -981,8 +981,15 @@ void text_poke_sync(void)
 
 struct text_poke_loc {
 	s32 rel_addr; /* addr := _stext + rel_addr */
-	s32 rel32;
-	u8 opcode;
+	union {
+		struct {
+			s32 rel32;
+			u8 opcode;
+		} emulated;
+		struct {
+			u8 len;
+		} native;
+	};
 	const u8 text[POKE_MAX_OPCODE_SIZE];
 };
 
@@ -990,6 +997,7 @@ struct bp_patching_desc {
 	struct text_poke_loc *vec;
 	int nr_entries;
 	atomic_t refs;
+	bool native;
 };
 
 static struct bp_patching_desc *bp_desc;
@@ -1071,10 +1079,13 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 			goto out_put;
 	}
 
-	len = text_opcode_size(tp->opcode);
+	if (desc->native)
+		BUG();
+
+	len = text_opcode_size(tp->emulated.opcode);
 	ip += len;
 
-	switch (tp->opcode) {
+	switch (tp->emulated.opcode) {
 	case INT3_INSN_OPCODE:
 		/*
 		 * Someone poked an explicit INT3, they'll want to handle it,
@@ -1083,12 +1094,12 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		goto out_put;
 
 	case CALL_INSN_OPCODE:
-		int3_emulate_call(regs, (long)ip + tp->rel32);
+		int3_emulate_call(regs, (long)ip + tp->emulated.rel32);
 		break;
 
 	case JMP32_INSN_OPCODE:
 	case JMP8_INSN_OPCODE:
-		int3_emulate_jmp(regs, (long)ip + tp->rel32);
+		int3_emulate_jmp(regs, (long)ip + tp->emulated.rel32);
 		break;
 
 	default:
@@ -1134,6 +1145,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		.vec = tp,
 		.nr_entries = nr_entries,
 		.refs = ATOMIC_INIT(1),
+		.native = false,
 	};
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
@@ -1161,7 +1173,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * Second step: update all but the first byte of the patched range.
 	 */
 	for (do_sync = 0, i = 0; i < nr_entries; i++) {
-		int len = text_opcode_size(tp[i].opcode);
+		int len = text_opcode_size(tp[i].emulated.opcode);
 
 		if (len - INT3_INSN_SIZE > 0) {
 			text_poke(text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
@@ -1205,11 +1217,25 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 }
 
 static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
-			       const void *opcode, size_t len, const void *emulate)
+			       const void *opcode, size_t len,
+			       const void *emulate, bool native)
 {
 	struct insn insn;
 
+	memset((void *)tp, 0, sizeof(*tp));
 	memcpy((void *)tp->text, opcode, len);
+
+	tp->rel_addr = addr - (void *)_stext;
+
+	/*
+	 * Native mode: when we might be poking
+	 * arbitrary (perhaps) multiple instructions.
+	 */
+	if (native) {
+		tp->native.len = (u8)len;
+		return;
+	}
+
 	if (!emulate)
 		emulate = opcode;
 
@@ -1219,31 +1245,30 @@ static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
 	BUG_ON(!insn_complete(&insn));
 	BUG_ON(len != insn.length);
 
-	tp->rel_addr = addr - (void *)_stext;
-	tp->opcode = insn.opcode.bytes[0];
+	tp->emulated.opcode = insn.opcode.bytes[0];
 
-	switch (tp->opcode) {
+	switch (tp->emulated.opcode) {
 	case INT3_INSN_OPCODE:
 		break;
 
 	case CALL_INSN_OPCODE:
 	case JMP32_INSN_OPCODE:
 	case JMP8_INSN_OPCODE:
-		tp->rel32 = insn.immediate.value;
+		tp->emulated.rel32 = insn.immediate.value;
 		break;
 
 	default: /* assume NOP */
 		switch (len) {
 		case 2: /* NOP2 -- emulate as JMP8+0 */
 			BUG_ON(memcmp(emulate, ideal_nops[len], len));
-			tp->opcode = JMP8_INSN_OPCODE;
-			tp->rel32 = 0;
+			tp->emulated.opcode = JMP8_INSN_OPCODE;
+			tp->emulated.rel32 = 0;
 			break;
 
 		case 5: /* NOP5 -- emulate as JMP32+0 */
 			BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len));
-			tp->opcode = JMP32_INSN_OPCODE;
-			tp->rel32 = 0;
+			tp->emulated.opcode = JMP32_INSN_OPCODE;
+			tp->emulated.rel32 = 0;
 			break;
 
 		default: /* unknown instruction */
@@ -1299,7 +1324,7 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
 	text_poke_flush(addr);
 
 	tp = &tp_vec[tp_vec_nr++];
-	text_poke_loc_init(tp, addr, opcode, len, emulate);
+	text_poke_loc_init(tp, addr, opcode, len, emulate, false);
 }
 
 /**
@@ -1322,7 +1347,7 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 		return;
 	}
 
-	text_poke_loc_init(&tp, addr, opcode, len, emulate);
+	text_poke_loc_init(&tp, addr, opcode, len, emulate, false);
 	text_poke_bp_batch(&tp, 1);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Intended to handle scenarios where we might want to patch arbitrary
instructions (ex. inlined opcodes in pv_lock_ops.)

Users for native mode (as opposed to emulated) are introduced in
later patches.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  4 +-
 arch/x86/kernel/alternative.c        | 61 ++++++++++++++++++++--------
 2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 04778c2bc34e..c4b2814f2f9d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -25,10 +25,10 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 
 /*
  * Currently, the max observed size in the kernel code is
- * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * NOP7 for indirect call, which is 7.
  * Raise it if needed.
  */
-#define POKE_MAX_OPCODE_SIZE	5
+#define POKE_MAX_OPCODE_SIZE	7
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 337aad8c2521..004fe86f463f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -981,8 +981,15 @@ void text_poke_sync(void)
 
 struct text_poke_loc {
 	s32 rel_addr; /* addr := _stext + rel_addr */
-	s32 rel32;
-	u8 opcode;
+	union {
+		struct {
+			s32 rel32;
+			u8 opcode;
+		} emulated;
+		struct {
+			u8 len;
+		} native;
+	};
 	const u8 text[POKE_MAX_OPCODE_SIZE];
 };
 
@@ -990,6 +997,7 @@ struct bp_patching_desc {
 	struct text_poke_loc *vec;
 	int nr_entries;
 	atomic_t refs;
+	bool native;
 };
 
 static struct bp_patching_desc *bp_desc;
@@ -1071,10 +1079,13 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 			goto out_put;
 	}
 
-	len = text_opcode_size(tp->opcode);
+	if (desc->native)
+		BUG();
+
+	len = text_opcode_size(tp->emulated.opcode);
 	ip += len;
 
-	switch (tp->opcode) {
+	switch (tp->emulated.opcode) {
 	case INT3_INSN_OPCODE:
 		/*
 		 * Someone poked an explicit INT3, they'll want to handle it,
@@ -1083,12 +1094,12 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 		goto out_put;
 
 	case CALL_INSN_OPCODE:
-		int3_emulate_call(regs, (long)ip + tp->rel32);
+		int3_emulate_call(regs, (long)ip + tp->emulated.rel32);
 		break;
 
 	case JMP32_INSN_OPCODE:
 	case JMP8_INSN_OPCODE:
-		int3_emulate_jmp(regs, (long)ip + tp->rel32);
+		int3_emulate_jmp(regs, (long)ip + tp->emulated.rel32);
 		break;
 
 	default:
@@ -1134,6 +1145,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		.vec = tp,
 		.nr_entries = nr_entries,
 		.refs = ATOMIC_INIT(1),
+		.native = false,
 	};
 	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
@@ -1161,7 +1173,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * Second step: update all but the first byte of the patched range.
 	 */
 	for (do_sync = 0, i = 0; i < nr_entries; i++) {
-		int len = text_opcode_size(tp[i].opcode);
+		int len = text_opcode_size(tp[i].emulated.opcode);
 
 		if (len - INT3_INSN_SIZE > 0) {
 			text_poke(text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
@@ -1205,11 +1217,25 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 }
 
 static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
-			       const void *opcode, size_t len, const void *emulate)
+			       const void *opcode, size_t len,
+			       const void *emulate, bool native)
 {
 	struct insn insn;
 
+	memset((void *)tp, 0, sizeof(*tp));
 	memcpy((void *)tp->text, opcode, len);
+
+	tp->rel_addr = addr - (void *)_stext;
+
+	/*
+	 * Native mode: when we might be poking
+	 * arbitrary (perhaps) multiple instructions.
+	 */
+	if (native) {
+		tp->native.len = (u8)len;
+		return;
+	}
+
 	if (!emulate)
 		emulate = opcode;
 
@@ -1219,31 +1245,30 @@ static void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
 	BUG_ON(!insn_complete(&insn));
 	BUG_ON(len != insn.length);
 
-	tp->rel_addr = addr - (void *)_stext;
-	tp->opcode = insn.opcode.bytes[0];
+	tp->emulated.opcode = insn.opcode.bytes[0];
 
-	switch (tp->opcode) {
+	switch (tp->emulated.opcode) {
 	case INT3_INSN_OPCODE:
 		break;
 
 	case CALL_INSN_OPCODE:
 	case JMP32_INSN_OPCODE:
 	case JMP8_INSN_OPCODE:
-		tp->rel32 = insn.immediate.value;
+		tp->emulated.rel32 = insn.immediate.value;
 		break;
 
 	default: /* assume NOP */
 		switch (len) {
 		case 2: /* NOP2 -- emulate as JMP8+0 */
 			BUG_ON(memcmp(emulate, ideal_nops[len], len));
-			tp->opcode = JMP8_INSN_OPCODE;
-			tp->rel32 = 0;
+			tp->emulated.opcode = JMP8_INSN_OPCODE;
+			tp->emulated.rel32 = 0;
 			break;
 
 		case 5: /* NOP5 -- emulate as JMP32+0 */
 			BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len));
-			tp->opcode = JMP32_INSN_OPCODE;
-			tp->rel32 = 0;
+			tp->emulated.opcode = JMP32_INSN_OPCODE;
+			tp->emulated.rel32 = 0;
 			break;
 
 		default: /* unknown instruction */
@@ -1299,7 +1324,7 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi
 	text_poke_flush(addr);
 
 	tp = &tp_vec[tp_vec_nr++];
-	text_poke_loc_init(tp, addr, opcode, len, emulate);
+	text_poke_loc_init(tp, addr, opcode, len, emulate, false);
 }
 
 /**
@@ -1322,7 +1347,7 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 		return;
 	}
 
-	text_poke_loc_init(&tp, addr, opcode, len, emulate);
+	text_poke_loc_init(&tp, addr, opcode, len, emulate, false);
 	text_poke_bp_batch(&tp, 1);
 }
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Patching at runtime needs to handle interdependent pv-ops: as an example,
lock.queued_lock_slowpath(), lock.queued_lock_unlock() and the other
pv_lock_ops are paired and so need to be updated atomically. This is
difficult with emulation because non-patching CPUs could be executing in
critical sections.
(We could apply INT3 everywhere first and then use RCU to force a
barrier but given that spinlocks are everywhere, it still might mean a
lot of time in emulation.)

Second, locking operations can be called from interrupt handlers which
means we cannot trivially use IPIs to introduce a pipeline sync step on
non-patching CPUs.

Third, some pv-ops can be inlined and so we would need to emulate a
broader set of operations than CALL, JMP, NOP*.

Introduce the core state-machine with the actual poking and pipeline
sync stubbed out. This executes via stop_machine() with the primary CPU
carrying out a text_poke_bp() style three-staged algorithm.

The control flow diagram below shows CPU0 as the primary which does the
patching, while the rest of the CPUs (CPUx) execute the sync loop in
text_poke_sync_finish().

 CPU0				    CPUx
 ----                               ----

 patch_worker()			    patch_worker()

   /* Traversal, insn-gen */	      text_poke_sync_finish()
   tps.patch_worker()		      /*
  				       * wait until:
     /* for each patch-site */ 	       *  tps->state == PATCH_DONE
     text_poke_site()		       */
       poke_sync()

  	   ...				       ...

   smp_store_release(&tps->state, PATCH_DONE)

Commits further on flesh out the rest of the code.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
sync_one() uses the following for pipeline synchronization:

+       if (in_nmi())
+               cpuid_eax(1);
+       else
+               sync_core();

The if (in_nmi()) clause is meant to be executed from NMI contexts.
Reading through past LKML discussions cpuid_eax() is probably a
bad choice -- at least in so far as Xen PV is concerned. What
would be a good primitive to use insead?

Also, given that we do handle the nested NMI case, does it make sense
to just use native_iret() (via sync_core()) in NMI contexts well?

---
 arch/x86/kernel/alternative.c | 247 ++++++++++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 004fe86f463f..452d4081eded 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -979,6 +979,26 @@ void text_poke_sync(void)
 	on_each_cpu(do_sync_core, NULL, 1);
 }
 
+static void __maybe_unused sync_one(void)
+{
+	/*
+	 * We might be executing in NMI context, and so cannot use
+	 * IRET as a synchronizing instruction.
+	 *
+	 * We could use native_write_cr2() but that is not guaranteed
+	 * to work on Xen-PV -- it is emulated by Xen and might not
+	 * execute an iret (or similar synchronizing instruction)
+	 * internally.
+	 *
+	 * cpuid() would trap as well. Unclear if that's a solution
+	 * either.
+	 */
+	if (in_nmi())
+		cpuid_eax(1);
+	else
+		sync_core();
+}
+
 struct text_poke_loc {
 	s32 rel_addr; /* addr := _stext + rel_addr */
 	union {
@@ -1351,6 +1371,233 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 	text_poke_bp_batch(&tp, 1);
 }
 
+struct text_poke_state;
+typedef void (*patch_worker_t)(struct text_poke_state *tps);
+
+/*
+ *                        +-----------possible-BP----------+
+ *                        |                                |
+ *         +--write-INT3--+   +--suffix--+   +-insn-prefix-+
+ *        /               | _/           |__/              |
+ *       /                v'             v                 v
+ * PATCH_SYNC_0    PATCH_SYNC_1    PATCH_SYNC_2   *PATCH_SYNC_DONE*
+ *       \                                                    |`----> PATCH_DONE
+ *        `----------<---------<---------<---------<----------+
+ *
+ * We start in state PATCH_SYNC_DONE and loop through PATCH_SYNC_* states
+ * to end at PATCH_DONE. The primary drives these in text_poke_site()
+ * with patch_worker() making the final transition to PATCH_DONE.
+ * All transitions but the last iteration need to be globally observed.
+ *
+ * On secondary CPUs, text_poke_sync_finish() waits in a cpu_relax()
+ * loop waiting for a transition to PATCH_SYNC_0 at which point it would
+ * start observing transitions until PATCH_SYNC_DONE.
+ * Eventually the master moves to PATCH_DONE and secondary CPUs finish.
+ */
+enum patch_state {
+	/*
+	 * Add an artificial state that we can do a bitwise operation
+	 * over all the PATCH_SYNC_* states.
+	 */
+	PATCH_SYNC_x = 4,
+	PATCH_SYNC_0 = PATCH_SYNC_x | 0,	/* Serialize INT3 */
+	PATCH_SYNC_1 = PATCH_SYNC_x | 1,	/* Serialize rest */
+	PATCH_SYNC_2 = PATCH_SYNC_x | 2,	/* Serialize first opcode */
+	PATCH_SYNC_DONE = PATCH_SYNC_x | 3,	/* Site done, and start state */
+
+	PATCH_DONE = 8,				/* End state */
+};
+
+/*
+ * State for driving text-poking via stop_machine().
+ */
+struct text_poke_state {
+	/* Whatever we are poking */
+	void *stage;
+
+	/* Modules to be processed. */
+	struct list_head *head;
+
+	/*
+	 * Accesses to sync_ack_map are ordered by the primary
+	 * via tps.state.
+	 */
+	struct cpumask sync_ack_map;
+
+	/*
+	 * Generates insn sequences for call-sites to be patched and
+	 * calls text_poke_site() to do the actual poking.
+	 */
+	patch_worker_t	patch_worker;
+
+	/*
+	 * Where are we in the patching state-machine.
+	 */
+	enum patch_state state;
+
+	unsigned int primary_cpu; /* CPU doing the patching. */
+	unsigned int num_acks; /* Number of Acks needed. */
+};
+
+static struct text_poke_state text_poke_state;
+
+/**
+ * poke_sync() - transitions to the specified state.
+ *
+ * @tps - struct text_poke_state *
+ * @state - one of PATCH_SYNC_* states
+ * @offset - offset to be patched
+ * @insns - insns to write
+ * @len - length of insn sequence
+ */
+static void poke_sync(struct text_poke_state *tps, int state, int offset,
+		      const char *insns, int len)
+{
+	/*
+	 * STUB: no patching or synchronization, just go through the
+	 * motions.
+	 */
+	smp_store_release(&tps->state, state);
+}
+
+/**
+ * text_poke_site() - called on the primary to patch a single call site.
+ *
+ * Returns after switching tps->state to PATCH_SYNC_DONE.
+ */
+static void __maybe_unused text_poke_site(struct text_poke_state *tps,
+					  struct text_poke_loc *tp)
+{
+	const unsigned char int3 = INT3_INSN_OPCODE;
+	temp_mm_state_t prev_mm;
+	pte_t *ptep;
+	int offset;
+
+	__text_poke_map(text_poke_addr(tp), tp->native.len, &prev_mm, &ptep);
+
+	offset = offset_in_page(text_poke_addr(tp));
+
+	/*
+	 * All secondary CPUs are waiting in tps->state == PATCH_SYNC_DONE
+	 * to move to PATCH_SYNC_0. Poke the INT3 and wait until all CPUs
+	 * are known to have observed PATCH_SYNC_0.
+	 *
+	 * The earliest we can hit an INT3 is just after the first poke.
+	 */
+	poke_sync(tps, PATCH_SYNC_0, offset, &int3, INT3_INSN_SIZE);
+
+	/* Poke remaining */
+	poke_sync(tps, PATCH_SYNC_1, offset + INT3_INSN_SIZE,
+		  tp->text + INT3_INSN_SIZE, tp->native.len - INT3_INSN_SIZE);
+
+	/*
+	 * Replace the INT3 with the first opcode and force the serializing
+	 * instruction for the last time. Any secondaries in the BP
+	 * handler should be able to move past the INT3 handler after this.
+	 * (See poke_int3_native() for details on this.)
+	 */
+	poke_sync(tps, PATCH_SYNC_2, offset, tp->text, INT3_INSN_SIZE);
+
+	/*
+	 * Force all CPUS to observe PATCH_SYNC_DONE (in the BP handler or
+	 * in text_poke_site()), so they know that this iteration is done
+	 * and it is safe to exit the wait-until-a-sync-is-required loop.
+	 */
+	poke_sync(tps, PATCH_SYNC_DONE, 0, NULL, 0);
+
+	/*
+	 * Unmap the poking_addr, poking_mm.
+	 */
+	__text_poke_unmap(text_poke_addr(tp), tp->text, tp->native.len,
+			  &prev_mm, ptep);
+}
+
+/**
+ * text_poke_sync_finish() -- called to synchronize the CPU pipeline
+ * on secondary CPUs for all patch sites.
+ *
+ * Called in thread context with tps->state == PATCH_SYNC_DONE.
+ * Returns with tps->state == PATCH_DONE.
+ */
+static void text_poke_sync_finish(struct text_poke_state *tps)
+{
+	while (true) {
+		enum patch_state state;
+
+		state = READ_ONCE(tps->state);
+
+		/*
+		 * We aren't doing any actual poking yet, so we don't
+		 * handle any other states.
+		 */
+		if (state == PATCH_DONE)
+			break;
+
+		/*
+		 * Relax here while the primary makes up its mind on
+		 * whether it is done or not.
+		 */
+		cpu_relax();
+	}
+}
+
+static int patch_worker(void *t)
+{
+	int cpu = smp_processor_id();
+	struct text_poke_state *tps = t;
+
+	if (cpu == tps->primary_cpu) {
+		/*
+		 * Generates insns and calls text_poke_site() to do the poking
+		 * and sync.
+		 */
+		tps->patch_worker(tps);
+
+		/*
+		 * We are done patching. Switch the state to PATCH_DONE
+		 * so the secondaries can exit.
+		 */
+		smp_store_release(&tps->state, PATCH_DONE);
+	} else {
+		/* Secondary CPUs spin in a sync_core() state-machine. */
+		text_poke_sync_finish(tps);
+	}
+	return 0;
+}
+
+/**
+ * text_poke_late() -- late patching via stop_machine().
+ *
+ * Called holding the text_mutex.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
+{
+	int ret;
+
+	lockdep_assert_held(&text_mutex);
+
+	if (system_state != SYSTEM_RUNNING)
+		return -EINVAL;
+
+	text_poke_state.stage = stage;
+	text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
+	text_poke_state.head = &alt_modules;
+
+	text_poke_state.patch_worker = worker;
+	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
+	text_poke_state.primary_cpu = smp_processor_id();
+
+	/*
+	 * Run the worker on all online CPUs. Don't need to do anything
+	 * for offline CPUs as they come back online with a clean cache.
+	 */
+	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
+
+	return ret;
+}
+
 #ifdef CONFIG_PARAVIRT_RUNTIME
 struct paravirt_stage_entry {
 	void *dest;	/* pv_op destination */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Patching at runtime needs to handle interdependent pv-ops: as an example,
lock.queued_lock_slowpath(), lock.queued_lock_unlock() and the other
pv_lock_ops are paired and so need to be updated atomically. This is
difficult with emulation because non-patching CPUs could be executing in
critical sections.
(We could apply INT3 everywhere first and then use RCU to force a
barrier but given that spinlocks are everywhere, it still might mean a
lot of time in emulation.)

Second, locking operations can be called from interrupt handlers which
means we cannot trivially use IPIs to introduce a pipeline sync step on
non-patching CPUs.

Third, some pv-ops can be inlined and so we would need to emulate a
broader set of operations than CALL, JMP, NOP*.

Introduce the core state-machine with the actual poking and pipeline
sync stubbed out. This executes via stop_machine() with the primary CPU
carrying out a text_poke_bp() style three-staged algorithm.

The control flow diagram below shows CPU0 as the primary which does the
patching, while the rest of the CPUs (CPUx) execute the sync loop in
text_poke_sync_finish().

 CPU0				    CPUx
 ----                               ----

 patch_worker()			    patch_worker()

   /* Traversal, insn-gen */	      text_poke_sync_finish()
   tps.patch_worker()		      /*
  				       * wait until:
     /* for each patch-site */ 	       *  tps->state == PATCH_DONE
     text_poke_site()		       */
       poke_sync()

  	   ...				       ...

   smp_store_release(&tps->state, PATCH_DONE)

Commits further on flesh out the rest of the code.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
sync_one() uses the following for pipeline synchronization:

+       if (in_nmi())
+               cpuid_eax(1);
+       else
+               sync_core();

The if (in_nmi()) clause is meant to be executed from NMI contexts.
Reading through past LKML discussions cpuid_eax() is probably a
bad choice -- at least in so far as Xen PV is concerned. What
would be a good primitive to use insead?

Also, given that we do handle the nested NMI case, does it make sense
to just use native_iret() (via sync_core()) in NMI contexts well?

---
 arch/x86/kernel/alternative.c | 247 ++++++++++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 004fe86f463f..452d4081eded 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -979,6 +979,26 @@ void text_poke_sync(void)
 	on_each_cpu(do_sync_core, NULL, 1);
 }
 
+static void __maybe_unused sync_one(void)
+{
+	/*
+	 * We might be executing in NMI context, and so cannot use
+	 * IRET as a synchronizing instruction.
+	 *
+	 * We could use native_write_cr2() but that is not guaranteed
+	 * to work on Xen-PV -- it is emulated by Xen and might not
+	 * execute an iret (or similar synchronizing instruction)
+	 * internally.
+	 *
+	 * cpuid() would trap as well. Unclear if that's a solution
+	 * either.
+	 */
+	if (in_nmi())
+		cpuid_eax(1);
+	else
+		sync_core();
+}
+
 struct text_poke_loc {
 	s32 rel_addr; /* addr := _stext + rel_addr */
 	union {
@@ -1351,6 +1371,233 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 	text_poke_bp_batch(&tp, 1);
 }
 
+struct text_poke_state;
+typedef void (*patch_worker_t)(struct text_poke_state *tps);
+
+/*
+ *                        +-----------possible-BP----------+
+ *                        |                                |
+ *         +--write-INT3--+   +--suffix--+   +-insn-prefix-+
+ *        /               | _/           |__/              |
+ *       /                v'             v                 v
+ * PATCH_SYNC_0    PATCH_SYNC_1    PATCH_SYNC_2   *PATCH_SYNC_DONE*
+ *       \                                                    |`----> PATCH_DONE
+ *        `----------<---------<---------<---------<----------+
+ *
+ * We start in state PATCH_SYNC_DONE and loop through PATCH_SYNC_* states
+ * to end at PATCH_DONE. The primary drives these in text_poke_site()
+ * with patch_worker() making the final transition to PATCH_DONE.
+ * All transitions but the last iteration need to be globally observed.
+ *
+ * On secondary CPUs, text_poke_sync_finish() waits in a cpu_relax()
+ * loop waiting for a transition to PATCH_SYNC_0 at which point it would
+ * start observing transitions until PATCH_SYNC_DONE.
+ * Eventually the master moves to PATCH_DONE and secondary CPUs finish.
+ */
+enum patch_state {
+	/*
+	 * Add an artificial state that we can do a bitwise operation
+	 * over all the PATCH_SYNC_* states.
+	 */
+	PATCH_SYNC_x = 4,
+	PATCH_SYNC_0 = PATCH_SYNC_x | 0,	/* Serialize INT3 */
+	PATCH_SYNC_1 = PATCH_SYNC_x | 1,	/* Serialize rest */
+	PATCH_SYNC_2 = PATCH_SYNC_x | 2,	/* Serialize first opcode */
+	PATCH_SYNC_DONE = PATCH_SYNC_x | 3,	/* Site done, and start state */
+
+	PATCH_DONE = 8,				/* End state */
+};
+
+/*
+ * State for driving text-poking via stop_machine().
+ */
+struct text_poke_state {
+	/* Whatever we are poking */
+	void *stage;
+
+	/* Modules to be processed. */
+	struct list_head *head;
+
+	/*
+	 * Accesses to sync_ack_map are ordered by the primary
+	 * via tps.state.
+	 */
+	struct cpumask sync_ack_map;
+
+	/*
+	 * Generates insn sequences for call-sites to be patched and
+	 * calls text_poke_site() to do the actual poking.
+	 */
+	patch_worker_t	patch_worker;
+
+	/*
+	 * Where are we in the patching state-machine.
+	 */
+	enum patch_state state;
+
+	unsigned int primary_cpu; /* CPU doing the patching. */
+	unsigned int num_acks; /* Number of Acks needed. */
+};
+
+static struct text_poke_state text_poke_state;
+
+/**
+ * poke_sync() - transitions to the specified state.
+ *
+ * @tps - struct text_poke_state *
+ * @state - one of PATCH_SYNC_* states
+ * @offset - offset to be patched
+ * @insns - insns to write
+ * @len - length of insn sequence
+ */
+static void poke_sync(struct text_poke_state *tps, int state, int offset,
+		      const char *insns, int len)
+{
+	/*
+	 * STUB: no patching or synchronization, just go through the
+	 * motions.
+	 */
+	smp_store_release(&tps->state, state);
+}
+
+/**
+ * text_poke_site() - called on the primary to patch a single call site.
+ *
+ * Returns after switching tps->state to PATCH_SYNC_DONE.
+ */
+static void __maybe_unused text_poke_site(struct text_poke_state *tps,
+					  struct text_poke_loc *tp)
+{
+	const unsigned char int3 = INT3_INSN_OPCODE;
+	temp_mm_state_t prev_mm;
+	pte_t *ptep;
+	int offset;
+
+	__text_poke_map(text_poke_addr(tp), tp->native.len, &prev_mm, &ptep);
+
+	offset = offset_in_page(text_poke_addr(tp));
+
+	/*
+	 * All secondary CPUs are waiting in tps->state == PATCH_SYNC_DONE
+	 * to move to PATCH_SYNC_0. Poke the INT3 and wait until all CPUs
+	 * are known to have observed PATCH_SYNC_0.
+	 *
+	 * The earliest we can hit an INT3 is just after the first poke.
+	 */
+	poke_sync(tps, PATCH_SYNC_0, offset, &int3, INT3_INSN_SIZE);
+
+	/* Poke remaining */
+	poke_sync(tps, PATCH_SYNC_1, offset + INT3_INSN_SIZE,
+		  tp->text + INT3_INSN_SIZE, tp->native.len - INT3_INSN_SIZE);
+
+	/*
+	 * Replace the INT3 with the first opcode and force the serializing
+	 * instruction for the last time. Any secondaries in the BP
+	 * handler should be able to move past the INT3 handler after this.
+	 * (See poke_int3_native() for details on this.)
+	 */
+	poke_sync(tps, PATCH_SYNC_2, offset, tp->text, INT3_INSN_SIZE);
+
+	/*
+	 * Force all CPUS to observe PATCH_SYNC_DONE (in the BP handler or
+	 * in text_poke_site()), so they know that this iteration is done
+	 * and it is safe to exit the wait-until-a-sync-is-required loop.
+	 */
+	poke_sync(tps, PATCH_SYNC_DONE, 0, NULL, 0);
+
+	/*
+	 * Unmap the poking_addr, poking_mm.
+	 */
+	__text_poke_unmap(text_poke_addr(tp), tp->text, tp->native.len,
+			  &prev_mm, ptep);
+}
+
+/**
+ * text_poke_sync_finish() -- called to synchronize the CPU pipeline
+ * on secondary CPUs for all patch sites.
+ *
+ * Called in thread context with tps->state == PATCH_SYNC_DONE.
+ * Returns with tps->state == PATCH_DONE.
+ */
+static void text_poke_sync_finish(struct text_poke_state *tps)
+{
+	while (true) {
+		enum patch_state state;
+
+		state = READ_ONCE(tps->state);
+
+		/*
+		 * We aren't doing any actual poking yet, so we don't
+		 * handle any other states.
+		 */
+		if (state == PATCH_DONE)
+			break;
+
+		/*
+		 * Relax here while the primary makes up its mind on
+		 * whether it is done or not.
+		 */
+		cpu_relax();
+	}
+}
+
+static int patch_worker(void *t)
+{
+	int cpu = smp_processor_id();
+	struct text_poke_state *tps = t;
+
+	if (cpu == tps->primary_cpu) {
+		/*
+		 * Generates insns and calls text_poke_site() to do the poking
+		 * and sync.
+		 */
+		tps->patch_worker(tps);
+
+		/*
+		 * We are done patching. Switch the state to PATCH_DONE
+		 * so the secondaries can exit.
+		 */
+		smp_store_release(&tps->state, PATCH_DONE);
+	} else {
+		/* Secondary CPUs spin in a sync_core() state-machine. */
+		text_poke_sync_finish(tps);
+	}
+	return 0;
+}
+
+/**
+ * text_poke_late() -- late patching via stop_machine().
+ *
+ * Called holding the text_mutex.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
+{
+	int ret;
+
+	lockdep_assert_held(&text_mutex);
+
+	if (system_state != SYSTEM_RUNNING)
+		return -EINVAL;
+
+	text_poke_state.stage = stage;
+	text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
+	text_poke_state.head = &alt_modules;
+
+	text_poke_state.patch_worker = worker;
+	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
+	text_poke_state.primary_cpu = smp_processor_id();
+
+	/*
+	 * Run the worker on all online CPUs. Don't need to do anything
+	 * for offline CPUs as they come back online with a clean cache.
+	 */
+	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
+
+	return ret;
+}
+
 #ifdef CONFIG_PARAVIRT_RUNTIME
 struct paravirt_stage_entry {
 	void *dest;	/* pv_op destination */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 16/26] x86/alternatives: Add paravirt patching at runtime
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Add paravirt_patch_runtime() which uses text_poke_late() to patch
paravirt sites.

Also add paravirt_worker() which does the actual insn generation
generate_paravirt() (which uses runtime_patch() to generate the
appropriate native or paravirt insn sequences) and then calls
text_poke_site() to do the actual poking.

 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()
     /* = paravirt_worker() */         /*
                                        * wait until:
     /* for each patch-site */          *  tps->state == PATCH_DONE
     generate_paravirt()                */
       runtime_patch()
     text_poke_site()
       poke_sync()

           ...                                 ...

   smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  2 +
 arch/x86/kernel/alternative.c        | 98 +++++++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c4b2814f2f9d..e86709a8287e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -21,6 +21,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #ifndef CONFIG_PARAVIRT_RUNTIME
 #define __parainstructions_runtime	NULL
 #define __parainstructions_runtime_end	NULL
+#else
+int paravirt_runtime_patch(void);
 #endif
 
 /*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 452d4081eded..1c5acdc4f349 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1463,7 +1463,9 @@ static void poke_sync(struct text_poke_state *tps, int state, int offset,
 /**
  * text_poke_site() - called on the primary to patch a single call site.
  *
- * Returns after switching tps->state to PATCH_SYNC_DONE.
+ * Called in thread context with tps->state == PATCH_SYNC_DONE where it
+ * takes tps->state through different PATCH_SYNC_* states, returning
+ * after having switched the tps->state back to PATCH_SYNC_DONE.
  */
 static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 					  struct text_poke_loc *tp)
@@ -1598,6 +1600,16 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
 	return ret;
 }
 
+/*
+ * Check if this address is still in scope of this module's .text section.
+ */
+static bool __maybe_unused stale_address(struct alt_module *am, u8 *p)
+{
+	if (p < am->text || p >= am->text_end)
+		return true;
+	return false;
+}
+
 #ifdef CONFIG_PARAVIRT_RUNTIME
 struct paravirt_stage_entry {
 	void *dest;	/* pv_op destination */
@@ -1654,4 +1666,88 @@ void text_poke_pv_stage_zero(void)
 	lockdep_assert_held(&text_mutex);
 	pv_stage.count = 0;
 }
+
+/**
+ * generate_paravirt - fill up the insn sequence for a pv-op.
+ *
+ * @tp - address of struct text_poke_loc
+ * @op - the pv-op entry for this location
+ * @site - patch site (kernel or module text)
+ */
+static void generate_paravirt(struct text_poke_loc *tp,
+			      struct paravirt_stage_entry *op,
+			      struct paravirt_patch_site *site)
+{
+	unsigned int used;
+
+	BUG_ON(site->len > POKE_MAX_OPCODE_SIZE);
+
+	text_poke_loc_init(tp, site->instr, site->instr, site->len, NULL, true);
+
+	/*
+	 * Paravirt patches can patch calls (ex. mmu.tlb_flush),
+	 * callee_saves(ex. queued_spin_unlock).
+	 *
+	 * runtime_patch() calls native_patch(), or paravirt_patch()
+	 * based on the destination.
+	 */
+	used = runtime_patch(site->type, (void *)tp->text, op->dest,
+			     (unsigned long)site->instr, site->len);
+
+	/* No good way to recover. */
+	BUG_ON(used < 0);
+
+	/* Pad the rest with nops */
+	add_nops((void *)tp->text + used, site->len - used);
+}
+
+/**
+ * paravirt_worker - generate the paravirt patching
+ * insns and calls text_poke_site() to do the actual patching.
+ */
+static void paravirt_worker(struct text_poke_state *tps)
+{
+	struct paravirt_patch_site *site;
+	struct paravirt_stage *stage = tps->stage;
+	struct paravirt_stage_entry *op = &stage->ops[0];
+	struct alt_module *am;
+	struct text_poke_loc tp;
+	int i;
+
+	list_for_each_entry(am, tps->head, next) {
+		for (site = am->para; site < am->para_end; site++) {
+			if (stale_address(am, site->instr))
+				continue;
+
+			for (i = 0;  i < stage->count; i++) {
+				if (op[i].type != site->type)
+					continue;
+
+				generate_paravirt(&tp, &op[i], site);
+
+				text_poke_site(tps, &tp);
+			}
+		}
+	}
+}
+
+/**
+ * paravirt_runtime_patch() -- patch pv-ops, including paired ops.
+ *
+ * Called holding the text_mutex.
+ *
+ * Modify possibly multiple mutually-dependent pv-op callsites
+ * (ex. pv_lock_ops) using stop_machine().
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int paravirt_runtime_patch(void)
+{
+	lockdep_assert_held(&text_mutex);
+
+	if (!pv_stage.count)
+		return -EINVAL;
+
+	return text_poke_late(paravirt_worker, &pv_stage);
+}
 #endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 16/26] x86/alternatives: Add paravirt patching at runtime
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Add paravirt_patch_runtime() which uses text_poke_late() to patch
paravirt sites.

Also add paravirt_worker() which does the actual insn generation
generate_paravirt() (which uses runtime_patch() to generate the
appropriate native or paravirt insn sequences) and then calls
text_poke_site() to do the actual poking.

 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()
     /* = paravirt_worker() */         /*
                                        * wait until:
     /* for each patch-site */          *  tps->state == PATCH_DONE
     generate_paravirt()                */
       runtime_patch()
     text_poke_site()
       poke_sync()

           ...                                 ...

   smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  2 +
 arch/x86/kernel/alternative.c        | 98 +++++++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c4b2814f2f9d..e86709a8287e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -21,6 +21,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #ifndef CONFIG_PARAVIRT_RUNTIME
 #define __parainstructions_runtime	NULL
 #define __parainstructions_runtime_end	NULL
+#else
+int paravirt_runtime_patch(void);
 #endif
 
 /*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 452d4081eded..1c5acdc4f349 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1463,7 +1463,9 @@ static void poke_sync(struct text_poke_state *tps, int state, int offset,
 /**
  * text_poke_site() - called on the primary to patch a single call site.
  *
- * Returns after switching tps->state to PATCH_SYNC_DONE.
+ * Called in thread context with tps->state == PATCH_SYNC_DONE where it
+ * takes tps->state through different PATCH_SYNC_* states, returning
+ * after having switched the tps->state back to PATCH_SYNC_DONE.
  */
 static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 					  struct text_poke_loc *tp)
@@ -1598,6 +1600,16 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
 	return ret;
 }
 
+/*
+ * Check if this address is still in scope of this module's .text section.
+ */
+static bool __maybe_unused stale_address(struct alt_module *am, u8 *p)
+{
+	if (p < am->text || p >= am->text_end)
+		return true;
+	return false;
+}
+
 #ifdef CONFIG_PARAVIRT_RUNTIME
 struct paravirt_stage_entry {
 	void *dest;	/* pv_op destination */
@@ -1654,4 +1666,88 @@ void text_poke_pv_stage_zero(void)
 	lockdep_assert_held(&text_mutex);
 	pv_stage.count = 0;
 }
+
+/**
+ * generate_paravirt - fill up the insn sequence for a pv-op.
+ *
+ * @tp - address of struct text_poke_loc
+ * @op - the pv-op entry for this location
+ * @site - patch site (kernel or module text)
+ */
+static void generate_paravirt(struct text_poke_loc *tp,
+			      struct paravirt_stage_entry *op,
+			      struct paravirt_patch_site *site)
+{
+	unsigned int used;
+
+	BUG_ON(site->len > POKE_MAX_OPCODE_SIZE);
+
+	text_poke_loc_init(tp, site->instr, site->instr, site->len, NULL, true);
+
+	/*
+	 * Paravirt patches can patch calls (ex. mmu.tlb_flush),
+	 * callee_saves(ex. queued_spin_unlock).
+	 *
+	 * runtime_patch() calls native_patch(), or paravirt_patch()
+	 * based on the destination.
+	 */
+	used = runtime_patch(site->type, (void *)tp->text, op->dest,
+			     (unsigned long)site->instr, site->len);
+
+	/* No good way to recover. */
+	BUG_ON(used < 0);
+
+	/* Pad the rest with nops */
+	add_nops((void *)tp->text + used, site->len - used);
+}
+
+/**
+ * paravirt_worker - generate the paravirt patching
+ * insns and calls text_poke_site() to do the actual patching.
+ */
+static void paravirt_worker(struct text_poke_state *tps)
+{
+	struct paravirt_patch_site *site;
+	struct paravirt_stage *stage = tps->stage;
+	struct paravirt_stage_entry *op = &stage->ops[0];
+	struct alt_module *am;
+	struct text_poke_loc tp;
+	int i;
+
+	list_for_each_entry(am, tps->head, next) {
+		for (site = am->para; site < am->para_end; site++) {
+			if (stale_address(am, site->instr))
+				continue;
+
+			for (i = 0;  i < stage->count; i++) {
+				if (op[i].type != site->type)
+					continue;
+
+				generate_paravirt(&tp, &op[i], site);
+
+				text_poke_site(tps, &tp);
+			}
+		}
+	}
+}
+
+/**
+ * paravirt_runtime_patch() -- patch pv-ops, including paired ops.
+ *
+ * Called holding the text_mutex.
+ *
+ * Modify possibly multiple mutually-dependent pv-op callsites
+ * (ex. pv_lock_ops) using stop_machine().
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int paravirt_runtime_patch(void)
+{
+	lockdep_assert_held(&text_mutex);
+
+	if (!pv_stage.count)
+		return -EINVAL;
+
+	return text_poke_late(paravirt_worker, &pv_stage);
+}
 #endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 17/26] x86/alternatives: Add patching logic in text_poke_site()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Add actual poking and pipeline sync logic in poke_sync(). This is called
from text_poke_site()).

The patching logic is similar to that in text_poke_bp_batch() where we
patch the first byte with an INT3, which serves as a barrier, then patch
the remaining bytes and then come back and fixup the first byte.

The first and the last steps are single byte writes and are thus
atomic, and the second step is protected because the INT3 serves
as a barrier.

Between each of these steps is a global pipeline sync which ensures that
remote pipelines flush out any stale opcodes that they might have cached.
This is driven from poke_sync() where the primary introduces a sync_core()
on secondary CPUs for every PATCH_SYNC_* state change. The corresponding
loop on the secondary executes in text_poke_sync_site().

Note that breakpoints are not handled yet.

 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()                    /* wait until:
     /* = paravirt_worker() */            *  tps->state == PATCH_DONE
                                          */
                  /* for each patch-site */
     generate_paravirt()
       runtime_patch()
     text_poke_site()                    text_poke_sync_site()
        poke_sync()                       /* for each:
          __text_do_poke()                 *  PATCH_SYNC_[012] */
          sync_one()                       sync_one()
          ack()                            ack()
          wait_for_acks()

           ...                                 ...

  smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 103 +++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 1c5acdc4f349..7fdaae9edbf0 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1441,27 +1441,57 @@ struct text_poke_state {
 
 static struct text_poke_state text_poke_state;
 
+static void wait_for_acks(struct text_poke_state *tps)
+{
+	int cpu = smp_processor_id();
+
+	cpumask_set_cpu(cpu, &tps->sync_ack_map);
+
+	/* Wait until all CPUs are known to have observed the state change. */
+	while (cpumask_weight(&tps->sync_ack_map) < tps->num_acks)
+		cpu_relax();
+}
+
 /**
- * poke_sync() - transitions to the specified state.
+ * poke_sync() - carries out one poke-step for a single site and
+ * transitions to the specified state.
+ * Called with the target populated in poking_mm and poking_addr.
  *
  * @tps - struct text_poke_state *
  * @state - one of PATCH_SYNC_* states
  * @offset - offset to be patched
  * @insns - insns to write
  * @len - length of insn sequence
+ *
+ * Returns after all CPUs have observed the state change and called
+ * sync_core().
  */
 static void poke_sync(struct text_poke_state *tps, int state, int offset,
 		      const char *insns, int len)
 {
+	if (len)
+		__text_do_poke(offset, insns, len);
 	/*
-	 * STUB: no patching or synchronization, just go through the
-	 * motions.
+	 * Stores to tps.sync_ack_map are ordered with
+	 * smp_load_acquire(tps->state) in text_poke_sync_site()
+	 * so we can safely clear the cpumask.
 	 */
 	smp_store_release(&tps->state, state);
+
+	cpumask_clear(&tps->sync_ack_map);
+
+	/*
+	 * Introduce a synchronizing instruction in local and remote insn
+	 * streams. This flushes any stale cached uops from CPU pipelines.
+	 */
+	sync_one();
+
+	wait_for_acks(tps);
 }
 
 /**
  * text_poke_site() - called on the primary to patch a single call site.
+ * The interlocking sync work on the secondary is done in text_poke_sync_site().
  *
  * Called in thread context with tps->state == PATCH_SYNC_DONE where it
  * takes tps->state through different PATCH_SYNC_* states, returning
@@ -1514,6 +1544,43 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 			  &prev_mm, ptep);
 }
 
+/**
+ * text_poke_sync_site() -- called to synchronize the CPU pipeline
+ * on secondary CPUs for each patch site.
+ *
+ * Called in thread context with tps->state == PATCH_SYNC_0.
+ *
+ * Returns after having observed tps->state == PATCH_SYNC_DONE.
+ */
+static void text_poke_sync_site(struct text_poke_state *tps)
+{
+	int cpu = smp_processor_id();
+	int prevstate = -1;
+	int acked;
+
+	/*
+	 * In thread context we arrive here expecting tps->state to move
+	 * in-order from PATCH_SYNC_{0 -> 1 -> 2} -> PATCH_SYNC_DONE.
+	 */
+	do {
+		/*
+		 * Wait until there's some work for us to do.
+		 */
+		smp_cond_load_acquire(&tps->state,
+				      prevstate != VAL);
+
+		prevstate = READ_ONCE(tps->state);
+
+		if (prevstate < PATCH_SYNC_DONE) {
+			acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
+
+			BUG_ON(acked);
+			sync_one();
+			cpumask_set_cpu(cpu, &tps->sync_ack_map);
+		}
+	} while (prevstate < PATCH_SYNC_DONE);
+}
+
 /**
  * text_poke_sync_finish() -- called to synchronize the CPU pipeline
  * on secondary CPUs for all patch sites.
@@ -1525,6 +1592,7 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 {
 	while (true) {
 		enum patch_state state;
+		int cpu = smp_processor_id();
 
 		state = READ_ONCE(tps->state);
 
@@ -1535,11 +1603,24 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 		if (state == PATCH_DONE)
 			break;
 
-		/*
-		 * Relax here while the primary makes up its mind on
-		 * whether it is done or not.
-		 */
-		cpu_relax();
+		if (state == PATCH_SYNC_DONE) {
+			/*
+			 * Ack that we've seen the end of this iteration
+			 * and then wait until everybody's ready to move
+			 * to the next iteration or exit.
+			 */
+			cpumask_set_cpu(cpu, &tps->sync_ack_map);
+			smp_cond_load_acquire(&tps->state,
+					      (state != VAL));
+		} else if (state == PATCH_SYNC_0) {
+			/*
+			 * PATCH_SYNC_1, PATCH_SYNC_2 are handled
+			 * inside text_poke_sync_site().
+			 */
+			text_poke_sync_site(tps);
+		} else {
+			BUG();
+		}
 	}
 }
 
@@ -1549,6 +1630,12 @@ static int patch_worker(void *t)
 	struct text_poke_state *tps = t;
 
 	if (cpu == tps->primary_cpu) {
+		/*
+		 * The init state is PATCH_SYNC_DONE. Wait until the
+		 * secondaries have assembled before we start patching.
+		 */
+		wait_for_acks(tps);
+
 		/*
 		 * Generates insns and calls text_poke_site() to do the poking
 		 * and sync.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 17/26] x86/alternatives: Add patching logic in text_poke_site()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Add actual poking and pipeline sync logic in poke_sync(). This is called
from text_poke_site()).

The patching logic is similar to that in text_poke_bp_batch() where we
patch the first byte with an INT3, which serves as a barrier, then patch
the remaining bytes and then come back and fixup the first byte.

The first and the last steps are single byte writes and are thus
atomic, and the second step is protected because the INT3 serves
as a barrier.

Between each of these steps is a global pipeline sync which ensures that
remote pipelines flush out any stale opcodes that they might have cached.
This is driven from poke_sync() where the primary introduces a sync_core()
on secondary CPUs for every PATCH_SYNC_* state change. The corresponding
loop on the secondary executes in text_poke_sync_site().

Note that breakpoints are not handled yet.

 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()                    /* wait until:
     /* = paravirt_worker() */            *  tps->state == PATCH_DONE
                                          */
                  /* for each patch-site */
     generate_paravirt()
       runtime_patch()
     text_poke_site()                    text_poke_sync_site()
        poke_sync()                       /* for each:
          __text_do_poke()                 *  PATCH_SYNC_[012] */
          sync_one()                       sync_one()
          ack()                            ack()
          wait_for_acks()

           ...                                 ...

  smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 103 +++++++++++++++++++++++++++++++---
 1 file changed, 95 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 1c5acdc4f349..7fdaae9edbf0 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1441,27 +1441,57 @@ struct text_poke_state {
 
 static struct text_poke_state text_poke_state;
 
+static void wait_for_acks(struct text_poke_state *tps)
+{
+	int cpu = smp_processor_id();
+
+	cpumask_set_cpu(cpu, &tps->sync_ack_map);
+
+	/* Wait until all CPUs are known to have observed the state change. */
+	while (cpumask_weight(&tps->sync_ack_map) < tps->num_acks)
+		cpu_relax();
+}
+
 /**
- * poke_sync() - transitions to the specified state.
+ * poke_sync() - carries out one poke-step for a single site and
+ * transitions to the specified state.
+ * Called with the target populated in poking_mm and poking_addr.
  *
  * @tps - struct text_poke_state *
  * @state - one of PATCH_SYNC_* states
  * @offset - offset to be patched
  * @insns - insns to write
  * @len - length of insn sequence
+ *
+ * Returns after all CPUs have observed the state change and called
+ * sync_core().
  */
 static void poke_sync(struct text_poke_state *tps, int state, int offset,
 		      const char *insns, int len)
 {
+	if (len)
+		__text_do_poke(offset, insns, len);
 	/*
-	 * STUB: no patching or synchronization, just go through the
-	 * motions.
+	 * Stores to tps.sync_ack_map are ordered with
+	 * smp_load_acquire(tps->state) in text_poke_sync_site()
+	 * so we can safely clear the cpumask.
 	 */
 	smp_store_release(&tps->state, state);
+
+	cpumask_clear(&tps->sync_ack_map);
+
+	/*
+	 * Introduce a synchronizing instruction in local and remote insn
+	 * streams. This flushes any stale cached uops from CPU pipelines.
+	 */
+	sync_one();
+
+	wait_for_acks(tps);
 }
 
 /**
  * text_poke_site() - called on the primary to patch a single call site.
+ * The interlocking sync work on the secondary is done in text_poke_sync_site().
  *
  * Called in thread context with tps->state == PATCH_SYNC_DONE where it
  * takes tps->state through different PATCH_SYNC_* states, returning
@@ -1514,6 +1544,43 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 			  &prev_mm, ptep);
 }
 
+/**
+ * text_poke_sync_site() -- called to synchronize the CPU pipeline
+ * on secondary CPUs for each patch site.
+ *
+ * Called in thread context with tps->state == PATCH_SYNC_0.
+ *
+ * Returns after having observed tps->state == PATCH_SYNC_DONE.
+ */
+static void text_poke_sync_site(struct text_poke_state *tps)
+{
+	int cpu = smp_processor_id();
+	int prevstate = -1;
+	int acked;
+
+	/*
+	 * In thread context we arrive here expecting tps->state to move
+	 * in-order from PATCH_SYNC_{0 -> 1 -> 2} -> PATCH_SYNC_DONE.
+	 */
+	do {
+		/*
+		 * Wait until there's some work for us to do.
+		 */
+		smp_cond_load_acquire(&tps->state,
+				      prevstate != VAL);
+
+		prevstate = READ_ONCE(tps->state);
+
+		if (prevstate < PATCH_SYNC_DONE) {
+			acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
+
+			BUG_ON(acked);
+			sync_one();
+			cpumask_set_cpu(cpu, &tps->sync_ack_map);
+		}
+	} while (prevstate < PATCH_SYNC_DONE);
+}
+
 /**
  * text_poke_sync_finish() -- called to synchronize the CPU pipeline
  * on secondary CPUs for all patch sites.
@@ -1525,6 +1592,7 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 {
 	while (true) {
 		enum patch_state state;
+		int cpu = smp_processor_id();
 
 		state = READ_ONCE(tps->state);
 
@@ -1535,11 +1603,24 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 		if (state == PATCH_DONE)
 			break;
 
-		/*
-		 * Relax here while the primary makes up its mind on
-		 * whether it is done or not.
-		 */
-		cpu_relax();
+		if (state == PATCH_SYNC_DONE) {
+			/*
+			 * Ack that we've seen the end of this iteration
+			 * and then wait until everybody's ready to move
+			 * to the next iteration or exit.
+			 */
+			cpumask_set_cpu(cpu, &tps->sync_ack_map);
+			smp_cond_load_acquire(&tps->state,
+					      (state != VAL));
+		} else if (state == PATCH_SYNC_0) {
+			/*
+			 * PATCH_SYNC_1, PATCH_SYNC_2 are handled
+			 * inside text_poke_sync_site().
+			 */
+			text_poke_sync_site(tps);
+		} else {
+			BUG();
+		}
 	}
 }
 
@@ -1549,6 +1630,12 @@ static int patch_worker(void *t)
 	struct text_poke_state *tps = t;
 
 	if (cpu == tps->primary_cpu) {
+		/*
+		 * The init state is PATCH_SYNC_DONE. Wait until the
+		 * secondaries have assembled before we start patching.
+		 */
+		wait_for_acks(tps);
+
 		/*
 		 * Generates insns and calls text_poke_site() to do the poking
 		 * and sync.
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 18/26] x86/alternatives: Handle BP in non-emulated text poking
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Handle breakpoints if we hit an INT3 either by way of an NMI while
patching a site in the NMI handling path, or if we are patching text
in text_poke_site() (executes on the primary), or in the pipeline sync
path in text_poke_sync_site() (executes on secondary CPUs.)
(The last two are not expected to happen, but see below.)

The handling on the primary CPU is to update the insn stream locally
such that we can return to the primary patching loop but not force
the secondary CPUs to execute sync_core().

From my reading of the Intel spec and the thread which laid down the
INT3 approach: https//lore.kernel.org/lkml/4B4D02B8.5020801@zytor.com,
skipping the sync_core() would mean that remote pipelines -- if they
have relevant uops cached would not see the updated instruction and
would continue to execute stale uops.

This is safe because the primary eventually gets back to the patching
loop in text_poke_site() and resumes the state-machine, re-writing
some of the insn sequences just written in the BP handling and forcing
the secondary CPUs to execute sync_core().

The handling on the secondary, is to call text_poke_sync_site() just as
in thread-context, so it contains acking the patch states such that the
primary can continue making forward progress. This can be called in a
re-entrant fashion.

Note that this does mean that we cannot handle any patches in
text_poke_sync_site() itself since that would end up being called
recursively in the BP handler.

Control flow diagram with the BP handler:

 CPU0-BP                             CPUx-BP
 -------                             -------

 poke_int3_native()                  poke_int3_native()
   __text_do_poke()         	       text_poke_sync_site()
   sync_one()               	        /* for state in:
                                         *  [PATCH_SYNC_y.._SYNC_DONE) */
                                         sync_one()
                                         ack()


 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()                    /* wait until:
     /* = paravirt_worker() */            *  tps->state == PATCH_DONE
                                          */
                 /* for each patch-site */
     generate_paravirt()
       runtime_patch()
     text_poke_site()                    text_poke_sync_site()
        poke_sync()                       /* for state in:
          __text_do_poke()                 *  [PATCH_SYNC_0..PATCH_SYNC_y]
          sync_one()                       */
          ack()                            sync_one()
          wait_for_acks()                  ack()

           ...                                 ...

   smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 145 ++++++++++++++++++++++++++++++++--
 1 file changed, 137 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7fdaae9edbf0..c68d940356a2 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1055,6 +1055,8 @@ static int notrace patch_cmp(const void *key, const void *elt)
 }
 NOKPROBE_SYMBOL(patch_cmp);
 
+static void poke_int3_native(struct pt_regs *regs,
+			     struct text_poke_loc *tp);
 int notrace poke_int3_handler(struct pt_regs *regs)
 {
 	struct bp_patching_desc *desc;
@@ -1099,8 +1101,11 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 			goto out_put;
 	}
 
-	if (desc->native)
-		BUG();
+	if (desc->native) {
+		poke_int3_native(regs, tp);
+		ret = 1; /* handled */
+		goto out_put;
+	}
 
 	len = text_opcode_size(tp->emulated.opcode);
 	ip += len;
@@ -1469,8 +1474,15 @@ static void wait_for_acks(struct text_poke_state *tps)
 static void poke_sync(struct text_poke_state *tps, int state, int offset,
 		      const char *insns, int len)
 {
-	if (len)
+	if (len) {
+		/*
+		 * Note that we could hit a BP right after patching memory
+		 * below. This could happen before the state change further
+		 * down. The primary BP handler allows us to make
+		 * forward-progress in that case.
+		 */
 		__text_do_poke(offset, insns, len);
+	}
 	/*
 	 * Stores to tps.sync_ack_map are ordered with
 	 * smp_load_acquire(tps->state) in text_poke_sync_site()
@@ -1504,11 +1516,22 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	temp_mm_state_t prev_mm;
 	pte_t *ptep;
 	int offset;
+	struct bp_patching_desc desc = {
+		.vec = tp,
+		.nr_entries = 1,
+		.native = true,
+		.refs = ATOMIC_INIT(1),
+	};
 
 	__text_poke_map(text_poke_addr(tp), tp->native.len, &prev_mm, &ptep);
 
 	offset = offset_in_page(text_poke_addr(tp));
 
+	/*
+	 * For INT3 use the same exclusion logic as BP emulation path.
+	 */
+	smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */
+
 	/*
 	 * All secondary CPUs are waiting in tps->state == PATCH_SYNC_DONE
 	 * to move to PATCH_SYNC_0. Poke the INT3 and wait until all CPUs
@@ -1537,6 +1560,19 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	 */
 	poke_sync(tps, PATCH_SYNC_DONE, 0, NULL, 0);
 
+	/*
+	 * All CPUs have ack'd PATCH_SYNC_DONE. So there can be no
+	 * laggard CPUs executing BP handlers. Reset bp_desc.
+	 */
+	WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */
+
+	/*
+	 * We've already done the synchronization so this should not
+	 * race.
+	 */
+	if (!atomic_dec_and_test(&desc.refs))
+		atomic_cond_read_acquire(&desc.refs, !VAL);
+
 	/*
 	 * Unmap the poking_addr, poking_mm.
 	 */
@@ -1548,7 +1584,8 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
  * text_poke_sync_site() -- called to synchronize the CPU pipeline
  * on secondary CPUs for each patch site.
  *
- * Called in thread context with tps->state == PATCH_SYNC_0.
+ * Called in thread context with tps->state == PATCH_SYNC_0 and in
+ * BP context with tps->state < PATCH_SYNC_DONE.
  *
  * Returns after having observed tps->state == PATCH_SYNC_DONE.
  */
@@ -1561,6 +1598,26 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 	/*
 	 * In thread context we arrive here expecting tps->state to move
 	 * in-order from PATCH_SYNC_{0 -> 1 -> 2} -> PATCH_SYNC_DONE.
+	 *
+	 * We could also arrive here in BP-context some point after having
+	 * observed bp_patching.nr_entries (and after poking the first INT3.)
+	 * This could happen by way of an NMI while we are patching a site
+	 * that'll get executed in the NMI handler, or if we hit a site
+	 * being patched in text_poke_sync_site().
+	 *
+	 * Just as thread-context, the BP handler calls text_poke_sync_site()
+	 * to keep the primary's state-machine moving forward until it has
+	 * finished patching the call-site. At that point it is safe to
+	 * unwind the contexts.
+	 *
+	 * The second case, where we are patching a site in
+	 * text_poke_sync_site(), could end up in recursive BP handlers
+	 * and is not handled.
+	 *
+	 * Note that unlike thread-context where the start state can only
+	 * be PATCH_SYNC_0, in the BP-context, the start state could be any
+	 * PATCH_SYNC_x, so long as (state < PATCH_SYNC_DONE) since once a
+	 * CPU has acked PATCH_SYNC_2, there is no INT3 left for it to observe.
 	 */
 	do {
 		/*
@@ -1571,16 +1628,88 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 
 		prevstate = READ_ONCE(tps->state);
 
-		if (prevstate < PATCH_SYNC_DONE) {
-			acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
-
-			BUG_ON(acked);
+		/*
+		 * As described above, text_poke_sync_site() gets called
+		 * from both thread-context and potentially in a re-entrant
+		 * fashion in BP-context. Accordingly expect to potentially
+		 * enter and exit this loop twice.
+		 *
+		 * Concretely, this means we need to handle the case where we
+		 * see an already acked state at BP/NMI entry and, see a
+		 * state discontinuity when returning to thread-context from
+		 * BP-context which would return after having observed
+		 * tps->state == PATCH_SYNC_DONE.
+		 *
+		 * Help this along by always exiting with tps->state ==
+		 * PATCH_SYNC_DONE but without acking it. Not acking it in
+		 * text_poke_sync_site(), guarantees that the state can only
+		 * forward once all secondary CPUs have exited both thread
+		 * and BP-contexts.
+		 */
+		acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
+		if (prevstate < PATCH_SYNC_DONE && !acked) {
 			sync_one();
 			cpumask_set_cpu(cpu, &tps->sync_ack_map);
 		}
 	} while (prevstate < PATCH_SYNC_DONE);
 }
 
+static void poke_int3_native(struct pt_regs *regs,
+			     struct text_poke_loc *tp)
+{
+	int cpu = smp_processor_id();
+	struct text_poke_state *tps = &text_poke_state;
+
+	if (cpu != tps->primary_cpu) {
+		/*
+		 * We came here from the sync loop in text_poke_sync_site().
+		 * Continue syncing. The primary is waiting.
+		 */
+		text_poke_sync_site(tps);
+	} else {
+		int offset = offset_in_page(text_poke_addr(tp));
+
+		/*
+		 * We are in the primary context and have hit the INT3 barrier
+		 * either ourselves or via an NMI.
+		 *
+		 * The secondary CPUs at this time are either in the original
+		 * text_poke_sync_site() loop or after having hit an NMI->INT3
+		 * themselves in the BP text_poke_sync_site() loop.
+		 *
+		 * The minimum that we need to do here is to update the local
+		 * insn stream such that we can return to the primary loop.
+		 * Without executing sync_core() on the secondary CPUs it is
+		 * possible that some of them might be executing stale uops in
+		 * their respective pipelines.
+		 *
+		 * This should be safe because we will get back to the patching
+		 * loop in text_poke_site() in due course and will resume
+		 * the state-machine where we left off including by re-writing
+		 * some of the insns sequences just written here.
+		 *
+		 * Note that we continue to be in poking_mm context and so can
+		 * safely call __text_do_poke() here.
+		 */
+		__text_do_poke(offset + INT3_INSN_SIZE,
+			       tp->text + INT3_INSN_SIZE,
+			       tp->native.len - INT3_INSN_SIZE);
+		__text_do_poke(offset, tp->text, INT3_INSN_SIZE);
+
+		/*
+		 * We only introduce a serializing instruction locally. As
+		 * noted above, the secondary CPUs can stay where they are --
+		 * potentially executing in the now stale INT3.) This is fine
+		 * because the primary will force the sync_core() on the
+		 * secondary CPUs once it returns.
+		 */
+		sync_one();
+	}
+
+	/* A new start */
+	regs->ip -= INT3_INSN_SIZE;
+}
+
 /**
  * text_poke_sync_finish() -- called to synchronize the CPU pipeline
  * on secondary CPUs for all patch sites.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 18/26] x86/alternatives: Handle BP in non-emulated text poking
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Handle breakpoints if we hit an INT3 either by way of an NMI while
patching a site in the NMI handling path, or if we are patching text
in text_poke_site() (executes on the primary), or in the pipeline sync
path in text_poke_sync_site() (executes on secondary CPUs.)
(The last two are not expected to happen, but see below.)

The handling on the primary CPU is to update the insn stream locally
such that we can return to the primary patching loop but not force
the secondary CPUs to execute sync_core().

From my reading of the Intel spec and the thread which laid down the
INT3 approach: https//lore.kernel.org/lkml/4B4D02B8.5020801@zytor.com,
skipping the sync_core() would mean that remote pipelines -- if they
have relevant uops cached would not see the updated instruction and
would continue to execute stale uops.

This is safe because the primary eventually gets back to the patching
loop in text_poke_site() and resumes the state-machine, re-writing
some of the insn sequences just written in the BP handling and forcing
the secondary CPUs to execute sync_core().

The handling on the secondary, is to call text_poke_sync_site() just as
in thread-context, so it contains acking the patch states such that the
primary can continue making forward progress. This can be called in a
re-entrant fashion.

Note that this does mean that we cannot handle any patches in
text_poke_sync_site() itself since that would end up being called
recursively in the BP handler.

Control flow diagram with the BP handler:

 CPU0-BP                             CPUx-BP
 -------                             -------

 poke_int3_native()                  poke_int3_native()
   __text_do_poke()         	       text_poke_sync_site()
   sync_one()               	        /* for state in:
                                         *  [PATCH_SYNC_y.._SYNC_DONE) */
                                         sync_one()
                                         ack()


 CPU0                                CPUx
 ----                                ----

 patch_worker()                      patch_worker()

   /* Traversal, insn-gen */           text_poke_sync_finish()
   tps.patch_worker()                    /* wait until:
     /* = paravirt_worker() */            *  tps->state == PATCH_DONE
                                          */
                 /* for each patch-site */
     generate_paravirt()
       runtime_patch()
     text_poke_site()                    text_poke_sync_site()
        poke_sync()                       /* for state in:
          __text_do_poke()                 *  [PATCH_SYNC_0..PATCH_SYNC_y]
          sync_one()                       */
          ack()                            sync_one()
          wait_for_acks()                  ack()

           ...                                 ...

   smp_store_release(&tps->state, PATCH_DONE)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/alternative.c | 145 ++++++++++++++++++++++++++++++++--
 1 file changed, 137 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7fdaae9edbf0..c68d940356a2 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1055,6 +1055,8 @@ static int notrace patch_cmp(const void *key, const void *elt)
 }
 NOKPROBE_SYMBOL(patch_cmp);
 
+static void poke_int3_native(struct pt_regs *regs,
+			     struct text_poke_loc *tp);
 int notrace poke_int3_handler(struct pt_regs *regs)
 {
 	struct bp_patching_desc *desc;
@@ -1099,8 +1101,11 @@ int notrace poke_int3_handler(struct pt_regs *regs)
 			goto out_put;
 	}
 
-	if (desc->native)
-		BUG();
+	if (desc->native) {
+		poke_int3_native(regs, tp);
+		ret = 1; /* handled */
+		goto out_put;
+	}
 
 	len = text_opcode_size(tp->emulated.opcode);
 	ip += len;
@@ -1469,8 +1474,15 @@ static void wait_for_acks(struct text_poke_state *tps)
 static void poke_sync(struct text_poke_state *tps, int state, int offset,
 		      const char *insns, int len)
 {
-	if (len)
+	if (len) {
+		/*
+		 * Note that we could hit a BP right after patching memory
+		 * below. This could happen before the state change further
+		 * down. The primary BP handler allows us to make
+		 * forward-progress in that case.
+		 */
 		__text_do_poke(offset, insns, len);
+	}
 	/*
 	 * Stores to tps.sync_ack_map are ordered with
 	 * smp_load_acquire(tps->state) in text_poke_sync_site()
@@ -1504,11 +1516,22 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	temp_mm_state_t prev_mm;
 	pte_t *ptep;
 	int offset;
+	struct bp_patching_desc desc = {
+		.vec = tp,
+		.nr_entries = 1,
+		.native = true,
+		.refs = ATOMIC_INIT(1),
+	};
 
 	__text_poke_map(text_poke_addr(tp), tp->native.len, &prev_mm, &ptep);
 
 	offset = offset_in_page(text_poke_addr(tp));
 
+	/*
+	 * For INT3 use the same exclusion logic as BP emulation path.
+	 */
+	smp_store_release(&bp_desc, &desc); /* rcu_assign_pointer */
+
 	/*
 	 * All secondary CPUs are waiting in tps->state == PATCH_SYNC_DONE
 	 * to move to PATCH_SYNC_0. Poke the INT3 and wait until all CPUs
@@ -1537,6 +1560,19 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	 */
 	poke_sync(tps, PATCH_SYNC_DONE, 0, NULL, 0);
 
+	/*
+	 * All CPUs have ack'd PATCH_SYNC_DONE. So there can be no
+	 * laggard CPUs executing BP handlers. Reset bp_desc.
+	 */
+	WRITE_ONCE(bp_desc, NULL); /* RCU_INIT_POINTER */
+
+	/*
+	 * We've already done the synchronization so this should not
+	 * race.
+	 */
+	if (!atomic_dec_and_test(&desc.refs))
+		atomic_cond_read_acquire(&desc.refs, !VAL);
+
 	/*
 	 * Unmap the poking_addr, poking_mm.
 	 */
@@ -1548,7 +1584,8 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
  * text_poke_sync_site() -- called to synchronize the CPU pipeline
  * on secondary CPUs for each patch site.
  *
- * Called in thread context with tps->state == PATCH_SYNC_0.
+ * Called in thread context with tps->state == PATCH_SYNC_0 and in
+ * BP context with tps->state < PATCH_SYNC_DONE.
  *
  * Returns after having observed tps->state == PATCH_SYNC_DONE.
  */
@@ -1561,6 +1598,26 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 	/*
 	 * In thread context we arrive here expecting tps->state to move
 	 * in-order from PATCH_SYNC_{0 -> 1 -> 2} -> PATCH_SYNC_DONE.
+	 *
+	 * We could also arrive here in BP-context some point after having
+	 * observed bp_patching.nr_entries (and after poking the first INT3.)
+	 * This could happen by way of an NMI while we are patching a site
+	 * that'll get executed in the NMI handler, or if we hit a site
+	 * being patched in text_poke_sync_site().
+	 *
+	 * Just as thread-context, the BP handler calls text_poke_sync_site()
+	 * to keep the primary's state-machine moving forward until it has
+	 * finished patching the call-site. At that point it is safe to
+	 * unwind the contexts.
+	 *
+	 * The second case, where we are patching a site in
+	 * text_poke_sync_site(), could end up in recursive BP handlers
+	 * and is not handled.
+	 *
+	 * Note that unlike thread-context where the start state can only
+	 * be PATCH_SYNC_0, in the BP-context, the start state could be any
+	 * PATCH_SYNC_x, so long as (state < PATCH_SYNC_DONE) since once a
+	 * CPU has acked PATCH_SYNC_2, there is no INT3 left for it to observe.
 	 */
 	do {
 		/*
@@ -1571,16 +1628,88 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 
 		prevstate = READ_ONCE(tps->state);
 
-		if (prevstate < PATCH_SYNC_DONE) {
-			acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
-
-			BUG_ON(acked);
+		/*
+		 * As described above, text_poke_sync_site() gets called
+		 * from both thread-context and potentially in a re-entrant
+		 * fashion in BP-context. Accordingly expect to potentially
+		 * enter and exit this loop twice.
+		 *
+		 * Concretely, this means we need to handle the case where we
+		 * see an already acked state at BP/NMI entry and, see a
+		 * state discontinuity when returning to thread-context from
+		 * BP-context which would return after having observed
+		 * tps->state == PATCH_SYNC_DONE.
+		 *
+		 * Help this along by always exiting with tps->state ==
+		 * PATCH_SYNC_DONE but without acking it. Not acking it in
+		 * text_poke_sync_site(), guarantees that the state can only
+		 * forward once all secondary CPUs have exited both thread
+		 * and BP-contexts.
+		 */
+		acked = cpumask_test_cpu(cpu, &tps->sync_ack_map);
+		if (prevstate < PATCH_SYNC_DONE && !acked) {
 			sync_one();
 			cpumask_set_cpu(cpu, &tps->sync_ack_map);
 		}
 	} while (prevstate < PATCH_SYNC_DONE);
 }
 
+static void poke_int3_native(struct pt_regs *regs,
+			     struct text_poke_loc *tp)
+{
+	int cpu = smp_processor_id();
+	struct text_poke_state *tps = &text_poke_state;
+
+	if (cpu != tps->primary_cpu) {
+		/*
+		 * We came here from the sync loop in text_poke_sync_site().
+		 * Continue syncing. The primary is waiting.
+		 */
+		text_poke_sync_site(tps);
+	} else {
+		int offset = offset_in_page(text_poke_addr(tp));
+
+		/*
+		 * We are in the primary context and have hit the INT3 barrier
+		 * either ourselves or via an NMI.
+		 *
+		 * The secondary CPUs at this time are either in the original
+		 * text_poke_sync_site() loop or after having hit an NMI->INT3
+		 * themselves in the BP text_poke_sync_site() loop.
+		 *
+		 * The minimum that we need to do here is to update the local
+		 * insn stream such that we can return to the primary loop.
+		 * Without executing sync_core() on the secondary CPUs it is
+		 * possible that some of them might be executing stale uops in
+		 * their respective pipelines.
+		 *
+		 * This should be safe because we will get back to the patching
+		 * loop in text_poke_site() in due course and will resume
+		 * the state-machine where we left off including by re-writing
+		 * some of the insns sequences just written here.
+		 *
+		 * Note that we continue to be in poking_mm context and so can
+		 * safely call __text_do_poke() here.
+		 */
+		__text_do_poke(offset + INT3_INSN_SIZE,
+			       tp->text + INT3_INSN_SIZE,
+			       tp->native.len - INT3_INSN_SIZE);
+		__text_do_poke(offset, tp->text, INT3_INSN_SIZE);
+
+		/*
+		 * We only introduce a serializing instruction locally. As
+		 * noted above, the secondary CPUs can stay where they are --
+		 * potentially executing in the now stale INT3.) This is fine
+		 * because the primary will force the sync_core() on the
+		 * secondary CPUs once it returns.
+		 */
+		sync_one();
+	}
+
+	/* A new start */
+	regs->ip -= INT3_INSN_SIZE;
+}
+
 /**
  * text_poke_sync_finish() -- called to synchronize the CPU pipeline
  * on secondary CPUs for all patch sites.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Runtime patching can deadlock with multiple simultaneous NMIs.
This can happen while patching inter-dependent pv-ops which are
used in the NMI path (ex pv_lock_ops):

 CPU0   			    CPUx
 ----                               ----

 patch_worker()                     patch_worker()

   /* Traversal, insn-gen */          text_poke_sync_finish()
   tps.patch_worker()                   /* wait until:
     /* = paravirt_worker() */           *  tps->state == PATCH_DONE
                                         */
           /* start-patching:lock.spin_unlock */
      generate_paravirt()
        runtime_patch()

      text_poke_site()                  text_poke_sync_site()
        poke_sync()                      /* for state in:
          __text_do_poke()                *  PATCH_SYNC_[012]
	  ==NMI==                         */
	                                 ==NMI==
         tries-to-acquire:nmi_lock       acquires:nmi_lock
                                         tries-to-release:nmi_lock
					 ==BP==
   				         text_poke_sync_site()

      /* waiting-for:nmi_lock */    /* waiting-for:patched-spin_unlock() */

A similar deadlock exists if two secondary CPUs get an NMI as well.

Fix this by patching NMI-unsafe ops in an NMI context. Given that the
NMI entry code ensures that NMIs do not nest, we are guaranteed that
this can be done atomically.

We do this by registering a local NMI handler (text_poke_nmi()) and
triggering a local NMI on the primary (via patch_worker_nmi()) which
then calls the same worker (tps->patch_worker()) as in thread-context.

On the secondary, we continue with the pipeline sync loop (via
text_poke_sync_finish()) in thread-context; however, if there is an
NMI on the secondary, we call text_poke_sync_finish() in the handler
which continues the work that was being done in thread-context.

Also note that text_poke_nmi() always executes first so we know that
it takes priority over any arbitrary code executing in the installed
NMI handlers.

 CPU0                                CPUx
 ----                                ----

 patch_worker(nmi=true)              patch_worker(nmi=true)

   patch_worker_nmi() -> triggers NMI   text_poke_sync_finish()
   /* wait for return from NMI */         /* wait until:
            ...                            *  tps->state == PATCH_DONE
                                           */

   smp_store_release(&tps->state,
                     PATCH_DONE)
                                          /* for each patch-site */

                                          text_poke_sync_site()
 CPU0-NMI                                 /* for each:
 --------                                  *  PATCH_SYNC_[012]
                                           */
 text_poke_nmi()                            sync_one()
   /* Traversal, insn-gen */                ack()
   tps.patch_worker()
   /* = paravirt_worker() */                ...

   /* for each patch-site */

     generate_paravirt()
       runtime_patch()

     text_poke_site()
       poke_sync()
         __text_do_poke()
         sync_one()
         ack()
         wait_for_acks()

          ...


Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |   2 +-
 arch/x86/kernel/alternative.c        | 120 ++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e86709a8287e..9ba329bf9479 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -22,7 +22,7 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_runtime	NULL
 #define __parainstructions_runtime_end	NULL
 #else
-int paravirt_runtime_patch(void);
+int paravirt_runtime_patch(bool nmi);
 #endif
 
 /*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index c68d940356a2..385c3e6ea925 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1442,6 +1442,14 @@ struct text_poke_state {
 
 	unsigned int primary_cpu; /* CPU doing the patching. */
 	unsigned int num_acks; /* Number of Acks needed. */
+
+	/*
+	 * To synchronize with the NMI handler.
+	 */
+	atomic_t nmi_work;
+
+	/* Ensure this is patched atomically against NMIs. */
+	bool nmi_context;
 };
 
 static struct text_poke_state text_poke_state;
@@ -1715,6 +1723,7 @@ static void poke_int3_native(struct pt_regs *regs,
  * on secondary CPUs for all patch sites.
  *
  * Called in thread context with tps->state == PATCH_SYNC_DONE.
+ * Also might be called from NMI context with an arbitrary tps->state.
  * Returns with tps->state == PATCH_DONE.
  */
 static void text_poke_sync_finish(struct text_poke_state *tps)
@@ -1741,6 +1750,12 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 			cpumask_set_cpu(cpu, &tps->sync_ack_map);
 			smp_cond_load_acquire(&tps->state,
 					      (state != VAL));
+		} else if (in_nmi() && (state & PATCH_SYNC_x)) {
+			/*
+			 * Called in case of NMI so we should be ready
+			 * to be called with any PATCH_SYNC_x.
+			 */
+			text_poke_sync_site(tps);
 		} else if (state == PATCH_SYNC_0) {
 			/*
 			 * PATCH_SYNC_1, PATCH_SYNC_2 are handled
@@ -1753,6 +1768,91 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 	}
 }
 
+/*
+ * text_poke_nmi() - primary CPU comes here (via self NMI) and the
+ * secondary (if there's an NMI.)
+ *
+ * By placing this NMI handler first, we can restrict execution of any
+ * NMI code that might be under patching.
+ * Local NMI handling also does not go through any locking code so it
+ * should be safe to install one.
+ *
+ * In both these roles the state-machine is identical to the one that
+ * we had in task context.
+ */
+static int text_poke_nmi(unsigned int val, struct pt_regs *regs)
+{
+	int ret, cpu = smp_processor_id();
+	struct text_poke_state *tps = &text_poke_state;
+
+	/*
+	 * We came here because there's a text-poke handler
+	 * installed. Get out if there's no work assigned yet.
+	 */
+	if (atomic_read(&tps->nmi_work) == 0)
+		return NMI_DONE;
+
+	if (cpu == tps->primary_cpu) {
+		/*
+		 * Do what we came here for. We can safely patch: any
+		 * secondary CPUs executing in NMI context have been
+		 * captured in the code below and are doing useful
+		 * work.
+		 */
+		tps->patch_worker(tps);
+
+		/*
+		 * Both the primary and the secondary CPUs are done (in NMI
+		 * or thread context.) Mark work done so any future NMIs can
+		 * skip this and go to the real handler.
+		 */
+		atomic_dec(&tps->nmi_work);
+
+		/*
+		 * The NMI was self-induced, consume it.
+		 */
+		ret = NMI_HANDLED;
+	} else {
+		/*
+		 * Unexpected NMI on a secondary CPU: do sync_core()
+		 * work until done.
+		 */
+		text_poke_sync_finish(tps);
+
+		/*
+		 * The NMI was spontaneous, not self-induced.
+		 * Don't consume it.
+		 */
+		ret = NMI_DONE;
+	}
+
+	return ret;
+}
+
+/*
+ * patch_worker_nmi() - sets up an NMI handler to do the
+ * patching work.
+ * This stops any NMIs from interrupting any code that might
+ * be getting patched.
+ */
+static void __maybe_unused patch_worker_nmi(void)
+{
+	atomic_set(&text_poke_state.nmi_work, 1);
+	/*
+	 * We could just use apic->send_IPI_self here. However, for reasons
+	 * that I don't understand, apic->send_IPI() or apic->send_IPI_mask()
+	 * work but apic->send_IPI_self (which internally does apic_write())
+	 * does not.
+	 */
+	apic->send_IPI(smp_processor_id(), NMI_VECTOR);
+
+	/*
+	 * Barrier to ensure that we do actually execute the NMI
+	 * before exiting.
+	 */
+	atomic_cond_read_acquire(&text_poke_state.nmi_work, !VAL);
+}
+
 static int patch_worker(void *t)
 {
 	int cpu = smp_processor_id();
@@ -1769,7 +1869,10 @@ static int patch_worker(void *t)
 		 * Generates insns and calls text_poke_site() to do the poking
 		 * and sync.
 		 */
-		tps->patch_worker(tps);
+		if (!tps->nmi_context)
+			tps->patch_worker(tps);
+		else
+			patch_worker_nmi();
 
 		/*
 		 * We are done patching. Switch the state to PATCH_DONE
@@ -1790,7 +1893,8 @@ static int patch_worker(void *t)
  *
  * Return: 0 on success, -errno on failure.
  */
-static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
+static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage,
+					 bool nmi)
 {
 	int ret;
 
@@ -1807,12 +1911,20 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
 	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
 	text_poke_state.primary_cpu = smp_processor_id();
 
+	text_poke_state.nmi_context = nmi;
+
+	if (nmi)
+		register_nmi_handler(NMI_LOCAL, text_poke_nmi,
+				     NMI_FLAG_FIRST, "text_poke_nmi");
 	/*
 	 * Run the worker on all online CPUs. Don't need to do anything
 	 * for offline CPUs as they come back online with a clean cache.
 	 */
 	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
 
+	if (nmi)
+		unregister_nmi_handler(NMI_LOCAL, "text_poke_nmi");
+
 	return ret;
 }
 
@@ -1957,13 +2069,13 @@ static void paravirt_worker(struct text_poke_state *tps)
  *
  * Return: 0 on success, -errno on failure.
  */
-int paravirt_runtime_patch(void)
+int paravirt_runtime_patch(bool nmi)
 {
 	lockdep_assert_held(&text_mutex);
 
 	if (!pv_stage.count)
 		return -EINVAL;
 
-	return text_poke_late(paravirt_worker, &pv_stage);
+	return text_poke_late(paravirt_worker, &pv_stage, nmi);
 }
 #endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Runtime patching can deadlock with multiple simultaneous NMIs.
This can happen while patching inter-dependent pv-ops which are
used in the NMI path (ex pv_lock_ops):

 CPU0   			    CPUx
 ----                               ----

 patch_worker()                     patch_worker()

   /* Traversal, insn-gen */          text_poke_sync_finish()
   tps.patch_worker()                   /* wait until:
     /* = paravirt_worker() */           *  tps->state == PATCH_DONE
                                         */
           /* start-patching:lock.spin_unlock */
      generate_paravirt()
        runtime_patch()

      text_poke_site()                  text_poke_sync_site()
        poke_sync()                      /* for state in:
          __text_do_poke()                *  PATCH_SYNC_[012]
	  ==NMI==                         */
	                                 ==NMI==
         tries-to-acquire:nmi_lock       acquires:nmi_lock
                                         tries-to-release:nmi_lock
					 ==BP==
   				         text_poke_sync_site()

      /* waiting-for:nmi_lock */    /* waiting-for:patched-spin_unlock() */

A similar deadlock exists if two secondary CPUs get an NMI as well.

Fix this by patching NMI-unsafe ops in an NMI context. Given that the
NMI entry code ensures that NMIs do not nest, we are guaranteed that
this can be done atomically.

We do this by registering a local NMI handler (text_poke_nmi()) and
triggering a local NMI on the primary (via patch_worker_nmi()) which
then calls the same worker (tps->patch_worker()) as in thread-context.

On the secondary, we continue with the pipeline sync loop (via
text_poke_sync_finish()) in thread-context; however, if there is an
NMI on the secondary, we call text_poke_sync_finish() in the handler
which continues the work that was being done in thread-context.

Also note that text_poke_nmi() always executes first so we know that
it takes priority over any arbitrary code executing in the installed
NMI handlers.

 CPU0                                CPUx
 ----                                ----

 patch_worker(nmi=true)              patch_worker(nmi=true)

   patch_worker_nmi() -> triggers NMI   text_poke_sync_finish()
   /* wait for return from NMI */         /* wait until:
            ...                            *  tps->state == PATCH_DONE
                                           */

   smp_store_release(&tps->state,
                     PATCH_DONE)
                                          /* for each patch-site */

                                          text_poke_sync_site()
 CPU0-NMI                                 /* for each:
 --------                                  *  PATCH_SYNC_[012]
                                           */
 text_poke_nmi()                            sync_one()
   /* Traversal, insn-gen */                ack()
   tps.patch_worker()
   /* = paravirt_worker() */                ...

   /* for each patch-site */

     generate_paravirt()
       runtime_patch()

     text_poke_site()
       poke_sync()
         __text_do_poke()
         sync_one()
         ack()
         wait_for_acks()

          ...


Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/text-patching.h |   2 +-
 arch/x86/kernel/alternative.c        | 120 ++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e86709a8287e..9ba329bf9479 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -22,7 +22,7 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_runtime	NULL
 #define __parainstructions_runtime_end	NULL
 #else
-int paravirt_runtime_patch(void);
+int paravirt_runtime_patch(bool nmi);
 #endif
 
 /*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index c68d940356a2..385c3e6ea925 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1442,6 +1442,14 @@ struct text_poke_state {
 
 	unsigned int primary_cpu; /* CPU doing the patching. */
 	unsigned int num_acks; /* Number of Acks needed. */
+
+	/*
+	 * To synchronize with the NMI handler.
+	 */
+	atomic_t nmi_work;
+
+	/* Ensure this is patched atomically against NMIs. */
+	bool nmi_context;
 };
 
 static struct text_poke_state text_poke_state;
@@ -1715,6 +1723,7 @@ static void poke_int3_native(struct pt_regs *regs,
  * on secondary CPUs for all patch sites.
  *
  * Called in thread context with tps->state == PATCH_SYNC_DONE.
+ * Also might be called from NMI context with an arbitrary tps->state.
  * Returns with tps->state == PATCH_DONE.
  */
 static void text_poke_sync_finish(struct text_poke_state *tps)
@@ -1741,6 +1750,12 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 			cpumask_set_cpu(cpu, &tps->sync_ack_map);
 			smp_cond_load_acquire(&tps->state,
 					      (state != VAL));
+		} else if (in_nmi() && (state & PATCH_SYNC_x)) {
+			/*
+			 * Called in case of NMI so we should be ready
+			 * to be called with any PATCH_SYNC_x.
+			 */
+			text_poke_sync_site(tps);
 		} else if (state == PATCH_SYNC_0) {
 			/*
 			 * PATCH_SYNC_1, PATCH_SYNC_2 are handled
@@ -1753,6 +1768,91 @@ static void text_poke_sync_finish(struct text_poke_state *tps)
 	}
 }
 
+/*
+ * text_poke_nmi() - primary CPU comes here (via self NMI) and the
+ * secondary (if there's an NMI.)
+ *
+ * By placing this NMI handler first, we can restrict execution of any
+ * NMI code that might be under patching.
+ * Local NMI handling also does not go through any locking code so it
+ * should be safe to install one.
+ *
+ * In both these roles the state-machine is identical to the one that
+ * we had in task context.
+ */
+static int text_poke_nmi(unsigned int val, struct pt_regs *regs)
+{
+	int ret, cpu = smp_processor_id();
+	struct text_poke_state *tps = &text_poke_state;
+
+	/*
+	 * We came here because there's a text-poke handler
+	 * installed. Get out if there's no work assigned yet.
+	 */
+	if (atomic_read(&tps->nmi_work) == 0)
+		return NMI_DONE;
+
+	if (cpu == tps->primary_cpu) {
+		/*
+		 * Do what we came here for. We can safely patch: any
+		 * secondary CPUs executing in NMI context have been
+		 * captured in the code below and are doing useful
+		 * work.
+		 */
+		tps->patch_worker(tps);
+
+		/*
+		 * Both the primary and the secondary CPUs are done (in NMI
+		 * or thread context.) Mark work done so any future NMIs can
+		 * skip this and go to the real handler.
+		 */
+		atomic_dec(&tps->nmi_work);
+
+		/*
+		 * The NMI was self-induced, consume it.
+		 */
+		ret = NMI_HANDLED;
+	} else {
+		/*
+		 * Unexpected NMI on a secondary CPU: do sync_core()
+		 * work until done.
+		 */
+		text_poke_sync_finish(tps);
+
+		/*
+		 * The NMI was spontaneous, not self-induced.
+		 * Don't consume it.
+		 */
+		ret = NMI_DONE;
+	}
+
+	return ret;
+}
+
+/*
+ * patch_worker_nmi() - sets up an NMI handler to do the
+ * patching work.
+ * This stops any NMIs from interrupting any code that might
+ * be getting patched.
+ */
+static void __maybe_unused patch_worker_nmi(void)
+{
+	atomic_set(&text_poke_state.nmi_work, 1);
+	/*
+	 * We could just use apic->send_IPI_self here. However, for reasons
+	 * that I don't understand, apic->send_IPI() or apic->send_IPI_mask()
+	 * work but apic->send_IPI_self (which internally does apic_write())
+	 * does not.
+	 */
+	apic->send_IPI(smp_processor_id(), NMI_VECTOR);
+
+	/*
+	 * Barrier to ensure that we do actually execute the NMI
+	 * before exiting.
+	 */
+	atomic_cond_read_acquire(&text_poke_state.nmi_work, !VAL);
+}
+
 static int patch_worker(void *t)
 {
 	int cpu = smp_processor_id();
@@ -1769,7 +1869,10 @@ static int patch_worker(void *t)
 		 * Generates insns and calls text_poke_site() to do the poking
 		 * and sync.
 		 */
-		tps->patch_worker(tps);
+		if (!tps->nmi_context)
+			tps->patch_worker(tps);
+		else
+			patch_worker_nmi();
 
 		/*
 		 * We are done patching. Switch the state to PATCH_DONE
@@ -1790,7 +1893,8 @@ static int patch_worker(void *t)
  *
  * Return: 0 on success, -errno on failure.
  */
-static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
+static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage,
+					 bool nmi)
 {
 	int ret;
 
@@ -1807,12 +1911,20 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
 	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
 	text_poke_state.primary_cpu = smp_processor_id();
 
+	text_poke_state.nmi_context = nmi;
+
+	if (nmi)
+		register_nmi_handler(NMI_LOCAL, text_poke_nmi,
+				     NMI_FLAG_FIRST, "text_poke_nmi");
 	/*
 	 * Run the worker on all online CPUs. Don't need to do anything
 	 * for offline CPUs as they come back online with a clean cache.
 	 */
 	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
 
+	if (nmi)
+		unregister_nmi_handler(NMI_LOCAL, "text_poke_nmi");
+
 	return ret;
 }
 
@@ -1957,13 +2069,13 @@ static void paravirt_worker(struct text_poke_state *tps)
  *
  * Return: 0 on success, -errno on failure.
  */
-int paravirt_runtime_patch(void)
+int paravirt_runtime_patch(bool nmi)
 {
 	lockdep_assert_held(&text_mutex);
 
 	if (!pv_stage.count)
 		return -EINVAL;
 
-	return text_poke_late(paravirt_worker, &pv_stage);
+	return text_poke_late(paravirt_worker, &pv_stage, nmi);
 }
 #endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 20/26] x86/paravirt: Enable pv-spinlocks in runtime_patch()
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Enable runtime patching of paravirt spinlocks. These can be trivially
enabled because pv_lock_ops are never preemptible -- preemption is
disabled at entry to spin_lock*().

Note that a particular CPU instance might get preempted in the host but
because runtime_patching() is called via stop_machine(), the migration
thread would flush out any kernel threads preempted in the host.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt.h  | 10 +++++-----
 arch/x86/kernel/paravirt_patch.c | 12 ++++++++++++
 kernel/locking/lock_events.c     |  2 +-
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 694d8daf4983..cb3d0a91c060 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -642,27 +642,27 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock,
 							u32 val)
 {
-	PVOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
+	PVRTOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
 }
 
 static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	PVOP_VCALLEE1(lock.queued_spin_unlock, lock);
+	PVRTOP_VCALLEE1(lock.queued_spin_unlock, lock);
 }
 
 static __always_inline void pv_wait(u8 *ptr, u8 val)
 {
-	PVOP_VCALL2(lock.wait, ptr, val);
+	PVRTOP_VCALL2(lock.wait, ptr, val);
 }
 
 static __always_inline void pv_kick(int cpu)
 {
-	PVOP_VCALL1(lock.kick, cpu);
+	PVRTOP_VCALL1(lock.kick, cpu);
 }
 
 static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
-	return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
+	return PVRTOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
 }
 
 void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock);
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eb8c0e720b4..3f8606f2811c 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -152,6 +152,18 @@ int runtime_patch(u8 type, void *insn_buff, void *op,
 
 	/* Nothing whitelisted for now. */
 	switch (type) {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	/*
+	 * Preemption is always disabled in the lifetime of a spinlock
+	 * (whether held or while waiting to acquire.)
+	 */
+	case PARAVIRT_PATCH(lock.queued_spin_lock_slowpath):
+	case PARAVIRT_PATCH(lock.queued_spin_unlock):
+	case PARAVIRT_PATCH(lock.wait):
+	case PARAVIRT_PATCH(lock.kick):
+	case PARAVIRT_PATCH(lock.vcpu_is_preempted):
+		break;
+#endif
 	default:
 		pr_warn("type=%d unsuitable for runtime-patching\n", type);
 		return -EINVAL;
diff --git a/kernel/locking/lock_events.c b/kernel/locking/lock_events.c
index fa2c2f951c6b..c3057e82e6f9 100644
--- a/kernel/locking/lock_events.c
+++ b/kernel/locking/lock_events.c
@@ -115,7 +115,7 @@ static const struct file_operations fops_lockevent = {
 	.llseek = default_llseek,
 };
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && !defined(CONFIG_PARAVIRT_RUNTIME)
 #include <asm/paravirt.h>
 
 static bool __init skip_lockevent(const char *name)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 20/26] x86/paravirt: Enable pv-spinlocks in runtime_patch()
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Enable runtime patching of paravirt spinlocks. These can be trivially
enabled because pv_lock_ops are never preemptible -- preemption is
disabled at entry to spin_lock*().

Note that a particular CPU instance might get preempted in the host but
because runtime_patching() is called via stop_machine(), the migration
thread would flush out any kernel threads preempted in the host.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/paravirt.h  | 10 +++++-----
 arch/x86/kernel/paravirt_patch.c | 12 ++++++++++++
 kernel/locking/lock_events.c     |  2 +-
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 694d8daf4983..cb3d0a91c060 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -642,27 +642,27 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock,
 							u32 val)
 {
-	PVOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
+	PVRTOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val);
 }
 
 static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	PVOP_VCALLEE1(lock.queued_spin_unlock, lock);
+	PVRTOP_VCALLEE1(lock.queued_spin_unlock, lock);
 }
 
 static __always_inline void pv_wait(u8 *ptr, u8 val)
 {
-	PVOP_VCALL2(lock.wait, ptr, val);
+	PVRTOP_VCALL2(lock.wait, ptr, val);
 }
 
 static __always_inline void pv_kick(int cpu)
 {
-	PVOP_VCALL1(lock.kick, cpu);
+	PVRTOP_VCALL1(lock.kick, cpu);
 }
 
 static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
-	return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
+	return PVRTOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
 }
 
 void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock);
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index 3eb8c0e720b4..3f8606f2811c 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -152,6 +152,18 @@ int runtime_patch(u8 type, void *insn_buff, void *op,
 
 	/* Nothing whitelisted for now. */
 	switch (type) {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	/*
+	 * Preemption is always disabled in the lifetime of a spinlock
+	 * (whether held or while waiting to acquire.)
+	 */
+	case PARAVIRT_PATCH(lock.queued_spin_lock_slowpath):
+	case PARAVIRT_PATCH(lock.queued_spin_unlock):
+	case PARAVIRT_PATCH(lock.wait):
+	case PARAVIRT_PATCH(lock.kick):
+	case PARAVIRT_PATCH(lock.vcpu_is_preempted):
+		break;
+#endif
 	default:
 		pr_warn("type=%d unsuitable for runtime-patching\n", type);
 		return -EINVAL;
diff --git a/kernel/locking/lock_events.c b/kernel/locking/lock_events.c
index fa2c2f951c6b..c3057e82e6f9 100644
--- a/kernel/locking/lock_events.c
+++ b/kernel/locking/lock_events.c
@@ -115,7 +115,7 @@ static const struct file_operations fops_lockevent = {
 	.llseek = default_llseek,
 };
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) && !defined(CONFIG_PARAVIRT_RUNTIME)
 #include <asm/paravirt.h>
 
 static bool __init skip_lockevent(const char *name)
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 21/26] x86/alternatives: Paravirt runtime selftest
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Add a selftest that triggers paravirt_runtime_patch() which
toggles between the paravirt and native pv_lock_ops.

The selftest also register an NMI handler, which exercises the
patched pv-ops by spin-lock operations. These are triggered via
artificially sent NMIs.

And last, introduce patch sites in the primary and secondary
patching code which are hit while during the patching process.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig.debug        |  13 ++
 arch/x86/kernel/Makefile      |   1 +
 arch/x86/kernel/alternative.c |  20 +++
 arch/x86/kernel/kvm.c         |   4 +-
 arch/x86/kernel/pv_selftest.c | 264 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/pv_selftest.h |  15 ++
 6 files changed, 315 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/pv_selftest.c
 create mode 100644 arch/x86/kernel/pv_selftest.h

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 2e74690b028a..82a8e3fa68c7 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -252,6 +252,19 @@ config X86_DEBUG_FPU
 
 	  If unsure, say N.
 
+config DEBUG_PARAVIRT_SELFTEST
+	bool "Enable paravirt runtime selftest"
+	depends on PARAVIRT
+	depends on PARAVIRT_RUNTIME
+	depends on PARAVIRT_SPINLOCKS
+	depends on KVM_GUEST
+	help
+	  This option enables sanity testing of the runtime paravirtualized
+	  patching code. Triggered via debugfs.
+
+	  Might help diagnose patching problems in different
+	  configurations and loads.
+
 config PUNIT_ATOM_DEBUG
 	tristate "ATOM Punit debug driver"
 	depends on PCI
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ba89cabe5fcf..ed3c93681f12 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_APB_TIMER)		+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
 obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
+obj-$(CONFIG_DEBUG_PARAVIRT_SELFTEST) += pv_selftest.o
 
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch.o
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 385c3e6ea925..26407d7a54db 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -26,6 +26,7 @@
 #include <asm/insn.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include "pv_selftest.h"
 
 int __read_mostly alternatives_patched;
 
@@ -1549,6 +1550,12 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	 */
 	poke_sync(tps, PATCH_SYNC_0, offset, &int3, INT3_INSN_SIZE);
 
+	/*
+	 * We have an INT3 in place; execute a contrived selftest that
+	 * has an insn sequence that is under patching.
+	 */
+	pv_selftest_primary();
+
 	/* Poke remaining */
 	poke_sync(tps, PATCH_SYNC_1, offset + INT3_INSN_SIZE,
 		  tp->text + INT3_INSN_SIZE, tp->native.len - INT3_INSN_SIZE);
@@ -1634,6 +1641,19 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 		smp_cond_load_acquire(&tps->state,
 				      prevstate != VAL);
 
+		/*
+		 * Send an NMI to one of the other CPUs.
+		 */
+		pv_selftest_send_nmi();
+
+		/*
+		 * We have an INT3 in place; execute a contrived selftest that
+		 * has an insn sequence that is under patching.
+		 *
+		 * Note that this function is also called from BP fixup but
+		 * is just an NOP when called from there.
+		 */
+		pv_selftest_secondary();
 		prevstate = READ_ONCE(tps->state);
 
 		/*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..e56d263159d7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -779,7 +779,7 @@ arch_initcall(kvm_alloc_cpumask);
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
-static void kvm_kick_cpu(int cpu)
+void kvm_kick_cpu(int cpu)
 {
 	int apicid;
 	unsigned long flags = 0;
@@ -790,7 +790,7 @@ static void kvm_kick_cpu(int cpu)
 
 #include <asm/qspinlock.h>
 
-static void kvm_wait(u8 *ptr, u8 val)
+void kvm_wait(u8 *ptr, u8 val)
 {
 	unsigned long flags;
 
diff --git a/arch/x86/kernel/pv_selftest.c b/arch/x86/kernel/pv_selftest.c
new file mode 100644
index 000000000000..e522f444bd6e
--- /dev/null
+++ b/arch/x86/kernel/pv_selftest.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/delay.h>
+#include <linux/irq.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/memory.h>
+#include <linux/nmi.h>
+#include <linux/uaccess.h>
+#include <asm/apic.h>
+#include <asm/text-patching.h>
+#include <asm/paravirt.h>
+#include <asm/paravirt_types.h>
+#include "pv_selftest.h"
+
+static int nmi_selftest;
+static bool cond_state;
+
+#define SELFTEST_PARAVIRT	1
+static int test_mode;
+
+/*
+ * Mark this and the following functions __always_inline to ensure
+ * we generate multiple patch sites that can be hit independently
+ * in thread, NMI etc contexts.
+ */
+static __always_inline void selftest_pv(void)
+{
+	struct qspinlock test;
+
+	memset(&test, 0, sizeof(test));
+
+	test.locked = _Q_LOCKED_VAL;
+
+	/*
+	 * Sits directly in the path of the test.
+	 *
+	 * The primary sets up an INT3 instruction at pv_queued_spin_unlock().
+	 * Both the primary and secondary CPUs should hit that in both
+	 * thread and NMI contexts.
+	 *
+	 * Additionally, this also gets inlined in nmi_pv_callback() so we
+	 * should hit this with nmi_selftest.
+	 *
+	 * The fixup takes place in poke_int3_native().
+	 */
+	pv_queued_spin_unlock(&test);
+}
+
+static __always_inline void patch_selftest(void)
+{
+	if (test_mode == SELFTEST_PARAVIRT)
+		selftest_pv();
+}
+
+static DEFINE_PER_CPU(int, selftest_count);
+void pv_selftest_secondary(void)
+{
+	/*
+	 * On the secondary we execute the same code in both the
+	 * thread-context and the BP-context and so would hit this
+	 * recursively if we do inside the fixup context.
+	 *
+	 * So we trigger the selftest only if it's not ongoing already
+	 * (thus allowing the thread or NMI context, but excluding
+	 * the INT3 handling path.)
+	 */
+	if (this_cpu_read(selftest_count))
+		return;
+
+	this_cpu_inc(selftest_count);
+
+	patch_selftest();
+
+	this_cpu_dec(selftest_count);
+}
+
+void pv_selftest_primary(void)
+{
+	patch_selftest();
+}
+
+/*
+ * We only come here if nmi_selftest > 0.
+ *  - nmi_selftest >= 1: execute a pv-op that will be patched
+ *  - nmi_selftest >= 2: execute a paired pv-op that is also contended
+ *  - nmi_selftest >= 3: add lock contention
+ */
+static int nmi_callback(unsigned int val, struct pt_regs *regs)
+{
+	static DEFINE_SPINLOCK(nmi_spin);
+
+	if (!nmi_selftest)
+		goto out;
+
+	patch_selftest();
+
+	if (nmi_selftest >= 2) {
+		/*
+		 * Depending on whether CONFIG_[UN]INLINE_SPIN_* are
+		 * defined or not, these would get patched or just
+		 * create race conditions between via NMIs.
+		 */
+		spin_lock(&nmi_spin);
+
+		/* Dilate the critical section to force contention. */
+		if (nmi_selftest >= 3)
+			udelay(1);
+
+		spin_unlock(&nmi_spin);
+	}
+
+	/*
+	 * nmi_selftest > 0, but we should really have a bitmap where
+	 * to check if this really was destined for us or not.
+	 */
+	return NMI_HANDLED;
+out:
+	return NMI_DONE;
+}
+
+void pv_selftest_register(void)
+{
+	register_nmi_handler(NMI_LOCAL, nmi_callback,
+			     0, "paravirt_nmi_selftest");
+}
+
+void pv_selftest_unregister(void)
+{
+	unregister_nmi_handler(NMI_LOCAL, "paravirt_nmi_selftest");
+}
+
+void pv_selftest_send_nmi(void)
+{
+	int cpu = smp_processor_id();
+	/* NMI or INT3 */
+	if (nmi_selftest && !in_interrupt())
+		apic->send_IPI(cpu + 1 % num_online_cpus(), NMI_VECTOR);
+}
+
+/*
+ * Just declare these locally here instead of having them be
+ * exposed to the whole world.
+ */
+void kvm_wait(u8 *ptr, u8 val);
+void kvm_kick_cpu(int cpu);
+bool __raw_callee_save___kvm_vcpu_is_preempted(long cpu);
+static void pv_spinlocks(void)
+{
+	paravirt_stage_alt(cond_state,
+			   lock.queued_spin_lock_slowpath,
+			   __pv_queued_spin_lock_slowpath);
+	paravirt_stage_alt(cond_state, lock.queued_spin_unlock.func,
+			   PV_CALLEE_SAVE(__pv_queued_spin_unlock).func);
+	paravirt_stage_alt(cond_state, lock.wait, kvm_wait);
+	paravirt_stage_alt(cond_state, lock.kick, kvm_kick_cpu);
+
+	paravirt_stage_alt(cond_state,
+			   lock.vcpu_is_preempted.func,
+			   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted).func);
+}
+
+void pv_trigger(void)
+{
+	bool nmi_mode = nmi_selftest ? true : false;
+	int ret;
+
+	pr_debug("%s: nmi=%d; NMI-mode=%d\n", __func__, nmi_selftest, nmi_mode);
+
+	mutex_lock(&text_mutex);
+
+	paravirt_stage_zero();
+	pv_spinlocks();
+
+	/*
+	 * paravirt patching for pv_locks can potentially deadlock
+	 * if we are running with nmi_mode=false and we get an NMI.
+	 *
+	 * For the sake of testing that path, we risk it. However, if
+	 * we are generating synthetic NMIs (nmi_selftest > 0) then
+	 * run with nmi_mode=true.
+	 */
+	ret = paravirt_runtime_patch(nmi_mode);
+
+	/*
+	 * Flip the state so we switch the pv_lock_ops on the next test.
+	 */
+	cond_state = !cond_state;
+
+	mutex_unlock(&text_mutex);
+
+	pr_debug("%s: nmi=%d; NMI-mode=%d, ret=%d\n", __func__, nmi_selftest,
+		 nmi_mode, ret);
+}
+
+static void pv_selftest_trigger(void)
+{
+	test_mode = SELFTEST_PARAVIRT;
+	pv_trigger();
+}
+
+static ssize_t pv_selftest_write(struct file *file, const char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	pv_selftest_register();
+	pv_selftest_trigger();
+	pv_selftest_unregister();
+
+	return count;
+}
+
+static ssize_t pv_nmi_read(struct file *file, char __user *ubuf,
+			   size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = snprintf(buf, sizeof(buf), "%d\n", nmi_selftest);
+	return simple_read_from_buffer(ubuf, count, ppos, buf, len);
+}
+
+static ssize_t pv_nmi_write(struct file *file, const char __user *ubuf,
+			    size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+	unsigned int enabled;
+
+	len = min(sizeof(buf) - 1, count);
+	if (copy_from_user(buf, ubuf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enabled))
+		return -EINVAL;
+
+	nmi_selftest = enabled > 3 ? 3 : enabled;
+
+	return count;
+}
+
+static const struct file_operations pv_selftest_fops = {
+	.read = NULL,
+	.write = pv_selftest_write,
+	.llseek = default_llseek,
+};
+
+static const struct file_operations pv_nmi_fops = {
+	.read = pv_nmi_read,
+	.write = pv_nmi_write,
+	.llseek = default_llseek,
+};
+
+static int __init pv_selftest_init(void)
+{
+	struct dentry *d = debugfs_create_dir("pv_selftest", NULL);
+
+	debugfs_create_file("toggle", 0600, d, NULL, &pv_selftest_fops);
+	debugfs_create_file("nmi", 0600, d, NULL, &pv_nmi_fops);
+
+	return 0;
+}
+
+late_initcall(pv_selftest_init);
diff --git a/arch/x86/kernel/pv_selftest.h b/arch/x86/kernel/pv_selftest.h
new file mode 100644
index 000000000000..5afa0f7db5cc
--- /dev/null
+++ b/arch/x86/kernel/pv_selftest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PVR_SELFTEST_H
+#define _PVR_SELFTEST_H
+
+#ifdef CONFIG_DEBUG_PARAVIRT_SELFTEST
+void pv_selftest_send_nmi(void);
+void pv_selftest_primary(void);
+void pv_selftest_secondary(void);
+#else
+static inline void pv_selftest_send_nmi(void) { }
+static inline void pv_selftest_primary(void) { }
+static inline void pv_selftest_secondary(void) { }
+#endif /*! CONFIG_DEBUG_PARAVIRT_SELFTEST */
+
+#endif /* _PVR_SELFTEST_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 21/26] x86/alternatives: Paravirt runtime selftest
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Add a selftest that triggers paravirt_runtime_patch() which
toggles between the paravirt and native pv_lock_ops.

The selftest also register an NMI handler, which exercises the
patched pv-ops by spin-lock operations. These are triggered via
artificially sent NMIs.

And last, introduce patch sites in the primary and secondary
patching code which are hit while during the patching process.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig.debug        |  13 ++
 arch/x86/kernel/Makefile      |   1 +
 arch/x86/kernel/alternative.c |  20 +++
 arch/x86/kernel/kvm.c         |   4 +-
 arch/x86/kernel/pv_selftest.c | 264 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/pv_selftest.h |  15 ++
 6 files changed, 315 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/pv_selftest.c
 create mode 100644 arch/x86/kernel/pv_selftest.h

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 2e74690b028a..82a8e3fa68c7 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -252,6 +252,19 @@ config X86_DEBUG_FPU
 
 	  If unsure, say N.
 
+config DEBUG_PARAVIRT_SELFTEST
+	bool "Enable paravirt runtime selftest"
+	depends on PARAVIRT
+	depends on PARAVIRT_RUNTIME
+	depends on PARAVIRT_SPINLOCKS
+	depends on KVM_GUEST
+	help
+	  This option enables sanity testing of the runtime paravirtualized
+	  patching code. Triggered via debugfs.
+
+	  Might help diagnose patching problems in different
+	  configurations and loads.
+
 config PUNIT_ATOM_DEBUG
 	tristate "ATOM Punit debug driver"
 	depends on PCI
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ba89cabe5fcf..ed3c93681f12 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_APB_TIMER)		+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
 obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
+obj-$(CONFIG_DEBUG_PARAVIRT_SELFTEST) += pv_selftest.o
 
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch.o
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 385c3e6ea925..26407d7a54db 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -26,6 +26,7 @@
 #include <asm/insn.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include "pv_selftest.h"
 
 int __read_mostly alternatives_patched;
 
@@ -1549,6 +1550,12 @@ static void __maybe_unused text_poke_site(struct text_poke_state *tps,
 	 */
 	poke_sync(tps, PATCH_SYNC_0, offset, &int3, INT3_INSN_SIZE);
 
+	/*
+	 * We have an INT3 in place; execute a contrived selftest that
+	 * has an insn sequence that is under patching.
+	 */
+	pv_selftest_primary();
+
 	/* Poke remaining */
 	poke_sync(tps, PATCH_SYNC_1, offset + INT3_INSN_SIZE,
 		  tp->text + INT3_INSN_SIZE, tp->native.len - INT3_INSN_SIZE);
@@ -1634,6 +1641,19 @@ static void text_poke_sync_site(struct text_poke_state *tps)
 		smp_cond_load_acquire(&tps->state,
 				      prevstate != VAL);
 
+		/*
+		 * Send an NMI to one of the other CPUs.
+		 */
+		pv_selftest_send_nmi();
+
+		/*
+		 * We have an INT3 in place; execute a contrived selftest that
+		 * has an insn sequence that is under patching.
+		 *
+		 * Note that this function is also called from BP fixup but
+		 * is just an NOP when called from there.
+		 */
+		pv_selftest_secondary();
 		prevstate = READ_ONCE(tps->state);
 
 		/*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..e56d263159d7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -779,7 +779,7 @@ arch_initcall(kvm_alloc_cpumask);
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
-static void kvm_kick_cpu(int cpu)
+void kvm_kick_cpu(int cpu)
 {
 	int apicid;
 	unsigned long flags = 0;
@@ -790,7 +790,7 @@ static void kvm_kick_cpu(int cpu)
 
 #include <asm/qspinlock.h>
 
-static void kvm_wait(u8 *ptr, u8 val)
+void kvm_wait(u8 *ptr, u8 val)
 {
 	unsigned long flags;
 
diff --git a/arch/x86/kernel/pv_selftest.c b/arch/x86/kernel/pv_selftest.c
new file mode 100644
index 000000000000..e522f444bd6e
--- /dev/null
+++ b/arch/x86/kernel/pv_selftest.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/delay.h>
+#include <linux/irq.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/memory.h>
+#include <linux/nmi.h>
+#include <linux/uaccess.h>
+#include <asm/apic.h>
+#include <asm/text-patching.h>
+#include <asm/paravirt.h>
+#include <asm/paravirt_types.h>
+#include "pv_selftest.h"
+
+static int nmi_selftest;
+static bool cond_state;
+
+#define SELFTEST_PARAVIRT	1
+static int test_mode;
+
+/*
+ * Mark this and the following functions __always_inline to ensure
+ * we generate multiple patch sites that can be hit independently
+ * in thread, NMI etc contexts.
+ */
+static __always_inline void selftest_pv(void)
+{
+	struct qspinlock test;
+
+	memset(&test, 0, sizeof(test));
+
+	test.locked = _Q_LOCKED_VAL;
+
+	/*
+	 * Sits directly in the path of the test.
+	 *
+	 * The primary sets up an INT3 instruction at pv_queued_spin_unlock().
+	 * Both the primary and secondary CPUs should hit that in both
+	 * thread and NMI contexts.
+	 *
+	 * Additionally, this also gets inlined in nmi_pv_callback() so we
+	 * should hit this with nmi_selftest.
+	 *
+	 * The fixup takes place in poke_int3_native().
+	 */
+	pv_queued_spin_unlock(&test);
+}
+
+static __always_inline void patch_selftest(void)
+{
+	if (test_mode == SELFTEST_PARAVIRT)
+		selftest_pv();
+}
+
+static DEFINE_PER_CPU(int, selftest_count);
+void pv_selftest_secondary(void)
+{
+	/*
+	 * On the secondary we execute the same code in both the
+	 * thread-context and the BP-context and so would hit this
+	 * recursively if we do inside the fixup context.
+	 *
+	 * So we trigger the selftest only if it's not ongoing already
+	 * (thus allowing the thread or NMI context, but excluding
+	 * the INT3 handling path.)
+	 */
+	if (this_cpu_read(selftest_count))
+		return;
+
+	this_cpu_inc(selftest_count);
+
+	patch_selftest();
+
+	this_cpu_dec(selftest_count);
+}
+
+void pv_selftest_primary(void)
+{
+	patch_selftest();
+}
+
+/*
+ * We only come here if nmi_selftest > 0.
+ *  - nmi_selftest >= 1: execute a pv-op that will be patched
+ *  - nmi_selftest >= 2: execute a paired pv-op that is also contended
+ *  - nmi_selftest >= 3: add lock contention
+ */
+static int nmi_callback(unsigned int val, struct pt_regs *regs)
+{
+	static DEFINE_SPINLOCK(nmi_spin);
+
+	if (!nmi_selftest)
+		goto out;
+
+	patch_selftest();
+
+	if (nmi_selftest >= 2) {
+		/*
+		 * Depending on whether CONFIG_[UN]INLINE_SPIN_* are
+		 * defined or not, these would get patched or just
+		 * create race conditions between via NMIs.
+		 */
+		spin_lock(&nmi_spin);
+
+		/* Dilate the critical section to force contention. */
+		if (nmi_selftest >= 3)
+			udelay(1);
+
+		spin_unlock(&nmi_spin);
+	}
+
+	/*
+	 * nmi_selftest > 0, but we should really have a bitmap where
+	 * to check if this really was destined for us or not.
+	 */
+	return NMI_HANDLED;
+out:
+	return NMI_DONE;
+}
+
+void pv_selftest_register(void)
+{
+	register_nmi_handler(NMI_LOCAL, nmi_callback,
+			     0, "paravirt_nmi_selftest");
+}
+
+void pv_selftest_unregister(void)
+{
+	unregister_nmi_handler(NMI_LOCAL, "paravirt_nmi_selftest");
+}
+
+void pv_selftest_send_nmi(void)
+{
+	int cpu = smp_processor_id();
+	/* NMI or INT3 */
+	if (nmi_selftest && !in_interrupt())
+		apic->send_IPI(cpu + 1 % num_online_cpus(), NMI_VECTOR);
+}
+
+/*
+ * Just declare these locally here instead of having them be
+ * exposed to the whole world.
+ */
+void kvm_wait(u8 *ptr, u8 val);
+void kvm_kick_cpu(int cpu);
+bool __raw_callee_save___kvm_vcpu_is_preempted(long cpu);
+static void pv_spinlocks(void)
+{
+	paravirt_stage_alt(cond_state,
+			   lock.queued_spin_lock_slowpath,
+			   __pv_queued_spin_lock_slowpath);
+	paravirt_stage_alt(cond_state, lock.queued_spin_unlock.func,
+			   PV_CALLEE_SAVE(__pv_queued_spin_unlock).func);
+	paravirt_stage_alt(cond_state, lock.wait, kvm_wait);
+	paravirt_stage_alt(cond_state, lock.kick, kvm_kick_cpu);
+
+	paravirt_stage_alt(cond_state,
+			   lock.vcpu_is_preempted.func,
+			   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted).func);
+}
+
+void pv_trigger(void)
+{
+	bool nmi_mode = nmi_selftest ? true : false;
+	int ret;
+
+	pr_debug("%s: nmi=%d; NMI-mode=%d\n", __func__, nmi_selftest, nmi_mode);
+
+	mutex_lock(&text_mutex);
+
+	paravirt_stage_zero();
+	pv_spinlocks();
+
+	/*
+	 * paravirt patching for pv_locks can potentially deadlock
+	 * if we are running with nmi_mode=false and we get an NMI.
+	 *
+	 * For the sake of testing that path, we risk it. However, if
+	 * we are generating synthetic NMIs (nmi_selftest > 0) then
+	 * run with nmi_mode=true.
+	 */
+	ret = paravirt_runtime_patch(nmi_mode);
+
+	/*
+	 * Flip the state so we switch the pv_lock_ops on the next test.
+	 */
+	cond_state = !cond_state;
+
+	mutex_unlock(&text_mutex);
+
+	pr_debug("%s: nmi=%d; NMI-mode=%d, ret=%d\n", __func__, nmi_selftest,
+		 nmi_mode, ret);
+}
+
+static void pv_selftest_trigger(void)
+{
+	test_mode = SELFTEST_PARAVIRT;
+	pv_trigger();
+}
+
+static ssize_t pv_selftest_write(struct file *file, const char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	pv_selftest_register();
+	pv_selftest_trigger();
+	pv_selftest_unregister();
+
+	return count;
+}
+
+static ssize_t pv_nmi_read(struct file *file, char __user *ubuf,
+			   size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = snprintf(buf, sizeof(buf), "%d\n", nmi_selftest);
+	return simple_read_from_buffer(ubuf, count, ppos, buf, len);
+}
+
+static ssize_t pv_nmi_write(struct file *file, const char __user *ubuf,
+			    size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+	unsigned int enabled;
+
+	len = min(sizeof(buf) - 1, count);
+	if (copy_from_user(buf, ubuf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enabled))
+		return -EINVAL;
+
+	nmi_selftest = enabled > 3 ? 3 : enabled;
+
+	return count;
+}
+
+static const struct file_operations pv_selftest_fops = {
+	.read = NULL,
+	.write = pv_selftest_write,
+	.llseek = default_llseek,
+};
+
+static const struct file_operations pv_nmi_fops = {
+	.read = pv_nmi_read,
+	.write = pv_nmi_write,
+	.llseek = default_llseek,
+};
+
+static int __init pv_selftest_init(void)
+{
+	struct dentry *d = debugfs_create_dir("pv_selftest", NULL);
+
+	debugfs_create_file("toggle", 0600, d, NULL, &pv_selftest_fops);
+	debugfs_create_file("nmi", 0600, d, NULL, &pv_nmi_fops);
+
+	return 0;
+}
+
+late_initcall(pv_selftest_init);
diff --git a/arch/x86/kernel/pv_selftest.h b/arch/x86/kernel/pv_selftest.h
new file mode 100644
index 000000000000..5afa0f7db5cc
--- /dev/null
+++ b/arch/x86/kernel/pv_selftest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PVR_SELFTEST_H
+#define _PVR_SELFTEST_H
+
+#ifdef CONFIG_DEBUG_PARAVIRT_SELFTEST
+void pv_selftest_send_nmi(void);
+void pv_selftest_primary(void);
+void pv_selftest_secondary(void);
+#else
+static inline void pv_selftest_send_nmi(void) { }
+static inline void pv_selftest_primary(void) { }
+static inline void pv_selftest_secondary(void) { }
+#endif /*! CONFIG_DEBUG_PARAVIRT_SELFTEST */
+
+#endif /* _PVR_SELFTEST_H */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 22/26] kvm/paravirt: Encapsulate KVM pv switching logic
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

KVM pv-ops are dependent on KVM features/hints which are exposed
via CPUID. Decouple the probing and the enabling of these ops from
__init so they can be called post-init as well.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/kernel/kvm.c | 87 ++++++++++++++++++++++++++++++-------------
 2 files changed, 63 insertions(+), 25 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 605619938f08..e0629558b6b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -809,6 +809,7 @@ config KVM_GUEST
 	depends on PARAVIRT
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
+	select PARAVIRT_RUNTIME
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e56d263159d7..31f5ecfd3907 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -24,6 +24,7 @@
 #include <linux/debugfs.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/memory.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -262,12 +263,20 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned lon
 }
 NOKPROBE_SYMBOL(do_async_page_fault);
 
+static bool kvm_pv_io_delay(void)
+{
+	bool cond = kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY);
+
+	paravirt_stage_alt(cond, cpu.io_delay, kvm_io_delay);
+
+	return cond;
+}
+
 static void __init paravirt_ops_setup(void)
 {
 	pv_info.name = "KVM";
 
-	if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
-		pv_ops.cpu.io_delay = kvm_io_delay;
+	kvm_pv_io_delay();
 
 #ifdef CONFIG_X86_IO_APIC
 	no_timer_check = 1;
@@ -432,6 +441,15 @@ static bool pv_tlb_flush_supported(void)
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
+static bool kvm_pv_steal_clock(void)
+{
+	bool cond = kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+	paravirt_stage_alt(cond, time.steal_clock, kvm_steal_clock);
+
+	return cond;
+}
+
 static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 #ifdef CONFIG_SMP
@@ -624,6 +642,17 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	native_flush_tlb_others(flushmask, info);
 }
 
+static bool kvm_pv_tlb(void)
+{
+	bool cond = pv_tlb_flush_supported();
+
+	paravirt_stage_alt(cond, mmu.flush_tlb_others,
+			   kvm_flush_tlb_others);
+	paravirt_stage_alt(cond, mmu.tlb_remove_table,
+			   tlb_remove_table);
+	return cond;
+}
+
 static void __init kvm_guest_init(void)
 {
 	int i;
@@ -635,16 +664,11 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
 		x86_init.irqs.trap_init = kvm_apf_trap_init;
 
-	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+	if (kvm_pv_steal_clock())
 		has_steal_clock = 1;
-		pv_ops.time.steal_clock = kvm_steal_clock;
-	}
 
-	if (pv_tlb_flush_supported()) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+	if (kvm_pv_tlb())
 		pr_info("KVM setup pv remote TLB flush\n");
-	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
@@ -849,33 +873,46 @@ asm(
 
 #endif
 
+static inline bool kvm_para_lock_ops(void)
+{
+	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
+	return kvm_para_has_feature(KVM_FEATURE_PV_UNHALT) &&
+		!kvm_para_has_hint(KVM_HINTS_REALTIME);
+}
+
+static bool kvm_pv_spinlock(void)
+{
+	bool cond = kvm_para_lock_ops();
+	bool preempt_cond = cond &&
+			kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+	paravirt_stage_alt(cond, lock.queued_spin_lock_slowpath,
+			   __pv_queued_spin_lock_slowpath);
+	paravirt_stage_alt(cond, lock.queued_spin_unlock.func,
+			   PV_CALLEE_SAVE(__pv_queued_spin_unlock).func);
+	paravirt_stage_alt(cond, lock.wait, kvm_wait);
+	paravirt_stage_alt(cond, lock.kick, kvm_kick_cpu);
+
+	paravirt_stage_alt(preempt_cond,
+			   lock.vcpu_is_preempted.func,
+			   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted).func);
+	return cond;
+}
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
-	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
-		return;
-
-	if (kvm_para_has_hint(KVM_HINTS_REALTIME))
-		return;
 
 	/* Don't use the pvqspinlock code if there is only 1 vCPU. */
 	if (num_possible_cpus() == 1)
 		return;
 
-	__pv_init_lock_hash();
-	pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
-	pv_ops.lock.queued_spin_unlock =
-		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
-	pv_ops.lock.wait = kvm_wait;
-	pv_ops.lock.kick = kvm_kick_cpu;
+	if (!kvm_pv_spinlock())
+		return;
 
-	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.lock.vcpu_is_preempted =
-			PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-	}
+	__pv_init_lock_hash();
 }
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 22/26] kvm/paravirt: Encapsulate KVM pv switching logic
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

KVM pv-ops are dependent on KVM features/hints which are exposed
via CPUID. Decouple the probing and the enabling of these ops from
__init so they can be called post-init as well.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig      |  1 +
 arch/x86/kernel/kvm.c | 87 ++++++++++++++++++++++++++++++-------------
 2 files changed, 63 insertions(+), 25 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 605619938f08..e0629558b6b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -809,6 +809,7 @@ config KVM_GUEST
 	depends on PARAVIRT
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
+	select PARAVIRT_RUNTIME
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e56d263159d7..31f5ecfd3907 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -24,6 +24,7 @@
 #include <linux/debugfs.h>
 #include <linux/nmi.h>
 #include <linux/swait.h>
+#include <linux/memory.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -262,12 +263,20 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned lon
 }
 NOKPROBE_SYMBOL(do_async_page_fault);
 
+static bool kvm_pv_io_delay(void)
+{
+	bool cond = kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY);
+
+	paravirt_stage_alt(cond, cpu.io_delay, kvm_io_delay);
+
+	return cond;
+}
+
 static void __init paravirt_ops_setup(void)
 {
 	pv_info.name = "KVM";
 
-	if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
-		pv_ops.cpu.io_delay = kvm_io_delay;
+	kvm_pv_io_delay();
 
 #ifdef CONFIG_X86_IO_APIC
 	no_timer_check = 1;
@@ -432,6 +441,15 @@ static bool pv_tlb_flush_supported(void)
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
+static bool kvm_pv_steal_clock(void)
+{
+	bool cond = kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+	paravirt_stage_alt(cond, time.steal_clock, kvm_steal_clock);
+
+	return cond;
+}
+
 static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 #ifdef CONFIG_SMP
@@ -624,6 +642,17 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 	native_flush_tlb_others(flushmask, info);
 }
 
+static bool kvm_pv_tlb(void)
+{
+	bool cond = pv_tlb_flush_supported();
+
+	paravirt_stage_alt(cond, mmu.flush_tlb_others,
+			   kvm_flush_tlb_others);
+	paravirt_stage_alt(cond, mmu.tlb_remove_table,
+			   tlb_remove_table);
+	return cond;
+}
+
 static void __init kvm_guest_init(void)
 {
 	int i;
@@ -635,16 +664,11 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
 		x86_init.irqs.trap_init = kvm_apf_trap_init;
 
-	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+	if (kvm_pv_steal_clock())
 		has_steal_clock = 1;
-		pv_ops.time.steal_clock = kvm_steal_clock;
-	}
 
-	if (pv_tlb_flush_supported()) {
-		pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+	if (kvm_pv_tlb())
 		pr_info("KVM setup pv remote TLB flush\n");
-	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
@@ -849,33 +873,46 @@ asm(
 
 #endif
 
+static inline bool kvm_para_lock_ops(void)
+{
+	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
+	return kvm_para_has_feature(KVM_FEATURE_PV_UNHALT) &&
+		!kvm_para_has_hint(KVM_HINTS_REALTIME);
+}
+
+static bool kvm_pv_spinlock(void)
+{
+	bool cond = kvm_para_lock_ops();
+	bool preempt_cond = cond &&
+			kvm_para_has_feature(KVM_FEATURE_STEAL_TIME);
+
+	paravirt_stage_alt(cond, lock.queued_spin_lock_slowpath,
+			   __pv_queued_spin_lock_slowpath);
+	paravirt_stage_alt(cond, lock.queued_spin_unlock.func,
+			   PV_CALLEE_SAVE(__pv_queued_spin_unlock).func);
+	paravirt_stage_alt(cond, lock.wait, kvm_wait);
+	paravirt_stage_alt(cond, lock.kick, kvm_kick_cpu);
+
+	paravirt_stage_alt(preempt_cond,
+			   lock.vcpu_is_preempted.func,
+			   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted).func);
+	return cond;
+}
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
-	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
-		return;
-
-	if (kvm_para_has_hint(KVM_HINTS_REALTIME))
-		return;
 
 	/* Don't use the pvqspinlock code if there is only 1 vCPU. */
 	if (num_possible_cpus() == 1)
 		return;
 
-	__pv_init_lock_hash();
-	pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
-	pv_ops.lock.queued_spin_unlock =
-		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
-	pv_ops.lock.wait = kvm_wait;
-	pv_ops.lock.kick = kvm_kick_cpu;
+	if (!kvm_pv_spinlock())
+		return;
 
-	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-		pv_ops.lock.vcpu_is_preempted =
-			PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-	}
+	__pv_init_lock_hash();
 }
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 23/26] x86/kvm: Add worker to trigger runtime patching
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Make __pv_init_lock_hash() conditional on either paravirt spinlocks
being enabled (via kvm_pv_spinlock()) or if paravirt spinlocks
might get enabled (runtime patching via CONFIG_PARAVIRT_RUNTIME.)

Also add a handler for CPUID reprobe which can trigger this patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/kvm.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 31f5ecfd3907..1cb7eab805a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,7 @@
 #include <asm/hypervisor.h>
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
+#include <asm/text-patching.h>
 
 static int kvmapf = 1;
 
@@ -909,12 +910,15 @@ void __init kvm_spinlock_init(void)
 	if (num_possible_cpus() == 1)
 		return;
 
-	if (!kvm_pv_spinlock())
-		return;
-
-	__pv_init_lock_hash();
+	/*
+	 * Allocate if pv_spinlocks are enabled or if we might
+	 * end up patching them in later.
+	 */
+	if (kvm_pv_spinlock() || IS_ENABLED(CONFIG_PARAVIRT_RUNTIME))
+		__pv_init_lock_hash();
 }
-
+#else	/* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline bool kvm_pv_spinlock(void) { return false; }
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
@@ -952,3 +956,23 @@ void arch_haltpoll_disable(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
 #endif
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+void kvm_trigger_reprobe_cpuid(struct work_struct *work)
+{
+	mutex_lock(&text_mutex);
+
+	paravirt_stage_zero();
+
+	kvm_pv_steal_clock();
+	kvm_pv_tlb();
+	paravirt_runtime_patch(false);
+
+	paravirt_stage_zero();
+
+	kvm_pv_spinlock();
+	paravirt_runtime_patch(true);
+
+	mutex_unlock(&text_mutex);
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 23/26] x86/kvm: Add worker to trigger runtime patching
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Make __pv_init_lock_hash() conditional on either paravirt spinlocks
being enabled (via kvm_pv_spinlock()) or if paravirt spinlocks
might get enabled (runtime patching via CONFIG_PARAVIRT_RUNTIME.)

Also add a handler for CPUID reprobe which can trigger this patching.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/kvm.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 31f5ecfd3907..1cb7eab805a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,7 @@
 #include <asm/hypervisor.h>
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
+#include <asm/text-patching.h>
 
 static int kvmapf = 1;
 
@@ -909,12 +910,15 @@ void __init kvm_spinlock_init(void)
 	if (num_possible_cpus() == 1)
 		return;
 
-	if (!kvm_pv_spinlock())
-		return;
-
-	__pv_init_lock_hash();
+	/*
+	 * Allocate if pv_spinlocks are enabled or if we might
+	 * end up patching them in later.
+	 */
+	if (kvm_pv_spinlock() || IS_ENABLED(CONFIG_PARAVIRT_RUNTIME))
+		__pv_init_lock_hash();
 }
-
+#else	/* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline bool kvm_pv_spinlock(void) { return false; }
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
@@ -952,3 +956,23 @@ void arch_haltpoll_disable(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
 #endif
+
+#ifdef CONFIG_PARAVIRT_RUNTIME
+void kvm_trigger_reprobe_cpuid(struct work_struct *work)
+{
+	mutex_lock(&text_mutex);
+
+	paravirt_stage_zero();
+
+	kvm_pv_steal_clock();
+	kvm_pv_tlb();
+	paravirt_runtime_patch(false);
+
+	paravirt_stage_zero();
+
+	kvm_pv_spinlock();
+	paravirt_runtime_patch(true);
+
+	mutex_unlock(&text_mutex);
+}
+#endif /* CONFIG_PARAVIRT_RUNTIME */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 24/26] x86/kvm: Support dynamic CPUID hints
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Change in the state of a KVM hint like KVM_HINTS_REALTIME can lead
to significant performance impact. Given that the hint might not be
stable across the lifetime of a guest, dynamic hints allow the host
to inform the guest if the hint changes.

Do this via KVM CPUID leaf in %ecx.  If the guest has registered a
callback via MSR_KVM_HINT_VECTOR, the hint change is notified to it by
means of a callback triggered via vcpu ioctl KVM_CALLBACK.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
The callback vector is currently tied in with the hint notification
and can (should) be made more generic such that we could deliver
arbitrary callbacks on it.

One use might be for TSC frequency switching notifications support for
emulated Hyper-V guests.

---
 Documentation/virt/kvm/api.rst       | 17 ++++++++++++
 Documentation/virt/kvm/cpuid.rst     |  9 +++++--
 arch/x86/include/asm/kvm_host.h      |  6 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  2 ++
 arch/x86/kvm/cpuid.c                 |  3 ++-
 arch/x86/kvm/x86.c                   | 39 ++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  4 +++
 7 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..40a9b22d6979 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4690,6 +4690,17 @@ KVM_PV_VM_VERIFY
   Verify the integrity of the unpacked image. Only if this succeeds,
   KVM is allowed to start protected VCPUs.
 
+4.126 KVM_CALLBACK
+------------------
+
+:Capability: KVM_CAP_CALLBACK
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Queues a callback on the guess's vcpu if a callback has been regisered.
+
 
 5. The kvm_run structure
 ========================
@@ -6109,3 +6120,9 @@ KVM can therefore start protected VMs.
 This capability governs the KVM_S390_PV_COMMAND ioctl and the
 KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
 guests when the state change is invalid.
+
+8.24 KVM_CAP_CALLBACK
+
+Architectures: x86_64
+
+This capability indicates that the ioctl KVM_CALLBACK is available.
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..5a997c9e74c0 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -86,6 +86,9 @@ KVM_FEATURE_PV_SCHED_YIELD        13          guest checks this feature bit
                                               before using paravirtualized
                                               sched yield.
 
+KVM_FEATURE_DYNAMIC_HINTS	  14	      guest handles feature hints
+					      changing under it.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
@@ -93,9 +96,11 @@ KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
 
 ::
 
-      edx = an OR'ed group of (1 << flag)
+      ecx, edx = an OR'ed group of (1 << flag)
 
-Where ``flag`` here is defined as below:
+Where the ``flag`` in ecx is currently applicable hints, and ``flag`` in
+edx is the union of all hints ever provided to the guest, both drawn from
+the set listed below:
 
 ================== ============ =================================
 flag               value        meaning
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..4f061550274d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -723,6 +723,8 @@ struct kvm_vcpu_arch {
 	bool nmi_injected;    /* Trying to inject an NMI this entry */
 	bool smi_pending;    /* SMI queued after currently running handler */
 
+	bool callback_pending;	/* Callback queued after running handler */
+
 	struct kvm_mtrr mtrr_state;
 	u64 pat;
 
@@ -982,6 +984,10 @@ struct kvm_arch {
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
+
+	struct {
+		u8 vector;
+	} callback;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..bf016e232f2f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
 #define KVM_FEATURE_PV_SEND_IPI	11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_DYNAMIC_HINTS	14
 
 #define KVM_HINTS_REALTIME      0
 
@@ -50,6 +51,7 @@
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_HINT_VECTOR	0x4b564d06
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..db6a4c4d9430 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
-			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
+			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+			     (1 << KVM_FEATURE_DYNAMIC_HINTS);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8124b562dea..838d033bf5ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1282,6 +1282,7 @@ static const u32 emulated_msrs_all[] = {
 
 	MSR_K7_HWCR,
 	MSR_KVM_POLL_CONTROL,
+	MSR_KVM_HINT_VECTOR,
 };
 
 static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2910,7 +2911,15 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
+	case MSR_KVM_HINT_VECTOR: {
+		u8 vector = (u8)data;
 
+		if ((u64)data > 0xffUL)
+			return 1;
+
+		vcpu->kvm->arch.callback.vector = vector;
+		break;
+	}
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3156,6 +3165,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_POLL_CONTROL:
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_HINT_VECTOR:
+		msr_info->data = vcpu->kvm->arch.callback.vector;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -3373,6 +3385,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_MSR_FEATURES:
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_CALLBACK:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -3721,6 +3734,20 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static int kvm_vcpu_ioctl_callback(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Has the guest setup a callback?
+	 */
+	if (vcpu->kvm->arch.callback.vector) {
+		vcpu->arch.callback_pending = true;
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+
 static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
 {
 	kvm_inject_nmi(vcpu);
@@ -4611,6 +4638,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_CALLBACK: {
+		r = kvm_vcpu_ioctl_callback(vcpu);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
@@ -7737,6 +7768,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
 		--vcpu->arch.nmi_pending;
 		vcpu->arch.nmi_injected = true;
 		kvm_x86_ops.set_nmi(vcpu);
+	} else if (vcpu->arch.callback_pending) {
+		if (kvm_x86_ops.interrupt_allowed(vcpu)) {
+			vcpu->arch.callback_pending = false;
+			kvm_queue_interrupt(vcpu,
+					    vcpu->kvm->arch.callback.vector,
+					    false);
+			kvm_x86_ops.set_irq(vcpu);
+		}
 	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
 		/*
 		 * Because interrupts can be injected asynchronously, we are
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..5401c056742c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_CALLBACK	182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1518,6 +1519,9 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_S390_PROTECTED */
 #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
 
+/* Available with  KVM_CAP_CALLBACK */
+#define KVM_CALLBACK		  _IO(KVMIO,  0xc6)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 24/26] x86/kvm: Support dynamic CPUID hints
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Change in the state of a KVM hint like KVM_HINTS_REALTIME can lead
to significant performance impact. Given that the hint might not be
stable across the lifetime of a guest, dynamic hints allow the host
to inform the guest if the hint changes.

Do this via KVM CPUID leaf in %ecx.  If the guest has registered a
callback via MSR_KVM_HINT_VECTOR, the hint change is notified to it by
means of a callback triggered via vcpu ioctl KVM_CALLBACK.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
The callback vector is currently tied in with the hint notification
and can (should) be made more generic such that we could deliver
arbitrary callbacks on it.

One use might be for TSC frequency switching notifications support for
emulated Hyper-V guests.

---
 Documentation/virt/kvm/api.rst       | 17 ++++++++++++
 Documentation/virt/kvm/cpuid.rst     |  9 +++++--
 arch/x86/include/asm/kvm_host.h      |  6 +++++
 arch/x86/include/uapi/asm/kvm_para.h |  2 ++
 arch/x86/kvm/cpuid.c                 |  3 ++-
 arch/x86/kvm/x86.c                   | 39 ++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h             |  4 +++
 7 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..40a9b22d6979 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4690,6 +4690,17 @@ KVM_PV_VM_VERIFY
   Verify the integrity of the unpacked image. Only if this succeeds,
   KVM is allowed to start protected VCPUs.
 
+4.126 KVM_CALLBACK
+------------------
+
+:Capability: KVM_CAP_CALLBACK
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, -1 on error
+
+Queues a callback on the guess's vcpu if a callback has been regisered.
+
 
 5. The kvm_run structure
 ========================
@@ -6109,3 +6120,9 @@ KVM can therefore start protected VMs.
 This capability governs the KVM_S390_PV_COMMAND ioctl and the
 KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
 guests when the state change is invalid.
+
+8.24 KVM_CAP_CALLBACK
+
+Architectures: x86_64
+
+This capability indicates that the ioctl KVM_CALLBACK is available.
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..5a997c9e74c0 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -86,6 +86,9 @@ KVM_FEATURE_PV_SCHED_YIELD        13          guest checks this feature bit
                                               before using paravirtualized
                                               sched yield.
 
+KVM_FEATURE_DYNAMIC_HINTS	  14	      guest handles feature hints
+					      changing under it.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
@@ -93,9 +96,11 @@ KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
 
 ::
 
-      edx = an OR'ed group of (1 << flag)
+      ecx, edx = an OR'ed group of (1 << flag)
 
-Where ``flag`` here is defined as below:
+Where the ``flag`` in ecx is currently applicable hints, and ``flag`` in
+edx is the union of all hints ever provided to the guest, both drawn from
+the set listed below:
 
 ================== ============ =================================
 flag               value        meaning
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..4f061550274d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -723,6 +723,8 @@ struct kvm_vcpu_arch {
 	bool nmi_injected;    /* Trying to inject an NMI this entry */
 	bool smi_pending;    /* SMI queued after currently running handler */
 
+	bool callback_pending;	/* Callback queued after running handler */
+
 	struct kvm_mtrr mtrr_state;
 	u64 pat;
 
@@ -982,6 +984,10 @@ struct kvm_arch {
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
+
+	struct {
+		u8 vector;
+	} callback;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..bf016e232f2f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
 #define KVM_FEATURE_PV_SEND_IPI	11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_DYNAMIC_HINTS	14
 
 #define KVM_HINTS_REALTIME      0
 
@@ -50,6 +51,7 @@
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_HINT_VECTOR	0x4b564d06
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..db6a4c4d9430 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
-			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
+			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+			     (1 << KVM_FEATURE_DYNAMIC_HINTS);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8124b562dea..838d033bf5ba 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1282,6 +1282,7 @@ static const u32 emulated_msrs_all[] = {
 
 	MSR_K7_HWCR,
 	MSR_KVM_POLL_CONTROL,
+	MSR_KVM_HINT_VECTOR,
 };
 
 static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2910,7 +2911,15 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
+	case MSR_KVM_HINT_VECTOR: {
+		u8 vector = (u8)data;
 
+		if ((u64)data > 0xffUL)
+			return 1;
+
+		vcpu->kvm->arch.callback.vector = vector;
+		break;
+	}
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3156,6 +3165,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_POLL_CONTROL:
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_HINT_VECTOR:
+		msr_info->data = vcpu->kvm->arch.callback.vector;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -3373,6 +3385,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_MSR_FEATURES:
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_CALLBACK:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -3721,6 +3734,20 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+static int kvm_vcpu_ioctl_callback(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Has the guest setup a callback?
+	 */
+	if (vcpu->kvm->arch.callback.vector) {
+		vcpu->arch.callback_pending = true;
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+}
+
 static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
 {
 	kvm_inject_nmi(vcpu);
@@ -4611,6 +4638,10 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_CALLBACK: {
+		r = kvm_vcpu_ioctl_callback(vcpu);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
@@ -7737,6 +7768,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
 		--vcpu->arch.nmi_pending;
 		vcpu->arch.nmi_injected = true;
 		kvm_x86_ops.set_nmi(vcpu);
+	} else if (vcpu->arch.callback_pending) {
+		if (kvm_x86_ops.interrupt_allowed(vcpu)) {
+			vcpu->arch.callback_pending = false;
+			kvm_queue_interrupt(vcpu,
+					    vcpu->kvm->arch.callback.vector,
+					    false);
+			kvm_x86_ops.set_irq(vcpu);
+		}
 	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
 		/*
 		 * Because interrupts can be injected asynchronously, we are
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..5401c056742c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_CALLBACK	182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1518,6 +1519,9 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_S390_PROTECTED */
 #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
 
+/* Available with  KVM_CAP_CALLBACK */
+#define KVM_CALLBACK		  _IO(KVMIO,  0xc6)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 25/26] x86/kvm: Guest support for dynamic hints
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

If the hypervisor supports KVM_FEATURE_DYNAMIC_HINTS, then register a
callback vector (currently chosen to be HYPERVISOR_CALLBACK_VECTOR.)
The callback triggers on a change in the active hints which are
are exported via KVM CPUID in %ecx.

Trigger re-evaluation of KVM_HINTS based on change in their active
status.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/entry/entry_64.S       |  5 +++
 arch/x86/include/asm/kvm_para.h |  7 ++++
 arch/x86/kernel/kvm.c           | 58 ++++++++++++++++++++++++++++++---
 include/asm-generic/kvm_para.h  |  4 +++
 include/linux/kvm_para.h        |  5 +++
 6 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e0629558b6b5..23b239d184fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -810,6 +810,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select PARAVIRT_RUNTIME
+	select X86_HV_CALLBACK_VECTOR
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0e9504fabe52..96b2a243c54f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1190,6 +1190,11 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	acrn_hv_callback_vector acrn_hv_vector_handler
 #endif
 
+#if IS_ENABLED(CONFIG_KVM_GUEST)
+apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
+	kvm_callback_vector kvm_do_callback
+#endif
+
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
 idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..5a7ca5639c2e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,11 +88,13 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
+unsigned int kvm_arch_para_active_hints(void);
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
 void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
+void kvm_callback_vector(struct pt_regs *regs);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -121,6 +123,11 @@ static inline unsigned int kvm_arch_para_hints(void)
 	return 0;
 }
 
+static inline unsigned int kvm_arch_para_active_hints(void)
+{
+	return 0;
+}
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
 	return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cb7eab805a6..163b7a7ec5f9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -25,6 +25,8 @@
 #include <linux/nmi.h>
 #include <linux/swait.h>
 #include <linux/memory.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -438,7 +440,7 @@ static void __init sev_map_percpu_data(void)
 static bool pv_tlb_flush_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
@@ -463,7 +465,7 @@ static bool pv_ipi_supported(void)
 static bool pv_sched_yield_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
@@ -568,7 +570,7 @@ static void kvm_smp_send_call_func_ipi(const struct cpumask *mask)
 static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
 {
 	native_smp_prepare_cpus(max_cpus);
-	if (kvm_para_has_hint(KVM_HINTS_REALTIME))
+	if (kvm_para_has_active_hint(KVM_HINTS_REALTIME))
 		static_branch_disable(&virt_spin_lock_key);
 }
 
@@ -654,6 +656,13 @@ static bool kvm_pv_tlb(void)
 	return cond;
 }
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+static bool has_dynamic_hint;
+static void __init kvm_register_callback_vector(void);
+#else
+#define has_dynamic_hint false
+#endif /* CONFIG_PARAVIRT_RUNTIME */
+
 static void __init kvm_guest_init(void)
 {
 	int i;
@@ -674,6 +683,12 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (IS_ENABLED(CONFIG_PARAVIRT_RUNTIME) &&
+	    kvm_para_has_feature(KVM_FEATURE_DYNAMIC_HINTS)) {
+		kvm_register_callback_vector();
+		has_dynamic_hint = true;
+	}
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
@@ -729,12 +744,27 @@ unsigned int kvm_arch_para_features(void)
 	return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
 }
 
+/*
+ * Universe of hints that's ever been given to this guest.
+ */
 unsigned int kvm_arch_para_hints(void)
 {
 	return cpuid_edx(kvm_cpuid_base() | KVM_CPUID_FEATURES);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_para_hints);
 
+/*
+ * Currently active set of hints. Reading can race with modifications.
+ */
+unsigned int kvm_arch_para_active_hints(void)
+{
+	if (has_dynamic_hint)
+		return cpuid_ecx(kvm_cpuid_base() | KVM_CPUID_FEATURES);
+	else
+		return kvm_arch_para_hints();
+}
+EXPORT_SYMBOL_GPL(kvm_arch_para_active_hints);
+
 static uint32_t __init kvm_detect(void)
 {
 	return kvm_cpuid_base();
@@ -878,7 +908,7 @@ static inline bool kvm_para_lock_ops(void)
 {
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
 	return kvm_para_has_feature(KVM_FEATURE_PV_UNHALT) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME);
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME);
 }
 
 static bool kvm_pv_spinlock(void)
@@ -975,4 +1005,24 @@ void kvm_trigger_reprobe_cpuid(struct work_struct *work)
 
 	mutex_unlock(&text_mutex);
 }
+
+static DECLARE_WORK(trigger_reprobe, kvm_trigger_reprobe_cpuid);
+
+void __irq_entry kvm_do_callback(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	irq_enter();
+	inc_irq_stat(irq_hv_callback_count);
+
+	schedule_work(&trigger_reprobe);
+	irq_exit();
+	set_irq_regs(old_regs);
+}
+
+static void __init kvm_register_callback_vector(void)
+{
+	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, kvm_callback_vector);
+	wrmsrl(MSR_KVM_HINT_VECTOR, HYPERVISOR_CALLBACK_VECTOR);
+}
 #endif /* CONFIG_PARAVIRT_RUNTIME */
diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h
index 728e5c5706c4..4a575299ad62 100644
--- a/include/asm-generic/kvm_para.h
+++ b/include/asm-generic/kvm_para.h
@@ -24,6 +24,10 @@ static inline unsigned int kvm_arch_para_hints(void)
 	return 0;
 }
 
+static inline unsigned int kvm_arch_para_active_hints(void)
+{
+	return 0;
+}
 static inline bool kvm_para_available(void)
 {
 	return false;
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index f23b90b02898..c98d3944d25a 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -14,4 +14,9 @@ static inline bool kvm_para_has_hint(unsigned int feature)
 {
 	return !!(kvm_arch_para_hints() & (1UL << feature));
 }
+
+static inline bool kvm_para_has_active_hint(unsigned int feature)
+{
+	return !!(kvm_arch_para_active_hints() & BIT(feature));
+}
 #endif /* __LINUX_KVM_PARA_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 25/26] x86/kvm: Guest support for dynamic hints
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

If the hypervisor supports KVM_FEATURE_DYNAMIC_HINTS, then register a
callback vector (currently chosen to be HYPERVISOR_CALLBACK_VECTOR.)
The callback triggers on a change in the active hints which are
are exported via KVM CPUID in %ecx.

Trigger re-evaluation of KVM_HINTS based on change in their active
status.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/entry/entry_64.S       |  5 +++
 arch/x86/include/asm/kvm_para.h |  7 ++++
 arch/x86/kernel/kvm.c           | 58 ++++++++++++++++++++++++++++++---
 include/asm-generic/kvm_para.h  |  4 +++
 include/linux/kvm_para.h        |  5 +++
 6 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e0629558b6b5..23b239d184fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -810,6 +810,7 @@ config KVM_GUEST
 	select PARAVIRT_CLOCK
 	select ARCH_CPUIDLE_HALTPOLL
 	select PARAVIRT_RUNTIME
+	select X86_HV_CALLBACK_VECTOR
 	default y
 	---help---
 	  This option enables various optimizations for running under the KVM
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0e9504fabe52..96b2a243c54f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1190,6 +1190,11 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	acrn_hv_callback_vector acrn_hv_vector_handler
 #endif
 
+#if IS_ENABLED(CONFIG_KVM_GUEST)
+apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
+	kvm_callback_vector kvm_do_callback
+#endif
+
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
 idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..5a7ca5639c2e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,11 +88,13 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
+unsigned int kvm_arch_para_active_hints(void);
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
 void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
+void kvm_callback_vector(struct pt_regs *regs);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -121,6 +123,11 @@ static inline unsigned int kvm_arch_para_hints(void)
 	return 0;
 }
 
+static inline unsigned int kvm_arch_para_active_hints(void)
+{
+	return 0;
+}
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
 	return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cb7eab805a6..163b7a7ec5f9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -25,6 +25,8 @@
 #include <linux/nmi.h>
 #include <linux/swait.h>
 #include <linux/memory.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -438,7 +440,7 @@ static void __init sev_map_percpu_data(void)
 static bool pv_tlb_flush_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
@@ -463,7 +465,7 @@ static bool pv_ipi_supported(void)
 static bool pv_sched_yield_supported(void)
 {
 	return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME) &&
 	    kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
 }
 
@@ -568,7 +570,7 @@ static void kvm_smp_send_call_func_ipi(const struct cpumask *mask)
 static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
 {
 	native_smp_prepare_cpus(max_cpus);
-	if (kvm_para_has_hint(KVM_HINTS_REALTIME))
+	if (kvm_para_has_active_hint(KVM_HINTS_REALTIME))
 		static_branch_disable(&virt_spin_lock_key);
 }
 
@@ -654,6 +656,13 @@ static bool kvm_pv_tlb(void)
 	return cond;
 }
 
+#ifdef CONFIG_PARAVIRT_RUNTIME
+static bool has_dynamic_hint;
+static void __init kvm_register_callback_vector(void);
+#else
+#define has_dynamic_hint false
+#endif /* CONFIG_PARAVIRT_RUNTIME */
+
 static void __init kvm_guest_init(void)
 {
 	int i;
@@ -674,6 +683,12 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (IS_ENABLED(CONFIG_PARAVIRT_RUNTIME) &&
+	    kvm_para_has_feature(KVM_FEATURE_DYNAMIC_HINTS)) {
+		kvm_register_callback_vector();
+		has_dynamic_hint = true;
+	}
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
@@ -729,12 +744,27 @@ unsigned int kvm_arch_para_features(void)
 	return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
 }
 
+/*
+ * Universe of hints that's ever been given to this guest.
+ */
 unsigned int kvm_arch_para_hints(void)
 {
 	return cpuid_edx(kvm_cpuid_base() | KVM_CPUID_FEATURES);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_para_hints);
 
+/*
+ * Currently active set of hints. Reading can race with modifications.
+ */
+unsigned int kvm_arch_para_active_hints(void)
+{
+	if (has_dynamic_hint)
+		return cpuid_ecx(kvm_cpuid_base() | KVM_CPUID_FEATURES);
+	else
+		return kvm_arch_para_hints();
+}
+EXPORT_SYMBOL_GPL(kvm_arch_para_active_hints);
+
 static uint32_t __init kvm_detect(void)
 {
 	return kvm_cpuid_base();
@@ -878,7 +908,7 @@ static inline bool kvm_para_lock_ops(void)
 {
 	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
 	return kvm_para_has_feature(KVM_FEATURE_PV_UNHALT) &&
-		!kvm_para_has_hint(KVM_HINTS_REALTIME);
+		!kvm_para_has_active_hint(KVM_HINTS_REALTIME);
 }
 
 static bool kvm_pv_spinlock(void)
@@ -975,4 +1005,24 @@ void kvm_trigger_reprobe_cpuid(struct work_struct *work)
 
 	mutex_unlock(&text_mutex);
 }
+
+static DECLARE_WORK(trigger_reprobe, kvm_trigger_reprobe_cpuid);
+
+void __irq_entry kvm_do_callback(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	irq_enter();
+	inc_irq_stat(irq_hv_callback_count);
+
+	schedule_work(&trigger_reprobe);
+	irq_exit();
+	set_irq_regs(old_regs);
+}
+
+static void __init kvm_register_callback_vector(void)
+{
+	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, kvm_callback_vector);
+	wrmsrl(MSR_KVM_HINT_VECTOR, HYPERVISOR_CALLBACK_VECTOR);
+}
 #endif /* CONFIG_PARAVIRT_RUNTIME */
diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h
index 728e5c5706c4..4a575299ad62 100644
--- a/include/asm-generic/kvm_para.h
+++ b/include/asm-generic/kvm_para.h
@@ -24,6 +24,10 @@ static inline unsigned int kvm_arch_para_hints(void)
 	return 0;
 }
 
+static inline unsigned int kvm_arch_para_active_hints(void)
+{
+	return 0;
+}
 static inline bool kvm_para_available(void)
 {
 	return false;
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index f23b90b02898..c98d3944d25a 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -14,4 +14,9 @@ static inline bool kvm_para_has_hint(unsigned int feature)
 {
 	return !!(kvm_arch_para_hints() & (1UL << feature));
 }
+
+static inline bool kvm_para_has_active_hint(unsigned int feature)
+{
+	return !!(kvm_arch_para_active_hints() & BIT(feature));
+}
 #endif /* __LINUX_KVM_PARA_H */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 26/26] x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08  5:03   ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora, Joao Martins

Add a blocking notifier that triggers when the host sends a hint
change notification.

Suggested-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/kvm_para.h | 10 ++++++++++
 arch/x86/kernel/kvm.c           | 16 ++++++++++++++++
 include/asm-generic/kvm_para.h  |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5a7ca5639c2e..54c3c7a3225e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_KVM_PARA_H
 #define _ASM_X86_KVM_PARA_H
 
+#include <linux/notifier.h>
 #include <asm/processor.h>
 #include <asm/alternative.h>
 #include <uapi/asm/kvm_para.h>
@@ -96,6 +97,9 @@ extern void kvm_disable_steal_time(void);
 void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 void kvm_callback_vector(struct pt_regs *regs);
 
+void kvm_realtime_notifier_register(struct notifier_block *nb);
+void kvm_realtime_notifier_unregister(struct notifier_block *nb);
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
 #else /* !CONFIG_PARAVIRT_SPINLOCKS */
@@ -137,6 +141,14 @@ static inline void kvm_disable_steal_time(void)
 {
 	return;
 }
+
+static inline void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+}
+
+static inline void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 163b7a7ec5f9..35ba4a837027 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -951,6 +951,20 @@ void __init kvm_spinlock_init(void)
 static inline bool kvm_pv_spinlock(void) { return false; }
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
+static BLOCKING_NOTIFIER_HEAD(realtime_notifier);
+
+void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+	blocking_notifier_chain_register(&realtime_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(kvm_realtime_notifier_register);
+
+void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&realtime_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(kvm_realtime_notifier_unregister);
+
 #ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
 
 static void kvm_disable_host_haltpoll(void *i)
@@ -1004,6 +1018,8 @@ void kvm_trigger_reprobe_cpuid(struct work_struct *work)
 	paravirt_runtime_patch(true);
 
 	mutex_unlock(&text_mutex);
+
+	blocking_notifier_call_chain(&realtime_notifier, 0, NULL);
 }
 
 static DECLARE_WORK(trigger_reprobe, kvm_trigger_reprobe_cpuid);
diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h
index 4a575299ad62..d443531b49ac 100644
--- a/include/asm-generic/kvm_para.h
+++ b/include/asm-generic/kvm_para.h
@@ -33,4 +33,12 @@ static inline bool kvm_para_available(void)
 	return false;
 }
 
+static inline void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+}
+
+static inline void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+}
+
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [RFC PATCH 26/26] x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
@ 2020-04-08  5:03   ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-08  5:03 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, Joao Martins, xen-devel, kvm, peterz, hpa, Ankur Arora,
	virtualization, pbonzini, namit, mhiramat, jpoimboe,
	mihai.carabas, bp, vkuznets, boris.ostrovsky

Add a blocking notifier that triggers when the host sends a hint
change notification.

Suggested-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/kvm_para.h | 10 ++++++++++
 arch/x86/kernel/kvm.c           | 16 ++++++++++++++++
 include/asm-generic/kvm_para.h  |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5a7ca5639c2e..54c3c7a3225e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -2,6 +2,7 @@
 #ifndef _ASM_X86_KVM_PARA_H
 #define _ASM_X86_KVM_PARA_H
 
+#include <linux/notifier.h>
 #include <asm/processor.h>
 #include <asm/alternative.h>
 #include <uapi/asm/kvm_para.h>
@@ -96,6 +97,9 @@ extern void kvm_disable_steal_time(void);
 void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 void kvm_callback_vector(struct pt_regs *regs);
 
+void kvm_realtime_notifier_register(struct notifier_block *nb);
+void kvm_realtime_notifier_unregister(struct notifier_block *nb);
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
 #else /* !CONFIG_PARAVIRT_SPINLOCKS */
@@ -137,6 +141,14 @@ static inline void kvm_disable_steal_time(void)
 {
 	return;
 }
+
+static inline void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+}
+
+static inline void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 163b7a7ec5f9..35ba4a837027 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -951,6 +951,20 @@ void __init kvm_spinlock_init(void)
 static inline bool kvm_pv_spinlock(void) { return false; }
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
+static BLOCKING_NOTIFIER_HEAD(realtime_notifier);
+
+void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+	blocking_notifier_chain_register(&realtime_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(kvm_realtime_notifier_register);
+
+void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&realtime_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(kvm_realtime_notifier_unregister);
+
 #ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
 
 static void kvm_disable_host_haltpoll(void *i)
@@ -1004,6 +1018,8 @@ void kvm_trigger_reprobe_cpuid(struct work_struct *work)
 	paravirt_runtime_patch(true);
 
 	mutex_unlock(&text_mutex);
+
+	blocking_notifier_call_chain(&realtime_notifier, 0, NULL);
 }
 
 static DECLARE_WORK(trigger_reprobe, kvm_trigger_reprobe_cpuid);
diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h
index 4a575299ad62..d443531b49ac 100644
--- a/include/asm-generic/kvm_para.h
+++ b/include/asm-generic/kvm_para.h
@@ -33,4 +33,12 @@ static inline bool kvm_para_available(void)
 	return false;
 }
 
+static inline void kvm_realtime_notifier_register(struct notifier_block *nb)
+{
+}
+
+static inline void kvm_realtime_notifier_unregister(struct notifier_block *nb)
+{
+}
+
 #endif
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 09/26] x86/paravirt: Add runtime_patch()
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:05     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:03:06PM -0700, Ankur Arora wrote:
> +/*
> + * preempt_enable_no_resched() so we don't add any preemption points until
> + * after the caller has returned.
> + */
> +#define preempt_enable_runtime_patch()	preempt_enable_no_resched()
> +#define preempt_disable_runtime_patch()	preempt_disable()

NAK, this is probably a stright preemption bug, also, afaict, there
aren't actually any users of this in the patch-set.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 09/26] x86/paravirt: Add runtime_patch()
@ 2020-04-08 11:05     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:06PM -0700, Ankur Arora wrote:
> +/*
> + * preempt_enable_no_resched() so we don't add any preemption points until
> + * after the caller has returned.
> + */
> +#define preempt_enable_runtime_patch()	preempt_enable_no_resched()
> +#define preempt_disable_runtime_patch()	preempt_disable()

NAK, this is probably a stright preemption bug, also, afaict, there
aren't actually any users of this in the patch-set.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 09/26] x86/paravirt: Add runtime_patch()
@ 2020-04-08 11:05     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:06PM -0700, Ankur Arora wrote:
> +/*
> + * preempt_enable_no_resched() so we don't add any preemption points until
> + * after the caller has returned.
> + */
> +#define preempt_enable_runtime_patch()	preempt_enable_no_resched()
> +#define preempt_disable_runtime_patch()	preempt_disable()

NAK, this is probably a stright preemption bug, also, afaict, there
aren't actually any users of this in the patch-set.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:11     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
>  struct text_poke_loc {
>  	s32 rel_addr; /* addr := _stext + rel_addr */
> -	s32 rel32;
> -	u8 opcode;
> +	union {
> +		struct {
> +			s32 rel32;
> +			u8 opcode;
> +		} emulated;
> +		struct {
> +			u8 len;
> +		} native;
> +	};
>  	const u8 text[POKE_MAX_OPCODE_SIZE];
>  };

NAK, this grows the structure from 16 to 20 bytes.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
@ 2020-04-08 11:11     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
>  struct text_poke_loc {
>  	s32 rel_addr; /* addr := _stext + rel_addr */
> -	s32 rel32;
> -	u8 opcode;
> +	union {
> +		struct {
> +			s32 rel32;
> +			u8 opcode;
> +		} emulated;
> +		struct {
> +			u8 len;
> +		} native;
> +	};
>  	const u8 text[POKE_MAX_OPCODE_SIZE];
>  };

NAK, this grows the structure from 16 to 20 bytes.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
@ 2020-04-08 11:11     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
>  struct text_poke_loc {
>  	s32 rel_addr; /* addr := _stext + rel_addr */
> -	s32 rel32;
> -	u8 opcode;
> +	union {
> +		struct {
> +			s32 rel32;
> +			u8 opcode;
> +		} emulated;
> +		struct {
> +			u8 len;
> +		} native;
> +	};
>  	const u8 text[POKE_MAX_OPCODE_SIZE];
>  };

NAK, this grows the structure from 16 to 20 bytes.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:13     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static void __maybe_unused sync_one(void)
> +{
> +	/*
> +	 * We might be executing in NMI context, and so cannot use
> +	 * IRET as a synchronizing instruction.
> +	 *
> +	 * We could use native_write_cr2() but that is not guaranteed
> +	 * to work on Xen-PV -- it is emulated by Xen and might not
> +	 * execute an iret (or similar synchronizing instruction)
> +	 * internally.
> +	 *
> +	 * cpuid() would trap as well. Unclear if that's a solution
> +	 * either.
> +	 */
> +	if (in_nmi())
> +		cpuid_eax(1);
> +	else
> +		sync_core();
> +}

That's not thinking staight; what do you think the INT3 does when it
happens inside an NMI ?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
@ 2020-04-08 11:13     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static void __maybe_unused sync_one(void)
> +{
> +	/*
> +	 * We might be executing in NMI context, and so cannot use
> +	 * IRET as a synchronizing instruction.
> +	 *
> +	 * We could use native_write_cr2() but that is not guaranteed
> +	 * to work on Xen-PV -- it is emulated by Xen and might not
> +	 * execute an iret (or similar synchronizing instruction)
> +	 * internally.
> +	 *
> +	 * cpuid() would trap as well. Unclear if that's a solution
> +	 * either.
> +	 */
> +	if (in_nmi())
> +		cpuid_eax(1);
> +	else
> +		sync_core();
> +}

That's not thinking staight; what do you think the INT3 does when it
happens inside an NMI ?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
@ 2020-04-08 11:13     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static void __maybe_unused sync_one(void)
> +{
> +	/*
> +	 * We might be executing in NMI context, and so cannot use
> +	 * IRET as a synchronizing instruction.
> +	 *
> +	 * We could use native_write_cr2() but that is not guaranteed
> +	 * to work on Xen-PV -- it is emulated by Xen and might not
> +	 * execute an iret (or similar synchronizing instruction)
> +	 * internally.
> +	 *
> +	 * cpuid() would trap as well. Unclear if that's a solution
> +	 * either.
> +	 */
> +	if (in_nmi())
> +		cpuid_eax(1);
> +	else
> +		sync_core();
> +}

That's not thinking staight; what do you think the INT3 does when it
happens inside an NMI ?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:17     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization


On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
> @@ -1071,10 +1079,13 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  			goto out_put;
>  	}
>  
> -	len = text_opcode_size(tp->opcode);
> +	if (desc->native)
> +		BUG();
> +

Subject: x86/alternatives: Handle native insns in text_poke_loc*()

That's not really handling, is it..

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
@ 2020-04-08 11:17     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky


On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
> @@ -1071,10 +1079,13 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  			goto out_put;
>  	}
>  
> -	len = text_opcode_size(tp->opcode);
> +	if (desc->native)
> +		BUG();
> +

Subject: x86/alternatives: Handle native insns in text_poke_loc*()

That's not really handling, is it..

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*()
@ 2020-04-08 11:17     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky


On Tue, Apr 07, 2020 at 10:03:11PM -0700, Ankur Arora wrote:
> @@ -1071,10 +1079,13 @@ int notrace poke_int3_handler(struct pt_regs *regs)
>  			goto out_put;
>  	}
>  
> -	len = text_opcode_size(tp->opcode);
> +	if (desc->native)
> +		BUG();
> +

Subject: x86/alternatives: Handle native insns in text_poke_loc*()

That's not really handling, is it..


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:23     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
> +{
> +	int ret;
> +
> +	lockdep_assert_held(&text_mutex);
> +
> +	if (system_state != SYSTEM_RUNNING)
> +		return -EINVAL;
> +
> +	text_poke_state.stage = stage;
> +	text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
> +	text_poke_state.head = &alt_modules;
> +
> +	text_poke_state.patch_worker = worker;
> +	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
> +	text_poke_state.primary_cpu = smp_processor_id();
> +
> +	/*
> +	 * Run the worker on all online CPUs. Don't need to do anything
> +	 * for offline CPUs as they come back online with a clean cache.
> +	 */
> +	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);

This.. that on its own is almost a reason to NAK the entire thing. We're
all working very hard to get rid of stop_machine() and you're adding
one.

Worse, stop_machine() is notoriously crap on over-committed virt, the
exact scenario where you want it.

> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
@ 2020-04-08 11:23     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
> +{
> +	int ret;
> +
> +	lockdep_assert_held(&text_mutex);
> +
> +	if (system_state != SYSTEM_RUNNING)
> +		return -EINVAL;
> +
> +	text_poke_state.stage = stage;
> +	text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
> +	text_poke_state.head = &alt_modules;
> +
> +	text_poke_state.patch_worker = worker;
> +	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
> +	text_poke_state.primary_cpu = smp_processor_id();
> +
> +	/*
> +	 * Run the worker on all online CPUs. Don't need to do anything
> +	 * for offline CPUs as they come back online with a clean cache.
> +	 */
> +	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);

This.. that on its own is almost a reason to NAK the entire thing. We're
all working very hard to get rid of stop_machine() and you're adding
one.

Worse, stop_machine() is notoriously crap on over-committed virt, the
exact scenario where you want it.

> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking
@ 2020-04-08 11:23     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:12PM -0700, Ankur Arora wrote:
> +static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
> +{
> +	int ret;
> +
> +	lockdep_assert_held(&text_mutex);
> +
> +	if (system_state != SYSTEM_RUNNING)
> +		return -EINVAL;
> +
> +	text_poke_state.stage = stage;
> +	text_poke_state.num_acks = cpumask_weight(cpu_online_mask);
> +	text_poke_state.head = &alt_modules;
> +
> +	text_poke_state.patch_worker = worker;
> +	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
> +	text_poke_state.primary_cpu = smp_processor_id();
> +
> +	/*
> +	 * Run the worker on all online CPUs. Don't need to do anything
> +	 * for offline CPUs as they come back online with a clean cache.
> +	 */
> +	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);

This.. that on its own is almost a reason to NAK the entire thing. We're
all working very hard to get rid of stop_machine() and you're adding
one.

Worse, stop_machine() is notoriously crap on over-committed virt, the
exact scenario where you want it.

> +
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching
  2020-04-08  5:03   ` Ankur Arora
  (?)
@ 2020-04-08 11:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:03:16PM -0700, Ankur Arora wrote:
> @@ -1807,12 +1911,20 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
>  	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
>  	text_poke_state.primary_cpu = smp_processor_id();
>  
> +	text_poke_state.nmi_context = nmi;
> +
> +	if (nmi)
> +		register_nmi_handler(NMI_LOCAL, text_poke_nmi,
> +				     NMI_FLAG_FIRST, "text_poke_nmi");
>  	/*
>  	 * Run the worker on all online CPUs. Don't need to do anything
>  	 * for offline CPUs as they come back online with a clean cache.
>  	 */
>  	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
>  
> +	if (nmi)
> +		unregister_nmi_handler(NMI_LOCAL, "text_poke_nmi");
> +
>  	return ret;
>  }

This is completely bonghits crazy.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching
@ 2020-04-08 11:36     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:16PM -0700, Ankur Arora wrote:
> @@ -1807,12 +1911,20 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
>  	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
>  	text_poke_state.primary_cpu = smp_processor_id();
>  
> +	text_poke_state.nmi_context = nmi;
> +
> +	if (nmi)
> +		register_nmi_handler(NMI_LOCAL, text_poke_nmi,
> +				     NMI_FLAG_FIRST, "text_poke_nmi");
>  	/*
>  	 * Run the worker on all online CPUs. Don't need to do anything
>  	 * for offline CPUs as they come back online with a clean cache.
>  	 */
>  	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
>  
> +	if (nmi)
> +		unregister_nmi_handler(NMI_LOCAL, "text_poke_nmi");
> +
>  	return ret;
>  }

This is completely bonghits crazy.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching
@ 2020-04-08 11:36     ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 11:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:03:16PM -0700, Ankur Arora wrote:
> @@ -1807,12 +1911,20 @@ static int __maybe_unused text_poke_late(patch_worker_t worker, void *stage)
>  	text_poke_state.state = PATCH_SYNC_DONE; /* Start state */
>  	text_poke_state.primary_cpu = smp_processor_id();
>  
> +	text_poke_state.nmi_context = nmi;
> +
> +	if (nmi)
> +		register_nmi_handler(NMI_LOCAL, text_poke_nmi,
> +				     NMI_FLAG_FIRST, "text_poke_nmi");
>  	/*
>  	 * Run the worker on all online CPUs. Don't need to do anything
>  	 * for offline CPUs as they come back online with a clean cache.
>  	 */
>  	ret = stop_machine(patch_worker, &text_poke_state, cpu_online_mask);
>  
> +	if (nmi)
> +		unregister_nmi_handler(NMI_LOCAL, "text_poke_nmi");
> +
>  	return ret;
>  }

This is completely bonghits crazy.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08  5:02 ` Ankur Arora
  (?)
@ 2020-04-08 12:08   ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 12:08 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

So what, the paravirt spinlock stuff works just fine when you're not
oversubscribed.

> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
> but PV-TLB ops are also good candidates)

The PV-TLB ops also work just fine when not oversubscribed. IIRC
kvm_flush_tlb_others() is pretty much the same in that case.

> in .parainstructions.runtime,
> while discarding the .parainstructions as usual at init. This is then
> used for switching back and forth between native and paravirt mode.
> ([1] lists some representative numbers of the increased memory
> footprint.)
> 
> Mechanism: the patching itself is done using stop_machine(). That is
> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> via text_poke_bp(), but I'm using this to address two issues:
>  1) emulation in text_poke() can only easily handle a small set
>  of instructions and this is problematic for inlined pv-ops (and see
>  a possible alternatives use-case below.)
>  2) paravirt patching might have inter-dependendent ops (ex.
>  lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>  need to be updated atomically.)

And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.

Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.

That just happens to not be the case now, but it was for a fair while.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

The whole late-microcode loading stuff is crazy already; you're making
it take double bonghits.

> Also, there are points of similarity with the ongoing static_call work
> which does rewriting of indirect calls.

Only in so far as that code patching is involved. An analogy would be
comparing having a beer with shooting dope. They're both 'drugs'.

> The difference here is that
> we need to switch a group of calls atomically and given that
> some of them can be inlined, need to handle a wider variety of opcodes.
> 
> To patch safely we need to satisfy these constraints:
> 
>  - No references to insn sequences under replacement on any kernel stack
>    once replacement is in progress. Without this constraint we might end
>    up returning to an address that is in the middle of an instruction.

Both ftrace and optprobes have that issue, neither of them are quite as
crazy as this.

>  - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>    lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>    a good example.

While I'm sure this is a fun problem, why are we solving it?

>  - handle a broader set of insns than CALL and JMP: some pv-ops end up
>    getting inlined. Alternatives can contain arbitrary instructions.

So can optprobes.

>  - locking operations can be called from interrupt handlers which means
>    we cannot trivially use IPIs for flushing.

Heck, some NMI handlers use locks..

> Handling these, necessitates that target pv-ops not be preemptible.

I don't think that is a correct inferrence.

> Once that is a given (for safety these need to be explicitly whitelisted
> in runtime_patch()), use a state-machine with the primary CPU doing the
> patching and secondary CPUs in a sync_core() loop. 
> 
> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
> progress by continuing the patching instead of emulating.
> 
> One remaining issue is inter-dependent pv-ops which are also executed in
> the NMI handler -- patching can potentially deadlock in case of multiple
> NMIs. Handle these by pushing some of this work in the NMI handler where
> we know it will be uninterrupted.

I'm just seeing a lot of bonghits without sane rationale. Why is any of
this important?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 12:08   ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 12:08 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

So what, the paravirt spinlock stuff works just fine when you're not
oversubscribed.

> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
> but PV-TLB ops are also good candidates)

The PV-TLB ops also work just fine when not oversubscribed. IIRC
kvm_flush_tlb_others() is pretty much the same in that case.

> in .parainstructions.runtime,
> while discarding the .parainstructions as usual at init. This is then
> used for switching back and forth between native and paravirt mode.
> ([1] lists some representative numbers of the increased memory
> footprint.)
> 
> Mechanism: the patching itself is done using stop_machine(). That is
> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> via text_poke_bp(), but I'm using this to address two issues:
>  1) emulation in text_poke() can only easily handle a small set
>  of instructions and this is problematic for inlined pv-ops (and see
>  a possible alternatives use-case below.)
>  2) paravirt patching might have inter-dependendent ops (ex.
>  lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>  need to be updated atomically.)

And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.

Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.

That just happens to not be the case now, but it was for a fair while.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

The whole late-microcode loading stuff is crazy already; you're making
it take double bonghits.

> Also, there are points of similarity with the ongoing static_call work
> which does rewriting of indirect calls.

Only in so far as that code patching is involved. An analogy would be
comparing having a beer with shooting dope. They're both 'drugs'.

> The difference here is that
> we need to switch a group of calls atomically and given that
> some of them can be inlined, need to handle a wider variety of opcodes.
> 
> To patch safely we need to satisfy these constraints:
> 
>  - No references to insn sequences under replacement on any kernel stack
>    once replacement is in progress. Without this constraint we might end
>    up returning to an address that is in the middle of an instruction.

Both ftrace and optprobes have that issue, neither of them are quite as
crazy as this.

>  - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>    lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>    a good example.

While I'm sure this is a fun problem, why are we solving it?

>  - handle a broader set of insns than CALL and JMP: some pv-ops end up
>    getting inlined. Alternatives can contain arbitrary instructions.

So can optprobes.

>  - locking operations can be called from interrupt handlers which means
>    we cannot trivially use IPIs for flushing.

Heck, some NMI handlers use locks..

> Handling these, necessitates that target pv-ops not be preemptible.

I don't think that is a correct inferrence.

> Once that is a given (for safety these need to be explicitly whitelisted
> in runtime_patch()), use a state-machine with the primary CPU doing the
> patching and secondary CPUs in a sync_core() loop. 
> 
> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
> progress by continuing the patching instead of emulating.
> 
> One remaining issue is inter-dependent pv-ops which are also executed in
> the NMI handler -- patching can potentially deadlock in case of multiple
> NMIs. Handle these by pushing some of this work in the NMI handler where
> we know it will be uninterrupted.

I'm just seeing a lot of bonghits without sane rationale. Why is any of
this important?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 12:08   ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 12:08 UTC (permalink / raw)
  To: Ankur Arora
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

So what, the paravirt spinlock stuff works just fine when you're not
oversubscribed.

> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
> but PV-TLB ops are also good candidates)

The PV-TLB ops also work just fine when not oversubscribed. IIRC
kvm_flush_tlb_others() is pretty much the same in that case.

> in .parainstructions.runtime,
> while discarding the .parainstructions as usual at init. This is then
> used for switching back and forth between native and paravirt mode.
> ([1] lists some representative numbers of the increased memory
> footprint.)
> 
> Mechanism: the patching itself is done using stop_machine(). That is
> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> via text_poke_bp(), but I'm using this to address two issues:
>  1) emulation in text_poke() can only easily handle a small set
>  of instructions and this is problematic for inlined pv-ops (and see
>  a possible alternatives use-case below.)
>  2) paravirt patching might have inter-dependendent ops (ex.
>  lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>  need to be updated atomically.)

And then you hope that the spinlock state transfers.. That is that both
implementations agree what an unlocked spinlock looks like.

Suppose the native one was a ticket spinlock, where unlocked means 'head
== tail' while the paravirt one is a test-and-set spinlock, where
unlocked means 'val == 0'.

That just happens to not be the case now, but it was for a fair while.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

The whole late-microcode loading stuff is crazy already; you're making
it take double bonghits.

> Also, there are points of similarity with the ongoing static_call work
> which does rewriting of indirect calls.

Only in so far as that code patching is involved. An analogy would be
comparing having a beer with shooting dope. They're both 'drugs'.

> The difference here is that
> we need to switch a group of calls atomically and given that
> some of them can be inlined, need to handle a wider variety of opcodes.
> 
> To patch safely we need to satisfy these constraints:
> 
>  - No references to insn sequences under replacement on any kernel stack
>    once replacement is in progress. Without this constraint we might end
>    up returning to an address that is in the middle of an instruction.

Both ftrace and optprobes have that issue, neither of them are quite as
crazy as this.

>  - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>    lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>    a good example.

While I'm sure this is a fun problem, why are we solving it?

>  - handle a broader set of insns than CALL and JMP: some pv-ops end up
>    getting inlined. Alternatives can contain arbitrary instructions.

So can optprobes.

>  - locking operations can be called from interrupt handlers which means
>    we cannot trivially use IPIs for flushing.

Heck, some NMI handlers use locks..

> Handling these, necessitates that target pv-ops not be preemptible.

I don't think that is a correct inferrence.

> Once that is a given (for safety these need to be explicitly whitelisted
> in runtime_patch()), use a state-machine with the primary CPU doing the
> patching and secondary CPUs in a sync_core() loop. 
> 
> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
> progress by continuing the patching instead of emulating.
> 
> One remaining issue is inter-dependent pv-ops which are also executed in
> the NMI handler -- patching can potentially deadlock in case of multiple
> NMIs. Handle these by pushing some of this work in the NMI handler where
> we know it will be uninterrupted.

I'm just seeing a lot of bonghits without sane rationale. Why is any of
this important?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08  5:02 ` Ankur Arora
@ 2020-04-08 12:28   ` Jürgen Groß
  -1 siblings, 0 replies; 93+ messages in thread
From: Jürgen Groß @ 2020-04-08 12:28 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, bp, vkuznets, pbonzini,
	boris.ostrovsky, mihai.carabas, kvm, xen-devel, virtualization

On 08.04.20 07:02, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the

Then this hint is wrong if it can't be guaranteed.

> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

I think using pvops for such a feature change is just wrong.

What comes next? Using pvops for being able to migrate a guest from an
Intel to an AMD machine?

...

> There are four main sets of patches in this series:
> 
>   1. PV-ops management (patches 1-10, 20): mostly infrastructure and
>   refactoring pieces to make paravirt patching usable at runtime. For the
>   most part scoped under CONFIG_PARAVIRT_RUNTIME.
> 
>   Patches 1-7, to persist part of parainstructions in memory:
>    "x86/paravirt: Specify subsection in PVOP macros"
>    "x86/paravirt: Allow paravirt patching post-init"
>    "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
>    "x86/alternatives: Refactor alternatives_smp_module*
>    "x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    "x86/alternatives: Remove stale symbols
>    "x86/paravirt: Persist .parainstructions.runtime"
> 
>   Patches 8-10, develop the inerfaces to safely switch pv-ops:
>    "x86/paravirt: Stash native pv-ops"
>    "x86/paravirt: Add runtime_patch()"
>    "x86/paravirt: Add primitives to stage pv-ops"
> 
>   Patch 20 enables switching of pv_lock_ops:
>    "x86/paravirt: Enable pv-spinlocks in runtime_patch()"
> 
>   2. Non-emulated text poking (patches 11-19)
> 
>   Patches 11-13 are mostly refactoring to split __text_poke() into map,
>   unmap and poke/memcpy phases with the poke portion being re-entrant
>    "x86/alternatives: Remove return value of text_poke*()"
>    "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
>    "x86/alternatives: Split __text_poke()"
> 
>   Patches 15, 17 add the actual poking state-machine:
>    "x86/alternatives: Non-emulated text poking"
>    "x86/alternatives: Add patching logic in text_poke_site()"
> 
>   with patches 14 and 18 containing the pieces for BP handling:
>    "x86/alternatives: Handle native insns in text_poke_loc*()"
>    "x86/alternatives: Handle BP in non-emulated text poking"
> 
>   and patch 19 provides the ability to use the state-machine above in an
>   NMI context (fixes some potential deadlocks when handling inter-
>   dependent operations and multiple NMIs):
>    "x86/alternatives: NMI safe runtime patching".
> 
>   Patch 16 provides the interface (paravirt_runtime_patch()) to use the
>   poking mechanism developed above and patch 21 adds a selftest:
>    "x86/alternatives: Add paravirt patching at runtime"
>    "x86/alternatives: Paravirt runtime selftest"
> 
>   3. KVM guest changes to be able to use this (patches 22-23,25-26):
>    "kvm/paravirt: Encapsulate KVM pv switching logic"
>    "x86/kvm: Add worker to trigger runtime patching"
>    "x86/kvm: Guest support for dynamic hints"
>    "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".
> 
>   4. KVM host changes to notify the guest of a change (patch 24):
>    "x86/kvm: Support dynamic CPUID hints"
> 
> Testing:
> With paravirt patching, the code is mostly stable on Intel and AMD
> systems under kernbench and locktorture with paravirt toggling (with,
> without synthetic NMIs) in the background.
> 
> Queued spinlock performance for locktorture is also on expected lines:
>   [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0
>   # toggle PV spinlocks
> 
>   [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0
>   # PV spinlocks (in ~60 seconds) = 62,901,545
> 
>   # toggle native spinlocks
>   [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0
>    # native spinlocks (in ~60 seconds) = 2,228,295
> 
> The alternatives testing is more limited with it being used to rewrite
> mostly harmless X86_FEATUREs with load in the background.
> 
> Patches also at:
> 
> ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] The precise change in memory footprint depends on config options
> but the following example inlines queued_spin_unlock() (which forms
> the bulk of the added state). The added footprint is the size of the
> .parainstructions.runtime section:
> 
>   $ objdump -h vmlinux|grep .parainstructions
>   Idx Name              		Size      VMA
>   	LMA                File-off  Algn
>    27 .parainstructions 		0001013c  ffffffff82895000
>    	0000000002895000   01c95000  2**3
>    28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
>    	00000000028a5140   01ca5140  2**3
> 
>    $ size vmlinux
>    text       data       bss        dec      hex       filename
>    13726196   12302814   14094336   40123346 2643bd2   vmlinux
> 
> Ankur Arora (26):
>    x86/paravirt: Specify subsection in PVOP macros
>    x86/paravirt: Allow paravirt patching post-init
>    x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
>    x86/alternatives: Refactor alternatives_smp_module*
>    x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    x86/alternatives: Remove stale symbols
>    x86/paravirt: Persist .parainstructions.runtime
>    x86/paravirt: Stash native pv-ops
>    x86/paravirt: Add runtime_patch()
>    x86/paravirt: Add primitives to stage pv-ops
>    x86/alternatives: Remove return value of text_poke*()
>    x86/alternatives: Use __get_unlocked_pte() in text_poke()
>    x86/alternatives: Split __text_poke()
>    x86/alternatives: Handle native insns in text_poke_loc*()
>    x86/alternatives: Non-emulated text poking
>    x86/alternatives: Add paravirt patching at runtime
>    x86/alternatives: Add patching logic in text_poke_site()
>    x86/alternatives: Handle BP in non-emulated text poking
>    x86/alternatives: NMI safe runtime patching
>    x86/paravirt: Enable pv-spinlocks in runtime_patch()
>    x86/alternatives: Paravirt runtime selftest
>    kvm/paravirt: Encapsulate KVM pv switching logic
>    x86/kvm: Add worker to trigger runtime patching
>    x86/kvm: Support dynamic CPUID hints
>    x86/kvm: Guest support for dynamic hints
>    x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
> 
>   Documentation/virt/kvm/api.rst        |  17 +
>   Documentation/virt/kvm/cpuid.rst      |   9 +-
>   arch/x86/Kconfig                      |  14 +
>   arch/x86/Kconfig.debug                |  13 +
>   arch/x86/entry/entry_64.S             |   5 +
>   arch/x86/include/asm/alternative.h    |  20 +-
>   arch/x86/include/asm/kvm_host.h       |   6 +
>   arch/x86/include/asm/kvm_para.h       |  17 +
>   arch/x86/include/asm/paravirt.h       |  10 +-
>   arch/x86/include/asm/paravirt_types.h | 230 ++++--
>   arch/x86/include/asm/text-patching.h  |  18 +-
>   arch/x86/include/uapi/asm/kvm_para.h  |   2 +
>   arch/x86/kernel/Makefile              |   1 +
>   arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
>   arch/x86/kernel/kvm.c                 | 191 ++++-
>   arch/x86/kernel/module.c              |  42 +-
>   arch/x86/kernel/paravirt.c            |  16 +-
>   arch/x86/kernel/paravirt_patch.c      |  61 ++
>   arch/x86/kernel/pv_selftest.c         | 264 +++++++
>   arch/x86/kernel/pv_selftest.h         |  15 +
>   arch/x86/kernel/setup.c               |   2 +
>   arch/x86/kernel/vmlinux.lds.S         |  16 +
>   arch/x86/kvm/cpuid.c                  |   3 +-
>   arch/x86/kvm/x86.c                    |  39 +
>   include/asm-generic/kvm_para.h        |  12 +
>   include/asm-generic/vmlinux.lds.h     |   8 +
>   include/linux/kvm_para.h              |   5 +
>   include/linux/mm.h                    |  16 +-
>   include/linux/preempt.h               |  17 +
>   include/uapi/linux/kvm.h              |   4 +
>   kernel/locking/lock_events.c          |   2 +-
>   mm/memory.c                           |   9 +-
>   32 files changed, 1850 insertions(+), 221 deletions(-)
>   create mode 100644 arch/x86/kernel/pv_selftest.c
>   create mode 100644 arch/x86/kernel/pv_selftest.h
> 

Quite a lot of code churn and hacks for a problem which should not
occur on a well administrated machine.

Especially the NMI dependencies make me not wanting to Ack this series.


Juergen

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 12:28   ` Jürgen Groß
  0 siblings, 0 replies; 93+ messages in thread
From: Jürgen Groß @ 2020-04-08 12:28 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, x86
  Cc: xen-devel, kvm, peterz, hpa, virtualization, pbonzini, bp,
	mhiramat, jpoimboe, mihai.carabas, namit, vkuznets,
	boris.ostrovsky

On 08.04.20 07:02, Ankur Arora wrote:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the

Then this hint is wrong if it can't be guaranteed.

> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

I think using pvops for such a feature change is just wrong.

What comes next? Using pvops for being able to migrate a guest from an
Intel to an AMD machine?

...

> There are four main sets of patches in this series:
> 
>   1. PV-ops management (patches 1-10, 20): mostly infrastructure and
>   refactoring pieces to make paravirt patching usable at runtime. For the
>   most part scoped under CONFIG_PARAVIRT_RUNTIME.
> 
>   Patches 1-7, to persist part of parainstructions in memory:
>    "x86/paravirt: Specify subsection in PVOP macros"
>    "x86/paravirt: Allow paravirt patching post-init"
>    "x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME"
>    "x86/alternatives: Refactor alternatives_smp_module*
>    "x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    "x86/alternatives: Remove stale symbols
>    "x86/paravirt: Persist .parainstructions.runtime"
> 
>   Patches 8-10, develop the inerfaces to safely switch pv-ops:
>    "x86/paravirt: Stash native pv-ops"
>    "x86/paravirt: Add runtime_patch()"
>    "x86/paravirt: Add primitives to stage pv-ops"
> 
>   Patch 20 enables switching of pv_lock_ops:
>    "x86/paravirt: Enable pv-spinlocks in runtime_patch()"
> 
>   2. Non-emulated text poking (patches 11-19)
> 
>   Patches 11-13 are mostly refactoring to split __text_poke() into map,
>   unmap and poke/memcpy phases with the poke portion being re-entrant
>    "x86/alternatives: Remove return value of text_poke*()"
>    "x86/alternatives: Use __get_unlocked_pte() in text_poke()"
>    "x86/alternatives: Split __text_poke()"
> 
>   Patches 15, 17 add the actual poking state-machine:
>    "x86/alternatives: Non-emulated text poking"
>    "x86/alternatives: Add patching logic in text_poke_site()"
> 
>   with patches 14 and 18 containing the pieces for BP handling:
>    "x86/alternatives: Handle native insns in text_poke_loc*()"
>    "x86/alternatives: Handle BP in non-emulated text poking"
> 
>   and patch 19 provides the ability to use the state-machine above in an
>   NMI context (fixes some potential deadlocks when handling inter-
>   dependent operations and multiple NMIs):
>    "x86/alternatives: NMI safe runtime patching".
> 
>   Patch 16 provides the interface (paravirt_runtime_patch()) to use the
>   poking mechanism developed above and patch 21 adds a selftest:
>    "x86/alternatives: Add paravirt patching at runtime"
>    "x86/alternatives: Paravirt runtime selftest"
> 
>   3. KVM guest changes to be able to use this (patches 22-23,25-26):
>    "kvm/paravirt: Encapsulate KVM pv switching logic"
>    "x86/kvm: Add worker to trigger runtime patching"
>    "x86/kvm: Guest support for dynamic hints"
>    "x86/kvm: Add hint change notifier for KVM_HINT_REALTIME".
> 
>   4. KVM host changes to notify the guest of a change (patch 24):
>    "x86/kvm: Support dynamic CPUID hints"
> 
> Testing:
> With paravirt patching, the code is mostly stable on Intel and AMD
> systems under kernbench and locktorture with paravirt toggling (with,
> without synthetic NMIs) in the background.
> 
> Queued spinlock performance for locktorture is also on expected lines:
>   [ 1533.221563] Writes:  Total: 1048759000  Max/Min: 0/0   Fail: 0
>   # toggle PV spinlocks
> 
>   [ 1594.713699] Writes:  Total: 1111660545  Max/Min: 0/0   Fail: 0
>   # PV spinlocks (in ~60 seconds) = 62,901,545
> 
>   # toggle native spinlocks
>   [ 1656.117175] Writes:  Total: 1113888840  Max/Min: 0/0   Fail: 0
>    # native spinlocks (in ~60 seconds) = 2,228,295
> 
> The alternatives testing is more limited with it being used to rewrite
> mostly harmless X86_FEATUREs with load in the background.
> 
> Patches also at:
> 
> ssh://git@github.com/terminus/linux.git alternatives-rfc-upstream-v1
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] The precise change in memory footprint depends on config options
> but the following example inlines queued_spin_unlock() (which forms
> the bulk of the added state). The added footprint is the size of the
> .parainstructions.runtime section:
> 
>   $ objdump -h vmlinux|grep .parainstructions
>   Idx Name              		Size      VMA
>   	LMA                File-off  Algn
>    27 .parainstructions 		0001013c  ffffffff82895000
>    	0000000002895000   01c95000  2**3
>    28 .parainstructions.runtime  0000cd2c  ffffffff828a5140
>    	00000000028a5140   01ca5140  2**3
> 
>    $ size vmlinux
>    text       data       bss        dec      hex       filename
>    13726196   12302814   14094336   40123346 2643bd2   vmlinux
> 
> Ankur Arora (26):
>    x86/paravirt: Specify subsection in PVOP macros
>    x86/paravirt: Allow paravirt patching post-init
>    x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME
>    x86/alternatives: Refactor alternatives_smp_module*
>    x86/alternatives: Rename alternatives_smp*, smp_alt_module
>    x86/alternatives: Remove stale symbols
>    x86/paravirt: Persist .parainstructions.runtime
>    x86/paravirt: Stash native pv-ops
>    x86/paravirt: Add runtime_patch()
>    x86/paravirt: Add primitives to stage pv-ops
>    x86/alternatives: Remove return value of text_poke*()
>    x86/alternatives: Use __get_unlocked_pte() in text_poke()
>    x86/alternatives: Split __text_poke()
>    x86/alternatives: Handle native insns in text_poke_loc*()
>    x86/alternatives: Non-emulated text poking
>    x86/alternatives: Add paravirt patching at runtime
>    x86/alternatives: Add patching logic in text_poke_site()
>    x86/alternatives: Handle BP in non-emulated text poking
>    x86/alternatives: NMI safe runtime patching
>    x86/paravirt: Enable pv-spinlocks in runtime_patch()
>    x86/alternatives: Paravirt runtime selftest
>    kvm/paravirt: Encapsulate KVM pv switching logic
>    x86/kvm: Add worker to trigger runtime patching
>    x86/kvm: Support dynamic CPUID hints
>    x86/kvm: Guest support for dynamic hints
>    x86/kvm: Add hint change notifier for KVM_HINT_REALTIME
> 
>   Documentation/virt/kvm/api.rst        |  17 +
>   Documentation/virt/kvm/cpuid.rst      |   9 +-
>   arch/x86/Kconfig                      |  14 +
>   arch/x86/Kconfig.debug                |  13 +
>   arch/x86/entry/entry_64.S             |   5 +
>   arch/x86/include/asm/alternative.h    |  20 +-
>   arch/x86/include/asm/kvm_host.h       |   6 +
>   arch/x86/include/asm/kvm_para.h       |  17 +
>   arch/x86/include/asm/paravirt.h       |  10 +-
>   arch/x86/include/asm/paravirt_types.h | 230 ++++--
>   arch/x86/include/asm/text-patching.h  |  18 +-
>   arch/x86/include/uapi/asm/kvm_para.h  |   2 +
>   arch/x86/kernel/Makefile              |   1 +
>   arch/x86/kernel/alternative.c         | 987 +++++++++++++++++++++++---
>   arch/x86/kernel/kvm.c                 | 191 ++++-
>   arch/x86/kernel/module.c              |  42 +-
>   arch/x86/kernel/paravirt.c            |  16 +-
>   arch/x86/kernel/paravirt_patch.c      |  61 ++
>   arch/x86/kernel/pv_selftest.c         | 264 +++++++
>   arch/x86/kernel/pv_selftest.h         |  15 +
>   arch/x86/kernel/setup.c               |   2 +
>   arch/x86/kernel/vmlinux.lds.S         |  16 +
>   arch/x86/kvm/cpuid.c                  |   3 +-
>   arch/x86/kvm/x86.c                    |  39 +
>   include/asm-generic/kvm_para.h        |  12 +
>   include/asm-generic/vmlinux.lds.h     |   8 +
>   include/linux/kvm_para.h              |   5 +
>   include/linux/mm.h                    |  16 +-
>   include/linux/preempt.h               |  17 +
>   include/uapi/linux/kvm.h              |   4 +
>   kernel/locking/lock_events.c          |   2 +-
>   mm/memory.c                           |   9 +-
>   32 files changed, 1850 insertions(+), 221 deletions(-)
>   create mode 100644 arch/x86/kernel/pv_selftest.c
>   create mode 100644 arch/x86/kernel/pv_selftest.h
> 

Quite a lot of code churn and hacks for a problem which should not
occur on a well administrated machine.

Especially the NMI dependencies make me not wanting to Ack this series.


Juergen


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 12:08   ` Peter Zijlstra
@ 2020-04-08 13:33     ` Jürgen Groß
  -1 siblings, 0 replies; 93+ messages in thread
From: Jürgen Groß @ 2020-04-08 13:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ankur Arora
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization

On 08.04.20 14:08, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.

Sure? This would mean that before spinlock-pvops are being set no lock
is allowed to be used in the kernel, because this would block the boot
time transition of the lock variant to use.

Another problem I'm seeing is that runtime pvops patching would rely on
the fact that stop_machine() isn't guarded by a spinlock.


Juergen

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 13:33     ` Jürgen Groß
  0 siblings, 0 replies; 93+ messages in thread
From: Jürgen Groß @ 2020-04-08 13:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ankur Arora
  Cc: hpa, xen-devel, kvm, x86, linux-kernel, virtualization, pbonzini,
	namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On 08.04.20 14:08, Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.

Sure? This would mean that before spinlock-pvops are being set no lock
is allowed to be used in the kernel, because this would block the boot
time transition of the lock variant to use.

Another problem I'm seeing is that runtime pvops patching would rely on
the fact that stop_machine() isn't guarded by a spinlock.


Juergen


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08  5:02 ` Ankur Arora
  (?)
@ 2020-04-08 14:12   ` Thomas Gleixner
  -1 siblings, 0 replies; 93+ messages in thread
From: Thomas Gleixner @ 2020-04-08 14:12 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization, Ankur Arora

Ankur Arora <ankur.a.arora@oracle.com> writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.

> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
>   [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>   [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>   [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)

And this illustrates what? The fact that you used a misconfigured setup.

> This series addresses the problem by doing paravirt switching at
> runtime.

You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:

  https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 14:12   ` Thomas Gleixner
  0 siblings, 0 replies; 93+ messages in thread
From: Thomas Gleixner @ 2020-04-08 14:12 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp,
	boris.ostrovsky

Ankur Arora <ankur.a.arora@oracle.com> writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.

> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
>   [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>   [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>   [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)

And this illustrates what? The fact that you used a misconfigured setup.

> This series addresses the problem by doing paravirt switching at
> runtime.

You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:

  https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 14:12   ` Thomas Gleixner
  0 siblings, 0 replies; 93+ messages in thread
From: Thomas Gleixner @ 2020-04-08 14:12 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, Ankur Arora, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

Ankur Arora <ankur.a.arora@oracle.com> writes:
> A KVM host (or another hypervisor) might advertise paravirtualized
> features and optimization hints (ex KVM_HINTS_REALTIME) which might
> become stale over the lifetime of the guest. For instance, the
> host might go from being undersubscribed to being oversubscribed
> (or the other way round) and it would make sense for the guest
> switch pv-ops based on that.

If your host changes his advertised behaviour then you want to fix the
host setup or find a competent admin.

> This lockorture splat that I saw on the guest while testing this is
> indicative of the problem:
>
>   [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>   [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>   [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>
> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
> on the gues.)

And this illustrates what? The fact that you used a misconfigured setup.

> This series addresses the problem by doing paravirt switching at
> runtime.

You're not addressing the problem. Your fixing the symptom, which is
wrong to begin with.

> The alternative use-case is a runtime version of apply_alternatives()
> (not posted with this patch-set) that can be used for some safe subset
> of X86_FEATUREs. This could be useful in conjunction with the ongoing
> late microcode loading work that Mihai Carabas and others have been
> working on.

This has been discussed to death before and there is no safe subset as
long as this hasn't been resolved:

  https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 13:33     ` Jürgen Groß
  (?)
@ 2020-04-08 14:49       ` Peter Zijlstra
  -1 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 14:49 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Ankur Arora, linux-kernel, x86, hpa, jpoimboe, namit, mhiramat,
	bp, vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On Wed, Apr 08, 2020 at 03:33:52PM +0200, Jürgen Groß wrote:
> On 08.04.20 14:08, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> > > Mechanism: the patching itself is done using stop_machine(). That is
> > > not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> > > via text_poke_bp(), but I'm using this to address two issues:
> > >   1) emulation in text_poke() can only easily handle a small set
> > >   of instructions and this is problematic for inlined pv-ops (and see
> > >   a possible alternatives use-case below.)
> > >   2) paravirt patching might have inter-dependendent ops (ex.
> > >   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> > >   need to be updated atomically.)
> > 
> > And then you hope that the spinlock state transfers.. That is that both
> > implementations agree what an unlocked spinlock looks like.
> > 
> > Suppose the native one was a ticket spinlock, where unlocked means 'head
> > == tail' while the paravirt one is a test-and-set spinlock, where
> > unlocked means 'val == 0'.
> > 
> > That just happens to not be the case now, but it was for a fair while.
> 
> Sure? This would mean that before spinlock-pvops are being set no lock
> is allowed to be used in the kernel, because this would block the boot
> time transition of the lock variant to use.

Hurm.. true. I suppose I completely forgot how paravirt spinlocks looked
before it got rewritten.

> Another problem I'm seeing is that runtime pvops patching would rely on
> the fact that stop_machine() isn't guarded by a spinlock.

It can't be, stop_machine() relies on scheduling. But yes, that another
variation of 'stuff uses spinlocks'.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 14:49       ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 14:49 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: hpa, xen-devel, kvm, x86, linux-kernel, Ankur Arora,
	virtualization, pbonzini, namit, mhiramat, jpoimboe,
	mihai.carabas, bp, boris.ostrovsky

On Wed, Apr 08, 2020 at 03:33:52PM +0200, Jürgen Groß wrote:
> On 08.04.20 14:08, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> > > Mechanism: the patching itself is done using stop_machine(). That is
> > > not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> > > via text_poke_bp(), but I'm using this to address two issues:
> > >   1) emulation in text_poke() can only easily handle a small set
> > >   of instructions and this is problematic for inlined pv-ops (and see
> > >   a possible alternatives use-case below.)
> > >   2) paravirt patching might have inter-dependendent ops (ex.
> > >   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> > >   need to be updated atomically.)
> > 
> > And then you hope that the spinlock state transfers.. That is that both
> > implementations agree what an unlocked spinlock looks like.
> > 
> > Suppose the native one was a ticket spinlock, where unlocked means 'head
> > == tail' while the paravirt one is a test-and-set spinlock, where
> > unlocked means 'val == 0'.
> > 
> > That just happens to not be the case now, but it was for a fair while.
> 
> Sure? This would mean that before spinlock-pvops are being set no lock
> is allowed to be used in the kernel, because this would block the boot
> time transition of the lock variant to use.

Hurm.. true. I suppose I completely forgot how paravirt spinlocks looked
before it got rewritten.

> Another problem I'm seeing is that runtime pvops patching would rely on
> the fact that stop_machine() isn't guarded by a spinlock.

It can't be, stop_machine() relies on scheduling. But yes, that another
variation of 'stuff uses spinlocks'.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-08 14:49       ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2020-04-08 14:49 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: hpa, xen-devel, kvm, x86, linux-kernel, Ankur Arora,
	virtualization, pbonzini, namit, mhiramat, jpoimboe,
	mihai.carabas, bp, vkuznets, boris.ostrovsky

On Wed, Apr 08, 2020 at 03:33:52PM +0200, Jürgen Groß wrote:
> On 08.04.20 14:08, Peter Zijlstra wrote:
> > On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
> > > Mechanism: the patching itself is done using stop_machine(). That is
> > > not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
> > > via text_poke_bp(), but I'm using this to address two issues:
> > >   1) emulation in text_poke() can only easily handle a small set
> > >   of instructions and this is problematic for inlined pv-ops (and see
> > >   a possible alternatives use-case below.)
> > >   2) paravirt patching might have inter-dependendent ops (ex.
> > >   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
> > >   need to be updated atomically.)
> > 
> > And then you hope that the spinlock state transfers.. That is that both
> > implementations agree what an unlocked spinlock looks like.
> > 
> > Suppose the native one was a ticket spinlock, where unlocked means 'head
> > == tail' while the paravirt one is a test-and-set spinlock, where
> > unlocked means 'val == 0'.
> > 
> > That just happens to not be the case now, but it was for a fair while.
> 
> Sure? This would mean that before spinlock-pvops are being set no lock
> is allowed to be used in the kernel, because this would block the boot
> time transition of the lock variant to use.

Hurm.. true. I suppose I completely forgot how paravirt spinlocks looked
before it got rewritten.

> Another problem I'm seeing is that runtime pvops patching would rely on
> the fact that stop_machine() isn't guarded by a spinlock.

It can't be, stop_machine() relies on scheduling. But yes, that another
variation of 'stuff uses spinlocks'.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 12:28   ` Jürgen Groß
@ 2020-04-10  7:56     ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  7:56 UTC (permalink / raw)
  To: Jürgen Groß, linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, bp, vkuznets, pbonzini,
	boris.ostrovsky, mihai.carabas, kvm, xen-devel, virtualization

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

So, first thanks for the quick comments even though some of my choices
were straight NAKs (or maybe because of that!)

Second, I clearly did a bad job of motivating the series. Let me try
to address the motivation comments first and then I can address the
technical concerns separately.

[ I'm collating all the motivation comments below. ]


>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the 

Thomas> If your host changes his advertised behaviour then you want to
Thomas> fix the host setup or find a competent admin.

Juergen> Then this hint is wrong if it can't be guaranteed.

I agree, the hint behaviour is wrong and the host shouldn't be giving
hints it can only temporarily honor.
The host problem is hard to fix though: the behaviour change is
either because of a guest migration or in case of a hosted guest,
cloud economics -- customers want to go to a 2-1 or worse VCPU-CPU
ratio at times of low load.

I had an offline discussion with Paolo Bonzini where he agreed that
it makes sense to make KVM_HINTS_REALTIME a dynamic hint rather than
static as it is now. (That was really the starting point for this
series.)

>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.

Juergen> I think using pvops for such a feature change is just wrong.
Juergen> What comes next? Using pvops for being able to migrate a guest
Juergen> from an Intel to an AMD machine?

My statement about switching pv-ops was too broadly worded. What
I meant to say was that KVM guests choose pv_lock_ops to be native
or paravirt based on undersubscribed/oversubscribed hint at boot,
and this choice should be available at run-time as well.

KVM chooses between native/paravirt spinlocks at boot based on this
reasoning (from commit b2798ba0b8):
"Waiman Long mentioned that:
> Generally speaking, unfair lock performs well for VMs with a small
> number of vCPUs. Native qspinlock may perform better than pvqspinlock
> if there is vCPU pinning and there is no vCPU over-commitment.
"

PeterZ> So what, the paravirt spinlock stuff works just fine when
PeterZ> you're not oversubscribed.
Yeah, the paravirt spinlocks work fine for both under and oversubscribed
hosts, but they are more expensive and that extra cost provides no benefits
when CPUs are pinned.
For instance, pvqueued spin_unlock() is a call+locked cmpxchg as opposed
to just a movb $0, (%rdi).

This difference shows up in kernbench running on a KVM guest with native
and paravirt spinlocks. I ran with 8 and 64 CPU guests with CPUs pinned.

The native version performs same or better.

8 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  4: sys  151.89  ( 0.2462)  160.14   ( 4.8366)    +5.4%
-j 32: sys  162.715 (11.4129)  170.225  (11.1138)    +4.6%
-j  0: sys  164.193 ( 9.4063)  170.843  ( 8.9651)    +4.0%


64 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  32: sys 209.448 (0.37009)  210.976   (0.4245)    +0.7%
-j 256: sys 267.401 (61.0928)  285.73   (78.8021)    +6.8%
-j   0: sys 286.313 (56.5978)  307.721  (70.9758)    +7.4%

In all cases the pv_kick, pv_wait numbers were minimal as expected.
The lock_slowpath counts were higher with PV but AFAICS the native
and paravirt lock_slowpath are not directly comparable.

Detailed kernbench numbers attached.

Thanks
Ankur

[-- Attachment #2: 8-cpus.txt --]
[-- Type: text/plain, Size: 1340 bytes --]

8-cpu-pinned,native
==================

Average Half load -j 4 Run (std deviation):
Elapsed Time 303.686 (0.737652)
User Time 1032.24 (2.8133)
System Time 151.89 (0.246272)
Percent CPU 389.2 (0.447214)
Context Switches 19350.4 (82.1785)
Sleeps 125885 (148.338)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 187.068 (0.358427)
User Time 1130.33 (103.405)
System Time 162.715 (11.4129)
Percent CPU 569.1 (189.633)
Context Switches 143301 (130656)
Sleeps 126938 (1132.83)

Average Maximal load -j Run (std deviation):
Elapsed Time 189.098 (0.316812)
User Time 1166.59 (98.4454)
System Time 164.193 (9.4063)
Percent CPU 627.133 (174.169)
Context Switches 222270 (156005)
Sleeps 122562 (6470.93)

8-cpu-pinned, pv
================

Average Half load -j 4 Run (std deviation):
Elapsed Time 309.872 (5.882)
User Time 1045.8 (18.5295)
System Time 160.14 (4.83669)
Percent CPU 388.8 (0.447214)
Context Switches 41215.4 (679.522)
Sleeps 122369 (477.593)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 190.1 (0.377823)
User Time 1144 (104.248)
System Time 170.225 (11.1138)
Percent CPU 568.2 (189.107)

Average Maximal load -j Run (std deviation):
Elapsed Time 191.606 (0.108305)
User Time 1178.83 (97.908)
System Time 170.843 (8.9651)
Percent CPU 625.8 (173.49)
Context Switches 234878 (149479)
Sleeps 120542 (6073.79)

[-- Attachment #3: 64-cpus.txt --]
[-- Type: text/plain, Size: 1414 bytes --]

64-cpu-pinned, native
=====================

Average Half load -j 32 Run (std deviation):
Elapsed Time 54.306 (0.134833)
User Time 1072.75 (1.34598)
System Time 209.448 (0.370095)
Percent CPU 2360.4 (4.03733)
Context Switches 26999 (99.5414)
Sleeps 122408 (184.87)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 39.424 (0.150599)
User Time 1140.91 (71.8722)
System Time 267.401 (61.0928)
Percent CPU 3125.9 (806.96)
Context Switches 129662 (108217)
Sleeps 121767 (699.198)

Average Maximal load -j Run (std deviation):
Elapsed Time 41.562 (0.206083)
User Time 1174.68 (75.9342)
System Time 286.313 (56.5978)
Percent CPU 3339.87 (719.062)
Context Switches 203428 (138536)
Sleeps 119066 (3993.58)

64-cpu-pinned, pv
================
Average Half load -j 32 Run (std deviation):
Elapsed Time 55.14 (0.0894427)
User Time 1071.99 (1.43335)
System Time 210.976 (0.424594)
Percent CPU 2326 (4.52769)
Context Switches 37544.8 (220.969)
Sleeps 115527 (94.7138)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 40.54 (0.246779)
User Time 1137.41 (68.9773)
System Time 285.73 (78.8021)
Percent CPU 3090.7 (806.218)
Context Switches 139059 (107006)
Sleeps 116962 (1518.56)

Average Maximal load -j Run (std deviation):
Elapsed Time 42.682 (0.170939)
User Time 1171.64 (74.6663)
System Time 307.721 (70.9758)
Percent CPU 3303.27 (717.418)
Context Switches 213430 (138616)
Sleeps 115143 (2930.03)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-10  7:56     ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  7:56 UTC (permalink / raw)
  To: Jürgen Groß, linux-kernel, x86
  Cc: xen-devel, kvm, peterz, hpa, virtualization, pbonzini, bp,
	mhiramat, jpoimboe, mihai.carabas, namit, vkuznets,
	boris.ostrovsky

[-- Attachment #1: Type: text/plain, Size: 3601 bytes --]

So, first thanks for the quick comments even though some of my choices
were straight NAKs (or maybe because of that!)

Second, I clearly did a bad job of motivating the series. Let me try
to address the motivation comments first and then I can address the
technical concerns separately.

[ I'm collating all the motivation comments below. ]


>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the 

Thomas> If your host changes his advertised behaviour then you want to
Thomas> fix the host setup or find a competent admin.

Juergen> Then this hint is wrong if it can't be guaranteed.

I agree, the hint behaviour is wrong and the host shouldn't be giving
hints it can only temporarily honor.
The host problem is hard to fix though: the behaviour change is
either because of a guest migration or in case of a hosted guest,
cloud economics -- customers want to go to a 2-1 or worse VCPU-CPU
ratio at times of low load.

I had an offline discussion with Paolo Bonzini where he agreed that
it makes sense to make KVM_HINTS_REALTIME a dynamic hint rather than
static as it is now. (That was really the starting point for this
series.)

>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.

Juergen> I think using pvops for such a feature change is just wrong.
Juergen> What comes next? Using pvops for being able to migrate a guest
Juergen> from an Intel to an AMD machine?

My statement about switching pv-ops was too broadly worded. What
I meant to say was that KVM guests choose pv_lock_ops to be native
or paravirt based on undersubscribed/oversubscribed hint at boot,
and this choice should be available at run-time as well.

KVM chooses between native/paravirt spinlocks at boot based on this
reasoning (from commit b2798ba0b8):
"Waiman Long mentioned that:
> Generally speaking, unfair lock performs well for VMs with a small
> number of vCPUs. Native qspinlock may perform better than pvqspinlock
> if there is vCPU pinning and there is no vCPU over-commitment.
"

PeterZ> So what, the paravirt spinlock stuff works just fine when
PeterZ> you're not oversubscribed.
Yeah, the paravirt spinlocks work fine for both under and oversubscribed
hosts, but they are more expensive and that extra cost provides no benefits
when CPUs are pinned.
For instance, pvqueued spin_unlock() is a call+locked cmpxchg as opposed
to just a movb $0, (%rdi).

This difference shows up in kernbench running on a KVM guest with native
and paravirt spinlocks. I ran with 8 and 64 CPU guests with CPUs pinned.

The native version performs same or better.

8 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  4: sys  151.89  ( 0.2462)  160.14   ( 4.8366)    +5.4%
-j 32: sys  162.715 (11.4129)  170.225  (11.1138)    +4.6%
-j  0: sys  164.193 ( 9.4063)  170.843  ( 8.9651)    +4.0%


64 CPU       Native  (std-dev)  Paravirt (std-dev)
             -----------------  -----------------
-j  32: sys 209.448 (0.37009)  210.976   (0.4245)    +0.7%
-j 256: sys 267.401 (61.0928)  285.73   (78.8021)    +6.8%
-j   0: sys 286.313 (56.5978)  307.721  (70.9758)    +7.4%

In all cases the pv_kick, pv_wait numbers were minimal as expected.
The lock_slowpath counts were higher with PV but AFAICS the native
and paravirt lock_slowpath are not directly comparable.

Detailed kernbench numbers attached.

Thanks
Ankur

[-- Attachment #2: 8-cpus.txt --]
[-- Type: text/plain, Size: 1340 bytes --]

8-cpu-pinned,native
==================

Average Half load -j 4 Run (std deviation):
Elapsed Time 303.686 (0.737652)
User Time 1032.24 (2.8133)
System Time 151.89 (0.246272)
Percent CPU 389.2 (0.447214)
Context Switches 19350.4 (82.1785)
Sleeps 125885 (148.338)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 187.068 (0.358427)
User Time 1130.33 (103.405)
System Time 162.715 (11.4129)
Percent CPU 569.1 (189.633)
Context Switches 143301 (130656)
Sleeps 126938 (1132.83)

Average Maximal load -j Run (std deviation):
Elapsed Time 189.098 (0.316812)
User Time 1166.59 (98.4454)
System Time 164.193 (9.4063)
Percent CPU 627.133 (174.169)
Context Switches 222270 (156005)
Sleeps 122562 (6470.93)

8-cpu-pinned, pv
================

Average Half load -j 4 Run (std deviation):
Elapsed Time 309.872 (5.882)
User Time 1045.8 (18.5295)
System Time 160.14 (4.83669)
Percent CPU 388.8 (0.447214)
Context Switches 41215.4 (679.522)
Sleeps 122369 (477.593)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 190.1 (0.377823)
User Time 1144 (104.248)
System Time 170.225 (11.1138)
Percent CPU 568.2 (189.107)

Average Maximal load -j Run (std deviation):
Elapsed Time 191.606 (0.108305)
User Time 1178.83 (97.908)
System Time 170.843 (8.9651)
Percent CPU 625.8 (173.49)
Context Switches 234878 (149479)
Sleeps 120542 (6073.79)

[-- Attachment #3: 64-cpus.txt --]
[-- Type: text/plain, Size: 1414 bytes --]

64-cpu-pinned, native
=====================

Average Half load -j 32 Run (std deviation):
Elapsed Time 54.306 (0.134833)
User Time 1072.75 (1.34598)
System Time 209.448 (0.370095)
Percent CPU 2360.4 (4.03733)
Context Switches 26999 (99.5414)
Sleeps 122408 (184.87)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 39.424 (0.150599)
User Time 1140.91 (71.8722)
System Time 267.401 (61.0928)
Percent CPU 3125.9 (806.96)
Context Switches 129662 (108217)
Sleeps 121767 (699.198)

Average Maximal load -j Run (std deviation):
Elapsed Time 41.562 (0.206083)
User Time 1174.68 (75.9342)
System Time 286.313 (56.5978)
Percent CPU 3339.87 (719.062)
Context Switches 203428 (138536)
Sleeps 119066 (3993.58)

64-cpu-pinned, pv
================
Average Half load -j 32 Run (std deviation):
Elapsed Time 55.14 (0.0894427)
User Time 1071.99 (1.43335)
System Time 210.976 (0.424594)
Percent CPU 2326 (4.52769)
Context Switches 37544.8 (220.969)
Sleeps 115527 (94.7138)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 40.54 (0.246779)
User Time 1137.41 (68.9773)
System Time 285.73 (78.8021)
Percent CPU 3090.7 (806.218)
Context Switches 139059 (107006)
Sleeps 116962 (1518.56)

Average Maximal load -j Run (std deviation):
Elapsed Time 42.682 (0.170939)
User Time 1171.64 (74.6663)
System Time 307.721 (70.9758)
Percent CPU 3303.27 (717.418)
Context Switches 213430 (138616)
Sleeps 115143 (2930.03)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 12:08   ` Peter Zijlstra
@ 2020-04-10  9:18     ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, hpa, jpoimboe, namit, mhiramat, jgross, bp,
	vkuznets, pbonzini, boris.ostrovsky, mihai.carabas, kvm,
	xen-devel, virtualization

On 2020-04-08 5:08 a.m., Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> So what, the paravirt spinlock stuff works just fine when you're not
> oversubscribed.
> 
>> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
>> but PV-TLB ops are also good candidates)
> 
> The PV-TLB ops also work just fine when not oversubscribed. IIRC
> kvm_flush_tlb_others() is pretty much the same in that case.
> 
>> in .parainstructions.runtime,
>> while discarding the .parainstructions as usual at init. This is then
>> used for switching back and forth between native and paravirt mode.
>> ([1] lists some representative numbers of the increased memory
>> footprint.)
>>
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> The whole late-microcode loading stuff is crazy already; you're making
> it take double bonghits.
That's fair. I was talking in a fairly limited sense, ex making static_cpu_has()
catch up with boot_cpu_has() after a microcode update but I should have
specified that.

> 
>> Also, there are points of similarity with the ongoing static_call work
>> which does rewriting of indirect calls.
> 
> Only in so far as that code patching is involved. An analogy would be
> comparing having a beer with shooting dope. They're both 'drugs'.
I meant closer to updating indirect pointers, like static_call_update()
semantics. But of course I don't know static_call code well enough.

> 
>> The difference here is that
>> we need to switch a group of calls atomically and given that
>> some of them can be inlined, need to handle a wider variety of opcodes.
>>
>> To patch safely we need to satisfy these constraints:
>>
>>   - No references to insn sequences under replacement on any kernel stack
>>     once replacement is in progress. Without this constraint we might end
>>     up returning to an address that is in the middle of an instruction.
> 
> Both ftrace and optprobes have that issue, neither of them are quite as
> crazy as this.
I did look at ftrace. Will look at optprobes. Thanks.

> 
>>   - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>>     lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>>     a good example.
> 
> While I'm sure this is a fun problem, why are we solving it?
> 
>>   - handle a broader set of insns than CALL and JMP: some pv-ops end up
>>     getting inlined. Alternatives can contain arbitrary instructions.
> 
> So can optprobes.> 
>>   - locking operations can be called from interrupt handlers which means
>>     we cannot trivially use IPIs for flushing.
> 
> Heck, some NMI handlers use locks..
This does handle the NMI locking problem. The solution -- doing it
in the NMI handler was of course pretty ugly.

>> Handling these, necessitates that target pv-ops not be preemptible.
> 
> I don't think that is a correct inferrence.The non-preemptibility requirement was to ensure that any pv-op under
replacement not be under execution after it is patched out.
(Not a concern for pv_lock_ops.)

Ensuring that we don't return to an address in the middle of an instruction
could be done by moving the NOPs in the prefix, but I couldn't think of
any other way to ensure that a function not be under execution.

Thanks
Ankur

>> Once that is a given (for safety these need to be explicitly whitelisted
>> in runtime_patch()), use a state-machine with the primary CPU doing the
>> patching and secondary CPUs in a sync_core() loop.
>>
>> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
>> progress by continuing the patching instead of emulating.
>>
>> One remaining issue is inter-dependent pv-ops which are also executed in
>> the NMI handler -- patching can potentially deadlock in case of multiple
>> NMIs. Handle these by pushing some of this work in the NMI handler where
>> we know it will be uninterrupted.
> 
> I'm just seeing a lot of bonghits without sane rationale. Why is any of
> this important?
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-10  9:18     ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jgross, hpa, xen-devel, kvm, x86, linux-kernel, virtualization,
	pbonzini, namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On 2020-04-08 5:08 a.m., Peter Zijlstra wrote:
> On Tue, Apr 07, 2020 at 10:02:57PM -0700, Ankur Arora wrote:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> So what, the paravirt spinlock stuff works just fine when you're not
> oversubscribed.
> 
>> We keep an interesting subset of pv-ops (pv_lock_ops only for now,
>> but PV-TLB ops are also good candidates)
> 
> The PV-TLB ops also work just fine when not oversubscribed. IIRC
> kvm_flush_tlb_others() is pretty much the same in that case.
> 
>> in .parainstructions.runtime,
>> while discarding the .parainstructions as usual at init. This is then
>> used for switching back and forth between native and paravirt mode.
>> ([1] lists some representative numbers of the increased memory
>> footprint.)
>>
>> Mechanism: the patching itself is done using stop_machine(). That is
>> not ideal -- text_poke_stop_machine() was replaced with INT3+emulation
>> via text_poke_bp(), but I'm using this to address two issues:
>>   1) emulation in text_poke() can only easily handle a small set
>>   of instructions and this is problematic for inlined pv-ops (and see
>>   a possible alternatives use-case below.)
>>   2) paravirt patching might have inter-dependendent ops (ex.
>>   lock.queued_lock_slowpath, lock.queued_lock_unlock are paired and
>>   need to be updated atomically.)
> 
> And then you hope that the spinlock state transfers.. That is that both
> implementations agree what an unlocked spinlock looks like.
> 
> Suppose the native one was a ticket spinlock, where unlocked means 'head
> == tail' while the paravirt one is a test-and-set spinlock, where
> unlocked means 'val == 0'.
> 
> That just happens to not be the case now, but it was for a fair while.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> The whole late-microcode loading stuff is crazy already; you're making
> it take double bonghits.
That's fair. I was talking in a fairly limited sense, ex making static_cpu_has()
catch up with boot_cpu_has() after a microcode update but I should have
specified that.

> 
>> Also, there are points of similarity with the ongoing static_call work
>> which does rewriting of indirect calls.
> 
> Only in so far as that code patching is involved. An analogy would be
> comparing having a beer with shooting dope. They're both 'drugs'.
I meant closer to updating indirect pointers, like static_call_update()
semantics. But of course I don't know static_call code well enough.

> 
>> The difference here is that
>> we need to switch a group of calls atomically and given that
>> some of them can be inlined, need to handle a wider variety of opcodes.
>>
>> To patch safely we need to satisfy these constraints:
>>
>>   - No references to insn sequences under replacement on any kernel stack
>>     once replacement is in progress. Without this constraint we might end
>>     up returning to an address that is in the middle of an instruction.
> 
> Both ftrace and optprobes have that issue, neither of them are quite as
> crazy as this.
I did look at ftrace. Will look at optprobes. Thanks.

> 
>>   - handle inter-dependent ops: as above, lock.queued_lock_unlock(),
>>     lock.queued_lock_slowpath() and the rest of the pv_lock_ops are
>>     a good example.
> 
> While I'm sure this is a fun problem, why are we solving it?
> 
>>   - handle a broader set of insns than CALL and JMP: some pv-ops end up
>>     getting inlined. Alternatives can contain arbitrary instructions.
> 
> So can optprobes.> 
>>   - locking operations can be called from interrupt handlers which means
>>     we cannot trivially use IPIs for flushing.
> 
> Heck, some NMI handlers use locks..
This does handle the NMI locking problem. The solution -- doing it
in the NMI handler was of course pretty ugly.

>> Handling these, necessitates that target pv-ops not be preemptible.
> 
> I don't think that is a correct inferrence.The non-preemptibility requirement was to ensure that any pv-op under
replacement not be under execution after it is patched out.
(Not a concern for pv_lock_ops.)

Ensuring that we don't return to an address in the middle of an instruction
could be done by moving the NOPs in the prefix, but I couldn't think of
any other way to ensure that a function not be under execution.

Thanks
Ankur

>> Once that is a given (for safety these need to be explicitly whitelisted
>> in runtime_patch()), use a state-machine with the primary CPU doing the
>> patching and secondary CPUs in a sync_core() loop.
>>
>> In case we hit an INT3/BP (in NMI or thread-context) we makes forward
>> progress by continuing the patching instead of emulating.
>>
>> One remaining issue is inter-dependent pv-ops which are also executed in
>> the NMI handler -- patching can potentially deadlock in case of multiple
>> NMIs. Handle these by pushing some of this work in the NMI handler where
>> we know it will be uninterrupted.
> 
> I'm just seeing a lot of bonghits without sane rationale. Why is any of
> this important?
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 12:28   ` Jürgen Groß
@ 2020-04-10  9:32     ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:32 UTC (permalink / raw)
  To: Jürgen Groß, linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, bp, vkuznets, pbonzini,
	boris.ostrovsky, mihai.carabas, kvm, xen-devel, virtualization

On 2020-04-08 5:28 a.m., Jürgen Groß wrote:
> On 08.04.20 07:02, Ankur Arora wrote:
[ snip ]
> 
> Quite a lot of code churn and hacks for a problem which should not
> occur on a well administrated machine.
Yeah, I agree the patch set is pretty large and clearly the NMI or
the stop_machine() are completely out. That said, as I wrote in my
other mail I think the problem is still worth solving.

> Especially the NMI dependencies make me not wanting to Ack this series.
The NMI solution did turn out to be pretty ugly.

I was using it to solve two problems: avoid a deadlock where an NMI handler
could use a lock while the stop_machine() thread is trying to rewrite the
corresponding call-sites. And, needed to ensure that we don't lock
and unlock using mismatched primitives.


Thanks
Ankur

> 
> 
> Juergen

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-10  9:32     ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:32 UTC (permalink / raw)
  To: Jürgen Groß, linux-kernel, x86
  Cc: xen-devel, kvm, peterz, hpa, virtualization, pbonzini, bp,
	mhiramat, jpoimboe, mihai.carabas, namit, vkuznets,
	boris.ostrovsky

On 2020-04-08 5:28 a.m., Jürgen Groß wrote:
> On 08.04.20 07:02, Ankur Arora wrote:
[ snip ]
> 
> Quite a lot of code churn and hacks for a problem which should not
> occur on a well administrated machine.
Yeah, I agree the patch set is pretty large and clearly the NMI or
the stop_machine() are completely out. That said, as I wrote in my
other mail I think the problem is still worth solving.

> Especially the NMI dependencies make me not wanting to Ack this series.
The NMI solution did turn out to be pretty ugly.

I was using it to solve two problems: avoid a deadlock where an NMI handler
could use a lock while the stop_machine() thread is trying to rewrite the
corresponding call-sites. And, needed to ensure that we don't lock
and unlock using mismatched primitives.


Thanks
Ankur

> 
> 
> Juergen


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
  2020-04-08 14:12   ` Thomas Gleixner
@ 2020-04-10  9:55     ` Ankur Arora
  -1 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:55 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86
  Cc: peterz, hpa, jpoimboe, namit, mhiramat, jgross, bp, vkuznets,
	pbonzini, boris.ostrovsky, mihai.carabas, kvm, xen-devel,
	virtualization

On 2020-04-08 7:12 a.m., Thomas Gleixner wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> If your host changes his advertised behaviour then you want to fix the
> host setup or find a competent admin.
> 
>> This lockorture splat that I saw on the guest while testing this is
>> indicative of the problem:
>>
>>    [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>>    [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>>    [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>>
>> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
>> on the gues.)
> 
> And this illustrates what? The fact that you used a misconfigured setup.
> 
>> This series addresses the problem by doing paravirt switching at
>> runtime.
> 
> You're not addressing the problem. Your fixing the symptom, which is
> wrong to begin with.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> This has been discussed to death before and there is no safe subset as
> long as this hasn't been resolved:
> 
>    https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/
Thanks. I was thinking of fairly limited subset: ex re-evaluate
X86_FEATURE_ALWAYS to make sure static_cpu_has() reflects reality
but I guess that has second order effects here.

Ankur

> 
> Thanks,
> 
>          tglx
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [RFC PATCH 00/26] Runtime paravirt patching
@ 2020-04-10  9:55     ` Ankur Arora
  0 siblings, 0 replies; 93+ messages in thread
From: Ankur Arora @ 2020-04-10  9:55 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86
  Cc: jgross, xen-devel, kvm, peterz, hpa, virtualization, pbonzini,
	namit, mhiramat, jpoimboe, mihai.carabas, bp, vkuznets,
	boris.ostrovsky

On 2020-04-08 7:12 a.m., Thomas Gleixner wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> A KVM host (or another hypervisor) might advertise paravirtualized
>> features and optimization hints (ex KVM_HINTS_REALTIME) which might
>> become stale over the lifetime of the guest. For instance, the
>> host might go from being undersubscribed to being oversubscribed
>> (or the other way round) and it would make sense for the guest
>> switch pv-ops based on that.
> 
> If your host changes his advertised behaviour then you want to fix the
> host setup or find a competent admin.
> 
>> This lockorture splat that I saw on the guest while testing this is
>> indicative of the problem:
>>
>>    [ 1136.461522] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [lock_torture_wr:12865]
>>    [ 1136.461542] CPU: 8 PID: 12865 Comm: lock_torture_wr Tainted: G W L 5.4.0-rc7+ #77
>>    [ 1136.461546] RIP: 0010:native_queued_spin_lock_slowpath+0x15/0x220
>>
>> (Caused by an oversubscribed host but using mismatched native pv_lock_ops
>> on the gues.)
> 
> And this illustrates what? The fact that you used a misconfigured setup.
> 
>> This series addresses the problem by doing paravirt switching at
>> runtime.
> 
> You're not addressing the problem. Your fixing the symptom, which is
> wrong to begin with.
> 
>> The alternative use-case is a runtime version of apply_alternatives()
>> (not posted with this patch-set) that can be used for some safe subset
>> of X86_FEATUREs. This could be useful in conjunction with the ongoing
>> late microcode loading work that Mihai Carabas and others have been
>> working on.
> 
> This has been discussed to death before and there is no safe subset as
> long as this hasn't been resolved:
> 
>    https://lore.kernel.org/lkml/alpine.DEB.2.21.1909062237580.1902@nanos.tec.linutronix.de/
Thanks. I was thinking of fairly limited subset: ex re-evaluate
X86_FEATURE_ALWAYS to make sure static_cpu_has() reflects reality
but I guess that has second order effects here.

Ankur

> 
> Thanks,
> 
>          tglx
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2020-04-10  9:56 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08  5:02 [RFC PATCH 00/26] Runtime paravirt patching Ankur Arora
2020-04-08  5:02 ` Ankur Arora
2020-04-08  5:02 ` [RFC PATCH 01/26] x86/paravirt: Specify subsection in PVOP macros Ankur Arora
2020-04-08  5:02   ` Ankur Arora
2020-04-08  5:02 ` [RFC PATCH 02/26] x86/paravirt: Allow paravirt patching post-init Ankur Arora
2020-04-08  5:02   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 03/26] x86/paravirt: PVRTOP macros for PARAVIRT_RUNTIME Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 04/26] x86/alternatives: Refactor alternatives_smp_module* Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 05/26] x86/alternatives: Rename alternatives_smp*, smp_alt_module Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 06/26] x86/alternatives: Remove stale symbols Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 07/26] x86/paravirt: Persist .parainstructions.runtime Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 08/26] x86/paravirt: Stash native pv-ops Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 09/26] x86/paravirt: Add runtime_patch() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08 11:05   ` Peter Zijlstra
2020-04-08 11:05     ` Peter Zijlstra
2020-04-08 11:05     ` Peter Zijlstra
2020-04-08  5:03 ` [RFC PATCH 10/26] x86/paravirt: Add primitives to stage pv-ops Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 11/26] x86/alternatives: Remove return value of text_poke*() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 12/26] x86/alternatives: Use __get_unlocked_pte() in text_poke() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 13/26] x86/alternatives: Split __text_poke() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 14/26] x86/alternatives: Handle native insns in text_poke_loc*() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08 11:11   ` Peter Zijlstra
2020-04-08 11:11     ` Peter Zijlstra
2020-04-08 11:11     ` Peter Zijlstra
2020-04-08 11:17   ` Peter Zijlstra
2020-04-08 11:17     ` Peter Zijlstra
2020-04-08 11:17     ` Peter Zijlstra
2020-04-08  5:03 ` [RFC PATCH 15/26] x86/alternatives: Non-emulated text poking Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08 11:13   ` Peter Zijlstra
2020-04-08 11:13     ` Peter Zijlstra
2020-04-08 11:13     ` Peter Zijlstra
2020-04-08 11:23   ` Peter Zijlstra
2020-04-08 11:23     ` Peter Zijlstra
2020-04-08 11:23     ` Peter Zijlstra
2020-04-08  5:03 ` [RFC PATCH 16/26] x86/alternatives: Add paravirt patching at runtime Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 17/26] x86/alternatives: Add patching logic in text_poke_site() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 18/26] x86/alternatives: Handle BP in non-emulated text poking Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 19/26] x86/alternatives: NMI safe runtime patching Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08 11:36   ` Peter Zijlstra
2020-04-08 11:36     ` Peter Zijlstra
2020-04-08 11:36     ` Peter Zijlstra
2020-04-08  5:03 ` [RFC PATCH 20/26] x86/paravirt: Enable pv-spinlocks in runtime_patch() Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 21/26] x86/alternatives: Paravirt runtime selftest Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 22/26] kvm/paravirt: Encapsulate KVM pv switching logic Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 23/26] x86/kvm: Add worker to trigger runtime patching Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 24/26] x86/kvm: Support dynamic CPUID hints Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 25/26] x86/kvm: Guest support for dynamic hints Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08  5:03 ` [RFC PATCH 26/26] x86/kvm: Add hint change notifier for KVM_HINT_REALTIME Ankur Arora
2020-04-08  5:03   ` Ankur Arora
2020-04-08 12:08 ` [RFC PATCH 00/26] Runtime paravirt patching Peter Zijlstra
2020-04-08 12:08   ` Peter Zijlstra
2020-04-08 12:08   ` Peter Zijlstra
2020-04-08 13:33   ` Jürgen Groß
2020-04-08 13:33     ` Jürgen Groß
2020-04-08 14:49     ` Peter Zijlstra
2020-04-08 14:49       ` Peter Zijlstra
2020-04-08 14:49       ` Peter Zijlstra
2020-04-10  9:18   ` Ankur Arora
2020-04-10  9:18     ` Ankur Arora
2020-04-08 12:28 ` Jürgen Groß
2020-04-08 12:28   ` Jürgen Groß
2020-04-10  7:56   ` Ankur Arora
2020-04-10  7:56     ` Ankur Arora
2020-04-10  9:32   ` Ankur Arora
2020-04-10  9:32     ` Ankur Arora
2020-04-08 14:12 ` Thomas Gleixner
2020-04-08 14:12   ` Thomas Gleixner
2020-04-08 14:12   ` Thomas Gleixner
2020-04-10  9:55   ` Ankur Arora
2020-04-10  9:55     ` Ankur Arora

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.