[RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition

* [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2023-07-20 16:30 Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
                   ` (19 more replies)
  0 siblings, 20 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().

Patches
=======

o Patches 1-9 have been submitted separately and are included for the sake of
  testing 

o Patches 10-14 focus on having objtool detect problematic static key usage in
  early entry

o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.

o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral

  These ones I'm a lot less confident about, mostly due to lacking
  instrumentation/verification.

  The actual deferred callback is also incomplete as it's not properly noinstr:
    vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
  and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
  have *no idea* what a sane implementation would be for Xen so I haven't
  touched that yet.

Patches are also available at:

https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2

Testing
=======

Note: this is a different machine than used for v1, because that machine decided
to act difficult.

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.

v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
    338 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   9207 func=do_flush_tlb_all
   1116 func=do_sync_core
     62 func=do_kernel_range_flush
      3 func=nohz_full_kick_func

v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
      2 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
      2 func=nohz_full_kick_func

The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
  .data..ro_after_init section.

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE

Revisions
=========

RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)

o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul)

  The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
  against this series.

o Fixed flush_tlb_kernel_range_deferrable() definition

Peter Zijlstra (1):
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys

Valentin Schneider (19):
  tracing/filters: Dynamically allocate filter_pred.regex
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Optimise cpumask vs cpumask filtering when user mask
    is a single CPU
  tracing/filters: Optimise scalar vs cpumask filtering when the user
    mask is a single CPU
  tracing/filters: Optimise CPU vs cpumask filtering when the user mask
    is a single CPU
  tracing/filters: Further optimise scalar vs cpumask comparison
  tracing/filters: Document cpumask filtering
  objtool: Flesh out warning related to pv_ops[] calls
  objtool: Warn about non __ro_after_init static key usage in .noinstr
  context_tracking: Make context_tracking_key __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context-tracking: Introduce work deferral infrastructure
  rcu: Make RCU dynticks counter size configurable
  rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  context_tracking,x86: Defer kernel text patching IPIs
  context_tracking,x86: Add infrastructure to defer kernel TLBI
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs

 Documentation/trace/events.rst                |  14 +
 arch/Kconfig                                  |   9 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/context_tracking_work.h  |  20 ++
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   2 +
 arch/x86/kernel/alternative.c                 |  24 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/mm/tlb.c                             |  40 ++-
 include/asm-generic/sections.h                |   5 +
 include/linux/context_tracking.h              |  26 ++
 include/linux/context_tracking_state.h        |  65 +++-
 include/linux/context_tracking_work.h         |  28 ++
 include/linux/jump_label.h                    |   1 +
 include/linux/trace_events.h                  |   1 +
 init/main.c                                   |   1 +
 kernel/context_tracking.c                     |  53 ++-
 kernel/jump_label.c                           |  49 +++
 kernel/rcu/Kconfig                            |  33 ++
 kernel/time/Kconfig                           |   5 +
 kernel/trace/trace_events_filter.c            | 302 ++++++++++++++++--
 mm/vmalloc.c                                  |  19 +-
 tools/objtool/check.c                         |  22 +-
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/special.h       |   2 +
 tools/objtool/special.c                       |   3 +
 .../selftests/rcutorture/configs/rcu/TREE11   |  19 ++
 .../rcutorture/configs/rcu/TREE11.boot        |   1 +
 31 files changed, 695 insertions(+), 64 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.31.1

^ permalink raw reply	[flat|nested] 76+ messages in thread