All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2023-07-05 18:12 Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
                   ` (15 more replies)
  0 siblings, 16 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().

Patches
=======

o Patches 1-5 have been submitted previously and are included for the sake of
  testing

o Patches 6-10 focus on having objtool detect problematic static key usage in early
  entry

  The context_tracking_key one causes a page_fault_oops() on my RHEL userspace
  due to the KVM module, cf. changelog.
  
o Patch 11 adds the infrastructure for IPI deferral.  
o Patch 12 adds text_poke() IPI deferral.

  This one I'm fairly confident about, we "just" need to do something about the
  __ro_after_init key vs module loading issue.

o Patches 13-14 add vunmap() flush_tlb_kernel_range() IPI deferral

  These ones I'm a lot less confident about, mostly due to lacking
  instrumentation/verification.
  
  The actual deferred callback is also incomplete as it's not properly noinstr:
    vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
  and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
  have *no idea* what a sane implementation would be for Xen so I haven't
  touched that yet.

Patches are also available at:

https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v1

Testing
=======

Xeon Platinum 8380 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & MASK{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & MASK{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 20 minutes.
		   
v6.4 (+ cpumask filtering patches):
$ trace-cmd report | grep callback | awk '{ print $NF }' | sort | uniq -c
    236 callback=do_flush_tlb_all+0x0
    576 callback=do_sync_core+0x0
    814 callback=generic_smp_call_function_single_interrupt+0x0
    309 callback=nohz_full_kick_func+0x0

v6.4 + patches:
$ trace-cmd report | grep callback | awk '{ print $NF }' | sort | uniq -c
     22 callback=do_flush_tlb_all+0x0
     24 callback=generic_smp_call_function_single_interrupt+0x0
    307 callback=nohz_full_kick_func+0x0

o IPIs from instruction patching are entirely gone.

o Some TLB flushes remain as I only patched the vunmap cases:

  kworker/2:0-13856 [002]  3517.445719: ipi_send_cpumask:     cpumask=0-1,3-79 callsite=on_each_cpu_cond_mask+0x20 
  callback=do_flush_tlb_all+0x0
  kworker/2:0-13856 [002]  3517.445722: kernel_stack:         <stack trace >
  => trace_event_raw_event_ipi_send_cpumask (ffffffffa974a050)
  => smp_call_function_many_cond (ffffffffa97fb0a7)
  => on_each_cpu_cond_mask (ffffffffa97fb1f0)
  => pcpu_reclaim_populated (ffffffffa996b451)
  => pcpu_balance_workfn (ffffffffa996c399)
  => process_one_work (ffffffffa9730e14)
  => worker_thread (ffffffffa9731440)
  => kthread (ffffffffa973984e)

o The nohz_full_kick_func() ones seem to come from the dev_watchdog() but are
  anyway consistent across revisions

  <...>-3734  [042]   392.890491: ipi_send_cpu:         cpu=42 callsite=irq_work_queue_on+0x77 callback=nohz_full_kick_func+0x0
  <...>-3734  [042]   392.890497: kernel_stack:         <stack trace >
  => trace_event_raw_event_ipi_send_cpu (ffffffff901492d8)
  => __irq_work_queue_local (ffffffff902acb3d)
  => irq_work_queue_on (ffffffff902acc47)
  => __mod_timer (ffffffff901dcd81)
  => dev_watchdog (ffffffff90a75310)
  => call_timer_fn (ffffffff901dc174)
  => __run_timers.part.0 (ffffffff901dc47e)
  => run_timer_softirq (ffffffff901dc546)
  => __do_softirq (ffffffff90c52348)
  => __irq_exit_rcu (ffffffff90113329)
  => sysvec_apic_timer_interrupt (ffffffff90c3c895)
  => asm_sysvec_apic_timer_interrupt (ffffffff90e00d86)

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
  .data..ro_after_init section.

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE

Valentin Schneider (14):
  tracing/filters: Dynamically allocate filter_pred.regex
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Document cpumask filtering
  objtool: Flesh out warning related to pv_ops[] calls
  objtool: Warn about non __ro_after_init static key usage in .noinstr
  BROKEN: context_tracking: Make context_tracking_key __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  x86/sev-es: Make sev_es_enable_key __ro_after_init
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  context_tracking,x86: Add infrastructure to defer kernel TLBI
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs

 Documentation/trace/events.rst               |  14 ++
 arch/Kconfig                                 |   9 +
 arch/x86/Kconfig                             |   1 +
 arch/x86/include/asm/context_tracking_work.h |  20 ++
 arch/x86/include/asm/text-patching.h         |   1 +
 arch/x86/include/asm/tlbflush.h              |   2 +
 arch/x86/kernel/alternative.c                |  24 +-
 arch/x86/kernel/kprobes/core.c               |   4 +-
 arch/x86/kernel/kprobes/opt.c                |   4 +-
 arch/x86/kernel/kvm.c                        |   2 +-
 arch/x86/kernel/module.c                     |   2 +-
 arch/x86/kernel/sev.c                        |   2 +-
 arch/x86/mm/tlb.c                            |  40 +++-
 include/linux/context_tracking.h             |   1 +
 include/linux/context_tracking_state.h       |   1 +
 include/linux/context_tracking_work.h        |  30 +++
 include/linux/trace_events.h                 |   1 +
 kernel/context_tracking.c                    |  65 +++++-
 kernel/time/Kconfig                          |   5 +
 kernel/trace/trace_events_filter.c           | 228 ++++++++++++++++---
 mm/vmalloc.c                                 |  15 +-
 tools/objtool/check.c                        |  22 +-
 tools/objtool/include/objtool/check.h        |   1 +
 tools/objtool/include/objtool/special.h      |   2 +
 tools/objtool/special.c                      |   3 +
 25 files changed, 451 insertions(+), 48 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array
in the regex field, even if the predicate function does not use the field.

A later commit will introduce a dynamically allocated cpumask to struct
filter_pred, which will require a dedicated freeing function. Bite the
bullet and make filter_pred.regex dynamically allocated.

While at it, reorder the fields of filter_pred to fill in the byte
holes. The struct now fits on a single cacheline.

No change in behaviour intended.

The kfree()'s were patched via Coccinelle:
  @@
  struct filter_pred *pred;
  @@

  -kfree(pred);
  +free_predicate(pred);

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 62 ++++++++++++++++++------------
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 1dad64267878c..d999a218fe833 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -70,15 +70,15 @@ enum filter_pred_fn {
 };
 
 struct filter_pred {
-	enum filter_pred_fn 	fn_num;
-	u64 			val;
-	u64 			val2;
-	struct regex		regex;
+	struct regex		*regex;
 	unsigned short		*ops;
 	struct ftrace_event_field *field;
-	int 			offset;
+	u64			val;
+	u64			val2;
+	enum filter_pred_fn	fn_num;
+	int			offset;
 	int			not;
-	int 			op;
+	int			op;
 };
 
 /*
@@ -186,6 +186,12 @@ enum {
 	PROCESS_OR	= 4,
 };
 
+static void free_predicate(struct filter_pred *pred)
+{
+	kfree(pred->regex);
+	kfree(pred);
+}
+
 /*
  * Without going into a formal proof, this explains the method that is used in
  * parsing the logical expressions.
@@ -623,7 +629,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
 	kfree(inverts);
 	if (prog_stack) {
 		for (i = 0; prog_stack[i].pred; i++)
-			kfree(prog_stack[i].pred);
+			free_predicate(prog_stack[i].pred);
 		kfree(prog_stack);
 	}
 	return ERR_PTR(ret);
@@ -750,7 +756,7 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
 	char *addr = (char *)(event + pred->offset);
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, pred->regex.field_len);
+	cmp = pred->regex->match(addr, pred->regex, pred->regex->field_len);
 
 	match = cmp ^ pred->not;
 
@@ -763,7 +769,7 @@ static __always_inline int filter_pchar(struct filter_pred *pred, char *str)
 	int len;
 
 	len = strlen(str) + 1;	/* including tailing '\0' */
-	cmp = pred->regex.match(str, &pred->regex, len);
+	cmp = pred->regex->match(str, pred->regex, len);
 
 	match = cmp ^ pred->not;
 
@@ -813,7 +819,7 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event)
 	char *addr = (char *)(event + str_loc);
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, str_len);
+	cmp = pred->regex->match(addr, pred->regex, str_len);
 
 	match = cmp ^ pred->not;
 
@@ -836,7 +842,7 @@ static int filter_pred_strrelloc(struct filter_pred *pred, void *event)
 	char *addr = (char *)(&item[1]) + str_loc;
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, str_len);
+	cmp = pred->regex->match(addr, pred->regex, str_len);
 
 	match = cmp ^ pred->not;
 
@@ -874,7 +880,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event)
 {
 	int cmp;
 
-	cmp = pred->regex.match(current->comm, &pred->regex,
+	cmp = pred->regex->match(current->comm, pred->regex,
 				TASK_COMM_LEN);
 	return cmp ^ pred->not;
 }
@@ -1004,7 +1010,7 @@ enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not)
 
 static void filter_build_regex(struct filter_pred *pred)
 {
-	struct regex *r = &pred->regex;
+	struct regex *r = pred->regex;
 	char *search;
 	enum regex_type type = MATCH_FULL;
 
@@ -1169,7 +1175,7 @@ static void free_prog(struct event_filter *filter)
 		return;
 
 	for (i = 0; prog[i].pred; i++)
-		kfree(prog[i].pred);
+		free_predicate(prog[i].pred);
 	kfree(prog);
 }
 
@@ -1553,9 +1559,12 @@ static int parse_pred(const char *str, void *data,
 			goto err_free;
 		}
 
-		pred->regex.len = len;
-		strncpy(pred->regex.pattern, str + s, len);
-		pred->regex.pattern[len] = 0;
+		pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+		if (!pred->regex)
+			goto err_mem;
+		pred->regex->len = len;
+		strncpy(pred->regex->pattern, str + s, len);
+		pred->regex->pattern[len] = 0;
 
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
@@ -1597,9 +1606,12 @@ static int parse_pred(const char *str, void *data,
 			goto err_free;
 		}
 
-		pred->regex.len = len;
-		strncpy(pred->regex.pattern, str + s, len);
-		pred->regex.pattern[len] = 0;
+		pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+		if (!pred->regex)
+			goto err_mem;
+		pred->regex->len = len;
+		strncpy(pred->regex->pattern, str + s, len);
+		pred->regex->pattern[len] = 0;
 
 		filter_build_regex(pred);
 
@@ -1608,7 +1620,7 @@ static int parse_pred(const char *str, void *data,
 
 		} else if (field->filter_type == FILTER_STATIC_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRING;
-			pred->regex.field_len = field->size;
+			pred->regex->field_len = field->size;
 
 		} else if (field->filter_type == FILTER_DYN_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRLOC;
@@ -1691,10 +1703,10 @@ static int parse_pred(const char *str, void *data,
 	return i;
 
 err_free:
-	kfree(pred);
+	free_predicate(pred);
 	return -EINVAL;
 err_mem:
-	kfree(pred);
+	free_predicate(pred);
 	return -ENOMEM;
 }
 
@@ -2287,8 +2299,8 @@ static int ftrace_function_set_filter_pred(struct filter_pred *pred,
 		return ret;
 
 	return __ftrace_function_set_filter(pred->op == OP_EQ,
-					    pred->regex.pattern,
-					    pred->regex.len,
+					    pred->regex->pattern,
+					    pred->regex->len,
 					    data);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:54   ` Steven Rostedt
  2023-07-05 18:12 ` [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

The recently introduced ipi_send_cpumask trace event contains a cpumask
field, but it currently cannot be used in filter expressions.

Make event filtering aware of cpumask fields, and allow these to be
filtered by a user-provided cpumask.

The user-provided cpumask is to be given in cpulist format and wrapped as:
"MASK{$cpulist}". The use of curly braces instead of parentheses is to
prevent predicate_parse() from parsing the contents of MASK{...} as a
full-fledged predicate subexpression.

This enables e.g.:

$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpu & MASK{2,4,6,8-32}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/trace_events.h       |  1 +
 kernel/trace/trace_events_filter.c | 85 +++++++++++++++++++++++++++++-
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 7c4a0b72334eb..974ef37a06c83 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -804,6 +804,7 @@ enum {
 	FILTER_RDYN_STRING,
 	FILTER_PTR_STRING,
 	FILTER_TRACE_FN,
+	FILTER_CPUMASK,
 	FILTER_COMM,
 	FILTER_CPU,
 	FILTER_STACKTRACE,
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index d999a218fe833..8af00caa363f7 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -64,6 +64,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_PCHAR_USER,
 	FILTER_PRED_FN_PCHAR,
 	FILTER_PRED_FN_CPU,
+	FILTER_PRED_FN_CPUMASK,
 	FILTER_PRED_FN_FUNCTION,
 	FILTER_PRED_FN_,
 	FILTER_PRED_TEST_VISITED,
@@ -71,6 +72,7 @@ enum filter_pred_fn {
 
 struct filter_pred {
 	struct regex		*regex;
+	struct cpumask          *mask;
 	unsigned short		*ops;
 	struct ftrace_event_field *field;
 	u64			val;
@@ -94,6 +96,8 @@ struct filter_pred {
 	C(TOO_MANY_OPEN,	"Too many '('"),			\
 	C(TOO_MANY_CLOSE,	"Too few '('"),				\
 	C(MISSING_QUOTE,	"Missing matching quote"),		\
+	C(MISSING_BRACE_OPEN,   "Missing '{'"),				\
+	C(MISSING_BRACE_CLOSE,  "Missing '}'"),				\
 	C(OPERAND_TOO_LONG,	"Operand too long"),			\
 	C(EXPECT_STRING,	"Expecting string field"),		\
 	C(EXPECT_DIGIT,		"Expecting numeric field"),		\
@@ -103,6 +107,7 @@ struct filter_pred {
 	C(BAD_SUBSYS_FILTER,	"Couldn't find or set field in one of a subsystem's events"), \
 	C(TOO_MANY_PREDS,	"Too many terms in predicate expression"), \
 	C(INVALID_FILTER,	"Meaningless filter expression"),	\
+	C(INVALID_CPULIST,	"Invalid cpulist"),	\
 	C(IP_FIELD_ONLY,	"Only 'ip' field is supported for function trace"), \
 	C(INVALID_VALUE,	"Invalid value (did you forget quotes)?"), \
 	C(NO_FUNCTION,		"Function not found"),			\
@@ -189,6 +194,7 @@ enum {
 static void free_predicate(struct filter_pred *pred)
 {
 	kfree(pred->regex);
+	kfree(pred->mask);
 	kfree(pred);
 }
 
@@ -875,6 +881,26 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
 	}
 }
 
+/* Filter predicate for cpumasks. */
+static int filter_pred_cpumask(struct filter_pred *pred, void *event)
+{
+	u32 item = *(u32 *)(event + pred->offset);
+	int loc = item & 0xffff;
+	const struct cpumask *mask = (event + loc);
+	const struct cpumask *cmp = pred->mask;
+
+	switch (pred->op) {
+	case OP_EQ:
+		return cpumask_equal(mask, cmp);
+	case OP_NE:
+		return !cpumask_equal(mask, cmp);
+	case OP_BAND:
+		return cpumask_intersects(mask, cmp);
+	default:
+		return 0;
+	}
+}
+
 /* Filter predicate for COMM. */
 static int filter_pred_comm(struct filter_pred *pred, void *event)
 {
@@ -1242,8 +1268,12 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
 
 int filter_assign_type(const char *type)
 {
-	if (strstr(type, "__data_loc") && strstr(type, "char"))
-		return FILTER_DYN_STRING;
+	if (strstr(type, "__data_loc")) {
+		if (strstr(type, "char"))
+			return FILTER_DYN_STRING;
+		if (strstr(type, "cpumask_t"))
+			return FILTER_CPUMASK;
+		}
 
 	if (strstr(type, "__rel_loc") && strstr(type, "char"))
 		return FILTER_RDYN_STRING;
@@ -1355,6 +1385,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_pchar(pred, event);
 	case FILTER_PRED_FN_CPU:
 		return filter_pred_cpu(pred, event);
+	case FILTER_PRED_FN_CPUMASK:
+		return filter_pred_cpumask(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
 		return filter_pred_function(pred, event);
 	case FILTER_PRED_TEST_VISITED:
@@ -1566,6 +1598,55 @@ static int parse_pred(const char *str, void *data,
 		strncpy(pred->regex->pattern, str + s, len);
 		pred->regex->pattern[len] = 0;
 
+	} else if (!strncmp(str + i, "MASK", 4)) {
+		unsigned int maskstart;
+		char *tmp;
+
+		if (field->filter_type != FILTER_CPUMASK) {
+			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
+			goto err_free;
+		}
+
+		/* Skip MASK */
+		i += 4;
+		if (str[i++] != '{') {
+			parse_error(pe, FILT_ERR_MISSING_BRACE_OPEN, pos + i);
+			goto err_free;
+		}
+		maskstart = i;
+
+		/* Walk the cpulist until closing } */
+		for (; str[i] && str[i] != '}'; i++);
+		if (str[i] != '}') {
+			parse_error(pe, FILT_ERR_MISSING_BRACE_CLOSE, pos + i);
+			goto err_free;
+		}
+
+		if (maskstart == i) {
+			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+			goto err_free;
+		}
+
+		/* Copy the cpulist between { and } */
+		tmp = kmalloc(i - maskstart + 1, GFP_KERNEL);
+		strncpy(tmp, str + maskstart, i - maskstart);
+		tmp[i - maskstart] = '\0';
+
+		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
+		if (!pred->mask)
+			goto err_mem;
+
+		/* Now parse it */
+		if (cpulist_parse(tmp, pred->mask)) {
+			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+			goto err_free;
+		}
+
+		/* Move along */
+		i++;
+		if (field->filter_type == FILTER_CPUMASK)
+			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
 		char q = str[i];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 19:00   ` Steven Rostedt
  2023-07-05 18:12 ` [RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common " Valentin Schneider
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Several events use a scalar field to denote a CPU:
o sched_wakeup.target_cpu
o sched_migrate_task.orig_cpu,dest_cpu
o sched_move_numa.src_cpu,dst_cpu
o ipi_send_cpu.cpu
o ...

Filtering these currently requires using arithmetic comparison functions,
which can be tedious when dealing with interleaved SMT or NUMA CPU ids.

Allow these to be filtered by a user-provided cpumask, which enables e.g.:

$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & MASK{2,4,6,8-32}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
NOTE: I went with an implicit cpumask conversion of the event field, as
AFAICT predicate_parse() does not support parsing the application of a
function to a field (e.g. 'MASK(target_cpu) & MASK{2,4,6,8-32}')
---
 kernel/trace/trace_events_filter.c | 92 ++++++++++++++++++++++++++----
 1 file changed, 80 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8af00caa363f7..99e111c237a93 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -46,15 +46,19 @@ static const char * ops[] = { OPS };
 enum filter_pred_fn {
 	FILTER_PRED_FN_NOP,
 	FILTER_PRED_FN_64,
+	FILTER_PRED_FN_64_CPUMASK,
 	FILTER_PRED_FN_S64,
 	FILTER_PRED_FN_U64,
 	FILTER_PRED_FN_32,
+	FILTER_PRED_FN_32_CPUMASK,
 	FILTER_PRED_FN_S32,
 	FILTER_PRED_FN_U32,
 	FILTER_PRED_FN_16,
+	FILTER_PRED_FN_16_CPUMASK,
 	FILTER_PRED_FN_S16,
 	FILTER_PRED_FN_U16,
 	FILTER_PRED_FN_8,
+	FILTER_PRED_FN_8_CPUMASK,
 	FILTER_PRED_FN_S8,
 	FILTER_PRED_FN_U8,
 	FILTER_PRED_FN_COMM,
@@ -641,6 +645,37 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
 	return ERR_PTR(ret);
 }
 
+static inline int
+do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp)
+{
+	switch (op) {
+	case OP_EQ:
+		return cpumask_equal(mask, cmp);
+	case OP_NE:
+		return !cpumask_equal(mask, cmp);
+	case OP_BAND:
+		return cpumask_intersects(mask, cmp);
+	default:
+		return 0;
+	}
+}
+
+/* Optimisation of do_filter_cpumask() for scalar values */
+static inline int
+do_filter_cpumask_scalar(int op, unsigned int cpu, const struct cpumask *mask)
+{
+	switch (op) {
+	case OP_EQ:
+		return cpumask_equal(mask, cpumask_of(cpu));
+	case OP_NE:
+		return !cpumask_equal(mask, cpumask_of(cpu));
+	case OP_BAND:
+		return cpumask_test_cpu(cpu, mask);
+	default:
+		return 0;
+	}
+}
+
 enum pred_cmp_types {
 	PRED_CMP_TYPE_NOP,
 	PRED_CMP_TYPE_LT,
@@ -684,6 +719,18 @@ static int filter_pred_##type(struct filter_pred *pred, void *event)	\
 	}								\
 }
 
+#define DEFINE_CPUMASK_COMPARISON_PRED(size)					\
+static int filter_pred_##size##_cpumask(struct filter_pred *pred, void *event)	\
+{										\
+	u##size *addr = (u##size *)(event + pred->offset);			\
+	unsigned int cpu = *addr;						\
+										\
+	if (cpu >= nr_cpu_ids)							\
+		return 0;							\
+										\
+	return do_filter_cpumask_scalar(pred->op, cpu, pred->mask);		\
+}
+
 #define DEFINE_EQUALITY_PRED(size)					\
 static int filter_pred_##size(struct filter_pred *pred, void *event)	\
 {									\
@@ -705,6 +752,11 @@ DEFINE_COMPARISON_PRED(u16);
 DEFINE_COMPARISON_PRED(s8);
 DEFINE_COMPARISON_PRED(u8);
 
+DEFINE_CPUMASK_COMPARISON_PRED(64);
+DEFINE_CPUMASK_COMPARISON_PRED(32);
+DEFINE_CPUMASK_COMPARISON_PRED(16);
+DEFINE_CPUMASK_COMPARISON_PRED(8);
+
 DEFINE_EQUALITY_PRED(64);
 DEFINE_EQUALITY_PRED(32);
 DEFINE_EQUALITY_PRED(16);
@@ -889,16 +941,7 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event)
 	const struct cpumask *mask = (event + loc);
 	const struct cpumask *cmp = pred->mask;
 
-	switch (pred->op) {
-	case OP_EQ:
-		return cpumask_equal(mask, cmp);
-	case OP_NE:
-		return !cpumask_equal(mask, cmp);
-	case OP_BAND:
-		return cpumask_intersects(mask, cmp);
-	default:
-		return 0;
-	}
+	return do_filter_cpumask(pred->op, mask, cmp);
 }
 
 /* Filter predicate for COMM. */
@@ -1349,24 +1392,32 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 	switch (pred->fn_num) {
 	case FILTER_PRED_FN_64:
 		return filter_pred_64(pred, event);
+	case FILTER_PRED_FN_64_CPUMASK:
+		return filter_pred_64_cpumask(pred, event);
 	case FILTER_PRED_FN_S64:
 		return filter_pred_s64(pred, event);
 	case FILTER_PRED_FN_U64:
 		return filter_pred_u64(pred, event);
 	case FILTER_PRED_FN_32:
 		return filter_pred_32(pred, event);
+	case FILTER_PRED_FN_32_CPUMASK:
+		return filter_pred_32_cpumask(pred, event);
 	case FILTER_PRED_FN_S32:
 		return filter_pred_s32(pred, event);
 	case FILTER_PRED_FN_U32:
 		return filter_pred_u32(pred, event);
 	case FILTER_PRED_FN_16:
 		return filter_pred_16(pred, event);
+	case FILTER_PRED_FN_16_CPUMASK:
+		return filter_pred_16_cpumask(pred, event);
 	case FILTER_PRED_FN_S16:
 		return filter_pred_s16(pred, event);
 	case FILTER_PRED_FN_U16:
 		return filter_pred_u16(pred, event);
 	case FILTER_PRED_FN_8:
 		return filter_pred_8(pred, event);
+	case FILTER_PRED_FN_8_CPUMASK:
+		return filter_pred_8_cpumask(pred, event);
 	case FILTER_PRED_FN_S8:
 		return filter_pred_s8(pred, event);
 	case FILTER_PRED_FN_U8:
@@ -1602,7 +1653,8 @@ static int parse_pred(const char *str, void *data,
 		unsigned int maskstart;
 		char *tmp;
 
-		if (field->filter_type != FILTER_CPUMASK) {
+		if (field->filter_type != FILTER_CPUMASK &&
+		    field->filter_type != FILTER_OTHER) {
 			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
 			goto err_free;
 		}
@@ -1644,8 +1696,24 @@ static int parse_pred(const char *str, void *data,
 
 		/* Move along */
 		i++;
-		if (field->filter_type == FILTER_CPUMASK)
+		if (field->filter_type == FILTER_CPUMASK) {
 			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+		} else {
+			switch (field->size) {
+			case 8:
+				pred->fn_num = FILTER_PRED_FN_64_CPUMASK;
+				break;
+			case 4:
+				pred->fn_num = FILTER_PRED_FN_32_CPUMASK;
+				break;
+			case 2:
+				pred->fn_num = FILTER_PRED_FN_16_CPUMASK;
+				break;
+			case 1:
+				pred->fn_num = FILTER_PRED_FN_8_CPUMASK;
+				break;
+			}
+		}
 
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common field by a cpumask
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (2 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 05/14] tracing/filters: Document cpumask filtering Valentin Schneider
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

The tracing_cpumask lets us specify which CPUs are traced in a buffer
instance, but doesn't let us do this on a per-event basis (unless one
creates an instance per event).

A previous commit added filtering scalar fields by a user-given cpumask,
make this work with the CPU common field as well.

This enables doing things like

$ trace-cmd record -e 'sched_switch' -f 'CPU & MASK{12-52}' \
		   -e 'sched_wakeup' -f 'target_cpu & MASK{12-52}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 99e111c237a93..b3d2612d4670a 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -68,6 +68,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_PCHAR_USER,
 	FILTER_PRED_FN_PCHAR,
 	FILTER_PRED_FN_CPU,
+	FILTER_PRED_FN_CPU_CPUMASK,
 	FILTER_PRED_FN_CPUMASK,
 	FILTER_PRED_FN_FUNCTION,
 	FILTER_PRED_FN_,
@@ -933,6 +934,13 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
 	}
 }
 
+static int filter_pred_cpu_cpumask(struct filter_pred *pred, void *event)
+{
+	int cpu = raw_smp_processor_id();
+
+	return do_filter_cpumask_scalar(pred->op, cpu, pred->mask);
+}
+
 /* Filter predicate for cpumasks. */
 static int filter_pred_cpumask(struct filter_pred *pred, void *event)
 {
@@ -1436,6 +1444,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_pchar(pred, event);
 	case FILTER_PRED_FN_CPU:
 		return filter_pred_cpu(pred, event);
+	case FILTER_PRED_FN_CPU_CPUMASK:
+		return filter_pred_cpu_cpumask(pred, event);
 	case FILTER_PRED_FN_CPUMASK:
 		return filter_pred_cpumask(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
@@ -1654,6 +1664,7 @@ static int parse_pred(const char *str, void *data,
 		char *tmp;
 
 		if (field->filter_type != FILTER_CPUMASK &&
+		    field->filter_type != FILTER_CPU &&
 		    field->filter_type != FILTER_OTHER) {
 			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
 			goto err_free;
@@ -1698,6 +1709,8 @@ static int parse_pred(const char *str, void *data,
 		i++;
 		if (field->filter_type == FILTER_CPUMASK) {
 			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+		} else if (field->filter_type == FILTER_CPU) {
+			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
 		} else {
 			switch (field->size) {
 			case 8:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 05/14] tracing/filters: Document cpumask filtering
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (3 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common " Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 06/14] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Cpumask, scalar and CPU fields can now be filtered by a user-provided
cpumask, document the syntax.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 Documentation/trace/events.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index f5fcb8e1218f6..e9bc9f23891a0 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -219,6 +219,20 @@ the function "security_prepare_creds" and less than the end of that function.
 The ".function" postfix can only be attached to values of size long, and can only
 be compared with "==" or "!=".
 
+Cpumask fields or scalar fields that encode a CPU number can be filtered using
+a user-provided cpumask in cpulist format. The format is as follows::
+
+  MASK{$cpulist}
+
+Operators available to cpumask filtering are:
+
+& (intersection), ==, !=
+
+For example, this will filter events that have their .target_cpu field present
+in the given cpumask::
+
+  target_cpu & MASK{17-42}
+
 5.2 Setting filters
 -------------------
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 06/14] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (4 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 05/14] tracing/filters: Document cpumask filtering Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 07/14] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0fcf99c914000..fe62232f218f9 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3404,7 +3404,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 
 	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
 		if (!target->sec->noinstr) {
-			WARN("pv_ops[%d]: %s", idx, target->name);
+			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
 			file->pv_ops[idx].clean = false;
 		}
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 07/14] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (5 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 06/14] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Josh Poimboeuf, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Nadav Amit, Dan Carpenter, Chuang Wang,
	Yang Jihong, Petr Mladek, Jason A. Donenfeld, Song Liu,
	Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

Later commits will depend on having no runtime-mutable text in early entry
code. (ab)use the .noinstr section as a marker of early entry code and warn
about static keys used in it that can be flipped at runtime.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 tools/objtool/check.c                   | 20 ++++++++++++++++++++
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/special.h |  2 ++
 tools/objtool/special.c                 |  3 +++
 4 files changed, 26 insertions(+)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index fe62232f218f9..7f8d210ec88c3 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2032,6 +2032,9 @@ static int add_special_section_alts(struct objtool_file *file)
 		alt->next = orig_insn->alts;
 		orig_insn->alts = alt;
 
+		if (special_alt->key_sym)
+			orig_insn->key_sym = special_alt->key_sym;
+
 		list_del(&special_alt->list);
 		free(special_alt);
 	}
@@ -3520,6 +3523,20 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
 	return 0;
 }
 
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+	if (state->noinstr && state->instr <= 0) {
+		if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
+			WARN_INSN(insn,
+				  "Non __ro_after_init static key \"%s\" in .noinstr section",
+				  insn->key_sym->name);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 static struct instruction *next_insn_to_validate(struct objtool_file *file,
 						 struct instruction *insn)
 {
@@ -3670,6 +3687,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 		if (handle_insn_ops(insn, next_insn, &state))
 			return 1;
 
+		if (insn->key_sym)
+			validate_static_key(insn, &state);
+
 		switch (insn->type) {
 
 		case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index daa46f1f0965a..35dd21f8f41e1 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -77,6 +77,7 @@ struct instruction {
 	struct symbol *sym;
 	struct stack_op *stack_ops;
 	struct cfi_state *cfi;
+	struct symbol *key_sym;
 };
 
 static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 86d4af9c5aa9d..0e61f34fe3a28 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -27,6 +27,8 @@ struct special_alt {
 	struct section *new_sec;
 	unsigned long new_off;
 
+	struct symbol *key_sym;
+
 	unsigned int orig_len, new_len; /* group only */
 };
 
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index baa85c31526b3..830e6abf173a2 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
 			return -1;
 		}
 		alt->key_addend = key_reloc->addend;
+
+		reloc_to_sec_off(key_reloc, &sec, &offset);
+		alt->key_sym = find_symbol_by_offset(sec, offset & ~2);
 	}
 
 	return 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (6 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 07/14] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 20:41   ` Peter Zijlstra
  2023-07-05 18:12 ` [RFC PATCH 09/14] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

objtool now warns about it:

  vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  [...]

The key can only be enabled (and not disabled) in the __init function
ct_cpu_tracker_user(), so mark it as __ro_after_init.

BROKEN: the struct static_key lives in a read-only mapping after
mark_rodata_ro(), which falls apart when the KVM module is loaded after
init and a write to the struct happens due to e.g. guest_state_exit_irqoff()
relying on the static key:

  jump_label_add_module()
  `\
    static_key_set_mod()
    static_key_set_linked()

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/context_tracking.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a09f1c19336ae..4e6cb14272fcb 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -432,7 +432,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
 
-DEFINE_STATIC_KEY_FALSE(context_tracking_key);
+DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
 EXPORT_SYMBOL_GPL(context_tracking_key);
 
 static noinstr bool context_tracking_recursion_enter(void)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 09/14] x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (7 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 10/14] x86/sev-es: Make sev_es_enable_key __ro_after_init Valentin Schneider
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

objtool now warns about it:

  vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init static key "kvm_async_pf_enabled" in .noinstr section

The key can only be enabled (and not disabled) in the __init function
kvm_guest_init(), so mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984daa..319460090a836 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -44,7 +44,7 @@
 #include <asm/svm.h>
 #include <asm/e820/api.h>
 
-DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 10/14] x86/sev-es: Make sev_es_enable_key __ro_after_init
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (8 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 09/14] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure Valentin Schneider
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

objtool now warns about it:

  vmlinux.o: warning: objtool: exc_nmi+0xbf: Non __ro_after_init static key "sev_es_enable_key" in .noinstr section
  vmlinux.o: warning: objtool: exc_nmi+0x4d: Non __ro_after_init static key "sev_es_enable_key" in .noinstr section
  vmlinux.o: warning: objtool: exc_nmi+0xc: Non __ro_after_init static key "sev_es_enable_key" in .noinstr section

The key can only be enabled (and not disabled) in the __init function
sev_es_init_vc_handling(), so mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/sev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index b031244d6d2df..4178e18383232 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -113,7 +113,7 @@ struct ghcb_state {
 };
 
 static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
-DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
+DEFINE_STATIC_KEY_FALSE_RO(sev_es_enable_key);
 
 static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (9 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 10/14] x86/sev-es: Make sev_es_enable_key __ro_after_init Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 22:23   ` Frederic Weisbecker
  2023-07-05 22:39   ` Peter Zijlstra
  2023-07-05 18:12 ` [RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Nadav Amit, Dan Carpenter, Chuang Wang,
	Yang Jihong, Petr Mladek, Jason A. Donenfeld, Song Liu,
	Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Add a field in struct context_tracking used as a bitmask to track deferred
callbacks to execute upon kernel entry. The LSB of that field is used as a
flag to prevent queueing deferred work when the CPU leaves userspace.

Later commits introduce the bit:callback mappings.

Note: A previous approach by PeterZ [1] used an extra bit in
context_tracking.state to flag the presence of deferred callbacks to
execute, and the actual callbacks were stored in a separate atomic
variable.

This meant that the atomic read of context_tracking.state was sufficient to
determine whether there are any deferred callbacks to execute.
Unfortunately, it presents a race window. Consider the work setting
function as:

  preempt_disable();
  seq = atomic_read(&ct->seq);
  if (__context_tracking_seq_in_user(seq)) {
	  /* ctrl-dep */
	  atomic_or(work, &ct->work);
	  ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
  }
  preempt_enable();

  return ret;

Then the following can happen:

  CPUx                                             CPUy
						     CT_SEQ_WORK \in context_tracking.state
    atomic_or(WORK_N, &ct->work);
						      ct_kernel_enter()
							ct_state_inc();
    atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);

The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
sent. Unfortunately, the work bit would remain set, and it can't be sanely
cleared in case another CPU set it concurrently - this would ultimately
lead to a double execution of the callback, one as a deferred callback and
one in the IPI. As not all IPI callbacks are idempotent, this is
undesirable.

Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/Kconfig                                 |  9 +++
 arch/x86/Kconfig                             |  1 +
 arch/x86/include/asm/context_tracking_work.h | 14 +++++
 include/linux/context_tracking.h             |  1 +
 include/linux/context_tracking_state.h       |  1 +
 include/linux/context_tracking_work.h        | 28 +++++++++
 kernel/context_tracking.c                    | 63 ++++++++++++++++++++
 kernel/time/Kconfig                          |  5 ++
 8 files changed, 122 insertions(+)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cada..e453e9fb864be 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -851,6 +851,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee4..490c773105c0c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -197,6 +197,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5bc29e6b2ed38
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+	switch (work) {
+	case CONTEXT_WORK_n:
+		// Do work...
+		break;
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d3cbb6c16babf..80d571ddfc3a4 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index fdd537ea513ff..5af06ed26f858 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -36,6 +36,7 @@ struct context_tracking {
 	int recursion;
 #endif
 #ifdef CONFIG_CONTEXT_TRACKING
+	atomic_t work;
 	atomic_t state;
 #endif
 #ifdef CONFIG_CONTEXT_TRACKING_IDLE
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..0b06c3dab58c7
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CONTEXT_WORK_DISABLED_OFFSET,
+	CONTEXT_WORK_n_OFFSET,
+	CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
+	CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
+	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4e6cb14272fcb..b6aee3d0c0528 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -32,6 +32,9 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
 	.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
 #endif
 	.state = ATOMIC_INIT(RCU_DYNTICKS_IDX),
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+	.work = ATOMIC_INIT(CONTEXT_WORK_DISABLED),
+#endif
 };
 EXPORT_SYMBOL_GPL(context_tracking);
 
@@ -72,6 +75,57 @@ static __always_inline void rcu_dynticks_task_trace_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned int ct_work_fetch(struct context_tracking *ct)
+{
+	return arch_atomic_fetch_or(CONTEXT_WORK_DISABLED, &ct->work);
+}
+static __always_inline void ct_work_clear(struct context_tracking *ct)
+{
+	arch_atomic_set(&ct->work, 0);
+}
+
+static noinstr void ct_work_flush(unsigned long work)
+{
+	int bit;
+
+	/* DISABLED is never set while there are deferred works */
+	WARN_ON_ONCE(work & CONTEXT_WORK_DISABLED);
+
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &work, CONTEXT_WORK_MAX)
+		arch_context_tracking_work(BIT(bit));
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old_work;
+	bool ret = false;
+
+	preempt_disable();
+
+	old_work = atomic_read(&ct->work);
+	/*
+	 * Try setting the work until either
+	 * - the target CPU no longer accepts any more deferred work
+	 * - the work has been set
+	 */
+	while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)
+		ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline unsigned int ct_work_fetch(struct context_tracking *ct) { return 0; }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -89,6 +143,10 @@ static noinstr void ct_kernel_exit_state(int offset)
 	 */
 	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
 	seq = ct_state_inc(offset);
+
+	/* Let this CPU allow deferred callbacks again */
+	ct_work_clear(this_cpu_ptr(&context_tracking));
+
 	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX));
 }
@@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
 	int seq;
+	unsigned int work;
 
+	work = ct_work_fetch(ct);
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
 	seq = ct_state_inc(offset);
+	if (work)
+		ct_work_flush(work);
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index bae8f11070bef..fdb266f2d774b 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (10 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Peter Zijlstra, Nicolas Saenz Julienne, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Nadav Amit, Dan Carpenter, Chuang Wang,
	Yang Jihong, Petr Mladek, Jason A. Donenfeld, Song Liu,
	Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys. Any static key used in early entry code
should be only forever-enabled at boot time, IOW __ro_after_init (pretty
much like alternatives). Objtool is now able to point at static keys that
don't respect this, and all static keys used in early entry code have now
been verified as behaving like so.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  6 +++--
 arch/x86/include/asm/text-patching.h         |  1 +
 arch/x86/kernel/alternative.c                | 24 ++++++++++++++++----
 arch/x86/kernel/kprobes/core.c               |  4 ++--
 arch/x86/kernel/kprobes/opt.c                |  4 ++--
 arch/x86/kernel/module.c                     |  2 +-
 include/linux/context_tracking_work.h        |  4 ++--
 7 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5bc29e6b2ed38..2c66687ce00e2 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include <asm/sync_core.h>
+
 static __always_inline void arch_context_tracking_work(int work)
 {
 	switch (work) {
-	case CONTEXT_WORK_n:
-		// Do work...
+	case CONTEXT_WORK_SYNC:
+		sync_core();
 		break;
 	}
 }
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 29832c338cdc5..b6939e965e69d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d932..7770aef2204f3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
 #include <linux/sync_core.h>
+#include <linux/context_tracking.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -1765,9 +1766,24 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+	on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void text_poke_sync(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	__text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+	__text_poke_sync(do_sync_core_defer_cond);
 }
 
 /*
@@ -1967,7 +1983,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
 	}
 
-	text_poke_sync();
+	text_poke_sync_deferrable();
 
 	/*
 	 * Second step: update all but the first byte of the patched range.
@@ -2029,7 +2045,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		 * not necessary and we'd be safe even without it. But
 		 * better safe than sorry (plus there's not only Intel).
 		 */
-		text_poke_sync();
+		text_poke_sync_deferrable();
 	}
 
 	/*
@@ -2050,7 +2066,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	}
 
 	if (do_sync)
-		text_poke_sync();
+		text_poke_sync_deferrable();
 
 	/*
 	 * Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6c..a38c914753397 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
 	u8 int3 = INT3_INSN_OPCODE;
 
 	text_poke(p->addr, &int3, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
 }
 
@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
 
 	perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
 	text_poke(p->addr, &p->opcode, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 57b0037d0a996..88451a744ceda 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	       JMP32_INSN_SIZE - INT3_INSN_SIZE);
 
 	text_poke(addr, new, INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	text_poke(addr + INT3_INSN_SIZE,
 		  new + INT3_INSN_SIZE,
 		  JMP32_INSN_SIZE - INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 
 	perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
 }
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344b..8b4542dc51b6d 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 				   write, apply);
 
 	if (!early) {
-		text_poke_sync();
+		text_poke_sync_deferrable();
 		mutex_unlock(&text_mutex);
 	}
 
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 0b06c3dab58c7..b0c7463048b60 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,13 +6,13 @@
 
 enum {
 	CONTEXT_WORK_DISABLED_OFFSET,
-	CONTEXT_WORK_n_OFFSET,
+	CONTEXT_WORK_SYNC_OFFSET,
 	CONTEXT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
-	CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
+	CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
 	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (11 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:12 ` [RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.

Rather than make __flush_tlb_all() noinstr, add a minimal noinstr
variant that doesn't try to leverage INVPCID.

FIXME: not fully noinstr compliant
XXX: same issue as with ins patching, when do we access data that should be
invalidated?

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  4 ++++
 arch/x86/include/asm/tlbflush.h              |  1 +
 arch/x86/mm/tlb.c                            | 17 +++++++++++++++++
 include/linux/context_tracking_work.h        |  2 ++
 4 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <asm/tlbflush.h>
 
 static __always_inline void arch_context_tracking_work(int work)
 {
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work)
 	case CONTEXT_WORK_SYNC:
 		sync_core();
 		break;
+	case CONTEXT_WORK_TLBI:
+		__flush_tlb_all_noinstr();
+		break;
 	}
 }
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 75bfaa4210303..9064aa0027d05 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -15,6 +15,7 @@
 #include <asm/pgtable.h>
 
 void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);
 
 #define TLB_FLUSH_ALL	-1UL
 #define TLB_GENERATION_INVALID	0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf27480af..631df9189ded4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1237,6 +1237,23 @@ void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+void noinstr __flush_tlb_all_noinstr(void)
+{
+	/*
+	 * This is for invocation in early entry code that cannot be
+	 * instrumented. A RMW to CR4 works for most cases, but relies on
+	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+	 * do the CR3 write.
+	 *
+	 * TODO: paravirt
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_PGE))
+		__native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
+	else
+		flush_tlb_local();
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index b0c7463048b60..c084a60d2fec5 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -7,12 +7,14 @@
 enum {
 	CONTEXT_WORK_DISABLED_OFFSET,
 	CONTEXT_WORK_SYNC_OFFSET,
+	CONTEXT_WORK_TLBI_OFFSET,
 	CONTEXT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
 	CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
+	CONTEXT_WORK_TLBI     = BIT(CONTEXT_WORK_TLBI_OFFSET),
 	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (12 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
@ 2023-07-05 18:12 ` Valentin Schneider
  2023-07-05 18:48 ` [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Nadav Amit
  2023-07-05 19:03 ` Steven Rostedt
  15 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-05 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf, x86
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

This does require a guarantee that nothing in the vmalloc range can be
accessed in early entry code. vmalloc'd kernel stacks (VMAP_STACK) are
AFAICT a safe exception, as a task running in userspace needs to enter
kernelspace to execute do_exit() before its stack can be vfree'd.

XXX: Validation that nothing in the vmalloc range is accessed in .noinstr or
  somesuch?

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 23 ++++++++++++++++++++---
 mm/vmalloc.c                    | 15 ++++++++++-----
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 9064aa0027d05..2b3605c09649c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -255,6 +255,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 631df9189ded4..bb18b35e61b4a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/context_tracking.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1045,6 +1046,11 @@ static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1061,12 +1067,13 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
 {
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info *info;
 
@@ -1074,13 +1081,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		info = get_flush_tlb_info(NULL, start, end, 0, false,
 					  TLB_GENERATION_INVALID);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
 		preempt_enable();
 	}
 }
 
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d13d71687d73..10f99224e12e0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -439,6 +439,11 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
 	__vunmap_range_noflush(start, end);
 }
 
+#ifndef flush_tlb_kernel_range_deferrable
+#define flush_tlb_kernel_range_deferrable(start, end) \
+	flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -452,7 +457,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
 {
 	flush_cache_vunmap(addr, end);
 	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -1747,7 +1752,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 		list_last_entry(&local_purge_list,
 			struct vmap_area, list)->va_end);
 
-	flush_tlb_kernel_range(start, end);
+	flush_tlb_kernel_range_deferrable(start, end);
 	resched_threshold = lazy_max_pages() << 1;
 
 	spin_lock(&free_vmap_area_lock);
@@ -1849,7 +1854,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 	flush_cache_vunmap(va->va_start, va->va_end);
 	vunmap_range_noflush(va->va_start, va->va_end);
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
 
 	free_vmap_area_noflush(va);
 }
@@ -2207,7 +2212,7 @@ static void vb_free(unsigned long addr, unsigned long size)
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);
 
 	spin_lock(&vb->lock);
 
@@ -2260,7 +2265,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	mutex_lock(&vmap_purge_lock);
 	purge_fragmented_blocks_allcpus();
 	if (!__purge_vmap_area_lazy(start, end) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (13 preceding siblings ...)
  2023-07-05 18:12 ` [RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
@ 2023-07-05 18:48 ` Nadav Amit
  2023-07-06 11:29   ` Valentin Schneider
  2023-07-05 19:03 ` Steven Rostedt
  15 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2023-07-05 18:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Linux Kernel Mailing List, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, the arch/x86 maintainers, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Dan Carpenter, Chuang Wang, Yang Jihong,
	Petr Mladek, Jason A. Donenfeld, Song Liu, Julian Pidancet,
	Tom Lendacky, Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky



> On Jul 5, 2023, at 11:12 AM, Valentin Schneider <vschneid@redhat.com> wrote:
> 
> Deferral approach
> =================
> 
> Storing each and every callback, like a secondary call_single_queue turned out
> to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
> userspace for as long as possible - no signal of any form would be sent when
> deferring an IPI. This means that any form of queuing for deferred callbacks
> would end up as a convoluted memory leak.
> 
> Deferred IPIs must thus be coalesced, which this series achieves by assigning
> IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
> kernel entry.

I have some experience with similar an optimization. Overall, it can make
sense and as you show, it can reduce the number of interrupts.

The main problem of such an approach might be in cases where a process
frequently enters and exits the kernel between deferred-IPIs, or even worse -
the IPI is sent while the remote CPU is inside the kernel. In such cases, you
pay the extra cost of synchronization and cache traffic, and might not even
get the benefit of reducing the number of IPIs.

In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
mechanism introduces while tracking whether a process is running or not. But
lazy-TLB would change is_lazy much less frequently than context tracking,
which means that the deferring the IPIs as done in this patch-set has a
greater potential to hurt performance than lazy-TLB.

tl;dr - it would be beneficial to show some performance number for both a
“good” case where a process spends most of the time in userspace, and “bad”
one where a process enters and exits the kernel very frequently. Reducing
the number of IPIs is good but I don’t think it is a goal by its own.

[ BTW: I did not go over the patches in detail. Obviously, there are
  various delicate points that need to be checked, as avoiding the
  deferring of IPIs if page-tables are freed. ]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-05 18:12 ` [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
@ 2023-07-05 18:54   ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2023-07-05 18:54 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Peter Zijlstra, Frederic Weisbecker, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed,  5 Jul 2023 19:12:44 +0100
Valentin Schneider <vschneid@redhat.com> wrote:

> +		/* Copy the cpulist between { and } */
> +		tmp = kmalloc(i - maskstart + 1, GFP_KERNEL);
> +		strncpy(tmp, str + maskstart, i - maskstart);
> +		tmp[i - maskstart] = '\0';
> +

You can replace the above with:

	tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
	strscpy(tmp, str + maskstart, (i - maskstart) + 1);

As strscpy() adds the nul terminating character, not to mention that
strncpy() is deprecated (see Documentation/process/deprecated.rst).

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask
  2023-07-05 18:12 ` [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
@ 2023-07-05 19:00   ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2023-07-05 19:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Peter Zijlstra, Frederic Weisbecker, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed,  5 Jul 2023 19:12:45 +0100
Valentin Schneider <vschneid@redhat.com> wrote:

> +/* Optimisation of do_filter_cpumask() for scalar values */
> +static inline int
> +do_filter_cpumask_scalar(int op, unsigned int cpu, const struct cpumask *mask)
> +{
> +	switch (op) {
> +	case OP_EQ:
> +		return cpumask_equal(mask, cpumask_of(cpu));
> +	case OP_NE:
> +		return !cpumask_equal(mask, cpumask_of(cpu));

The above two can be optimized much more if the cpu is a scalar. If mask is
anything but a single CPU, then EQ will always be false, and NE will always
be true. If the mask contains a single CPU, then we should convert the mask
into the scalar CPU and just do:

	case OP_EQ:
		return mask_cpu == cpu;
	case op_NE:
		return maks_cpu != cpu;

-- Steve

> +	case OP_BAND:
> +		return cpumask_test_cpu(cpu, mask);
> +	default:
> +		return 0;
> +	}
> +}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (14 preceding siblings ...)
  2023-07-05 18:48 ` [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Nadav Amit
@ 2023-07-05 19:03 ` Steven Rostedt
  2023-07-06 11:30   ` Valentin Schneider
  15 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2023-07-05 19:03 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Peter Zijlstra, Frederic Weisbecker, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed,  5 Jul 2023 19:12:42 +0100
Valentin Schneider <vschneid@redhat.com> wrote:

> o Patches 1-5 have been submitted previously and are included for the sake of
>   testing

I should have commented on the previous set, but I did my review on this set ;-)

Anyway, I'm all for the patches. Care to send a new version covering my input?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init
  2023-07-05 18:12 ` [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
@ 2023-07-05 20:41   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2023-07-05 20:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Frederic Weisbecker, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed, Jul 05, 2023 at 07:12:50PM +0100, Valentin Schneider wrote:

> BROKEN: the struct static_key lives in a read-only mapping after
> mark_rodata_ro(), which falls apart when the KVM module is loaded after
> init and a write to the struct happens due to e.g. guest_state_exit_irqoff()
> relying on the static key:

Right.. so whoever added the whole ro_after_init jump_label support did
a very poor job of it.

That said; I think it is fixable. Since the key cannot be changed, we
don't actually need to track the entries list and can thus avoid the key
update.

Something like the completely untested below...

---
Subject: jump_label: Seal __ro_after_init keys

When a static_key is marked ro_after_init, its state will never change
(after init), therefore jump_label_update() will never need to iterate
the entries, and thus module load won't actually need to track this --
avoiding the static_key::next write.

Therefore, mark these keys such that jump_label_add_module() might
recognise them and avoid the modification.

Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
to denote such keys.

*UNTESTED*

NOT-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/sections.h |  5 +++++
 include/linux/jump_label.h     |  1 +
 init/main.c                    |  1 +
 kernel/jump_label.c            | 44 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 51 insertions(+)

diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index db13bb620f52..c768de6f19a9 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -180,6 +180,11 @@ static inline bool is_kernel_rodata(unsigned long addr)
 	       addr < (unsigned long)__end_rodata;
 }
 
+static inline bool is_kernel_ro_after_init(unsigned long addr)
+{
+	return addr >= (unsigned long)__start_ro_after_init &&
+	       addr < (unsigned long)__end_ro_after_init;
+}
 /**
  * is_kernel_inittext - checks if the pointer address is located in the
  *                      .init.text section
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f0a949b7c973..88ef9e776af8 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -216,6 +216,7 @@ extern struct jump_entry __start___jump_table[];
 extern struct jump_entry __stop___jump_table[];
 
 extern void jump_label_init(void);
+extern void jump_label_ro(void);
 extern void jump_label_lock(void);
 extern void jump_label_unlock(void);
 extern void arch_jump_label_transform(struct jump_entry *entry,
diff --git a/init/main.c b/init/main.c
index ad920fac325c..cb5304ca18f4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1403,6 +1403,7 @@ static void mark_readonly(void)
 		 * insecure pages which are W+X.
 		 */
 		rcu_barrier();
+		jump_label_ro();
 		mark_rodata_ro();
 		rodata_test();
 	} else
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index d9c822bbffb8..40fb72d79d7a 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -530,6 +530,46 @@ void __init jump_label_init(void)
 	cpus_read_unlock();
 }
 
+static inline bool static_key_sealed(struct static_key *key)
+{
+	return (key->type & JUMP_TYPE_LINKED) && !(key->type & ~JUMP_TYPE_MASK);
+}
+
+static inline void static_key_seal(struct static_key *key)
+{
+	unsigned long type = key->type & JUMP_TYPE_TRUE;
+	key->type = JUMP_TYPE_LINKED | type;
+}
+
+void jump_label_ro(void)
+{
+	struct jump_entry *iter_start = __start___jump_table;
+	struct jump_entry *iter_stop = __stop___jump_table;
+	struct static_key *key = NULL;
+	struct jump_entry *iter;
+
+	if (WARN_ON_ONCE(!static_key_initialized))
+		return;
+
+	cpus_read_lock();
+	jump_label_lock();
+
+	for (iter = iter_start; iter < iter_stop; iter++) {
+		struct static_key *iterk = jump_entry_key(iter);
+
+		if (!is_kernel_ro_after_init(iterk))
+			continue;
+
+		if (static_key_sealed(iterk))
+			continue;
+
+		static_key_seal(iterk);
+	}
+
+	jump_label_unlock();
+	cpus_read_unlock();
+}
+
 #ifdef CONFIG_MODULES
 
 enum jump_label_type jump_label_init_type(struct jump_entry *entry)
@@ -650,6 +690,9 @@ static int jump_label_add_module(struct module *mod)
 			static_key_set_entries(key, iter);
 			continue;
 		}
+		if (static_key_sealed(key))
+			goto do_poke;
+
 		jlm = kzalloc(sizeof(struct static_key_mod), GFP_KERNEL);
 		if (!jlm)
 			return -ENOMEM;
@@ -675,6 +718,7 @@ static int jump_label_add_module(struct module *mod)
 		static_key_set_linked(key);
 
 		/* Only update if we've changed from our initial state */
+do_poke:
 		if (jump_label_type(iter) != jump_label_init_type(iter))
 			__jump_label_update(key, iter, iter_stop, true);
 	}

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 18:12 ` [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2023-07-05 22:23   ` Frederic Weisbecker
  2023-07-05 22:41     ` Peter Zijlstra
                       ` (3 more replies)
  2023-07-05 22:39   ` Peter Zijlstra
  1 sibling, 4 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2023-07-05 22:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a écrit :
> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> +	unsigned int old_work;
> +	bool ret = false;
> +
> +	preempt_disable();
> +
> +	old_work = atomic_read(&ct->work);
> +	/*
> +	 * Try setting the work until either
> +	 * - the target CPU no longer accepts any more deferred work
> +	 * - the work has been set
> +	 */
> +	while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)

Isn't there a race here where you may have missed a CPU that just entered in
user and you eventually disturb it?

> +		ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> +
> +	preempt_enable();
> +	return ret;
> +}
[...]
> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
>   */
>  static noinstr void ct_kernel_enter_state(int offset)
>  {
> +	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
>  	int seq;
> +	unsigned int work;
>  
> +	work = ct_work_fetch(ct);

So this adds another fully ordered operation on user <-> kernel transition.
How many such IPIs can we expect?

If this is just about a dozen, can we stuff them in the state like in the
following? We can potentially add more of them especially on 64 bits we could
afford 30 different works, this is just shrinking the RCU extended quiescent
state counter space. Worst case that can happen is that RCU misses 65535
idle/user <-> kernel transitions and delays a grace period...

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..e453e9fb864b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -851,6 +851,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..490c773105c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -197,6 +197,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 000000000000..5bc29e6b2ed3
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+	switch (work) {
+	case CONTEXT_WORK_n:
+		// Do work...
+		break;
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d3cbb6c16bab..333b26d7cbe5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
@@ -75,7 +76,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 static __always_inline bool context_tracking_guest_enter(void)
 {
 	if (context_tracking_enabled())
-		__ct_user_enter(CONTEXT_GUEST);
+		__ct_user_enter(CONTEXT_USER);
 
 	return context_tracking_enabled_this_cpu();
 }
@@ -83,7 +84,7 @@ static __always_inline bool context_tracking_guest_enter(void)
 static __always_inline void context_tracking_guest_exit(void)
 {
 	if (context_tracking_enabled())
-		__ct_user_exit(CONTEXT_GUEST);
+		__ct_user_exit(CONTEXT_USER);
 }
 
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
@@ -122,6 +123,26 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
 	return !(arch_atomic_read(this_cpu_ptr(&context_tracking.state)) & RCU_DYNTICKS_IDX);
 }
 
+/*
+ * Increment the current CPU's context_tracking structure's ->state field
+ * with ordering and clear the work bits.  Return the new value.
+ */
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long new, old, state;
+
+	state = arch_atomic_read(&ct->state);
+	do {
+		old = state;
+		new = old & ~CONTEXT_WORK_MASK;
+		new += incby;
+		state = arch_atomic_cmpxchg(&ct->state, old, new);
+	} while (old != state);
+
+	return state;
+}
+
 /*
  * Increment the current CPU's context_tracking structure's ->state field
  * with ordering.  Return the new value.
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index fdd537ea513f..ec3d172601c5 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -10,14 +10,19 @@
 #define DYNTICK_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
 
 enum ctx_state {
+	/* Following are values */
 	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
 	CONTEXT_KERNEL		= 0,
 	CONTEXT_IDLE		= 1,
 	CONTEXT_USER		= 2,
-	CONTEXT_GUEST		= 3,
-	CONTEXT_MAX		= 4,
+	/* Following are bit numbers */
+	CONTEXT_WORK		= 2,
+	CONTEXT_MAX		= 16,
 };
 
+#define CONTEXT_MASK (BIT(CONTEXT_WORK) - 1)
+#define CONTEXT_WORK_MASK ((BIT(CONTEXT_MAX) - 1) & ~(BIT(CONTEXT_WORK) - 1))
+
 /* Even value for idle, else odd. */
 #define RCU_DYNTICKS_IDX CONTEXT_MAX
 
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 000000000000..fb74db8876dd
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CONTEXT_WORK_n_OFFSET,
+	CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
+	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a09f1c19336a..732042b9a7b7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,58 @@ static __always_inline void rcu_dynticks_task_trace_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+	unsigned int bit;
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &seq, CONTEXT_MAX)
+		arch_context_tracking_work(BIT(bit) >> CONTEXT_WORK);
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old, new, state;
+	bool ret = false;
+
+	preempt_disable();
+
+	work <<= CONTEXT_WORK;
+	state = atomic_read(&ct->state);
+	/*
+	 * Try setting the work until either
+	 * - the target CPU is on the kernel
+	 * - the work has been set
+	 */
+	for (;;) {
+		/* Only set if running in user/guest */
+		old = state;
+		old &= ~CONTEXT_MASK;
+		old |= CONTEXT_USER;
+
+		new = old | work;
+
+		state = atomic_cmpxchg(&ct->state, old, new);
+		if (state & work) {
+			ret = true;
+			break;
+		}
+
+		if ((state & CONTEXT_MASK) != CONTEXT_USER)
+			break;
+	}
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long seq) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -100,14 +152,18 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
-	int seq;
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long seq;
 
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
 	 * and we also must force ordering with the next RCU read-side
 	 * critical section.
 	 */
-	seq = ct_state_inc(offset);
+	seq = ct_state_inc_clear_work(offset);
+	if (seq & CONTEXT_WORK_MASK)
+		ct_work_flush(seq & CONTEXT_WORK_MASK);
+
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index bae8f11070be..fdb266f2d774 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 18:12 ` [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure Valentin Schneider
  2023-07-05 22:23   ` Frederic Weisbecker
@ 2023-07-05 22:39   ` Peter Zijlstra
  2023-07-06 11:31     ` Valentin Schneider
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2023-07-05 22:39 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:

> Note: A previous approach by PeterZ [1] used an extra bit in
> context_tracking.state to flag the presence of deferred callbacks to
> execute, and the actual callbacks were stored in a separate atomic
> variable.
> 
> This meant that the atomic read of context_tracking.state was sufficient to
> determine whether there are any deferred callbacks to execute.
> Unfortunately, it presents a race window. Consider the work setting
> function as:
> 
>   preempt_disable();
>   seq = atomic_read(&ct->seq);
>   if (__context_tracking_seq_in_user(seq)) {
> 	  /* ctrl-dep */
> 	  atomic_or(work, &ct->work);
> 	  ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>   }
>   preempt_enable();
> 
>   return ret;
> 
> Then the following can happen:
> 
>   CPUx                                             CPUy
> 						     CT_SEQ_WORK \in context_tracking.state
>     atomic_or(WORK_N, &ct->work);
> 						      ct_kernel_enter()
> 							ct_state_inc();
>     atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
> 
> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
> sent. Unfortunately, the work bit would remain set, and it can't be sanely
> cleared in case another CPU set it concurrently - this would ultimately
> lead to a double execution of the callback, one as a deferred callback and
> one in the IPI. As not all IPI callbacks are idempotent, this is
> undesirable.

So adding another atomic is arguably worse.

The thing is, if the NOHZ_FULL CPU is actually doing context transitions
(SYSCALLs etc..) then everything is fundamentally racy, there is no
winning that game, we could find the remote CPU is in-kernel, send an
IPI, the remote CPU does return-to-user and receives the IPI.

And then the USER is upset... because he got an IPI.

The whole NOHZ_FULL thing really only works if userspace does not do
SYSCALLs.

But the sad sad state of affairs is that some people think it is
acceptable to do SYSCALLs while NOHZ_FULL and cry about how slow stuff
is.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:23   ` Frederic Weisbecker
@ 2023-07-05 22:41     ` Peter Zijlstra
  2023-07-06  9:53       ` Frederic Weisbecker
  2023-07-06 10:01     ` Frederic Weisbecker
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2023-07-05 22:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, Nicolas Saenz Julienne, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> If this is just about a dozen, can we stuff them in the state like in the
> following? We can potentially add more of them especially on 64 bits we could
> afford 30 different works, this is just shrinking the RCU extended quiescent
> state counter space. Worst case that can happen is that RCU misses 65535
> idle/user <-> kernel transitions and delays a grace period...

We can make all this a 64bit only feature and use atomic_long_t :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:41     ` Peter Zijlstra
@ 2023-07-06  9:53       ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2023-07-06  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, Nicolas Saenz Julienne, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 12:41:04AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> > If this is just about a dozen, can we stuff them in the state like in the
> > following? We can potentially add more of them especially on 64 bits we could
> > afford 30 different works, this is just shrinking the RCU extended quiescent
> > state counter space. Worst case that can happen is that RCU misses 65535
> > idle/user <-> kernel transitions and delays a grace period...
> 
> We can make all this a 64bit only feature and use atomic_long_t :-)

Works for me :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:23   ` Frederic Weisbecker
  2023-07-05 22:41     ` Peter Zijlstra
@ 2023-07-06 10:01     ` Frederic Weisbecker
  2023-07-06 11:30     ` Valentin Schneider
  2023-07-06 11:55     ` Frederic Weisbecker
  3 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2023-07-06 10:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index fdd537ea513f..ec3d172601c5 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -10,14 +10,19 @@
>  #define DYNTICK_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
>  
>  enum ctx_state {
> +	/* Following are values */
>  	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
>  	CONTEXT_KERNEL		= 0,
>  	CONTEXT_IDLE		= 1,
>  	CONTEXT_USER		= 2,
> -	CONTEXT_GUEST		= 3,
> -	CONTEXT_MAX		= 4,
> +	/* Following are bit numbers */
> +	CONTEXT_WORK		= 2,
> +	CONTEXT_MAX		= 16,
>  };
>  
> +#define CONTEXT_MASK (BIT(CONTEXT_WORK) - 1)
> +#define CONTEXT_WORK_MASK ((BIT(CONTEXT_MAX) - 1) & ~(BIT(CONTEXT_WORK) - 1))
> +
>  /* Even value for idle, else odd. */
>  #define RCU_DYNTICKS_IDX CONTEXT_MAX

And that should be:

#define RCU_DYNTICKS_IDX BIT(CONTEXT_MAX)

Did I mention it's not even build tested? :o)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2023-07-05 18:48 ` [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Nadav Amit
@ 2023-07-06 11:29   ` Valentin Schneider
  0 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-06 11:29 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, the arch/x86 maintainers, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Dan Carpenter, Chuang Wang, Yang Jihong,
	Petr Mladek, Jason A. Donenfeld, Song Liu, Julian Pidancet,
	Tom Lendacky, Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 05/07/23 18:48, Nadav Amit wrote:
>> On Jul 5, 2023, at 11:12 AM, Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> Deferral approach
>> =================
>>
>> Storing each and every callback, like a secondary call_single_queue turned out
>> to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
>> userspace for as long as possible - no signal of any form would be sent when
>> deferring an IPI. This means that any form of queuing for deferred callbacks
>> would end up as a convoluted memory leak.
>>
>> Deferred IPIs must thus be coalesced, which this series achieves by assigning
>> IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
>> kernel entry.
>
> I have some experience with similar an optimization. Overall, it can make
> sense and as you show, it can reduce the number of interrupts.
>
> The main problem of such an approach might be in cases where a process
> frequently enters and exits the kernel between deferred-IPIs, or even worse -
> the IPI is sent while the remote CPU is inside the kernel. In such cases, you
> pay the extra cost of synchronization and cache traffic, and might not even
> get the benefit of reducing the number of IPIs.
>
> In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
> mechanism introduces while tracking whether a process is running or not. But
> lazy-TLB would change is_lazy much less frequently than context tracking,
> which means that the deferring the IPIs as done in this patch-set has a
> greater potential to hurt performance than lazy-TLB.
>
> tl;dr - it would be beneficial to show some performance number for both a
> “good” case where a process spends most of the time in userspace, and “bad”
> one where a process enters and exits the kernel very frequently. Reducing
> the number of IPIs is good but I don’t think it is a goal by its own.
>

There already is a significant overhead incurred on kernel entry for
nohz_full CPUs due to all of context_tracking faff; now I *am* making it
worse with that extra atomic, but I get the feeling it's not going to stay
:D

nohz_full CPUs that do context transitions very frequently are
unfortunately in the realm of "you shouldn't do that". Due to what's out
there I have to care about *occasional* transitions, but some folks
consider even that to be broken usage, so I don't believe getting numbers
for that to be much relevant.

> [ BTW: I did not go over the patches in detail. Obviously, there are
>   various delicate points that need to be checked, as avoiding the
>   deferring of IPIs if page-tables are freed. ]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition
  2023-07-05 19:03 ` Steven Rostedt
@ 2023-07-06 11:30   ` Valentin Schneider
  0 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-06 11:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Peter Zijlstra, Frederic Weisbecker, Paul E. McKenney,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 05/07/23 15:03, Steven Rostedt wrote:
> On Wed,  5 Jul 2023 19:12:42 +0100
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>> o Patches 1-5 have been submitted previously and are included for the sake of
>>   testing
>
> I should have commented on the previous set, but I did my review on this set ;-)
>

Thanks for having a look!

> Anyway, I'm all for the patches. Care to send a new version covering my input?
>

Sure thing, I'll send a v2 of these patches soonish.

> Thanks,
>
> -- Steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:23   ` Frederic Weisbecker
  2023-07-05 22:41     ` Peter Zijlstra
  2023-07-06 10:01     ` Frederic Weisbecker
@ 2023-07-06 11:30     ` Valentin Schneider
  2023-07-06 11:40       ` Frederic Weisbecker
  2023-07-06 11:55     ` Frederic Weisbecker
  3 siblings, 1 reply; 33+ messages in thread
From: Valentin Schneider @ 2023-07-06 11:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 06/07/23 00:23, Frederic Weisbecker wrote:
> Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a écrit :
>> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
>> +{
>> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> +	unsigned int old_work;
>> +	bool ret = false;
>> +
>> +	preempt_disable();
>> +
>> +	old_work = atomic_read(&ct->work);
>> +	/*
>> +	 * Try setting the work until either
>> +	 * - the target CPU no longer accepts any more deferred work
>> +	 * - the work has been set
>> +	 */
>> +	while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)
>
> Isn't there a race here where you may have missed a CPU that just entered in
> user and you eventually disturb it?
>

Yes, unfortunately.

>> +		ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
>> +
>> +	preempt_enable();
>> +	return ret;
>> +}
> [...]
>> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
>>   */
>>  static noinstr void ct_kernel_enter_state(int offset)
>>  {
>> +	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
>>      int seq;
>> +	unsigned int work;
>>
>> +	work = ct_work_fetch(ct);
>
> So this adds another fully ordered operation on user <-> kernel transition.
> How many such IPIs can we expect?
>

Despite having spent quite a lot of time on that question, I think I still
only have a hunch.

Poking around RHEL systems, I'd say 99% of the problematic IPIs are
instruction patching and TLB flushes.

Staring at the code, there's quite a lot of smp_calls for which it's hard
to say whether the target CPUs can actually be isolated or not (e.g. the
CPU comes from a cpumask shoved in a struct that was built using data from
another struct of uncertain origins), but then again some of them don't
need to hook into context_tracking.

Long story short: I /think/ we can consider that number to be fairly small,
but there could be more lurking in the shadows.

> If this is just about a dozen, can we stuff them in the state like in the
> following? We can potentially add more of them especially on 64 bits we could
> afford 30 different works, this is just shrinking the RCU extended quiescent
> state counter space. Worst case that can happen is that RCU misses 65535
> idle/user <-> kernel transitions and delays a grace period...
>

I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
only does a !=).

I'm rephrasing here to make sure I get it - is it then that the worst case
here is 2^(dynticks_counter_size) transitions happen between saving the
dynticks snapshot and checking it again, so RCU waits some more?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:39   ` Peter Zijlstra
@ 2023-07-06 11:31     ` Valentin Schneider
  0 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-06 11:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 06/07/23 00:39, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:
>
>> Note: A previous approach by PeterZ [1] used an extra bit in
>> context_tracking.state to flag the presence of deferred callbacks to
>> execute, and the actual callbacks were stored in a separate atomic
>> variable.
>>
>> This meant that the atomic read of context_tracking.state was sufficient to
>> determine whether there are any deferred callbacks to execute.
>> Unfortunately, it presents a race window. Consider the work setting
>> function as:
>>
>>   preempt_disable();
>>   seq = atomic_read(&ct->seq);
>>   if (__context_tracking_seq_in_user(seq)) {
>>        /* ctrl-dep */
>>        atomic_or(work, &ct->work);
>>        ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>>   }
>>   preempt_enable();
>>
>>   return ret;
>>
>> Then the following can happen:
>>
>>   CPUx                                             CPUy
>>                                                   CT_SEQ_WORK \in context_tracking.state
>>     atomic_or(WORK_N, &ct->work);
>>                                                    ct_kernel_enter()
>>                                                      ct_state_inc();
>>     atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>>
>> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
>> sent. Unfortunately, the work bit would remain set, and it can't be sanely
>> cleared in case another CPU set it concurrently - this would ultimately
>> lead to a double execution of the callback, one as a deferred callback and
>> one in the IPI. As not all IPI callbacks are idempotent, this is
>> undesirable.
>
> So adding another atomic is arguably worse.
>
> The thing is, if the NOHZ_FULL CPU is actually doing context transitions
> (SYSCALLs etc..) then everything is fundamentally racy, there is no
> winning that game, we could find the remote CPU is in-kernel, send an
> IPI, the remote CPU does return-to-user and receives the IPI.
>
> And then the USER is upset... because he got an IPI.
>

Yeah, that part is inevitably racy.

The thing I was especially worried about was the potential double
executions (once in IPI, again in deferred work). It's not /too/ bad as the
only two deferred callbacks I'm introducing here are costly-but-stateless,
but IMO is a bad foundation.

But it seems like we can reuse the existing atomic and squeeze some bits in
there, so let's see how that goes :-)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-06 11:30     ` Valentin Schneider
@ 2023-07-06 11:40       ` Frederic Weisbecker
  2023-07-06 16:39         ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2023-07-06 11:40 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
> >> +		ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> >> +
> >> +	preempt_enable();
> >> +	return ret;
> >> +}
> > [...]
> >> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> >>   */
> >>  static noinstr void ct_kernel_enter_state(int offset)
> >>  {
> >> +	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> >>      int seq;
> >> +	unsigned int work;
> >>
> >> +	work = ct_work_fetch(ct);
> >
> > So this adds another fully ordered operation on user <-> kernel transition.
> > How many such IPIs can we expect?
> >
> 
> Despite having spent quite a lot of time on that question, I think I still
> only have a hunch.
> 
> Poking around RHEL systems, I'd say 99% of the problematic IPIs are
> instruction patching and TLB flushes.
> 
> Staring at the code, there's quite a lot of smp_calls for which it's hard
> to say whether the target CPUs can actually be isolated or not (e.g. the
> CPU comes from a cpumask shoved in a struct that was built using data from
> another struct of uncertain origins), but then again some of them don't
> need to hook into context_tracking.
> 
> Long story short: I /think/ we can consider that number to be fairly small,
> but there could be more lurking in the shadows.

I guess it will still be time to reconsider the design if we ever reach such size.

> 
> > If this is just about a dozen, can we stuff them in the state like in the
> > following? We can potentially add more of them especially on 64 bits we could
> > afford 30 different works, this is just shrinking the RCU extended quiescent
> > state counter space. Worst case that can happen is that RCU misses 65535
> > idle/user <-> kernel transitions and delays a grace period...
> >
> 
> I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
> even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
> but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
> only does a !=).
> 
> I'm rephrasing here to make sure I get it - is it then that the worst case
> here is 2^(dynticks_counter_size) transitions happen between saving the
> dynticks snapshot and checking it again, so RCU waits some more?

That's my understanding as well but I have to defer on Paul to make sure I'm
not overlooking something.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-05 22:23   ` Frederic Weisbecker
                       ` (2 preceding siblings ...)
  2023-07-06 11:30     ` Valentin Schneider
@ 2023-07-06 11:55     ` Frederic Weisbecker
  3 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2023-07-06 11:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Paul E. McKenney, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a écrit :
> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> +	unsigned int old, new, state;
> +	bool ret = false;
> +
> +	preempt_disable();
> +
> +	work <<= CONTEXT_WORK;
> +	state = atomic_read(&ct->state);
> +	/*
> +	 * Try setting the work until either
> +	 * - the target CPU is on the kernel
> +	 * - the work has been set
> +	 */
> +	for (;;) {
> +		/* Only set if running in user/guest */
> +		old = state;
> +		old &= ~CONTEXT_MASK;
> +		old |= CONTEXT_USER;
> +
> +		new = old | work;
> +
> +		state = atomic_cmpxchg(&ct->state, old, new);
> +		if (state & work) {

And this should be "if (state == old)", otherwise there is
a risk that someone else had set the work but atomic_cmpxchg()
failed due to other modifications is the meantime. It's then
dangerous in that case to defer the work because atomic_cmpxchg()
failures don't imply full ordering. So there is a risk that the
target executes the work but doesn't see the most recent data.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-06 11:40       ` Frederic Weisbecker
@ 2023-07-06 16:39         ` Paul E. McKenney
  2023-07-06 18:05           ` Valentin Schneider
  0 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2023-07-06 16:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, Nicolas Saenz Julienne, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 06, 2023 at 01:40:14PM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
> > >> +		ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> > >> +
> > >> +	preempt_enable();
> > >> +	return ret;
> > >> +}
> > > [...]
> > >> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> > >>   */
> > >>  static noinstr void ct_kernel_enter_state(int offset)
> > >>  {
> > >> +	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> > >>      int seq;
> > >> +	unsigned int work;
> > >>
> > >> +	work = ct_work_fetch(ct);
> > >
> > > So this adds another fully ordered operation on user <-> kernel transition.
> > > How many such IPIs can we expect?
> > >
> > 
> > Despite having spent quite a lot of time on that question, I think I still
> > only have a hunch.
> > 
> > Poking around RHEL systems, I'd say 99% of the problematic IPIs are
> > instruction patching and TLB flushes.
> > 
> > Staring at the code, there's quite a lot of smp_calls for which it's hard
> > to say whether the target CPUs can actually be isolated or not (e.g. the
> > CPU comes from a cpumask shoved in a struct that was built using data from
> > another struct of uncertain origins), but then again some of them don't
> > need to hook into context_tracking.
> > 
> > Long story short: I /think/ we can consider that number to be fairly small,
> > but there could be more lurking in the shadows.
> 
> I guess it will still be time to reconsider the design if we ever reach such size.
> 
> > > If this is just about a dozen, can we stuff them in the state like in the
> > > following? We can potentially add more of them especially on 64 bits we could
> > > afford 30 different works, this is just shrinking the RCU extended quiescent
> > > state counter space. Worst case that can happen is that RCU misses 65535
> > > idle/user <-> kernel transitions and delays a grace period...
> > >
> > 
> > I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
> > even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
> > but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
> > only does a !=).
> > 
> > I'm rephrasing here to make sure I get it - is it then that the worst case
> > here is 2^(dynticks_counter_size) transitions happen between saving the
> > dynticks snapshot and checking it again, so RCU waits some more?
> 
> That's my understanding as well but I have to defer on Paul to make sure I'm
> not overlooking something.

That does look plausible to me.

And yes, RCU really cares about whether its part of this counter has
been a multiple of two during a given interval of time, because this
indicates that the CPU has no pre-existing RCU readers still active.
One way that this can happen is for that value to be a multiple of two
at some point in time.  The other way that this can happen is for the
value to have changed.  No matter what the start and end values, if they
are different, the counter must necessarily have at least passed through
multiple of two in the meantime, again guaranteeing that any RCU readers
that around when the count was first fetched have now finished.

But we should take the machine's opinions much more seriously than we
take any of our own opinions.  Why not adjust RCU_DYNTICKS_IDX so as
to crank RCU's portion of this counter down to (say) two or three bits
and let rcutorture have at it on TREE04 or TREE07, both of which have
nohz_full CPUs?

Maybe also adjust mkinitrd.sh to make the user/kernel transitions more
frequent?

Please note that I do -not- recommend production use of a three-bit
(let alone a two-bit) RCU portion because this has a high probability
of excessively extending grace periods.  But it might be good to keep
a tiny counter as a debug option so that we regularly rcutorture it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure
  2023-07-06 16:39         ` Paul E. McKenney
@ 2023-07-06 18:05           ` Valentin Schneider
  0 siblings, 0 replies; 33+ messages in thread
From: Valentin Schneider @ 2023-07-06 18:05 UTC (permalink / raw)
  To: paulmck, Frederic Weisbecker
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Kees Cook, Sami Tolvanen, Ard Biesheuvel,
	Nicholas Piggin, Juerg Haefliger, Nicolas Saenz Julienne,
	Kirill A. Shutemov, Nadav Amit, Dan Carpenter, Chuang Wang,
	Yang Jihong, Petr Mladek, Jason A. Donenfeld, Song Liu,
	Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

On 06/07/23 09:39, Paul E. McKenney wrote:
> On Thu, Jul 06, 2023 at 01:40:14PM +0200, Frederic Weisbecker wrote:
>> On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
>> > I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
>> > even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
>> > but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
>> > only does a !=).
>> >
>> > I'm rephrasing here to make sure I get it - is it then that the worst case
>> > here is 2^(dynticks_counter_size) transitions happen between saving the
>> > dynticks snapshot and checking it again, so RCU waits some more?
>>
>> That's my understanding as well but I have to defer on Paul to make sure I'm
>> not overlooking something.
>
> That does look plausible to me.
>
> And yes, RCU really cares about whether its part of this counter has
> been a multiple of two during a given interval of time, because this
> indicates that the CPU has no pre-existing RCU readers still active.
> One way that this can happen is for that value to be a multiple of two
> at some point in time.  The other way that this can happen is for the
> value to have changed.  No matter what the start and end values, if they
> are different, the counter must necessarily have at least passed through
> multiple of two in the meantime, again guaranteeing that any RCU readers
> that around when the count was first fetched have now finished.
>

Thank you for the demystification!

> But we should take the machine's opinions much more seriously than we
> take any of our own opinions.

Heh :-)

> Why not adjust RCU_DYNTICKS_IDX so as
> to crank RCU's portion of this counter down to (say) two or three bits
> and let rcutorture have at it on TREE04 or TREE07, both of which have
> nohz_full CPUs?
>
> Maybe also adjust mkinitrd.sh to make the user/kernel transitions more
> frequent?
>
> Please note that I do -not- recommend production use of a three-bit
> (let alone a two-bit) RCU portion because this has a high probability
> of excessively extending grace periods.  But it might be good to keep
> a tiny counter as a debug option so that we regularly rcutorture it.
>

Sounds sensible, I'll add that to my v2 todolist.

Thanks!

>                                                       Thanx, Paul


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-07-06 18:06 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-05 18:12 [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 02/14] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
2023-07-05 18:54   ` Steven Rostedt
2023-07-05 18:12 ` [RFC PATCH 03/14] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
2023-07-05 19:00   ` Steven Rostedt
2023-07-05 18:12 ` [RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common " Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 05/14] tracing/filters: Document cpumask filtering Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 06/14] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 07/14] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 08/14] BROKEN: context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
2023-07-05 20:41   ` Peter Zijlstra
2023-07-05 18:12 ` [RFC PATCH 09/14] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 10/14] x86/sev-es: Make sev_es_enable_key __ro_after_init Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2023-07-05 22:23   ` Frederic Weisbecker
2023-07-05 22:41     ` Peter Zijlstra
2023-07-06  9:53       ` Frederic Weisbecker
2023-07-06 10:01     ` Frederic Weisbecker
2023-07-06 11:30     ` Valentin Schneider
2023-07-06 11:40       ` Frederic Weisbecker
2023-07-06 16:39         ` Paul E. McKenney
2023-07-06 18:05           ` Valentin Schneider
2023-07-06 11:55     ` Frederic Weisbecker
2023-07-05 22:39   ` Peter Zijlstra
2023-07-06 11:31     ` Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
2023-07-05 18:12 ` [RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
2023-07-05 18:48 ` [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition Nadav Amit
2023-07-06 11:29   ` Valentin Schneider
2023-07-05 19:03 ` Steven Rostedt
2023-07-06 11:30   ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.