All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition
@ 2023-07-20 16:30 Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
                   ` (19 more replies)
  0 siblings, 20 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().

Patches
=======

o Patches 1-9 have been submitted separately and are included for the sake of
  testing 

o Patches 10-14 focus on having objtool detect problematic static key usage in
  early entry
  
o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.

o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral

  These ones I'm a lot less confident about, mostly due to lacking
  instrumentation/verification.
  
  The actual deferred callback is also incomplete as it's not properly noinstr:
    vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
  and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
  have *no idea* what a sane implementation would be for Xen so I haven't
  touched that yet.

Patches are also available at:

https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2

Testing
=======

Note: this is a different machine than used for v1, because that machine decided
to act difficult.

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.

v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
    338 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   9207 func=do_flush_tlb_all
   1116 func=do_sync_core
     62 func=do_kernel_range_flush
      3 func=nohz_full_kick_func

v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
      2 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
      2 func=nohz_full_kick_func

The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.


Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
  .data..ro_after_init section.

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE

Revisions
=========

RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul)

  The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
  against this series.

o Fixed flush_tlb_kernel_range_deferrable() definition

Peter Zijlstra (1):
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys

Valentin Schneider (19):
  tracing/filters: Dynamically allocate filter_pred.regex
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Optimise cpumask vs cpumask filtering when user mask
    is a single CPU
  tracing/filters: Optimise scalar vs cpumask filtering when the user
    mask is a single CPU
  tracing/filters: Optimise CPU vs cpumask filtering when the user mask
    is a single CPU
  tracing/filters: Further optimise scalar vs cpumask comparison
  tracing/filters: Document cpumask filtering
  objtool: Flesh out warning related to pv_ops[] calls
  objtool: Warn about non __ro_after_init static key usage in .noinstr
  context_tracking: Make context_tracking_key __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context-tracking: Introduce work deferral infrastructure
  rcu: Make RCU dynticks counter size configurable
  rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  context_tracking,x86: Defer kernel text patching IPIs
  context_tracking,x86: Add infrastructure to defer kernel TLBI
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs

 Documentation/trace/events.rst                |  14 +
 arch/Kconfig                                  |   9 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/context_tracking_work.h  |  20 ++
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   2 +
 arch/x86/kernel/alternative.c                 |  24 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/mm/tlb.c                             |  40 ++-
 include/asm-generic/sections.h                |   5 +
 include/linux/context_tracking.h              |  26 ++
 include/linux/context_tracking_state.h        |  65 +++-
 include/linux/context_tracking_work.h         |  28 ++
 include/linux/jump_label.h                    |   1 +
 include/linux/trace_events.h                  |   1 +
 init/main.c                                   |   1 +
 kernel/context_tracking.c                     |  53 ++-
 kernel/jump_label.c                           |  49 +++
 kernel/rcu/Kconfig                            |  33 ++
 kernel/time/Kconfig                           |   5 +
 kernel/trace/trace_events_filter.c            | 302 ++++++++++++++++--
 mm/vmalloc.c                                  |  19 +-
 tools/objtool/check.c                         |  22 +-
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/special.h       |   2 +
 tools/objtool/special.c                       |   3 +
 .../selftests/rcutorture/configs/rcu/TREE11   |  19 ++
 .../rcutorture/configs/rcu/TREE11.boot        |   1 +
 31 files changed, 695 insertions(+), 64 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.31.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array
in the regex field, even if the predicate function does not use the field.

A later commit will introduce a dynamically allocated cpumask to struct
filter_pred, which will require a dedicated freeing function. Bite the
bullet and make filter_pred.regex dynamically allocated.

While at it, reorder the fields of filter_pred to fill in the byte
holes. The struct now fits on a single cacheline.

No change in behaviour intended.

The kfree()'s were patched via Coccinelle:
  @@
  struct filter_pred *pred;
  @@

  -kfree(pred);
  +free_predicate(pred);

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 64 ++++++++++++++++++------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 1dad64267878c..91fc9990107f1 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -70,15 +70,15 @@ enum filter_pred_fn {
 };
 
 struct filter_pred {
-	enum filter_pred_fn 	fn_num;
-	u64 			val;
-	u64 			val2;
-	struct regex		regex;
+	struct regex		*regex;
 	unsigned short		*ops;
 	struct ftrace_event_field *field;
-	int 			offset;
+	u64			val;
+	u64			val2;
+	enum filter_pred_fn	fn_num;
+	int			offset;
 	int			not;
-	int 			op;
+	int			op;
 };
 
 /*
@@ -186,6 +186,14 @@ enum {
 	PROCESS_OR	= 4,
 };
 
+static void free_predicate(struct filter_pred *pred)
+{
+	if (pred) {
+		kfree(pred->regex);
+		kfree(pred);
+	}
+}
+
 /*
  * Without going into a formal proof, this explains the method that is used in
  * parsing the logical expressions.
@@ -623,7 +631,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
 	kfree(inverts);
 	if (prog_stack) {
 		for (i = 0; prog_stack[i].pred; i++)
-			kfree(prog_stack[i].pred);
+			free_predicate(prog_stack[i].pred);
 		kfree(prog_stack);
 	}
 	return ERR_PTR(ret);
@@ -750,7 +758,7 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
 	char *addr = (char *)(event + pred->offset);
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, pred->regex.field_len);
+	cmp = pred->regex->match(addr, pred->regex, pred->regex->field_len);
 
 	match = cmp ^ pred->not;
 
@@ -763,7 +771,7 @@ static __always_inline int filter_pchar(struct filter_pred *pred, char *str)
 	int len;
 
 	len = strlen(str) + 1;	/* including tailing '\0' */
-	cmp = pred->regex.match(str, &pred->regex, len);
+	cmp = pred->regex->match(str, pred->regex, len);
 
 	match = cmp ^ pred->not;
 
@@ -813,7 +821,7 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event)
 	char *addr = (char *)(event + str_loc);
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, str_len);
+	cmp = pred->regex->match(addr, pred->regex, str_len);
 
 	match = cmp ^ pred->not;
 
@@ -836,7 +844,7 @@ static int filter_pred_strrelloc(struct filter_pred *pred, void *event)
 	char *addr = (char *)(&item[1]) + str_loc;
 	int cmp, match;
 
-	cmp = pred->regex.match(addr, &pred->regex, str_len);
+	cmp = pred->regex->match(addr, pred->regex, str_len);
 
 	match = cmp ^ pred->not;
 
@@ -874,7 +882,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event)
 {
 	int cmp;
 
-	cmp = pred->regex.match(current->comm, &pred->regex,
+	cmp = pred->regex->match(current->comm, pred->regex,
 				TASK_COMM_LEN);
 	return cmp ^ pred->not;
 }
@@ -1004,7 +1012,7 @@ enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not)
 
 static void filter_build_regex(struct filter_pred *pred)
 {
-	struct regex *r = &pred->regex;
+	struct regex *r = pred->regex;
 	char *search;
 	enum regex_type type = MATCH_FULL;
 
@@ -1169,7 +1177,7 @@ static void free_prog(struct event_filter *filter)
 		return;
 
 	for (i = 0; prog[i].pred; i++)
-		kfree(prog[i].pred);
+		free_predicate(prog[i].pred);
 	kfree(prog);
 }
 
@@ -1553,9 +1561,12 @@ static int parse_pred(const char *str, void *data,
 			goto err_free;
 		}
 
-		pred->regex.len = len;
-		strncpy(pred->regex.pattern, str + s, len);
-		pred->regex.pattern[len] = 0;
+		pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+		if (!pred->regex)
+			goto err_mem;
+		pred->regex->len = len;
+		strncpy(pred->regex->pattern, str + s, len);
+		pred->regex->pattern[len] = 0;
 
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
@@ -1597,9 +1608,12 @@ static int parse_pred(const char *str, void *data,
 			goto err_free;
 		}
 
-		pred->regex.len = len;
-		strncpy(pred->regex.pattern, str + s, len);
-		pred->regex.pattern[len] = 0;
+		pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+		if (!pred->regex)
+			goto err_mem;
+		pred->regex->len = len;
+		strncpy(pred->regex->pattern, str + s, len);
+		pred->regex->pattern[len] = 0;
 
 		filter_build_regex(pred);
 
@@ -1608,7 +1622,7 @@ static int parse_pred(const char *str, void *data,
 
 		} else if (field->filter_type == FILTER_STATIC_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRING;
-			pred->regex.field_len = field->size;
+			pred->regex->field_len = field->size;
 
 		} else if (field->filter_type == FILTER_DYN_STRING) {
 			pred->fn_num = FILTER_PRED_FN_STRLOC;
@@ -1691,10 +1705,10 @@ static int parse_pred(const char *str, void *data,
 	return i;
 
 err_free:
-	kfree(pred);
+	free_predicate(pred);
 	return -EINVAL;
 err_mem:
-	kfree(pred);
+	free_predicate(pred);
 	return -ENOMEM;
 }
 
@@ -2287,8 +2301,8 @@ static int ftrace_function_set_filter_pred(struct filter_pred *pred,
 		return ret;
 
 	return __ftrace_function_set_filter(pred->op == OP_EQ,
-					    pred->regex.pattern,
-					    pred->regex.len,
+					    pred->regex->pattern,
+					    pred->regex->len,
 					    data);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-26 19:41   ` Josh Poimboeuf
  2023-07-20 16:30 ` [RFC PATCH v2 03/20] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

The recently introduced ipi_send_cpumask trace event contains a cpumask
field, but it currently cannot be used in filter expressions.

Make event filtering aware of cpumask fields, and allow these to be
filtered by a user-provided cpumask.

The user-provided cpumask is to be given in cpulist format and wrapped as:
"CPUS{$cpulist}". The use of curly braces instead of parentheses is to
prevent predicate_parse() from parsing the contents of CPUS{...} as a
full-fledged predicate subexpression.

This enables e.g.:

$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/trace_events.h       |  1 +
 kernel/trace/trace_events_filter.c | 97 +++++++++++++++++++++++++++++-
 2 files changed, 96 insertions(+), 2 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 3930e676436c9..302be73918336 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -807,6 +807,7 @@ enum {
 	FILTER_RDYN_STRING,
 	FILTER_PTR_STRING,
 	FILTER_TRACE_FN,
+	FILTER_CPUMASK,
 	FILTER_COMM,
 	FILTER_CPU,
 	FILTER_STACKTRACE,
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 91fc9990107f1..cb1863dfa280b 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -64,6 +64,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_PCHAR_USER,
 	FILTER_PRED_FN_PCHAR,
 	FILTER_PRED_FN_CPU,
+	FILTER_PRED_FN_CPUMASK,
 	FILTER_PRED_FN_FUNCTION,
 	FILTER_PRED_FN_,
 	FILTER_PRED_TEST_VISITED,
@@ -71,6 +72,7 @@ enum filter_pred_fn {
 
 struct filter_pred {
 	struct regex		*regex;
+	struct cpumask          *mask;
 	unsigned short		*ops;
 	struct ftrace_event_field *field;
 	u64			val;
@@ -94,6 +96,8 @@ struct filter_pred {
 	C(TOO_MANY_OPEN,	"Too many '('"),			\
 	C(TOO_MANY_CLOSE,	"Too few '('"),				\
 	C(MISSING_QUOTE,	"Missing matching quote"),		\
+	C(MISSING_BRACE_OPEN,   "Missing '{'"),				\
+	C(MISSING_BRACE_CLOSE,  "Missing '}'"),				\
 	C(OPERAND_TOO_LONG,	"Operand too long"),			\
 	C(EXPECT_STRING,	"Expecting string field"),		\
 	C(EXPECT_DIGIT,		"Expecting numeric field"),		\
@@ -103,6 +107,7 @@ struct filter_pred {
 	C(BAD_SUBSYS_FILTER,	"Couldn't find or set field in one of a subsystem's events"), \
 	C(TOO_MANY_PREDS,	"Too many terms in predicate expression"), \
 	C(INVALID_FILTER,	"Meaningless filter expression"),	\
+	C(INVALID_CPULIST,	"Invalid cpulist"),	\
 	C(IP_FIELD_ONLY,	"Only 'ip' field is supported for function trace"), \
 	C(INVALID_VALUE,	"Invalid value (did you forget quotes)?"), \
 	C(NO_FUNCTION,		"Function not found"),			\
@@ -190,6 +195,7 @@ static void free_predicate(struct filter_pred *pred)
 {
 	if (pred) {
 		kfree(pred->regex);
+		kfree(pred->mask);
 		kfree(pred);
 	}
 }
@@ -877,6 +883,26 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
 	}
 }
 
+/* Filter predicate for cpumask field vs user-provided cpumask */
+static int filter_pred_cpumask(struct filter_pred *pred, void *event)
+{
+	u32 item = *(u32 *)(event + pred->offset);
+	int loc = item & 0xffff;
+	const struct cpumask *mask = (event + loc);
+	const struct cpumask *cmp = pred->mask;
+
+	switch (pred->op) {
+	case OP_EQ:
+		return cpumask_equal(mask, cmp);
+	case OP_NE:
+		return !cpumask_equal(mask, cmp);
+	case OP_BAND:
+		return cpumask_intersects(mask, cmp);
+	default:
+		return 0;
+	}
+}
+
 /* Filter predicate for COMM. */
 static int filter_pred_comm(struct filter_pred *pred, void *event)
 {
@@ -1244,8 +1270,12 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
 
 int filter_assign_type(const char *type)
 {
-	if (strstr(type, "__data_loc") && strstr(type, "char"))
-		return FILTER_DYN_STRING;
+	if (strstr(type, "__data_loc")) {
+		if (strstr(type, "char"))
+			return FILTER_DYN_STRING;
+		if (strstr(type, "cpumask_t"))
+			return FILTER_CPUMASK;
+		}
 
 	if (strstr(type, "__rel_loc") && strstr(type, "char"))
 		return FILTER_RDYN_STRING;
@@ -1357,6 +1387,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_pchar(pred, event);
 	case FILTER_PRED_FN_CPU:
 		return filter_pred_cpu(pred, event);
+	case FILTER_PRED_FN_CPUMASK:
+		return filter_pred_cpumask(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
 		return filter_pred_function(pred, event);
 	case FILTER_PRED_TEST_VISITED:
@@ -1568,6 +1600,67 @@ static int parse_pred(const char *str, void *data,
 		strncpy(pred->regex->pattern, str + s, len);
 		pred->regex->pattern[len] = 0;
 
+	} else if (!strncmp(str + i, "CPUS", 4)) {
+		unsigned int maskstart;
+		char *tmp;
+
+		switch (field->filter_type) {
+		case FILTER_CPUMASK:
+			break;
+		default:
+			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
+			goto err_free;
+		}
+
+		switch (op) {
+		case OP_EQ:
+		case OP_NE:
+		case OP_BAND:
+			break;
+		default:
+			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
+			goto err_free;
+		}
+
+		/* Skip CPUS */
+		i += 4;
+		if (str[i++] != '{') {
+			parse_error(pe, FILT_ERR_MISSING_BRACE_OPEN, pos + i);
+			goto err_free;
+		}
+		maskstart = i;
+
+		/* Walk the cpulist until closing } */
+		for (; str[i] && str[i] != '}'; i++);
+		if (str[i] != '}') {
+			parse_error(pe, FILT_ERR_MISSING_BRACE_CLOSE, pos + i);
+			goto err_free;
+		}
+
+		if (maskstart == i) {
+			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+			goto err_free;
+		}
+
+		/* Copy the cpulist between { and } */
+		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
+		strscpy(tmp, str + maskstart, (i - maskstart) + 1);
+
+		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
+		if (!pred->mask)
+			goto err_mem;
+
+		/* Now parse it */
+		if (cpulist_parse(tmp, pred->mask)) {
+			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+			goto err_free;
+		}
+
+		/* Move along */
+		i++;
+		if (field->filter_type == FILTER_CPUMASK)
+			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
 		char q = str[i];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 03/20] tracing/filters: Enable filtering a scalar field by a cpumask
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 04/20] tracing/filters: Enable filtering the CPU common " Valentin Schneider
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Several events use a scalar field to denote a CPU:
o sched_wakeup.target_cpu
o sched_migrate_task.orig_cpu,dest_cpu
o sched_move_numa.src_cpu,dst_cpu
o ipi_send_cpu.cpu
o ...

Filtering these currently requires using arithmetic comparison functions,
which can be tedious when dealing with interleaved SMT or NUMA CPU ids.

Allow these to be filtered by a user-provided cpumask, which enables e.g.:

$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
NOTE: I went with an implicit cpumask conversion of the event field, as
AFAICT predicate_parse() does not support parsing the application of a
function to a field (e.g. 'CPUS(target_cpu) & CPUS{2,4,6,8-32}')
---
 kernel/trace/trace_events_filter.c | 92 ++++++++++++++++++++++++++----
 1 file changed, 81 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index cb1863dfa280b..1e14f801685a8 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -46,15 +46,19 @@ static const char * ops[] = { OPS };
 enum filter_pred_fn {
 	FILTER_PRED_FN_NOP,
 	FILTER_PRED_FN_64,
+	FILTER_PRED_FN_64_CPUMASK,
 	FILTER_PRED_FN_S64,
 	FILTER_PRED_FN_U64,
 	FILTER_PRED_FN_32,
+	FILTER_PRED_FN_32_CPUMASK,
 	FILTER_PRED_FN_S32,
 	FILTER_PRED_FN_U32,
 	FILTER_PRED_FN_16,
+	FILTER_PRED_FN_16_CPUMASK,
 	FILTER_PRED_FN_S16,
 	FILTER_PRED_FN_U16,
 	FILTER_PRED_FN_8,
+	FILTER_PRED_FN_8_CPUMASK,
 	FILTER_PRED_FN_S8,
 	FILTER_PRED_FN_U8,
 	FILTER_PRED_FN_COMM,
@@ -643,6 +647,39 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
 	return ERR_PTR(ret);
 }
 
+static inline int
+do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp)
+{
+	switch (op) {
+	case OP_EQ:
+		return cpumask_equal(mask, cmp);
+	case OP_NE:
+		return !cpumask_equal(mask, cmp);
+	case OP_BAND:
+		return cpumask_intersects(mask, cmp);
+	default:
+		return 0;
+	}
+}
+
+/* Optimisation of do_filter_cpumask() for scalar fields */
+static inline int
+do_filter_scalar_cpumask(int op, unsigned int cpu, const struct cpumask *mask)
+{
+	switch (op) {
+	case OP_EQ:
+		return cpumask_test_cpu(cpu, mask) &&
+			cpumask_nth(1, mask) >= nr_cpu_ids;
+	case OP_NE:
+		return !cpumask_test_cpu(cpu, mask) ||
+			cpumask_nth(1, mask) < nr_cpu_ids;
+	case OP_BAND:
+		return cpumask_test_cpu(cpu, mask);
+	default:
+		return 0;
+	}
+}
+
 enum pred_cmp_types {
 	PRED_CMP_TYPE_NOP,
 	PRED_CMP_TYPE_LT,
@@ -686,6 +723,18 @@ static int filter_pred_##type(struct filter_pred *pred, void *event)	\
 	}								\
 }
 
+#define DEFINE_CPUMASK_COMPARISON_PRED(size)					\
+static int filter_pred_##size##_cpumask(struct filter_pred *pred, void *event)	\
+{										\
+	u##size *addr = (u##size *)(event + pred->offset);			\
+	unsigned int cpu = *addr;						\
+										\
+	if (cpu >= nr_cpu_ids)							\
+		return 0;							\
+										\
+	return do_filter_scalar_cpumask(pred->op, cpu, pred->mask);		\
+}
+
 #define DEFINE_EQUALITY_PRED(size)					\
 static int filter_pred_##size(struct filter_pred *pred, void *event)	\
 {									\
@@ -707,6 +756,11 @@ DEFINE_COMPARISON_PRED(u16);
 DEFINE_COMPARISON_PRED(s8);
 DEFINE_COMPARISON_PRED(u8);
 
+DEFINE_CPUMASK_COMPARISON_PRED(64);
+DEFINE_CPUMASK_COMPARISON_PRED(32);
+DEFINE_CPUMASK_COMPARISON_PRED(16);
+DEFINE_CPUMASK_COMPARISON_PRED(8);
+
 DEFINE_EQUALITY_PRED(64);
 DEFINE_EQUALITY_PRED(32);
 DEFINE_EQUALITY_PRED(16);
@@ -891,16 +945,7 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event)
 	const struct cpumask *mask = (event + loc);
 	const struct cpumask *cmp = pred->mask;
 
-	switch (pred->op) {
-	case OP_EQ:
-		return cpumask_equal(mask, cmp);
-	case OP_NE:
-		return !cpumask_equal(mask, cmp);
-	case OP_BAND:
-		return cpumask_intersects(mask, cmp);
-	default:
-		return 0;
-	}
+	return do_filter_cpumask(pred->op, mask, cmp);
 }
 
 /* Filter predicate for COMM. */
@@ -1351,24 +1396,32 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 	switch (pred->fn_num) {
 	case FILTER_PRED_FN_64:
 		return filter_pred_64(pred, event);
+	case FILTER_PRED_FN_64_CPUMASK:
+		return filter_pred_64_cpumask(pred, event);
 	case FILTER_PRED_FN_S64:
 		return filter_pred_s64(pred, event);
 	case FILTER_PRED_FN_U64:
 		return filter_pred_u64(pred, event);
 	case FILTER_PRED_FN_32:
 		return filter_pred_32(pred, event);
+	case FILTER_PRED_FN_32_CPUMASK:
+		return filter_pred_32_cpumask(pred, event);
 	case FILTER_PRED_FN_S32:
 		return filter_pred_s32(pred, event);
 	case FILTER_PRED_FN_U32:
 		return filter_pred_u32(pred, event);
 	case FILTER_PRED_FN_16:
 		return filter_pred_16(pred, event);
+	case FILTER_PRED_FN_16_CPUMASK:
+		return filter_pred_16_cpumask(pred, event);
 	case FILTER_PRED_FN_S16:
 		return filter_pred_s16(pred, event);
 	case FILTER_PRED_FN_U16:
 		return filter_pred_u16(pred, event);
 	case FILTER_PRED_FN_8:
 		return filter_pred_8(pred, event);
+	case FILTER_PRED_FN_8_CPUMASK:
+		return filter_pred_8_cpumask(pred, event);
 	case FILTER_PRED_FN_S8:
 		return filter_pred_s8(pred, event);
 	case FILTER_PRED_FN_U8:
@@ -1606,6 +1659,7 @@ static int parse_pred(const char *str, void *data,
 
 		switch (field->filter_type) {
 		case FILTER_CPUMASK:
+		case FILTER_OTHER:
 			break;
 		default:
 			parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
@@ -1658,8 +1712,24 @@ static int parse_pred(const char *str, void *data,
 
 		/* Move along */
 		i++;
-		if (field->filter_type == FILTER_CPUMASK)
+		if (field->filter_type == FILTER_CPUMASK) {
 			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+		} else {
+			switch (field->size) {
+			case 8:
+				pred->fn_num = FILTER_PRED_FN_64_CPUMASK;
+				break;
+			case 4:
+				pred->fn_num = FILTER_PRED_FN_32_CPUMASK;
+				break;
+			case 2:
+				pred->fn_num = FILTER_PRED_FN_16_CPUMASK;
+				break;
+			case 1:
+				pred->fn_num = FILTER_PRED_FN_8_CPUMASK;
+				break;
+			}
+		}
 
 	/* This is either a string, or an integer */
 	} else if (str[i] == '\'' || str[i] == '"') {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 04/20] tracing/filters: Enable filtering the CPU common field by a cpumask
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (2 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 03/20] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU Valentin Schneider
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

The tracing_cpumask lets us specify which CPUs are traced in a buffer
instance, but doesn't let us do this on a per-event basis (unless one
creates an instance per event).

A previous commit added filtering scalar fields by a user-given cpumask,
make this work with the CPU common field as well.

This enables doing things like

$ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \
		   -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}'

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 1e14f801685a8..3009d0c61b532 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -68,6 +68,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_PCHAR_USER,
 	FILTER_PRED_FN_PCHAR,
 	FILTER_PRED_FN_CPU,
+	FILTER_PRED_FN_CPU_CPUMASK,
 	FILTER_PRED_FN_CPUMASK,
 	FILTER_PRED_FN_FUNCTION,
 	FILTER_PRED_FN_,
@@ -937,6 +938,14 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
 	}
 }
 
+/* Filter predicate for current CPU vs user-provided cpumask */
+static int filter_pred_cpu_cpumask(struct filter_pred *pred, void *event)
+{
+	int cpu = raw_smp_processor_id();
+
+	return do_filter_scalar_cpumask(pred->op, cpu, pred->mask);
+}
+
 /* Filter predicate for cpumask field vs user-provided cpumask */
 static int filter_pred_cpumask(struct filter_pred *pred, void *event)
 {
@@ -1440,6 +1449,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_pchar(pred, event);
 	case FILTER_PRED_FN_CPU:
 		return filter_pred_cpu(pred, event);
+	case FILTER_PRED_FN_CPU_CPUMASK:
+		return filter_pred_cpu_cpumask(pred, event);
 	case FILTER_PRED_FN_CPUMASK:
 		return filter_pred_cpumask(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
@@ -1659,6 +1670,7 @@ static int parse_pred(const char *str, void *data,
 
 		switch (field->filter_type) {
 		case FILTER_CPUMASK:
+		case FILTER_CPU:
 		case FILTER_OTHER:
 			break;
 		default:
@@ -1714,6 +1726,8 @@ static int parse_pred(const char *str, void *data,
 		i++;
 		if (field->filter_type == FILTER_CPUMASK) {
 			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+		} else if (field->filter_type == FILTER_CPU) {
+			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
 		} else {
 			switch (field->size) {
 			case 8:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (3 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 04/20] tracing/filters: Enable filtering the CPU common " Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-29 19:34   ` Steven Rostedt
  2023-07-20 16:30 ` [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the " Valentin Schneider
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

Reuse do_filter_scalar_cpumask() when the input mask has a weight of one.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 35 +++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 3009d0c61b532..2fe65ddeb34ef 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -70,6 +70,7 @@ enum filter_pred_fn {
 	FILTER_PRED_FN_CPU,
 	FILTER_PRED_FN_CPU_CPUMASK,
 	FILTER_PRED_FN_CPUMASK,
+	FILTER_PRED_FN_CPUMASK_CPU,
 	FILTER_PRED_FN_FUNCTION,
 	FILTER_PRED_FN_,
 	FILTER_PRED_TEST_VISITED,
@@ -957,6 +958,22 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event)
 	return do_filter_cpumask(pred->op, mask, cmp);
 }
 
+/* Filter predicate for cpumask field vs user-provided scalar  */
+static int filter_pred_cpumask_cpu(struct filter_pred *pred, void *event)
+{
+	u32 item = *(u32 *)(event + pred->offset);
+	int loc = item & 0xffff;
+	const struct cpumask *mask = (event + loc);
+	unsigned int cpu = pred->val;
+
+	/*
+	 * This inverts the usual usage of the function (field is first element,
+	 * user parameter is second), but that's fine because the (scalar, mask)
+	 * operations used are symmetric.
+	 */
+	return do_filter_scalar_cpumask(pred->op, cpu, mask);
+}
+
 /* Filter predicate for COMM. */
 static int filter_pred_comm(struct filter_pred *pred, void *event)
 {
@@ -1453,6 +1470,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
 		return filter_pred_cpu_cpumask(pred, event);
 	case FILTER_PRED_FN_CPUMASK:
 		return filter_pred_cpumask(pred, event);
+	case FILTER_PRED_FN_CPUMASK_CPU:
+		return filter_pred_cpumask_cpu(pred, event);
 	case FILTER_PRED_FN_FUNCTION:
 		return filter_pred_function(pred, event);
 	case FILTER_PRED_TEST_VISITED:
@@ -1666,6 +1685,7 @@ static int parse_pred(const char *str, void *data,
 
 	} else if (!strncmp(str + i, "CPUS", 4)) {
 		unsigned int maskstart;
+		bool single;
 		char *tmp;
 
 		switch (field->filter_type) {
@@ -1724,8 +1744,21 @@ static int parse_pred(const char *str, void *data,
 
 		/* Move along */
 		i++;
+
+		/*
+		 * Optimisation: if the user-provided mask has a weight of one
+		 * then we can treat it as a scalar input.
+		 */
+		single = cpumask_weight(pred->mask) == 1;
+		if (single && field->filter_type == FILTER_CPUMASK) {
+			pred->val = cpumask_first(pred->mask);
+			kfree(pred->mask);
+		}
+
 		if (field->filter_type == FILTER_CPUMASK) {
-			pred->fn_num = FILTER_PRED_FN_CPUMASK;
+			pred->fn_num = single ?
+				FILTER_PRED_FN_CPUMASK_CPU :
+				FILTER_PRED_FN_CPUMASK;
 		} else if (field->filter_type == FILTER_CPU) {
 			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
 		} else {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (4 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-29 19:55   ` Steven Rostedt
  2023-07-20 16:30 ` [RFC PATCH v2 07/20] tracing/filters: Optimise CPU " Valentin Schneider
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

When the mask contains a single CPU, directly re-use the unsigned field
predicate functions. Transform '&' into '==' beforehand.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 2fe65ddeb34ef..54d642fabb7f1 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data,
 		 * then we can treat it as a scalar input.
 		 */
 		single = cpumask_weight(pred->mask) == 1;
-		if (single && field->filter_type == FILTER_CPUMASK) {
+		if (single && field->filter_type != FILTER_CPU) {
 			pred->val = cpumask_first(pred->mask);
 			kfree(pred->mask);
 		}
@@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
 				FILTER_PRED_FN_CPUMASK;
 		} else if (field->filter_type == FILTER_CPU) {
 			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
+		} else if (single) {
+			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
+			pred->fn_num = select_comparison_fn(pred->op, field->size, false);
+			if (pred->op == OP_NE)
+				pred->not = 1;
 		} else {
 			switch (field->size) {
 			case 8:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 07/20] tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (5 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the " Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 08/20] tracing/filters: Further optimise scalar vs cpumask comparison Valentin Schneider
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Steven noted that when the user-provided cpumask contains a single CPU,
then the filtering function can use a scalar as input instead of a
full-fledged cpumask.

In this case we can directly re-use filter_pred_cpu(), we just need to
transform '&' into '==' before executing it.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 54d642fabb7f1..fd72dacc5d1b8 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data,
 		 * then we can treat it as a scalar input.
 		 */
 		single = cpumask_weight(pred->mask) == 1;
-		if (single && field->filter_type != FILTER_CPU) {
+		if (single) {
 			pred->val = cpumask_first(pred->mask);
 			kfree(pred->mask);
 		}
@@ -1760,7 +1760,12 @@ static int parse_pred(const char *str, void *data,
 				FILTER_PRED_FN_CPUMASK_CPU :
 				FILTER_PRED_FN_CPUMASK;
 		} else if (field->filter_type == FILTER_CPU) {
-			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
+			if (single) {
+				pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
+				pred->fn_num = FILTER_PRED_FN_CPU;
+			} else {
+				pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
+			}
 		} else if (single) {
 			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
 			pred->fn_num = select_comparison_fn(pred->op, field->size, false);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 08/20] tracing/filters: Further optimise scalar vs cpumask comparison
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (6 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 07/20] tracing/filters: Optimise CPU " Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 09/20] tracing/filters: Document cpumask filtering Valentin Schneider
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Per the previous commits, we now only enter do_filter_scalar_cpumask() with
a mask of weight greater than one. Optimise the equality checks.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/trace/trace_events_filter.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index fd72dacc5d1b8..3a529214a21b7 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -667,6 +667,25 @@ do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp)
 /* Optimisation of do_filter_cpumask() for scalar fields */
 static inline int
 do_filter_scalar_cpumask(int op, unsigned int cpu, const struct cpumask *mask)
+{
+	/*
+	 * Per the weight-of-one cpumask optimisations, the mask passed in this
+	 * function has a weight >= 2, so it is never equal to a single scalar.
+	 */
+	switch (op) {
+	case OP_EQ:
+		return false;
+	case OP_NE:
+		return true;
+	case OP_BAND:
+		return cpumask_test_cpu(cpu, mask);
+	default:
+		return 0;
+	}
+}
+
+static inline int
+do_filter_cpumask_scalar(int op, const struct cpumask *mask, unsigned int cpu)
 {
 	switch (op) {
 	case OP_EQ:
@@ -966,12 +985,7 @@ static int filter_pred_cpumask_cpu(struct filter_pred *pred, void *event)
 	const struct cpumask *mask = (event + loc);
 	unsigned int cpu = pred->val;
 
-	/*
-	 * This inverts the usual usage of the function (field is first element,
-	 * user parameter is second), but that's fine because the (scalar, mask)
-	 * operations used are symmetric.
-	 */
-	return do_filter_scalar_cpumask(pred->op, cpu, mask);
+	return do_filter_cpumask_scalar(pred->op, mask, cpu);
 }
 
 /* Filter predicate for COMM. */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 09/20] tracing/filters: Document cpumask filtering
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (7 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 08/20] tracing/filters: Further optimise scalar vs cpumask comparison Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Cpumask, scalar and CPU fields can now be filtered by a user-provided
cpumask, document the syntax.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 Documentation/trace/events.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index f5fcb8e1218f6..34108d5a55b41 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -219,6 +219,20 @@ the function "security_prepare_creds" and less than the end of that function.
 The ".function" postfix can only be attached to values of size long, and can only
 be compared with "==" or "!=".
 
+Cpumask fields or scalar fields that encode a CPU number can be filtered using
+a user-provided cpumask in cpulist format. The format is as follows::
+
+  CPUS{$cpulist}
+
+Operators available to cpumask filtering are:
+
+& (intersection), ==, !=
+
+For example, this will filter events that have their .target_cpu field present
+in the given cpumask::
+
+  target_cpu & CPUS{17-42}
+
 5.2 Setting filters
 -------------------
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (8 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 09/20] tracing/filters: Document cpumask filtering Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-28 22:04   ` Peter Zijlstra
  2023-07-20 16:30 ` [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Peter Zijlstra, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

From: Peter Zijlstra <peterz@infradead.org>

When a static_key is marked ro_after_init, its state will never change
(after init), therefore jump_label_update() will never need to iterate
the entries, and thus module load won't actually need to track this --
avoiding the static_key::next write.

Therefore, mark these keys such that jump_label_add_module() might
recognise them and avoid the modification.

Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
to denote such keys.

Link: http://lore.kernel.org/r/20230705204142.GB2813335@hirez.programming.kicks-ass.net
NOT-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
@Peter: I've barely touched this patch, it's just been writing a comment
and fixing benign compilation issues, so credit's all yours really!
---
 include/asm-generic/sections.h |  5 ++++
 include/linux/jump_label.h     |  1 +
 init/main.c                    |  1 +
 kernel/jump_label.c            | 49 ++++++++++++++++++++++++++++++++++
 4 files changed, 56 insertions(+)

diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index db13bb620f527..c768de6f19a9a 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -180,6 +180,11 @@ static inline bool is_kernel_rodata(unsigned long addr)
 	       addr < (unsigned long)__end_rodata;
 }
 
+static inline bool is_kernel_ro_after_init(unsigned long addr)
+{
+	return addr >= (unsigned long)__start_ro_after_init &&
+	       addr < (unsigned long)__end_ro_after_init;
+}
 /**
  * is_kernel_inittext - checks if the pointer address is located in the
  *                      .init.text section
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f0a949b7c9733..88ef9e776af8d 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -216,6 +216,7 @@ extern struct jump_entry __start___jump_table[];
 extern struct jump_entry __stop___jump_table[];
 
 extern void jump_label_init(void);
+extern void jump_label_ro(void);
 extern void jump_label_lock(void);
 extern void jump_label_unlock(void);
 extern void arch_jump_label_transform(struct jump_entry *entry,
diff --git a/init/main.c b/init/main.c
index ad920fac325c3..cb5304ca18f4d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1403,6 +1403,7 @@ static void mark_readonly(void)
 		 * insecure pages which are W+X.
 		 */
 		rcu_barrier();
+		jump_label_ro();
 		mark_rodata_ro();
 		rodata_test();
 	} else
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index d9c822bbffb8d..661ef74dee9b7 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -530,6 +530,45 @@ void __init jump_label_init(void)
 	cpus_read_unlock();
 }
 
+static inline bool static_key_sealed(struct static_key *key)
+{
+	return (key->type & JUMP_TYPE_LINKED) && !(key->type & ~JUMP_TYPE_MASK);
+}
+
+static inline void static_key_seal(struct static_key *key)
+{
+	unsigned long type = key->type & JUMP_TYPE_TRUE;
+	key->type = JUMP_TYPE_LINKED | type;
+}
+
+void jump_label_ro(void)
+{
+	struct jump_entry *iter_start = __start___jump_table;
+	struct jump_entry *iter_stop = __stop___jump_table;
+	struct jump_entry *iter;
+
+	if (WARN_ON_ONCE(!static_key_initialized))
+		return;
+
+	cpus_read_lock();
+	jump_label_lock();
+
+	for (iter = iter_start; iter < iter_stop; iter++) {
+		struct static_key *iterk = jump_entry_key(iter);
+
+		if (!is_kernel_ro_after_init((unsigned long)iterk))
+			continue;
+
+		if (static_key_sealed(iterk))
+			continue;
+
+		static_key_seal(iterk);
+	}
+
+	jump_label_unlock();
+	cpus_read_unlock();
+}
+
 #ifdef CONFIG_MODULES
 
 enum jump_label_type jump_label_init_type(struct jump_entry *entry)
@@ -650,6 +689,15 @@ static int jump_label_add_module(struct module *mod)
 			static_key_set_entries(key, iter);
 			continue;
 		}
+
+		/*
+		 * If the key was sealed at init, then there's no need to keep a
+		 * a reference to its module entries - just patch them now and
+		 * be done with it.
+		 */
+		if (static_key_sealed(key))
+			goto do_poke;
+
 		jlm = kzalloc(sizeof(struct static_key_mod), GFP_KERNEL);
 		if (!jlm)
 			return -ENOMEM;
@@ -675,6 +723,7 @@ static int jump_label_add_module(struct module *mod)
 		static_key_set_linked(key);
 
 		/* Only update if we've changed from our initial state */
+do_poke:
 		if (jump_label_type(iter) != jump_label_init_type(iter))
 			__jump_label_update(key, iter, iter_stop, true);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (9 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-28 15:33   ` Josh Poimboeuf
  2023-07-20 16:30 ` [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

I had to look into objtool itself to understand what this warning was
about; make it more explicit.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 tools/objtool/check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 8936a05f0e5ac..d308330f2910e 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 
 	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
 		if (!target->sec->noinstr) {
-			WARN("pv_ops[%d]: %s", idx, target->name);
+			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
 			file->pv_ops[idx].clean = false;
 		}
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (10 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-28 15:35   ` Josh Poimboeuf
  2023-07-28 16:02   ` Josh Poimboeuf
  2023-07-20 16:30 ` [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
                   ` (7 subsequent siblings)
  19 siblings, 2 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Josh Poimboeuf, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Later commits will depend on having no runtime-mutable text in early entry
code. (ab)use the .noinstr section as a marker of early entry code and warn
about static keys used in it that can be flipped at runtime.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 tools/objtool/check.c                   | 20 ++++++++++++++++++++
 tools/objtool/include/objtool/check.h   |  1 +
 tools/objtool/include/objtool/special.h |  2 ++
 tools/objtool/special.c                 |  3 +++
 4 files changed, 26 insertions(+)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d308330f2910e..d973bb4df4341 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1968,6 +1968,9 @@ static int add_special_section_alts(struct objtool_file *file)
 		alt->next = orig_insn->alts;
 		orig_insn->alts = alt;
 
+		if (special_alt->key_sym)
+			orig_insn->key_sym = special_alt->key_sym;
+
 		list_del(&special_alt->list);
 		free(special_alt);
 	}
@@ -3476,6 +3479,20 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct
 	return 0;
 }
 
+static int validate_static_key(struct instruction *insn, struct insn_state *state)
+{
+	if (state->noinstr && state->instr <= 0) {
+		if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
+			WARN_INSN(insn,
+				  "Non __ro_after_init static key \"%s\" in .noinstr section",
+				  insn->key_sym->name);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 static struct instruction *next_insn_to_validate(struct objtool_file *file,
 						 struct instruction *insn)
 {
@@ -3625,6 +3642,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 		if (handle_insn_ops(insn, next_insn, &state))
 			return 1;
 
+		if (insn->key_sym)
+			validate_static_key(insn, &state);
+
 		switch (insn->type) {
 
 		case INSN_RETURN:
diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h
index daa46f1f0965a..35dd21f8f41e1 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -77,6 +77,7 @@ struct instruction {
 	struct symbol *sym;
 	struct stack_op *stack_ops;
 	struct cfi_state *cfi;
+	struct symbol *key_sym;
 };
 
 static inline struct symbol *insn_func(struct instruction *insn)
diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h
index 86d4af9c5aa9d..0e61f34fe3a28 100644
--- a/tools/objtool/include/objtool/special.h
+++ b/tools/objtool/include/objtool/special.h
@@ -27,6 +27,8 @@ struct special_alt {
 	struct section *new_sec;
 	unsigned long new_off;
 
+	struct symbol *key_sym;
+
 	unsigned int orig_len, new_len; /* group only */
 };
 
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 91b1950f5bd8a..1f76cfd815bf3 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
 			return -1;
 		}
 		alt->key_addend = reloc_addend(key_reloc);
+
+		reloc_to_sec_off(key_reloc, &sec, &offset);
+		alt->key_sym = find_symbol_by_offset(sec, offset & ~2);
 	}
 
 	return 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (11 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-28 16:00   ` Josh Poimboeuf
  2023-07-20 16:30 ` [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

objtool now warns about it:

  vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section
  [...]

The key can only be enabled (and not disabled) in the __init function
ct_cpu_tracker_user(), so mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/context_tracking.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 6ef0b35fc28c5..cc4f3a57f848c 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -432,7 +432,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
 
-DEFINE_STATIC_KEY_FALSE(context_tracking_key);
+DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
 EXPORT_SYMBOL_GPL(context_tracking_key);
 
 static noinstr bool context_tracking_recursion_enter(void)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (12 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-10-09 16:40   ` Maxim Levitsky
  2023-07-20 16:30 ` [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure Valentin Schneider
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

objtool now warns about it:

  vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init static key "kvm_async_pf_enabled" in .noinstr section

The key can only be enabled (and not disabled) in the __init function
kvm_guest_init(), so mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984daa..319460090a836 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -44,7 +44,7 @@
 #include <asm/svm.h>
 #include <asm/e820/api.h>
 
-DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
 
 static int kvmapf = 1;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (13 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-24 14:52   ` Frederic Weisbecker
  2023-07-20 16:30 ` [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable Valentin Schneider
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

  cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Reduce the RCU dynticks counter width to free up some bits to be used as a
deferred callback bitmask. Add some build-time checks to validate that
setup.

Presence of CONTEXT_KERNEL in the ct_state prevents queuing deferred work.

Later commits introduce the bit:callback mappings.

Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/Kconfig                                 |  9 +++
 arch/x86/Kconfig                             |  1 +
 arch/x86/include/asm/context_tracking_work.h | 14 +++++
 include/linux/context_tracking.h             | 25 ++++++++
 include/linux/context_tracking_state.h       | 62 +++++++++++++++-----
 include/linux/context_tracking_work.h        | 26 ++++++++
 kernel/context_tracking.c                    | 51 +++++++++++++++-
 kernel/time/Kconfig                          |  5 ++
 8 files changed, 176 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index aff2746c8af28..1bcb3bbdddaad 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -871,6 +871,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
 	  - No use of instrumentation, unless instrumentation_begin() got
 	    called.
 
+config HAVE_CONTEXT_TRACKING_WORK
+	bool
+	help
+	  Architecture supports deferring work while not in kernel context.
+	  This is especially useful on setups with isolated CPUs that might
+	  want to avoid being interrupted to perform housekeeping tasks (for
+	  ex. TLB invalidation or icache invalidation). The housekeeping
+	  operations are performed upon re-entering the kernel.
+
 config HAVE_TIF_NOHZ
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7422db4097701..71481a80774f6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -198,6 +198,7 @@ config X86
 	select HAVE_CMPXCHG_LOCAL
 	select HAVE_CONTEXT_TRACKING_USER		if X86_64
 	select HAVE_CONTEXT_TRACKING_USER_OFFSTACK	if HAVE_CONTEXT_TRACKING_USER
+	select HAVE_CONTEXT_TRACKING_WORK		if X86_64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if HAVE_OBJTOOL
 	select HAVE_OBJTOOL_NOP_MCOUNT		if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5bc29e6b2ed38
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+	switch (work) {
+	case CONTEXT_WORK_n:
+		// Do work...
+		break;
+	}
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6e76b9dba00e7..8aee086d0a25f 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,10 +5,15 @@
 #include <linux/sched.h>
 #include <linux/vtime.h>
 #include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
 #include <linux/instrumentation.h>
 
 #include <asm/ptrace.h>
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static_assert(CONTEXT_WORK_MAX_OFFSET <= CONTEXT_WORK_END + 1 - CONTEXT_WORK_START,
+	      "Not enough bits for CONTEXT_WORK");
+#endif
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 extern void ct_cpu_track_user(int cpu);
@@ -131,6 +136,26 @@ static __always_inline unsigned long ct_state_inc(int incby)
 	return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
 }
 
+#ifdef CONTEXT_TRACKING_WORK
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+	struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+	unsigned long new, old, state;
+
+	state = arch_atomic_read(&ct->state);
+	do {
+		old = state;
+		new = old & ~CONTEXT_WORK_MASK;
+		new += incby;
+		state = arch_atomic_cmpxchg(&ct->state, old, new);
+	} while (old != state);
+
+	return new;
+}
+#else
+#define ct_state_inc_clear_work(x) ct_state_inc(x)
+#endif
+
 static __always_inline bool warn_rcu_enter(void)
 {
 	bool ret = false;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index bbff5f7f88030..828fcdb801f73 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -9,21 +9,6 @@
 /* Offset to allow distinguishing irq vs. task-based idle entry/exit. */
 #define DYNTICK_IRQ_NONIDLE	((LONG_MAX / 2) + 1)
 
-enum ctx_state {
-	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
-	CONTEXT_KERNEL		= 0,
-	CONTEXT_IDLE		= 1,
-	CONTEXT_USER		= 2,
-	CONTEXT_GUEST		= 3,
-	CONTEXT_MAX		= 4,
-};
-
-/* Even value for idle, else odd. */
-#define RCU_DYNTICKS_IDX CONTEXT_MAX
-
-#define CT_STATE_MASK (CONTEXT_MAX - 1)
-#define CT_DYNTICKS_MASK (~CT_STATE_MASK)
-
 struct context_tracking {
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 	/*
@@ -44,6 +29,53 @@ struct context_tracking {
 #endif
 };
 
+enum ctx_state {
+	/* Following are values */
+	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
+	CONTEXT_KERNEL		= 0,
+	CONTEXT_IDLE		= 1,
+	CONTEXT_USER		= 2,
+	CONTEXT_GUEST		= 3,
+	CONTEXT_MAX             = 4,
+};
+
+/*
+ * We cram three different things within the same atomic variable:
+ *
+ *                CONTEXT_STATE_END                        RCU_DYNTICKS_END
+ *                         |       CONTEXT_WORK_END                |
+ *                         |               |                       |
+ *                         v               v                       v
+ *         [ context_state ][ context work ][ RCU dynticks counter ]
+ *         ^                ^               ^
+ *         |                |               |
+ *         |        CONTEXT_WORK_START      |
+ * CONTEXT_STATE_START              RCU_DYNTICKS_START
+ */
+
+#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+
+#define CONTEXT_STATE_START 0
+#define CONTEXT_STATE_END   (bits_per(CONTEXT_MAX - 1) - 1)
+
+#define RCU_DYNTICKS_BITS  (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
+#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
+#define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
+#define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)
+
+#define	CONTEXT_WORK_START (CONTEXT_STATE_END + 1)
+#define CONTEXT_WORK_END   (RCU_DYNTICKS_START - 1)
+
+/* Make sure all our bits are accounted for */
+static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
+	      (CONTEXT_WORK_END  + 1 - CONTEXT_WORK_START) +
+	      (RCU_DYNTICKS_END  + 1 - RCU_DYNTICKS_START) ==
+	      CT_STATE_SIZE);
+
+#define CT_STATE_MASK GENMASK(CONTEXT_STATE_END, CONTEXT_STATE_START)
+#define CT_WORK_MASK  GENMASK(CONTEXT_WORK_END, CONTEXT_WORK_START)
+#define CT_DYNTICKS_MASK GENMASK(RCU_DYNTICKS_END, RCU_DYNTICKS_START)
+
 #ifdef CONFIG_CONTEXT_TRACKING
 DECLARE_PER_CPU(struct context_tracking, context_tracking);
 #endif
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..fb74db8876dd2
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+	CONTEXT_WORK_n_OFFSET,
+	CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+	CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
+	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index cc4f3a57f848c..1a3f6e355826d 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,51 @@ static __always_inline void rcu_dynticks_task_trace_exit(void)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+	int bit;
+
+	seq = (seq & CT_WORK_MASK) >> CONTEXT_WORK_START;
+
+	/*
+	 * arch_context_tracking_work() must be noinstr, non-blocking,
+	 * and NMI safe.
+	 */
+	for_each_set_bit(bit, &seq, CONTEXT_WORK_MAX)
+		arch_context_tracking_work(BIT(bit));
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+	unsigned int old;
+	bool ret = false;
+
+	preempt_disable();
+
+	old = atomic_read(&ct->state);
+	/*
+	 * Try setting the work until either
+	 * - the target CPU no longer accepts any more deferred work
+	 * - the work has been set
+	 *
+	 * NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
+	 * as they are regular integers rather than bits, but that doesn't
+	 * matter here: if any of the context state bit is set, the CPU isn't
+	 * in kernel context.
+	 */
+	while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)
+		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CONTEXT_WORK_START));
+
+	preempt_enable();
+	return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
 /*
  * Record entry into an extended quiescent state.  This is only to be
  * called when not already in an extended quiescent state, that is,
@@ -88,7 +133,8 @@ static noinstr void ct_kernel_exit_state(int offset)
 	 * next idle sojourn.
 	 */
 	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
-	seq = ct_state_inc(offset);
+	seq = ct_state_inc_clear_work(offset);
+
 	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX));
 }
@@ -100,7 +146,7 @@ static noinstr void ct_kernel_exit_state(int offset)
  */
 static noinstr void ct_kernel_enter_state(int offset)
 {
-	int seq;
+	unsigned long seq;
 
 	/*
 	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
@@ -108,6 +154,7 @@ static noinstr void ct_kernel_enter_state(int offset)
 	 * critical section.
 	 */
 	seq = ct_state_inc(offset);
+	ct_work_flush(seq);
 	// RCU is now watching.  Better not be in an extended quiescent state!
 	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index bae8f11070bef..fdb266f2d774b 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
 	  Say N otherwise, this option brings an overhead that you
 	  don't want in production.
 
+config CONTEXT_TRACKING_WORK
+	bool
+	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+	default y
+
 config NO_HZ
 	bool "Old Idle dynticks config"
 	help
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (14 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-21  8:17   ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width Valentin Schneider
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Paul E . McKenney, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

CONTEXT_TRACKING_WORK reduces the size of the dynticks counter to free up
some bits for work deferral. Paul suggested making the actual counter size
configurable for rcutorture to poke at, so do that.

Make it only configurable under RCU_EXPERT. Previous commits have added
build-time checks that ensure a kernel with problematic dynticks counter
width can't be built.

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/context_tracking.h       |  3 ++-
 include/linux/context_tracking_state.h |  3 +--
 kernel/rcu/Kconfig                     | 33 ++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 8aee086d0a25f..9c0c622bc27bb 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -12,7 +12,8 @@
 
 #ifdef CONFIG_CONTEXT_TRACKING_WORK
 static_assert(CONTEXT_WORK_MAX_OFFSET <= CONTEXT_WORK_END + 1 - CONTEXT_WORK_START,
-	      "Not enough bits for CONTEXT_WORK");
+	      "Not enough bits for CONTEXT_WORK, "
+	      "CONFIG_RCU_DYNTICKS_BITS might be too high");
 #endif
 
 #ifdef CONFIG_CONTEXT_TRACKING_USER
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 828fcdb801f73..292a0b7c06948 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -58,8 +58,7 @@ enum ctx_state {
 #define CONTEXT_STATE_START 0
 #define CONTEXT_STATE_END   (bits_per(CONTEXT_MAX - 1) - 1)
 
-#define RCU_DYNTICKS_BITS  (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
-#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
+#define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
 #define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
 #define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)
 
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bdd7eadb33d8f..1ff2aab24e964 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
 	  Say Y here if you need tighter callback-limit enforcement.
 	  Say N here if you are unsure.
 
+config RCU_DYNTICKS_RANGE_BEGIN
+	int
+	depends on !RCU_EXPERT
+	default 31 if !CONTEXT_TRACKING_WORK
+	default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_RANGE_BEGIN
+	int
+	depends on RCU_EXPERT
+	default 2
+
+config RCU_DYNTICKS_RANGE_END
+	int
+	default 31 if !CONTEXT_TRACKING_WORK
+	default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_BITS_DEFAULT
+       int
+       default 31 if !CONTEXT_TRACKING_WORK
+       default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_BITS
+	int "Dynticks counter width" if CONTEXT_TRACKING_WORK
+	range RCU_DYNTICKS_RANGE_BEGIN RCU_DYNTICKS_RANGE_END
+	default RCU_DYNTICKS_BITS_DEFAULT
+	help
+	  This option controls the width of the dynticks counter.
+
+	  Lower values will make overflows more frequent, which will increase
+	  the likelihood of extending grace-periods.
+
+	  Don't touch this unless you are running some tests.
+
 endmenu # "RCU Subsystem"
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (15 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 19:53   ` Paul E. McKenney
  2023-07-20 16:30 ` [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Paul E . McKenney, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

We now have an RCU_EXPORT knob for configuring the size of the dynticks
counter: CONFIG_RCU_DYNTICKS_BITS.

Add a torture config for a ridiculously small counter (2 bits). This is ac
opy of TREE4 with the added counter size restriction.

Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 .../selftests/rcutorture/configs/rcu/TREE11   | 19 +++++++++++++++++++
 .../rcutorture/configs/rcu/TREE11.boot        |  1 +
 2 files changed, 20 insertions(+)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
new file mode 100644
index 0000000000000..aa7274efd9819
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
@@ -0,0 +1,19 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=8
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_PREEMPT=n
+CONFIG_PREEMPT_DYNAMIC=n
+#CHECK#CONFIG_TREE_RCU=y
+CONFIG_HZ_PERIODIC=n
+CONFIG_NO_HZ_IDLE=n
+CONFIG_NO_HZ_FULL=y
+CONFIG_RCU_TRACE=y
+CONFIG_RCU_FANOUT=4
+CONFIG_RCU_FANOUT_LEAF=3
+CONFIG_DEBUG_LOCK_ALLOC=n
+CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
+CONFIG_RCU_EXPERT=y
+CONFIG_RCU_EQS_DEBUG=y
+CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_BITS=2
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
new file mode 100644
index 0000000000000..a8d94caf7d2fd
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
@@ -0,0 +1 @@
+rcutree.rcu_fanout_leaf=4 nohz_full=1-N
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (16 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-25 10:49   ` Joel Fernandes
  2023-07-20 16:30 ` [RFC PATCH v2 19/20] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Peter Zijlstra, Nicolas Saenz Julienne, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys. Any static key used in early entry code
should be only forever-enabled at boot time, IOW __ro_after_init (pretty
much like alternatives). Objtool is now able to point at static keys that
don't respect this, and all static keys used in early entry code have now
been verified as behaving like so.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  6 +++--
 arch/x86/include/asm/text-patching.h         |  1 +
 arch/x86/kernel/alternative.c                | 24 ++++++++++++++++----
 arch/x86/kernel/kprobes/core.c               |  4 ++--
 arch/x86/kernel/kprobes/opt.c                |  4 ++--
 arch/x86/kernel/module.c                     |  2 +-
 include/linux/context_tracking_work.h        |  4 ++--
 7 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5bc29e6b2ed38..2c66687ce00e2 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
+#include <asm/sync_core.h>
+
 static __always_inline void arch_context_tracking_work(int work)
 {
 	switch (work) {
-	case CONTEXT_WORK_n:
-		// Do work...
+	case CONTEXT_WORK_SYNC:
+		sync_core();
 		break;
 	}
 }
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 29832c338cdc5..b6939e965e69d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
 extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 72646d75b6ffe..fcce480e1919e 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
 #include <linux/sync_core.h>
+#include <linux/context_tracking.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -1933,9 +1934,24 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+	on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
 void text_poke_sync(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	__text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+	__text_poke_sync(do_sync_core_defer_cond);
 }
 
 /*
@@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
 	}
 
-	text_poke_sync();
+	text_poke_sync_deferrable();
 
 	/*
 	 * Second step: update all but the first byte of the patched range.
@@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 		 * not necessary and we'd be safe even without it. But
 		 * better safe than sorry (plus there's not only Intel).
 		 */
-		text_poke_sync();
+		text_poke_sync_deferrable();
 	}
 
 	/*
@@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	}
 
 	if (do_sync)
-		text_poke_sync();
+		text_poke_sync_deferrable();
 
 	/*
 	 * Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6c..a38c914753397 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
 	u8 int3 = INT3_INSN_OPCODE;
 
 	text_poke(p->addr, &int3, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
 }
 
@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
 
 	perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
 	text_poke(p->addr, &p->opcode, 1);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 }
 
 void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 57b0037d0a996..88451a744ceda 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	       JMP32_INSN_SIZE - INT3_INSN_SIZE);
 
 	text_poke(addr, new, INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 	text_poke(addr + INT3_INSN_SIZE,
 		  new + INT3_INSN_SIZE,
 		  JMP32_INSN_SIZE - INT3_INSN_SIZE);
-	text_poke_sync();
+	text_poke_sync_deferrable();
 
 	perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
 }
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344b..8b4542dc51b6d 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 				   write, apply);
 
 	if (!early) {
-		text_poke_sync();
+		text_poke_sync_deferrable();
 		mutex_unlock(&text_mutex);
 	}
 
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index fb74db8876dd2..13fc97b395030 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,12 @@
 #include <linux/bitops.h>
 
 enum {
-	CONTEXT_WORK_n_OFFSET,
+	CONTEXT_WORK_SYNC_OFFSET,
 	CONTEXT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
-	CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
+	CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
 	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 19/20] context_tracking,x86: Add infrastructure to defer kernel TLBI
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (17 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-20 16:30 ` [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
  19 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.

Rather than make __flush_tlb_all() noinstr, add a minimal noinstr
variant that doesn't try to leverage INVPCID.

FIXME: not fully noinstr compliant
XXX: same issue as with ins patching, when do we access data that should be
invalidated?

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  4 ++++
 arch/x86/include/asm/tlbflush.h              |  1 +
 arch/x86/mm/tlb.c                            | 17 +++++++++++++++++
 include/linux/context_tracking_state.h       |  4 ++++
 include/linux/context_tracking_work.h        |  2 ++
 5 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <asm/tlbflush.h>
 
 static __always_inline void arch_context_tracking_work(int work)
 {
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work)
 	case CONTEXT_WORK_SYNC:
 		sync_core();
 		break;
+	case CONTEXT_WORK_TLBI:
+		__flush_tlb_all_noinstr();
+		break;
 	}
 }
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 80450e1d5385a..323b971987af7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,7 @@
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
 void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);
 
 #define TLB_FLUSH_ALL	-1UL
 #define TLB_GENERATION_INVALID	0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf27480af..631df9189ded4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1237,6 +1237,23 @@ void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+void noinstr __flush_tlb_all_noinstr(void)
+{
+	/*
+	 * This is for invocation in early entry code that cannot be
+	 * instrumented. A RMW to CR4 works for most cases, but relies on
+	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+	 * do the CR3 write.
+	 *
+	 * TODO: paravirt
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_PGE))
+		__native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
+	else
+		flush_tlb_local();
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 292a0b7c06948..3571c62cbb9cd 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -62,6 +62,10 @@ enum ctx_state {
 #define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
 #define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)
 
+/*
+ * When CONFIG_CONTEXT_TRACKING_WORK=n, _END is 1 behind _START, which makes
+ * the CONTEXT_WORK size computation below 0, which is what we want!
+ */
 #define	CONTEXT_WORK_START (CONTEXT_STATE_END + 1)
 #define CONTEXT_WORK_END   (RCU_DYNTICKS_START - 1)
 
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 13fc97b395030..47d5ced39a43a 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,11 +6,13 @@
 
 enum {
 	CONTEXT_WORK_SYNC_OFFSET,
+	CONTEXT_WORK_TLBI_OFFSET,
 	CONTEXT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
+	CONTEXT_WORK_TLBI     = BIT(CONTEXT_WORK_TLBI_OFFSET),
 	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
                   ` (18 preceding siblings ...)
  2023-07-20 16:30 ` [RFC PATCH v2 19/20] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
@ 2023-07-20 16:30 ` Valentin Schneider
  2023-07-21 18:15   ` Nadav Amit
  19 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-20 16:30 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

This does require a guarantee that nothing in the vmalloc range can be
accessed in early entry code. vmalloc'd kernel stacks (VMAP_STACK) are
AFAICT a safe exception, as a task running in userspace needs to enter
kernelspace to execute do_exit() before its stack can be vfree'd.

XXX: Validation that nothing in the vmalloc range is accessed in .noinstr or
  somesuch?

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 23 ++++++++++++++++++++---
 mm/vmalloc.c                    | 19 ++++++++++++++-----
 3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 323b971987af7..0b9b1f040c476 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 631df9189ded4..bb18b35e61b4a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/context_tracking.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1045,6 +1046,11 @@ static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1061,12 +1067,13 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
 {
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info *info;
 
@@ -1074,13 +1081,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		info = get_flush_tlb_info(NULL, start, end, 0, false,
 					  TLB_GENERATION_INVALID);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
 		preempt_enable();
 	}
 }
 
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 93cf99aba335b..e08b6c7d22fb6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -439,6 +439,15 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
 	__vunmap_range_noflush(start, end);
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -452,7 +461,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
 {
 	flush_cache_vunmap(addr, end);
 	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -1746,7 +1755,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 		list_last_entry(&local_purge_list,
 			struct vmap_area, list)->va_end);
 
-	flush_tlb_kernel_range(start, end);
+	flush_tlb_kernel_range_deferrable(start, end);
 	resched_threshold = lazy_max_pages() << 1;
 
 	spin_lock(&free_vmap_area_lock);
@@ -1849,7 +1858,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 	flush_cache_vunmap(va->va_start, va->va_end);
 	vunmap_range_noflush(va->va_start, va->va_end);
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
 
 	free_vmap_area_noflush(va);
 }
@@ -2239,7 +2248,7 @@ static void vb_free(unsigned long addr, unsigned long size)
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);
 
 	spin_lock(&vb->lock);
 
@@ -2304,7 +2313,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	free_purged_blocks(&purge_list);
 
 	if (!__purge_vmap_area_lazy(start, end) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-20 16:30 ` [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width Valentin Schneider
@ 2023-07-20 19:53   ` Paul E. McKenney
  2023-07-21  4:00     ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Paul E. McKenney @ 2023-07-20 19:53 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> We now have an RCU_EXPORT knob for configuring the size of the dynticks
> counter: CONFIG_RCU_DYNTICKS_BITS.
> 
> Add a torture config for a ridiculously small counter (2 bits). This is ac
> opy of TREE4 with the added counter size restriction.
> 
> Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  .../selftests/rcutorture/configs/rcu/TREE11   | 19 +++++++++++++++++++
>  .../rcutorture/configs/rcu/TREE11.boot        |  1 +
>  2 files changed, 20 insertions(+)
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> new file mode 100644
> index 0000000000000..aa7274efd9819
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> @@ -0,0 +1,19 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=8
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=y
> +CONFIG_PREEMPT=n
> +CONFIG_PREEMPT_DYNAMIC=n
> +#CHECK#CONFIG_TREE_RCU=y
> +CONFIG_HZ_PERIODIC=n
> +CONFIG_NO_HZ_IDLE=n
> +CONFIG_NO_HZ_FULL=y
> +CONFIG_RCU_TRACE=y
> +CONFIG_RCU_FANOUT=4
> +CONFIG_RCU_FANOUT_LEAF=3
> +CONFIG_DEBUG_LOCK_ALLOC=n
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_EQS_DEBUG=y
> +CONFIG_RCU_LAZY=y
> +CONFIG_RCU_DYNTICKS_BITS=2

Why not just add this last line to the existing TREE04 scenario?
That would ensure that it gets tested regularly without extending the
time required to run a full set of rcutorture tests.

> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> new file mode 100644
> index 0000000000000..a8d94caf7d2fd
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> @@ -0,0 +1 @@
> +rcutree.rcu_fanout_leaf=4 nohz_full=1-N
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-20 19:53   ` Paul E. McKenney
@ 2023-07-21  4:00     ` Paul E. McKenney
  2023-07-21  7:58       ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Paul E. McKenney @ 2023-07-21  4:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> > We now have an RCU_EXPORT knob for configuring the size of the dynticks
> > counter: CONFIG_RCU_DYNTICKS_BITS.
> > 
> > Add a torture config for a ridiculously small counter (2 bits). This is ac
> > opy of TREE4 with the added counter size restriction.
> > 
> > Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > ---
> >  .../selftests/rcutorture/configs/rcu/TREE11   | 19 +++++++++++++++++++
> >  .../rcutorture/configs/rcu/TREE11.boot        |  1 +
> >  2 files changed, 20 insertions(+)
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> > 
> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > new file mode 100644
> > index 0000000000000..aa7274efd9819
> > --- /dev/null
> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > @@ -0,0 +1,19 @@
> > +CONFIG_SMP=y
> > +CONFIG_NR_CPUS=8
> > +CONFIG_PREEMPT_NONE=n
> > +CONFIG_PREEMPT_VOLUNTARY=y
> > +CONFIG_PREEMPT=n
> > +CONFIG_PREEMPT_DYNAMIC=n
> > +#CHECK#CONFIG_TREE_RCU=y
> > +CONFIG_HZ_PERIODIC=n
> > +CONFIG_NO_HZ_IDLE=n
> > +CONFIG_NO_HZ_FULL=y
> > +CONFIG_RCU_TRACE=y
> > +CONFIG_RCU_FANOUT=4
> > +CONFIG_RCU_FANOUT_LEAF=3
> > +CONFIG_DEBUG_LOCK_ALLOC=n
> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> > +CONFIG_RCU_EXPERT=y
> > +CONFIG_RCU_EQS_DEBUG=y
> > +CONFIG_RCU_LAZY=y
> > +CONFIG_RCU_DYNTICKS_BITS=2
> 
> Why not just add this last line to the existing TREE04 scenario?
> That would ensure that it gets tested regularly without extending the
> time required to run a full set of rcutorture tests.

Please see below for the version of this patch that I am running overnight
tests with.  Does this one work for you?

							Thanx, Paul

------------------------------------------------------------------------

commit 1aa13731e665193cd833edac5ebc86a9c3fea2b7
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Thu Jul 20 20:58:41 2023 -0700

    rcutorture: Add a test config to torture test low RCU_DYNTICKS width
    
    We now have an RCU_EXPORT knob for configuring the size of the dynticks
    counter: CONFIG_RCU_DYNTICKS_BITS.
    
    Modify scenario TREE04 to exercise a a ridiculously small counter
    (2 bits).
    
    Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
    Suggested-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3a..aa7274efd981 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
 CONFIG_RCU_EXPERT=y
 CONFIG_RCU_EQS_DEBUG=y
 CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_BITS=2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-21  4:00     ` Paul E. McKenney
@ 2023-07-21  7:58       ` Valentin Schneider
  2023-07-21 14:07         ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-21  7:58 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 20/07/23 21:00, Paul E. McKenney wrote:
> On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
>> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
>> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> > new file mode 100644
>> > index 0000000000000..aa7274efd9819
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> > @@ -0,0 +1,19 @@
>> > +CONFIG_SMP=y
>> > +CONFIG_NR_CPUS=8
>> > +CONFIG_PREEMPT_NONE=n
>> > +CONFIG_PREEMPT_VOLUNTARY=y
>> > +CONFIG_PREEMPT=n
>> > +CONFIG_PREEMPT_DYNAMIC=n
>> > +#CHECK#CONFIG_TREE_RCU=y
>> > +CONFIG_HZ_PERIODIC=n
>> > +CONFIG_NO_HZ_IDLE=n
>> > +CONFIG_NO_HZ_FULL=y
>> > +CONFIG_RCU_TRACE=y
>> > +CONFIG_RCU_FANOUT=4
>> > +CONFIG_RCU_FANOUT_LEAF=3
>> > +CONFIG_DEBUG_LOCK_ALLOC=n
>> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> > +CONFIG_RCU_EXPERT=y
>> > +CONFIG_RCU_EQS_DEBUG=y
>> > +CONFIG_RCU_LAZY=y
>> > +CONFIG_RCU_DYNTICKS_BITS=2
>>
>> Why not just add this last line to the existing TREE04 scenario?
>> That would ensure that it gets tested regularly without extending the
>> time required to run a full set of rcutorture tests.
>
> Please see below for the version of this patch that I am running overnight
> tests with.  Does this one work for you?
>

Yep that's fine with me. I only went with a separate test file as wasn't
sure how new test options should be handled (merged into existing tests vs
new tests created), and didn't want to negatively impact TREE04 or
TREE06. If merging into TREE04 is preferred, then I'll do just that and
carry this path moving forwards.

Thanks!


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable
  2023-07-20 16:30 ` [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable Valentin Schneider
@ 2023-07-21  8:17   ` Valentin Schneider
  2023-07-21 14:10     ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-21  8:17 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest
  Cc: Paul E . McKenney, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 20/07/23 17:30, Valentin Schneider wrote:
> index bdd7eadb33d8f..1ff2aab24e964 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
>         Say Y here if you need tighter callback-limit enforcement.
>         Say N here if you are unsure.
>
> +config RCU_DYNTICKS_RANGE_BEGIN
> +	int
> +	depends on !RCU_EXPERT
> +	default 31 if !CONTEXT_TRACKING_WORK

You'll note that this should be 30 really, because the lower *2* bits are
taken by the context state (CONTEXT_GUEST has a value of 3).

This highlights the fragile part of this: the Kconfig values are hardcoded,
but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
static_assert() will at least capture any misconfiguration, but having that
enforced by the actual Kconfig ranges would be less awkward.

Do we currently have a way of e.g. making a Kconfig file depend on and use
values generated by a C header?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-21  7:58       ` Valentin Schneider
@ 2023-07-21 14:07         ` Paul E. McKenney
  2023-07-21 15:08           ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Paul E. McKenney @ 2023-07-21 14:07 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
> On 20/07/23 21:00, Paul E. McKenney wrote:
> > On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
> >> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> >> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >> > new file mode 100644
> >> > index 0000000000000..aa7274efd9819
> >> > --- /dev/null
> >> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >> > @@ -0,0 +1,19 @@
> >> > +CONFIG_SMP=y
> >> > +CONFIG_NR_CPUS=8
> >> > +CONFIG_PREEMPT_NONE=n
> >> > +CONFIG_PREEMPT_VOLUNTARY=y
> >> > +CONFIG_PREEMPT=n
> >> > +CONFIG_PREEMPT_DYNAMIC=n
> >> > +#CHECK#CONFIG_TREE_RCU=y
> >> > +CONFIG_HZ_PERIODIC=n
> >> > +CONFIG_NO_HZ_IDLE=n
> >> > +CONFIG_NO_HZ_FULL=y
> >> > +CONFIG_RCU_TRACE=y
> >> > +CONFIG_RCU_FANOUT=4
> >> > +CONFIG_RCU_FANOUT_LEAF=3
> >> > +CONFIG_DEBUG_LOCK_ALLOC=n
> >> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> >> > +CONFIG_RCU_EXPERT=y
> >> > +CONFIG_RCU_EQS_DEBUG=y
> >> > +CONFIG_RCU_LAZY=y
> >> > +CONFIG_RCU_DYNTICKS_BITS=2
> >>
> >> Why not just add this last line to the existing TREE04 scenario?
> >> That would ensure that it gets tested regularly without extending the
> >> time required to run a full set of rcutorture tests.
> >
> > Please see below for the version of this patch that I am running overnight
> > tests with.  Does this one work for you?
> 
> Yep that's fine with me. I only went with a separate test file as wasn't
> sure how new test options should be handled (merged into existing tests vs
> new tests created), and didn't want to negatively impact TREE04 or
> TREE06. If merging into TREE04 is preferred, then I'll do just that and
> carry this path moving forwards.

Things worked fine for this one-hour-per-scenario test run on my laptop,
except for the CONFIG_SMP=n runs, which all got build errors like the
following.

							Thanx, Paul

------------------------------------------------------------------------

In file included from ./include/linux/container_of.h:5,
                 from ./include/linux/list.h:5,
                 from ./include/linux/swait.h:5,
                 from ./include/linux/completion.h:12,
                 from ./include/linux/crypto.h:15,
                 from arch/x86/kernel/asm-offsets.c:9:
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
   56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
      |                                                             ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      | ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’
   61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
      |                             ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:70:29: note: in expansion of macro ‘RCU_DYNTICKS_START’
   70 | #define CONTEXT_WORK_END   (RCU_DYNTICKS_START - 1)
      |                             ^~~~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:74:16: note: in expansion of macro ‘CONTEXT_WORK_END’
   74 |               (CONTEXT_WORK_END  + 1 - CONTEXT_WORK_START) +
      |                ^~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
   56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
      |                                                             ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      | ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:62:29: note: in expansion of macro ‘CT_STATE_SIZE’
   62 | #define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
      |                             ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:75:16: note: in expansion of macro ‘RCU_DYNTICKS_END’
   75 |               (RCU_DYNTICKS_END  + 1 - RCU_DYNTICKS_START) ==
      |                ^~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
   56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
      |                                                             ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      | ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’
   61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
      |                             ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:75:40: note: in expansion of macro ‘RCU_DYNTICKS_START’
   75 |               (RCU_DYNTICKS_END  + 1 - RCU_DYNTICKS_START) ==
      |                                        ^~~~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
   56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
      |                                                             ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      | ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:76:15: note: in expansion of macro ‘CT_STATE_SIZE’
   76 |               CT_STATE_SIZE);
      |               ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:73:15: error: expression in static assertion is not an integer
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      |               ^
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
   73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
      | ^~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:116: arch/x86/kernel/asm-offsets.s] Error 1
make[1]: *** [/home/git/linux-rcu-1/Makefile:1275: prepare0] Error 2
make[1]: *** Waiting for unfinished jobs....
  LD      /home/git/linux-rcu-1/tools/objtool/objtool-in.o
  LINK    /home/git/linux-rcu-1/tools/objtool/objtool
make: *** [Makefile:234: __sub-make] Error 2

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable
  2023-07-21  8:17   ` Valentin Schneider
@ 2023-07-21 14:10     ` Paul E. McKenney
  2023-07-21 15:08       ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Paul E. McKenney @ 2023-07-21 14:10 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
> On 20/07/23 17:30, Valentin Schneider wrote:
> > index bdd7eadb33d8f..1ff2aab24e964 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
> >         Say Y here if you need tighter callback-limit enforcement.
> >         Say N here if you are unsure.
> >
> > +config RCU_DYNTICKS_RANGE_BEGIN
> > +	int
> > +	depends on !RCU_EXPERT
> > +	default 31 if !CONTEXT_TRACKING_WORK
> 
> You'll note that this should be 30 really, because the lower *2* bits are
> taken by the context state (CONTEXT_GUEST has a value of 3).
> 
> This highlights the fragile part of this: the Kconfig values are hardcoded,
> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
> static_assert() will at least capture any misconfiguration, but having that
> enforced by the actual Kconfig ranges would be less awkward.
> 
> Do we currently have a way of e.g. making a Kconfig file depend on and use
> values generated by a C header?

Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
option and let the C code work out what the number of bits should be?

I suppose that there might be a failure whose frequency depended on
the number of bits, which might be an argument for keeping something
like RCU_DYNTICKS_RANGE_BEGIN for fault isolation.  But still using
RCU_DYNTICKS_TORTURE for normal testing.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width
  2023-07-21 14:07         ` Paul E. McKenney
@ 2023-07-21 15:08           ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-21 15:08 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 21/07/23 07:07, Paul E. McKenney wrote:
> On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
>> On 20/07/23 21:00, Paul E. McKenney wrote:
>> > On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
>> >> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
>> >> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > new file mode 100644
>> >> > index 0000000000000..aa7274efd9819
>> >> > --- /dev/null
>> >> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > @@ -0,0 +1,19 @@
>> >> > +CONFIG_SMP=y
>> >> > +CONFIG_NR_CPUS=8
>> >> > +CONFIG_PREEMPT_NONE=n
>> >> > +CONFIG_PREEMPT_VOLUNTARY=y
>> >> > +CONFIG_PREEMPT=n
>> >> > +CONFIG_PREEMPT_DYNAMIC=n
>> >> > +#CHECK#CONFIG_TREE_RCU=y
>> >> > +CONFIG_HZ_PERIODIC=n
>> >> > +CONFIG_NO_HZ_IDLE=n
>> >> > +CONFIG_NO_HZ_FULL=y
>> >> > +CONFIG_RCU_TRACE=y
>> >> > +CONFIG_RCU_FANOUT=4
>> >> > +CONFIG_RCU_FANOUT_LEAF=3
>> >> > +CONFIG_DEBUG_LOCK_ALLOC=n
>> >> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> >> > +CONFIG_RCU_EXPERT=y
>> >> > +CONFIG_RCU_EQS_DEBUG=y
>> >> > +CONFIG_RCU_LAZY=y
>> >> > +CONFIG_RCU_DYNTICKS_BITS=2
>> >>
>> >> Why not just add this last line to the existing TREE04 scenario?
>> >> That would ensure that it gets tested regularly without extending the
>> >> time required to run a full set of rcutorture tests.
>> >
>> > Please see below for the version of this patch that I am running overnight
>> > tests with.  Does this one work for you?
>>
>> Yep that's fine with me. I only went with a separate test file as wasn't
>> sure how new test options should be handled (merged into existing tests vs
>> new tests created), and didn't want to negatively impact TREE04 or
>> TREE06. If merging into TREE04 is preferred, then I'll do just that and
>> carry this path moving forwards.
>
> Things worked fine for this one-hour-per-scenario test run on my laptop,

Many thanks for testing!

> except for the CONFIG_SMP=n runs, which all got build errors like the
> following.
>

Harumph, yes !SMP (and !CONTEXT_TRACKING_WORK) doesn't compile nicely, I'll
fix that for v3.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable
  2023-07-21 14:10     ` Paul E. McKenney
@ 2023-07-21 15:08       ` Valentin Schneider
  2023-07-21 16:09         ` Paul E. McKenney
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-21 15:08 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 21/07/23 07:10, Paul E. McKenney wrote:
> On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
>> On 20/07/23 17:30, Valentin Schneider wrote:
>> > index bdd7eadb33d8f..1ff2aab24e964 100644
>> > --- a/kernel/rcu/Kconfig
>> > +++ b/kernel/rcu/Kconfig
>> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
>> >         Say Y here if you need tighter callback-limit enforcement.
>> >         Say N here if you are unsure.
>> >
>> > +config RCU_DYNTICKS_RANGE_BEGIN
>> > +	int
>> > +	depends on !RCU_EXPERT
>> > +	default 31 if !CONTEXT_TRACKING_WORK
>>
>> You'll note that this should be 30 really, because the lower *2* bits are
>> taken by the context state (CONTEXT_GUEST has a value of 3).
>>
>> This highlights the fragile part of this: the Kconfig values are hardcoded,
>> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
>> static_assert() will at least capture any misconfiguration, but having that
>> enforced by the actual Kconfig ranges would be less awkward.
>>
>> Do we currently have a way of e.g. making a Kconfig file depend on and use
>> values generated by a C header?
>
> Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
> option and let the C code work out what the number of bits should be?
>
> I suppose that there might be a failure whose frequency depended on
> the number of bits, which might be an argument for keeping something
> like RCU_DYNTICKS_RANGE_BEGIN for fault isolation.  But still using
> RCU_DYNTICKS_TORTURE for normal testing.
>
> Thoughts?
>

AFAICT if we run tests with the minimum possible width, then intermediate
values shouldn't have much value.

Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I
came up with, as we can let the context tracking code figure out the widths
itself and not expose any of that to Kconfig.

>                                                       Thanx, Paul


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable
  2023-07-21 15:08       ` Valentin Schneider
@ 2023-07-21 16:09         ` Paul E. McKenney
  0 siblings, 0 replies; 76+ messages in thread
From: Paul E. McKenney @ 2023-07-21 16:09 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Fri, Jul 21, 2023 at 04:08:10PM +0100, Valentin Schneider wrote:
> On 21/07/23 07:10, Paul E. McKenney wrote:
> > On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
> >> On 20/07/23 17:30, Valentin Schneider wrote:
> >> > index bdd7eadb33d8f..1ff2aab24e964 100644
> >> > --- a/kernel/rcu/Kconfig
> >> > +++ b/kernel/rcu/Kconfig
> >> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
> >> >         Say Y here if you need tighter callback-limit enforcement.
> >> >         Say N here if you are unsure.
> >> >
> >> > +config RCU_DYNTICKS_RANGE_BEGIN
> >> > +	int
> >> > +	depends on !RCU_EXPERT
> >> > +	default 31 if !CONTEXT_TRACKING_WORK
> >>
> >> You'll note that this should be 30 really, because the lower *2* bits are
> >> taken by the context state (CONTEXT_GUEST has a value of 3).
> >>
> >> This highlights the fragile part of this: the Kconfig values are hardcoded,
> >> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
> >> static_assert() will at least capture any misconfiguration, but having that
> >> enforced by the actual Kconfig ranges would be less awkward.
> >>
> >> Do we currently have a way of e.g. making a Kconfig file depend on and use
> >> values generated by a C header?
> >
> > Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
> > option and let the C code work out what the number of bits should be?
> >
> > I suppose that there might be a failure whose frequency depended on
> > the number of bits, which might be an argument for keeping something
> > like RCU_DYNTICKS_RANGE_BEGIN for fault isolation.  But still using
> > RCU_DYNTICKS_TORTURE for normal testing.
> >
> > Thoughts?
> >
> 
> AFAICT if we run tests with the minimum possible width, then intermediate
> values shouldn't have much value.
> 
> Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I
> came up with, as we can let the context tracking code figure out the widths
> itself and not expose any of that to Kconfig.

Agreed.  If a need for variable numbers of bits ever does arise, we can
worry about it at that time.  And then we would have more information
on what a variable-bit facility should look like.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-20 16:30 ` [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
@ 2023-07-21 18:15   ` Nadav Amit
  2023-07-24 11:32     ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Nadav Amit @ 2023-07-21 18:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Linux Kernel Mailing List, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, the arch/x86 maintainers, rcu, linux-kselftest,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Dan Carpenter,
	Chuang Wang, Yang Jihong, Petr Mladek, Jason A. Donenfeld,
	Song Liu, Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky



> On Jul 20, 2023, at 9:30 AM, Valentin Schneider <vschneid@redhat.com> wrote:
> 
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
> 
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.

So I think there are a few assumptions here that it seems suitable to confirm
and acknowledge the major one in the commit log (assuming they hold).

There is an assumption that VMAP page-tables are not freed. I actually
never paid attention to that, but skimming the code it does seem so. To
clarify the issue: if page-tables were freed and their pages were reused,
there would be a problem that page-walk caches for instance would be used
and “junk” entries from the reused pages would be used. See [1].

I would also assume the memory-hot-unplug of some sorts is not an issue,
(i.e., you cannot have a stale TLB entry pointing to memory that was
unplugged).  

I also think that there might be speculative code execution using stale
TLB entries that would point to memory that has been reused and perhaps
controllable by the user. If somehow the CPU/OS is tricked to use the
stale executable TLB entries early enough on kernel entry that might be
an issue. I guess it is probably theoretical issue, but it would be helpful
to confirm.

In general, deferring TLB flushes can be done safely. This patch, I think,
takes it one step forward and allows the reuse of the memory before the TLB
flush is actually done. This is more dangerous.

[1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@git.kernel.org/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-21 18:15   ` Nadav Amit
@ 2023-07-24 11:32     ` Valentin Schneider
  2023-07-24 17:40       ` Dave Hansen
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-24 11:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, the arch/x86 maintainers, rcu, linux-kselftest,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Dan Carpenter,
	Chuang Wang, Yang Jihong, Petr Mladek, Jason A. Donenfeld,
	Song Liu, Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

On 21/07/23 18:15, Nadav Amit wrote:
>> On Jul 20, 2023, at 9:30 AM, Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> flush_tlb_kernel_range() IPIs.
>>
>> Given that CPUs executing in userspace do not access data in the vmalloc
>> range, these IPIs could be deferred until their next kernel entry.
>
> So I think there are a few assumptions here that it seems suitable to confirm
> and acknowledge the major one in the commit log (assuming they hold).
>
> There is an assumption that VMAP page-tables are not freed. I actually
> never paid attention to that, but skimming the code it does seem so. To
> clarify the issue: if page-tables were freed and their pages were reused,
> there would be a problem that page-walk caches for instance would be used
> and “junk” entries from the reused pages would be used. See [1].
>

Thanks for looking into this and sharing context. This is an area I don't
have much experience with, so help is much appreciated!

Indeed, accessing addresses that should be impacted by a TLB flush *before*
executing the deferred flush is an issue. Deferring sync_core() for
instruction patching is a similar problem - it's all in the shape of
"access @addr impacted by @operation during kernel entry, before actually
executing @operation".

AFAICT the only reasonable way to go about the deferral is to prove that no
such access happens before the deferred @operation is done. We got to prove
that for sync_core() deferral, cf. PATCH 18.

I'd like to reason about it for deferring vunmap TLB flushes:

What addresses in VMAP range, other than the stack, can early entry code
access? Yes, the ranges can be checked at runtime, but is there any chance
of figuring this out e.g. at build-time?

> I would also assume the memory-hot-unplug of some sorts is not an issue,
> (i.e., you cannot have a stale TLB entry pointing to memory that was
> unplugged).
>
> I also think that there might be speculative code execution using stale
> TLB entries that would point to memory that has been reused and perhaps
> controllable by the user. If somehow the CPU/OS is tricked to use the
> stale executable TLB entries early enough on kernel entry that might be
> an issue. I guess it is probably theoretical issue, but it would be helpful
> to confirm.
>
> In general, deferring TLB flushes can be done safely. This patch, I think,
> takes it one step forward and allows the reuse of the memory before the TLB
> flush is actually done. This is more dangerous.
>
> [1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@git.kernel.org/


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-20 16:30 ` [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure Valentin Schneider
@ 2023-07-24 14:52   ` Frederic Weisbecker
  2023-07-24 16:55     ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Frederic Weisbecker @ 2023-07-24 14:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Le Thu, Jul 20, 2023 at 05:30:51PM +0100, Valentin Schneider a écrit :
> +enum ctx_state {
> +	/* Following are values */
> +	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
> +	CONTEXT_KERNEL		= 0,
> +	CONTEXT_IDLE		= 1,
> +	CONTEXT_USER		= 2,
> +	CONTEXT_GUEST		= 3,
> +	CONTEXT_MAX             = 4,
> +};
> +
> +/*
> + * We cram three different things within the same atomic variable:
> + *
> + *                CONTEXT_STATE_END                        RCU_DYNTICKS_END
> + *                         |       CONTEXT_WORK_END                |
> + *                         |               |                       |
> + *                         v               v                       v
> + *         [ context_state ][ context work ][ RCU dynticks counter ]
> + *         ^                ^               ^
> + *         |                |               |
> + *         |        CONTEXT_WORK_START      |
> + * CONTEXT_STATE_START              RCU_DYNTICKS_START

Should the layout be displayed in reverse? Well at least I always picture
bitmaps in reverse, that's probably due to the direction of the shift arrows.
Not sure what is the usual way to picture it though...

> + */
> +
> +#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
> +
> +#define CONTEXT_STATE_START 0
> +#define CONTEXT_STATE_END   (bits_per(CONTEXT_MAX - 1) - 1)

Since you have non overlapping *_START symbols, perhaps the *_END
are superfluous?

> +
> +#define RCU_DYNTICKS_BITS  (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
> +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
> +#define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
> +#define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)

Might be the right time to standardize and fix our naming:

CT_STATE_START,
CT_STATE_KERNEL,
CT_STATE_USER,
...
CT_WORK_START,
CT_WORK_*,
...
CT_RCU_DYNTICKS_START,
CT_RCU_DYNTICKS_IDX

> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> +	unsigned int old;
> +	bool ret = false;
> +
> +	preempt_disable();
> +
> +	old = atomic_read(&ct->state);
> +	/*
> +	 * Try setting the work until either
> +	 * - the target CPU no longer accepts any more deferred work
> +	 * - the work has been set
> +	 *
> +	 * NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
> +	 * as they are regular integers rather than bits, but that doesn't
> +	 * matter here: if any of the context state bit is set, the CPU isn't
> +	 * in kernel context.
> +	 */
> +	while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)

That may still miss a recent entry to userspace due to the first plain read, ending
with an undesired interrupt.

You need at least one cmpxchg. Well, of course that stays racy by nature because
between the cmpxchg() returning CONTEXT_KERNEL and the actual IPI raised and
received, the remote CPU may have gone to userspace already. But still it limits
a little the window.

Thanks.

> +		ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CONTEXT_WORK_START));
> +
> +	preempt_enable();
> +	return ret;
> +}
> +#else
> +static __always_inline void ct_work_flush(unsigned long work) { }
> +static __always_inline void ct_work_clear(struct context_tracking *ct) { }
> +#endif
> +
>  /*
>   * Record entry into an extended quiescent state.  This is only to be
>   * called when not already in an extended quiescent state, that is,
> @@ -88,7 +133,8 @@ static noinstr void ct_kernel_exit_state(int offset)
>  	 * next idle sojourn.
>  	 */
>  	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
> -	seq = ct_state_inc(offset);
> +	seq = ct_state_inc_clear_work(offset);
> +
>  	// RCU is no longer watching.  Better be in extended quiescent state!
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX));
>  }
> @@ -100,7 +146,7 @@ static noinstr void ct_kernel_exit_state(int offset)
>   */
>  static noinstr void ct_kernel_enter_state(int offset)
>  {
> -	int seq;
> +	unsigned long seq;
>  
>  	/*
>  	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
> @@ -108,6 +154,7 @@ static noinstr void ct_kernel_enter_state(int offset)
>  	 * critical section.
>  	 */
>  	seq = ct_state_inc(offset);
> +	ct_work_flush(seq);
>  	// RCU is now watching.  Better not be in an extended quiescent state!
>  	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index bae8f11070bef..fdb266f2d774b 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
>  	  Say N otherwise, this option brings an overhead that you
>  	  don't want in production.
>  
> +config CONTEXT_TRACKING_WORK
> +	bool
> +	depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
> +	default y
> +
>  config NO_HZ
>  	bool "Old Idle dynticks config"
>  	help
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-24 14:52   ` Frederic Weisbecker
@ 2023-07-24 16:55     ` Valentin Schneider
  2023-07-24 19:18       ` Frederic Weisbecker
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-24 16:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 24/07/23 16:52, Frederic Weisbecker wrote:
> Le Thu, Jul 20, 2023 at 05:30:51PM +0100, Valentin Schneider a écrit :
>> +enum ctx_state {
>> +	/* Following are values */
>> +	CONTEXT_DISABLED	= -1,	/* returned by ct_state() if unknown */
>> +	CONTEXT_KERNEL		= 0,
>> +	CONTEXT_IDLE		= 1,
>> +	CONTEXT_USER		= 2,
>> +	CONTEXT_GUEST		= 3,
>> +	CONTEXT_MAX             = 4,
>> +};
>> +
>> +/*
>> + * We cram three different things within the same atomic variable:
>> + *
>> + *                CONTEXT_STATE_END                        RCU_DYNTICKS_END
>> + *                         |       CONTEXT_WORK_END                |
>> + *                         |               |                       |
>> + *                         v               v                       v
>> + *         [ context_state ][ context work ][ RCU dynticks counter ]
>> + *         ^                ^               ^
>> + *         |                |               |
>> + *         |        CONTEXT_WORK_START      |
>> + * CONTEXT_STATE_START              RCU_DYNTICKS_START
>
> Should the layout be displayed in reverse? Well at least I always picture
> bitmaps in reverse, that's probably due to the direction of the shift arrows.
> Not sure what is the usual way to picture it though...
>

Surprisingly, I managed to confuse myself with that comment :-)  I think I
am subconsciously more used to the reverse as well. I've flipped that and
put "MSB" / "LSB" at either end.

>> + */
>> +
>> +#define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
>> +
>> +#define CONTEXT_STATE_START 0
>> +#define CONTEXT_STATE_END   (bits_per(CONTEXT_MAX - 1) - 1)
>
> Since you have non overlapping *_START symbols, perhaps the *_END
> are superfluous?
>

They're only really there to tidy up the GENMASK() further down - it keeps
the range and index definitions in one hunk. I tried defining that directly
within the GENMASK() themselves but it got too ugly IMO.

>> +
>> +#define RCU_DYNTICKS_BITS  (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
>> +#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
>> +#define RCU_DYNTICKS_END   (CT_STATE_SIZE - 1)
>> +#define RCU_DYNTICKS_IDX   BIT(RCU_DYNTICKS_START)
>
> Might be the right time to standardize and fix our naming:
>
> CT_STATE_START,
> CT_STATE_KERNEL,
> CT_STATE_USER,
> ...
> CT_WORK_START,
> CT_WORK_*,
> ...
> CT_RCU_DYNTICKS_START,
> CT_RCU_DYNTICKS_IDX
>

Heh, I have actually already done this for v3, though I hadn't touched the
RCU_DYNTICKS* family. I'll fold that in.

>> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
>> +{
>> +	struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> +	unsigned int old;
>> +	bool ret = false;
>> +
>> +	preempt_disable();
>> +
>> +	old = atomic_read(&ct->state);
>> +	/*
>> +	 * Try setting the work until either
>> +	 * - the target CPU no longer accepts any more deferred work
>> +	 * - the work has been set
>> +	 *
>> +	 * NOTE: CONTEXT_GUEST intersects with CONTEXT_USER and CONTEXT_IDLE
>> +	 * as they are regular integers rather than bits, but that doesn't
>> +	 * matter here: if any of the context state bit is set, the CPU isn't
>> +	 * in kernel context.
>> +	 */
>> +	while ((old & (CONTEXT_GUEST | CONTEXT_USER | CONTEXT_IDLE)) && !ret)
>
> That may still miss a recent entry to userspace due to the first plain read, ending
> with an undesired interrupt.
>
> You need at least one cmpxchg. Well, of course that stays racy by nature because
> between the cmpxchg() returning CONTEXT_KERNEL and the actual IPI raised and
> received, the remote CPU may have gone to userspace already. But still it limits
> a little the window.
>

I can make that a 'do {} while ()' instead to force at least one execution
of the cmpxchg().

This is only about reducing the race window, right? If we're executing this
just as the target CPU is about to enter userspace, we're going to be in
racy territory anyway. Regardless, I'm happy to do that change.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-24 11:32     ` Valentin Schneider
@ 2023-07-24 17:40       ` Dave Hansen
  2023-07-25 13:21         ` Peter Zijlstra
  2023-07-25 16:37         ` Marcelo Tosatti
  0 siblings, 2 replies; 76+ messages in thread
From: Dave Hansen @ 2023-07-24 17:40 UTC (permalink / raw)
  To: Valentin Schneider, Nadav Amit
  Cc: Linux Kernel Mailing List, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, the arch/x86 maintainers, rcu, linux-kselftest,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Dan Carpenter,
	Chuang Wang, Yang Jihong, Petr Mladek, Jason A. Donenfeld,
	Song Liu, Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky

On 7/24/23 04:32, Valentin Schneider wrote:
> AFAICT the only reasonable way to go about the deferral is to prove that no
> such access happens before the deferred @operation is done. We got to prove
> that for sync_core() deferral, cf. PATCH 18.
> 
> I'd like to reason about it for deferring vunmap TLB flushes:
> 
> What addresses in VMAP range, other than the stack, can early entry code
> access? Yes, the ranges can be checked at runtime, but is there any chance
> of figuring this out e.g. at build-time?

Nadav was touching on a very important point: TLB flushes for addresses
are relatively easy to defer.  You just need to ensure that the CPU
deferring the flush does an actual flush before it might architecturally
consume the contents of the flushed entry.

TLB flushes for freed page tables are another game entirely.  The CPU is
free to cache any part of the paging hierarchy it wants at any time.
It's also free to set accessed and dirty bits at any time, even for
instructions that may never execute architecturally.

That basically means that if you have *ANY* freed page table page
*ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
screwed.

There's no reasoning about accesses or ordering.  As soon as the CPU
does *anything*, it's out to get you.

You're going to need to do something a lot more radical to deal with
free page table pages.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-24 16:55     ` Valentin Schneider
@ 2023-07-24 19:18       ` Frederic Weisbecker
  2023-07-25 10:10         ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Frederic Weisbecker @ 2023-07-24 19:18 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
> I can make that a 'do {} while ()' instead to force at least one execution
> of the cmpxchg().
> 
> This is only about reducing the race window, right? If we're executing this
> just as the target CPU is about to enter userspace, we're going to be in
> racy territory anyway. Regardless, I'm happy to do that change.

Right, it's only about narrowing down the race window. It probably don't matter
in practice, but it's one less thing to consider for the brain :-)

Also, why bothering with handling CONTEXT_IDLE?

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-24 19:18       ` Frederic Weisbecker
@ 2023-07-25 10:10         ` Valentin Schneider
  2023-07-25 11:22           ` Frederic Weisbecker
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-25 10:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 24/07/23 21:18, Frederic Weisbecker wrote:
> On Mon, Jul 24, 2023 at 05:55:44PM +0100, Valentin Schneider wrote:
>> I can make that a 'do {} while ()' instead to force at least one execution
>> of the cmpxchg().
>>
>> This is only about reducing the race window, right? If we're executing this
>> just as the target CPU is about to enter userspace, we're going to be in
>> racy territory anyway. Regardless, I'm happy to do that change.
>
> Right, it's only about narrowing down the race window. It probably don't matter
> in practice, but it's one less thing to consider for the brain :-)
>

Ack

> Also, why bothering with handling CONTEXT_IDLE?
>

I have reasons! I just swept them under the rug and didn't mention them :D
Also looking at the config dependencies again I got it wrong, but
nevertheless that means I get to ramble about it.

With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
transitions:

  ct_idle_enter()
    ct_kernel_exit()
      ct_state_inc_clear_work()

  ct_idle_exit()
    ct_kernel_enter()
      ct_work_flush()

Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE
rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for
NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.

It's a completely different argument than reducing interference for
NOHZ_FULL userspace applications and I should have at the very least
mentioned it in the cover letter, but it's the exact same backing
mechanism.

Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate
patch with a proper changelog.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-20 16:30 ` [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
@ 2023-07-25 10:49   ` Joel Fernandes
  2023-07-25 13:36     ` Valentin Schneider
  2023-07-25 13:39     ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-07-25 10:49 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Peter Zijlstra,
	Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

Interesting series Valentin. Some high-level question/comments on this one:

> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
> 
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.

Does the amount of harm not correspond to practical frequency of text_poke? 
How often does instruction patching really happen? If it is very infrequent
then I am not sure if it is that harmful.

> 
> As the synchronization IPIs are sent using a blocking call, returning from
> text_poke_bp_batch() implies all CPUs will observe the patched
> instruction(s), and this should be preserved even if the IPI is deferred.
> In other words, to safely defer this synchronization, any kernel
> instruction leading to the execution of the deferred instruction
> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

If it is not infrequent, then are you handling the case where userland
spends multiple seconds before entering the kernel, and all this while
the blocking call waits? Perhaps in such situation you want the real IPI
to be sent out instead of the deferred one?

thanks,

 - Joel


> 
> This means we must pay attention to mutable instructions in the early entry
> code:
> - alternatives
> - static keys
> - all sorts of probes (kprobes/ftrace/bpf/???)
> 
> The early entry code leading to ct_work_flush() is noinstr, which gets rid
> of the probes.
> 
> Alternatives are safe, because it's boot-time patching (before SMP is
> even brought up) which is before any IPI deferral can happen.
> 
> This leaves us with static keys. Any static key used in early entry code
> should be only forever-enabled at boot time, IOW __ro_after_init (pretty
> much like alternatives). Objtool is now able to point at static keys that
> don't respect this, and all static keys used in early entry code have now
> been verified as behaving like so.
> 
> Leverage the new context_tracking infrastructure to defer sync_core() IPIs
> to a target CPU's next kernel entry.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> arch/x86/include/asm/context_tracking_work.h |  6 +++--
> arch/x86/include/asm/text-patching.h         |  1 +
> arch/x86/kernel/alternative.c                | 24 ++++++++++++++++----
> arch/x86/kernel/kprobes/core.c               |  4 ++--
> arch/x86/kernel/kprobes/opt.c                |  4 ++--
> arch/x86/kernel/module.c                     |  2 +-
> include/linux/context_tracking_work.h        |  4 ++--
> 7 files changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
> index 5bc29e6b2ed38..2c66687ce00e2 100644
> --- a/arch/x86/include/asm/context_tracking_work.h
> +++ b/arch/x86/include/asm/context_tracking_work.h
> @@ -2,11 +2,13 @@
> #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
> #define _ASM_X86_CONTEXT_TRACKING_WORK_H
> 
> +#include <asm/sync_core.h>
> +
> static __always_inline void arch_context_tracking_work(int work)
> {
>    switch (work) {
> -    case CONTEXT_WORK_n:
> -        // Do work...
> +    case CONTEXT_WORK_SYNC:
> +        sync_core();
>        break;
>    }
> }
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index 29832c338cdc5..b6939e965e69d 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
>  */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void text_poke_sync(void);
> +extern void text_poke_sync_deferrable(void);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 72646d75b6ffe..fcce480e1919e 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -18,6 +18,7 @@
> #include <linux/mmu_context.h>
> #include <linux/bsearch.h>
> #include <linux/sync_core.h>
> +#include <linux/context_tracking.h>
> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info)
>    sync_core();
> }
> 
> +static bool do_sync_core_defer_cond(int cpu, void *info)
> +{
> +    return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
> +}
> +
> +static void __text_poke_sync(smp_cond_func_t cond_func)
> +{
> +    on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
> +}
> +
> void text_poke_sync(void)
> {
> -    on_each_cpu(do_sync_core, NULL, 1);
> +    __text_poke_sync(NULL);
> +}
> +
> +void text_poke_sync_deferrable(void)
> +{
> +    __text_poke_sync(do_sync_core_defer_cond);
> }
> 
> /*
> @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>        text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
>    }
> 
> -    text_poke_sync();
> +    text_poke_sync_deferrable();
> 
>    /*
>     * Second step: update all but the first byte of the patched range.
> @@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>         * not necessary and we'd be safe even without it. But
>         * better safe than sorry (plus there's not only Intel).
>         */
> -        text_poke_sync();
> +        text_poke_sync_deferrable();
>    }
> 
>    /*
> @@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
>    }
> 
>    if (do_sync)
> -        text_poke_sync();
> +        text_poke_sync_deferrable();
> 
>    /*
>     * Remove and wait for refs to be zero.
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index f7f6042eb7e6c..a38c914753397 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
>    u8 int3 = INT3_INSN_OPCODE;
> 
>    text_poke(p->addr, &int3, 1);
> -    text_poke_sync();
> +    text_poke_sync_deferrable();
>    perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
> }
> 
> @@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)
> 
>    perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
>    text_poke(p->addr, &p->opcode, 1);
> -    text_poke_sync();
> +    text_poke_sync_deferrable();
> }
> 
> void arch_remove_kprobe(struct kprobe *p)
> diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
> index 57b0037d0a996..88451a744ceda 100644
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
>           JMP32_INSN_SIZE - INT3_INSN_SIZE);
> 
>    text_poke(addr, new, INT3_INSN_SIZE);
> -    text_poke_sync();
> +    text_poke_sync_deferrable();
>    text_poke(addr + INT3_INSN_SIZE,
>          new + INT3_INSN_SIZE,
>          JMP32_INSN_SIZE - INT3_INSN_SIZE);
> -    text_poke_sync();
> +    text_poke_sync_deferrable();
> 
>    perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
> }
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index b05f62ee2344b..8b4542dc51b6d 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
>                   write, apply);
> 
>    if (!early) {
> -        text_poke_sync();
> +        text_poke_sync_deferrable();
>        mutex_unlock(&text_mutex);
>    }
> 
> diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
> index fb74db8876dd2..13fc97b395030 100644
> --- a/include/linux/context_tracking_work.h
> +++ b/include/linux/context_tracking_work.h
> @@ -5,12 +5,12 @@
> #include <linux/bitops.h>
> 
> enum {
> -    CONTEXT_WORK_n_OFFSET,
> +    CONTEXT_WORK_SYNC_OFFSET,
>    CONTEXT_WORK_MAX_OFFSET
> };
> 
> enum ct_work {
> -    CONTEXT_WORK_n        = BIT(CONTEXT_WORK_n_OFFSET),
> +    CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
>    CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
> };
> 
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-25 10:10         ` Valentin Schneider
@ 2023-07-25 11:22           ` Frederic Weisbecker
  2023-07-25 13:05             ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Frederic Weisbecker @ 2023-07-25 11:22 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Tue, Jul 25, 2023 at 11:10:31AM +0100, Valentin Schneider wrote:
> I have reasons! I just swept them under the rug and didn't mention them :D
> Also looking at the config dependencies again I got it wrong, but
> nevertheless that means I get to ramble about it.
> 
> With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
> transitions:
> 
>   ct_idle_enter()
>     ct_kernel_exit()
>       ct_state_inc_clear_work()
> 
>   ct_idle_exit()
>     ct_kernel_enter()
>       ct_work_flush()
> 
> Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE
> rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for
> NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
> 
> It's a completely different argument than reducing interference for
> NOHZ_FULL userspace applications and I should have at the very least
> mentioned it in the cover letter, but it's the exact same backing
> mechanism.
> 
> Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate
> patch with a proper changelog.

Ok should that be a seperate Kconfig? This indeed can bring power improvement
but at the cost of more overhead from the sender. A balance to be measured...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure
  2023-07-25 11:22           ` Frederic Weisbecker
@ 2023-07-25 13:05             ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-25 13:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Nicolas Saenz Julienne,
	Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 25/07/23 13:22, Frederic Weisbecker wrote:
> On Tue, Jul 25, 2023 at 11:10:31AM +0100, Valentin Schneider wrote:
>> I have reasons! I just swept them under the rug and didn't mention them :D
>> Also looking at the config dependencies again I got it wrong, but
>> nevertheless that means I get to ramble about it.
>>
>> With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these
>> transitions:
>>
>>   ct_idle_enter()
>>     ct_kernel_exit()
>>       ct_state_inc_clear_work()
>>
>>   ct_idle_exit()
>>     ct_kernel_enter()
>>       ct_work_flush()
>>
>> Now, if we just make CONTEXT_TRACKING_WORK depend on CONTEXT_TRACKING_IDLE
>> rather than CONTEXT_TRACKING_USER, we get to leverage the IPI deferral for
>> NO_HZ_IDLE kernels - in other words, we get to keep idle CPUs idle longer.
>>
>> It's a completely different argument than reducing interference for
>> NOHZ_FULL userspace applications and I should have at the very least
>> mentioned it in the cover letter, but it's the exact same backing
>> mechanism.
>>
>> Looking at it again, I'll probably make the CONTEXT_IDLE thing a separate
>> patch with a proper changelog.
>
> Ok should that be a seperate Kconfig? This indeed can bring power improvement
> but at the cost of more overhead from the sender. A balance to be measured...

Yep agreed, I'll make that an optional config.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-24 17:40       ` Dave Hansen
@ 2023-07-25 13:21         ` Peter Zijlstra
  2023-07-25 14:03           ` Valentin Schneider
  2023-07-25 16:37         ` Marcelo Tosatti
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-07-25 13:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Valentin Schneider, Nadav Amit, Linux Kernel Mailing List,
	linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	the arch/x86 maintainers, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:

> TLB flushes for freed page tables are another game entirely.  The CPU is
> free to cache any part of the paging hierarchy it wants at any time.
> It's also free to set accessed and dirty bits at any time, even for
> instructions that may never execute architecturally.
> 
> That basically means that if you have *ANY* freed page table page
> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
> screwed.
> 
> There's no reasoning about accesses or ordering.  As soon as the CPU
> does *anything*, it's out to get you.
> 
> You're going to need to do something a lot more radical to deal with
> free page table pages.

Ha! IIRC the only thing we can reasonably do there is to have strict
per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is,
as long we the per-cpu tables do not contain -- and have never contained
-- a particular table page, we can avoid flushing it. Because if it
never was there, it also couldn't have speculatively loaded it.

Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd
have done them ages ago) and doing them is going to be *major* surgery
and pain.

Other than that, we must take the TLBI-IPI when freeing
page-table-pages.


But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or
at least, I couldn't find it in a hurry either), but if we're going to
be doing this, then that file must include a very prominent comment
explaining it must never actually do so either.

Not being able to free page-tables might be a 'problem' if we're going
to be doing more of HUGE_VMALLOC, because that means it becomes rather
hard to swizzle from small to large pages.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-25 10:49   ` Joel Fernandes
@ 2023-07-25 13:36     ` Valentin Schneider
  2023-07-25 17:41       ` Joel Fernandes
  2023-07-25 13:39     ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-25 13:36 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Peter Zijlstra,
	Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 25/07/23 06:49, Joel Fernandes wrote:
> Interesting series Valentin. Some high-level question/comments on this one:
>
>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>> them vs the newly patched instruction. CPUs that are executing in userspace
>> do not need this synchronization to happen immediately, and this is
>> actually harmful interference for NOHZ_FULL CPUs.
>
> Does the amount of harm not correspond to practical frequency of text_poke?
> How often does instruction patching really happen? If it is very infrequent
> then I am not sure if it is that harmful.
>

Being pushed over a latency threshold *once* is enough to impact the
latency evaluation of your given system/application.

It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the
housekeeping CPUs may be up to (flipping static keys, loading kprobes,
using ftrace...) - frequency of the interference isn't such a big part of
the reasoning.

>>
>> As the synchronization IPIs are sent using a blocking call, returning from
>> text_poke_bp_batch() implies all CPUs will observe the patched
>> instruction(s), and this should be preserved even if the IPI is deferred.
>> In other words, to safely defer this synchronization, any kernel
>> instruction leading to the execution of the deferred instruction
>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>
> If it is not infrequent, then are you handling the case where userland
> spends multiple seconds before entering the kernel, and all this while
> the blocking call waits? Perhaps in such situation you want the real IPI
> to be sent out instead of the deferred one?
>

The blocking call only waits for CPUs for which it queued a CSD. Deferred
calls do not queue a CSD thus do not impact the waiting at all. See
smp_call_function_many_cond().


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-25 10:49   ` Joel Fernandes
  2023-07-25 13:36     ` Valentin Schneider
@ 2023-07-25 13:39     ` Peter Zijlstra
  2023-07-25 17:47       ` Joel Fernandes
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-07-25 13:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest,
	Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
> Interesting series Valentin. Some high-level question/comments on this one:
> 
> > On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
> > 
> > text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> > them vs the newly patched instruction. CPUs that are executing in userspace
> > do not need this synchronization to happen immediately, and this is
> > actually harmful interference for NOHZ_FULL CPUs.
> 
> Does the amount of harm not correspond to practical frequency of text_poke? 
> How often does instruction patching really happen? If it is very infrequent
> then I am not sure if it is that harmful.

Well, it can happen quite a bit, also from things people would not
typically 'expect' it.

For instance, the moment you create the first per-task perf event we
frob some jump-labels (and again some second after the last one goes
away).

The same for a bunch of runtime network configurations.

> > As the synchronization IPIs are sent using a blocking call, returning from
> > text_poke_bp_batch() implies all CPUs will observe the patched
> > instruction(s), and this should be preserved even if the IPI is deferred.
> > In other words, to safely defer this synchronization, any kernel
> > instruction leading to the execution of the deferred instruction
> > sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
> 
> If it is not infrequent, then are you handling the case where userland
> spends multiple seconds before entering the kernel, and all this while
> the blocking call waits? Perhaps in such situation you want the real IPI
> to be sent out instead of the deferred one?

Please re-read what Valentin wrote -- nobody is waiting on anything.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-25 13:21         ` Peter Zijlstra
@ 2023-07-25 14:03           ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-25 14:03 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Hansen
  Cc: Nadav Amit, Linux Kernel Mailing List, linux-trace-kernel,
	linux-doc, kvm, linux-mm, bpf, the arch/x86 maintainers, rcu,
	linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Dan Carpenter,
	Chuang Wang, Yang Jihong, Petr Mladek, Jason A. Donenfeld,
	Song Liu, Julian Pidancet, Tom Lendacky, Dionna Glaze,
	Thomas Weißschuh, Juri Lelli, Daniel Bristot de Oliveira,
	Marcelo Tosatti, Yair Podemsky


Sorry, I missed out Dave's email, so now I'm taking my time to page (hah!)
all of this.

On 25/07/23 15:21, Peter Zijlstra wrote:
> On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
>
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
>> It's also free to set accessed and dirty bits at any time, even for
>> instructions that may never execute architecturally.
>>
>> That basically means that if you have *ANY* freed page table page
>> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
>> screwed.
>>
>> There's no reasoning about accesses or ordering.  As soon as the CPU
>> does *anything*, it's out to get you.
>>

OK, I feel like I need to go back do some more reading now, but I think I
get the difference. Thanks for spelling it out.

>> You're going to need to do something a lot more radical to deal with
>> free page table pages.
>
> Ha! IIRC the only thing we can reasonably do there is to have strict
> per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is,
> as long we the per-cpu tables do not contain -- and have never contained
> -- a particular table page, we can avoid flushing it. Because if it
> never was there, it also couldn't have speculatively loaded it.
>
> Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd
> have done them ages ago) and doing them is going to be *major* surgery
> and pain.
>
> Other than that, we must take the TLBI-IPI when freeing
> page-table-pages.
>
>
> But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or
> at least, I couldn't find it in a hurry either), but if we're going to
> be doing this, then that file must include a very prominent comment
> explaining it must never actually do so either.
>

I also couldn't find any freeing of the page-table-pages, I'll do another
pass and sharpen my quill for a big fat comment.

> Not being able to free page-tables might be a 'problem' if we're going
> to be doing more of HUGE_VMALLOC, because that means it becomes rather
> hard to swizzle from small to large pages.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-24 17:40       ` Dave Hansen
  2023-07-25 13:21         ` Peter Zijlstra
@ 2023-07-25 16:37         ` Marcelo Tosatti
  2023-07-25 17:12           ` Dave Hansen
  1 sibling, 1 reply; 76+ messages in thread
From: Marcelo Tosatti @ 2023-07-25 16:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Valentin Schneider, Nadav Amit, Linux Kernel Mailing List,
	linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	the arch/x86 maintainers, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Yair Podemsky

On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
> On 7/24/23 04:32, Valentin Schneider wrote:
> > AFAICT the only reasonable way to go about the deferral is to prove that no
> > such access happens before the deferred @operation is done. We got to prove
> > that for sync_core() deferral, cf. PATCH 18.
> > 
> > I'd like to reason about it for deferring vunmap TLB flushes:
> > 
> > What addresses in VMAP range, other than the stack, can early entry code
> > access? Yes, the ranges can be checked at runtime, but is there any chance
> > of figuring this out e.g. at build-time?
> 
> Nadav was touching on a very important point: TLB flushes for addresses
> are relatively easy to defer.  You just need to ensure that the CPU
> deferring the flush does an actual flush before it might architecturally
> consume the contents of the flushed entry.
> 
> TLB flushes for freed page tables are another game entirely.  The CPU is
> free to cache any part of the paging hierarchy it wants at any time.

Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
table caches) on user->kernel and kernel->user context switches ?

So freeing a kernel pagetable page does not require interrupting a CPU 
which is in userspace (therefore does not have visibility into kernel
pagetables).

> It's also free to set accessed and dirty bits at any time, even for
> instructions that may never execute architecturally.
> 
> That basically means that if you have *ANY* freed page table page
> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
> screwed.
> 
> There's no reasoning about accesses or ordering.  As soon as the CPU
> does *anything*, it's out to get you.
> 
> You're going to need to do something a lot more radical to deal with
> free page table pages.




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
  2023-07-25 16:37         ` Marcelo Tosatti
@ 2023-07-25 17:12           ` Dave Hansen
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Hansen @ 2023-07-25 17:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Valentin Schneider, Nadav Amit, Linux Kernel Mailing List,
	linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	the arch/x86 maintainers, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Yair Podemsky

On 7/25/23 09:37, Marcelo Tosatti wrote:
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
> Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
> table caches) on user->kernel and kernel->user context switches ?

Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at
all on user<->kernel switches when PCIDs are enabled.

Second, even if it did, the CPU is still free to cache any portion of
the paging hierarchy at any time.  Without LASS[1], userspace can even
_compel_ walks of the kernel portion of the address space, and we don't
have any infrastructure to tell if a freed kernel page is exposed in the
user copy of the page tables with PTI.

Third, (also ignoring PCIDs) there are plenty of instructions between
kernel entry and the MOV-to-CR3 that can flush the TLB.  All those
instructions architecturally permitted to speculatively set Accessed or
Dirty bits in any part of the address space.  If they run into a free
page table page, things get ugly.

These accesses are not _likely_.  There probably isn't a predictor out
there that's going to see a:

	movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)

and go off trying to dirty memory in the vmalloc() area.  But we'd need
some backward *and* forward-looking guarantees from our intrepid CPU
designers to promise that this kind of thing is safe yesterday, today
and tomorrow.  I suspect such a guarantee is going to be hard to obtain.

1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@intel.com



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-25 13:36     ` Valentin Schneider
@ 2023-07-25 17:41       ` Joel Fernandes
  0 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-07-25 17:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Peter Zijlstra,
	Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky



On 7/25/23 09:36, Valentin Schneider wrote:
> On 25/07/23 06:49, Joel Fernandes wrote:
>> Interesting series Valentin. Some high-level question/comments on this one:
>>
>>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>>
>>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>>> them vs the newly patched instruction. CPUs that are executing in userspace
>>> do not need this synchronization to happen immediately, and this is
>>> actually harmful interference for NOHZ_FULL CPUs.
>>
>> Does the amount of harm not correspond to practical frequency of text_poke?
>> How often does instruction patching really happen? If it is very infrequent
>> then I am not sure if it is that harmful.
>>
> 
> Being pushed over a latency threshold *once* is enough to impact the
> latency evaluation of your given system/application.
> 
> It's mainly about shielding the isolated, NOHZ_FULL CPUs from whatever the
> housekeeping CPUs may be up to (flipping static keys, loading kprobes,
> using ftrace...) - frequency of the interference isn't such a big part of
> the reasoning.

Makes sense.

>>> As the synchronization IPIs are sent using a blocking call, returning from
>>> text_poke_bp_batch() implies all CPUs will observe the patched
>>> instruction(s), and this should be preserved even if the IPI is deferred.
>>> In other words, to safely defer this synchronization, any kernel
>>> instruction leading to the execution of the deferred instruction
>>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>>
>> If it is not infrequent, then are you handling the case where userland
>> spends multiple seconds before entering the kernel, and all this while
>> the blocking call waits? Perhaps in such situation you want the real IPI
>> to be sent out instead of the deferred one?
>>
> 
> The blocking call only waits for CPUs for which it queued a CSD. Deferred
> calls do not queue a CSD thus do not impact the waiting at all. See
> smp_call_function_many_cond().

Ah I see you are using on_each_cpu_cond(). I should have gone through
the other patch before making noise.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs
  2023-07-25 13:39     ` Peter Zijlstra
@ 2023-07-25 17:47       ` Joel Fernandes
  0 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-07-25 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest,
	Nicolas Saenz Julienne, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, Lorenzo Stoakes,
	Josh Poimboeuf, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky



On 7/25/23 09:39, Peter Zijlstra wrote:
> On Tue, Jul 25, 2023 at 06:49:45AM -0400, Joel Fernandes wrote:
>> Interesting series Valentin. Some high-level question/comments on this one:
>>
>>> On Jul 20, 2023, at 12:34 PM, Valentin Schneider <vschneid@redhat.com> wrote:
>>>
>>> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
>>> them vs the newly patched instruction. CPUs that are executing in userspace
>>> do not need this synchronization to happen immediately, and this is
>>> actually harmful interference for NOHZ_FULL CPUs.
>>
>> Does the amount of harm not correspond to practical frequency of text_poke? 
>> How often does instruction patching really happen? If it is very infrequent
>> then I am not sure if it is that harmful.
> 
> Well, it can happen quite a bit, also from things people would not
> typically 'expect' it.
> 
> For instance, the moment you create the first per-task perf event we
> frob some jump-labels (and again some second after the last one goes
> away).
> 
> The same for a bunch of runtime network configurations.

Ok cool. I guess I still have memories of that old ARM device I had
where modifications to kernel text was forbidden by hardware (was a
security feature). That was making kprobes unusable...

>>> As the synchronization IPIs are sent using a blocking call, returning from
>>> text_poke_bp_batch() implies all CPUs will observe the patched
>>> instruction(s), and this should be preserved even if the IPI is deferred.
>>> In other words, to safely defer this synchronization, any kernel
>>> instruction leading to the execution of the deferred instruction
>>> sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
>>
>> If it is not infrequent, then are you handling the case where userland
>> spends multiple seconds before entering the kernel, and all this while
>> the blocking call waits? Perhaps in such situation you want the real IPI
>> to be sent out instead of the deferred one?
> 
> Please re-read what Valentin wrote -- nobody is waiting on anything.

Makes sense. To be fair I received his email 3 minutes before yours ;-).
But thank you both for clarifying!

 - Joel



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-20 16:30 ` [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
@ 2023-07-26 19:41   ` Josh Poimboeuf
  2023-07-27  9:46     ` Valentin Schneider
  2023-07-29 19:09     ` Steven Rostedt
  0 siblings, 2 replies; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-26 19:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>  int filter_assign_type(const char *type)
>  {
> -	if (strstr(type, "__data_loc") && strstr(type, "char"))
> -		return FILTER_DYN_STRING;
> +	if (strstr(type, "__data_loc")) {
> +		if (strstr(type, "char"))
> +			return FILTER_DYN_STRING;
> +		if (strstr(type, "cpumask_t"))
> +			return FILTER_CPUMASK;
> +		}

The closing bracket has the wrong indentation.

> +		/* Copy the cpulist between { and } */
> +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);

Need to check kmalloc() failure?  And also free tmp?

> +
> +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> +		if (!pred->mask)
> +			goto err_mem;
> +
> +		/* Now parse it */
> +		if (cpulist_parse(tmp, pred->mask)) {
> +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> +			goto err_free;
> +		}
> +
> +		/* Move along */
> +		i++;
> +		if (field->filter_type == FILTER_CPUMASK)
> +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
> +

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-26 19:41   ` Josh Poimboeuf
@ 2023-07-27  9:46     ` Valentin Schneider
  2023-07-29 19:09     ` Steven Rostedt
  1 sibling, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-27  9:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 26/07/23 12:41, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>>  int filter_assign_type(const char *type)
>>  {
>> -	if (strstr(type, "__data_loc") && strstr(type, "char"))
>> -		return FILTER_DYN_STRING;
>> +	if (strstr(type, "__data_loc")) {
>> +		if (strstr(type, "char"))
>> +			return FILTER_DYN_STRING;
>> +		if (strstr(type, "cpumask_t"))
>> +			return FILTER_CPUMASK;
>> +		}
>
> The closing bracket has the wrong indentation.
>
>> +		/* Copy the cpulist between { and } */
>> +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
>> +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>
> Need to check kmalloc() failure?  And also free tmp?
>

Heh, indeed, shoddy that :-)

Thanks!

>> +
>> +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> +		if (!pred->mask)
>> +			goto err_mem;
>> +
>> +		/* Now parse it */
>> +		if (cpulist_parse(tmp, pred->mask)) {
>> +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
>> +			goto err_free;
>> +		}
>> +
>> +		/* Move along */
>> +		i++;
>> +		if (field->filter_type == FILTER_CPUMASK)
>> +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
>> +
>
> --
> Josh


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-20 16:30 ` [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
@ 2023-07-28 15:33   ` Josh Poimboeuf
  2023-07-31 11:16     ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-28 15:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:47PM +0100, Valentin Schneider wrote:
> I had to look into objtool itself to understand what this warning was
> about; make it more explicit.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  tools/objtool/check.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 8936a05f0e5ac..d308330f2910e 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
>  
>  	list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
>  		if (!target->sec->noinstr) {
> -			WARN("pv_ops[%d]: %s", idx, target->name);
> +			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
>  			file->pv_ops[idx].clean = false;

This is an improvement, though I think it still results in two warnings,
with the second not-so-useful warning happening in validate_call().

Ideally it would only show a single warning, I guess that would need a
little bit of restructuring the code.

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-20 16:30 ` [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
@ 2023-07-28 15:35   ` Josh Poimboeuf
  2023-07-31 11:18     ` Valentin Schneider
  2023-07-28 16:02   ` Josh Poimboeuf
  1 sibling, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-28 15:35 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Josh Poimboeuf, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
> +static int validate_static_key(struct instruction *insn, struct insn_state *state)
> +{
> +	if (state->noinstr && state->instr <= 0) {
> +		if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
> +			WARN_INSN(insn,
> +				  "Non __ro_after_init static key \"%s\" in .noinstr section",

For consistency with other warnings, this should start with a lowercase
"n" and the string literal should be on the same line as the WARN_INSN,
like

			WARN_INSN(insn, "non __ro_after_init static key \"%s\" in .noinstr section",
				  ...

> diff --git a/tools/objtool/special.c b/tools/objtool/special.c
> index 91b1950f5bd8a..1f76cfd815bf3 100644
> --- a/tools/objtool/special.c
> +++ b/tools/objtool/special.c
> @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
>  			return -1;
>  		}
>  		alt->key_addend = reloc_addend(key_reloc);
> +
> +		reloc_to_sec_off(key_reloc, &sec, &offset);
> +		alt->key_sym = find_symbol_by_offset(sec, offset & ~2);

Bits 0 and 1 can both store data, should be ~3?

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init
  2023-07-20 16:30 ` [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
@ 2023-07-28 16:00   ` Josh Poimboeuf
  2023-07-31 11:16     ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-28 16:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:49PM +0100, Valentin Schneider wrote:
> objtool now warns about it:
> 
>   vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>   [...]
> 
> The key can only be enabled (and not disabled) in the __init function
> ct_cpu_tracker_user(), so mark it as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>

It's best to avoid temporarily introducing warnings.  Bots will
rightfully complain about that.  This patch and the next one should come
before the objtool patches.

Also it would be helpful for the commit log to have a brief
justification for the patch beyond "fix the objtool warning".  Something
roughly like:

  Soon, runtime-mutable text won't be allowed in .noinstr sections, so
  that a code patching IPI to a userspace-bound CPU can be safely
  deferred to the next kernel entry.

  'context_tracking_key' is only enabled in __init ct_cpu_tracker_user().
  Mark it as __ro_after_init.

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-20 16:30 ` [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
  2023-07-28 15:35   ` Josh Poimboeuf
@ 2023-07-28 16:02   ` Josh Poimboeuf
  2023-07-31 11:18     ` Valentin Schneider
  1 sibling, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-28 16:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Josh Poimboeuf, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
> Later commits will depend on having no runtime-mutable text in early entry
> code. (ab)use the .noinstr section as a marker of early entry code and warn
> about static keys used in it that can be flipped at runtime.

Similar to my comment on patch 13, this could also use a short
justification for adding the feature, i.e. why runtime-mutable text
isn't going to be allowed in .noinstr.

Also, please add a short description of the warning (and why it exists)
to tools/objtool/Documentation/objtool.txt.

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
  2023-07-20 16:30 ` [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider
@ 2023-07-28 22:04   ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2023-07-28 22:04 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, Jul 20, 2023 at 05:30:46PM +0100, Valentin Schneider wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> When a static_key is marked ro_after_init, its state will never change
> (after init), therefore jump_label_update() will never need to iterate
> the entries, and thus module load won't actually need to track this --
> avoiding the static_key::next write.
> 
> Therefore, mark these keys such that jump_label_add_module() might
> recognise them and avoid the modification.
> 
> Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
> to denote such keys.
> 
> Link: http://lore.kernel.org/r/20230705204142.GB2813335@hirez.programming.kicks-ass.net
> NOT-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> @Peter: I've barely touched this patch, it's just been writing a comment
> and fixing benign compilation issues, so credit's all yours really!

Ah, it works? Excellent! You can remove the NOT from the SoB then ;-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-26 19:41   ` Josh Poimboeuf
  2023-07-27  9:46     ` Valentin Schneider
@ 2023-07-29 19:09     ` Steven Rostedt
  2023-07-31 11:19       ` Valentin Schneider
  1 sibling, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2023-07-29 19:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Wed, 26 Jul 2023 12:41:48 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
> >  int filter_assign_type(const char *type)
> >  {
> > -	if (strstr(type, "__data_loc") && strstr(type, "char"))
> > -		return FILTER_DYN_STRING;
> > +	if (strstr(type, "__data_loc")) {
> > +		if (strstr(type, "char"))
> > +			return FILTER_DYN_STRING;
> > +		if (strstr(type, "cpumask_t"))
> > +			return FILTER_CPUMASK;
> > +		}  
> 
> The closing bracket has the wrong indentation.
> 
> > +		/* Copy the cpulist between { and } */
> > +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> > +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);  
> 
> Need to check kmalloc() failure?  And also free tmp?

I came to state the same thing.

Also, when you do an empty for loop:

	for (; str[i] && str[i] != '}'; i++);

Always put the semicolon on the next line, otherwise it is really easy
to think that the next line is part of the for loop. That is, instead
of the above, do:

	for (; str[i] && str[i] != '}'; i++)
		;


-- Steve


> 
> > +
> > +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> > +		if (!pred->mask)
> > +			goto err_mem;
> > +
> > +		/* Now parse it */
> > +		if (cpulist_parse(tmp, pred->mask)) {
> > +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> > +			goto err_free;
> > +		}
> > +
> > +		/* Move along */
> > +		i++;
> > +		if (field->filter_type == FILTER_CPUMASK)
> > +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
> > +  
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
  2023-07-20 16:30 ` [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU Valentin Schneider
@ 2023-07-29 19:34   ` Steven Rostedt
  2023-07-31 11:20     ` Valentin Schneider
  0 siblings, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2023-07-29 19:34 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, 20 Jul 2023 17:30:41 +0100
Valentin Schneider <vschneid@redhat.com> wrote:

>  		/* Move along */
>  		i++;
> +
> +		/*
> +		 * Optimisation: if the user-provided mask has a weight of one
> +		 * then we can treat it as a scalar input.
> +		 */
> +		single = cpumask_weight(pred->mask) == 1;
> +		if (single && field->filter_type == FILTER_CPUMASK) {
> +			pred->val = cpumask_first(pred->mask);
> +			kfree(pred->mask);

Don't we need:
			pred->mask = NULL;

here, or the free_predicate() will cause a double free?

-- Steve

> +		}
> +

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-20 16:30 ` [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the " Valentin Schneider
@ 2023-07-29 19:55   ` Steven Rostedt
  2023-07-31 11:20     ` Valentin Schneider
  2023-07-31 12:07     ` Dan Carpenter
  0 siblings, 2 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-07-29 19:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Thu, 20 Jul 2023 17:30:42 +0100
Valentin Schneider <vschneid@redhat.com> wrote:

> Steven noted that when the user-provided cpumask contains a single CPU,
> then the filtering function can use a scalar as input instead of a
> full-fledged cpumask.
> 
> When the mask contains a single CPU, directly re-use the unsigned field
> predicate functions. Transform '&' into '==' beforehand.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  kernel/trace/trace_events_filter.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 2fe65ddeb34ef..54d642fabb7f1 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data,
>  		 * then we can treat it as a scalar input.
>  		 */
>  		single = cpumask_weight(pred->mask) == 1;
> -		if (single && field->filter_type == FILTER_CPUMASK) {
> +		if (single && field->filter_type != FILTER_CPU) {
>  			pred->val = cpumask_first(pred->mask);
>  			kfree(pred->mask);
>  		}
> @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
>  				FILTER_PRED_FN_CPUMASK;
>  		} else if (field->filter_type == FILTER_CPU) {
>  			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
> +		} else if (single) {
> +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;

Nit, the above can be written as:

			pred->op = pret->op != OP_BAND ? : OP_EQ;

-- Steve


> +			pred->fn_num = select_comparison_fn(pred->op, field->size, false);
> +			if (pred->op == OP_NE)
> +				pred->not = 1;
>  		} else {
>  			switch (field->size) {
>  			case 8:


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init
  2023-07-28 16:00   ` Josh Poimboeuf
@ 2023-07-31 11:16     ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 28/07/23 11:00, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:49PM +0100, Valentin Schneider wrote:
>> objtool now warns about it:
>>
>>   vmlinux.o: warning: objtool: enter_from_user_mode+0x4e: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>>   vmlinux.o: warning: objtool: enter_from_user_mode+0x50: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x60: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>>   vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0x62: Non __ro_after_init static key "context_tracking_key" in .noinstr section
>>   [...]
>>
>> The key can only be enabled (and not disabled) in the __init function
>> ct_cpu_tracker_user(), so mark it as __ro_after_init.
>>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>
> It's best to avoid temporarily introducing warnings.  Bots will
> rightfully complain about that.  This patch and the next one should come
> before the objtool patches.
>

Ack, I'll reverse the order of these.

> Also it would be helpful for the commit log to have a brief
> justification for the patch beyond "fix the objtool warning".  Something
> roughly like:
>
>   Soon, runtime-mutable text won't be allowed in .noinstr sections, so
>   that a code patching IPI to a userspace-bound CPU can be safely
>   deferred to the next kernel entry.
>
>   'context_tracking_key' is only enabled in __init ct_cpu_tracker_user().
>   Mark it as __ro_after_init.
>

Looks better indeed, thanks!

> --
> Josh


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-28 15:33   ` Josh Poimboeuf
@ 2023-07-31 11:16     ` Valentin Schneider
  2023-07-31 21:36       ` Josh Poimboeuf
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 28/07/23 10:33, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:47PM +0100, Valentin Schneider wrote:
>> I had to look into objtool itself to understand what this warning was
>> about; make it more explicit.
>>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>  tools/objtool/check.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
>> index 8936a05f0e5ac..d308330f2910e 100644
>> --- a/tools/objtool/check.c
>> +++ b/tools/objtool/check.c
>> @@ -3360,7 +3360,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
>>
>>      list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) {
>>              if (!target->sec->noinstr) {
>> -			WARN("pv_ops[%d]: %s", idx, target->name);
>> +			WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name);
>>                      file->pv_ops[idx].clean = false;
>
> This is an improvement, though I think it still results in two warnings,
> with the second not-so-useful warning happening in validate_call().
>
> Ideally it would only show a single warning, I guess that would need a
> little bit of restructuring the code.

You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():

  vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section

Interestingly the second one doesn't seem to have triggered the "pv_ops"
bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will
return NULL.

Trickling down the file yields:

  vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section

In my case (!PARAVIRT_XXL) pv_ops should look like:
  [0]: .cpu.io_delay
  [1]: .mmu.flush_tlb_user()

so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because
it uses arch_dest_reloc_offset().

If I use the above to fix up validate_call(), would we still need
pv_call_dest() & co?

>
> --
> Josh


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-28 16:02   ` Josh Poimboeuf
@ 2023-07-31 11:18     ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Josh Poimboeuf, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 28/07/23 11:02, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
>> Later commits will depend on having no runtime-mutable text in early entry
>> code. (ab)use the .noinstr section as a marker of early entry code and warn
>> about static keys used in it that can be flipped at runtime.
>
> Similar to my comment on patch 13, this could also use a short
> justification for adding the feature, i.e. why runtime-mutable text
> isn't going to be allowed in .noinstr.
>
> Also, please add a short description of the warning (and why it exists)
> to tools/objtool/Documentation/objtool.txt.
>

I had missed this, thanks for the pointer.

> --
> Josh


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr
  2023-07-28 15:35   ` Josh Poimboeuf
@ 2023-07-31 11:18     ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:18 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Josh Poimboeuf, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 28/07/23 10:35, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:48PM +0100, Valentin Schneider wrote:
>> +static int validate_static_key(struct instruction *insn, struct insn_state *state)
>> +{
>> +	if (state->noinstr && state->instr <= 0) {
>> +		if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) {
>> +			WARN_INSN(insn,
>> +				  "Non __ro_after_init static key \"%s\" in .noinstr section",
>
> For consistency with other warnings, this should start with a lowercase
> "n" and the string literal should be on the same line as the WARN_INSN,
> like
>
>                       WARN_INSN(insn, "non __ro_after_init static key \"%s\" in .noinstr section",
>                                 ...
>
>> diff --git a/tools/objtool/special.c b/tools/objtool/special.c
>> index 91b1950f5bd8a..1f76cfd815bf3 100644
>> --- a/tools/objtool/special.c
>> +++ b/tools/objtool/special.c
>> @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
>>                      return -1;
>>              }
>>              alt->key_addend = reloc_addend(key_reloc);
>> +
>> +		reloc_to_sec_off(key_reloc, &sec, &offset);
>> +		alt->key_sym = find_symbol_by_offset(sec, offset & ~2);
>
> Bits 0 and 1 can both store data, should be ~3?
>

Quite so, that needs to be the same as jump_entry_key().

> --
> Josh


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-29 19:09     ` Steven Rostedt
@ 2023-07-31 11:19       ` Valentin Schneider
  2023-07-31 15:48         ` Steven Rostedt
  0 siblings, 1 reply; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:19 UTC (permalink / raw)
  To: Steven Rostedt, Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Jason Baron, Kees Cook, Sami Tolvanen,
	Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 29/07/23 15:09, Steven Rostedt wrote:
> On Wed, 26 Jul 2023 12:41:48 -0700
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
>> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>> >  int filter_assign_type(const char *type)
>> >  {
>> > -	if (strstr(type, "__data_loc") && strstr(type, "char"))
>> > -		return FILTER_DYN_STRING;
>> > +	if (strstr(type, "__data_loc")) {
>> > +		if (strstr(type, "char"))
>> > +			return FILTER_DYN_STRING;
>> > +		if (strstr(type, "cpumask_t"))
>> > +			return FILTER_CPUMASK;
>> > +		}
>>
>> The closing bracket has the wrong indentation.
>>
>> > +		/* Copy the cpulist between { and } */
>> > +		tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
>> > +		strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>>
>> Need to check kmalloc() failure?  And also free tmp?
>
> I came to state the same thing.
>
> Also, when you do an empty for loop:
>
>       for (; str[i] && str[i] != '}'; i++);
>
> Always put the semicolon on the next line, otherwise it is really easy
> to think that the next line is part of the for loop. That is, instead
> of the above, do:
>
>       for (; str[i] && str[i] != '}'; i++)
>               ;
>

Interestingly I don't think I've ever encountered that variant, usually
having an empty line (which this lacks) and the indentation level is enough
to identify these - regardless, I'll change it.

>
> -- Steve
>
>
>>
>> > +
>> > +		pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> > +		if (!pred->mask)
>> > +			goto err_mem;
>> > +
>> > +		/* Now parse it */
>> > +		if (cpulist_parse(tmp, pred->mask)) {
>> > +			parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
>> > +			goto err_free;
>> > +		}
>> > +
>> > +		/* Move along */
>> > +		i++;
>> > +		if (field->filter_type == FILTER_CPUMASK)
>> > +			pred->fn_num = FILTER_PRED_FN_CPUMASK;
>> > +
>>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
  2023-07-29 19:34   ` Steven Rostedt
@ 2023-07-31 11:20     ` Valentin Schneider
  0 siblings, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 29/07/23 15:34, Steven Rostedt wrote:
> On Thu, 20 Jul 2023 17:30:41 +0100
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>>              /* Move along */
>>              i++;
>> +
>> +		/*
>> +		 * Optimisation: if the user-provided mask has a weight of one
>> +		 * then we can treat it as a scalar input.
>> +		 */
>> +		single = cpumask_weight(pred->mask) == 1;
>> +		if (single && field->filter_type == FILTER_CPUMASK) {
>> +			pred->val = cpumask_first(pred->mask);
>> +			kfree(pred->mask);
>
> Don't we need:
>                       pred->mask = NULL;
>
> here, or the free_predicate() will cause a double free?
>

We do, thanks for spotting this.

> -- Steve
>
>> +		}
>> +


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-29 19:55   ` Steven Rostedt
@ 2023-07-31 11:20     ` Valentin Schneider
  2023-07-31 12:07     ` Dan Carpenter
  1 sibling, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 11:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 29/07/23 15:55, Steven Rostedt wrote:
> On Thu, 20 Jul 2023 17:30:42 +0100
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>> Steven noted that when the user-provided cpumask contains a single CPU,
>> then the filtering function can use a scalar as input instead of a
>> full-fledged cpumask.
>>
>> When the mask contains a single CPU, directly re-use the unsigned field
>> predicate functions. Transform '&' into '==' beforehand.
>>
>> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>  kernel/trace/trace_events_filter.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
>> index 2fe65ddeb34ef..54d642fabb7f1 100644
>> --- a/kernel/trace/trace_events_filter.c
>> +++ b/kernel/trace/trace_events_filter.c
>> @@ -1750,7 +1750,7 @@ static int parse_pred(const char *str, void *data,
>>               * then we can treat it as a scalar input.
>>               */
>>              single = cpumask_weight(pred->mask) == 1;
>> -		if (single && field->filter_type == FILTER_CPUMASK) {
>> +		if (single && field->filter_type != FILTER_CPU) {
>>                      pred->val = cpumask_first(pred->mask);
>>                      kfree(pred->mask);
>>              }
>> @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
>>                              FILTER_PRED_FN_CPUMASK;
>>              } else if (field->filter_type == FILTER_CPU) {
>>                      pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
>> +		} else if (single) {
>> +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
>
> Nit, the above can be written as:
>
>                       pred->op = pret->op != OP_BAND ? : OP_EQ;
>

That's neater, thanks!

> -- Steve
>
>
>> +			pred->fn_num = select_comparison_fn(pred->op, field->size, false);
>> +			if (pred->op == OP_NE)
>> +				pred->not = 1;
>>              } else {
>>                      switch (field->size) {
>>                      case 8:


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-29 19:55   ` Steven Rostedt
  2023-07-31 11:20     ` Valentin Schneider
@ 2023-07-31 12:07     ` Dan Carpenter
  2023-07-31 15:54       ` Steven Rostedt
  1 sibling, 1 reply; 76+ messages in thread
From: Dan Carpenter @ 2023-07-31 12:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
> > @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
> >  				FILTER_PRED_FN_CPUMASK;
> >  		} else if (field->filter_type == FILTER_CPU) {
> >  			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
> > +		} else if (single) {
> > +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
> 
> Nit, the above can be written as:
> 
> 			pred->op = pret->op != OP_BAND ? : OP_EQ;
> 

Heh.  Those are not equivalent.  The right way to write this is:

	if (pred->op == OP_BAND)
		pred->op = OP_EQ;

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask
  2023-07-31 11:19       ` Valentin Schneider
@ 2023-07-31 15:48         ` Steven Rostedt
  0 siblings, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-07-31 15:48 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Josh Poimboeuf, linux-kernel, linux-trace-kernel, linux-doc, kvm,
	linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, 31 Jul 2023 12:19:51 +0100
Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > Also, when you do an empty for loop:
> >
> >       for (; str[i] && str[i] != '}'; i++);
> >
> > Always put the semicolon on the next line, otherwise it is really easy
> > to think that the next line is part of the for loop. That is, instead
> > of the above, do:
> >
> >       for (; str[i] && str[i] != '}'; i++)
> >               ;
> >  
> 
> Interestingly I don't think I've ever encountered that variant, usually
> having an empty line (which this lacks) and the indentation level is enough
> to identify these - regardless, I'll change it.


Do a "git grep -B1 -e '^\s*;\s*$'"

You'll find that it is quite common.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-31 12:07     ` Dan Carpenter
@ 2023-07-31 15:54       ` Steven Rostedt
  2023-07-31 16:03         ` Dan Carpenter
  0 siblings, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2023-07-31 15:54 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, 31 Jul 2023 15:07:52 +0300
Dan Carpenter <dan.carpenter@linaro.org> wrote:

> On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
> > > @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
> > >  				FILTER_PRED_FN_CPUMASK;
> > >  		} else if (field->filter_type == FILTER_CPU) {
> > >  			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
> > > +		} else if (single) {
> > > +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;  
> > 
> > Nit, the above can be written as:
> > 
> > 			pred->op = pret->op != OP_BAND ? : OP_EQ;
> >   
> 
> Heh.  Those are not equivalent.  The right way to write this is:

You mean because of my typo?

> 
> 	if (pred->op == OP_BAND)
> 		pred->op = OP_EQ;

But sure, I'm fine with that, and it's probably more readable too.

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-31 15:54       ` Steven Rostedt
@ 2023-07-31 16:03         ` Dan Carpenter
  2023-07-31 17:20           ` Valentin Schneider
  2023-07-31 18:16           ` Steven Rostedt
  0 siblings, 2 replies; 76+ messages in thread
From: Dan Carpenter @ 2023-07-31 16:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 31, 2023 at 11:54:53AM -0400, Steven Rostedt wrote:
> On Mon, 31 Jul 2023 15:07:52 +0300
> Dan Carpenter <dan.carpenter@linaro.org> wrote:
> 
> > On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
> > > > @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
> > > >  				FILTER_PRED_FN_CPUMASK;
> > > >  		} else if (field->filter_type == FILTER_CPU) {
> > > >  			pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
> > > > +		} else if (single) {
> > > > +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;  
> > > 
> > > Nit, the above can be written as:
> > > 
> > > 			pred->op = pret->op != OP_BAND ? : OP_EQ;
> > >   
> > 
> > Heh.  Those are not equivalent.  The right way to write this is:
> 
> You mean because of my typo?

No, I hadn't seen the s/pred/pret/ typo.  Your code does:

	if (pred->op != OP_BAND)
		pred->op = true;
	else
		pred->op OP_EQ;

Realy we should probably trigger a static checker warning any time
someone does a compare operations as part of a "x = comparison ?: bar;
Years ago, someone asked me to do that with regards to error codes like:

	return ret < 0 ?: -EINVAL;

but I don't remember the results.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-31 16:03         ` Dan Carpenter
@ 2023-07-31 17:20           ` Valentin Schneider
  2023-07-31 18:16           ` Steven Rostedt
  1 sibling, 0 replies; 76+ messages in thread
From: Valentin Schneider @ 2023-07-31 17:20 UTC (permalink / raw)
  To: Dan Carpenter, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On 31/07/23 19:03, Dan Carpenter wrote:
> On Mon, Jul 31, 2023 at 11:54:53AM -0400, Steven Rostedt wrote:
>> On Mon, 31 Jul 2023 15:07:52 +0300
>> Dan Carpenter <dan.carpenter@linaro.org> wrote:
>>
>> > On Sat, Jul 29, 2023 at 03:55:47PM -0400, Steven Rostedt wrote:
>> > > > @@ -1761,6 +1761,11 @@ static int parse_pred(const char *str, void *data,
>> > > >                                FILTER_PRED_FN_CPUMASK;
>> > > >                } else if (field->filter_type == FILTER_CPU) {
>> > > >                        pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
>> > > > +		} else if (single) {
>> > > > +			pred->op = pred->op == OP_BAND ? OP_EQ : pred->op;
>> > >
>> > > Nit, the above can be written as:
>> > >
>> > >                  pred->op = pret->op != OP_BAND ? : OP_EQ;
>> > >
>> >
>> > Heh.  Those are not equivalent.  The right way to write this is:
>>
>> You mean because of my typo?
>
> No, I hadn't seen the s/pred/pret/ typo.  Your code does:
>
>       if (pred->op != OP_BAND)
>               pred->op = true;
>       else
>               pred->op OP_EQ;
>
> Realy we should probably trigger a static checker warning any time
> someone does a compare operations as part of a "x = comparison ?: bar;
> Years ago, someone asked me to do that with regards to error codes like:
>
>       return ret < 0 ?: -EINVAL;
>
> but I don't remember the results.
>

FWIW this is caught by GCC:

     error: the omitted middle operand in ?: will always be ‘true’, suggest explicit middle operand [-Werror=parentheses]
     pred->op = pred->op != OP_BAND ? : OP_EQ;


> regards,
> dan carpenter


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  2023-07-31 16:03         ` Dan Carpenter
  2023-07-31 17:20           ` Valentin Schneider
@ 2023-07-31 18:16           ` Steven Rostedt
  1 sibling, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-07-31 18:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Josh Poimboeuf, Jason Baron,
	Kees Cook, Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin,
	Juerg Haefliger, Nicolas Saenz Julienne, Kirill A. Shutemov,
	Nadav Amit, Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, 31 Jul 2023 19:03:04 +0300
Dan Carpenter <dan.carpenter@linaro.org> wrote:

> > > > Nit, the above can be written as:
> > > > 
> > > > 			pred->op = pret->op != OP_BAND ? : OP_EQ;
> > > >     
> > > 
> > > Heh.  Those are not equivalent.  The right way to write this is:  
> > 
> > You mean because of my typo?  
> 
> No, I hadn't seen the s/pred/pret/ typo.  Your code does:
> 
> 	if (pred->op != OP_BAND)
> 		pred->op = true;
> 	else
> 		pred->op OP_EQ;

Ah, for some reason I was thinking the ? : just was just a nop, but I guess
it is to assign the cond value :-/

But of course every place I've done that, it was the condition value I
wanted, which was the same as the value being assigned.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-31 11:16     ` Valentin Schneider
@ 2023-07-31 21:36       ` Josh Poimboeuf
  2023-07-31 21:46         ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-07-31 21:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-trace-kernel, linux-doc, kvm, linux-mm, bpf,
	x86, rcu, linux-kselftest, Steven Rostedt, Masami Hiramatsu,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Paolo Bonzini, Wanpeng Li,
	Vitaly Kuznetsov, Andy Lutomirski, Peter Zijlstra,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 31, 2023 at 12:16:59PM +0100, Valentin Schneider wrote:
> You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():
> 
>   vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
>   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
> 
> Interestingly the second one doesn't seem to have triggered the "pv_ops"
> bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will
> return NULL.

Yeah, that's weird.

> Trickling down the file yields:
> 
>   vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
>   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section
> 
> In my case (!PARAVIRT_XXL) pv_ops should look like:
>   [0]: .cpu.io_delay
>   [1]: .mmu.flush_tlb_user()
> 
> so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because
> it uses arch_dest_reloc_offset().
> 
> If I use the above to fix up validate_call(), would we still need
> pv_call_dest() & co?

The functionality in pv_call_dest() is still needed because it goes
through all the possible targets for the .mmu.flush_tlb_user() pointer
-- xen_flush_tlb() and native_flush_tlb_local() -- and makes sure
they're noinstr.

Ideally it would only print a single warning for this case, something
like:

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section

I left out "pv_ops[1]" because it's already long enough :-)

It would need a little bit of code shuffling.  But it's really a
preexisting problem so don't feel compelled to fix it with this patch
set.

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-31 21:36       ` Josh Poimboeuf
@ 2023-07-31 21:46         ` Peter Zijlstra
  2023-08-01 16:06           ` Josh Poimboeuf
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-07-31 21:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 31, 2023 at 04:36:31PM -0500, Josh Poimboeuf wrote:
> On Mon, Jul 31, 2023 at 12:16:59PM +0100, Valentin Schneider wrote:
> > You're quite right - fabricating an artificial warning with a call to __flush_tlb_local():
> > 
> >   vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
> >   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
> > 
> > Interestingly the second one doesn't seem to have triggered the "pv_ops"
> > bit of call_dest_name. Seems like any call to insn_reloc(NULL, x) will
> > return NULL.
> 
> Yeah, that's weird.
> 
> > Trickling down the file yields:
> > 
> >   vmlinux.o: warning: objtool: pv_ops[1]: indirect call to native_flush_tlb_local() leaves .noinstr.text section
> >   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[0]() leaves .noinstr.text section
> > 
> > In my case (!PARAVIRT_XXL) pv_ops should look like:
> >   [0]: .cpu.io_delay
> >   [1]: .mmu.flush_tlb_user()
> > 
> > so pv_ops[1] looks right. Seems like pv_call_dest() gets it right because
> > it uses arch_dest_reloc_offset().
> > 
> > If I use the above to fix up validate_call(), would we still need
> > pv_call_dest() & co?
> 
> The functionality in pv_call_dest() is still needed because it goes
> through all the possible targets for the .mmu.flush_tlb_user() pointer
> -- xen_flush_tlb() and native_flush_tlb_local() -- and makes sure
> they're noinstr.
> 
> Ideally it would only print a single warning for this case, something
> like:
> 
>   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section

But then what for the case where there are multiple implementations and
more than one isn't noinstr? IIRC that is where these double prints came
from. One is the callsite (always one) and the second is the offending
implementation (but there could be more).

> I left out "pv_ops[1]" because it's already long enough :-)

The index number is useful when also looking at the assembler, which
IIRC is an indexed indirect call.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-07-31 21:46         ` Peter Zijlstra
@ 2023-08-01 16:06           ` Josh Poimboeuf
  2023-08-01 18:12             ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Josh Poimboeuf @ 2023-08-01 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Mon, Jul 31, 2023 at 11:46:12PM +0200, Peter Zijlstra wrote:
> > Ideally it would only print a single warning for this case, something
> > like:
> > 
> >   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
> 
> But then what for the case where there are multiple implementations and
> more than one isn't noinstr?

The warning would be in the loop in pv_call_dest(), so it would
potentially print multiple warnings, one for each potential dest.

> IIRC that is where these double prints came from. One is the callsite
> (always one) and the second is the offending implementation (but there
> could be more).

It's confusing to warn about the call site and the destination in two
separate warnings.  That's why I'm proposing combining them into a
single warning (which still could end up as multiple warnings if there
are multiple affected dests).

> > I left out "pv_ops[1]" because it's already long enough :-)
> 
> The index number is useful when also looking at the assembler, which
> IIRC is an indexed indirect call.

Ok, so something like so?

  vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to pv_ops[1] (native_flush_tlb_local) leaves .noinstr.text section

-- 
Josh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls
  2023-08-01 16:06           ` Josh Poimboeuf
@ 2023-08-01 18:12             ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2023-08-01 18:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest, Steven Rostedt,
	Masami Hiramatsu, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Paolo Bonzini,
	Wanpeng Li, Vitaly Kuznetsov, Andy Lutomirski,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Lorenzo Stoakes, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

On Tue, Aug 01, 2023 at 11:06:36AM -0500, Josh Poimboeuf wrote:
> On Mon, Jul 31, 2023 at 11:46:12PM +0200, Peter Zijlstra wrote:
> > > Ideally it would only print a single warning for this case, something
> > > like:
> > > 
> > >   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to native_flush_tlb_local() leaves .noinstr.text section
> > 
> > But then what for the case where there are multiple implementations and
> > more than one isn't noinstr?
> 
> The warning would be in the loop in pv_call_dest(), so it would
> potentially print multiple warnings, one for each potential dest.
> 
> > IIRC that is where these double prints came from. One is the callsite
> > (always one) and the second is the offending implementation (but there
> > could be more).
> 
> It's confusing to warn about the call site and the destination in two
> separate warnings.  That's why I'm proposing combining them into a
> single warning (which still could end up as multiple warnings if there
> are multiple affected dests).
> 
> > > I left out "pv_ops[1]" because it's already long enough :-)
> > 
> > The index number is useful when also looking at the assembler, which
> > IIRC is an indexed indirect call.
> 
> Ok, so something like so?
> 
>   vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: indirect call to pv_ops[1] (native_flush_tlb_local) leaves .noinstr.text section

Sure, that all would work I suppose.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  2023-07-20 16:30 ` [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
@ 2023-10-09 16:40   ` Maxim Levitsky
  0 siblings, 0 replies; 76+ messages in thread
From: Maxim Levitsky @ 2023-10-09 16:40 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel, linux-trace-kernel, linux-doc,
	kvm, linux-mm, bpf, x86, rcu, linux-kselftest
  Cc: Steven Rostedt, Masami Hiramatsu, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
	Andy Lutomirski, Peter Zijlstra, Frederic Weisbecker,
	Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes, Josh Triplett,
	Boqun Feng, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	Lorenzo Stoakes, Josh Poimboeuf, Jason Baron, Kees Cook,
	Sami Tolvanen, Ard Biesheuvel, Nicholas Piggin, Juerg Haefliger,
	Nicolas Saenz Julienne, Kirill A. Shutemov, Nadav Amit,
	Dan Carpenter, Chuang Wang, Yang Jihong, Petr Mladek,
	Jason A. Donenfeld, Song Liu, Julian Pidancet, Tom Lendacky,
	Dionna Glaze, Thomas Weißschuh, Juri Lelli,
	Daniel Bristot de Oliveira, Marcelo Tosatti, Yair Podemsky

У чт, 2023-07-20 у 17:30 +0100, Valentin Schneider пише:
> objtool now warns about it:
> 
>   vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init static key "kvm_async_pf_enabled" in .noinstr section
> 
> The key can only be enabled (and not disabled) in the __init function
> kvm_guest_init(), so mark it as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  arch/x86/kernel/kvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 1cceac5984daa..319460090a836 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -44,7 +44,7 @@
>  #include <asm/svm.h>
>  #include <asm/e820/api.h>
>  
> -DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
>  
>  static int kvmapf = 1;
>  
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2023-10-09 16:41 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 16:30 [RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 01/20] tracing/filters: Dynamically allocate filter_pred.regex Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 02/20] tracing/filters: Enable filtering a cpumask field by another cpumask Valentin Schneider
2023-07-26 19:41   ` Josh Poimboeuf
2023-07-27  9:46     ` Valentin Schneider
2023-07-29 19:09     ` Steven Rostedt
2023-07-31 11:19       ` Valentin Schneider
2023-07-31 15:48         ` Steven Rostedt
2023-07-20 16:30 ` [RFC PATCH v2 03/20] tracing/filters: Enable filtering a scalar field by a cpumask Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 04/20] tracing/filters: Enable filtering the CPU common " Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 05/20] tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU Valentin Schneider
2023-07-29 19:34   ` Steven Rostedt
2023-07-31 11:20     ` Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 06/20] tracing/filters: Optimise scalar vs cpumask filtering when the " Valentin Schneider
2023-07-29 19:55   ` Steven Rostedt
2023-07-31 11:20     ` Valentin Schneider
2023-07-31 12:07     ` Dan Carpenter
2023-07-31 15:54       ` Steven Rostedt
2023-07-31 16:03         ` Dan Carpenter
2023-07-31 17:20           ` Valentin Schneider
2023-07-31 18:16           ` Steven Rostedt
2023-07-20 16:30 ` [RFC PATCH v2 07/20] tracing/filters: Optimise CPU " Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 08/20] tracing/filters: Further optimise scalar vs cpumask comparison Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 09/20] tracing/filters: Document cpumask filtering Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 10/20] jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider
2023-07-28 22:04   ` Peter Zijlstra
2023-07-20 16:30 ` [RFC PATCH v2 11/20] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2023-07-28 15:33   ` Josh Poimboeuf
2023-07-31 11:16     ` Valentin Schneider
2023-07-31 21:36       ` Josh Poimboeuf
2023-07-31 21:46         ` Peter Zijlstra
2023-08-01 16:06           ` Josh Poimboeuf
2023-08-01 18:12             ` Peter Zijlstra
2023-07-20 16:30 ` [RFC PATCH v2 12/20] objtool: Warn about non __ro_after_init static key usage in .noinstr Valentin Schneider
2023-07-28 15:35   ` Josh Poimboeuf
2023-07-31 11:18     ` Valentin Schneider
2023-07-28 16:02   ` Josh Poimboeuf
2023-07-31 11:18     ` Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 13/20] context_tracking: Make context_tracking_key __ro_after_init Valentin Schneider
2023-07-28 16:00   ` Josh Poimboeuf
2023-07-31 11:16     ` Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init Valentin Schneider
2023-10-09 16:40   ` Maxim Levitsky
2023-07-20 16:30 ` [RFC PATCH v2 15/20] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2023-07-24 14:52   ` Frederic Weisbecker
2023-07-24 16:55     ` Valentin Schneider
2023-07-24 19:18       ` Frederic Weisbecker
2023-07-25 10:10         ` Valentin Schneider
2023-07-25 11:22           ` Frederic Weisbecker
2023-07-25 13:05             ` Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 16/20] rcu: Make RCU dynticks counter size configurable Valentin Schneider
2023-07-21  8:17   ` Valentin Schneider
2023-07-21 14:10     ` Paul E. McKenney
2023-07-21 15:08       ` Valentin Schneider
2023-07-21 16:09         ` Paul E. McKenney
2023-07-20 16:30 ` [RFC PATCH v2 17/20] rcutorture: Add a test config to torture test low RCU_DYNTICKS width Valentin Schneider
2023-07-20 19:53   ` Paul E. McKenney
2023-07-21  4:00     ` Paul E. McKenney
2023-07-21  7:58       ` Valentin Schneider
2023-07-21 14:07         ` Paul E. McKenney
2023-07-21 15:08           ` Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2023-07-25 10:49   ` Joel Fernandes
2023-07-25 13:36     ` Valentin Schneider
2023-07-25 17:41       ` Joel Fernandes
2023-07-25 13:39     ` Peter Zijlstra
2023-07-25 17:47       ` Joel Fernandes
2023-07-20 16:30 ` [RFC PATCH v2 19/20] context_tracking,x86: Add infrastructure to defer kernel TLBI Valentin Schneider
2023-07-20 16:30 ` [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Valentin Schneider
2023-07-21 18:15   ` Nadav Amit
2023-07-24 11:32     ` Valentin Schneider
2023-07-24 17:40       ` Dave Hansen
2023-07-25 13:21         ` Peter Zijlstra
2023-07-25 14:03           ` Valentin Schneider
2023-07-25 16:37         ` Marcelo Tosatti
2023-07-25 17:12           ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.