linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS
@ 2023-01-09 13:58 Mark Rutland
  2023-01-09 13:58 ` [PATCH 1/8] Compiler attributes: GCC function alignment workarounds Mark Rutland
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
enables support for this on arm64. This significantly reduces the
overhead of tracing when a callsite/tracee has a single associated
tracer, avoids a number of issues that make it undesireably and
infeasible to use dynamically-allocated trampolines (e.g. branch range
limitations), and makes it possible to implement support for
DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.

The main idea is to give each ftrace callsite an associated pointer to
an ftrace_ops. The architecture's ftrace_caller trampoline can recover
the ops pointer and invoke ops->func from this without needing to use
ftrace_ops_list_func, which has to iterate through all registered ops.

To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
placed before the function entry point. On arm64 NOPs are always 4
bytes, so by allocating 2 per-function NOPs, we have enaough space to
place a 64-bit value. So that we can manipulate the pointer atomically,
we need to align instrumented functions to at least 8 bytes.

The first three patches enable this function alignment, requiring
changes to the ACPICA Makefile, and working around cases where GCC drops
alignment.

The final four patches implement support for arm64. As noted in the
final patch, this results in a significant reduction in overhead:

  Before this patch:

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,583 |         0.95 |           -
         0 |          1 ||      93,709 |         0.94 |           -
         0 |          2 ||      93,666 |         0.94 |           -
         0 |         10 ||      93,709 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||   6,467,833 |        64.68 |       63.73
         1 |          2 ||   7,509,708 |        75.10 |       74.15
         1 |         10 ||  23,786,792 |       237.87 |      236.92
         1 |        100 || 106,432,500 |     1,064.43 |     1063.38
  ---------+------------++-------------+--------------+------------
         1 |          0 ||   1,431,875 |        14.32 |       13.37
         2 |          0 ||   6,456,334 |        64.56 |       63.62
        10 |          0 ||  22,717,000 |       227.17 |      226.22
       100 |          0 || 103,293,667 |      1032.94 |     1031.99
  ---------+------------++-------------+--------------+--------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

  After this patch

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,541 |         0.95 |           -
         0 |          1 ||      93,666 |         0.94 |           -
         0 |          2 ||      93,709 |         0.94 |           -
         0 |         10 ||      93,667 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||     281,000 |         2.81 |        1.86
         1 |          2 ||     281,042 |         2.81 |        1.87
         1 |         10 ||     280,958 |         2.81 |        1.86
         1 |        100 ||     281,250 |         2.81 |        1.87
  ---------+------------++-------------+--------------+------------
         1 |          0 ||     280,959 |         2.81 |        1.86
         2 |          0 ||   6,502,708 |        65.03 |       64.08
        10 |          0 ||  18,681,209 |       186.81 |      185.87
       100 |          0 || 103,550,458 |     1,035.50 |     1034.56
  ---------+------------++-------------+--------------+------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

Thanks,
Mark.

Mark Rutland (8):
  Compiler attributes: GCC function alignment workarounds
  ACPI: Don't build ACPICA with '-Os'
  arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  arm64: insn: Add helpers for BTI
  arm64: patching: Add aarch64_insn_write_literal_u64()
  arm64: ftrace: Update stale comment
  arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

 arch/arm64/Kconfig                  |   3 +
 arch/arm64/Makefile                 |   5 +-
 arch/arm64/include/asm/ftrace.h     |  15 +--
 arch/arm64/include/asm/insn.h       |   1 +
 arch/arm64/include/asm/linkage.h    |  10 +-
 arch/arm64/include/asm/patching.h   |   2 +
 arch/arm64/kernel/asm-offsets.c     |   4 +
 arch/arm64/kernel/entry-ftrace.S    |  32 +++++-
 arch/arm64/kernel/ftrace.c          | 158 +++++++++++++++++++++++++++-
 arch/arm64/kernel/patching.c        |  17 +++
 drivers/acpi/acpica/Makefile        |   2 +-
 include/linux/compiler_attributes.h |  23 +++-
 include/linux/ftrace.h              |  15 ++-
 kernel/trace/Kconfig                |   7 ++
 kernel/trace/ftrace.c               | 109 ++++++++++++++++++-
 15 files changed, 371 insertions(+), 32 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-09 14:43   ` Miguel Ojeda
  2023-01-09 13:58 ` [PATCH 2/8] ACPI: Don't build ACPICA with '-Os' Mark Rutland
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

From local testing, contemporary versions of of GCC (e.g. GCC 12.2.0)
don't respect '-falign-functions=N' in all cases. This is unfortunate,
as (for non-zero values of N) CONFIG_FUNCTION_ALIGNMENT=N will set
'-falign-functions=N', but this won't take effect for all functions.
LLVM appears to respect '-falign-functions=N' in call cases.

Today, for x86 this turns out to be functionally benign, though it does
somewhat undermine the CONFIG_FUNCTION_ALIGNMENT option, and it means
that CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not as robust as we'd
like.

On arm64 we'd like to use CONFIG_FUNCTION_ALIGNMENT to implement ftrace
functionality, and we'll require function alignment to be respected for
functional reasons.

As far as I can tell, GCC doesn't respect '-falign-functions=N':

* When the __weak__ attribute is used

  GCC seems to forget the alignment specified by '-falign-functions=N',
  but will respect the '__aligned__(N)' function attribute. Thus, we can
  work around this by explciitly setting the alignment for weak
  functions.

* When the __cold__ attribute is used

  GCC seems to forget the alignment specified by '-falign-functions=N',
  and also doesn't seem to respect the '__aligned__(N)' function
  attribute. The only way to work around this is to not use the __cold__
  attibute.

This patch implements workarounds for these two cases, using a function
attribute to set the alignment of __weak__ functions, and preventing the
use of the __cold__ attribute when CONFIG_FUNCTION_ALIGNMENT is
non-zero.

I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
building and booting a kernel, and looking for misaligned text symbols:

* arm64:

  Before:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    4939

  After:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    908

* x86_64:

  Before:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    7969

  After:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    2057

With the patch applied, the remaining unaligned text labels are a
combination of static call trampolines, non-function labels in assembly,
and ACPICA functions, which will be dealt with in subsequent patches.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
---
 include/linux/compiler_attributes.h | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index 898b3458b24a0..dcb7ac67b764f 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -33,6 +33,17 @@
 #define __aligned(x)                    __attribute__((__aligned__(x)))
 #define __aligned_largest               __attribute__((__aligned__))
 
+/*
+ * Contemporary versions of GCC (e.g. 12.2.0) don't always respect
+ * '-falign-functions=N', and require alignment to be specificed via a function
+ * attribute in some cases.
+ */
+#if CONFIG_FUNCTION_ALIGNMENT > 0
+#define __function_aligned		__aligned(CONFIG_FUNCTION_ALIGNMENT)
+#else
+#define __function_aligned
+#endif
+
 /*
  * Note: do not use this directly. Instead, use __alloc_size() since it is conditionally
  * available and includes other attributes. For GCC < 9.1, __alloc_size__ gets undefined
@@ -78,8 +89,15 @@
 /*
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-cold-function-attribute
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Label-Attributes.html#index-cold-label-attribute
+ *
+ * GCC drops function alignment when the __cold__ attribute is used. Avoid the
+ * __cold__ attribute if function alignment is required.
  */
+#if !defined(CONFIG_CC_IS_GCC) || (CONFIG_FUNCTION_ALIGNMENT == 0)
 #define __cold                          __attribute__((__cold__))
+#else
+#define __cold
+#endif
 
 /*
  * Note the long name.
@@ -369,8 +387,11 @@
 /*
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-weak-function-attribute
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-weak-variable-attribute
+ *
+ * GCC drops function alignment when the __weak__ attribute is used. This can
+ * be restored with function attributes.
  */
-#define __weak                          __attribute__((__weak__))
+#define __weak                          __attribute__((__weak__)) __function_aligned
 
 /*
  * Used by functions that use '__builtin_return_address'. These function
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/8] ACPI: Don't build ACPICA with '-Os'
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
  2023-01-09 13:58 ` [PATCH 1/8] Compiler attributes: GCC function alignment workarounds Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-10 13:45   ` Rafael J. Wysocki
  2023-01-09 13:58 ` [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT Mark Rutland
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel, Len Brown, Rafael J . Wysocki, Robert Moore
  Cc: catalin.marinas, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, revest, rostedt, will

The ACPICA code has been built with '-Os' since the beginning of git
history, though there's no explanatory comment as to why.

This is unfortunate as building with '-Os' overrides -falign-functions,
which prevents CONFIG_FUNCITON_ALIGNMENT and
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B from having their expected effect
on the ACPICA code. This is doubly unfortunate as in subsequent patches
arm64 will depend upon CONFIG_FUNCTION_ALIGNMENT for its ftrace
implementation.

Drop the '-Os' flag when building the ACPICA code. With this removed,
the code builds cleanly and works correctly in testing so far.

I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
building and booting a kernel using ACPI, and looking for misaligned
text symbols:

* arm64:

  Before:
    #  grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    908
    #  grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
    576

  After:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    322
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
    0

* x86_64:

  Before:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    2057
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
    706

  After:
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    1351
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
    0

With the patch applied, the remaining unaligned text labels are a
combination of static call trampolines and labels in assembly, which
will be dealt with in subsequent patches.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-acpi@vger.kernel.org
---
 drivers/acpi/acpica/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/acpica/Makefile b/drivers/acpi/acpica/Makefile
index 9e0d95d76fff7..30f3fc13c29d1 100644
--- a/drivers/acpi/acpica/Makefile
+++ b/drivers/acpi/acpica/Makefile
@@ -3,7 +3,7 @@
 # Makefile for ACPICA Core interpreter
 #
 
-ccflags-y			:= -Os -D_LINUX -DBUILDING_ACPICA
+ccflags-y			:= -D_LINUX -DBUILDING_ACPICA
 ccflags-$(CONFIG_ACPI_DEBUG)	+= -DACPI_DEBUG_OUTPUT
 
 # use acpi.o to put all files here into acpi.o modparam namespace
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
  2023-01-09 13:58 ` [PATCH 1/8] Compiler attributes: GCC function alignment workarounds Mark Rutland
  2023-01-09 13:58 ` [PATCH 2/8] ACPI: Don't build ACPICA with '-Os' Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-10 20:35   ` Peter Zijlstra
  2023-01-09 13:58 ` [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

On arm64 we don't align assembly funciton in the same way as C
functions. This somewhat limits the utility of
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for testing.

Follow the example of x86, and align assembly functions in the same way
as C functions.

I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
building and booting a kernel, and looking for misaligned text symbols:

Before:
  # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
  322

After:
  # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
  115

The remaining unaligned text symbols are a combination of non-function labels
in assembly and early position-independent code which is built with '-Os'.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/linkage.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
index 1436fa1cde24d..df18a3446ce82 100644
--- a/arch/arm64/include/asm/linkage.h
+++ b/arch/arm64/include/asm/linkage.h
@@ -5,8 +5,14 @@
 #include <asm/assembler.h>
 #endif
 
-#define __ALIGN		.align 2
-#define __ALIGN_STR	".align 2"
+#if CONFIG_FUNCTION_ALIGNMENT > 0
+#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
+#else
+#define ARM64_FUNCTION_ALIGNMENT	4
+#endif
+
+#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
+#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
 
 /*
  * When using in-kernel BTI we need to ensure that PCS-conformant
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (2 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-12  6:48   ` Li Huafei
  2023-01-09 13:58 ` [PATCH 5/8] arm64: insn: Add helpers for BTI Mark Rutland
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

Architectures without dynamic ftrace trampolines incur an overhead when
multiple ftrace_ops are enabled with distinct filters. in these cases,
each call site calls a common trampoline which uses
ftrace_ops_list_func() to iterate over all enabled ftrace functions, and
so incurs an overhead relative to the size of this list (including RCU
protection overhead).

Architectures with dynamic ftrace trampolines avoid this overhead for
call sites which have a single associated ftrace_ops. In these cases,
the dynamic trampoline is customized to branch directly to the relevant
ftrace function, avoiding the list overhead.

On some architectures it's impractical and/or undesirable to implement
dynamic ftrace trampolines. For example, arm64 has limited branch ranges
and cannot always directly branch from a call site to an arbitrary
address (e.g. from a kernel text address to an arbitrary module
address). Calls from modules to core kernel text can be indirected via
PLTs (allocated at module load time) to address this, but the same is
not possible from calls from core kernel text.

Using an indirect branch from a call site to an arbitrary trampoline is
possible, but requires several more instructions in the function
prologue (or immediately before it), and/or comes with far more complex
requirements for patching.

Instead, this patch adds a new option, where an architecture can
associate each call site with a pointer to an ftrace_ops, placed at a
fixed offset from the call site. A shared trampoline can recover this
pointer and call ftrace_ops::func() without needing to go via
ftrace_ops_list_func(), avoiding the associated overhead.

This avoids issues with branch range limitations, and avoids the need to
allocate and manipulate dynamic trampolines, making it far simpler to
implement and maintain, while having similar performance
characteristics.

Note that this allows for dynamic ftrace_ops to be invoked directly from
an architecture's ftrace_caller trampoline, whereas existing code forces
the use of ftrace_ops_get_list_func(), which is in part necessary to
permit the ftrace_ops to be freed once unregistereed *and* to avoid
branch/address-generation range limitation on some architectures (e.g.
where ops->func is a module address, and may be outside of the direct
branch range for callsites within the main kernel image).

The CALL_OPS approach avoids this problems and is safe as:

* The existing synchronization in ftrace_shutdown() using
  ftrace_shutdown() using synchronize_rcu_tasks_rude() (and
  synchronize_rcu_tasks()) ensures that no tasks hold a stale reference
  to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline,
  or while invoking ftrace_ops::func), when that ftrace_ops is
  unregistered.

  Arguably this could also be relied upon for the existing scheme,
  permitting dynamic ftrace_ops to be invoked directly when ops->func is
  in range, but this will require additional logic to handle branch
  range limitations, and is not handled by this patch.

* Each callsite's ftrace_ops pointer literal can hold any valid kernel
  address, and is updated atomically. As an architecture's ftrace_caller
  trampoline will atomically load the ops pointer then derefrence
  ops->func, there is no risk of invoking ops->func with a mismatches
  ops pointer, and updates to the ops pointer do not require special
  care.

A subsequent patch will implement architectures support for arm64. There
should be no functional change as a result of this patch alone.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/ftrace.h |  15 +++++-
 kernel/trace/Kconfig   |   7 +++
 kernel/trace/ftrace.c  | 109 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 124 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 99f1146614c04..5eeb2776124c5 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -57,6 +57,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 			       struct ftrace_ops *op, struct ftrace_regs *fregs);
 #endif
+extern const struct ftrace_ops ftrace_nop_ops;
+extern const struct ftrace_ops ftrace_list_ops;
+struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
 #endif /* CONFIG_FUNCTION_TRACER */
 
 /* Main tracing buffer and events set up */
@@ -563,6 +566,8 @@ bool is_ftrace_trampoline(unsigned long addr);
  *  IPMODIFY - the record allows for the IP address to be changed.
  *  DISABLED - the record is not ready to be touched yet
  *  DIRECT   - there is a direct function to call
+ *  CALL_OPS - the record can use callsite-specific ops
+ *  CALL_OPS_EN - the function is set up to use callsite-specific ops
  *
  * When a new ftrace_ops is registered and wants a function to save
  * pt_regs, the rec->flags REGS is set. When the function has been
@@ -580,9 +585,11 @@ enum {
 	FTRACE_FL_DISABLED	= (1UL << 25),
 	FTRACE_FL_DIRECT	= (1UL << 24),
 	FTRACE_FL_DIRECT_EN	= (1UL << 23),
+	FTRACE_FL_CALL_OPS	= (1UL << 22),
+	FTRACE_FL_CALL_OPS_EN	= (1UL << 21),
 };
 
-#define FTRACE_REF_MAX_SHIFT	23
+#define FTRACE_REF_MAX_SHIFT	21
 #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)
 
 #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)
@@ -820,7 +827,8 @@ static inline int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
  */
 extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
 
-#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+#if defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) || \
+	defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)
 /**
  * ftrace_modify_call - convert from one addr to another (no nop)
  * @rec: the call site record (e.g. mcount/fentry)
@@ -833,6 +841,9 @@ extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
  * what we expect it to be, and then on success of the compare,
  * it should write to the location.
  *
+ * When using call ops, this is called when the associated ops change, even
+ * when (addr == old_addr).
+ *
  * The code segment at @rec->ip should be a caller to @old_addr
  *
  * Return must be:
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 197545241ab83..5df427a2321df 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -42,6 +42,9 @@ config HAVE_DYNAMIC_FTRACE_WITH_REGS
 config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 	bool
 
+config HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
+	bool
+
 config HAVE_DYNAMIC_FTRACE_WITH_ARGS
 	bool
 	help
@@ -257,6 +260,10 @@ config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 	depends on DYNAMIC_FTRACE_WITH_REGS
 	depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 
+config DYNAMIC_FTRACE_WITH_CALL_OPS
+	def_bool y
+	depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
+
 config DYNAMIC_FTRACE_WITH_ARGS
 	def_bool y
 	depends on DYNAMIC_FTRACE
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 442438b93fe98..c0b219ab89d2f 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -125,6 +125,33 @@ struct ftrace_ops global_ops;
 void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
 			  struct ftrace_ops *op, struct ftrace_regs *fregs);
 
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+/*
+ * Stub used to invoke the list ops without requiring a separate trampoline.
+ */
+const struct ftrace_ops ftrace_list_ops = {
+	.func	= ftrace_ops_list_func,
+	.flags	= FTRACE_OPS_FL_STUB,
+};
+
+static void ftrace_ops_nop_func(unsigned long ip, unsigned long parent_ip,
+				struct ftrace_ops *op,
+				struct ftrace_regs *fregs)
+{
+	/* do nothing */
+}
+
+/*
+ * Stub used when a call site is disabled. May be called transiently by threads
+ * which have made it into ftrace_caller but haven't yet recovered the ops at
+ * the point the call site is disabled.
+ */
+const struct ftrace_ops ftrace_nop_ops = {
+	.func	= ftrace_ops_nop_func,
+	.flags  = FTRACE_OPS_FL_STUB,
+};
+#endif
+
 static inline void ftrace_ops_init(struct ftrace_ops *ops)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
@@ -1814,6 +1841,18 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
 			 * if rec count is zero.
 			 */
 		}
+
+		/*
+		 * If the rec has a single associated ops, and ops->func can be
+		 * called directly, allow the call site to call via the ops.
+		 */
+		if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) &&
+		    ftrace_rec_count(rec) == 1 &&
+		    ftrace_ops_get_func(ops) == ops->func)
+			rec->flags |= FTRACE_FL_CALL_OPS;
+		else
+			rec->flags &= ~FTRACE_FL_CALL_OPS;
+
 		count++;
 
 		/* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */
@@ -2108,8 +2147,9 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec)
 		struct ftrace_ops *ops = NULL;
 
 		pr_info("ftrace record flags: %lx\n", rec->flags);
-		pr_cont(" (%ld)%s", ftrace_rec_count(rec),
-			rec->flags & FTRACE_FL_REGS ? " R" : "  ");
+		pr_cont(" (%ld)%s%s", ftrace_rec_count(rec),
+			rec->flags & FTRACE_FL_REGS ? " R" : "  ",
+			rec->flags & FTRACE_FL_CALL_OPS ? " O" : "  ");
 		if (rec->flags & FTRACE_FL_TRAMP_EN) {
 			ops = ftrace_find_tramp_ops_any(rec);
 			if (ops) {
@@ -2177,6 +2217,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 		 * want the direct enabled (it will be done via the
 		 * direct helper). But if DIRECT_EN is set, and
 		 * the count is not one, we need to clear it.
+		 *
 		 */
 		if (ftrace_rec_count(rec) == 1) {
 			if (!(rec->flags & FTRACE_FL_DIRECT) !=
@@ -2185,6 +2226,19 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 		} else if (rec->flags & FTRACE_FL_DIRECT_EN) {
 			flag |= FTRACE_FL_DIRECT;
 		}
+
+		/*
+		 * Ops calls are special, as count matters.
+		 * As with direct calls, they must only be enabled when count
+		 * is one, otherwise they'll be handled via the list ops.
+		 */
+		if (ftrace_rec_count(rec) == 1) {
+			if (!(rec->flags & FTRACE_FL_CALL_OPS) !=
+			    !(rec->flags & FTRACE_FL_CALL_OPS_EN))
+				flag |= FTRACE_FL_CALL_OPS;
+		} else if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+			flag |= FTRACE_FL_CALL_OPS;
+		}
 	}
 
 	/* If the state of this record hasn't changed, then do nothing */
@@ -2229,6 +2283,21 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 					rec->flags &= ~FTRACE_FL_DIRECT_EN;
 				}
 			}
+
+			if (flag & FTRACE_FL_CALL_OPS) {
+				if (ftrace_rec_count(rec) == 1) {
+					if (rec->flags & FTRACE_FL_CALL_OPS)
+						rec->flags |= FTRACE_FL_CALL_OPS_EN;
+					else
+						rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
+				} else {
+					/*
+					 * Can only call directly if there's
+					 * only one sets of associated ops.
+					 */
+					rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
+				}
+			}
 		}
 
 		/*
@@ -2258,7 +2327,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 			 * and REGS states. The _EN flags must be disabled though.
 			 */
 			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
-					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
+					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
+					FTRACE_FL_CALL_OPS_EN);
 	}
 
 	ftrace_bug_type = FTRACE_BUG_NOP;
@@ -2431,6 +2501,25 @@ ftrace_find_tramp_ops_new(struct dyn_ftrace *rec)
 	return NULL;
 }
 
+struct ftrace_ops *
+ftrace_find_unique_ops(struct dyn_ftrace *rec)
+{
+	struct ftrace_ops *op, *found = NULL;
+	unsigned long ip = rec->ip;
+
+	do_for_each_ftrace_op(op, ftrace_ops_list) {
+
+		if (hash_contains_ip(ip, op->func_hash)) {
+			if (found)
+				return NULL;
+			found = op;
+		}
+
+	} while_for_each_ftrace_op(op);
+
+	return found;
+}
+
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 /* Protected by rcu_tasks for reading, and direct_mutex for writing */
 static struct ftrace_hash *direct_functions = EMPTY_HASH;
@@ -3780,11 +3869,12 @@ static int t_show(struct seq_file *m, void *v)
 	if (iter->flags & FTRACE_ITER_ENABLED) {
 		struct ftrace_ops *ops;
 
-		seq_printf(m, " (%ld)%s%s%s",
+		seq_printf(m, " (%ld)%s%s%s%s",
 			   ftrace_rec_count(rec),
 			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",
 			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",
-			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");
+			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ",
+			   rec->flags & FTRACE_FL_CALL_OPS ? " O" : "  ");
 		if (rec->flags & FTRACE_FL_TRAMP_EN) {
 			ops = ftrace_find_tramp_ops_any(rec);
 			if (ops) {
@@ -3800,6 +3890,15 @@ static int t_show(struct seq_file *m, void *v)
 		} else {
 			add_trampoline_func(m, NULL, rec);
 		}
+		if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+			ops = ftrace_find_unique_ops(rec);
+			if (ops) {
+				seq_printf(m, "\tops: %pS (%pS)",
+					   ops, ops->func);
+			} else {
+				seq_puts(m, "\tops: ERROR!");
+			}
+		}
 		if (rec->flags & FTRACE_FL_DIRECT) {
 			unsigned long direct;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/8] arm64: insn: Add helpers for BTI
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (3 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-09 13:58 ` [PATCH 6/8] arm64: patching: Add aarch64_insn_write_literal_u64() Mark Rutland
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

In subsequent patches we'd like to check whether an instruction is a
BTI. In preparation for this, add basic instruction helpers for BTI
instructions.

Per ARM DDI 0487H.a section C6.2.41, BTI is encoded in binary as
follows, MSB to LSB:

  1101 0101 000 0011 0010 0100 xx01 1111

Where the `xx` bits encode J/C/JC:

  00 : (omitted)
  01 : C
  10 : J
  11 : JC

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/insn.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index aaf1f52fbf3e0..139a88e4e8522 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -420,6 +420,7 @@ __AARCH64_INSN_FUNCS(sb,	0xFFFFFFFF, 0xD50330FF)
 __AARCH64_INSN_FUNCS(clrex,	0xFFFFF0FF, 0xD503305F)
 __AARCH64_INSN_FUNCS(ssbb,	0xFFFFFFFF, 0xD503309F)
 __AARCH64_INSN_FUNCS(pssbb,	0xFFFFFFFF, 0xD503349F)
+__AARCH64_INSN_FUNCS(bti,	0xFFFFFF3F, 0xD503241f)
 
 #undef	__AARCH64_INSN_FUNCS
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/8] arm64: patching: Add aarch64_insn_write_literal_u64()
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (4 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 5/8] arm64: insn: Add helpers for BTI Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-09 13:58 ` [PATCH 7/8] arm64: ftrace: Update stale comment Mark Rutland
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

In subsequent patches we'll need to atomically write to a
naturally-aligned 64-bit literal embedded within the kernel text.

Add a helper for this. For consistency with other text patching code we
use copy_to_kernel_nofault(), which is atomic for naturally-aligned
accesses up to 64-bits.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/patching.h |  2 ++
 arch/arm64/kernel/patching.c      | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/arch/arm64/include/asm/patching.h b/arch/arm64/include/asm/patching.h
index 6bf5adc562950..68908b82b168f 100644
--- a/arch/arm64/include/asm/patching.h
+++ b/arch/arm64/include/asm/patching.h
@@ -7,6 +7,8 @@
 int aarch64_insn_read(void *addr, u32 *insnp);
 int aarch64_insn_write(void *addr, u32 insn);
 
+int aarch64_insn_write_literal_u64(void *addr, u64 val);
+
 int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
 int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);
 
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 33e0fabc0b79b..b4835f6d594bc 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -88,6 +88,23 @@ int __kprobes aarch64_insn_write(void *addr, u32 insn)
 	return __aarch64_insn_write(addr, cpu_to_le32(insn));
 }
 
+noinstr int aarch64_insn_write_literal_u64(void *addr, u64 val)
+{
+	u64 *waddr;
+	unsigned long flags;
+	int ret;
+
+	raw_spin_lock_irqsave(&patch_lock, flags);
+	waddr = patch_map(addr, FIX_TEXT_POKE0);
+
+	ret = copy_to_kernel_nofault(waddr, &val, sizeof(val));
+
+	patch_unmap(FIX_TEXT_POKE0);
+	raw_spin_unlock_irqrestore(&patch_lock, flags);
+
+	return ret;
+}
+
 int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
 {
 	u32 *tp = addr;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/8] arm64: ftrace: Update stale comment
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (5 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 6/8] arm64: patching: Add aarch64_insn_write_literal_u64() Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-09 13:58 ` [PATCH 8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
  2023-01-10  8:55 ` [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS David Laight
  8 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

In commit:

  26299b3f6ba26bfc ("ftrace: arm64: move from REGS to ARGS")

... we folded ftrace_regs_entry into ftrace_caller, and
ftrace_regs_entry no longer exists.

Update the comment accordingly.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/ftrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index b30b955a89211..38ebdf063255b 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -209,7 +209,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
  * | NOP      | MOV X9, LR | MOV X9, LR |
  * | NOP      | NOP        | BL <entry> |
  *
- * The LR value will be recovered by ftrace_regs_entry, and restored into LR
+ * The LR value will be recovered by ftrace_caller, and restored into LR
  * before returning to the regular function prologue. When a function is not
  * being traced, the MOV is not harmful given x9 is not live per the AAPCS.
  *
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (6 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 7/8] arm64: ftrace: Update stale comment Mark Rutland
@ 2023-01-09 13:58 ` Mark Rutland
  2023-01-10  8:55 ` [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS David Laight
  8 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 13:58 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mark.rutland,
	mhiramat, ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64.
This allows each ftrace callsite to provide an ftrace_ops to the common
ftrace trampoline, allowing each callsite to invoke distinct tracer
functions without the need to fall back to list processing or to
allocate custom trampoliens for each callsite. This significantly speeds
up cases where multiple distinct trace functions are used and callsites
are mostly traced by a single tracer.

The main idea is to place a pointer to the ftrace_ops as a literal at a
fixed offset from the function entry point, which can be recovered by
the common ftrace trampoline. Using a 64-bit literal avoids branch range
limitations, and permits the ops to be swapped atomically without
special considerations that apply to code-patching. In future this will
also allow for the implementation of DYNAMIC_FTRACE_WITH_DIRECT_CALLS
without branch range limitations by using additional fields in struct
ftrace_ops.

As noted in the core patch adding support for
DYNAMIC_FTRACE_WITH_CALL_OPS, this approach allows for directly invoking
ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or
part of a module), without going via ftrace_ops_list_func.

Currently, this approach is not compatible with CLANG_CFI, as the
presence/absence of pre-function NOPs changes the offset of the
pre-function type hash, and there's no existing mechanism to ensure a
consistent offset for instrumented and uninstrumented functions. When
CLANG_CFI is enabled, the existing scheme with a global ops->func
pointer is used, and there should be no functional change. I am
currently working with others to allow the two to work together in
future (though this will liekly require updated compiler support).

I've benchamrked this with the ftrace_ops sample module [1], which is
not currently upstream, but available at:

  https://lore.kernel.org/lkml/20230103124912.2948963-1-mark.rutland@arm.com
  git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git ftrace-ops-sample-20230109

Using that module I measured the total time taken for 100,000 calls to a
trivial instrumented function, with a number of tracers enabled with
relevant filters (which would apply to the instrumented function) and a
number of tracers enabled with irrelevant filters (which would not apply
to the instrumented function). I tested on an M1 MacBook Pro, running
under a HVF-accelerated QEMU VM (i.e. on real hardware).

Before this patch:

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,583 |         0.95 |           -
         0 |          1 ||      93,709 |         0.94 |           -
         0 |          2 ||      93,666 |         0.94 |           -
         0 |         10 ||      93,709 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||   6,467,833 |        64.68 |       63.73
         1 |          2 ||   7,509,708 |        75.10 |       74.15
         1 |         10 ||  23,786,792 |       237.87 |      236.92
         1 |        100 || 106,432,500 |     1,064.43 |     1063.38
  ---------+------------++-------------+--------------+------------
         1 |          0 ||   1,431,875 |        14.32 |       13.37
         2 |          0 ||   6,456,334 |        64.56 |       63.62
        10 |          0 ||  22,717,000 |       227.17 |      226.22
       100 |          0 || 103,293,667 |      1032.94 |     1031.99
  ---------+------------++-------------+--------------+--------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

After this patch

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,541 |         0.95 |           -
         0 |          1 ||      93,666 |         0.94 |           -
         0 |          2 ||      93,709 |         0.94 |           -
         0 |         10 ||      93,667 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||     281,000 |         2.81 |        1.86
         1 |          2 ||     281,042 |         2.81 |        1.87
         1 |         10 ||     280,958 |         2.81 |        1.86
         1 |        100 ||     281,250 |         2.81 |        1.87
  ---------+------------++-------------+--------------+------------
         1 |          0 ||     280,959 |         2.81 |        1.86
         2 |          0 ||   6,502,708 |        65.03 |       64.08
        10 |          0 ||  18,681,209 |       186.81 |      185.87
       100 |          0 || 103,550,458 |     1,035.50 |     1034.56
  ---------+------------++-------------+--------------+------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

As can be seen from the above:

a) Whenever there is a single relevant tracer function associated with a
   tracee, the overhead of invoking the tracer is constant, and does not
   scale with the number of tracers which are *not* associated with that
   tracee.

b) The overhead for a single relevant tracer has dropped to ~1/7 of the
   overhead prior to this series (from 13.37ns to 1.86ns). This is
   largely due to permitting calls to dynamically-allocated ftrace_ops
   without going through ftrace_ops_list_func.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/Kconfig               |   3 +
 arch/arm64/Makefile              |   5 +-
 arch/arm64/include/asm/ftrace.h  |  15 +--
 arch/arm64/kernel/asm-offsets.c  |   4 +
 arch/arm64/kernel/entry-ftrace.S |  32 ++++++-
 arch/arm64/kernel/ftrace.c       | 156 +++++++++++++++++++++++++++++++
 6 files changed, 195 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed0..7838f568fa158 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -123,6 +123,7 @@ config ARM64
 	select DMA_DIRECT_REMAP
 	select EDAC_SUPPORT
 	select FRAME_POINTER
+	select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS
 	select GENERIC_ALLOCATOR
 	select GENERIC_ARCH_TOPOLOGY
 	select GENERIC_CLOCKEVENTS_BROADCAST
@@ -186,6 +187,8 @@ config ARM64
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_DYNAMIC_FTRACE_WITH_ARGS \
 		if $(cc-option,-fpatchable-function-entry=2)
+	select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS \
+		if (DYNAMIC_FTRACE_WITH_ARGS && !CFI_CLANG)
 	select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \
 		if DYNAMIC_FTRACE_WITH_ARGS
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index d62bd221828f7..4c3be442fbb35 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -139,7 +139,10 @@ endif
 
 CHECKFLAGS	+= -D__aarch64__
 
-ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
+ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y)
+  KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+  CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
+else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
   CC_FLAGS_FTRACE := -fpatchable-function-entry=2
 endif
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5664729800ae1..1c2672bbbf379 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -62,20 +62,7 @@ extern unsigned long ftrace_graph_call;
 
 extern void return_to_handler(void);
 
-static inline unsigned long ftrace_call_adjust(unsigned long addr)
-{
-	/*
-	 * Adjust addr to point at the BL in the callsite.
-	 * See ftrace_init_nop() for the callsite sequence.
-	 */
-	if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
-		return addr + AARCH64_INSN_SIZE;
-	/*
-	 * addr is the address of the mcount call instruction.
-	 * recordmcount does the necessary offset calculation.
-	 */
-	return addr;
-}
+unsigned long ftrace_call_adjust(unsigned long addr);
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
 struct dyn_ftrace;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 2234624536d95..ae345b06e9f7e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -9,6 +9,7 @@
 
 #include <linux/arm_sdei.h>
 #include <linux/sched.h>
+#include <linux/ftrace.h>
 #include <linux/kexec.h>
 #include <linux/mm.h>
 #include <linux/dma-mapping.h>
@@ -193,6 +194,9 @@ int main(void)
   DEFINE(KIMAGE_HEAD,			offsetof(struct kimage, head));
   DEFINE(KIMAGE_START,			offsetof(struct kimage, start));
   BLANK();
+#endif
+#ifdef CONFIG_FUNCTION_TRACER
+  DEFINE(FTRACE_OPS_FUNC,		offsetof(struct ftrace_ops, func));
 #endif
   return 0;
 }
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index 3b625f76ffbae..350ed81324ace 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -65,13 +65,35 @@ SYM_CODE_START(ftrace_caller)
 	stp	x29, x30, [sp, #FREGS_SIZE]
 	add	x29, sp, #FREGS_SIZE
 
-	sub	x0, x30, #AARCH64_INSN_SIZE	// ip (callsite's BL insn)
-	mov	x1, x9				// parent_ip (callsite's LR)
-	ldr_l	x2, function_trace_op		// op
-	mov	x3, sp				// regs
+	/* Prepare arguments for the the tracer func */
+	sub	x0, x30, #AARCH64_INSN_SIZE		// ip (callsite's BL insn)
+	mov	x1, x9					// parent_ip (callsite's LR)
+	mov	x3, sp					// regs
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+	/*
+	 * The literal pointer to the ops is at an 8-byte aligned boundary
+	 * which is either 12 or 16 bytes before the BL instruction in the call
+	 * site. See ftrace_call_adjust() for details.
+	 *
+	 * Therefore here the LR points at `literal + 16` or `literal + 20`,
+	 * and we can find the address of the literal in either case by
+	 * aligning to an 8-byte boundary and subtracting 16. We do the
+	 * alignment first as this allows us to fold the subtraction into the
+	 * LDR.
+	 */
+	bic	x2, x30, 0x7
+	ldr	x2, [x2, #-16]				// op
+
+	ldr	x4, [x2, #FTRACE_OPS_FUNC]		// op->func
+	blr	x4					// op->func(ip, parent_ip, op, regs)
+
+#else
+	ldr_l   x2, function_trace_op			// op
 
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
-	bl	ftrace_stub
+	bl      ftrace_stub				// func(ip, parent_ip, op, regs)
+#endif
 
 /*
  * At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 38ebdf063255b..5545fe1a90125 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -60,6 +60,89 @@ int ftrace_regs_query_register_offset(const char *name)
 }
 #endif
 
+unsigned long ftrace_call_adjust(unsigned long addr)
+{
+	/*
+	 * When using mcount, addr is the address of the mcount call
+	 * instruction, and no adjustment is necessary.
+	 */
+	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
+		return addr;
+
+	/*
+	 * When using patchable-function-entry without pre-function NOPS, addr
+	 * is the address of the first NOP after the function entry point.
+	 *
+	 * The compiler has either generated:
+	 *
+	 * addr+00:	func:	NOP		// To be patched to MOV X9, LR
+	 * addr+04:		NOP		// To be patched to BL <caller>
+	 *
+	 * Or:
+	 *
+	 * addr-04:		BTI	C
+	 * addr+00:	func:	NOP		// To be patched to MOV X9, LR
+	 * addr+04:		NOP		// To be patched to BL <caller>
+	 *
+	 * We must adjust addr to the address of the NOP which will be patched
+	 * to `BL <caller>`, which is at `addr + 4` bytes in either case.
+	 *
+	 */
+	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
+		return addr + AARCH64_INSN_SIZE;
+
+	/*
+	 * When using patchable-function-entry with pre-function NOPs, addr is
+	 * the address of the first pre-function NOP.
+	 *
+	 * Starting from an 8-byte aligned base, the compiler has either
+	 * generated:
+	 *
+	 * addr+00:		NOP		// Literal (first 32 bits)
+	 * addr+04:		NOP		// Literal (last 32 bits)
+	 * addr+08:	func:	NOP		// To be patched to MOV X9, LR
+	 * addr+12:		NOP		// To be patched to BL <caller>
+	 *
+	 * Or:
+	 *
+	 * addr+00:		NOP		// Literal (first 32 bits)
+	 * addr+04:		NOP		// Literal (last 32 bits)
+	 * addr+08:	func:	BTI	C
+	 * addr+12:		NOP		// To be patched to MOV X9, LR
+	 * addr+16:		NOP		// To be patched to BL <caller>
+	 *
+	 * We must adjust addr to the address of the NOP which will be patched
+	 * to `BL <caller>`, which is at either addr+12 or addr+16 depending on
+	 * whether there is a BTI.
+	 */
+
+	if (!IS_ALIGNED(addr, sizeof(unsigned long))) {
+		WARN_RATELIMIT(1, "Misaligned patch-site %pS\n",
+			       (void *)(addr + 8));
+		return 0;
+	}
+
+	/* Skip the NOPs placed before the function entry point */
+	addr += 2 * AARCH64_INSN_SIZE;
+
+	/* Skip any BTI */
+	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
+		u32 insn = le32_to_cpu(*(__le32 *)addr);
+
+		if (aarch64_insn_is_bti(insn)) {
+			addr += AARCH64_INSN_SIZE;
+		} else if (insn != aarch64_insn_gen_nop()) {
+			WARN_RATELIMIT(1, "unexpected insn in patch-site %pS: 0x%08x\n",
+				       (void *)addr, insn);
+		}
+	}
+
+	/* Skip the first NOP after function entry */
+	addr += AARCH64_INSN_SIZE;
+
+	return addr;
+}
+
 /*
  * Replace a single instruction, which may be a branch or NOP.
  * If @validate == true, a replaced instruction is checked against 'old'.
@@ -98,6 +181,13 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned long pc;
 	u32 new;
 
+	/*
+	 * When using CALL_OPS, the function to call is associated with the
+	 * call site, and we don't have a global function pointer to update.
+	 */
+	if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
+		return 0;
+
 	pc = (unsigned long)ftrace_call;
 	new = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
 					  AARCH64_INSN_BRANCH_LINK);
@@ -176,6 +266,44 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
 	return true;
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+static const struct ftrace_ops *arm64_rec_get_ops(struct dyn_ftrace *rec)
+{
+	const struct ftrace_ops *ops = NULL;
+
+	if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+		ops = ftrace_find_unique_ops(rec);
+		WARN_ON_ONCE(!ops);
+	}
+
+	if (!ops)
+		ops = &ftrace_list_ops;
+
+	return ops;
+}
+
+static int ftrace_rec_set_ops(const struct dyn_ftrace *rec,
+			      const struct ftrace_ops *ops)
+{
+	unsigned long literal = ALIGN_DOWN(rec->ip - 12, 8);
+	return aarch64_insn_write_literal_u64((void *)literal,
+					      (unsigned long)ops);
+}
+
+static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec)
+{
+	return ftrace_rec_set_ops(rec, &ftrace_nop_ops);
+}
+
+static int ftrace_rec_update_ops(struct dyn_ftrace *rec)
+{
+	return ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec));
+}
+#else
+static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) { return 0; }
+static int ftrace_rec_update_ops(struct dyn_ftrace *rec) { return 0; }
+#endif
+
 /*
  * Turn on the call to ftrace_caller() in instrumented function
  */
@@ -183,6 +311,11 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
 {
 	unsigned long pc = rec->ip;
 	u32 old, new;
+	int ret;
+
+	ret = ftrace_rec_update_ops(rec);
+	if (ret)
+		return ret;
 
 	if (!ftrace_find_callable_addr(rec, NULL, &addr))
 		return -EINVAL;
@@ -193,6 +326,19 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
 	return ftrace_modify_code(pc, old, new, true);
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
+		       unsigned long addr)
+{
+	if (WARN_ON_ONCE(old_addr != (unsigned long)ftrace_caller))
+		return -EINVAL;
+	if (WARN_ON_ONCE(addr != (unsigned long)ftrace_caller))
+		return -EINVAL;
+
+	return ftrace_rec_update_ops(rec);
+}
+#endif
+
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
 /*
  * The compiler has inserted two NOPs before the regular function prologue.
@@ -220,6 +366,11 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
 {
 	unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
 	u32 old, new;
+	int ret;
+
+	ret = ftrace_rec_set_nop_ops(rec);
+	if (ret)
+		return ret;
 
 	old = aarch64_insn_gen_nop();
 	new = aarch64_insn_gen_move_reg(AARCH64_INSN_REG_9,
@@ -237,9 +388,14 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
 {
 	unsigned long pc = rec->ip;
 	u32 old = 0, new;
+	int ret;
 
 	new = aarch64_insn_gen_nop();
 
+	ret = ftrace_rec_set_nop_ops(rec);
+	if (ret)
+		return ret;
+
 	/*
 	 * When using mcount, callsites in modules may have been initalized to
 	 * call an arbitrary module PLT (which redirects to the _mcount stub)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-09 13:58 ` [PATCH 1/8] Compiler attributes: GCC function alignment workarounds Mark Rutland
@ 2023-01-09 14:43   ` Miguel Ojeda
  2023-01-09 17:06     ` Mark Rutland
  2023-01-11 18:27     ` Mark Rutland
  0 siblings, 2 replies; 26+ messages in thread
From: Miguel Ojeda @ 2023-01-09 14:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> As far as I can tell, GCC doesn't respect '-falign-functions=N':
>
> * When the __weak__ attribute is used
>
>   GCC seems to forget the alignment specified by '-falign-functions=N',
>   but will respect the '__aligned__(N)' function attribute. Thus, we can
>   work around this by explciitly setting the alignment for weak
>   functions.
>
> * When the __cold__ attribute is used
>
>   GCC seems to forget the alignment specified by '-falign-functions=N',
>   and also doesn't seem to respect the '__aligned__(N)' function
>   attribute. The only way to work around this is to not use the __cold__
>   attibute.

If you happen to have a reduced case, then it would be nice to link it
in the commit. A bug report to GCC would also be nice.

I gave it a very quick try in Compiler Explorer, but I couldn't
reproduce it, so I guess it depends on flags, non-trivial functions or
something else.

> + * '-falign-functions=N', and require alignment to be specificed via a function

Nit: specificed -> specified

> +#if CONFIG_FUNCTION_ALIGNMENT > 0
> +#define __function_aligned             __aligned(CONFIG_FUNCTION_ALIGNMENT)
> +#else
> +#define __function_aligned
> +#endif

Currently, the file is intended for attributes that do not depend on
`CONFIG_*` options.

What I usually mention is that we could change that policy, but
otherwise these would go into e.g. `compiler_types.h`.

> +#if !defined(CONFIG_CC_IS_GCC) || (CONFIG_FUNCTION_ALIGNMENT == 0)
>  #define __cold                          __attribute__((__cold__))
> +#else
> +#define __cold
> +#endif

Similarly, in this case this could go into `compiler-gcc.h` /
`compiler-clang.h` etc., since the definition will be different for
each.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-09 14:43   ` Miguel Ojeda
@ 2023-01-09 17:06     ` Mark Rutland
  2023-01-09 22:35       ` Miguel Ojeda
  2023-01-11 18:27     ` Mark Rutland
  1 sibling, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-09 17:06 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> >
> > * When the __weak__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> >   work around this by explciitly setting the alignment for weak

Whoops: s/explciitly/explicitly/ here too; I'll go re-proofread the series.

> >   functions.
> >
> > * When the __cold__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   and also doesn't seem to respect the '__aligned__(N)' function
> >   attribute. The only way to work around this is to not use the __cold__
> >   attibute.

Whoops: s/attibute/attribute/

> If you happen to have a reduced case, then it would be nice to link it
> in the commit. A bug report to GCC would also be nice.
> 
> I gave it a very quick try in Compiler Explorer, but I couldn't
> reproduce it, so I guess it depends on flags, non-trivial functions or
> something else.

Sorry, that is something I had intendeed to do but I hadn't extracted a
reproducer yet. I'll try to come up with something that can be included in the
commit message and reported to GCC folk (and double-check at the same time that
there's not another hidden cause)

With this series applied and this patch reverted, it's possible to see when
building defconfig + CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y, where scanning
/proc/kallsyms with:

  $ grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] '

... will show a bunch of cold functions (and their callees/callers), largely
init/exit functions (so I'll double-check whether section handling as an
effect), e.g.

  ffffdf08be173b8c t snd_soc_exit
  ffffdf08be173bc4 t apple_mca_driver_exit
  ffffdf08be173be8 t failover_exit
  ffffdf08be173c10 t inet_diag_exit
  ffffdf08be173c60 t tcp_diag_exit
  ffffdf08be173c84 t cubictcp_unregister
  ffffdf08be173cac t af_unix_exit
  ffffdf08be173cf4 t packet_exit
  ffffdf08be173d3c t cleanup_sunrpc
  ffffdf08be173d8c t exit_rpcsec_gss
  ffffdf08be173dc4 t exit_p9
  ffffdf08be173dec T p9_client_exit
  ffffdf08be173e10 t p9_trans_fd_exit
  ffffdf08be173e58 t p9_virtio_cleanup
  ffffdf08be173e90 t exit_dns_resolver

> > + * '-falign-functions=N', and require alignment to be specificed via a function
> 
> Nit: specificed -> specified

Thanks, fixed

> > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > +#define __function_aligned             __aligned(CONFIG_FUNCTION_ALIGNMENT)
> > +#else
> > +#define __function_aligned
> > +#endif
> 
> Currently, the file is intended for attributes that do not depend on
> `CONFIG_*` options.
> 
> What I usually mention is that we could change that policy, but
> otherwise these would go into e.g. `compiler_types.h`.

I'm happy to move these, I just wasn't sure what the policy would be w.r.t. the
existing __weak and __cold defitions since those end up depending upon
__function_aligned.

I assume I should move them all? i.e. move __weak as well?

> > +#if !defined(CONFIG_CC_IS_GCC) || (CONFIG_FUNCTION_ALIGNMENT == 0)
> >  #define __cold                          __attribute__((__cold__))
> > +#else
> > +#define __cold
> > +#endif
> 
> Similarly, in this case this could go into `compiler-gcc.h` /
> `compiler-clang.h` etc., since the definition will be different for
> each.

Sure, can do.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-09 17:06     ` Mark Rutland
@ 2023-01-09 22:35       ` Miguel Ojeda
  0 siblings, 0 replies; 26+ messages in thread
From: Miguel Ojeda @ 2023-01-09 22:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Mon, Jan 9, 2023 at 6:06 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Sorry, that is something I had intendeed to do but I hadn't extracted a
> reproducer yet. I'll try to come up with something that can be included in the
> commit message and reported to GCC folk (and double-check at the same time that
> there's not another hidden cause)

Yeah, no worries :) I suggested it because from my quick test it
didn't appear to be reproducible trivially, so I thought having the
reproducer would be nice.

> I'm happy to move these, I just wasn't sure what the policy would be w.r.t. the
> existing __weak and __cold defitions since those end up depending upon
> __function_aligned.
>
> I assume I should move them all? i.e. move __weak as well?

Yeah, with the current policy, all should be moved since their
behavior now depends on the config (eventually).

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
                   ` (7 preceding siblings ...)
  2023-01-09 13:58 ` [PATCH 8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
@ 2023-01-10  8:55 ` David Laight
  2023-01-10 10:31   ` Mark Rutland
  8 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2023-01-10  8:55 UTC (permalink / raw)
  To: 'Mark Rutland', linux-arm-kernel
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mhiramat,
	ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will

From: Mark Rutland
> Sent: 09 January 2023 13:58
> 
> This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
> enables support for this on arm64. This significantly reduces the
> overhead of tracing when a callsite/tracee has a single associated
> tracer, avoids a number of issues that make it undesireably and
> infeasible to use dynamically-allocated trampolines (e.g. branch range
> limitations), and makes it possible to implement support for
> DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.
> 
> The main idea is to give each ftrace callsite an associated pointer to
> an ftrace_ops. The architecture's ftrace_caller trampoline can recover
> the ops pointer and invoke ops->func from this without needing to use
> ftrace_ops_list_func, which has to iterate through all registered ops.
> 
> To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
> placed before the function entry point...

Doesn't this bump the minimum gcc version up to something like 9.0 ?

How does it interact with the 'CFI stuff' that also uses the same area?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-10  8:55 ` [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS David Laight
@ 2023-01-10 10:31   ` Mark Rutland
  0 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-10 10:31 UTC (permalink / raw)
  To: David Laight
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Tue, Jan 10, 2023 at 08:55:58AM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 09 January 2023 13:58
> > 
> > This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
> > enables support for this on arm64. This significantly reduces the
> > overhead of tracing when a callsite/tracee has a single associated
> > tracer, avoids a number of issues that make it undesireably and
> > infeasible to use dynamically-allocated trampolines (e.g. branch range
> > limitations), and makes it possible to implement support for
> > DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.
> > 
> > The main idea is to give each ftrace callsite an associated pointer to
> > an ftrace_ops. The architecture's ftrace_caller trampoline can recover
> > the ops pointer and invoke ops->func from this without needing to use
> > ftrace_ops_list_func, which has to iterate through all registered ops.
> > 
> > To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
> > placed before the function entry point...
> 
> Doesn't this bump the minimum gcc version up to something like 9.0 ?

This doesn't bump the minimum GCC version, but users of older toolchains
won't get the speedup.

We already support -fpatchable-function-entry based ftrace with GCC 8+ (and
this is necessary to play nicely with pointer authentication), for older GCC
versions we still support using -pg / mcount.

> How does it interact with the 'CFI stuff' that also uses the same area?

There's some more detail in patch 8, but the summary is that they're mutually
exclusive for now (enforce by Kconfig), and I'm working with others to get
improved compiler support necessary for them to play nicely together.

Currently LLVM will place the type-hash before the pre-function NOPs, which
works if everything has pre-function NOPs, but doesn't work for calls between
instrumented and non-instrumented functions, since as the latter don't have
pre-function NOPs and the type hash is placed at a different offset. To make
them work better together we'll need some improved compiler support, and I'm
working with others for that currently.

GCC doesn't currently have KCFI support, but the plan is to match whatever LLVM
does.

Atop that we'll need some trivial changes to the asm function macros, but
without the underlying compiler support there's not much point.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] ACPI: Don't build ACPICA with '-Os'
  2023-01-09 13:58 ` [PATCH 2/8] ACPI: Don't build ACPICA with '-Os' Mark Rutland
@ 2023-01-10 13:45   ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2023-01-10 13:45 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, Len Brown, Rafael J . Wysocki, Robert Moore,
	catalin.marinas, linux-acpi, linux-kernel, mhiramat,
	ndesaulniers, ojeda, peterz, revest, rostedt, will

On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> The ACPICA code has been built with '-Os' since the beginning of git
> history, though there's no explanatory comment as to why.
>
> This is unfortunate as building with '-Os' overrides -falign-functions,
> which prevents CONFIG_FUNCITON_ALIGNMENT and
> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B from having their expected effect
> on the ACPICA code. This is doubly unfortunate as in subsequent patches
> arm64 will depend upon CONFIG_FUNCTION_ALIGNMENT for its ftrace
> implementation.
>
> Drop the '-Os' flag when building the ACPICA code. With this removed,
> the code builds cleanly and works correctly in testing so far.

Fair enough.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

and please feel free to route this through the arch tree along with
the rest of the series.

Thanks!

> I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
> building and booting a kernel using ACPI, and looking for misaligned
> text symbols:
>
> * arm64:
>
>   Before:
>     #  grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
>     908
>     #  grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
>     576
>
>   After:
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
>     322
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
>     0
>
> * x86_64:
>
>   Before:
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
>     2057
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
>     706
>
>   After:
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
>     1351
>     # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
>     0
>
> With the patch applied, the remaining unaligned text labels are a
> combination of static call trampolines and labels in assembly, which
> will be dealt with in subsequent patches.
>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Florent Revest <revest@chromium.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Robert Moore <robert.moore@intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-acpi@vger.kernel.org
> ---
>  drivers/acpi/acpica/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/acpica/Makefile b/drivers/acpi/acpica/Makefile
> index 9e0d95d76fff7..30f3fc13c29d1 100644
> --- a/drivers/acpi/acpica/Makefile
> +++ b/drivers/acpi/acpica/Makefile
> @@ -3,7 +3,7 @@
>  # Makefile for ACPICA Core interpreter
>  #
>
> -ccflags-y                      := -Os -D_LINUX -DBUILDING_ACPICA
> +ccflags-y                      := -D_LINUX -DBUILDING_ACPICA
>  ccflags-$(CONFIG_ACPI_DEBUG)   += -DACPI_DEBUG_OUTPUT
>
>  # use acpi.o to put all files here into acpi.o modparam namespace
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  2023-01-09 13:58 ` [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT Mark Rutland
@ 2023-01-10 20:35   ` Peter Zijlstra
  2023-01-10 20:43     ` Will Deacon
  2023-01-11 11:36     ` Mark Rutland
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2023-01-10 20:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, rafael.j.wysocki,
	revest, robert.moore, rostedt, will

On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:

> diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> index 1436fa1cde24d..df18a3446ce82 100644
> --- a/arch/arm64/include/asm/linkage.h
> +++ b/arch/arm64/include/asm/linkage.h
> @@ -5,8 +5,14 @@
>  #include <asm/assembler.h>
>  #endif
>  
> -#define __ALIGN		.align 2
> -#define __ALIGN_STR	".align 2"
> +#if CONFIG_FUNCTION_ALIGNMENT > 0
> +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> +#else
> +#define ARM64_FUNCTION_ALIGNMENT	4
> +#endif
> +
> +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT

Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
and simply removing all these lines and relying on the default
behaviour?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  2023-01-10 20:35   ` Peter Zijlstra
@ 2023-01-10 20:43     ` Will Deacon
  2023-01-11 11:39       ` Mark Rutland
  2023-01-11 11:36     ` Mark Rutland
  1 sibling, 1 reply; 26+ messages in thread
From: Will Deacon @ 2023-01-10 20:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mark Rutland, linux-arm-kernel, catalin.marinas, lenb,
	linux-acpi, linux-kernel, mhiramat, ndesaulniers, ojeda,
	rafael.j.wysocki, revest, robert.moore, rostedt

On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> 
> > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > index 1436fa1cde24d..df18a3446ce82 100644
> > --- a/arch/arm64/include/asm/linkage.h
> > +++ b/arch/arm64/include/asm/linkage.h
> > @@ -5,8 +5,14 @@
> >  #include <asm/assembler.h>
> >  #endif
> >  
> > -#define __ALIGN		.align 2
> > -#define __ALIGN_STR	".align 2"
> > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > +#else
> > +#define ARM64_FUNCTION_ALIGNMENT	4
> > +#endif
> > +
> > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> 
> Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> and simply removing all these lines and relying on the default
> behaviour?

There's a proposal (with some rough performance claims) to select
FUNCTION_ALIGNMENT_16B over at:

https://lore.kernel.org/r/20221208053649.540891-1-almasrymina@google.com

so we could just go with that?

Will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  2023-01-10 20:35   ` Peter Zijlstra
  2023-01-10 20:43     ` Will Deacon
@ 2023-01-11 11:36     ` Mark Rutland
  1 sibling, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-11 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, rafael.j.wysocki,
	revest, robert.moore, rostedt, will

On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> 
> > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > index 1436fa1cde24d..df18a3446ce82 100644
> > --- a/arch/arm64/include/asm/linkage.h
> > +++ b/arch/arm64/include/asm/linkage.h
> > @@ -5,8 +5,14 @@
> >  #include <asm/assembler.h>
> >  #endif
> >  
> > -#define __ALIGN		.align 2
> > -#define __ALIGN_STR	".align 2"
> > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > +#else
> > +#define ARM64_FUNCTION_ALIGNMENT	4
> > +#endif
> > +
> > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> 
> Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> and simply removing all these lines and relying on the default
> behaviour?

Yes, it is.

I was focussed on obviously retaining the existing semantic by default, and I
missed that was possible by selecting FUNCTION_ALIGNMENT_4B.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  2023-01-10 20:43     ` Will Deacon
@ 2023-01-11 11:39       ` Mark Rutland
  0 siblings, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2023-01-11 11:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-arm-kernel, catalin.marinas, lenb,
	linux-acpi, linux-kernel, mhiramat, ndesaulniers, ojeda,
	rafael.j.wysocki, revest, robert.moore, rostedt

On Tue, Jan 10, 2023 at 08:43:20PM +0000, Will Deacon wrote:
> On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> > 
> > > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > > index 1436fa1cde24d..df18a3446ce82 100644
> > > --- a/arch/arm64/include/asm/linkage.h
> > > +++ b/arch/arm64/include/asm/linkage.h
> > > @@ -5,8 +5,14 @@
> > >  #include <asm/assembler.h>
> > >  #endif
> > >  
> > > -#define __ALIGN		.align 2
> > > -#define __ALIGN_STR	".align 2"
> > > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > > +#else
> > > +#define ARM64_FUNCTION_ALIGNMENT	4
> > > +#endif
> > > +
> > > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> > 
> > Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> > and simply removing all these lines and relying on the default
> > behaviour?
> 
> There's a proposal (with some rough performance claims) to select
> FUNCTION_ALIGNMENT_16B over at:
> 
> https://lore.kernel.org/r/20221208053649.540891-1-almasrymina@google.com
> 
> so we could just go with that?

I reckon it'd be worth having that as a separate patch atop, to split the
infrastructure from the actual change, but I'm happy to go with 16B immediately
if you'd prefer.

It'd be nice if we could get some numbers...

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-09 14:43   ` Miguel Ojeda
  2023-01-09 17:06     ` Mark Rutland
@ 2023-01-11 18:27     ` Mark Rutland
  2023-01-12 11:38       ` Mark Rutland
  1 sibling, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-11 18:27 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> >
> > * When the __weak__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> >   work around this by explciitly setting the alignment for weak
> >   functions.
> >
> > * When the __cold__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   and also doesn't seem to respect the '__aligned__(N)' function
> >   attribute. The only way to work around this is to not use the __cold__
> >   attibute.
> 
> If you happen to have a reduced case, then it would be nice to link it
> in the commit. A bug report to GCC would also be nice.
> 
> I gave it a very quick try in Compiler Explorer, but I couldn't
> reproduce it, so I guess it depends on flags, non-trivial functions or
> something else.

So having spent today coming up with tests, it turns out it's not quite as I
described above, but in a sense worse. I'm posting a summary here for
posterity; I'll try to get this to compiler folk shortly.

GCC appears to not align cold functions to the alignment specified by
`-falign-functions=N` when compiling at `-O1` or above. Alignment *can* be
restored with explicit attributes on each function, but due to some
interprocedural analysis, callees can be implicitly marked as cold (losing
their default alignment), which means we don't have a reliable mechanism to
ensure functions are always aligned short of annotating *every* function
explicitly (and I suspect that's not sufficient due to some interprocedural optimizations).

I've tested with the 12.1.0 binary release from the kernel.org cross toolchains
page).

LLVM always seems to repsect `-falign-functions=N` at both `-O1` and `-O2` (I
tested the 10.0.0, 11.0.0, 11.0.1, 15.0.6 binary releases from llvm.org).

For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-cold.c                                           
| #define __cold \
|         __attribute__((cold))
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| __cold
| void cold_func_a(void) { }
| 
| __cold
| void cold_func_b(void) { }
| 
| __cold
| void cold_func_c(void) { }
| 
| static __cold
| void static_cold_func_a(void) { }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-cold.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-cold.o                     
| 
| test-cold.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <static_cold_func_a>:
|    0:   d65f03c0        ret
| 
| 0000000000000004 <static_cold_func_b>:
|    4:   d65f03c0        ret
| 
| 0000000000000008 <static_cold_func_c>:
|    8:   d65f03c0        ret
| 
| 000000000000000c <cold_func_a>:
|    c:   d65f03c0        ret
| 
| 0000000000000010 <cold_func_b>:
|   10:   d65f03c0        ret
| 
| 0000000000000014 <cold_func_c>:
|   14:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-cold.o
| 
| test-cold.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         00000018  0000000000000000  0000000000000000  00000040  2**2
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000058  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  00000070  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  00000070  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00000083  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000090  0000000000000000  0000000000000000  00000088  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

In simple cases, alignment *can* be restored if an explicit function attribute
is used. For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-aligned-cold.c                            
| #define __aligned(n) \
|         __attribute__((aligned(n)))
| 
| #define __cold \
|         __attribute__((cold)) __aligned(16)
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| __cold
| void cold_func_a(void) { }
| 
| __cold
| void cold_func_b(void) { }
| 
| __cold
| void cold_func_c(void) { }
| 
| static __cold
| void static_cold_func_a(void) { }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-aligned-cold.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-aligned-cold.o                     
| 
| test-aligned-cold.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <static_cold_func_a>:
|    0:   d65f03c0        ret
|    4:   d503201f        nop
|    8:   d503201f        nop
|    c:   d503201f        nop
| 
| 0000000000000010 <static_cold_func_b>:
|   10:   d65f03c0        ret
|   14:   d503201f        nop
|   18:   d503201f        nop
|   1c:   d503201f        nop
| 
| 0000000000000020 <static_cold_func_c>:
|   20:   d65f03c0        ret
|   24:   d503201f        nop
|   28:   d503201f        nop
|   2c:   d503201f        nop
| 
| 0000000000000030 <cold_func_a>:
|   30:   d65f03c0        ret
|   34:   d503201f        nop
|   38:   d503201f        nop
|   3c:   d503201f        nop
| 
| 0000000000000040 <cold_func_b>:
|   40:   d65f03c0        ret
|   44:   d503201f        nop
|   48:   d503201f        nop
|   4c:   d503201f        nop
| 
| 0000000000000050 <cold_func_c>:
|   50:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-aligned-cold.o
| 
| test-aligned-cold.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         00000054  0000000000000000  0000000000000000  00000040  2**4
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000098  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  000000b0  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  000000b0  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  000000c3  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000090  0000000000000000  0000000000000000  000000c8  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA


Unfortunately it appears that some interprocedural analysis determines that if
a callee is only called/referenced from cold callers, the callee is marked as
cold, and the alignment it would have got from the command line option is
dropped. If it's given an explicit alignment attribute, the alignment is
retained.

For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-aligned-cold-caller.c                                    
| #define noinline \
|         __attribute__((noinline))
| 
| #define __aligned(n) \
|         __attribute__((aligned(n)))
| 
| #define __cold \
|         __attribute__((cold)) __aligned(16)
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| static noinline void callee_a(void)
| {
|         asm volatile("// callee_a\n" ::: "memory");
| }
| 
| static noinline void callee_b(void)
| {
|         asm volatile("// callee_b\n" ::: "memory");
| }
| 
| static noinline void callee_c(void)
| {
|         asm volatile("// callee_c\n" ::: "memory");
| }
| __cold
| void cold_func_a(void) { callee_a(); }
| 
| __cold
| void cold_func_b(void) { callee_b(); }
| 
| __cold
| void cold_func_c(void) { callee_c(); }
| 
| static __cold
| void static_cold_func_a(void) { callee_a(); }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { callee_b(); }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { callee_c(); }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-aligned-cold-caller.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-aligned-cold-caller.o                     
| 
| test-aligned-cold-caller.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <callee_a>:
|    0:   d65f03c0        ret
| 
| 0000000000000004 <callee_b>:
|    4:   d65f03c0        ret
| 
| 0000000000000008 <callee_c>:
|    8:   d65f03c0        ret
|    c:   d503201f        nop
| 
| 0000000000000010 <static_cold_func_a>:
|   10:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   14:   910003fd        mov     x29, sp
|   18:   97fffffa        bl      0 <callee_a>
|   1c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   20:   d65f03c0        ret
|   24:   d503201f        nop
|   28:   d503201f        nop
|   2c:   d503201f        nop
| 
| 0000000000000030 <static_cold_func_b>:
|   30:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   34:   910003fd        mov     x29, sp
|   38:   97fffff3        bl      4 <callee_b>
|   3c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   40:   d65f03c0        ret
|   44:   d503201f        nop
|   48:   d503201f        nop
|   4c:   d503201f        nop
| 
| 0000000000000050 <static_cold_func_c>:
|   50:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   54:   910003fd        mov     x29, sp
|   58:   97ffffec        bl      8 <callee_c>
|   5c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   60:   d65f03c0        ret
|   64:   d503201f        nop
|   68:   d503201f        nop
|   6c:   d503201f        nop
| 
| 0000000000000070 <cold_func_a>:
|   70:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   74:   910003fd        mov     x29, sp
|   78:   97ffffe2        bl      0 <callee_a>
|   7c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   80:   d65f03c0        ret
|   84:   d503201f        nop
|   88:   d503201f        nop
|   8c:   d503201f        nop
| 
| 0000000000000090 <cold_func_b>:
|   90:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   94:   910003fd        mov     x29, sp
|   98:   97ffffdb        bl      4 <callee_b>
|   9c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   a0:   d65f03c0        ret
|   a4:   d503201f        nop
|   a8:   d503201f        nop
|   ac:   d503201f        nop
| 
| 00000000000000b0 <cold_func_c>:
|   b0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   b4:   910003fd        mov     x29, sp
|   b8:   97ffffd4        bl      8 <callee_c>
|   bc:   a8c17bfd        ldp     x29, x30, [sp], #16
|   c0:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-aligned-cold-caller.o
| 
| test-aligned-cold-caller.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         000000c4  0000000000000000  0000000000000000  00000040  2**4
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000108  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  00000120  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  00000120  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00000133  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000110  0000000000000000  0000000000000000  00000138  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-09 13:58 ` [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
@ 2023-01-12  6:48   ` Li Huafei
  2023-01-12 11:00     ` Mark Rutland
  0 siblings, 1 reply; 26+ messages in thread
From: Li Huafei @ 2023-01-12  6:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mhiramat,
	ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will, linux-arm-kernel



On 2023/1/9 21:58, Mark Rutland wrote:
> Architectures without dynamic ftrace trampolines incur an overhead when
> multiple ftrace_ops are enabled with distinct filters. in these cases,
> each call site calls a common trampoline which uses
> ftrace_ops_list_func() to iterate over all enabled ftrace functions, and
> so incurs an overhead relative to the size of this list (including RCU
> protection overhead).
> 
> Architectures with dynamic ftrace trampolines avoid this overhead for
> call sites which have a single associated ftrace_ops. In these cases,
> the dynamic trampoline is customized to branch directly to the relevant
> ftrace function, avoiding the list overhead.
> 
> On some architectures it's impractical and/or undesirable to implement
> dynamic ftrace trampolines. For example, arm64 has limited branch ranges
> and cannot always directly branch from a call site to an arbitrary
> address (e.g. from a kernel text address to an arbitrary module
> address). Calls from modules to core kernel text can be indirected via
> PLTs (allocated at module load time) to address this, but the same is
> not possible from calls from core kernel text.
> 
> Using an indirect branch from a call site to an arbitrary trampoline is
> possible, but requires several more instructions in the function
> prologue (or immediately before it), and/or comes with far more complex
> requirements for patching.
> 
> Instead, this patch adds a new option, where an architecture can
> associate each call site with a pointer to an ftrace_ops, placed at a
> fixed offset from the call site. A shared trampoline can recover this
> pointer and call ftrace_ops::func() without needing to go via
> ftrace_ops_list_func(), avoiding the associated overhead.
> 
> This avoids issues with branch range limitations, and avoids the need to
> allocate and manipulate dynamic trampolines, making it far simpler to
> implement and maintain, while having similar performance
> characteristics.
> 
> Note that this allows for dynamic ftrace_ops to be invoked directly from
> an architecture's ftrace_caller trampoline, whereas existing code forces
> the use of ftrace_ops_get_list_func(), which is in part necessary to
> permit the ftrace_ops to be freed once unregistereed *and* to avoid
> branch/address-generation range limitation on some architectures (e.g.
> where ops->func is a module address, and may be outside of the direct
> branch range for callsites within the main kernel image).
> 
> The CALL_OPS approach avoids this problems and is safe as:
> 
> * The existing synchronization in ftrace_shutdown() using
>   ftrace_shutdown() using synchronize_rcu_tasks_rude() (and
>   synchronize_rcu_tasks()) ensures that no tasks hold a stale reference
>   to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline,
>   or while invoking ftrace_ops::func), when that ftrace_ops is
>   unregistered.
> 
>   Arguably this could also be relied upon for the existing scheme,
>   permitting dynamic ftrace_ops to be invoked directly when ops->func is
>   in range, but this will require additional logic to handle branch
>   range limitations, and is not handled by this patch.
> 
> * Each callsite's ftrace_ops pointer literal can hold any valid kernel
>   address, and is updated atomically. As an architecture's ftrace_caller
>   trampoline will atomically load the ops pointer then derefrence
>   ops->func, there is no risk of invoking ops->func with a mismatches
>   ops pointer, and updates to the ops pointer do not require special
>   care.
> 
> A subsequent patch will implement architectures support for arm64. There
> should be no functional change as a result of this patch alone.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Florent Revest <revest@chromium.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  include/linux/ftrace.h |  15 +++++-
>  kernel/trace/Kconfig   |   7 +++
>  kernel/trace/ftrace.c  | 109 +++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 124 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 99f1146614c04..5eeb2776124c5 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -57,6 +57,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
>  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  			       struct ftrace_ops *op, struct ftrace_regs *fregs);
>  #endif
> +extern const struct ftrace_ops ftrace_nop_ops;
> +extern const struct ftrace_ops ftrace_list_ops;
> +struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);

Hi Mark,

This patch has build issues on x86:

    CC      mm/readahead.o
  In file included from include/linux/perf_event.h:52:0,
                   from arch/x86/events/amd/lbr.c:2:
  include/linux/ftrace.h:62:50: error: ‘struct dyn_ftrace’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
   struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);

Here we should need 'struct dyn_ftrace' forward declaration.

Thanks,
Huafei

>  #endif /* CONFIG_FUNCTION_TRACER */
>  
>  /* Main tracing buffer and events set up */
> @@ -563,6 +566,8 @@ bool is_ftrace_trampoline(unsigned long addr);
>   *  IPMODIFY - the record allows for the IP address to be changed.
>   *  DISABLED - the record is not ready to be touched yet
>   *  DIRECT   - there is a direct function to call
> + *  CALL_OPS - the record can use callsite-specific ops
> + *  CALL_OPS_EN - the function is set up to use callsite-specific ops
>   *
>   * When a new ftrace_ops is registered and wants a function to save
>   * pt_regs, the rec->flags REGS is set. When the function has been
> @@ -580,9 +585,11 @@ enum {
>  	FTRACE_FL_DISABLED	= (1UL << 25),
>  	FTRACE_FL_DIRECT	= (1UL << 24),
>  	FTRACE_FL_DIRECT_EN	= (1UL << 23),
> +	FTRACE_FL_CALL_OPS	= (1UL << 22),
> +	FTRACE_FL_CALL_OPS_EN	= (1UL << 21),
>  };
>  
> -#define FTRACE_REF_MAX_SHIFT	23
> +#define FTRACE_REF_MAX_SHIFT	21
>  #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)
>  
>  #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)
> @@ -820,7 +827,8 @@ static inline int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
>   */
>  extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
>  
> -#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) || \
> +	defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)
>  /**
>   * ftrace_modify_call - convert from one addr to another (no nop)
>   * @rec: the call site record (e.g. mcount/fentry)
> @@ -833,6 +841,9 @@ extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
>   * what we expect it to be, and then on success of the compare,
>   * it should write to the location.
>   *
> + * When using call ops, this is called when the associated ops change, even
> + * when (addr == old_addr).
> + *
>   * The code segment at @rec->ip should be a caller to @old_addr
>   *
>   * Return must be:
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 197545241ab83..5df427a2321df 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -42,6 +42,9 @@ config HAVE_DYNAMIC_FTRACE_WITH_REGS
>  config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>  	bool
>  
> +config HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> +	bool
> +
>  config HAVE_DYNAMIC_FTRACE_WITH_ARGS
>  	bool
>  	help
> @@ -257,6 +260,10 @@ config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>  	depends on DYNAMIC_FTRACE_WITH_REGS
>  	depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>  
> +config DYNAMIC_FTRACE_WITH_CALL_OPS
> +	def_bool y
> +	depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> +
>  config DYNAMIC_FTRACE_WITH_ARGS
>  	def_bool y
>  	depends on DYNAMIC_FTRACE
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 442438b93fe98..c0b219ab89d2f 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -125,6 +125,33 @@ struct ftrace_ops global_ops;
>  void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  			  struct ftrace_ops *op, struct ftrace_regs *fregs);
>  
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> +/*
> + * Stub used to invoke the list ops without requiring a separate trampoline.
> + */
> +const struct ftrace_ops ftrace_list_ops = {
> +	.func	= ftrace_ops_list_func,
> +	.flags	= FTRACE_OPS_FL_STUB,
> +};
> +
> +static void ftrace_ops_nop_func(unsigned long ip, unsigned long parent_ip,
> +				struct ftrace_ops *op,
> +				struct ftrace_regs *fregs)
> +{
> +	/* do nothing */
> +}
> +
> +/*
> + * Stub used when a call site is disabled. May be called transiently by threads
> + * which have made it into ftrace_caller but haven't yet recovered the ops at
> + * the point the call site is disabled.
> + */
> +const struct ftrace_ops ftrace_nop_ops = {
> +	.func	= ftrace_ops_nop_func,
> +	.flags  = FTRACE_OPS_FL_STUB,
> +};
> +#endif
> +
>  static inline void ftrace_ops_init(struct ftrace_ops *ops)
>  {
>  #ifdef CONFIG_DYNAMIC_FTRACE
> @@ -1814,6 +1841,18 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
>  			 * if rec count is zero.
>  			 */
>  		}
> +
> +		/*
> +		 * If the rec has a single associated ops, and ops->func can be
> +		 * called directly, allow the call site to call via the ops.
> +		 */
> +		if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) &&
> +		    ftrace_rec_count(rec) == 1 &&
> +		    ftrace_ops_get_func(ops) == ops->func)
> +			rec->flags |= FTRACE_FL_CALL_OPS;
> +		else
> +			rec->flags &= ~FTRACE_FL_CALL_OPS;
> +
>  		count++;
>  
>  		/* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */
> @@ -2108,8 +2147,9 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec)
>  		struct ftrace_ops *ops = NULL;
>  
>  		pr_info("ftrace record flags: %lx\n", rec->flags);
> -		pr_cont(" (%ld)%s", ftrace_rec_count(rec),
> -			rec->flags & FTRACE_FL_REGS ? " R" : "  ");
> +		pr_cont(" (%ld)%s%s", ftrace_rec_count(rec),
> +			rec->flags & FTRACE_FL_REGS ? " R" : "  ",
> +			rec->flags & FTRACE_FL_CALL_OPS ? " O" : "  ");
>  		if (rec->flags & FTRACE_FL_TRAMP_EN) {
>  			ops = ftrace_find_tramp_ops_any(rec);
>  			if (ops) {
> @@ -2177,6 +2217,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  		 * want the direct enabled (it will be done via the
>  		 * direct helper). But if DIRECT_EN is set, and
>  		 * the count is not one, we need to clear it.
> +		 *
>  		 */
>  		if (ftrace_rec_count(rec) == 1) {
>  			if (!(rec->flags & FTRACE_FL_DIRECT) !=
> @@ -2185,6 +2226,19 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  		} else if (rec->flags & FTRACE_FL_DIRECT_EN) {
>  			flag |= FTRACE_FL_DIRECT;
>  		}
> +
> +		/*
> +		 * Ops calls are special, as count matters.
> +		 * As with direct calls, they must only be enabled when count
> +		 * is one, otherwise they'll be handled via the list ops.
> +		 */
> +		if (ftrace_rec_count(rec) == 1) {
> +			if (!(rec->flags & FTRACE_FL_CALL_OPS) !=
> +			    !(rec->flags & FTRACE_FL_CALL_OPS_EN))
> +				flag |= FTRACE_FL_CALL_OPS;
> +		} else if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> +			flag |= FTRACE_FL_CALL_OPS;
> +		}
>  	}
>  
>  	/* If the state of this record hasn't changed, then do nothing */
> @@ -2229,6 +2283,21 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  					rec->flags &= ~FTRACE_FL_DIRECT_EN;
>  				}
>  			}
> +
> +			if (flag & FTRACE_FL_CALL_OPS) {
> +				if (ftrace_rec_count(rec) == 1) {
> +					if (rec->flags & FTRACE_FL_CALL_OPS)
> +						rec->flags |= FTRACE_FL_CALL_OPS_EN;
> +					else
> +						rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> +				} else {
> +					/*
> +					 * Can only call directly if there's
> +					 * only one sets of associated ops.
> +					 */
> +					rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> +				}
> +			}
>  		}
>  
>  		/*
> @@ -2258,7 +2327,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
>  			 * and REGS states. The _EN flags must be disabled though.
>  			 */
>  			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
> -					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
> +					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
> +					FTRACE_FL_CALL_OPS_EN);
>  	}
>  
>  	ftrace_bug_type = FTRACE_BUG_NOP;
> @@ -2431,6 +2501,25 @@ ftrace_find_tramp_ops_new(struct dyn_ftrace *rec)
>  	return NULL;
>  }
>  
> +struct ftrace_ops *
> +ftrace_find_unique_ops(struct dyn_ftrace *rec)
> +{
> +	struct ftrace_ops *op, *found = NULL;
> +	unsigned long ip = rec->ip;
> +
> +	do_for_each_ftrace_op(op, ftrace_ops_list) {
> +
> +		if (hash_contains_ip(ip, op->func_hash)) {
> +			if (found)
> +				return NULL;
> +			found = op;
> +		}
> +
> +	} while_for_each_ftrace_op(op);
> +
> +	return found;
> +}
> +
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>  /* Protected by rcu_tasks for reading, and direct_mutex for writing */
>  static struct ftrace_hash *direct_functions = EMPTY_HASH;
> @@ -3780,11 +3869,12 @@ static int t_show(struct seq_file *m, void *v)
>  	if (iter->flags & FTRACE_ITER_ENABLED) {
>  		struct ftrace_ops *ops;
>  
> -		seq_printf(m, " (%ld)%s%s%s",
> +		seq_printf(m, " (%ld)%s%s%s%s",
>  			   ftrace_rec_count(rec),
>  			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",
>  			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",
> -			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");
> +			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ",
> +			   rec->flags & FTRACE_FL_CALL_OPS ? " O" : "  ");
>  		if (rec->flags & FTRACE_FL_TRAMP_EN) {
>  			ops = ftrace_find_tramp_ops_any(rec);
>  			if (ops) {
> @@ -3800,6 +3890,15 @@ static int t_show(struct seq_file *m, void *v)
>  		} else {
>  			add_trampoline_func(m, NULL, rec);
>  		}
> +		if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> +			ops = ftrace_find_unique_ops(rec);
> +			if (ops) {
> +				seq_printf(m, "\tops: %pS (%pS)",
> +					   ops, ops->func);
> +			} else {
> +				seq_puts(m, "\tops: ERROR!");
> +			}
> +		}
>  		if (rec->flags & FTRACE_FL_DIRECT) {
>  			unsigned long direct;
>  
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-12  6:48   ` Li Huafei
@ 2023-01-12 11:00     ` Mark Rutland
  2023-01-13  1:15       ` Li Huafei
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-12 11:00 UTC (permalink / raw)
  To: Li Huafei
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mhiramat,
	ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will, linux-arm-kernel

On Thu, Jan 12, 2023 at 02:48:45PM +0800, Li Huafei wrote:
> On 2023/1/9 21:58, Mark Rutland wrote:
> 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 99f1146614c04..5eeb2776124c5 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -57,6 +57,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
> >  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> >  			       struct ftrace_ops *op, struct ftrace_regs *fregs);
> >  #endif
> > +extern const struct ftrace_ops ftrace_nop_ops;
> > +extern const struct ftrace_ops ftrace_list_ops;
> > +struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
> 
> Hi Mark,

Hi Huafei,

Thanks for the reporrt.

> This patch has build issues on x86:
> 
>     CC      mm/readahead.o
>   In file included from include/linux/perf_event.h:52:0,
>                    from arch/x86/events/amd/lbr.c:2:
>   include/linux/ftrace.h:62:50: error: ‘struct dyn_ftrace’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
>    struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
> 
> Here we should need 'struct dyn_ftrace' forward declaration.

The build robot spotted this a couple of days ago (from my branch rather than
the list), and I fixed it there a couple of days ago. The relevant messages were:

  https://lore.kernel.org/oe-kbuild-all/202301100944.E0mV3kSO-lkp@intel.com/
  https://lore.kernel.org/oe-kbuild-all/Y72TJ3qQuvx3gIOi@FVFF77S0Q05N/#t

... and the updated commit is at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/ftrace/per-callsite-ops&id=acab29d6ea2f20d8d156cdd301ad9790bd1d888f


... I'll post a v2 with that folded in once this has had a bit more time to gather comments.

Thanks,	
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-11 18:27     ` Mark Rutland
@ 2023-01-12 11:38       ` Mark Rutland
  2023-01-13 12:49         ` Mark Rutland
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-12 11:38 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Wed, Jan 11, 2023 at 06:27:53PM +0000, Mark Rutland wrote:
> On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> > On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> > >
> > > * When the __weak__ attribute is used
> > >
> > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> > >   work around this by explciitly setting the alignment for weak
> > >   functions.
> > >
> > > * When the __cold__ attribute is used
> > >
> > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > >   and also doesn't seem to respect the '__aligned__(N)' function
> > >   attribute. The only way to work around this is to not use the __cold__
> > >   attibute.
> > 
> > If you happen to have a reduced case, then it would be nice to link it
> > in the commit. A bug report to GCC would also be nice.
> > 
> > I gave it a very quick try in Compiler Explorer, but I couldn't
> > reproduce it, so I guess it depends on flags, non-trivial functions or
> > something else.
> 
> So having spent today coming up with tests, it turns out it's not quite as I
> described above, but in a sense worse. I'm posting a summary here for
> posterity; I'll try to get this to compiler folk shortly.

I've added the cold bits to an existing ticket:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

I have not been able to reproduce the issue with __weak__, so I'll go dig into
that some more; it's likely I was mistaken there.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  2023-01-12 11:00     ` Mark Rutland
@ 2023-01-13  1:15       ` Li Huafei
  0 siblings, 0 replies; 26+ messages in thread
From: Li Huafei @ 2023-01-13  1:15 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, lenb, linux-acpi, linux-kernel, mhiramat,
	ndesaulniers, ojeda, peterz, rafael.j.wysocki, revest,
	robert.moore, rostedt, will, linux-arm-kernel



On 2023/1/12 19:00, Mark Rutland wrote:
> On Thu, Jan 12, 2023 at 02:48:45PM +0800, Li Huafei wrote:
>> On 2023/1/9 21:58, Mark Rutland wrote:
>>
>>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>>> index 99f1146614c04..5eeb2776124c5 100644
>>> --- a/include/linux/ftrace.h
>>> +++ b/include/linux/ftrace.h
>>> @@ -57,6 +57,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
>>>  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>>>  			       struct ftrace_ops *op, struct ftrace_regs *fregs);
>>>  #endif
>>> +extern const struct ftrace_ops ftrace_nop_ops;
>>> +extern const struct ftrace_ops ftrace_list_ops;
>>> +struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
>>
>> Hi Mark,
> 
> Hi Huafei,
> 
> Thanks for the reporrt.
> 
>> This patch has build issues on x86:
>>
>>     CC      mm/readahead.o
>>   In file included from include/linux/perf_event.h:52:0,
>>                    from arch/x86/events/amd/lbr.c:2:
>>   include/linux/ftrace.h:62:50: error: ‘struct dyn_ftrace’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
>>    struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
>>
>> Here we should need 'struct dyn_ftrace' forward declaration.
> 
> The build robot spotted this a couple of days ago (from my branch rather than
> the list), and I fixed it there a couple of days ago. The relevant messages were:
> 
>   https://lore.kernel.org/oe-kbuild-all/202301100944.E0mV3kSO-lkp@intel.com/
>   https://lore.kernel.org/oe-kbuild-all/Y72TJ3qQuvx3gIOi@FVFF77S0Q05N/#t
> 
> ... and the updated commit is at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/ftrace/per-callsite-ops&id=acab29d6ea2f20d8d156cdd301ad9790bd1d888f
> 

Okay, you've fixed it. Thanks for pointing out.

> 
> ... I'll post a v2 with that folded in once this has had a bit more time to gather comments.
> 
> Thanks,	
> Mark.
> 
> .
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-12 11:38       ` Mark Rutland
@ 2023-01-13 12:49         ` Mark Rutland
  2023-01-15 21:32           ` Miguel Ojeda
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Rutland @ 2023-01-13 12:49 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Thu, Jan 12, 2023 at 11:38:17AM +0000, Mark Rutland wrote:
> On Wed, Jan 11, 2023 at 06:27:53PM +0000, Mark Rutland wrote:
> > On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> > > On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> > > >
> > > > * When the __weak__ attribute is used
> > > >
> > > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > > >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> > > >   work around this by explciitly setting the alignment for weak
> > > >   functions.
> > > >
> > > > * When the __cold__ attribute is used
> > > >
> > > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > > >   and also doesn't seem to respect the '__aligned__(N)' function
> > > >   attribute. The only way to work around this is to not use the __cold__
> > > >   attibute.
> > > 
> > > If you happen to have a reduced case, then it would be nice to link it
> > > in the commit. A bug report to GCC would also be nice.
> > > 
> > > I gave it a very quick try in Compiler Explorer, but I couldn't
> > > reproduce it, so I guess it depends on flags, non-trivial functions or
> > > something else.
> > 
> > So having spent today coming up with tests, it turns out it's not quite as I
> > described above, but in a sense worse. I'm posting a summary here for
> > posterity; I'll try to get this to compiler folk shortly.
> 
> I've added the cold bits to an existing ticket:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345
> 
> I have not been able to reproduce the issue with __weak__, so I'll go dig into
> that some more; it's likely I was mistaken there.

It turns out that was a red herring; GCC is actually implicitly marking the
abort() function as cold, and as Linux's implementation happened to be marked
as weak I assumed that was the culprit.

I'll drop the changes to weak and update our abort implementation specifically,
with a comment.

I'll also go update the ticket above.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] Compiler attributes: GCC function alignment workarounds
  2023-01-13 12:49         ` Mark Rutland
@ 2023-01-15 21:32           ` Miguel Ojeda
  0 siblings, 0 replies; 26+ messages in thread
From: Miguel Ojeda @ 2023-01-15 21:32 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, catalin.marinas, lenb, linux-acpi,
	linux-kernel, mhiramat, ndesaulniers, ojeda, peterz,
	rafael.j.wysocki, revest, robert.moore, rostedt, will

On Fri, Jan 13, 2023 at 1:49 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> It turns out that was a red herring; GCC is actually implicitly marking the
> abort() function as cold, and as Linux's implementation happened to be marked
> as weak I assumed that was the culprit.

That and your previous message probably explains probably why I
couldn't reproduce it.

Thanks a lot for all the details -- the `cold` issue is reproducible
since gcc 4.6 at least: https://godbolt.org/z/PoxazzT9T

The `abort` case appears to happen since gcc 8.1.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-01-15 21:32 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09 13:58 [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
2023-01-09 13:58 ` [PATCH 1/8] Compiler attributes: GCC function alignment workarounds Mark Rutland
2023-01-09 14:43   ` Miguel Ojeda
2023-01-09 17:06     ` Mark Rutland
2023-01-09 22:35       ` Miguel Ojeda
2023-01-11 18:27     ` Mark Rutland
2023-01-12 11:38       ` Mark Rutland
2023-01-13 12:49         ` Mark Rutland
2023-01-15 21:32           ` Miguel Ojeda
2023-01-09 13:58 ` [PATCH 2/8] ACPI: Don't build ACPICA with '-Os' Mark Rutland
2023-01-10 13:45   ` Rafael J. Wysocki
2023-01-09 13:58 ` [PATCH 3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT Mark Rutland
2023-01-10 20:35   ` Peter Zijlstra
2023-01-10 20:43     ` Will Deacon
2023-01-11 11:39       ` Mark Rutland
2023-01-11 11:36     ` Mark Rutland
2023-01-09 13:58 ` [PATCH 4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
2023-01-12  6:48   ` Li Huafei
2023-01-12 11:00     ` Mark Rutland
2023-01-13  1:15       ` Li Huafei
2023-01-09 13:58 ` [PATCH 5/8] arm64: insn: Add helpers for BTI Mark Rutland
2023-01-09 13:58 ` [PATCH 6/8] arm64: patching: Add aarch64_insn_write_literal_u64() Mark Rutland
2023-01-09 13:58 ` [PATCH 7/8] arm64: ftrace: Update stale comment Mark Rutland
2023-01-09 13:58 ` [PATCH 8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS Mark Rutland
2023-01-10  8:55 ` [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS David Laight
2023-01-10 10:31   ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).