bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
@ 2021-02-18 22:21 Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 1/6] tracing: introduce faultable " Michael Jeanson
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Steven Rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Joel Fernandes, bpf

Formerly known as “Sleepable tracepoints”.

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow specific tracer
probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
called from sleepable context, and convert the system call enter/exit
instrumentation to sleepable tracepoints.

This series only implements the tracepoint infrastructure required to
allow tracers to handle page faults. Modifying each tracer to handle
those page faults would be a next step after we all agree on this piece
of instrumentation infrastructure.

This patchset is based on v5.10.15.

Changes since v1 RFC:

  - Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
    MAYFAULT, and use might_fault() rather than might_sleep(), to properly
    convey that the tracepoints are meant to be able to take a page fault,
    which requires to be able to sleep *and* to hold the mmap_sem.

Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org

Mathieu Desnoyers (1):
  tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints

Michael Jeanson (5):
  tracing: introduce faultable tracepoints (v2)
  tracing: ftrace: add support for faultable tracepoints
  tracing: bpf-trace: add support for faultable tracepoints
  tracing: perf: add support for faultable tracepoints
  tracing: convert sys_enter/exit to faultable tracepoints

 include/linux/tracepoint-defs.h |  11 ++++
 include/linux/tracepoint.h      | 111 +++++++++++++++++++++-----------
 include/trace/bpf_probe.h       |  23 ++++++-
 include/trace/define_trace.h    |   8 +++
 include/trace/events/syscalls.h |   4 +-
 include/trace/perf.h            |  26 ++++++--
 include/trace/trace_events.h    |  79 +++++++++++++++++++++--
 init/Kconfig                    |   1 +
 kernel/trace/bpf_trace.c        |   5 +-
 kernel/trace/trace_events.c     |  15 ++++-
 kernel/trace/trace_syscalls.c   |  84 ++++++++++++++++--------
 kernel/tracepoint.c             | 104 ++++++++++++++++++++++++------
 12 files changed, 373 insertions(+), 98 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/6] tracing: introduce faultable tracepoints (v2)
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 2/6] tracing: ftrace: add support for faultable tracepoints Michael Jeanson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Mathieu Desnoyers, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a faultable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those faultable tracepoints. When the
TRACEPOINT_MAYFAULT flag is provided on registration, the probe callback
will be called with preemption enabled, and is allowed to take page
faults. Faultable probes can only be registered on faultable
tracepoints and non-faultable probes on non-faultable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to faultable probes
unregistration and teardown.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
Changes since v1:
- Cleanup __DO_TRACE() implementation.
- Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
  MAYFAULT, and use might_fault() rather than might_sleep(), to properly
  convey that the tracepoints are meant to be able to take a page fault,
  which requires to be able to sleep *and* to hold the mmap_sem.
---
 include/linux/tracepoint-defs.h |  11 ++++
 include/linux/tracepoint.h      |  91 ++++++++++++++++++++++------
 include/trace/define_trace.h    |   8 +++
 include/trace/trace_events.h    |   6 ++
 init/Kconfig                    |   1 +
 kernel/tracepoint.c             | 103 ++++++++++++++++++++++++++------
 6 files changed, 186 insertions(+), 34 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index e7c2276be33e..012f1ec6730c 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -29,6 +29,16 @@ struct tracepoint_func {
 	int prio;
 };
 
+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAYFAULT: The tracepoint probe callback will be called with
+ *                       preemption enabled, and is allowed to take page
+ *                       faults.
+ */
+enum tracepoint_flags {
+	TRACEPOINT_MAYFAULT = (1 << 0),
+};
+
 struct tracepoint {
 	const char *name;		/* Tracepoint name */
 	struct static_key key;
@@ -38,6 +48,7 @@ struct tracepoint {
 	int (*regfunc)(void);
 	void (*unregfunc)(void);
 	struct tracepoint_func __rcu *funcs;
+	unsigned int flags;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 966ed8980327..04079cbd2015 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
 #include <linux/types.h>
 #include <linux/cpumask.h>
 #include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
 
@@ -38,9 +39,14 @@ extern struct srcu_struct tracepoint_srcu;
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
 extern int
+tracepoint_probe_register_mayfault(struct tracepoint *tp, void *probe, void *data);
+extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
 			       int prio);
 extern int
+tracepoint_probe_register_prio_mayfault(struct tracepoint *tp, void *probe, void *data,
+			       int prio);
+extern int
 tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
@@ -80,6 +86,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
 #ifdef CONFIG_TRACEPOINTS
 static inline void tracepoint_synchronize_unregister(void)
 {
+	synchronize_rcu_tasks_trace();
 	synchronize_srcu(&tracepoint_srcu);
 	synchronize_rcu();
 }
@@ -166,11 +173,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * has a "void" prototype, then it is invalid to declare a function
  * as "(void *, void)".
  */
-#define __DO_TRACE(name, proto, args, cond, rcuidle)			\
+#define __DO_TRACE(name, proto, args, cond, rcuidle, tp_flags)		\
 	do {								\
 		struct tracepoint_func *it_func_ptr;			\
 		int __maybe_unused __idx = 0;				\
 		void *__data;						\
+		bool mayfault = (tp_flags) & TRACEPOINT_MAYFAULT;	\
 									\
 		if (!(cond))						\
 			return;						\
@@ -178,9 +186,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		/* srcu can't be used from NMI */			\
 		WARN_ON_ONCE(rcuidle && in_nmi());			\
 									\
-		/* keep srcu and sched-rcu usage consistent */		\
-		preempt_disable_notrace();				\
-									\
+		if (mayfault) {						\
+			rcu_read_lock_trace();				\
+		} else {						\
+			/* keep srcu and sched-rcu usage consistent */	\
+			preempt_disable_notrace();			\
+		}							\
 		/*							\
 		 * For rcuidle callers, use srcu since sched-rcu	\
 		 * doesn't work from the idle path.			\
@@ -192,6 +203,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 									\
 		it_func_ptr =						\
 			rcu_dereference_raw((&__tracepoint_##name)->funcs); \
+									\
 		if (it_func_ptr) {					\
 			__data = (it_func_ptr)->data;			\
 			__DO_TRACE_CALL(name)(args);			\
@@ -202,21 +214,24 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
 		}							\
 									\
-		preempt_enable_notrace();				\
+		if (mayfault)						\
+			rcu_read_unlock_trace();			\
+		else							\
+			preempt_enable_notrace();			\
 	} while (0)
 
 #ifndef MODULE
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	static inline void trace_##name##_rcuidle(proto)		\
 	{								\
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(name,				\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
-				TP_CONDITION(cond), 1);			\
+				TP_CONDITION(cond), 1, tp_flags);	\
 	}
 #else
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args)
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags)
 #endif
 
 /*
@@ -231,7 +246,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * even when this tracepoint is off. This code has no purpose other than
  * poking RCU a bit.
  */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	extern int __traceiter_##name(data_proto);			\
 	DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name);	\
 	extern struct tracepoint __tracepoint_##name;			\
@@ -241,15 +256,17 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			__DO_TRACE(name,				\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
-				TP_CONDITION(cond), 0);			\
+				TP_CONDITION(cond), 0, tp_flags);	\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			rcu_read_lock_sched_notrace();			\
 			rcu_dereference_sched(__tracepoint_##name.funcs);\
 			rcu_read_unlock_sched_notrace();		\
 		}							\
+		if ((tp_flags) & TRACEPOINT_MAYFAULT)			\
+			might_fault();					\
 	}								\
 	__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
-		PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))	\
+		PARAMS(cond), PARAMS(data_proto), PARAMS(data_args), tp_flags)	\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
@@ -257,6 +274,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 						(void *)probe, data);	\
 	}								\
 	static inline int						\
+	register_trace_mayfault_##name(void (*probe)(data_proto), void *data)	\
+	{								\
+		return tracepoint_probe_register_mayfault(&__tracepoint_##name,	\
+						(void *)probe, data);	\
+	}								\
+	static inline int						\
 	register_trace_prio_##name(void (*probe)(data_proto), void *data,\
 				   int prio)				\
 	{								\
@@ -264,6 +287,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 					      (void *)probe, data, prio); \
 	}								\
 	static inline int						\
+	register_trace_prio_mayfault_##name(void (*probe)(data_proto),	\
+			void *data, int prio)				\
+	{								\
+		return tracepoint_probe_register_prio_mayfault(&__tracepoint_##name, \
+					      (void *)probe, data, prio); \
+	}								\
+	static inline int						\
 	unregister_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
 		return tracepoint_probe_unregister(&__tracepoint_##name,\
@@ -284,7 +314,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * structures, so we create an array of pointers that will be used for iteration
  * on the tracepoints.
  */
-#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)		\
+#define DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, proto, args, tp_flags) \
 	static const char __tpstrtab_##_name[]				\
 	__section("__tracepoints_strings") = #_name;			\
 	extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name);	\
@@ -298,7 +328,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		.iterator = &__traceiter_##_name,			\
 		.regfunc = _reg,					\
 		.unregfunc = _unreg,					\
-		.funcs = NULL };					\
+		.funcs = NULL,						\
+		.flags = tp_flags };					\
 	__TRACEPOINT_ENTRY(_name);					\
 	int __traceiter_##_name(void *__data, proto)			\
 	{								\
@@ -318,8 +349,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	}								\
 	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
 
+#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)		\
+	DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, PARAMS(proto), PARAMS(args), 0)
+
 #define DEFINE_TRACE(name, proto, args)		\
-	DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
+	DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args))
 
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)				\
 	EXPORT_SYMBOL_GPL(__tracepoint_##name);				\
@@ -332,7 +366,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 
 
 #else /* !TRACEPOINTS_ENABLED */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	static inline void trace_##name(proto)				\
 	{ }								\
 	static inline void trace_##name##_rcuidle(proto)		\
@@ -344,6 +378,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return -ENOSYS;						\
 	}								\
 	static inline int						\
+	register_trace_mayfault_##name(void (*probe)(data_proto),	\
+			      void *data)				\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline int						\
+	register_trace_prio_mayfault_##name(void (*probe)(data_proto),	\
+			void *data, int prio)				\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline int						\
 	unregister_trace_##name(void (*probe)(data_proto),		\
 				void *data)				\
 	{								\
@@ -358,6 +404,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return false;						\
 	}
 
+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, proto, args, tp_flags)
 #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
 #define DEFINE_TRACE(name, proto, args)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
@@ -413,13 +460,20 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
 			cpu_online(raw_smp_processor_id()),		\
 			PARAMS(void *__data, proto),			\
-			PARAMS(__data, args))
+			PARAMS(__data, args), 0)
+
+#define DECLARE_TRACE_MAYFAULT(name, proto, args)			\
+	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
+			cpu_online(raw_smp_processor_id()),		\
+			PARAMS(void *__data, proto),			\
+			PARAMS(__data, args),				\
+			TRACEPOINT_MAYFAULT)
 
 #define DECLARE_TRACE_CONDITION(name, proto, args, cond)		\
 	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
 			cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
 			PARAMS(void *__data, proto),			\
-			PARAMS(__data, args))
+			PARAMS(__data, args), 0)
 
 #define TRACE_EVENT_FLAGS(event, flag)
 
@@ -550,6 +604,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define TRACE_EVENT_FN(name, proto, args, struct,		\
 		assign, print, reg, unreg)			\
 	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FN_MAYFAULT(name, proto, args, struct,	\
+		assign, print, reg, unreg)			\
+	DECLARE_TRACE_MAYFAULT(name, PARAMS(proto), PARAMS(args))
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, struct,		\
 		assign, print, reg, unreg)			\
 	DECLARE_TRACE_CONDITION(name, PARAMS(proto),	\
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 00723935dcc7..e08fd5b9a2ac 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -41,6 +41,13 @@
 		assign, print, reg, unreg)			\
 	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
+/* Define a trace event with the MAYFAULT flag set */
+#undef TRACE_EVENT_FN_MAYFAULT
+#define TRACE_EVENT_FN_MAYFAULT(name, proto, args, tstruct,		\
+		assign, print, reg, unreg)			\
+	DEFINE_TRACE_FN_FLAGS(name, reg, unreg, PARAMS(proto), PARAMS(args), \
+		TRACEPOINT_MAYFAULT)
+
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,		\
 		assign, print, reg, unreg)			\
@@ -106,6 +113,7 @@
 
 #undef TRACE_EVENT
 #undef TRACE_EVENT_FN
+#undef TRACE_EVENT_FN_MAYFAULT
 #undef TRACE_EVENT_FN_COND
 #undef TRACE_EVENT_CONDITION
 #undef TRACE_EVENT_NOP
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 7785961d82ba..af9807251226 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -138,6 +138,12 @@ TRACE_MAKE_SYSTEM_STR();
 	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
 		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
 
+#undef TRACE_EVENT_FN_MAYFAULT
+#define TRACE_EVENT_FN_MAYFAULT(name, proto, args, tstruct,		\
+		assign, print, reg, unreg)				\
+	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
+
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,	\
 		assign, print, reg, unreg)				\
diff --git a/init/Kconfig b/init/Kconfig
index 0872a5a2e759..d3b88f3cdaca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2032,6 +2032,7 @@ config PROFILING
 #
 config TRACEPOINTS
 	bool
+	select TASKS_TRACE_RCU
 
 endmenu		# General setup
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 3f659f855074..41fc9c6e17f6 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -60,11 +60,16 @@ static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
-static void srcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
+static void srcu_free_old_probes(struct rcu_head *head)
+{
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
 static void rcu_free_old_probes(struct rcu_head *head)
 {
 	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
@@ -85,7 +90,7 @@ static __init int release_early_probes(void)
 	return 0;
 }
 
-/* SRCU is initialized at core_initcall */
+/* SRCU and Tasks Trace RCU are initialized at core_initcall */
 postcore_initcall(release_early_probes);
 
 static inline void release_probes(struct tracepoint_func *old)
@@ -95,8 +100,9 @@ static inline void release_probes(struct tracepoint_func *old)
 			struct tp_probes, probes[0]);
 
 		/*
-		 * We can't free probes if SRCU is not initialized yet.
-		 * Postpone the freeing till after SRCU is initialized.
+		 * We can't free probes if SRCU and Tasks Trace RCU are not
+		 * initialized yet. Postpone the freeing till after both are
+		 * initialized.
 		 */
 		if (unlikely(!ok_to_free_tracepoints)) {
 			tp_probes->rcu.next = early_probes;
@@ -105,10 +111,9 @@ static inline void release_probes(struct tracepoint_func *old)
 		}
 
 		/*
-		 * Tracepoint probes are protected by both sched RCU and SRCU,
-		 * by calling the SRCU callback in the sched RCU callback we
-		 * cover both cases. So let us chain the SRCU and sched RCU
-		 * callbacks to wait for both grace periods.
+		 * Tracepoint probes are protected by sched RCU, SRCU and
+		 * Tasks Trace RCU by chaining the callbacks we cover all three
+		 * cases and wait for all three grace periods.
 		 */
 		call_rcu(&tp_probes->rcu, rcu_free_old_probes);
 	}
@@ -316,6 +321,21 @@ static int tracepoint_remove_func(struct tracepoint *tp,
 	return 0;
 }
 
+static int __tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
+				   void *data, int prio)
+{
+	struct tracepoint_func tp_func;
+	int ret;
+
+	mutex_lock(&tracepoints_mutex);
+	tp_func.func = probe;
+	tp_func.data = data;
+	tp_func.prio = prio;
+	ret = tracepoint_add_func(tp, &tp_func, prio);
+	mutex_unlock(&tracepoints_mutex);
+	return ret;
+}
+
 /**
  * tracepoint_probe_register_prio -  Connect a probe to a tracepoint with priority
  * @tp: tracepoint
@@ -323,6 +343,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
  * @data: tracepoint data
  * @prio: priority of this function over other registered functions
  *
+ * Non-faultable probes can only be registered on non-faultable tracepoints.
+ *
  * Returns 0 if ok, error value on error.
  * Note: if @tp is within a module, the caller is responsible for
  * unregistering the probe before the module is gone. This can be
@@ -332,25 +354,49 @@ static int tracepoint_remove_func(struct tracepoint *tp,
 int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
 				   void *data, int prio)
 {
-	struct tracepoint_func tp_func;
-	int ret;
+	if (tp->flags & TRACEPOINT_MAYFAULT)
+		return -EINVAL;
 
-	mutex_lock(&tracepoints_mutex);
-	tp_func.func = probe;
-	tp_func.data = data;
-	tp_func.prio = prio;
-	ret = tracepoint_add_func(tp, &tp_func, prio);
-	mutex_unlock(&tracepoints_mutex);
-	return ret;
+	return __tracepoint_probe_register_prio(tp, probe, data, prio);
 }
 EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
 
+/**
+ * tracepoint_probe_register_prio_mayfault - Connect a faultable probe to a tracepoint with priority
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ * @prio: priority of this function over other registered functions
+ *
+ * When the TRACEPOINT_MAYFAULT flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Faultable probes can only be registered on faultable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_prio_mayfault(struct tracepoint *tp, void *probe,
+				   void *data, int prio)
+{
+	if (!(tp->flags & TRACEPOINT_MAYFAULT))
+		return -EINVAL;
+
+	return __tracepoint_probe_register_prio(tp, probe, data, prio);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_mayfault);
+
 /**
  * tracepoint_probe_register -  Connect a probe to a tracepoint
  * @tp: tracepoint
  * @probe: probe handler
  * @data: tracepoint data
  *
+ * Non-faultable probes can only be registered on non-faultable tracepoints.
+ *
  * Returns 0 if ok, error value on error.
  * Note: if @tp is within a module, the caller is responsible for
  * unregistering the probe before the module is gone. This can be
@@ -363,6 +409,29 @@ int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
 }
 EXPORT_SYMBOL_GPL(tracepoint_probe_register);
 
+/**
+ * tracepoint_probe_register_mayfault - Connect a faultable probe to a tracepoint
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ *
+ * When the TRACEPOINT_MAYFAULT flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Faultable probes can only be registered on faultable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_mayfault(struct tracepoint *tp, void *probe, void *data)
+{
+	return tracepoint_probe_register_prio_mayfault(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_mayfault);
+
 /**
  * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
  * @tp: tracepoint
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/6] tracing: ftrace: add support for faultable tracepoints
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 1/6] tracing: introduce faultable " Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Mathieu Desnoyers, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that ftrace can handle registering
to such tracepoints by explicitly disabling preemption within the ftrace
tracepoint probes to respect the current expectations within ftrace ring
buffer code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/trace_events.h | 75 +++++++++++++++++++++++++++++++++---
 kernel/trace/trace_events.c  | 15 +++++++-
 2 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index af9807251226..5125d3fcf963 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -80,6 +80,16 @@ TRACE_MAKE_SYSTEM_STR();
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_MAYFAULT
+#define TRACE_EVENT_MAYFAULT(name, proto, args, tstruct, assign, print)	\
+	DECLARE_EVENT_CLASS_MAYFAULT(name,		       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 
 #undef __field
 #define __field(type, item)		type	item;
@@ -118,6 +128,12 @@ TRACE_MAKE_SYSTEM_STR();
 									\
 	static struct trace_event_class event_class_##name;
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(name, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)	\
 	static struct trace_event_call	__used		\
@@ -141,7 +157,7 @@ TRACE_MAKE_SYSTEM_STR();
 #undef TRACE_EVENT_FN_MAYFAULT
 #define TRACE_EVENT_FN_MAYFAULT(name, proto, args, tstruct,		\
 		assign, print, reg, unreg)				\
-	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
+	TRACE_EVENT_MAYFAULT(name, PARAMS(proto), PARAMS(args),		\
 		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
 
 #undef TRACE_EVENT_FN_COND
@@ -212,6 +228,12 @@ TRACE_MAKE_SYSTEM_STR();
 		tstruct;						\
 	};
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
@@ -378,6 +400,12 @@ static struct trace_event_functions trace_event_type_funcs_##call = {	\
 	.trace			= trace_raw_output_##call,		\
 };
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print)		\
 static notrace enum print_line_t					\
@@ -448,6 +476,12 @@ static struct trace_event_fields trace_event_fields_##call[] = {	\
 	tstruct								\
 	{} };
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args,			\
+		tstruct, func, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(func), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -524,6 +558,12 @@ static inline notrace int trace_event_get_offsets_##call(		\
 	return __data_size;						\
 }
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 /*
@@ -673,8 +713,8 @@ static inline notrace int trace_event_get_offsets_##call(		\
 #undef __perf_task
 #define __perf_task(t)	(t)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 									\
 static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
@@ -685,8 +725,11 @@ trace_event_raw_event_##call(void *__data, proto)			\
 	struct trace_event_raw_##call *entry;				\
 	int __data_size;						\
 									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_disable_notrace();				\
+									\
 	if (trace_trigger_soft_disabled(trace_file))			\
-		return;							\
+		goto end;						\
 									\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
@@ -694,14 +737,30 @@ trace_event_raw_event_##call(void *__data, proto)			\
 				 sizeof(*entry) + __data_size);		\
 									\
 	if (!entry)							\
-		return;							\
+		goto end;						\
 									\
 	tstruct								\
 									\
 	{ assign; }							\
 									\
 	trace_event_buffer_commit(&fbuffer);				\
+end:									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_enable_notrace();				\
 }
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+			PARAMS(tstruct), PARAMS(assign),		\
+			PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),			\
+			PARAMS(tstruct), PARAMS(assign),			\
+			PARAMS(print), TRACEPOINT_MAYFAULT)
+
 /*
  * The ftrace_test_probe is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the ftrace probe will
@@ -748,6 +807,12 @@ static struct trace_event_class __used __refdata event_class_##call = { \
 	_TRACE_PERF_INIT(call)						\
 };
 
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
 									\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 802f3e7d8b8b..ed2ca828311a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -291,9 +291,15 @@ int trace_event_reg(struct trace_event_call *call,
 	WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
 	switch (type) {
 	case TRACE_REG_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		if (call->tp->flags & TRACEPOINT_MAYFAULT)
+			return tracepoint_probe_register_mayfault(call->tp,
 						 call->class->probe,
 						 file);
+		else
+			return tracepoint_probe_register(call->tp,
+						 call->class->probe,
+						 file);
+
 	case TRACE_REG_UNREGISTER:
 		tracepoint_probe_unregister(call->tp,
 					    call->class->probe,
@@ -302,7 +308,12 @@ int trace_event_reg(struct trace_event_call *call,
 
 #ifdef CONFIG_PERF_EVENTS
 	case TRACE_REG_PERF_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		if (call->tp->flags & TRACEPOINT_MAYFAULT)
+			return tracepoint_probe_register_mayfault(call->tp,
+						 call->class->perf_probe,
+						 call);
+		else
+			return tracepoint_probe_register(call->tp,
 						 call->class->perf_probe,
 						 call);
 	case TRACE_REG_PERF_UNREGISTER:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/6] tracing: bpf-trace: add support for faultable tracepoints
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 1/6] tracing: introduce faultable " Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 2/6] tracing: ftrace: add support for faultable tracepoints Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Mathieu Desnoyers, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that bpf can handle registering to
such tracepoints by explicitly disabling preemption within the bpf
tracepoint probes to respect the current expectations within bpf tracing
code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to connect to faultable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/bpf_probe.h | 23 +++++++++++++++++++++--
 kernel/trace/bpf_trace.c  |  5 ++++-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..1fc3afc49f37 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,15 +55,34 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 static notrace void							\
 __bpf_trace_##call(void *__data, proto)					\
 {									\
 	struct bpf_prog *prog = __data;					\
+									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_disable_notrace();				\
+									\
 	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));	\
+									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_enable_notrace();				\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),	0)
+
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args, tstruct,	\
+		assign, print)						\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),		\
+		TRACEPOINT_MAYFAULT)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0dde84b9d29f..eeeb3dafb01e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2117,7 +2117,10 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
 	if (prog->aux->max_tp_access > btp->writable_size)
 		return -EINVAL;
 
-	return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
+	if (tp->flags & TRACEPOINT_MAYFAULT)
+		return tracepoint_probe_register_mayfault(tp, (void *)btp->bpf_func, prog);
+	else
+		return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
 }
 
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 4/6] tracing: perf: add support for faultable tracepoints
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
                   ` (2 preceding siblings ...)
  2021-02-18 22:21 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Mathieu Desnoyers, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/perf.h | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index dbc6c74defc3..9659ef91fae1 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -27,8 +27,8 @@
 #undef __perf_task
 #define __perf_task(t)	(__task = (t))
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 static notrace void							\
 perf_trace_##call(void *__data, proto)					\
 {									\
@@ -43,13 +43,16 @@ perf_trace_##call(void *__data, proto)					\
 	int __data_size;						\
 	int rctx;							\
 									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_disable_notrace();				\
+									\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
 	if (!bpf_prog_array_valid(event_call) &&			\
 	    __builtin_constant_p(!__task) && !__task &&			\
 	    hlist_empty(head))						\
-		return;							\
+		goto end;						\
 									\
 	__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
 			     sizeof(u64));				\
@@ -57,7 +60,7 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);	\
 	if (!entry)							\
-		return;							\
+		goto end;						\
 									\
 	perf_fetch_caller_regs(__regs);					\
 									\
@@ -68,8 +71,23 @@ perf_trace_##call(void *__data, proto)					\
 	perf_trace_run_bpf_submit(entry, __entry_size, rctx,		\
 				  event_call, __count, __regs,		\
 				  head, __task);			\
+end:									\
+	if ((tp_flags) & TRACEPOINT_MAYFAULT)				\
+		preempt_enable_notrace();				\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYFAULT
+#define DECLARE_EVENT_CLASS_MAYFAULT(call, proto, args, tstruct,	\
+		assign, print)						\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),		\
+		TRACEPOINT_MAYFAULT)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 5/6] tracing: convert sys_enter/exit to faultable tracepoints
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
                   ` (3 preceding siblings ...)
  2021-02-18 22:21 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-18 22:21 ` [RFC PATCH 6/6] tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
  2021-02-24  2:16 ` [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Steven Rostedt
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Jeanson, Mathieu Desnoyers, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

Convert the definition of the system call enter/exit tracepoints to
faultable tracepoints now that all upstream tracers handle it.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/events/syscalls.h |  4 +-
 kernel/trace/trace_syscalls.c   | 84 +++++++++++++++++++++++----------
 2 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..2bd2d94563a2 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAYFAULT(sys_enter,
 
 	TP_PROTO(struct pt_regs *regs, long id),
 
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAYFAULT(sys_exit,
 
 	TP_PROTO(struct pt_regs *regs, long ret),
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index d85a2f0f316b..4ca9190e26b2 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -304,21 +304,27 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	int syscall_nr;
 	int size;
 
+	/*
+	 * Probe called with preemption enabled (mayfault), but ring buffer and
+	 * per-cpu data require preemption to be disabled.
+	 */
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
 	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
 	if (!trace_file)
-		return;
+		goto end;
 
 	if (trace_trigger_soft_disabled(trace_file))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
@@ -329,7 +335,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	event = trace_buffer_lock_reserve(buffer,
 			sys_data->enter_event->event.type, size, irq_flags, pc);
 	if (!event)
-		return;
+		goto end;
 
 	entry = ring_buffer_event_data(event);
 	entry->nr = syscall_nr;
@@ -338,6 +344,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 	event_trigger_unlock_commit(trace_file, buffer, event, entry,
 				    irq_flags, pc);
+end:
+	preempt_enable_notrace();
 }
 
 static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
@@ -352,21 +360,27 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	int pc;
 	int syscall_nr;
 
+	/*
+	 * Probe called with preemption enabled (mayfault), but ring buffer and
+	 * per-cpu data require preemption to be disabled.
+	 */
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
 	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
 	if (!trace_file)
-		return;
+		goto end;
 
 	if (trace_trigger_soft_disabled(trace_file))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	local_save_flags(irq_flags);
 	pc = preempt_count();
@@ -376,7 +390,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 			sys_data->exit_event->event.type, sizeof(*entry),
 			irq_flags, pc);
 	if (!event)
-		return;
+		goto end;
 
 	entry = ring_buffer_event_data(event);
 	entry->nr = syscall_nr;
@@ -384,6 +398,8 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 
 	event_trigger_unlock_commit(trace_file, buffer, event, entry,
 				    irq_flags, pc);
+end:
+	preempt_enable_notrace();
 }
 
 static int reg_event_syscall_enter(struct trace_event_file *file,
@@ -398,7 +414,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_enter)
-		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+		ret = register_trace_mayfault_sys_enter(ftrace_syscall_enter, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->enter_syscall_files[num], file);
 		tr->sys_refcount_enter++;
@@ -436,7 +452,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_exit)
-		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+		ret = register_trace_mayfault_sys_exit(ftrace_syscall_exit, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->exit_syscall_files[num], file);
 		tr->sys_refcount_exit++;
@@ -600,20 +616,26 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	int rctx;
 	int size;
 
+	/*
+	 * Probe called with preemption enabled (mayfault), but ring buffer and
+	 * per-cpu data require preemption to be disabled.
+	 */
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 	if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
 	valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
 	if (!valid_prog_array && hlist_empty(head))
-		return;
+		goto end;
 
 	/* get the size after alignment with the u32 buffer size field */
 	size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
@@ -622,7 +644,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 
 	rec = perf_trace_buf_alloc(size, NULL, &rctx);
 	if (!rec)
-		return;
+		goto end;
 
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, args);
@@ -632,12 +654,14 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	     !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
-		return;
+		goto end;
 	}
 
 	perf_trace_buf_submit(rec, size, rctx,
 			      sys_data->enter_event->event.type, 1, regs,
 			      head, NULL);
+end:
+	preempt_enable_notrace();
 }
 
 static int perf_sysenter_enable(struct trace_event_call *call)
@@ -649,7 +673,7 @@ static int perf_sysenter_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_enter)
-		ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+		ret = register_trace_mayfault_sys_enter(perf_syscall_enter, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall entry trace point");
 	} else {
@@ -699,20 +723,26 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	int rctx;
 	int size;
 
+	/*
+	 * Probe called with preemption enabled (mayfault), but ring buffer and
+	 * per-cpu data require preemption to be disabled.
+	 */
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 	if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
 	valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
 	if (!valid_prog_array && hlist_empty(head))
-		return;
+		goto end;
 
 	/* We can probably do that at build time */
 	size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
@@ -720,7 +750,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 
 	rec = perf_trace_buf_alloc(size, NULL, &rctx);
 	if (!rec)
-		return;
+		goto end;
 
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
@@ -729,11 +759,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	     !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
-		return;
+		goto end;
 	}
 
 	perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
 			      1, regs, head, NULL);
+end:
+	preempt_enable_notrace();
 }
 
 static int perf_sysexit_enable(struct trace_event_call *call)
@@ -745,7 +777,7 @@ static int perf_sysexit_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_exit)
-		ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+		ret = register_trace_mayfault_sys_exit(perf_syscall_exit, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall exit trace point");
 	} else {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 6/6] tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
                   ` (4 preceding siblings ...)
  2021-02-18 22:21 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
@ 2021-02-18 22:21 ` Michael Jeanson
  2021-02-24  2:16 ` [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Steven Rostedt
  6 siblings, 0 replies; 14+ messages in thread
From: Michael Jeanson @ 2021-02-18 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mathieu Desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf,
	Joel Fernandes

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Similarly to SRCU, Tasks Trace RCU can be used for rcuidle tracepoints.
It has the advantage to provide faster RCU read-side. Similarly to
SRCU, Tasks Trace RCU grace periods are ready after core_initcall.

Now that Tasks Trace RCU is used for faultable tracepoints, using it for
rcuidle tracepoints is an overall simplification.

Some distinctions between SRCU and Tasks Trace RCU:

- Tasks Trace RCU can be used from NMI context, which was not possible
  with SRCU,
- Tree SRCU has more scalable grace periods than Tasks Trace RCU, but it
  should not matter for tracing use-cases,
- Tasks Trace RCU has slower grace periods than SRCU (similar to those
  of RCU in upcoming kernels, but similar to Tasks RCU in current
  kernels). This should also be OK for tracing,
- SRCU readers can be used in places where Tasks Trace RCU readers cannot,
  but these places are also all places where tracing is prohibited.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/linux/tracepoint.h | 34 ++++++++--------------------------
 kernel/tracepoint.c        | 25 +++++++++----------------
 2 files changed, 17 insertions(+), 42 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 04079cbd2015..c22a87c34a22 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -13,7 +13,6 @@
  */
 
 #include <linux/smp.h>
-#include <linux/srcu.h>
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/cpumask.h>
@@ -34,8 +33,6 @@ struct trace_eval_map {
 
 #define TRACEPOINT_DEFAULT_PRIO	10
 
-extern struct srcu_struct tracepoint_srcu;
-
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
 extern int
@@ -87,7 +84,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
 static inline void tracepoint_synchronize_unregister(void)
 {
 	synchronize_rcu_tasks_trace();
-	synchronize_srcu(&tracepoint_srcu);
 	synchronize_rcu();
 }
 #else
@@ -176,30 +172,19 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define __DO_TRACE(name, proto, args, cond, rcuidle, tp_flags)		\
 	do {								\
 		struct tracepoint_func *it_func_ptr;			\
-		int __maybe_unused __idx = 0;				\
 		void *__data;						\
 		bool mayfault = (tp_flags) & TRACEPOINT_MAYFAULT;	\
+		bool tasks_trace_rcu = mayfault || (rcuidle);		\
 									\
 		if (!(cond))						\
 			return;						\
 									\
-		/* srcu can't be used from NMI */			\
-		WARN_ON_ONCE(rcuidle && in_nmi());			\
-									\
-		if (mayfault) {						\
-			rcu_read_lock_trace();				\
-		} else {						\
-			/* keep srcu and sched-rcu usage consistent */	\
+		if (!mayfault)						\
 			preempt_disable_notrace();			\
-		}							\
-		/*							\
-		 * For rcuidle callers, use srcu since sched-rcu	\
-		 * doesn't work from the idle path.			\
-		 */							\
-		if (rcuidle) {						\
-			__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
+		if (tasks_trace_rcu)					\
+			rcu_read_lock_trace();				\
+		if (rcuidle)						\
 			rcu_irq_enter_irqson();				\
-		}							\
 									\
 		it_func_ptr =						\
 			rcu_dereference_raw((&__tracepoint_##name)->funcs); \
@@ -209,14 +194,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			__DO_TRACE_CALL(name)(args);			\
 		}							\
 									\
-		if (rcuidle) {						\
+		if (rcuidle)						\
 			rcu_irq_exit_irqson();				\
-			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
-		}							\
-									\
-		if (mayfault)						\
+		if (tasks_trace_rcu)					\
 			rcu_read_unlock_trace();			\
-		else							\
+		if (!mayfault)						\
 			preempt_enable_notrace();			\
 	} while (0)
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 41fc9c6e17f6..efa49f22d435 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -18,9 +18,6 @@
 extern tracepoint_ptr_t __start___tracepoints_ptrs[];
 extern tracepoint_ptr_t __stop___tracepoints_ptrs[];
 
-DEFINE_SRCU(tracepoint_srcu);
-EXPORT_SYMBOL_GPL(tracepoint_srcu);
-
 /* Set to 1 to enable tracepoint debug output */
 static const int tracepoint_debug;
 
@@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
-static void srcu_free_old_probes(struct rcu_head *head)
-{
-	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
-}
-
 static void rcu_free_old_probes(struct rcu_head *head)
 {
-	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
 }
 
 static __init int release_early_probes(void)
@@ -90,7 +82,7 @@ static __init int release_early_probes(void)
 	return 0;
 }
 
-/* SRCU and Tasks Trace RCU are initialized at core_initcall */
+/* Tasks Trace RCU is initialized at core_initcall */
 postcore_initcall(release_early_probes);
 
 static inline void release_probes(struct tracepoint_func *old)
@@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
 			struct tp_probes, probes[0]);
 
 		/*
-		 * We can't free probes if SRCU and Tasks Trace RCU are not
-		 * initialized yet. Postpone the freeing till after both are
-		 * initialized.
+		 * We can't free probes if Tasks Trace RCU is not initialized yet.
+		 * Postpone the freeing till after Tasks Trace RCU is initialized.
 		 */
 		if (unlikely(!ok_to_free_tracepoints)) {
 			tp_probes->rcu.next = early_probes;
@@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
 		}
 
 		/*
-		 * Tracepoint probes are protected by sched RCU, SRCU and
-		 * Tasks Trace RCU by chaining the callbacks we cover all three
-		 * cases and wait for all three grace periods.
+		 * Tracepoint probes are protected by both sched RCU and
+		 * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
+		 * the sched RCU callback we cover both cases. So let us chain
+		 * the Tasks Trace RCU and sched RCU callbacks to wait for both
+		 * grace periods.
 		 */
 		call_rcu(&tp_probes->rcu, rcu_free_old_probes);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
                   ` (5 preceding siblings ...)
  2021-02-18 22:21 ` [RFC PATCH 6/6] tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
@ 2021-02-24  2:16 ` Steven Rostedt
  2021-02-24 16:22   ` Michael Jeanson
  6 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2021-02-24  2:16 UTC (permalink / raw)
  To: Michael Jeanson
  Cc: linux-kernel, Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

On Thu, 18 Feb 2021 17:21:19 -0500
Michael Jeanson <mjeanson@efficios.com> wrote:

> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.

I started taking a quick look at this, and came up with the question: how
do you allow preemption when dealing with per-cpu buffers or storage to
record the data?

That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
this is the reason for the need to disable preemption. What's the solution
that LTTng is using for this? I know it has a per cpu buffers too, but does
it have some kind of "per task" buffer that is being used to extract the
data that can fault?

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24  2:16 ` [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Steven Rostedt
@ 2021-02-24 16:22   ` Michael Jeanson
  2021-02-24 16:59     ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Jeanson @ 2021-02-24 16:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf, Mathieu Desnoyers

[ Adding Mathieu Desnoyers in CC ]

On 2021-02-23 21 h 16, Steven Rostedt wrote:
> On Thu, 18 Feb 2021 17:21:19 -0500
> Michael Jeanson <mjeanson@efficios.com> wrote:
> 
>> This series only implements the tracepoint infrastructure required to
>> allow tracers to handle page faults. Modifying each tracer to handle
>> those page faults would be a next step after we all agree on this piece
>> of instrumentation infrastructure.
> 
> I started taking a quick look at this, and came up with the question: how
> do you allow preemption when dealing with per-cpu buffers or storage to
> record the data?
> 
> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> this is the reason for the need to disable preemption. What's the solution
> that LTTng is using for this? I know it has a per cpu buffers too, but does
> it have some kind of "per task" buffer that is being used to extract the
> data that can fault?
> 
> -- Steve
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24 16:22   ` Michael Jeanson
@ 2021-02-24 16:59     ` Mathieu Desnoyers
  2021-02-24 18:14       ` Steven Rostedt
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2021-02-24 16:59 UTC (permalink / raw)
  To: Michael Jeanson, rostedt
  Cc: linux-kernel, Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	paulmck, Ingo Molnar, acme, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Joel Fernandes, Google, bpf

----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote:

> [ Adding Mathieu Desnoyers in CC ]
> 
> On 2021-02-23 21 h 16, Steven Rostedt wrote:
>> On Thu, 18 Feb 2021 17:21:19 -0500
>> Michael Jeanson <mjeanson@efficios.com> wrote:
>> 
>>> This series only implements the tracepoint infrastructure required to
>>> allow tracers to handle page faults. Modifying each tracer to handle
>>> those page faults would be a next step after we all agree on this piece
>>> of instrumentation infrastructure.
>> 
>> I started taking a quick look at this, and came up with the question: how
>> do you allow preemption when dealing with per-cpu buffers or storage to
>> record the data?
>> 
>> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
>> this is the reason for the need to disable preemption. What's the solution
>> that LTTng is using for this? I know it has a per cpu buffers too, but does
>> it have some kind of "per task" buffer that is being used to extract the
>> data that can fault?

As a prototype solution, what I've done currently is to copy the user-space
data into a kmalloc'd buffer in a preparation step before disabling preemption
and copying data over into the per-cpu buffers. It works, but I think we should
be able to do it without the needless copy.

What I have in mind as an efficient solution (not implemented yet) for the LTTng
kernel tracer goes as follows:

#define COMMIT_LOCAL 0
#define COMMIT_REMOTE 1

- faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
  - probe code calculate the length which needs to be reserved to store the event
    (e.g. user strlen),

  - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
    - reserve_cpu = smp_processor_id()
    - reserve space in the ring buffer for reserve_cpu
      [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
        from any cpu until we commit. ]
  - preempt enable -> [ preemption/blocking/migration is allowed from here ]

  - copy data from user-space to the ring buffer "slot",

  - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
    commit_cpu = smp_processor_id()
    if (commit_cpu == reserve_cpu)
       use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
    else
       use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]
  - preempt enable -> [ preemption/blocking/migration is allowed from here ]

Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
accumulators, the trick here is to use two commit counters rather than single one for each
sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
LOCAL and REMOTE counter.

This allows dealing with migration between reserve and commit without requiring the overhead
of an atomic operation on the fast-path (LOCAL case).

I had to design this kind of dual-counter trick in the context of user-space use of restartable
sequences. It looks like it may have a role to play in the kernel as well. :)

Or am I missing something important that would not survive real-life ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24 16:59     ` Mathieu Desnoyers
@ 2021-02-24 18:14       ` Steven Rostedt
  2021-02-25 21:46         ` Mathieu Desnoyers
  2021-02-24 23:54       ` Mathieu Desnoyers
  2021-02-26  5:28       ` Lai Jiangshan
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2021-02-24 18:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Jeanson, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf

On Wed, 24 Feb 2021 11:59:35 -0500 (EST)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
> 
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
> 
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
> 
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
>   - probe code calculate the length which needs to be reserved to store the event
>     (e.g. user strlen),
> 
>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>     - reserve_cpu = smp_processor_id()
>     - reserve space in the ring buffer for reserve_cpu
>       [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
>         from any cpu until we commit. ]
>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
> 

So basically the commit position here doesn't move until this task is
scheduled back in and the commit (remote or local) is updated.

To put it in terms of the ftrace ring buffer, where we have both a commit
page and a commit index, and it only gets moved by the first one to start a
commit stack (that is, interrupts that interrupted a write will not
increment the commit).

Now, I'm not sure how LTTng does it, but I could see issues for ftrace to
try to move the commit pointer (the pointer to the new commit page), as the
design is currently dependent on the fact that it can't happen while
commits are taken place.

Are the pages of the LTTng indexed by an array of pages?

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24 16:59     ` Mathieu Desnoyers
  2021-02-24 18:14       ` Steven Rostedt
@ 2021-02-24 23:54       ` Mathieu Desnoyers
  2021-02-26  5:28       ` Lai Jiangshan
  2 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2021-02-24 23:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Jeanson, rostedt, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf


----- Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote:
> 
> > [ Adding Mathieu Desnoyers in CC ]
> > 
> > On 2021-02-23 21 h 16, Steven Rostedt wrote:
> >> On Thu, 18 Feb 2021 17:21:19 -0500
> >> Michael Jeanson <mjeanson@efficios.com> wrote:
> >> 
> >>> This series only implements the tracepoint infrastructure required to
> >>> allow tracers to handle page faults. Modifying each tracer to handle
> >>> those page faults would be a next step after we all agree on this piece
> >>> of instrumentation infrastructure.
> >> 
> >> I started taking a quick look at this, and came up with the question: how
> >> do you allow preemption when dealing with per-cpu buffers or storage to
> >> record the data?
> >> 
> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> >> this is the reason for the need to disable preemption. What's the solution
> >> that LTTng is using for this? I know it has a per cpu buffers too, but does
> >> it have some kind of "per task" buffer that is being used to extract the
> >> data that can fault?
> 
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
> 
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
> 
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
> 
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
>   - probe code calculate the length which needs to be reserved to store the event
>     (e.g. user strlen),
> 
>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>     - reserve_cpu = smp_processor_id()
>     - reserve space in the ring buffer for reserve_cpu
>       [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
>         from any cpu until we commit. ]
>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
> 
>   - copy data from user-space to the ring buffer "slot",
> 
>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>     commit_cpu = smp_processor_id()
>     if (commit_cpu == reserve_cpu)
>        use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
>     else
>        use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]

The line above should increment reserve_cpu's buffer commit count, of course.

Thanks,

Mathieu

>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
> 
> Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
> accumulators, the trick here is to use two commit counters rather than single one for each
> sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
> LOCAL and REMOTE counter.
> 
> This allows dealing with migration between reserve and commit without requiring the overhead
> of an atomic operation on the fast-path (LOCAL case).
> 
> I had to design this kind of dual-counter trick in the context of user-space use of restartable
> sequences. It looks like it may have a role to play in the kernel as well. :)
> 
> Or am I missing something important that would not survive real-life ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24 18:14       ` Steven Rostedt
@ 2021-02-25 21:46         ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2021-02-25 21:46 UTC (permalink / raw)
  To: rostedt
  Cc: Michael Jeanson, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf



----- On Feb 24, 2021, at 1:14 PM, rostedt rostedt@goodmis.org wrote:

> On Wed, 24 Feb 2021 11:59:35 -0500 (EST)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> 
>> As a prototype solution, what I've done currently is to copy the user-space
>> data into a kmalloc'd buffer in a preparation step before disabling preemption
>> and copying data over into the per-cpu buffers. It works, but I think we should
>> be able to do it without the needless copy.
>> 
>> What I have in mind as an efficient solution (not implemented yet) for the LTTng
>> kernel tracer goes as follows:
>> 
>> #define COMMIT_LOCAL 0
>> #define COMMIT_REMOTE 1
>> 
>> - faultable probe is called from system call tracepoint [
>> preemption/blocking/migration is allowed ]
>>   - probe code calculate the length which needs to be reserved to store the event
>>     (e.g. user strlen),
>> 
>>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>>     - reserve_cpu = smp_processor_id()
>>     - reserve space in the ring buffer for reserve_cpu
>>       [ from that point on, we have _exclusive_ access to write into the ring buffer
>>       "slot"
>>         from any cpu until we commit. ]
>>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>> 
> 
> So basically the commit position here doesn't move until this task is
> scheduled back in and the commit (remote or local) is updated.

Indeed.

> To put it in terms of the ftrace ring buffer, where we have both a commit
> page and a commit index, and it only gets moved by the first one to start a
> commit stack (that is, interrupts that interrupted a write will not
> increment the commit).

The tricky part for ftrace is its reliance on the fact that the concurrent
users of the per-cpu ring buffer are all nested contexts. LTTng does not
assume that and has been designed to be used both in kernel and user-space:
lttng-modules and lttng-ust share a lot of ring buffer code. Therefore,
LTTng's ring buffer supports preemption/migration of concurrent contexts.

The fact that LTTng uses local-atomic-ops on its kernel ring buffers is just
an optimization on an overall ring buffer design meant to allow preemption.

> Now, I'm not sure how LTTng does it, but I could see issues for ftrace to
> try to move the commit pointer (the pointer to the new commit page), as the
> design is currently dependent on the fact that it can't happen while
> commits are taken place.

Indeed, what makes it easy for LTTng is because the ring buffer has been
designed to support preemption/migration from the ground up.

> Are the pages of the LTTng indexed by an array of pages?

Yes, they are. Handling the initial page allocation and then the tracer copy of data
to/from the ring buffer pages is the responsibility of the LTTng lib ring buffer "backend".
The LTTng lib ring buffer backend is somewhat similar to a page table done in software, where
the top level of the page table can be dynamically updated when doing flight recorder tracing.

It is however completely separate from the space reservation/commit scheme which is handled
by the lib ring buffer "frontend".

The algorithm I described in my prior email is specifically targeted at the frontend layer,
leaving the "backend" unchanged.

For some reasons I suspect Ftrace ring buffer combined those two layers into a single
algorithm, which may have its advantages, but seems to strengthen its dependency on
only having nested contexts sharing a given per-cpu ring buffer.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)
  2021-02-24 16:59     ` Mathieu Desnoyers
  2021-02-24 18:14       ` Steven Rostedt
  2021-02-24 23:54       ` Mathieu Desnoyers
@ 2021-02-26  5:28       ` Lai Jiangshan
  2 siblings, 0 replies; 14+ messages in thread
From: Lai Jiangshan @ 2021-02-26  5:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Jeanson, rostedt, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf

On Thu, Feb 25, 2021 at 9:15 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson mjeanson@efficios.com wrote:
>
> > [ Adding Mathieu Desnoyers in CC ]
> >
> > On 2021-02-23 21 h 16, Steven Rostedt wrote:
> >> On Thu, 18 Feb 2021 17:21:19 -0500
> >> Michael Jeanson <mjeanson@efficios.com> wrote:
> >>
> >>> This series only implements the tracepoint infrastructure required to
> >>> allow tracers to handle page faults. Modifying each tracer to handle
> >>> those page faults would be a next step after we all agree on this piece
> >>> of instrumentation infrastructure.
> >>
> >> I started taking a quick look at this, and came up with the question: how
> >> do you allow preemption when dealing with per-cpu buffers or storage to
> >> record the data?
> >>
> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> >> this is the reason for the need to disable preemption. What's the solution
> >> that LTTng is using for this? I know it has a per cpu buffers too, but does
> >> it have some kind of "per task" buffer that is being used to extract the
> >> data that can fault?
>
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
>
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
>
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
>
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]

label:
restart:

>   - probe code calculate the length which needs to be reserved to store the event
>     (e.g. user strlen),

Does "user strlen" makes the content fault in?

Is it possible to make the sleepable faulting only happen here between
"restart" and the following "preempt disable"?  The code here should
do a prefetch operation like "user strlen".

And we can keep preemption disabled when copying the data.  If there
is a fault while copying, then we can restart from the label "restart".

Very immature thought.

Thanks
Lai

>
>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>     - reserve_cpu = smp_processor_id()
>     - reserve space in the ring buffer for reserve_cpu
>       [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
>         from any cpu until we commit. ]
>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
>   - copy data from user-space to the ring buffer "slot",
>
>   - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>     commit_cpu = smp_processor_id()
>     if (commit_cpu == reserve_cpu)
>        use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
>     else
>        use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]
>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
> Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
> accumulators, the trick here is to use two commit counters rather than single one for each
> sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
> LOCAL and REMOTE counter.
>
> This allows dealing with migration between reserve and commit without requiring the overhead
> of an atomic operation on the fast-path (LOCAL case).
>
> I had to design this kind of dual-counter trick in the context of user-space use of restartable
> sequences. It looks like it may have a role to play in the kernel as well. :)
>
> Or am I missing something important that would not survive real-life ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-02-26  5:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18 22:21 [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 1/6] tracing: introduce faultable " Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 2/6] tracing: ftrace: add support for faultable tracepoints Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
2021-02-18 22:21 ` [RFC PATCH 6/6] tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
2021-02-24  2:16 ` [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2) Steven Rostedt
2021-02-24 16:22   ` Michael Jeanson
2021-02-24 16:59     ` Mathieu Desnoyers
2021-02-24 18:14       ` Steven Rostedt
2021-02-25 21:46         ` Mathieu Desnoyers
2021-02-24 23:54       ` Mathieu Desnoyers
2021-02-26  5:28       ` Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).