linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Sleepable tracepoints
@ 2020-10-23 19:53 Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 1/6] tracing: introduce sleepable tracepoints Michael Jeanson
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow specific tracer
probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
called from sleepable context, and convert the system call enter/exit
instrumentation to sleepable tracepoints.

This series only implements the tracepoint infrastructure required to
allow tracers to handle page faults. Modifying each tracer to handle
those page faults would be a next step after we all agree on this piece
of instrumentation infrastructure.

This patchset is base on v5.9.1.

Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org

Mathieu Desnoyers (1):
  tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

Michael Jeanson (5):
  tracing: introduce sleepable tracepoints
  tracing: ftrace: add support for sleepable tracepoints
  tracing: bpf-trace: add support for sleepable tracepoints
  tracing: perf: add support for sleepable tracepoints
  tracing: convert sys_enter/exit to sleepable tracepoints

 include/linux/tracepoint-defs.h |  11 ++++
 include/linux/tracepoint.h      | 104 +++++++++++++++++++++-----------
 include/trace/bpf_probe.h       |  23 ++++++-
 include/trace/define_trace.h    |   7 +++
 include/trace/events/syscalls.h |   4 +-
 include/trace/perf.h            |  26 ++++++--
 include/trace/trace_events.h    |  79 ++++++++++++++++++++++--
 init/Kconfig                    |   1 +
 kernel/trace/bpf_trace.c        |   5 +-
 kernel/trace/trace_events.c     |  15 ++++-
 kernel/trace/trace_syscalls.c   |  68 +++++++++++++--------
 kernel/tracepoint.c             | 104 +++++++++++++++++++++++++-------
 12 files changed, 351 insertions(+), 96 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-26 22:43   ` Alexei Starovoitov
  2020-10-23 19:53 ` [RFC PATCH 2/6] tracing: ftrace: add support for " Michael Jeanson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a sleepable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those sleepable tracepoints. When the
TRACEPOINT_MAYSLEEP flag is provided on registration, the probe callback
will be called with preemption enabled, and is allowed to take page
faults. Sleepable probes can only be registered on sleepable
tracepoints and non-sleepable probes on non-sleepable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to sleepable probes
unregistration and teardown.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/linux/tracepoint-defs.h |  11 ++++
 include/linux/tracepoint.h      |  85 +++++++++++++++++++++-----
 include/trace/define_trace.h    |   7 +++
 include/trace/trace_events.h    |   6 ++
 init/Kconfig                    |   1 +
 kernel/tracepoint.c             | 103 ++++++++++++++++++++++++++------
 6 files changed, 181 insertions(+), 32 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index b29950a19205..87ff40cf343f 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -27,12 +27,23 @@ struct tracepoint_func {
 	int prio;
 };
 
+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAYSLEEP: The tracepoint probe callback will be called with
+ *                       preemption enabled, and is allowed to take page
+ *                       faults.
+ */
+enum tracepoint_flags {
+	TRACEPOINT_MAYSLEEP = (1 << 0),
+};
+
 struct tracepoint {
 	const char *name;		/* Tracepoint name */
 	struct static_key key;
 	int (*regfunc)(void);
 	void (*unregfunc)(void);
 	struct tracepoint_func __rcu *funcs;
+	unsigned int flags;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..0386b54cbcbb 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
 #include <linux/types.h>
 #include <linux/cpumask.h>
 #include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 
 struct module;
@@ -37,9 +38,14 @@ extern struct srcu_struct tracepoint_srcu;
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
 extern int
+tracepoint_probe_register_maysleep(struct tracepoint *tp, void *probe, void *data);
+extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
 			       int prio);
 extern int
+tracepoint_probe_register_prio_maysleep(struct tracepoint *tp, void *probe, void *data,
+			       int prio);
+extern int
 tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
@@ -79,6 +85,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
 #ifdef CONFIG_TRACEPOINTS
 static inline void tracepoint_synchronize_unregister(void)
 {
+	synchronize_rcu_tasks_trace();
 	synchronize_srcu(&tracepoint_srcu);
 	synchronize_rcu();
 }
@@ -157,12 +164,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * has a "void" prototype, then it is invalid to declare a function
  * as "(void *, void)".
  */
-#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
+#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
 	do {								\
 		struct tracepoint_func *it_func_ptr;			\
 		void *it_func;						\
 		void *__data;						\
 		int __maybe_unused __idx = 0;				\
+		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
 									\
 		if (!(cond))						\
 			return;						\
@@ -170,8 +178,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		/* srcu can't be used from NMI */			\
 		WARN_ON_ONCE(rcuidle && in_nmi());			\
 									\
-		/* keep srcu and sched-rcu usage consistent */		\
-		preempt_disable_notrace();				\
+		if (maysleep) {						\
+			might_sleep();					\
+			rcu_read_lock_trace();				\
+		} else {						\
+			/* keep srcu and sched-rcu usage consistent */	\
+			preempt_disable_notrace();			\
+		}							\
 									\
 		/*							\
 		 * For rcuidle callers, use srcu since sched-rcu	\
@@ -197,21 +210,24 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
 		}							\
 									\
-		preempt_enable_notrace();				\
+		if (maysleep)						\
+			rcu_read_unlock_trace();			\
+		else							\
+			preempt_enable_notrace();			\
 	} while (0)
 
 #ifndef MODULE
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	static inline void trace_##name##_rcuidle(proto)		\
 	{								\
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(&__tracepoint_##name,		\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
-				TP_CONDITION(cond), 1);			\
+				TP_CONDITION(cond), 1, tp_flags);	\
 	}
 #else
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args)
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags)
 #endif
 
 /*
@@ -226,7 +242,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * even when this tracepoint is off. This code has no purpose other than
  * poking RCU a bit.
  */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	extern struct tracepoint __tracepoint_##name;			\
 	static inline void trace_##name(proto)				\
 	{								\
@@ -234,7 +250,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			__DO_TRACE(&__tracepoint_##name,		\
 				TP_PROTO(data_proto),			\
 				TP_ARGS(data_args),			\
-				TP_CONDITION(cond), 0);			\
+				TP_CONDITION(cond), 0, tp_flags);	\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			rcu_read_lock_sched_notrace();			\
 			rcu_dereference_sched(__tracepoint_##name.funcs);\
@@ -242,7 +258,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		}							\
 	}								\
 	__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
-		PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))	\
+		PARAMS(cond), PARAMS(data_proto), PARAMS(data_args), tp_flags)	\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
@@ -250,6 +266,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 						(void *)probe, data);	\
 	}								\
 	static inline int						\
+	register_trace_maysleep_##name(void (*probe)(data_proto), void *data)	\
+	{								\
+		return tracepoint_probe_register_maysleep(&__tracepoint_##name,	\
+						(void *)probe, data);	\
+	}								\
+	static inline int						\
 	register_trace_prio_##name(void (*probe)(data_proto), void *data,\
 				   int prio)				\
 	{								\
@@ -257,6 +279,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 					      (void *)probe, data, prio); \
 	}								\
 	static inline int						\
+	register_trace_prio_maysleep_##name(void (*probe)(data_proto),	\
+			void *data, int prio)				\
+	{								\
+		return tracepoint_probe_register_prio_maysleep(&__tracepoint_##name, \
+					      (void *)probe, data, prio); \
+	}								\
+	static inline int						\
 	unregister_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
 		return tracepoint_probe_unregister(&__tracepoint_##name,\
@@ -277,14 +306,17 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * structures, so we create an array of pointers that will be used for iteration
  * on the tracepoints.
  */
-#define DEFINE_TRACE_FN(name, reg, unreg)				 \
+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, tp_flags)		 \
 	static const char __tpstrtab_##name[]				 \
 	__section(__tracepoints_strings) = #name;			 \
 	struct tracepoint __tracepoint_##name __used			 \
 	__section(__tracepoints) =					 \
-		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
+		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, tp_flags };\
 	__TRACEPOINT_ENTRY(name);
 
+#define DEFINE_TRACE_FN(name, reg, unreg)				\
+	DEFINE_TRACE_FN_FLAGS(name, reg, unreg, 0)
+
 #define DEFINE_TRACE(name)						\
 	DEFINE_TRACE_FN(name, NULL, NULL);
 
@@ -294,7 +326,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	EXPORT_SYMBOL(__tracepoint_##name)
 
 #else /* !TRACEPOINTS_ENABLED */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
 	static inline void trace_##name(proto)				\
 	{ }								\
 	static inline void trace_##name##_rcuidle(proto)		\
@@ -306,6 +338,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return -ENOSYS;						\
 	}								\
 	static inline int						\
+	register_trace_maysleep_##name(void (*probe)(data_proto),	\
+			      void *data)				\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline int						\
+	register_trace_prio_maysleep_##name(void (*probe)(data_proto),	\
+			void *data, int prio)				\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline int						\
 	unregister_trace_##name(void (*probe)(data_proto),		\
 				void *data)				\
 	{								\
@@ -320,6 +364,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return false;						\
 	}
 
+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, tp_flags)
 #define DEFINE_TRACE_FN(name, reg, unreg)
 #define DEFINE_TRACE(name)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
@@ -375,13 +420,20 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
 			cpu_online(raw_smp_processor_id()),		\
 			PARAMS(void *__data, proto),			\
-			PARAMS(__data, args))
+			PARAMS(__data, args), 0)
+
+#define DECLARE_TRACE_MAYSLEEP(name, proto, args)			\
+	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
+			cpu_online(raw_smp_processor_id()),		\
+			PARAMS(void *__data, proto),			\
+			PARAMS(__data, args),				\
+			TRACEPOINT_MAYSLEEP)
 
 #define DECLARE_TRACE_CONDITION(name, proto, args, cond)		\
 	__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),		\
 			cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
 			PARAMS(void *__data, proto),			\
-			PARAMS(__data, args))
+			PARAMS(__data, args), 0)
 
 #define TRACE_EVENT_FLAGS(event, flag)
 
@@ -512,6 +564,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define TRACE_EVENT_FN(name, proto, args, struct,		\
 		assign, print, reg, unreg)			\
 	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, struct,	\
+		assign, print, reg, unreg)			\
+	DECLARE_TRACE_MAYSLEEP(name, PARAMS(proto), PARAMS(args))
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, struct,		\
 		assign, print, reg, unreg)			\
 	DECLARE_TRACE_CONDITION(name, PARAMS(proto),	\
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index bd75f97867b9..2b6ae7c978b3 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -41,6 +41,12 @@
 		assign, print, reg, unreg)			\
 	DEFINE_TRACE_FN(name, reg, unreg)
 
+/* Define a trace event with the MAYSLEEP flag set */
+#undef TRACE_EVENT_FN_MAYSLEEP
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct,		\
+		assign, print, reg, unreg)			\
+	DEFINE_TRACE_FN_FLAGS(name, reg, unreg, TRACEPOINT_MAYSLEEP)
+
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,		\
 		assign, print, reg, unreg)			\
@@ -106,6 +112,7 @@
 
 #undef TRACE_EVENT
 #undef TRACE_EVENT_FN
+#undef TRACE_EVENT_FN_MAYSLEEP
 #undef TRACE_EVENT_FN_COND
 #undef TRACE_EVENT_CONDITION
 #undef TRACE_EVENT_NOP
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 1bc3e7bba9a4..8b3f4068a702 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -138,6 +138,12 @@ TRACE_MAKE_SYSTEM_STR();
 	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
 		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
 
+#undef TRACE_EVENT_FN_MAYSLEEP
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct,		\
+		assign, print, reg, unreg)				\
+	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
+
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,	\
 		assign, print, reg, unreg)				\
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..857f57562490 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2018,6 +2018,7 @@ config PROFILING
 #
 config TRACEPOINTS
 	bool
+	select TASKS_TRACE_RCU
 
 endmenu		# General setup
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 73956eaff8a9..8d8e41c5d8a5 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -60,11 +60,16 @@ static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
-static void srcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
+static void srcu_free_old_probes(struct rcu_head *head)
+{
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
 static void rcu_free_old_probes(struct rcu_head *head)
 {
 	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
@@ -85,7 +90,7 @@ static __init int release_early_probes(void)
 	return 0;
 }
 
-/* SRCU is initialized at core_initcall */
+/* SRCU and Tasks Trace RCU are initialized at core_initcall */
 postcore_initcall(release_early_probes);
 
 static inline void release_probes(struct tracepoint_func *old)
@@ -95,8 +100,9 @@ static inline void release_probes(struct tracepoint_func *old)
 			struct tp_probes, probes[0]);
 
 		/*
-		 * We can't free probes if SRCU is not initialized yet.
-		 * Postpone the freeing till after SRCU is initialized.
+		 * We can't free probes if SRCU and Tasks Trace RCU are not
+		 * initialized yet. Postpone the freeing till after both are
+		 * initialized.
 		 */
 		if (unlikely(!ok_to_free_tracepoints)) {
 			tp_probes->rcu.next = early_probes;
@@ -105,10 +111,9 @@ static inline void release_probes(struct tracepoint_func *old)
 		}
 
 		/*
-		 * Tracepoint probes are protected by both sched RCU and SRCU,
-		 * by calling the SRCU callback in the sched RCU callback we
-		 * cover both cases. So let us chain the SRCU and sched RCU
-		 * callbacks to wait for both grace periods.
+		 * Tracepoint probes are protected by sched RCU, SRCU and
+		 * Tasks Trace RCU by chaining the callbacks we cover all three
+		 * cases and wait for all three grace periods.
 		 */
 		call_rcu(&tp_probes->rcu, rcu_free_old_probes);
 	}
@@ -289,6 +294,21 @@ static int tracepoint_remove_func(struct tracepoint *tp,
 	return 0;
 }
 
+static int __tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
+				   void *data, int prio)
+{
+	struct tracepoint_func tp_func;
+	int ret;
+
+	mutex_lock(&tracepoints_mutex);
+	tp_func.func = probe;
+	tp_func.data = data;
+	tp_func.prio = prio;
+	ret = tracepoint_add_func(tp, &tp_func, prio);
+	mutex_unlock(&tracepoints_mutex);
+	return ret;
+}
+
 /**
  * tracepoint_probe_register_prio -  Connect a probe to a tracepoint with priority
  * @tp: tracepoint
@@ -296,6 +316,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
  * @data: tracepoint data
  * @prio: priority of this function over other registered functions
  *
+ * Non-sleepable probes can only be registered on non-sleepable tracepoints.
+ *
  * Returns 0 if ok, error value on error.
  * Note: if @tp is within a module, the caller is responsible for
  * unregistering the probe before the module is gone. This can be
@@ -305,25 +327,49 @@ static int tracepoint_remove_func(struct tracepoint *tp,
 int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
 				   void *data, int prio)
 {
-	struct tracepoint_func tp_func;
-	int ret;
+	if (tp->flags & TRACEPOINT_MAYSLEEP)
+		return -EINVAL;
 
-	mutex_lock(&tracepoints_mutex);
-	tp_func.func = probe;
-	tp_func.data = data;
-	tp_func.prio = prio;
-	ret = tracepoint_add_func(tp, &tp_func, prio);
-	mutex_unlock(&tracepoints_mutex);
-	return ret;
+	return __tracepoint_probe_register_prio(tp, probe, data, prio);
 }
 EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
 
+/**
+ * tracepoint_probe_register_prio_maysleep - Connect a sleepable probe to a tracepoint with priority
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ * @prio: priority of this function over other registered functions
+ *
+ * When the TRACEPOINT_MAYSLEEP flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Sleepable probes can only be registered on sleepable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_prio_maysleep(struct tracepoint *tp, void *probe,
+				   void *data, int prio)
+{
+	if (!(tp->flags & TRACEPOINT_MAYSLEEP))
+		return -EINVAL;
+
+	return __tracepoint_probe_register_prio(tp, probe, data, prio);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_maysleep);
+
 /**
  * tracepoint_probe_register -  Connect a probe to a tracepoint
  * @tp: tracepoint
  * @probe: probe handler
  * @data: tracepoint data
  *
+ * Non-sleepable probes can only be registered on non-sleepable tracepoints.
+ *
  * Returns 0 if ok, error value on error.
  * Note: if @tp is within a module, the caller is responsible for
  * unregistering the probe before the module is gone. This can be
@@ -336,6 +382,29 @@ int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
 }
 EXPORT_SYMBOL_GPL(tracepoint_probe_register);
 
+/**
+ * tracepoint_probe_register_maysleep - Connect a sleepable probe to a tracepoint
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ *
+ * When the TRACEPOINT_MAYSLEEP flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Sleepable probes can only be registered on sleepable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_maysleep(struct tracepoint *tp, void *probe, void *data)
+{
+	return tracepoint_probe_register_prio_maysleep(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_maysleep);
+
 /**
  * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
  * @tp: tracepoint
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/6] tracing: ftrace: add support for sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 1/6] tracing: introduce sleepable tracepoints Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

In preparation for converting system call enter/exit instrumentation
into sleepable tracepoints, make sure that ftrace can handle registering
to such tracepoints by explicitly disabling preemption within the ftrace
tracepoint probes to respect the current expectations within ftrace ring
buffer code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to connect to sleepable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/trace/trace_events.h | 75 +++++++++++++++++++++++++++++++++---
 kernel/trace/trace_events.c  | 15 +++++++-
 2 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8b3f4068a702..b95a9c3d9405 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -80,6 +80,16 @@ TRACE_MAKE_SYSTEM_STR();
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_MAYSLEEP
+#define TRACE_EVENT_MAYSLEEP(name, proto, args, tstruct, assign, print)	\
+	DECLARE_EVENT_CLASS_MAYSLEEP(name,		       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 
 #undef __field
 #define __field(type, item)		type	item;
@@ -118,6 +128,12 @@ TRACE_MAKE_SYSTEM_STR();
 									\
 	static struct trace_event_class event_class_##name;
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(name, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)	\
 	static struct trace_event_call	__used		\
@@ -141,7 +157,7 @@ TRACE_MAKE_SYSTEM_STR();
 #undef TRACE_EVENT_FN_MAYSLEEP
 #define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct,		\
 		assign, print, reg, unreg)				\
-	TRACE_EVENT(name, PARAMS(proto), PARAMS(args),			\
+	TRACE_EVENT_MAYSLEEP(name, PARAMS(proto), PARAMS(args),		\
 		PARAMS(tstruct), PARAMS(assign), PARAMS(print))		\
 
 #undef TRACE_EVENT_FN_COND
@@ -212,6 +228,12 @@ TRACE_MAKE_SYSTEM_STR();
 		tstruct;						\
 	};
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
@@ -378,6 +400,12 @@ static struct trace_event_functions trace_event_type_funcs_##call = {	\
 	.trace			= trace_raw_output_##call,		\
 };
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print)		\
 static notrace enum print_line_t					\
@@ -448,6 +476,12 @@ static struct trace_event_fields trace_event_fields_##call[] = {	\
 	tstruct								\
 	{} };
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args,			\
+		tstruct, func, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(func), PARAMS(print))
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -524,6 +558,12 @@ static inline notrace int trace_event_get_offsets_##call(		\
 	return __data_size;						\
 }
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 /*
@@ -673,8 +713,8 @@ static inline notrace int trace_event_get_offsets_##call(		\
 #undef __perf_task
 #define __perf_task(t)	(t)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 									\
 static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
@@ -685,8 +725,11 @@ trace_event_raw_event_##call(void *__data, proto)			\
 	struct trace_event_raw_##call *entry;				\
 	int __data_size;						\
 									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_disable_notrace();				\
+									\
 	if (trace_trigger_soft_disabled(trace_file))			\
-		return;							\
+		goto end;						\
 									\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
@@ -694,14 +737,30 @@ trace_event_raw_event_##call(void *__data, proto)			\
 				 sizeof(*entry) + __data_size);		\
 									\
 	if (!entry)							\
-		return;							\
+		goto end;						\
 									\
 	tstruct								\
 									\
 	{ assign; }							\
 									\
 	trace_event_buffer_commit(&fbuffer);				\
+end:									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_enable_notrace();				\
 }
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+			PARAMS(tstruct), PARAMS(assign),		\
+			PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),			\
+			PARAMS(tstruct), PARAMS(assign),			\
+			PARAMS(print), TRACEPOINT_MAYSLEEP)
+
 /*
  * The ftrace_test_probe is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the ftrace probe will
@@ -748,6 +807,12 @@ static struct trace_event_class __used __refdata event_class_##call = { \
 	_TRACE_PERF_INIT(call)						\
 };
 
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args,			\
+		tstruct, assign, print)					\
+	DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
 									\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index a85effb2373b..058fe2834f14 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -290,9 +290,15 @@ int trace_event_reg(struct trace_event_call *call,
 	WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
 	switch (type) {
 	case TRACE_REG_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		if (call->tp->flags & TRACEPOINT_MAYSLEEP)
+			return tracepoint_probe_register_maysleep(call->tp,
 						 call->class->probe,
 						 file);
+		else
+			return tracepoint_probe_register(call->tp,
+						 call->class->probe,
+						 file);
+
 	case TRACE_REG_UNREGISTER:
 		tracepoint_probe_unregister(call->tp,
 					    call->class->probe,
@@ -301,7 +307,12 @@ int trace_event_reg(struct trace_event_call *call,
 
 #ifdef CONFIG_PERF_EVENTS
 	case TRACE_REG_PERF_REGISTER:
-		return tracepoint_probe_register(call->tp,
+		if (call->tp->flags & TRACEPOINT_MAYSLEEP)
+			return tracepoint_probe_register_maysleep(call->tp,
+						 call->class->perf_probe,
+						 call);
+		else
+			return tracepoint_probe_register(call->tp,
 						 call->class->perf_probe,
 						 call);
 	case TRACE_REG_PERF_UNREGISTER:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/6] tracing: bpf-trace: add support for sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 1/6] tracing: introduce sleepable tracepoints Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 2/6] tracing: ftrace: add support for " Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

In preparation for converting system call enter/exit instrumentation
into sleepable tracepoints, make sure that bpf can handle registering to
such tracepoints by explicitly disabling preemption within the bpf
tracepoint probes to respect the current expectations within bpf tracing
code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to connect to sleepable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/trace/bpf_probe.h | 23 +++++++++++++++++++++--
 kernel/trace/bpf_trace.c  |  5 ++++-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 1ce3be63add1..d688cb9b32fe 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,15 +55,34 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 static notrace void							\
 __bpf_trace_##call(void *__data, proto)					\
 {									\
 	struct bpf_prog *prog = __data;					\
+									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_disable_notrace();				\
+									\
 	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));	\
+									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_enable_notrace();				\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),	0)
+
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, tstruct,	\
+		assign, print)						\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),		\
+		TRACEPOINT_MAYSLEEP)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a8d4f253ed77..54f8b320fe2f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1947,7 +1947,10 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
 	if (prog->aux->max_tp_access > btp->writable_size)
 		return -EINVAL;
 
-	return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
+	if (tp->flags & TRACEPOINT_MAYSLEEP)
+		return tracepoint_probe_register_maysleep(tp, (void *)btp->bpf_func, prog);
+	else
+		return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog);
 }
 
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/6] tracing: perf: add support for sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
                   ` (2 preceding siblings ...)
  2020-10-23 19:53 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

In preparation for converting system call enter/exit instrumentation
into sleepable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to sleepable
tracepoints.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/trace/perf.h | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index dbc6c74defc3..e1d866c3a076 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -27,8 +27,8 @@
 #undef __perf_task
 #define __perf_task(t)	(__task = (t))
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags)	\
 static notrace void							\
 perf_trace_##call(void *__data, proto)					\
 {									\
@@ -43,13 +43,16 @@ perf_trace_##call(void *__data, proto)					\
 	int __data_size;						\
 	int rctx;							\
 									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_disable_notrace();				\
+									\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
 	if (!bpf_prog_array_valid(event_call) &&			\
 	    __builtin_constant_p(!__task) && !__task &&			\
 	    hlist_empty(head))						\
-		return;							\
+		goto end;						\
 									\
 	__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
 			     sizeof(u64));				\
@@ -57,7 +60,7 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);	\
 	if (!entry)							\
-		return;							\
+		goto end;						\
 									\
 	perf_fetch_caller_regs(__regs);					\
 									\
@@ -68,8 +71,23 @@ perf_trace_##call(void *__data, proto)					\
 	perf_trace_run_bpf_submit(entry, __entry_size, rctx,		\
 				  event_call, __count, __regs,		\
 				  head, __task);			\
+end:									\
+	if ((tp_flags) & TRACEPOINT_MAYSLEEP)				\
+		preempt_enable_notrace();				\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, tstruct,	\
+		assign, print)						\
+	_DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args),		\
+		PARAMS(tstruct), PARAMS(assign), PARAMS(print),		\
+		TRACEPOINT_MAYSLEEP)
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 5/6] tracing: convert sys_enter/exit to sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
                   ` (3 preceding siblings ...)
  2020-10-23 19:53 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-23 19:53 ` [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
  2020-10-26 12:05 ` [RFC PATCH 0/6] Sleepable tracepoints peter enderborg
  6 siblings, 0 replies; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

Convert the definition of the system call enter/exit tracepoints to
sleepable tracepoints now that all upstream tracers handle it.

Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/trace/events/syscalls.h |  4 +-
 kernel/trace/trace_syscalls.c   | 68 ++++++++++++++++++++-------------
 2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..fbb8d8b48f81 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAYSLEEP(sys_enter,
 
 	TP_PROTO(struct pt_regs *regs, long id),
 
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAYSLEEP(sys_exit,
 
 	TP_PROTO(struct pt_regs *regs, long ret),
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index d85a2f0f316b..48d92d59fb92 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -304,21 +304,23 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	int syscall_nr;
 	int size;
 
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
 	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
 	if (!trace_file)
-		return;
+		goto end;
 
 	if (trace_trigger_soft_disabled(trace_file))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
@@ -329,7 +331,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	event = trace_buffer_lock_reserve(buffer,
 			sys_data->enter_event->event.type, size, irq_flags, pc);
 	if (!event)
-		return;
+		goto end;
 
 	entry = ring_buffer_event_data(event);
 	entry->nr = syscall_nr;
@@ -338,6 +340,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 	event_trigger_unlock_commit(trace_file, buffer, event, entry,
 				    irq_flags, pc);
+end:
+	preempt_enable_notrace();
 }
 
 static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
@@ -352,21 +356,23 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	int pc;
 	int syscall_nr;
 
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
 	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
 	if (!trace_file)
-		return;
+		goto end;
 
 	if (trace_trigger_soft_disabled(trace_file))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	local_save_flags(irq_flags);
 	pc = preempt_count();
@@ -376,7 +382,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 			sys_data->exit_event->event.type, sizeof(*entry),
 			irq_flags, pc);
 	if (!event)
-		return;
+		goto end;
 
 	entry = ring_buffer_event_data(event);
 	entry->nr = syscall_nr;
@@ -384,6 +390,8 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 
 	event_trigger_unlock_commit(trace_file, buffer, event, entry,
 				    irq_flags, pc);
+end:
+	preempt_enable_notrace();
 }
 
 static int reg_event_syscall_enter(struct trace_event_file *file,
@@ -398,7 +406,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_enter)
-		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+		ret = register_trace_maysleep_sys_enter(ftrace_syscall_enter, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->enter_syscall_files[num], file);
 		tr->sys_refcount_enter++;
@@ -436,7 +444,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_exit)
-		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+		ret = register_trace_maysleep_sys_exit(ftrace_syscall_exit, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->exit_syscall_files[num], file);
 		tr->sys_refcount_exit++;
@@ -600,20 +608,22 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	int rctx;
 	int size;
 
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 	if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
 	valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
 	if (!valid_prog_array && hlist_empty(head))
-		return;
+		goto end;
 
 	/* get the size after alignment with the u32 buffer size field */
 	size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
@@ -622,7 +632,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 
 	rec = perf_trace_buf_alloc(size, NULL, &rctx);
 	if (!rec)
-		return;
+		goto end;
 
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, args);
@@ -632,12 +642,14 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	     !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
-		return;
+		goto end;
 	}
 
 	perf_trace_buf_submit(rec, size, rctx,
 			      sys_data->enter_event->event.type, 1, regs,
 			      head, NULL);
+end:
+	preempt_enable_notrace();
 }
 
 static int perf_sysenter_enable(struct trace_event_call *call)
@@ -649,7 +661,7 @@ static int perf_sysenter_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_enter)
-		ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+		ret = register_trace_maysleep_sys_enter(perf_syscall_enter, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall entry trace point");
 	} else {
@@ -699,20 +711,22 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	int rctx;
 	int size;
 
+	preempt_disable_notrace();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
-		return;
+		goto end;
 	if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
-		return;
+		goto end;
 
 	sys_data = syscall_nr_to_meta(syscall_nr);
 	if (!sys_data)
-		return;
+		goto end;
 
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
 	valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
 	if (!valid_prog_array && hlist_empty(head))
-		return;
+		goto end;
 
 	/* We can probably do that at build time */
 	size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
@@ -720,7 +734,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 
 	rec = perf_trace_buf_alloc(size, NULL, &rctx);
 	if (!rec)
-		return;
+		goto end;
 
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
@@ -729,11 +743,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	     !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
-		return;
+		goto end;
 	}
 
 	perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
 			      1, regs, head, NULL);
+end:
+	preempt_enable_notrace();
 }
 
 static int perf_sysexit_enable(struct trace_event_call *call)
@@ -745,7 +761,7 @@ static int perf_sysexit_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_exit)
-		ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+		ret = register_trace_maysleep_sys_exit(perf_syscall_exit, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall exit trace point");
 	} else {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
                   ` (4 preceding siblings ...)
  2020-10-23 19:53 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
@ 2020-10-23 19:53 ` Michael Jeanson
  2020-10-23 21:13   ` Joel Fernandes
  2020-10-26 12:05 ` [RFC PATCH 0/6] Sleepable tracepoints peter enderborg
  6 siblings, 1 reply; 20+ messages in thread
From: Michael Jeanson @ 2020-10-23 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: mathieu.desnoyers, Michael Jeanson, Steven Rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
	Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, bpf

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Considering that tracer callbacks expect RCU to be watching (for
instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
no point in using SRCU anymore given that rcuidle tracepoints need to
ensure RCU is watching. Therefore, simply use sched-RCU like normal
tracepoints for rcuidle tracepoints.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: bpf@vger.kernel.org
---
 include/linux/tracepoint.h | 33 +++++++--------------------------
 kernel/tracepoint.c        | 25 +++++++++----------------
 2 files changed, 16 insertions(+), 42 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0386b54cbcbb..1414b11f864b 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -13,7 +13,6 @@
  */
 
 #include <linux/smp.h>
-#include <linux/srcu.h>
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/cpumask.h>
@@ -33,8 +32,6 @@ struct trace_eval_map {
 
 #define TRACEPOINT_DEFAULT_PRIO	10
 
-extern struct srcu_struct tracepoint_srcu;
-
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
 extern int
@@ -86,7 +83,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
 static inline void tracepoint_synchronize_unregister(void)
 {
 	synchronize_rcu_tasks_trace();
-	synchronize_srcu(&tracepoint_srcu);
 	synchronize_rcu();
 }
 #else
@@ -175,25 +171,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		if (!(cond))						\
 			return;						\
 									\
-		/* srcu can't be used from NMI */			\
-		WARN_ON_ONCE(rcuidle && in_nmi());			\
-									\
-		if (maysleep) {						\
-			might_sleep();					\
+		might_sleep_if(maysleep);				\
+		if (rcuidle)						\
+			rcu_irq_enter_irqson();				\
+		if (maysleep)						\
 			rcu_read_lock_trace();				\
-		} else {						\
-			/* keep srcu and sched-rcu usage consistent */	\
+		else							\
 			preempt_disable_notrace();			\
-		}							\
-									\
-		/*							\
-		 * For rcuidle callers, use srcu since sched-rcu	\
-		 * doesn't work from the idle path.			\
-		 */							\
-		if (rcuidle) {						\
-			__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
-			rcu_irq_enter_irqson();				\
-		}							\
 									\
 		it_func_ptr = rcu_dereference_raw((tp)->funcs);		\
 									\
@@ -205,15 +189,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			} while ((++it_func_ptr)->func);		\
 		}							\
 									\
-		if (rcuidle) {						\
-			rcu_irq_exit_irqson();				\
-			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
-		}							\
-									\
 		if (maysleep)						\
 			rcu_read_unlock_trace();			\
 		else							\
 			preempt_enable_notrace();			\
+		if (rcuidle)						\
+			rcu_irq_exit_irqson();				\
 	} while (0)
 
 #ifndef MODULE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 8d8e41c5d8a5..68b4e50798b1 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -18,9 +18,6 @@
 extern tracepoint_ptr_t __start___tracepoints_ptrs[];
 extern tracepoint_ptr_t __stop___tracepoints_ptrs[];
 
-DEFINE_SRCU(tracepoint_srcu);
-EXPORT_SYMBOL_GPL(tracepoint_srcu);
-
 /* Set to 1 to enable tracepoint debug output */
 static const int tracepoint_debug;
 
@@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
-static void srcu_free_old_probes(struct rcu_head *head)
-{
-	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
-}
-
 static void rcu_free_old_probes(struct rcu_head *head)
 {
-	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
 }
 
 static __init int release_early_probes(void)
@@ -90,7 +82,7 @@ static __init int release_early_probes(void)
 	return 0;
 }
 
-/* SRCU and Tasks Trace RCU are initialized at core_initcall */
+/* Tasks Trace RCU is initialized at core_initcall */
 postcore_initcall(release_early_probes);
 
 static inline void release_probes(struct tracepoint_func *old)
@@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
 			struct tp_probes, probes[0]);
 
 		/*
-		 * We can't free probes if SRCU and Tasks Trace RCU are not
-		 * initialized yet. Postpone the freeing till after both are
-		 * initialized.
+		 * We can't free probes if Tasks Trace RCU is not initialized yet.
+		 * Postpone the freeing till after Tasks Trace RCU is initialized.
 		 */
 		if (unlikely(!ok_to_free_tracepoints)) {
 			tp_probes->rcu.next = early_probes;
@@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
 		}
 
 		/*
-		 * Tracepoint probes are protected by sched RCU, SRCU and
-		 * Tasks Trace RCU by chaining the callbacks we cover all three
-		 * cases and wait for all three grace periods.
+		 * Tracepoint probes are protected by both sched RCU and
+		 * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
+		 * the sched RCU callback we cover both cases. So let us chain
+		 * the Tasks Trace RCU and sched RCU callbacks to wait for both
+		 * grace periods.
 		 */
 		call_rcu(&tp_probes->rcu, rcu_free_old_probes);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-23 19:53 ` [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
@ 2020-10-23 21:13   ` Joel Fernandes
  2020-10-26  8:20     ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Joel Fernandes @ 2020-10-23 21:13 UTC (permalink / raw)
  To: Michael Jeanson
  Cc: linux-kernel, mathieu.desnoyers, Steven Rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf

On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> Considering that tracer callbacks expect RCU to be watching (for
> instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> no point in using SRCU anymore given that rcuidle tracepoints need to
> ensure RCU is watching. Therefore, simply use sched-RCU like normal
> tracepoints for rcuidle tracepoints.

High level question:

IIRC, doing this increases overhead for general tracing that does not use
perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
tracepoints. I remember adding SRCU because of this reason.

Can the 'rcuidle' information not be pushed down further, such that perf does
it because it requires RCU to be watching, so that it does not effect, say,
trace events?

thanks,

 - Joel

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: bpf@vger.kernel.org
> ---
>  include/linux/tracepoint.h | 33 +++++++--------------------------
>  kernel/tracepoint.c        | 25 +++++++++----------------
>  2 files changed, 16 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 0386b54cbcbb..1414b11f864b 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -13,7 +13,6 @@
>   */
>  
>  #include <linux/smp.h>
> -#include <linux/srcu.h>
>  #include <linux/errno.h>
>  #include <linux/types.h>
>  #include <linux/cpumask.h>
> @@ -33,8 +32,6 @@ struct trace_eval_map {
>  
>  #define TRACEPOINT_DEFAULT_PRIO	10
>  
> -extern struct srcu_struct tracepoint_srcu;
> -
>  extern int
>  tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
>  extern int
> @@ -86,7 +83,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
>  static inline void tracepoint_synchronize_unregister(void)
>  {
>  	synchronize_rcu_tasks_trace();
> -	synchronize_srcu(&tracepoint_srcu);
>  	synchronize_rcu();
>  }
>  #else
> @@ -175,25 +171,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		if (!(cond))						\
>  			return;						\
>  									\
> -		/* srcu can't be used from NMI */			\
> -		WARN_ON_ONCE(rcuidle && in_nmi());			\
> -									\
> -		if (maysleep) {						\
> -			might_sleep();					\
> +		might_sleep_if(maysleep);				\
> +		if (rcuidle)						\
> +			rcu_irq_enter_irqson();				\
> +		if (maysleep)						\
>  			rcu_read_lock_trace();				\
> -		} else {						\
> -			/* keep srcu and sched-rcu usage consistent */	\
> +		else							\
>  			preempt_disable_notrace();			\
> -		}							\
> -									\
> -		/*							\
> -		 * For rcuidle callers, use srcu since sched-rcu	\
> -		 * doesn't work from the idle path.			\
> -		 */							\
> -		if (rcuidle) {						\
> -			__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> -			rcu_irq_enter_irqson();				\
> -		}							\
>  									\
>  		it_func_ptr = rcu_dereference_raw((tp)->funcs);		\
>  									\
> @@ -205,15 +189,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  			} while ((++it_func_ptr)->func);		\
>  		}							\
>  									\
> -		if (rcuidle) {						\
> -			rcu_irq_exit_irqson();				\
> -			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> -		}							\
> -									\
>  		if (maysleep)						\
>  			rcu_read_unlock_trace();			\
>  		else							\
>  			preempt_enable_notrace();			\
> +		if (rcuidle)						\
> +			rcu_irq_exit_irqson();				\
>  	} while (0)
>  
>  #ifndef MODULE
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 8d8e41c5d8a5..68b4e50798b1 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -18,9 +18,6 @@
>  extern tracepoint_ptr_t __start___tracepoints_ptrs[];
>  extern tracepoint_ptr_t __stop___tracepoints_ptrs[];
>  
> -DEFINE_SRCU(tracepoint_srcu);
> -EXPORT_SYMBOL_GPL(tracepoint_srcu);
> -
>  /* Set to 1 to enable tracepoint debug output */
>  static const int tracepoint_debug;
>  
> @@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
>  	kfree(container_of(head, struct tp_probes, rcu));
>  }
>  
> -static void srcu_free_old_probes(struct rcu_head *head)
> -{
> -	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
> -}
> -
>  static void rcu_free_old_probes(struct rcu_head *head)
>  {
> -	call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> +	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
>  }
>  
>  static __init int release_early_probes(void)
> @@ -90,7 +82,7 @@ static __init int release_early_probes(void)
>  	return 0;
>  }
>  
> -/* SRCU and Tasks Trace RCU are initialized at core_initcall */
> +/* Tasks Trace RCU is initialized at core_initcall */
>  postcore_initcall(release_early_probes);
>  
>  static inline void release_probes(struct tracepoint_func *old)
> @@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
>  			struct tp_probes, probes[0]);
>  
>  		/*
> -		 * We can't free probes if SRCU and Tasks Trace RCU are not
> -		 * initialized yet. Postpone the freeing till after both are
> -		 * initialized.
> +		 * We can't free probes if Tasks Trace RCU is not initialized yet.
> +		 * Postpone the freeing till after Tasks Trace RCU is initialized.
>  		 */
>  		if (unlikely(!ok_to_free_tracepoints)) {
>  			tp_probes->rcu.next = early_probes;
> @@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
>  		}
>  
>  		/*
> -		 * Tracepoint probes are protected by sched RCU, SRCU and
> -		 * Tasks Trace RCU by chaining the callbacks we cover all three
> -		 * cases and wait for all three grace periods.
> +		 * Tracepoint probes are protected by both sched RCU and
> +		 * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
> +		 * the sched RCU callback we cover both cases. So let us chain
> +		 * the Tasks Trace RCU and sched RCU callbacks to wait for both
> +		 * grace periods.
>  		 */
>  		call_rcu(&tp_probes->rcu, rcu_free_old_probes);
>  	}
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-23 21:13   ` Joel Fernandes
@ 2020-10-26  8:20     ` Peter Zijlstra
  2020-10-26 14:28       ` Mathieu Desnoyers
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2020-10-26  8:20 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Michael Jeanson, linux-kernel, mathieu.desnoyers, Steven Rostedt,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf

On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> > From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > 
> > Considering that tracer callbacks expect RCU to be watching (for
> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> > no point in using SRCU anymore given that rcuidle tracepoints need to
> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
> > tracepoints for rcuidle tracepoints.
> 
> High level question:
> 
> IIRC, doing this increases overhead for general tracing that does not use
> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
> tracepoints. I remember adding SRCU because of this reason.
> 
> Can the 'rcuidle' information not be pushed down further, such that perf does
> it because it requires RCU to be watching, so that it does not effect, say,
> trace events?

There's very few trace_.*_rcuidle() users left. We should eradicate them
and remove the option. It's bugs to begin with.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] Sleepable tracepoints
  2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
                   ` (5 preceding siblings ...)
  2020-10-23 19:53 ` [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
@ 2020-10-26 12:05 ` peter enderborg
  2020-10-26 14:59   ` Mathieu Desnoyers
  6 siblings, 1 reply; 20+ messages in thread
From: peter enderborg @ 2020-10-26 12:05 UTC (permalink / raw)
  To: Michael Jeanson, linux-kernel
  Cc: mathieu.desnoyers, Steven Rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Joel Fernandes, bpf

On 10/23/20 9:53 PM, Michael Jeanson wrote:
> When invoked from system call enter/exit instrumentation, accessing
> user-space data is a common use-case for tracers. However, tracepoints
> currently disable preemption around iteration on the registered
> tracepoint probes and invocation of the probe callbacks, which prevents
> tracers from handling page faults.
>
> Extend the tracepoint and trace event APIs to allow specific tracer
> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
> called from sleepable context, and convert the system call enter/exit
> instrumentation to sleepable tracepoints.

Will this not be a problem for analyse of the trace? It get two
relevant times, one it when it is called and one when it returns.

It makes things harder to correlate in what order things happen.

And handling of tracing of contexts that already are not preamptable?

Eg the same tracepoint are used in different places and contexts.


> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.
>
> This patchset is base on v5.9.1.
>
> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: bpf@vger.kernel.org
>
> Mathieu Desnoyers (1):
>   tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
>
> Michael Jeanson (5):
>   tracing: introduce sleepable tracepoints
>   tracing: ftrace: add support for sleepable tracepoints
>   tracing: bpf-trace: add support for sleepable tracepoints
>   tracing: perf: add support for sleepable tracepoints
>   tracing: convert sys_enter/exit to sleepable tracepoints
>
>  include/linux/tracepoint-defs.h |  11 ++++
>  include/linux/tracepoint.h      | 104 +++++++++++++++++++++-----------
>  include/trace/bpf_probe.h       |  23 ++++++-
>  include/trace/define_trace.h    |   7 +++
>  include/trace/events/syscalls.h |   4 +-
>  include/trace/perf.h            |  26 ++++++--
>  include/trace/trace_events.h    |  79 ++++++++++++++++++++++--
>  init/Kconfig                    |   1 +
>  kernel/trace/bpf_trace.c        |   5 +-
>  kernel/trace/trace_events.c     |  15 ++++-
>  kernel/trace/trace_syscalls.c   |  68 +++++++++++++--------
>  kernel/tracepoint.c             | 104 +++++++++++++++++++++++++-------
>  12 files changed, 351 insertions(+), 96 deletions(-)
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-26  8:20     ` Peter Zijlstra
@ 2020-10-26 14:28       ` Mathieu Desnoyers
  2020-10-26 20:44         ` Steven Rostedt
  2020-11-02 18:43         ` Joel Fernandes
  0 siblings, 2 replies; 20+ messages in thread
From: Mathieu Desnoyers @ 2020-10-26 14:28 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Joel Fernandes, Google
  Cc: Michael Jeanson, linux-kernel, rostedt, Alexei Starovoitov,
	Yonghong Song, paulmck, Ingo Molnar, acme, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, bpf

----- On Oct 26, 2020, at 4:20 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
>> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
>> > From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> > 
>> > Considering that tracer callbacks expect RCU to be watching (for
>> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
>> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
>> > no point in using SRCU anymore given that rcuidle tracepoints need to
>> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
>> > tracepoints for rcuidle tracepoints.
>> 
>> High level question:
>> 
>> IIRC, doing this increases overhead for general tracing that does not use
>> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
>> tracepoints. I remember adding SRCU because of this reason.
>> 
>> Can the 'rcuidle' information not be pushed down further, such that perf does
>> it because it requires RCU to be watching, so that it does not effect, say,
>> trace events?
> 
> There's very few trace_.*_rcuidle() users left. We should eradicate them
> and remove the option. It's bugs to begin with.

I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
API and fixing all callers to ensure they trace from a context where RCU is
watching would simplify instrumentation of the Linux kernel, thus making it harder
for subtle bugs to hide and be unearthed only when tracing is enabled. This is
AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
think it is a good thing.

So if we consider this our target, and that the current state of things is that
we need to have RCU watching around callback invocation, then removing the
dependency on SRCU seems like an overall simplification which does not regress
feature-wise nor speed-wise compared with what we have upstream today. The next
steps would then be to audit all rcuidle tracepoints and make sure the context
where they are placed has RCU watching already, so we can remove the tracepoint
rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
from the tracepoint code.

This is however beyond the scope of the proposed patch set.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/6] Sleepable tracepoints
  2020-10-26 12:05 ` [RFC PATCH 0/6] Sleepable tracepoints peter enderborg
@ 2020-10-26 14:59   ` Mathieu Desnoyers
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Desnoyers @ 2020-10-26 14:59 UTC (permalink / raw)
  To: peter enderborg
  Cc: Michael Jeanson, linux-kernel, rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf

----- On Oct 26, 2020, at 8:05 AM, peter enderborg peter.enderborg@sony.com wrote:

> On 10/23/20 9:53 PM, Michael Jeanson wrote:
>> When invoked from system call enter/exit instrumentation, accessing
>> user-space data is a common use-case for tracers. However, tracepoints
>> currently disable preemption around iteration on the registered
>> tracepoint probes and invocation of the probe callbacks, which prevents
>> tracers from handling page faults.
>>
>> Extend the tracepoint and trace event APIs to allow specific tracer
>> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
>> called from sleepable context, and convert the system call enter/exit
>> instrumentation to sleepable tracepoints.
> 
> Will this not be a problem for analyse of the trace? It get two
> relevant times, one it when it is called and one when it returns.

It will depend on what the tracer chooses to do. If we call the side-effect
of what is being traced a "transaction" (e.g. actually opening a file
descriptor and adding it to a process'file descriptor table as the result
of an open(2) system call), we have to consider that already today the
timestamp which we get is either slightly before or after the actual
side-effect of the transaction in the kernel. That is true even without
being preemptable.

Sometimes it's not relevant to have a tracepoint before and after the
transaction, e.g. when all we care about is to know that the transaction
has successfully happened or not.

In the case of system calls, we have sys_enter and sys_exit to mark the
beginning and end of the "transaction". Whatever side-effects are done by
the system call happens in between.

I think the question here is whether it is relevant to know whether page
faults triggered by accessing system call input parameters need to
happen after we trace a "system call entry" event. If the tracers care,
then it would be up to them to first trace that "system call entry" event,
and have a separate event for the argument payload. But there are other
ways to identify whether page faults happen within the system call or
from user-space, for instance by using the instruction pointer associated
with the page fault. So when observing page faults happening before sys
enter, but associated with a kernel instruction pointer, a trace analysis
tool could theoretically figure out who is to blame for that page fault,
*if* it cares.

> 
> It makes things harder to correlate in what order things happen.

The alternative is to have partial payloads like LTTng does today for
system call arguments. If reading a string from userspace (e.g. open(2)
file name) requires to take a page fault, LTTng truncates the string. This
is pretty bad for automated analysis as well.

> 
> And handling of tracing of contexts that already are not preamptable?

The sleepable tracepoints are only meant to be used in contexts which can sleep.
For tracepoints placed in non-preemptible contexts, those should never take
a page fault to begin with.

> 
> Eg the same tracepoint are used in different places and contexts.

As far as considering that a given tracepoint "name" could be registered to
by both a sleepable and non-sleepable tracer probes, I would like to see an
actual use-case for this. I don't have any.

I can envision that some tracer code will want to be allowed to work in
both sleepable and non-sleepable context, e.g. take page faults in
sleepable context (and called from a sleepable tracepoint), but have a
truncation behavior when called from non-sleepable context. This can actually
be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that
tp_flags to code shared between sleepable and non-sleepable probes would allow
the callee to know whether it can take a page fault or not.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-26 14:28       ` Mathieu Desnoyers
@ 2020-10-26 20:44         ` Steven Rostedt
  2020-10-27 13:57           ` Mathieu Desnoyers
  2020-11-02 18:43         ` Joel Fernandes
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2020-10-26 20:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Thomas Gleixner, Joel Fernandes, Google,
	Michael Jeanson, linux-kernel, Alexei Starovoitov, Yonghong Song,
	paulmck, Ingo Molnar, acme, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, bpf

On Mon, 26 Oct 2020 10:28:07 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
> API and fixing all callers to ensure they trace from a context where RCU is
> watching would simplify instrumentation of the Linux kernel, thus making it harder
> for subtle bugs to hide and be unearthed only when tracing is enabled. This is

Note, the lockdep RCU checking of a tracepoint is outside of it being
enabled or disable. So if a non rcuidle() tracepoint is in a location that
RCU is not watching, it will complain loudly, even if you don't enable that
tracepoint.

> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
> think it is a good thing.
> 
> So if we consider this our target, and that the current state of things is that
> we need to have RCU watching around callback invocation, then removing the
> dependency on SRCU seems like an overall simplification which does not regress
> feature-wise nor speed-wise compared with what we have upstream today. The next
> steps would then be to audit all rcuidle tracepoints and make sure the context
> where they are placed has RCU watching already, so we can remove the tracepoint

Just remove the _rcuidle() from them, and lockdep will complain if they are
being called without RCU watching.

-- Steve


> rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
> from the tracepoint code.
> 
> This is however beyond the scope of the proposed patch set.
> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-23 19:53 ` [RFC PATCH 1/6] tracing: introduce sleepable tracepoints Michael Jeanson
@ 2020-10-26 22:43   ` Alexei Starovoitov
  2020-10-27 13:37     ` Mathieu Desnoyers
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-10-26 22:43 UTC (permalink / raw)
  To: Michael Jeanson
  Cc: linux-kernel, mathieu.desnoyers, Steven Rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Joel Fernandes, bpf

On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
>  	do {								\
>  		struct tracepoint_func *it_func_ptr;			\
>  		void *it_func;						\
>  		void *__data;						\
>  		int __maybe_unused __idx = 0;				\
> +		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
>  									\
>  		if (!(cond))						\
>  			return;						\
> @@ -170,8 +178,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		/* srcu can't be used from NMI */			\
>  		WARN_ON_ONCE(rcuidle && in_nmi());			\
>  									\
> -		/* keep srcu and sched-rcu usage consistent */		\
> -		preempt_disable_notrace();				\
> +		if (maysleep) {						\
> +			might_sleep();					\

The main purpose of the patch set is to access user memory in tracepoints, right?
In such case I suggest to use stronger might_fault() here.
We used might_sleep() in sleepable bpf and it wasn't enough to catch
a combination where sleepable hook was invoked while mm->mmap_lock was
taken which may cause a deadlock.

> +			rcu_read_lock_trace();				\
> +		} else {						\
> +			/* keep srcu and sched-rcu usage consistent */	\
> +			preempt_disable_notrace();			\
> +		}							\

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-26 22:43   ` Alexei Starovoitov
@ 2020-10-27 13:37     ` Mathieu Desnoyers
  2020-10-28 21:23       ` Alexei Starovoitov
  2020-11-02 18:51       ` Joel Fernandes
  0 siblings, 2 replies; 20+ messages in thread
From: Mathieu Desnoyers @ 2020-10-27 13:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michael Jeanson, linux-kernel, rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf


----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:

> On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
>> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
>> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
>>  	do {								\
>>  		struct tracepoint_func *it_func_ptr;			\
>>  		void *it_func;						\
>>  		void *__data;						\
>>  		int __maybe_unused __idx = 0;				\
>> +		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
>>  									\
>>  		if (!(cond))						\
>>  			return;						\
>> @@ -170,8 +178,13 @@ static inline struct tracepoint
>> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>  		/* srcu can't be used from NMI */			\
>>  		WARN_ON_ONCE(rcuidle && in_nmi());			\
>>  									\
>> -		/* keep srcu and sched-rcu usage consistent */		\
>> -		preempt_disable_notrace();				\
>> +		if (maysleep) {						\
>> +			might_sleep();					\
> 
> The main purpose of the patch set is to access user memory in tracepoints,
> right?

Yes, exactly.

> In such case I suggest to use stronger might_fault() here.
> We used might_sleep() in sleepable bpf and it wasn't enough to catch
> a combination where sleepable hook was invoked while mm->mmap_lock was
> taken which may cause a deadlock.

Good point! We will do that for the next round.

By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
(a "faultable" tracepoint sounds weird though)

Thanks,

Mathieu

> 
>> +			rcu_read_lock_trace();				\
>> +		} else {						\
>> +			/* keep srcu and sched-rcu usage consistent */	\
>> +			preempt_disable_notrace();			\
> > +		}							\

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-26 20:44         ` Steven Rostedt
@ 2020-10-27 13:57           ` Mathieu Desnoyers
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Desnoyers @ 2020-10-27 13:57 UTC (permalink / raw)
  To: rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, Joel Fernandes, Google,
	Michael Jeanson, linux-kernel, Alexei Starovoitov, Yonghong Song,
	paulmck, Ingo Molnar, acme, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, bpf

[-- Attachment #1: Type: text/plain, Size: 9630 bytes --]

----- On Oct 26, 2020, at 4:44 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 26 Oct 2020 10:28:07 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
>> API and fixing all callers to ensure they trace from a context where RCU is
>> watching would simplify instrumentation of the Linux kernel, thus making it
>> harder
>> for subtle bugs to hide and be unearthed only when tracing is enabled. This is
> 
> Note, the lockdep RCU checking of a tracepoint is outside of it being
> enabled or disable. So if a non rcuidle() tracepoint is in a location that
> RCU is not watching, it will complain loudly, even if you don't enable that
> tracepoint.
> 
>> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
>> think it is a good thing.
>> 
>> So if we consider this our target, and that the current state of things is that
>> we need to have RCU watching around callback invocation, then removing the
>> dependency on SRCU seems like an overall simplification which does not regress
>> feature-wise nor speed-wise compared with what we have upstream today. The next
>> steps would then be to audit all rcuidle tracepoints and make sure the context
>> where they are placed has RCU watching already, so we can remove the tracepoint
> 
> Just remove the _rcuidle() from them, and lockdep will complain if they are
> being called without RCU watching.

That's the easy part. The patch removing rcuidle is attached to this email,
and here are the splats I get just when booting the machine (see below). The
part which takes more time is fixing those splats and figuring out how to
restructure the code. For instance, what should we do about the rcuidle
tracepoint within printk() ?

Also, how do we test rcuidle tracepoints triggered only when printk is called
from printk warnings ? We'd also need to test on arm32 boards, because some arm
architecture code uses rcuidle tracepoints as well.

As this is beyond the scope of this patch set, I can either drop this patch
entirely (it's not required for sleepable tracepoints), or keep it as an
intermediate cleanup step. Removing rcuidle tracepoints entirely is beyond
the effort Michael and I can undertake now.

=============================
WARNING: suspicious RCU usage

5.9.1+ #72 Not tainted
-----------------------------
=============================
./include/trace/events/preemptirq.h:42 suspicious rcu_dereference_check() usage!
WARNING: suspicious RCU usage

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
5.9.1+ #72 Not tainted
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.
-----------------------------
./include/trace/events/preemptirq.h:38 suspicious rcu_dereference_check() usage!

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
RCU used illegally from extended quiescent state!
no locks held by swapper/1/0.
 dump_stack+0x8d/0xbb

stack backtrace:
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.1+ #72

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/lock.h:37 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 lock_acquire+0x346/0x3b0
 ? __lock_acquire+0x2e7/0x1da0
 _raw_spin_lock+0x2c/0x40
 ? vprintk_emit+0x9f/0x410
 vprintk_emit+0x9f/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/lock.h:63 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
1 lock held by swapper/0/0:
 #0: ffffffff97a80158 (logbuf_lock){-...}-{2:2}, at: vprintk_emit+0x9f/0x410

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 lock_release+0x25a/0x280
 _raw_spin_unlock+0x17/0x30
 vprintk_emit+0xdf/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/printk.h:33 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
2 locks held by swapper/0/0:
 #0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
 #1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 console_unlock+0x5ad/0x5d0
 vprintk_emit+0x133/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/linux/rcupdate.h:636 rcu_read_lock() used illegally while idle!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
4 locks held by swapper/0/0:
 #0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
 #1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0
 #2: ffffffff97b7d598 (printing_lock){....}-{2:2}, at: vt_console_print+0x7d/0x3e0
 #3: ffffffff97a82d80 (rcu_read_lock){....}-{1:2}, at: __atomic_notifier_call_chain+0x5/0x110

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 __atomic_notifier_call_chain+0xd7/0x110
 vt_console_print+0x19e/0x3e0
 console_unlock+0x417/0x5d0
 vprintk_emit+0x133/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/linux/rcupdate.h:685 rcu_read_unlock() used illegally while idle!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
4 locks held by swapper/0/0:
 #0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
 #1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0
 #2: ffffffff97b7d598 (printing_lock){....}-{2:2}, at: vt_console_print+0x7d/0x3e0
 #3: ffffffff97a82d80 (rcu_read_lock){....}-{1:2}, at: __atomic_notifier_call_chain+0x5/0x110

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 __atomic_notifier_call_chain+0x101/0x110
 vt_console_print+0x19e/0x3e0
 console_unlock+0x417/0x5d0
 vprintk_emit+0x133/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 ? rcu_idle_exit+0x32/0x40
 trace_hardirqs_off+0xe3/0xf0
 rcu_idle_exit+0x32/0x40
 default_idle_call+0x54/0x1e0
 do_idle+0x1ef/0x2c0
 ? lockdep_hardirqs_on_prepare+0xec/0x180
 cpu_startup_entry+0x19/0x20
 start_secondary+0x11c/0x160
 secondary_startup_64+0xa4/0xb0

Thanks,

Mathieu


> 
> -- Steve
> 
> 
>> rcuidle API. That would effectively remove the calls to
>> rcu_irq_{enter,exit}_irqson
>> from the tracepoint code.
>> 
>> This is however beyond the scope of the proposed patch set.
>> 
>> Thanks,
>> 
>> Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[-- Attachment #2: 0001-wip-remove-rcuidle-tracepoint-API.patch --]
[-- Type: application/mbox, Size: 15836 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-27 13:37     ` Mathieu Desnoyers
@ 2020-10-28 21:23       ` Alexei Starovoitov
  2021-02-11 19:36         ` Mathieu Desnoyers
  2020-11-02 18:51       ` Joel Fernandes
  1 sibling, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2020-10-28 21:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Michael Jeanson, linux-kernel, rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf

On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
> 
> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> 
> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
> >>  	do {								\
> >>  		struct tracepoint_func *it_func_ptr;			\
> >>  		void *it_func;						\
> >>  		void *__data;						\
> >>  		int __maybe_unused __idx = 0;				\
> >> +		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
> >>  									\
> >>  		if (!(cond))						\
> >>  			return;						\
> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> >>  		/* srcu can't be used from NMI */			\
> >>  		WARN_ON_ONCE(rcuidle && in_nmi());			\
> >>  									\
> >> -		/* keep srcu and sched-rcu usage consistent */		\
> >> -		preempt_disable_notrace();				\
> >> +		if (maysleep) {						\
> >> +			might_sleep();					\
> > 
> > The main purpose of the patch set is to access user memory in tracepoints,
> > right?
> 
> Yes, exactly.
> 
> > In such case I suggest to use stronger might_fault() here.
> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
> > a combination where sleepable hook was invoked while mm->mmap_lock was
> > taken which may cause a deadlock.
> 
> Good point! We will do that for the next round.
> 
> By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
> (a "faultable" tracepoint sounds weird though)

bpf kept 'sleepable' as a name. 'faultable' is too misleading.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
  2020-10-26 14:28       ` Mathieu Desnoyers
  2020-10-26 20:44         ` Steven Rostedt
@ 2020-11-02 18:43         ` Joel Fernandes
  1 sibling, 0 replies; 20+ messages in thread
From: Joel Fernandes @ 2020-11-02 18:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Thomas Gleixner, Michael Jeanson, linux-kernel,
	rostedt, Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar,
	acme, Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	bpf

On Mon, Oct 26, 2020 at 10:28:07AM -0400, Mathieu Desnoyers wrote:
> ----- On Oct 26, 2020, at 4:20 AM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
> >> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> >> > From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> > 
> >> > Considering that tracer callbacks expect RCU to be watching (for
> >> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> >> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> >> > no point in using SRCU anymore given that rcuidle tracepoints need to
> >> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
> >> > tracepoints for rcuidle tracepoints.
> >> 
> >> High level question:
> >> 
> >> IIRC, doing this increases overhead for general tracing that does not use
> >> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
> >> tracepoints. I remember adding SRCU because of this reason.
> >> 
> >> Can the 'rcuidle' information not be pushed down further, such that perf does
> >> it because it requires RCU to be watching, so that it does not effect, say,
> >> trace events?
> > 
> > There's very few trace_.*_rcuidle() users left. We should eradicate them
> > and remove the option. It's bugs to begin with.
> 
> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
> API and fixing all callers to ensure they trace from a context where RCU is
> watching would simplify instrumentation of the Linux kernel, thus making it harder
> for subtle bugs to hide and be unearthed only when tracing is enabled. This is
> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
> think it is a good thing.
> 
> So if we consider this our target, and that the current state of things is that
> we need to have RCU watching around callback invocation, then removing the
> dependency on SRCU seems like an overall simplification which does not regress
> feature-wise nor speed-wise compared with what we have upstream today. The next
> steps would then be to audit all rcuidle tracepoints and make sure the context
> where they are placed has RCU watching already, so we can remove the tracepoint
> rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
> from the tracepoint code.
> 
> This is however beyond the scope of the proposed patch set.

You are right, it doesn't regress speedwise - I got confused since the code
was modified to call rcu_enter_irqson() even for the rcuidle case (which I
had avoided when I added SRCU). So in current code, SRCU is kind of
pointless. I think keep the patch in the series.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-27 13:37     ` Mathieu Desnoyers
  2020-10-28 21:23       ` Alexei Starovoitov
@ 2020-11-02 18:51       ` Joel Fernandes
  1 sibling, 0 replies; 20+ messages in thread
From: Joel Fernandes @ 2020-11-02 18:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Alexei Starovoitov, Michael Jeanson, linux-kernel, rostedt,
	Peter Zijlstra, Alexei Starovoitov, Yonghong Song, paulmck,
	Ingo Molnar, acme, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, bpf

On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
> 
> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> 
> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
> >>  	do {								\
> >>  		struct tracepoint_func *it_func_ptr;			\
> >>  		void *it_func;						\
> >>  		void *__data;						\
> >>  		int __maybe_unused __idx = 0;				\
> >> +		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
> >>  									\
> >>  		if (!(cond))						\
> >>  			return;						\
> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> >>  		/* srcu can't be used from NMI */			\
> >>  		WARN_ON_ONCE(rcuidle && in_nmi());			\
> >>  									\
> >> -		/* keep srcu and sched-rcu usage consistent */		\
> >> -		preempt_disable_notrace();				\
> >> +		if (maysleep) {						\
> >> +			might_sleep();					\
> > 
> > The main purpose of the patch set is to access user memory in tracepoints,
> > right?
> 
> Yes, exactly.
> 
> > In such case I suggest to use stronger might_fault() here.
> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
> > a combination where sleepable hook was invoked while mm->mmap_lock was
> > taken which may cause a deadlock.
> 
> Good point! We will do that for the next round.
> 
> By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
> (a "faultable" tracepoint sounds weird though)

What about keeping it might_sleep() here and then adding might_fault() in the
probe handler? Since the probe handler knows that it may cause page fault, it
could itself make sure about it.

One more thought: Should we make _all_ tracepoints sleepable, and then move
the preempt_disable() bit to the probe handler as needed? That could simplify
the tracepoint API as well. Steven said before that whoever registers probes
knows what they are doing so I am ok with that.

No strong feelings one way or the other, for either of these though.

thanks,

 - Joel

> 
> Thanks,
> 
> Mathieu
> 
> > 
> >> +			rcu_read_lock_trace();				\
> >> +		} else {						\
> >> +			/* keep srcu and sched-rcu usage consistent */	\
> >> +			preempt_disable_notrace();			\
> > > +		}							\
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints
  2020-10-28 21:23       ` Alexei Starovoitov
@ 2021-02-11 19:36         ` Mathieu Desnoyers
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Desnoyers @ 2021-02-11 19:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michael Jeanson, linux-kernel, rostedt, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, paulmck, Ingo Molnar, acme,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joel Fernandes, Google, bpf

----- On Oct 28, 2020, at 5:23 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:

> On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
>> 
>> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov
>> alexei.starovoitov@gmail.com wrote:
>> 
>> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
>> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)			\
>> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)		\
>> >>  	do {								\
>> >>  		struct tracepoint_func *it_func_ptr;			\
>> >>  		void *it_func;						\
>> >>  		void *__data;						\
>> >>  		int __maybe_unused __idx = 0;				\
>> >> +		bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;	\
>> >>  									\
>> >>  		if (!(cond))						\
>> >>  			return;						\
>> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
>> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>> >>  		/* srcu can't be used from NMI */			\
>> >>  		WARN_ON_ONCE(rcuidle && in_nmi());			\
>> >>  									\
>> >> -		/* keep srcu and sched-rcu usage consistent */		\
>> >> -		preempt_disable_notrace();				\
>> >> +		if (maysleep) {						\
>> >> +			might_sleep();					\
>> > 
>> > The main purpose of the patch set is to access user memory in tracepoints,
>> > right?
>> 
>> Yes, exactly.
>> 
>> > In such case I suggest to use stronger might_fault() here.
>> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
>> > a combination where sleepable hook was invoked while mm->mmap_lock was
>> > taken which may cause a deadlock.
>> 
>> Good point! We will do that for the next round.
>> 
>> By the way, we named this "sleepable" tracepoint (with flag
>> TRACEPOINT_MAYSLEEP),
>> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive
>> ?
>> (a "faultable" tracepoint sounds weird though)
> 
> bpf kept 'sleepable' as a name. 'faultable' is too misleading.

We're working on an updated patchset for those "sleepable tracepoints", and considering
that those are really "tracepoints allowing page faults", I must admit that I am
uncomfortable with the confusion between "sleep" and "fault" in the naming here.

I am tempted to do the following changes:

- Change name from "sleepable tracepoints" to a better suited "tracepoints allowing page faults",
- Use might_fault() rather than might_sleep() in __DO_TRACE(), effectively guaranteeing that all
  probes connecting to a tracepoint which allows page faults can indeed take page faults.
- Change TRACEPOINT_MAYSLEEP into TRACEPOINT_MAYFAULT.

Any objections ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-02-11 19:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23 19:53 [RFC PATCH 0/6] Sleepable tracepoints Michael Jeanson
2020-10-23 19:53 ` [RFC PATCH 1/6] tracing: introduce sleepable tracepoints Michael Jeanson
2020-10-26 22:43   ` Alexei Starovoitov
2020-10-27 13:37     ` Mathieu Desnoyers
2020-10-28 21:23       ` Alexei Starovoitov
2021-02-11 19:36         ` Mathieu Desnoyers
2020-11-02 18:51       ` Joel Fernandes
2020-10-23 19:53 ` [RFC PATCH 2/6] tracing: ftrace: add support for " Michael Jeanson
2020-10-23 19:53 ` [RFC PATCH 3/6] tracing: bpf-trace: " Michael Jeanson
2020-10-23 19:53 ` [RFC PATCH 4/6] tracing: perf: " Michael Jeanson
2020-10-23 19:53 ` [RFC PATCH 5/6] tracing: convert sys_enter/exit to " Michael Jeanson
2020-10-23 19:53 ` [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints Michael Jeanson
2020-10-23 21:13   ` Joel Fernandes
2020-10-26  8:20     ` Peter Zijlstra
2020-10-26 14:28       ` Mathieu Desnoyers
2020-10-26 20:44         ` Steven Rostedt
2020-10-27 13:57           ` Mathieu Desnoyers
2020-11-02 18:43         ` Joel Fernandes
2020-10-26 12:05 ` [RFC PATCH 0/6] Sleepable tracepoints peter enderborg
2020-10-26 14:59   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).