linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint
@ 2019-09-03 15:43 Radim Krčmář
  2019-09-03 15:43 ` [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h Radim Krčmář
  2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
  0 siblings, 2 replies; 31+ messages in thread
From: Radim Krčmář @ 2019-09-03 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

Add a tracepoint for monitoring nr_running values, which is helpful in
discovering scheduling imbalances.

More information and most of the code is in [2/2], while [1/2] fixes a
build issue that popped up because CREATE_TRACE_POINTS is now defined
for several includes.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jirka Hladký <jhladky@redhat.com>
Cc: Jiří Vozár <jvozar@redhat.com>
Cc: x86@kernel.org


Radim Krčmář (2):
  x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h
  sched/debug: add sched_update_nr_running tracepoint

 arch/x86/include/asm/mmu_context.h |  2 --
 arch/x86/mm/tlb.c                  |  2 ++
 include/trace/events/sched.h       | 22 ++++++++++++++++++++++
 kernel/sched/core.c                |  7 ++-----
 kernel/sched/fair.c                |  2 --
 kernel/sched/sched.h               |  7 +++++++
 6 files changed, 33 insertions(+), 9 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h
  2019-09-03 15:43 [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
@ 2019-09-03 15:43 ` Radim Krčmář
  2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
  1 sibling, 0 replies; 31+ messages in thread
From: Radim Krčmář @ 2019-09-03 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

asm/mmu_context.h is a tree-wide include that will unnecessarily break
the build if CREATE_TRACE_POINTS is defined when including it.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/include/asm/mmu_context.h | 2 --
 arch/x86/mm/tlb.c                  | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 16ae821483c8..e59f230ff981 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -7,8 +7,6 @@
 #include <linux/mm_types.h>
 #include <linux/pkeys.h>
 
-#include <trace/events/tlb.h>
-
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e6a9edc5baaf..83cd66a0db99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -18,6 +18,8 @@
 
 #include "mm_internal.h"
 
+#include <trace/events/tlb.h>
+
 /*
  *	TLB flushing, formerly SMP-only
  *		c/o Linus Torvalds.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 15:43 [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
  2019-09-03 15:43 ` [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h Radim Krčmář
@ 2019-09-03 15:43 ` Radim Krčmář
  2019-09-03 16:05   ` Valentin Schneider
                     ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Radim Krčmář @ 2019-09-03 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
custom data gathering points to better understand what was going on in
the scheduler.
Red Hat adapted one of them for the tracepoint framework and created a
tool to plot a heatmap of nr_running, where the sched_update_nr_running
tracepoint is being used for fine grained monitoring of scheduling
imbalance.
The tool is available from https://github.com/jirvoz/plot-nr-running.

The best place for the tracepoints is inside the add/sub_nr_running,
which requires some shenanigans to make it work as they are defined
inside sched.h.
The tracepoints have to be included from sched.h, which means that
CREATE_TRACE_POINTS has to be defined for the whole header and this
might cause problems if tree-wide headers expose tracepoints in sched.h
dependencies, but I'd argue it's the other side's misuse of tracepoints.

Moving the import sched.h line lower would require fixes in s390 and ppc
headers, because they don't include dependecies properly and expect
sched.h to do it, so it is simpler to keep sched.h there and
preventively undefine CREATE_TRACE_POINTS right after.

Exports of the pelt tracepoints remain because they don't need to be
protected by CREATE_TRACE_POINTS and moving them closer would be
unsightly.

Tested-by: Jirka Hladký <jhladky@redhat.com>
Tested-by: Jiří Vozár <jvozar@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 include/trace/events/sched.h | 22 ++++++++++++++++++++++
 kernel/sched/core.c          |  7 ++-----
 kernel/sched/fair.c          |  2 --
 kernel/sched/sched.h         |  7 +++++++
 4 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..1527fc695609 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp,
 	TP_PROTO(struct root_domain *rd, bool overutilized),
 	TP_ARGS(rd, overutilized));
 
+TRACE_EVENT(sched_update_nr_running,
+
+	TP_PROTO(int cpu, int change, unsigned int nr_running),
+
+	TP_ARGS(cpu, change, nr_running),
+
+	TP_STRUCT__entry(
+		__field(int,          cpu)
+		__field(int,          change)
+		__field(unsigned int, nr_running)
+	),
+
+	TP_fast_assign(
+		__entry->cpu        = cpu;
+		__entry->change     = change;
+		__entry->nr_running = nr_running;
+	),
+
+	TP_printk("cpu=%u nr_running=%u (%d)",
+			__entry->cpu, __entry->nr_running, __entry->change)
+);
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71981ce84231..31ac37b9f6f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,7 +6,9 @@
  *
  *  Copyright (C) 1991-2002  Linus Torvalds
  */
+#define CREATE_TRACE_POINTS
 #include "sched.h"
+#undef CREATE_TRACE_POINTS
 
 #include <linux/nospec.h>
 
@@ -20,9 +22,6 @@
 
 #include "pelt.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/sched.h>
-
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = {
  /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
-
-#undef CREATE_TRACE_POINTS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84959d3285d1..421236d39902 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
  */
 #include "sched.h"
 
-#include <trace/events/sched.h>
-
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c4915f46035a..b89d7786109a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -75,6 +75,8 @@
 #include "cpupri.h"
 #include "cpudeadline.h"
 
+#include <trace/events/sched.h>
+
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
 #else
@@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 	rq->nr_running = prev_nr + count;
 
+	trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running);
+
 #ifdef CONFIG_SMP
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 		if (!READ_ONCE(rq->rd->overload))
@@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
 	rq->nr_running -= count;
+
+	trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running);
+
 	/* Check if we still need preemption */
 	sched_update_tick_dependency(rq);
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
@ 2019-09-03 16:05   ` Valentin Schneider
  2019-09-04  4:23     ` Joel Fernandes
  2019-09-04  6:55     ` chengjian (D)
  2019-09-04 13:13   ` Peter Zijlstra
  2019-09-04 14:37   ` Qais Yousef
  2 siblings, 2 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-09-03 16:05 UTC (permalink / raw)
  To: Radim Krčmář, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86, Qais Yousef

On 03/09/2019 16:43, Radim Krčmář wrote:
> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> custom data gathering points to better understand what was going on in
> the scheduler.
> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.
> The tool is available from https://github.com/jirvoz/plot-nr-running.
> 
> The best place for the tracepoints is inside the add/sub_nr_running,
> which requires some shenanigans to make it work as they are defined
> inside sched.h.
> The tracepoints have to be included from sched.h, which means that
> CREATE_TRACE_POINTS has to be defined for the whole header and this
> might cause problems if tree-wide headers expose tracepoints in sched.h
> dependencies, but I'd argue it's the other side's misuse of tracepoints.
> 
> Moving the import sched.h line lower would require fixes in s390 and ppc
> headers, because they don't include dependecies properly and expect
> sched.h to do it, so it is simpler to keep sched.h there and
> preventively undefine CREATE_TRACE_POINTS right after.
> 
> Exports of the pelt tracepoints remain because they don't need to be
> protected by CREATE_TRACE_POINTS and moving them closer would be
> unsightly.
> 

Pure trace events are frowned upon in scheduler world, try going with
trace points. Qais did something very similar recently:

https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/

You'll have to implement the associated trace events in a module, which
lets you define your own event format and doesn't form an ABI :).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 16:05   ` Valentin Schneider
@ 2019-09-04  4:23     ` Joel Fernandes
  2019-09-04  8:14       ` Peter Zijlstra
  2019-09-04 10:43       ` Qais Yousef
  2019-09-04  6:55     ` chengjian (D)
  1 sibling, 2 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04  4:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86, Qais Yousef

On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> On 03/09/2019 16:43, Radim Krčmář wrote:
> > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > custom data gathering points to better understand what was going on in
> > the scheduler.
> > Red Hat adapted one of them for the tracepoint framework and created a
> > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > tracepoint is being used for fine grained monitoring of scheduling
> > imbalance.
> > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > 
> > The best place for the tracepoints is inside the add/sub_nr_running,
> > which requires some shenanigans to make it work as they are defined
> > inside sched.h.
> > The tracepoints have to be included from sched.h, which means that
> > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > might cause problems if tree-wide headers expose tracepoints in sched.h
> > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > 
> > Moving the import sched.h line lower would require fixes in s390 and ppc
> > headers, because they don't include dependecies properly and expect
> > sched.h to do it, so it is simpler to keep sched.h there and
> > preventively undefine CREATE_TRACE_POINTS right after.
> > 
> > Exports of the pelt tracepoints remain because they don't need to be
> > protected by CREATE_TRACE_POINTS and moving them closer would be
> > unsightly.
> > 
> 
> Pure trace events are frowned upon in scheduler world, try going with
> trace points. Qais did something very similar recently:
> 
> https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> 
> You'll have to implement the associated trace events in a module, which
> lets you define your own event format and doesn't form an ABI :).

Is that really true? eBPF programs loaded from userspace can access
tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103

I don't have a strong opinion about considering tracepoints as ABI / API or
not, but just want to get the facts straight :)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 16:05   ` Valentin Schneider
  2019-09-04  4:23     ` Joel Fernandes
@ 2019-09-04  6:55     ` chengjian (D)
  1 sibling, 0 replies; 31+ messages in thread
From: chengjian (D) @ 2019-09-04  6:55 UTC (permalink / raw)
  To: Valentin Schneider, Radim Krčmář, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86, Qais Yousef, chengjian (D),
	Xiexiuqi (Xie XiuQi),
	Li Bin, bobo.shaobowang


On 2019/9/4 0:05, Valentin Schneider wrote:
> On 03/09/2019 16:43, Radim Krčmář wrote:
>> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
>> custom data gathering points to better understand what was going on in
>> the scheduler.
>> Red Hat adapted one of them for the tracepoint framework and created a
>> tool to plot a heatmap of nr_running, where the sched_update_nr_running
>> tracepoint is being used for fine grained monitoring of scheduling
>> imbalance.
>> The tool is available from https://github.com/jirvoz/plot-nr-running.
>>
>> The best place for the tracepoints is inside the add/sub_nr_running,
>> which requires some shenanigans to make it work as they are defined
>> inside sched.h.
>> The tracepoints have to be included from sched.h, which means that
>> CREATE_TRACE_POINTS has to be defined for the whole header and this
>> might cause problems if tree-wide headers expose tracepoints in sched.h
>> dependencies, but I'd argue it's the other side's misuse of tracepoints.
>>
>> Moving the import sched.h line lower would require fixes in s390 and ppc
>> headers, because they don't include dependecies properly and expect
>> sched.h to do it, so it is simpler to keep sched.h there and
>> preventively undefine CREATE_TRACE_POINTS right after.
>>
>> Exports of the pelt tracepoints remain because they don't need to be
>> protected by CREATE_TRACE_POINTS and moving them closer would be
>> unsightly.
>>
> Pure trace events are frowned upon in scheduler world, try going with
> trace points. Qais did something very similar recently:
>
> https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
>
> You'll have to implement the associated trace events in a module, which
> lets you define your own event format and doesn't form an ABI :).
>
> .


Hi Radim,


Why not try Chrome Tracing

1--Record trace data, scheduling, irq, etc.
     You can Produce a JSON file with the format expected by Chrome
     OR
     just save the captured data directly as "xxx.html", it can still work.
     cat /sys/kernel/debug/tracing/trace > trace.html
2--Go to chrome://tracing in Chrome,
3--Click "Load" and open your file, or alternatively drag the file into 
Chrome,
4--Profit!


Systrace (Android System Trace) captures and displays execution times of 
your app's
processes and other Android system processes, fortunately, there are 
some ported
versions for Linux Desktop/Server.

https://github.com/gatieme/systrace



references--
https://www.chromium.org/developers/how-tos/trace-event-profiling-tool
https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/trace-event-reading


Thanks,
     - Cheng Jian.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04  4:23     ` Joel Fernandes
@ 2019-09-04  8:14       ` Peter Zijlstra
  2019-09-04 10:52         ` Qais Yousef
  2019-09-04 13:11         ` Joel Fernandes
  2019-09-04 10:43       ` Qais Yousef
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2019-09-04  8:14 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86, Qais Yousef

On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > custom data gathering points to better understand what was going on in
> > > the scheduler.
> > > Red Hat adapted one of them for the tracepoint framework and created a
> > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > tracepoint is being used for fine grained monitoring of scheduling
> > > imbalance.
> > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > 
> > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > which requires some shenanigans to make it work as they are defined
> > > inside sched.h.
> > > The tracepoints have to be included from sched.h, which means that
> > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > 
> > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > headers, because they don't include dependecies properly and expect
> > > sched.h to do it, so it is simpler to keep sched.h there and
> > > preventively undefine CREATE_TRACE_POINTS right after.
> > > 
> > > Exports of the pelt tracepoints remain because they don't need to be
> > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > unsightly.
> > > 
> > 
> > Pure trace events are frowned upon in scheduler world, try going with
> > trace points. 

Quite; I hate tracepoints for the API constraints they impose. Been
bitten by that, not want to ever have to deal with that again.

> >  Qais did something very similar recently:
> > 
> > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> > 
> > You'll have to implement the associated trace events in a module, which
> > lets you define your own event format and doesn't form an ABI :).
> 
> Is that really true? eBPF programs loaded from userspace can access
> tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> 
> I don't have a strong opinion about considering tracepoints as ABI / API or
> not, but just want to get the facts straight :)

eBPF can access all sorts of kernel internals; if we were to deem eBPF
and API we'd be fscked.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04  4:23     ` Joel Fernandes
  2019-09-04  8:14       ` Peter Zijlstra
@ 2019-09-04 10:43       ` Qais Yousef
  2019-09-04 13:06         ` Joel Fernandes
  1 sibling, 1 reply; 31+ messages in thread
From: Qais Yousef @ 2019-09-04 10:43 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On 09/04/19 00:23, Joel Fernandes wrote:
> On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > custom data gathering points to better understand what was going on in
> > > the scheduler.
> > > Red Hat adapted one of them for the tracepoint framework and created a
> > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > tracepoint is being used for fine grained monitoring of scheduling
> > > imbalance.
> > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > 
> > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > which requires some shenanigans to make it work as they are defined
> > > inside sched.h.
> > > The tracepoints have to be included from sched.h, which means that
> > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > 
> > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > headers, because they don't include dependecies properly and expect
> > > sched.h to do it, so it is simpler to keep sched.h there and
> > > preventively undefine CREATE_TRACE_POINTS right after.
> > > 
> > > Exports of the pelt tracepoints remain because they don't need to be
> > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > unsightly.
> > > 
> > 
> > Pure trace events are frowned upon in scheduler world, try going with
> > trace points. Qais did something very similar recently:
> > 
> > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> > 
> > You'll have to implement the associated trace events in a module, which
> > lets you define your own event format and doesn't form an ABI :).
> 
> Is that really true? eBPF programs loaded from userspace can access
> tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> 
> I don't have a strong opinion about considering tracepoints as ABI / API or
> not, but just want to get the facts straight :)

It is actually true. But you need to make the distinction between a tracepoint
and a trace event first. What Valentin is talking about here is the *bare*
tracepoint without any event associated with them like the one I added to the
scheduler recently. These ones are not accessible via eBPF, unless something
has changed since I last tried.

The current infrastructure needs to be expanded to allow eBPF to attach these
bare tracepoints. Something similar to what I have in [1] is needed - but
instead of creating a new macro it needs to expand the current macro. [2] might
give full context of when I was trying to come up with alternatives to using
trace events.

[1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
[2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/

HTH

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04  8:14       ` Peter Zijlstra
@ 2019-09-04 10:52         ` Qais Yousef
  2019-09-04 13:11         ` Joel Fernandes
  1 sibling, 0 replies; 31+ messages in thread
From: Qais Yousef @ 2019-09-04 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

On 09/04/19 10:14, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > > 
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > > 
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > > 
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > > 
> > > 
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points. 
> 
> Quite; I hate tracepoints for the API constraints they impose. Been
> bitten by that, not want to ever have to deal with that again.

s/tracepoints/trace events/ ?

They used to be one and the same but I think using them interchangeably might
cause some confusion now since we have tracepoints without trace events
associated with them.

Not trying to be picky, but the missing distinction confused the hell out of
me when I first started looking at this :-)

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 10:43       ` Qais Yousef
@ 2019-09-04 13:06         ` Joel Fernandes
  2019-09-04 14:20           ` Qais Yousef
  2019-09-04 15:25           ` Alexei Starovoitov
  0 siblings, 2 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 13:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On Wed, Sep 04, 2019 at 11:43:33AM +0100, Qais Yousef wrote:
> On 09/04/19 00:23, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > > 
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > > 
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > > 
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > > 
> > > 
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points. Qais did something very similar recently:
> > > 
> > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> > > 
> > > You'll have to implement the associated trace events in a module, which
> > > lets you define your own event format and doesn't form an ABI :).
> > 
> > Is that really true? eBPF programs loaded from userspace can access
> > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> > 
> > I don't have a strong opinion about considering tracepoints as ABI / API or
> > not, but just want to get the facts straight :)
> 
> It is actually true.
>
> But you need to make the distinction between a tracepoint
> and a trace event first.

I know this distinction well.

> What Valentin is talking about here is the *bare*
> tracepoint without any event associated with them like the one I added to the
> scheduler recently. These ones are not accessible via eBPF, unless something
> has changed since I last tried.

Can this tracepoint be registered on with tracepoint_probe_register()?
Quickly looking at these new tracepoint, they can be otherwise how would they
even work right? If so, then eBPF can very well access it. Look at
__bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
BPF_RAW_TRACEPOINT_OPEN.

> The current infrastructure needs to be expanded to allow eBPF to attach these
> bare tracepoints. Something similar to what I have in [1] is needed - but
> instead of creating a new macro it needs to expand the current macro. [2] might
> give full context of when I was trying to come up with alternatives to using
> trace events.
> 
> [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/


As I was mentioning, tracepoints, not "trace events" can already be opened
directly with BPF. I don't see how these new tracepoints are different.

I wonder if this distinction of "tracepoint" being non-ABI can be documented
somewhere. I would be happy to do that if there is a place for the same. I
really want some general "policy" in the kernel on where we draw a line in
the sand with respect to tracepoints and ABI :).

For instance, perhaps VFS can also start having non-ABI tracepoints for the
benefit of people tracing the VFS.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04  8:14       ` Peter Zijlstra
  2019-09-04 10:52         ` Qais Yousef
@ 2019-09-04 13:11         ` Joel Fernandes
  2019-09-04 15:26           ` Alexei Starovoitov
  1 sibling, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 13:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86, Qais Yousef

On Wed, Sep 04, 2019 at 10:14:48AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote:
> > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote:
> > > On 03/09/2019 16:43, Radim Krčmář wrote:
> > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> > > > custom data gathering points to better understand what was going on in
> > > > the scheduler.
> > > > Red Hat adapted one of them for the tracepoint framework and created a
> > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running
> > > > tracepoint is being used for fine grained monitoring of scheduling
> > > > imbalance.
> > > > The tool is available from https://github.com/jirvoz/plot-nr-running.
> > > > 
> > > > The best place for the tracepoints is inside the add/sub_nr_running,
> > > > which requires some shenanigans to make it work as they are defined
> > > > inside sched.h.
> > > > The tracepoints have to be included from sched.h, which means that
> > > > CREATE_TRACE_POINTS has to be defined for the whole header and this
> > > > might cause problems if tree-wide headers expose tracepoints in sched.h
> > > > dependencies, but I'd argue it's the other side's misuse of tracepoints.
> > > > 
> > > > Moving the import sched.h line lower would require fixes in s390 and ppc
> > > > headers, because they don't include dependecies properly and expect
> > > > sched.h to do it, so it is simpler to keep sched.h there and
> > > > preventively undefine CREATE_TRACE_POINTS right after.
> > > > 
> > > > Exports of the pelt tracepoints remain because they don't need to be
> > > > protected by CREATE_TRACE_POINTS and moving them closer would be
> > > > unsightly.
> > > > 
> > > 
> > > Pure trace events are frowned upon in scheduler world, try going with
> > > trace points. 
> 
> Quite; I hate tracepoints for the API constraints they impose. Been
> bitten by that, not want to ever have to deal with that again.

Your NACKs on trace patches over the years have spoken out loud about this
point ;-)

> > >  Qais did something very similar recently:
> > > 
> > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.yousef@arm.com/
> > > 
> > > You'll have to implement the associated trace events in a module, which
> > > lets you define your own event format and doesn't form an ABI :).
> > 
> > Is that really true? eBPF programs loaded from userspace can access
> > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI:
> > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103
> > 
> > I don't have a strong opinion about considering tracepoints as ABI / API or
> > not, but just want to get the facts straight :)
> 
> eBPF can access all sorts of kernel internals; if we were to deem eBPF
> and API we'd be fscked.

True. However, for kprobes-based BPF program - it does check for kernel
version to ensure that the BPF program is built against the right kernel
version (in order to ensure the program is built against the right set of
kernel headers). If it is not, then BPF refuses to load the program.

But, to your point bpf_probe_read() can away access kernel memory and
assume structure layouts; so I guess a badly written bpf program can still do
all sorts of ABI-unstable things.

Good to know that the raw tracepoint being Ok since it is non-ABI; is where
we draw the line.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
  2019-09-03 16:05   ` Valentin Schneider
@ 2019-09-04 13:13   ` Peter Zijlstra
  2019-09-04 13:21     ` Valentin Schneider
  2019-09-04 14:37   ` Qais Yousef
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-09-04 13:13 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote:

> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.

You should be able to reconstruct this from wakeup and switch
tracepoints.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 13:13   ` Peter Zijlstra
@ 2019-09-04 13:21     ` Valentin Schneider
  0 siblings, 0 replies; 31+ messages in thread
From: Valentin Schneider @ 2019-09-04 13:21 UTC (permalink / raw)
  To: Peter Zijlstra, Radim Krčmář
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

On 04/09/2019 14:13, Peter Zijlstra wrote:
> On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote:
> 
>> Red Hat adapted one of them for the tracepoint framework and created a
>> tool to plot a heatmap of nr_running, where the sched_update_nr_running
>> tracepoint is being used for fine grained monitoring of scheduling
>> imbalance.
> 
> You should be able to reconstruct this from wakeup and switch
> tracepoints.
> 
> 

I never tried to do it myself but know some folks around me gave it a go
(Morten, Qais and maybe even Chris). I can't remember the exact details but
there was something about not getting the right input - duplicated wakeups,
or no migration information (I think sched_migrate doesn't cover all of the
cases) so the information you're building diverges as time progresses.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 13:06         ` Joel Fernandes
@ 2019-09-04 14:20           ` Qais Yousef
  2019-09-04 14:41             ` Joel Fernandes
  2019-09-04 15:25           ` Alexei Starovoitov
  1 sibling, 1 reply; 31+ messages in thread
From: Qais Yousef @ 2019-09-04 14:20 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On 09/04/19 09:06, Joel Fernandes wrote:
> > 
> > It is actually true.
> >
> > But you need to make the distinction between a tracepoint
> > and a trace event first.
> 
> I know this distinction well.
> 
> > What Valentin is talking about here is the *bare*
> > tracepoint without any event associated with them like the one I added to the
> > scheduler recently. These ones are not accessible via eBPF, unless something
> > has changed since I last tried.
> 
> Can this tracepoint be registered on with tracepoint_probe_register()?
> Quickly looking at these new tracepoint, they can be otherwise how would they
> even work right? If so, then eBPF can very well access it. Look at
> __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> BPF_RAW_TRACEPOINT_OPEN.

Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
maybe I missed something on the way it should be used. AFAICT it was missing
the bits that I implemented in [1]. Maybe the method you mention is lower level
than bcc.

> 
> > The current infrastructure needs to be expanded to allow eBPF to attach these
> > bare tracepoints. Something similar to what I have in [1] is needed - but
> > instead of creating a new macro it needs to expand the current macro. [2] might
> > give full context of when I was trying to come up with alternatives to using
> > trace events.
> > 
> > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/
> 
> 
> As I was mentioning, tracepoints, not "trace events" can already be opened
> directly with BPF. I don't see how these new tracepoints are different.
> 
> I wonder if this distinction of "tracepoint" being non-ABI can be documented
> somewhere. I would be happy to do that if there is a place for the same. I
> really want some general "policy" in the kernel on where we draw a line in
> the sand with respect to tracepoints and ABI :).
> 
> For instance, perhaps VFS can also start having non-ABI tracepoints for the
> benefit of people tracing the VFS.

Good question. I did consider that but failed to come up with a place. AFAIU
the history moved from tracepoints to trace events and now moving back to
tracepoints. Something Steve is not very enthusiastic about.

LPC is coming, sounds like a good venue to discuss this :-)

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
  2019-09-03 16:05   ` Valentin Schneider
  2019-09-04 13:13   ` Peter Zijlstra
@ 2019-09-04 14:37   ` Qais Yousef
  2019-09-04 17:48     ` Peter Zijlstra
  2 siblings, 1 reply; 31+ messages in thread
From: Qais Yousef @ 2019-09-04 14:37 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On 09/03/19 17:43, Radim Krčmář wrote:
> The paper "The Linux Scheduler: a Decade of Wasted Cores" used several
> custom data gathering points to better understand what was going on in
> the scheduler.
> Red Hat adapted one of them for the tracepoint framework and created a
> tool to plot a heatmap of nr_running, where the sched_update_nr_running
> tracepoint is being used for fine grained monitoring of scheduling
> imbalance.
> The tool is available from https://github.com/jirvoz/plot-nr-running.
> 
> The best place for the tracepoints is inside the add/sub_nr_running,
> which requires some shenanigans to make it work as they are defined
> inside sched.h.

I managed to hook into sched_switch to get the nr_running of cfs tasks via
eBPF.

```
int on_switch(struct sched_switch_args *args) {
    struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
    struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
    const char *prev_cgroup_name = prev_cgroup->kn->name;

    if (prev_cgroup->kn->parent) {
        bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
                         prev->se.cfs_rq->nr_running,
                         prev_cgroup_name);
    } else {
        bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
                         prev->se.cfs_rq->nr_running);
    }
    return 0;
};
```

You can do something similar by attaching to the sched_switch tracepoint from
a module and a create a new event to get the nr_running.

Now this is not as accurate as your proposed new tracepoint in terms where you
sample nr_running, but should be good enough?

Cheers

--
Qais Yousef

> The tracepoints have to be included from sched.h, which means that
> CREATE_TRACE_POINTS has to be defined for the whole header and this
> might cause problems if tree-wide headers expose tracepoints in sched.h
> dependencies, but I'd argue it's the other side's misuse of tracepoints.
> 
> Moving the import sched.h line lower would require fixes in s390 and ppc
> headers, because they don't include dependecies properly and expect
> sched.h to do it, so it is simpler to keep sched.h there and
> preventively undefine CREATE_TRACE_POINTS right after.
> 
> Exports of the pelt tracepoints remain because they don't need to be
> protected by CREATE_TRACE_POINTS and moving them closer would be
> unsightly.
> 
> Tested-by: Jirka Hladký <jhladky@redhat.com>
> Tested-by: Jiří Vozár <jvozar@redhat.com>
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  include/trace/events/sched.h | 22 ++++++++++++++++++++++
>  kernel/sched/core.c          |  7 ++-----
>  kernel/sched/fair.c          |  2 --
>  kernel/sched/sched.h         |  7 +++++++
>  4 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 420e80e56e55..1527fc695609 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp,
>  	TP_PROTO(struct root_domain *rd, bool overutilized),
>  	TP_ARGS(rd, overutilized));
>  
> +TRACE_EVENT(sched_update_nr_running,
> +
> +	TP_PROTO(int cpu, int change, unsigned int nr_running),
> +
> +	TP_ARGS(cpu, change, nr_running),
> +
> +	TP_STRUCT__entry(
> +		__field(int,          cpu)
> +		__field(int,          change)
> +		__field(unsigned int, nr_running)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->cpu        = cpu;
> +		__entry->change     = change;
> +		__entry->nr_running = nr_running;
> +	),
> +
> +	TP_printk("cpu=%u nr_running=%u (%d)",
> +			__entry->cpu, __entry->nr_running, __entry->change)
> +);
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 71981ce84231..31ac37b9f6f7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6,7 +6,9 @@
>   *
>   *  Copyright (C) 1991-2002  Linus Torvalds
>   */
> +#define CREATE_TRACE_POINTS
>  #include "sched.h"
> +#undef CREATE_TRACE_POINTS
>  
>  #include <linux/nospec.h>
>  
> @@ -20,9 +22,6 @@
>  
>  #include "pelt.h"
>  
> -#define CREATE_TRACE_POINTS
> -#include <trace/events/sched.h>
> -
>  /*
>   * Export tracepoints that act as a bare tracehook (ie: have no trace event
>   * associated with them) to allow external modules to probe them.
> @@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = {
>   /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
>   /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
>  };
> -
> -#undef CREATE_TRACE_POINTS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84959d3285d1..421236d39902 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -22,8 +22,6 @@
>   */
>  #include "sched.h"
>  
> -#include <trace/events/sched.h>
> -
>  /*
>   * Targeted preemption latency for CPU-bound tasks:
>   *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c4915f46035a..b89d7786109a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -75,6 +75,8 @@
>  #include "cpupri.h"
>  #include "cpudeadline.h"
>  
> +#include <trace/events/sched.h>
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
>  #else
> @@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  
>  	rq->nr_running = prev_nr + count;
>  
> +	trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running);
> +
>  #ifdef CONFIG_SMP
>  	if (prev_nr < 2 && rq->nr_running >= 2) {
>  		if (!READ_ONCE(rq->rd->overload))
> @@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  static inline void sub_nr_running(struct rq *rq, unsigned count)
>  {
>  	rq->nr_running -= count;
> +
> +	trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running);
> +
>  	/* Check if we still need preemption */
>  	sched_update_tick_dependency(rq);
>  }
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 14:20           ` Qais Yousef
@ 2019-09-04 14:41             ` Joel Fernandes
  2019-09-04 14:57               ` Qais Yousef
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 14:41 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> On 09/04/19 09:06, Joel Fernandes wrote:
> > > 
> > > It is actually true.
> > >
> > > But you need to make the distinction between a tracepoint
> > > and a trace event first.
> > 
> > I know this distinction well.
> > 
> > > What Valentin is talking about here is the *bare*
> > > tracepoint without any event associated with them like the one I added to the
> > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > has changed since I last tried.
> > 
> > Can this tracepoint be registered on with tracepoint_probe_register()?
> > Quickly looking at these new tracepoint, they can be otherwise how would they
> > even work right? If so, then eBPF can very well access it. Look at
> > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > BPF_RAW_TRACEPOINT_OPEN.
> 
> Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> maybe I missed something on the way it should be used. AFAICT it was missing
> the bits that I implemented in [1]. Maybe the method you mention is lower level
> than bcc.

Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
tracepoints (not trace events) to probe context switches and such (probably
not through BCC but some other BPF tracing code). Peter had rejected trace
events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
then IIRC.

> > > The current infrastructure needs to be expanded to allow eBPF to attach these
> > > bare tracepoints. Something similar to what I have in [1] is needed - but
> > > instead of creating a new macro it needs to expand the current macro. [2] might
> > > give full context of when I was trying to come up with alternatives to using
> > > trace events.
> > > 
> > > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/
> > 
> > 
> > As I was mentioning, tracepoints, not "trace events" can already be opened
> > directly with BPF. I don't see how these new tracepoints are different.
> > 
> > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > somewhere. I would be happy to do that if there is a place for the same. I
> > really want some general "policy" in the kernel on where we draw a line in
> > the sand with respect to tracepoints and ABI :).
> > 
> > For instance, perhaps VFS can also start having non-ABI tracepoints for the
> > benefit of people tracing the VFS.
> 
> Good question. I did consider that but failed to come up with a place. AFAIU
> the history moved from tracepoints to trace events and now moving back to
> tracepoints. Something Steve is not very enthusiastic about.

Yeah this is a bit of a mess. I think for every recent LPC this has come up.
But the DECLARE_TRACE approach you did is interesting in that it
reduces/removes the API surface for trace-events at least.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 14:41             ` Joel Fernandes
@ 2019-09-04 14:57               ` Qais Yousef
  2019-09-04 15:46                 ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Qais Yousef @ 2019-09-04 14:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On 09/04/19 10:41, Joel Fernandes wrote:
> On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> > On 09/04/19 09:06, Joel Fernandes wrote:
> > > > 
> > > > It is actually true.
> > > >
> > > > But you need to make the distinction between a tracepoint
> > > > and a trace event first.
> > > 
> > > I know this distinction well.
> > > 
> > > > What Valentin is talking about here is the *bare*
> > > > tracepoint without any event associated with them like the one I added to the
> > > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > > has changed since I last tried.
> > > 
> > > Can this tracepoint be registered on with tracepoint_probe_register()?
> > > Quickly looking at these new tracepoint, they can be otherwise how would they
> > > even work right? If so, then eBPF can very well access it. Look at
> > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > > BPF_RAW_TRACEPOINT_OPEN.
> > 
> > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> > maybe I missed something on the way it should be used. AFAICT it was missing
> > the bits that I implemented in [1]. Maybe the method you mention is lower level
> > than bcc.
> 
> Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
> tracepoints (not trace events) to probe context switches and such (probably
> not through BCC but some other BPF tracing code). Peter had rejected trace
> events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
> then IIRC.

Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for
RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT).

Anyway, if you ever get a chance please try it and let me know. I might have
done something wrong and you're more of a eBPF guru than I am :-)

> 
> > > > The current infrastructure needs to be expanded to allow eBPF to attach these
> > > > bare tracepoints. Something similar to what I have in [1] is needed - but
> > > > instead of creating a new macro it needs to expand the current macro. [2] might
> > > > give full context of when I was trying to come up with alternatives to using
> > > > trace events.
> > > > 
> > > > [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > > [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45v6k@e107158-lin.cambridge.arm.com/
> > > 
> > > 
> > > As I was mentioning, tracepoints, not "trace events" can already be opened
> > > directly with BPF. I don't see how these new tracepoints are different.
> > > 
> > > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > > somewhere. I would be happy to do that if there is a place for the same. I
> > > really want some general "policy" in the kernel on where we draw a line in
> > > the sand with respect to tracepoints and ABI :).
> > > 
> > > For instance, perhaps VFS can also start having non-ABI tracepoints for the
> > > benefit of people tracing the VFS.
> > 
> > Good question. I did consider that but failed to come up with a place. AFAIU
> > the history moved from tracepoints to trace events and now moving back to
> > tracepoints. Something Steve is not very enthusiastic about.
> 
> Yeah this is a bit of a mess. I think for every recent LPC this has come up.
> But the DECLARE_TRACE approach you did is interesting in that it
> reduces/removes the API surface for trace-events at least.

Yes. And you have the flexibility to add more info to the tracepoint without
worrying about breaking current users.

Another nice feat we discovered is that you can create several trace evenets
from the same tracepoint each exposing different set of info. You can achieve
the same with the current trace events if you use tracepoint_probe_register()
of course, but not if you use the macros. The tendency for in-kernel trace
events is to have 1:1 mapping even if the events can be extracted from
a single tracepoint.

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 13:06         ` Joel Fernandes
  2019-09-04 14:20           ` Qais Yousef
@ 2019-09-04 15:25           ` Alexei Starovoitov
  2019-09-04 15:40             ` Joel Fernandes
  1 sibling, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 15:25 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Qais Yousef, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, X86 ML

On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> I wonder if this distinction of "tracepoint" being non-ABI can be documented
> somewhere. I would be happy to do that if there is a place for the same. I
> really want some general "policy" in the kernel on where we draw a line in
> the sand with respect to tracepoints and ABI :).

It's been discussed millions times. tracepoints are not abi.
Example: android folks started abusing tracepoints inside bpf core
and we _deleted_ them.
Same thing can be done with _any_ tracepoint.
Do not abuse them and stop the fud about abi.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 13:11         ` Joel Fernandes
@ 2019-09-04 15:26           ` Alexei Starovoitov
  2019-09-04 15:33             ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 15:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML, Qais Yousef

On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> True. However, for kprobes-based BPF program - it does check for kernel
> version to ensure that the BPF program is built against the right kernel
> version (in order to ensure the program is built against the right set of
> kernel headers). If it is not, then BPF refuses to load the program.

This is not true anymore. Users found few ways to workaround that check
in practice. It became useless and it was deleted some time ago.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:26           ` Alexei Starovoitov
@ 2019-09-04 15:33             ` Joel Fernandes
  2019-09-04 15:37               ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 15:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML, Qais Yousef

On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > True. However, for kprobes-based BPF program - it does check for kernel
> > version to ensure that the BPF program is built against the right kernel
> > version (in order to ensure the program is built against the right set of
> > kernel headers). If it is not, then BPF refuses to load the program.
> 
> This is not true anymore. Users found few ways to workaround that check
> in practice. It became useless and it was deleted some time ago.

Wow, Ok! Interesting!

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:33             ` Joel Fernandes
@ 2019-09-04 15:37               ` Alexei Starovoitov
  2019-09-04 15:41                 ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 15:37 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML, Qais Yousef

On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > True. However, for kprobes-based BPF program - it does check for kernel
> > > version to ensure that the BPF program is built against the right kernel
> > > version (in order to ensure the program is built against the right set of
> > > kernel headers). If it is not, then BPF refuses to load the program.
> >
> > This is not true anymore. Users found few ways to workaround that check
> > in practice. It became useless and it was deleted some time ago.
>
> Wow, Ok! Interesting!

the other part of your email says about kernel header requirement.
This is not true any more as well :)
BTF relocations are already supported by the kernel, llvm, libbpf,
bpftool, pahole.
We'll be posting sample code soon.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:25           ` Alexei Starovoitov
@ 2019-09-04 15:40             ` Joel Fernandes
  2019-09-04 15:51               ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 15:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Qais Yousef, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, X86 ML

On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > somewhere. I would be happy to do that if there is a place for the same. I
> > really want some general "policy" in the kernel on where we draw a line in
> > the sand with respect to tracepoints and ABI :).
> 
> It's been discussed millions times. tracepoints are not abi.
> Example: android folks started abusing tracepoints inside bpf core
> and we _deleted_ them.

This is news to me, which ones?

> Same thing can be done with _any_ tracepoint.
> Do not abuse them and stop the fud about abi.

I don't know what FUD you are referring to. At least it is not coming from
me. This thread is dealing with the issue about ABI specifically, I jumped in
just now. As I was saying earlier, I don't have a strong opinion about this.
I just want to know what is the agreed upon approach so that we can stick to
it.

It sounds like the agreement here is tracepoints can be added and used
without ABI guarantees, however the same is not true with trace events.
Where's the FUD in that?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:37               ` Alexei Starovoitov
@ 2019-09-04 15:41                 ` Joel Fernandes
  0 siblings, 0 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML, Qais Yousef

On Wed, Sep 04, 2019 at 08:37:22AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > True. However, for kprobes-based BPF program - it does check for kernel
> > > > version to ensure that the BPF program is built against the right kernel
> > > > version (in order to ensure the program is built against the right set of
> > > > kernel headers). If it is not, then BPF refuses to load the program.
> > >
> > > This is not true anymore. Users found few ways to workaround that check
> > > in practice. It became useless and it was deleted some time ago.
> >
> > Wow, Ok! Interesting!
> 
> the other part of your email says about kernel header requirement.
> This is not true any more as well :)
> BTF relocations are already supported by the kernel, llvm, libbpf,
> bpftool, pahole.
> We'll be posting sample code soon.

Ok, this landscape seems to be changing quite a bit. I was going by what I
already know... Looking forward to catching up with the latest. Sorry,

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 14:57               ` Qais Yousef
@ 2019-09-04 15:46                 ` Joel Fernandes
  0 siblings, 0 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-09-04 15:46 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Valentin Schneider, Radim Krčmář,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, x86

On Wed, Sep 04, 2019 at 03:57:59PM +0100, Qais Yousef wrote:
> On 09/04/19 10:41, Joel Fernandes wrote:
> > On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote:
> > > On 09/04/19 09:06, Joel Fernandes wrote:
> > > > > 
> > > > > It is actually true.
> > > > >
> > > > > But you need to make the distinction between a tracepoint
> > > > > and a trace event first.
> > > > 
> > > > I know this distinction well.
> > > > 
> > > > > What Valentin is talking about here is the *bare*
> > > > > tracepoint without any event associated with them like the one I added to the
> > > > > scheduler recently. These ones are not accessible via eBPF, unless something
> > > > > has changed since I last tried.
> > > > 
> > > > Can this tracepoint be registered on with tracepoint_probe_register()?
> > > > Quickly looking at these new tracepoint, they can be otherwise how would they
> > > > even work right? If so, then eBPF can very well access it. Look at
> > > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the
> > > > BPF_RAW_TRACEPOINT_OPEN.
> > > 
> > > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But
> > > maybe I missed something on the way it should be used. AFAICT it was missing
> > > the bits that I implemented in [1]. Maybe the method you mention is lower level
> > > than bcc.
> > 
> > Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing*
> > tracepoints (not trace events) to probe context switches and such (probably
> > not through BCC but some other BPF tracing code). Peter had rejected trace
> > events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN
> > then IIRC.
> 
> Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for
> RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT).
> 
> Anyway, if you ever get a chance please try it and let me know. I might have
> done something wrong and you're more of a eBPF guru than I am :-)

eBPF guru and me? no way ;-) I have tried out BPF_RAW_TRACEPOINT_OPEN before
and it works as expected. Are there not any in-kernel samples? Perhaps Alexei
can post some if there are not.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:40             ` Joel Fernandes
@ 2019-09-04 15:51               ` Alexei Starovoitov
  2019-09-04 17:47                 ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 15:51 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Qais Yousef, Valentin Schneider, Radim Krčmář,
	LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, Steven Rostedt, H. Peter Anvin,
	Andy Lutomirski, Jirka Hladký,
	Jiří Vozár, X86 ML

On Wed, Sep 4, 2019 at 8:40 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > I wonder if this distinction of "tracepoint" being non-ABI can be documented
> > > somewhere. I would be happy to do that if there is a place for the same. I
> > > really want some general "policy" in the kernel on where we draw a line in
> > > the sand with respect to tracepoints and ABI :).
> >
> > It's been discussed millions times. tracepoints are not abi.
> > Example: android folks started abusing tracepoints inside bpf core
> > and we _deleted_ them.
>
> This is news to me, which ones?

those that your android teammates abused!

> > Same thing can be done with _any_ tracepoint.
> > Do not abuse them and stop the fud about abi.
>
> I don't know what FUD you are referring to. At least it is not coming from
> me. This thread is dealing with the issue about ABI specifically, I jumped in
> just now. As I was saying earlier, I don't have a strong opinion about this.
> I just want to know what is the agreed upon approach so that we can stick to
> it.
>
> It sounds like the agreement here is tracepoints can be added and used
> without ABI guarantees, however the same is not true with trace events.
> Where's the FUD in that?

Anything in tracing can be deleted.
Tracing is about debugging and introspection.
When underlying kernel code changes the introspection points change as well.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 15:51               ` Alexei Starovoitov
@ 2019-09-04 17:47                 ` Peter Zijlstra
  2019-09-04 17:53                   ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-09-04 17:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joel Fernandes, Qais Yousef, Valentin Schneider,
	Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML

On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> Anything in tracing can be deleted.
> Tracing is about debugging and introspection.
> When underlying kernel code changes the introspection points change as well.

Right; except when it breaks widely used tools; like say powertop. Been
there, done that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 14:37   ` Qais Yousef
@ 2019-09-04 17:48     ` Peter Zijlstra
  2019-09-09 10:59       ` Qais Yousef
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-09-04 17:48 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Radim Krčmář,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote:

> I managed to hook into sched_switch to get the nr_running of cfs tasks via
> eBPF.
> 
> ```
> int on_switch(struct sched_switch_args *args) {
>     struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
>     struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
>     const char *prev_cgroup_name = prev_cgroup->kn->name;
> 
>     if (prev_cgroup->kn->parent) {
>         bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
>                          prev->se.cfs_rq->nr_running,
>                          prev_cgroup_name);
>     } else {
>         bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
>                          prev->se.cfs_rq->nr_running);
>     }
>     return 0;
> };
> ```
> 
> You can do something similar by attaching to the sched_switch tracepoint from
> a module and a create a new event to get the nr_running.
> 
> Now this is not as accurate as your proposed new tracepoint in terms where you
> sample nr_running, but should be good enough?

The above is after deactivate() and gives an up-to-date count for
decrements. Attach something to trace_sched_wakeup() to get the
increment update.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 17:47                 ` Peter Zijlstra
@ 2019-09-04 17:53                   ` Alexei Starovoitov
  2019-09-05  8:13                     ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-04 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Qais Yousef, Valentin Schneider,
	Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML

On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > Anything in tracing can be deleted.
> > Tracing is about debugging and introspection.
> > When underlying kernel code changes the introspection points change as well.
>
> Right; except when it breaks widely used tools; like say powertop. Been
> there, done that.

powertop was a lesson learned, but it's not a relevant example anymore.
There are more widely used tools today. Like bcc tools.
And bpftrace is quickly gaining momentum and large user base.
bcc tools did break already several times and people fixed them.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 17:53                   ` Alexei Starovoitov
@ 2019-09-05  8:13                     ` Ingo Molnar
  2019-09-05 16:49                       ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2019-09-05  8:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Valentin Schneider,
	Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML


* Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > > Anything in tracing can be deleted.
> > > Tracing is about debugging and introspection.
> > > When underlying kernel code changes the introspection points change as well.
> >
> > Right; except when it breaks widely used tools; like say powertop. Been
> > there, done that.
> 
> powertop was a lesson learned, but it's not a relevant example anymore.
> There are more widely used tools today. Like bcc tools.
> And bpftrace is quickly gaining momentum and large user base.
> bcc tools did break already several times and people fixed them.

Are these tools using libtraceevents?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-05  8:13                     ` Ingo Molnar
@ 2019-09-05 16:49                       ` Alexei Starovoitov
  0 siblings, 0 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2019-09-05 16:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Valentin Schneider,
	Radim Krčmář,
	LKML, Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, X86 ML

On Thu, Sep 05, 2019 at 10:13:10AM +0200, Ingo Molnar wrote:
> 
> * Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote:
> > > > Anything in tracing can be deleted.
> > > > Tracing is about debugging and introspection.
> > > > When underlying kernel code changes the introspection points change as well.
> > >
> > > Right; except when it breaks widely used tools; like say powertop. Been
> > > there, done that.
> > 
> > powertop was a lesson learned, but it's not a relevant example anymore.
> > There are more widely used tools today. Like bcc tools.
> > And bpftrace is quickly gaining momentum and large user base.
> > bcc tools did break already several times and people fixed them.
> 
> Are these tools using libtraceevents?

bcc tools and bpftrace are using libbcc.
Which in turn is using libbpf.
libtraceevents is not used.

Interesting example is https://github.com/iovisor/bcc/blob/master/tools/tcplife.py
It's using "inet_sock_set_state" tracepoint when available on newer kernels
and kprobe in tcp_set_state() function on older kernels.
That tracepoint changed significantly over time.
It had different name 'tcp_set_state' and slightly different semantics.
Hence the tool was fixed when that change in tracepoint happened:
https://github.com/iovisor/bcc/commit/fd93dc0409b626b749b90f115d3d550a870ed125

Note that tcp:tcp_set_state tracepoint existed for full kernel release.
Yet people didn't make fuzz about the fact it disappeared in 4.16.
Though tcplife.py tool is simple there are more complex tools based
on this idea that are deployed in netflix and fb that went through the same
tcp_set_state->inet_sock_set_state fixes.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
  2019-09-04 17:48     ` Peter Zijlstra
@ 2019-09-09 10:59       ` Qais Yousef
  0 siblings, 0 replies; 31+ messages in thread
From: Qais Yousef @ 2019-09-09 10:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Radim Krčmář,
	linux-kernel, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, H. Peter Anvin, Andy Lutomirski,
	Jirka Hladký,
	Jiří Vozár, x86

On 09/04/19 19:48, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote:
> 
> > I managed to hook into sched_switch to get the nr_running of cfs tasks via
> > eBPF.
> > 
> > ```
> > int on_switch(struct sched_switch_args *args) {
> >     struct task_struct *prev = (struct task_struct *)bpf_get_current_task();
> >     struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup;
> >     const char *prev_cgroup_name = prev_cgroup->kn->name;
> > 
> >     if (prev_cgroup->kn->parent) {
> >         bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n",
> >                          prev->se.cfs_rq->nr_running,
> >                          prev_cgroup_name);
> >     } else {
> >         bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n",
> >                          prev->se.cfs_rq->nr_running);
> >     }
> >     return 0;
> > };
> > ```
> > 
> > You can do something similar by attaching to the sched_switch tracepoint from
> > a module and a create a new event to get the nr_running.
> > 
> > Now this is not as accurate as your proposed new tracepoint in terms where you
> > sample nr_running, but should be good enough?
> 
> The above is after deactivate() and gives an up-to-date count for
> decrements. Attach something to trace_sched_wakeup() to get the
> increment update.

I just remembered that sched_switch and sched_wakeup aren't
EXPORT_TRACEPOINT*() so can't be attached to via out of tree module. But still
accessible via eBPF.

There has been several attempts to export these tracepoints but they were
NACKed because there was no in-kernel module that needed them.

https://lore.kernel.org/lkml/20150422130052.4996e231@gandalf.local.home/

--
Qais Yousef

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-09-09 10:59 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 15:43 [PATCH 0/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
2019-09-03 15:43 ` [PATCH 1/2] x86/mm/tlb: include tracepoints from tlb.c instead of mmu_context.h Radim Krčmář
2019-09-03 15:43 ` [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint Radim Krčmář
2019-09-03 16:05   ` Valentin Schneider
2019-09-04  4:23     ` Joel Fernandes
2019-09-04  8:14       ` Peter Zijlstra
2019-09-04 10:52         ` Qais Yousef
2019-09-04 13:11         ` Joel Fernandes
2019-09-04 15:26           ` Alexei Starovoitov
2019-09-04 15:33             ` Joel Fernandes
2019-09-04 15:37               ` Alexei Starovoitov
2019-09-04 15:41                 ` Joel Fernandes
2019-09-04 10:43       ` Qais Yousef
2019-09-04 13:06         ` Joel Fernandes
2019-09-04 14:20           ` Qais Yousef
2019-09-04 14:41             ` Joel Fernandes
2019-09-04 14:57               ` Qais Yousef
2019-09-04 15:46                 ` Joel Fernandes
2019-09-04 15:25           ` Alexei Starovoitov
2019-09-04 15:40             ` Joel Fernandes
2019-09-04 15:51               ` Alexei Starovoitov
2019-09-04 17:47                 ` Peter Zijlstra
2019-09-04 17:53                   ` Alexei Starovoitov
2019-09-05  8:13                     ` Ingo Molnar
2019-09-05 16:49                       ` Alexei Starovoitov
2019-09-04  6:55     ` chengjian (D)
2019-09-04 13:13   ` Peter Zijlstra
2019-09-04 13:21     ` Valentin Schneider
2019-09-04 14:37   ` Qais Yousef
2019-09-04 17:48     ` Peter Zijlstra
2019-09-09 10:59       ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).