linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch
@ 2019-10-23  7:05 Alexey Budankov
  2019-10-23  7:11 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Alexey Budankov @ 2019-10-23  7:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Restore Intel LBR call stack from cloned inactive task perf context on
a context switch. This change inherently addresses inconsistency in LBR 
call stack data provided on a sample in record profiling mode:

  $ perf record -N -B -T -R --call-graph lbr \
         -e cpu/period=0xcdfe60,event=0x3c,name=\'CPU_CLK_UNHALTED.THREAD\'/Duk \
         --clockid=monotonic_raw -- ./miniFE.x nx 25 ny 25 nz 25

Let's assume threads A, B, C belonging to the same process. 
B and C are siblings of A and their perf contexts are treated as equivalent.
At some point B blocks on a futex (non preempt context switch).
B's LBRs are preserved at B's perf context task_ctx_data and B's events 
are removed from PMU and disabled. B's perf context becomes inactive.

Later C gets on a cpu, runs, gets profiled and eventually switches to 
the awaken but not yet running B. The optimized context switch path is 
executed swapping B's and C's task_ctx_data pointers at perf event contexts.
So C's task_ctx_data will refer preserved B's LBRs on the following 
switch-in event.

However, as far B's perf context is inactive there is no enabled events
in there and B's task_ctx_data->lbr_callstack_users is equal to 0.
When B gets on the cpu B's events reviving is skipped following
the optimized context switch path and B's task_ctx_data->lbr_callstack_users
remains 0. Thus B's LBR's are not restored by pmu sched_task() code called 
in the end of perf context switch-in callback for B.

In the report that manifests as having short fragments of B's
call stack, still tracked by LBR's HW between adjacent samples,
but the whole thread call tree doesn't aggregate.

The fix has been evaluated when profiling miniFE [1] (C++, OpenMP)
workload running 64 threads on Intel Skylake EP(64 core, 2 sockets):

  $ perf report --call-graph callee,flat

5.3.0-rc6+ (tip perf/core) - fixed

-   92.66%    82.64%  miniFE.x  libiomp5.so         [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 69.14% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 21.89% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 1.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        main
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone

5.0.13-300.fc30.x86_64 - no fix

-   90.29%    81.01%  miniFE.x  libiomp5.so         [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 33.45% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
     87.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 54.79% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
   - 9.18% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 41.28% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
   - 15.77% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
   - 11.56% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
   - 2.33% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        main
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
     0.67% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_gather
     0.57% __kmp_hardware_timestamp

[1] https://www.hpcadvisorycouncil.com/pdf/miniFE_Analysis_and_Profiling.pdf

---
Alexey Budankov (4):
  perf/core,x86: introduce swap_task_ctx() method at struct pmu
  perf/x86: install platform specific swap_task_ctx adapter
  perf/x86/intel: implement LBR callstacks context synchronization
  perf/core,x86: synchronize PMU task contexts on optimized context switches

 arch/x86/events/core.c       |  8 ++++++++
 arch/x86/events/intel/core.c |  7 +++++++
 arch/x86/events/intel/lbr.c  | 23 +++++++++++++++++++++++
 arch/x86/events/perf_event.h | 11 +++++++++++
 include/linux/perf_event.h   |  9 +++++++++
 kernel/events/core.c         | 13 ++++++++++++-
 6 files changed, 70 insertions(+), 1 deletion(-)

---
Changes in v5:
- renamed sync_task_ctx to swap_task_ctx;
- converted type of prev and next swap_task_ctx() params to struct perf_event_context;
- implemented check on availability of swap_task_ctx() pointer in pmu type
  at perf core implementation;
- moved swap of ctx->task_ctx_data pointers to architecture specific
  intel_pmu_lbr_swap_task_ctx() implementation;

Changes in v4:
- moved check on simultaneous task_ctx_data objects availability 
  to the perf/core layer;
- marked sync_task_ctx() as the optional in code comments;
- renamed params of sync_task_ctx() to prev and next;

Changes in v3:
- replaced assignment with swap at intel_pmu_lbr_sync_task_ctx()

Changes in v2:
- implemented sync_task_ctx() method at perf,x86,intel pmu types;
- employed the method on the optimized context switch path between 
  equivalent perf event contexts;

-- 
2.20.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu
  2019-10-23  7:05 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
@ 2019-10-23  7:11 ` Alexey Budankov
  2019-10-28 12:43   ` [tip: perf/core] perf/core, perf/x86: Introduce swap_task_ctx() method at 'struct pmu' tip-bot2 for Alexey Budankov
  2019-10-23  7:11 ` [PATCH v5 2/4] perf/x86: install platform specific swap_task_ctx adapter Alexey Budankov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Alexey Budankov @ 2019-10-23  7:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Declare swap_task_ctx() methods at the generic and x86 specific
pmu types to bridge calls to platform specific pmu code on optimized
context switch path between equivalent task perf event contexts.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 arch/x86/events/perf_event.h | 8 ++++++++
 include/linux/perf_event.h   | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ecacfbf4ebc1..5384317eaa16 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -682,6 +682,14 @@ struct x86_pmu {
 	 */
 	atomic_t	lbr_exclusive[x86_lbr_exclusive_max];
 
+	/*
+	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
+	 * switch helper to bridge calls from perf/core to perf/x86.
+	 * See struct pmu::swap_task_ctx() usage for examples;
+	 */
+	void		(*swap_task_ctx)(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+
 	/*
 	 * AMD bits
 	 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 587ae4d002f5..7887e4a3d487 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -410,6 +410,15 @@ struct pmu {
 	 */
 	size_t				task_ctx_size;
 
+	/*
+	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
+	 * can be synchronized using this function. See Intel LBR callstack support
+	 * implementation and Perf core context switch handling callbacks for usage
+	 * examples.
+	 */
+	void (*swap_task_ctx)		(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+					/* optional */
 
 	/*
 	 * Set up pmu-private data structures for an AUX area
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/4] perf/x86: install platform specific swap_task_ctx adapter
  2019-10-23  7:05 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
  2019-10-23  7:11 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov
@ 2019-10-23  7:11 ` Alexey Budankov
  2019-10-28 12:43   ` [tip: perf/core] perf/x86: Install platform specific ->swap_task_ctx() adapter tip-bot2 for Alexey Budankov
  2019-10-23  7:12 ` [PATCH v5 3/4] perf/x86/intel: implement LBR callstacks context synchronization Alexey Budankov
  2019-10-23  7:13 ` [PATCH v5 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches Alexey Budankov
  3 siblings, 1 reply; 10+ messages in thread
From: Alexey Budankov @ 2019-10-23  7:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Bridge perf core and x86 swap_task_ctx() method calls.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 arch/x86/events/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b21455d7504..6e3f0c18908e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2243,6 +2243,13 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
+static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
+				  struct perf_event_context *next)
+{
+	if (x86_pmu.swap_task_ctx)
+		x86_pmu.swap_task_ctx(prev, next);
+}
+
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -2297,6 +2304,7 @@ static struct pmu pmu = {
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
 	.task_ctx_size          = sizeof(struct x86_perf_task_context),
+	.swap_task_ctx		= x86_pmu_swap_task_ctx,
 	.check_period		= x86_pmu_check_period,
 
 	.aux_output_match	= x86_pmu_aux_output_match,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/4] perf/x86/intel: implement LBR callstacks context synchronization
  2019-10-23  7:05 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
  2019-10-23  7:11 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov
  2019-10-23  7:11 ` [PATCH v5 2/4] perf/x86: install platform specific swap_task_ctx adapter Alexey Budankov
@ 2019-10-23  7:12 ` Alexey Budankov
  2019-10-28 12:43   ` [tip: perf/core] perf/x86/intel: Implement LBR callstack " tip-bot2 for Alexey Budankov
  2019-10-23  7:13 ` [PATCH v5 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches Alexey Budankov
  3 siblings, 1 reply; 10+ messages in thread
From: Alexey Budankov @ 2019-10-23  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Implement intel_pmu_lbr_swap_task_ctx() method updating counters
of the events that requested LBR callstack data on a sample.

The counter can be zero for the case when task context belongs to
a thread that has just come from a block on a futex and the context
contains saved (lbr_stack_state == LBR_VALID) LBR register values.

For the values to be restored at LBR registers on the next thread's
switch-in event it swaps the counter value with the one that is
expected to be non zero at the previous equivalent task perf event
context.

Swap operation type ensures the previous task perf event context
stays consistent with the amount of events that requested LBR
callstack data on a sample.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 arch/x86/events/intel/lbr.c  | 23 +++++++++++++++++++++++
 arch/x86/events/perf_event.h |  3 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index ea54634eabf3..534c76606049 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -417,6 +417,29 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 	cpuc->last_log_id = ++task_ctx->log_id;
 }
 
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
+				 struct perf_event_context *next)
+{
+	struct x86_perf_task_context *prev_ctx_data, *next_ctx_data;
+
+	swap(prev->task_ctx_data, next->task_ctx_data);
+
+	/*
+	 * Architecture specific synchronization makes sense in
+	 * case both prev->task_ctx_data and next->task_ctx_data
+	 * pointers are allocated.
+	 */
+
+	prev_ctx_data = next->task_ctx_data;
+	next_ctx_data = prev->task_ctx_data;
+
+	if (!prev_ctx_data || !next_ctx_data)
+		return;
+
+	swap(prev_ctx_data->lbr_callstack_users,
+	     next_ctx_data->lbr_callstack_users);
+}
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5384317eaa16..930611db8f9a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1024,6 +1024,9 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
+				 struct perf_event_context *next);
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
 
 u64 lbr_from_signext_quirk_wr(u64 val);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches
  2019-10-23  7:05 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
                   ` (2 preceding siblings ...)
  2019-10-23  7:12 ` [PATCH v5 3/4] perf/x86/intel: implement LBR callstacks context synchronization Alexey Budankov
@ 2019-10-23  7:13 ` Alexey Budankov
  2019-10-28 12:43   ` [tip: perf/core] perf/x86: Synchronize " tip-bot2 for Alexey Budankov
  3 siblings, 1 reply; 10+ messages in thread
From: Alexey Budankov @ 2019-10-23  7:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Install Intel specific PMU task context synchronization adapter and
extend optimized context switch path with PMU specific task context
synchronization to fix LBR callstack virtualization on context switches.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 arch/x86/events/intel/core.c |  7 +++++++
 kernel/events/core.c         | 13 ++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index bbf6588d47ee..dc64b16e6b71 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3820,6 +3820,12 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx,
 	intel_pmu_lbr_sched_task(ctx, sched_in);
 }
 
+static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
+				    struct perf_event_context *next)
+{
+	intel_pmu_lbr_swap_task_ctx(prev, next);
+}
+
 static int intel_pmu_check_period(struct perf_event *event, u64 value)
 {
 	return intel_pmu_has_bts_period(event, value) ? -EINVAL : 0;
@@ -3955,6 +3961,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 
 	.guest_get_msrs		= intel_guest_get_msrs,
 	.sched_task		= intel_pmu_sched_task,
+	.swap_task_ctx		= intel_pmu_swap_task_ctx,
 
 	.check_period		= intel_pmu_check_period,
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9a5d4356562..ed31aa849161 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3204,10 +3204,21 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 		raw_spin_lock(&ctx->lock);
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {
+			struct pmu *pmu = ctx->pmu;
+
 			WRITE_ONCE(ctx->task, next);
 			WRITE_ONCE(next_ctx->task, task);
 
-			swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+			/*
+			 * PMU specific parts of task perf context can require
+			 * additional synchronization. As an example of such
+			 * synchronization see implementation details of Intel
+			 * LBR call stack data profiling;
+			 */
+			if (pmu->swap_task_ctx)
+				pmu->swap_task_ctx(ctx, next_ctx);
+			else
+				swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] perf/x86: Synchronize PMU task contexts on optimized context switches
  2019-10-23  7:13 ` [PATCH v5 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches Alexey Budankov
@ 2019-10-28 12:43   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-10-28 12:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Budankov, Peter Zijlstra (Intel),
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Kan Liang,
	Linus Torvalds, Mark Rutland, Namhyung Kim, Song Liu,
	Stephane Eranian, Thomas Gleixner, Vince Weaver, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     c2b98a8661514f29a44ebd0925cf4b1503beb48c
Gitweb:        https://git.kernel.org/tip/c2b98a8661514f29a44ebd0925cf4b1503beb48c
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Wed, 23 Oct 2019 10:13:56 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 28 Oct 2019 12:51:01 +01:00

perf/x86: Synchronize PMU task contexts on optimized context switches

Install Intel specific PMU task context synchronization adapter and
extend optimized context switch path with PMU specific task context
synchronization to fix LBR callstack virtualization on context switches.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/9c6445a9-bdba-ef03-3859-f1f91198f27a@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/core.c |  7 +++++++
 kernel/events/core.c         | 13 ++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index bbf6588..dc64b16 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3820,6 +3820,12 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx,
 	intel_pmu_lbr_sched_task(ctx, sched_in);
 }
 
+static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
+				    struct perf_event_context *next)
+{
+	intel_pmu_lbr_swap_task_ctx(prev, next);
+}
+
 static int intel_pmu_check_period(struct perf_event *event, u64 value)
 {
 	return intel_pmu_has_bts_period(event, value) ? -EINVAL : 0;
@@ -3955,6 +3961,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 
 	.guest_get_msrs		= intel_guest_get_msrs,
 	.sched_task		= intel_pmu_sched_task,
+	.swap_task_ctx		= intel_pmu_swap_task_ctx,
 
 	.check_period		= intel_pmu_check_period,
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0940c88..f48d38b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3204,10 +3204,21 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 		raw_spin_lock(&ctx->lock);
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {
+			struct pmu *pmu = ctx->pmu;
+
 			WRITE_ONCE(ctx->task, next);
 			WRITE_ONCE(next_ctx->task, task);
 
-			swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+			/*
+			 * PMU specific parts of task perf context can require
+			 * additional synchronization. As an example of such
+			 * synchronization see implementation details of Intel
+			 * LBR call stack data profiling;
+			 */
+			if (pmu->swap_task_ctx)
+				pmu->swap_task_ctx(ctx, next_ctx);
+			else
+				swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] perf/x86/intel: Implement LBR callstack context synchronization
  2019-10-23  7:12 ` [PATCH v5 3/4] perf/x86/intel: implement LBR callstacks context synchronization Alexey Budankov
@ 2019-10-28 12:43   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-10-28 12:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Budankov, Peter Zijlstra (Intel),
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Kan Liang,
	Linus Torvalds, Mark Rutland, Namhyung Kim, Song Liu,
	Stephane Eranian, Thomas Gleixner, Vince Weaver, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     421ca868ea3b7c1ca1a541ed6dff3c101a563b95
Gitweb:        https://git.kernel.org/tip/421ca868ea3b7c1ca1a541ed6dff3c101a563b95
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Wed, 23 Oct 2019 10:12:54 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 28 Oct 2019 12:51:01 +01:00

perf/x86/intel: Implement LBR callstack context synchronization

Implement intel_pmu_lbr_swap_task_ctx() method updating counters
of the events that requested LBR callstack data on a sample.

The counter can be zero for the case when task context belongs to
a thread that has just come from a block on a futex and the context
contains saved (lbr_stack_state == LBR_VALID) LBR register values.

For the values to be restored at LBR registers on the next thread's
switch-in event it swaps the counter value with the one that is
expected to be non zero at the previous equivalent task perf event
context.

Swap operation type ensures the previous task perf event context
stays consistent with the amount of events that requested LBR
callstack data on a sample.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/261ac742-9022-c3f4-5885-1eae7415b091@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/lbr.c  | 23 +++++++++++++++++++++++
 arch/x86/events/perf_event.h |  3 +++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index ea54634..534c766 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -417,6 +417,29 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 	cpuc->last_log_id = ++task_ctx->log_id;
 }
 
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
+				 struct perf_event_context *next)
+{
+	struct x86_perf_task_context *prev_ctx_data, *next_ctx_data;
+
+	swap(prev->task_ctx_data, next->task_ctx_data);
+
+	/*
+	 * Architecture specific synchronization makes sense in
+	 * case both prev->task_ctx_data and next->task_ctx_data
+	 * pointers are allocated.
+	 */
+
+	prev_ctx_data = next->task_ctx_data;
+	next_ctx_data = prev->task_ctx_data;
+
+	if (!prev_ctx_data || !next_ctx_data)
+		return;
+
+	swap(prev_ctx_data->lbr_callstack_users,
+	     next_ctx_data->lbr_callstack_users);
+}
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5384317..930611d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1024,6 +1024,9 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
+				 struct perf_event_context *next);
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
 
 u64 lbr_from_signext_quirk_wr(u64 val);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] perf/x86: Install platform specific ->swap_task_ctx() adapter
  2019-10-23  7:11 ` [PATCH v5 2/4] perf/x86: install platform specific swap_task_ctx adapter Alexey Budankov
@ 2019-10-28 12:43   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-10-28 12:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Budankov, Peter Zijlstra (Intel),
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Kan Liang,
	Linus Torvalds, Mark Rutland, Namhyung Kim, Song Liu,
	Stephane Eranian, Thomas Gleixner, Vince Weaver, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     a44399703b4893de4eadb970867fd5efd4461514
Gitweb:        https://git.kernel.org/tip/a44399703b4893de4eadb970867fd5efd4461514
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Wed, 23 Oct 2019 10:11:54 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 28 Oct 2019 12:51:00 +01:00

perf/x86: Install platform specific ->swap_task_ctx() adapter

Bridge perf core and x86 swap_task_ctx() method calls.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/b157e97d-32c3-aeaf-13ba-47350c677906@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b21455..6e3f0c1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2243,6 +2243,13 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
+static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
+				  struct perf_event_context *next)
+{
+	if (x86_pmu.swap_task_ctx)
+		x86_pmu.swap_task_ctx(prev, next);
+}
+
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -2297,6 +2304,7 @@ static struct pmu pmu = {
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
 	.task_ctx_size          = sizeof(struct x86_perf_task_context),
+	.swap_task_ctx		= x86_pmu_swap_task_ctx,
 	.check_period		= x86_pmu_check_period,
 
 	.aux_output_match	= x86_pmu_aux_output_match,

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] perf/core, perf/x86: Introduce swap_task_ctx() method at 'struct pmu'
  2019-10-23  7:11 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov
@ 2019-10-28 12:43   ` tip-bot2 for Alexey Budankov
  0 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Alexey Budankov @ 2019-10-28 12:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Budankov, Peter Zijlstra (Intel),
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Ian Rogers, Jiri Olsa, Kan Liang,
	Linus Torvalds, Mark Rutland, Namhyung Kim, Song Liu,
	Stephane Eranian, Thomas Gleixner, Vince Weaver, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     fc1adfe306b71e094df636012f8c0fed971cad45
Gitweb:        https://git.kernel.org/tip/fc1adfe306b71e094df636012f8c0fed971cad45
Author:        Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate:    Wed, 23 Oct 2019 10:11:04 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 28 Oct 2019 12:50:59 +01:00

perf/core, perf/x86: Introduce swap_task_ctx() method at 'struct pmu'

Declare swap_task_ctx() methods at the generic and x86 specific
pmu types to bridge calls to platform specific PMU code on optimized
context switch path between equivalent task perf event contexts.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/9a0aa84a-f062-9b64-3133-373658550c4b@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/perf_event.h |  8 ++++++++
 include/linux/perf_event.h   |  9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ecacfbf..5384317 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -683,6 +683,14 @@ struct x86_pmu {
 	atomic_t	lbr_exclusive[x86_lbr_exclusive_max];
 
 	/*
+	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
+	 * switch helper to bridge calls from perf/core to perf/x86.
+	 * See struct pmu::swap_task_ctx() usage for examples;
+	 */
+	void		(*swap_task_ctx)(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+
+	/*
 	 * AMD bits
 	 */
 	unsigned int	amd_nb_constraints : 1;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4f77b22..011dcbd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -410,6 +410,15 @@ struct pmu {
 	 */
 	size_t				task_ctx_size;
 
+	/*
+	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
+	 * can be synchronized using this function. See Intel LBR callstack support
+	 * implementation and Perf core context switch handling callbacks for usage
+	 * examples.
+	 */
+	void (*swap_task_ctx)		(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+					/* optional */
 
 	/*
 	 * Set up pmu-private data structures for an AUX area

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu
  2019-10-25  8:35 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
@ 2019-10-25  8:58 ` Alexey Budankov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Budankov @ 2019-10-25  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


Declare swap_task_ctx() methods at the generic and x86 specific
pmu types to bridge calls to platform specific pmu code on optimized
context switch path between equivalent task perf event contexts.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 arch/x86/events/perf_event.h | 8 ++++++++
 include/linux/perf_event.h   | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ecacfbf4ebc1..5384317eaa16 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -682,6 +682,14 @@ struct x86_pmu {
 	 */
 	atomic_t	lbr_exclusive[x86_lbr_exclusive_max];
 
+	/*
+	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
+	 * switch helper to bridge calls from perf/core to perf/x86.
+	 * See struct pmu::swap_task_ctx() usage for examples;
+	 */
+	void		(*swap_task_ctx)(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+
 	/*
 	 * AMD bits
 	 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 587ae4d002f5..7887e4a3d487 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -410,6 +410,15 @@ struct pmu {
 	 */
 	size_t				task_ctx_size;
 
+	/*
+	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
+	 * can be synchronized using this function. See Intel LBR callstack support
+	 * implementation and Perf core context switch handling callbacks for usage
+	 * examples.
+	 */
+	void (*swap_task_ctx)		(struct perf_event_context *prev,
+					 struct perf_event_context *next);
+					/* optional */
 
 	/*
 	 * Set up pmu-private data structures for an AUX area
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-10-28 12:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-23  7:05 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
2019-10-23  7:11 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov
2019-10-28 12:43   ` [tip: perf/core] perf/core, perf/x86: Introduce swap_task_ctx() method at 'struct pmu' tip-bot2 for Alexey Budankov
2019-10-23  7:11 ` [PATCH v5 2/4] perf/x86: install platform specific swap_task_ctx adapter Alexey Budankov
2019-10-28 12:43   ` [tip: perf/core] perf/x86: Install platform specific ->swap_task_ctx() adapter tip-bot2 for Alexey Budankov
2019-10-23  7:12 ` [PATCH v5 3/4] perf/x86/intel: implement LBR callstacks context synchronization Alexey Budankov
2019-10-28 12:43   ` [tip: perf/core] perf/x86/intel: Implement LBR callstack " tip-bot2 for Alexey Budankov
2019-10-23  7:13 ` [PATCH v5 4/4] perf/core,x86: synchronize PMU task contexts on optimized context switches Alexey Budankov
2019-10-28 12:43   ` [tip: perf/core] perf/x86: Synchronize " tip-bot2 for Alexey Budankov
2019-10-25  8:35 [PATCH v5 0/4] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
2019-10-25  8:58 ` [PATCH v5 1/4] perf/core,x86: introduce swap_task_ctx() method at struct pmu Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).