linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] perf/core: fix restoring of Intel LBR call stack on a context switch
@ 2019-10-14  6:08 Alexey Budankov
  2019-10-15  9:16 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Alexey Budankov @ 2019-10-14  6:08 UTC (permalink / raw)
  To: Peter Zijlstra, Arnaldo Carvalho de Melo, Ingo Molnar
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen,
	Kan Liang, Stephane Eranian, Ian Rogers, Song Liu, linux-kernel


Restore Intel LBR call stack from cloned inactive task perf context on
a context switch. This change inherently addresses inconsistency in LBR 
call stack data provided on a sample in record profiling mode for 
example like this:

  $ perf record -N -B -T -R --call-graph lbr \
         -e cpu/period=0xcdfe60,event=0x3c,name=\'CPU_CLK_UNHALTED.THREAD\'/Duk \
         --clockid=monotonic_raw -- ./miniFE.x nx 25 ny 25 nz 25

Let's assume threads A, B, C belonging to the same process. 
B and C are siblings of A and their perf contexts are treated as equivalent.
At some point B blocks on a futex (non preempt context switch).
B's LBRs are preserved at B's perf context task_ctx_data and B's events 
are removed from PMU and disabled. B's perf context becomes inactive.

Later C gets on a cpu, runs, gets profiled and eventually switches to 
the awaken but not yet running B. The optimized context switch path is 
executed coping B's task_ctx_data to C's one and updating B's perf context 
pointer to refer to C's task_ctx_data that contains preserved B's LBRs 
after coping.

However, as far B's perf context is inactive there is no enabled events
in there and B's task_ctx_data->lbr_callstack_users is equal to 0.
When B gets on the cpu B's events reviving is skipped following
the optimized context switch path and B's task_ctx_data->lbr_callstack_users
remains 0. Thus B's LBR's are not restored by pmu sched_task() code called 
in the end of perf context switch sched_in callback for B.

In the report that manifests as having short fragments of B's
call stack, still tracked by LBR's HW between adjacent samples,
but the whole thread call tree doesn't aggregate.

The fix has been evaluated when profiling miniFE [1] (C++, OpenMP)
workload running 64 threads on Intel Skylake EP(64 core, 2 sockets):

  $ perf report --call-graph callee,flat

5.3.0-rc6+ (tip perf/core) - fixed

-   92.66%    82.64%  miniFE.x  libiomp5.so         [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 69.14% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 21.89% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 1.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        main
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone

5.0.13-300.fc30.x86_64 - no fix

-   90.29%    81.01%  miniFE.x  libiomp5.so         [.] _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 33.45% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
     87.63% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
   - 54.79% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
   - 9.18% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
   - 41.28% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_fork_barrier
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
   - 15.77% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
   - 11.56% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        miniFE::cg_solve<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, int>, miniFE::matvec_std<miniFE::CSRMatrix<double, int, int>, miniFE::Vector<double, int, in
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
   - 2.33% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_release
        __kmp_barrier
        __kmpc_reduce_nowait
        main
        __kmp_invoke_microtask
        __kmp_invoke_task_func
        __kmp_launch_thread
        _INTERNAL_24_______src_z_Linux_util_c_3e0095e6::__kmp_launch_worker
        start_thread
        __clone
     0.67% _INTERNAL_25_______src_kmp_barrier_cpp_1d20fae8::__kmp_hyper_barrier_gather
     0.57% __kmp_hardware_timestamp

[1] https://www.hpcadvisorycouncil.com/pdf/miniFE_Analysis_and_Profiling.pdf

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2aad959e6def..74c2ff38e079 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3181,7 +3181,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 
 	rcu_read_lock();
 	next_ctx = next->perf_event_ctxp[ctxn];
-	if (!next_ctx)
+	if (!next_ctx || !next_ctx->is_active)
 		goto unlock;
 
 	parent = rcu_dereference(ctx->parent_ctx);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v1] perf/core: fix restoring of Intel LBR call stack on a context switch
  2019-10-14  6:08 [PATCH v1] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
@ 2019-10-15  9:16 ` Peter Zijlstra
  2019-10-15 10:55   ` Alexey Budankov
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2019-10-15  9:16 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel

On Mon, Oct 14, 2019 at 09:08:34AM +0300, Alexey Budankov wrote:
> 
> Restore Intel LBR call stack from cloned inactive task perf context on
> a context switch. This change inherently addresses inconsistency in LBR 
> call stack data provided on a sample in record profiling mode for 
> example like this:
> 
>   $ perf record -N -B -T -R --call-graph lbr \
>          -e cpu/period=0xcdfe60,event=0x3c,name=\'CPU_CLK_UNHALTED.THREAD\'/Duk \
>          --clockid=monotonic_raw -- ./miniFE.x nx 25 ny 25 nz 25
> 
> Let's assume threads A, B, C belonging to the same process. 
> B and C are siblings of A and their perf contexts are treated as equivalent.
> At some point B blocks on a futex (non preempt context switch).
> B's LBRs are preserved at B's perf context task_ctx_data and B's events 
> are removed from PMU and disabled. B's perf context becomes inactive.
> 
> Later C gets on a cpu, runs, gets profiled and eventually switches to 
> the awaken but not yet running B. The optimized context switch path is 
> executed coping B's task_ctx_data to C's one and updating B's perf context 
> pointer to refer to C's task_ctx_data that contains preserved B's LBRs 
> after coping.
> 
> However, as far B's perf context is inactive there is no enabled events
> in there and B's task_ctx_data->lbr_callstack_users is equal to 0.
> When B gets on the cpu B's events reviving is skipped following
> the optimized context switch path and B's task_ctx_data->lbr_callstack_users
> remains 0. Thus B's LBR's are not restored by pmu sched_task() code called 
> in the end of perf context switch sched_in callback for B.
> 
> In the report that manifests as having short fragments of B's
> call stack, still tracked by LBR's HW between adjacent samples,
> but the whole thread call tree doesn't aggregate.
> 

> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  kernel/events/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2aad959e6def..74c2ff38e079 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3181,7 +3181,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
>  
>  	rcu_read_lock();
>  	next_ctx = next->perf_event_ctxp[ctxn];
> -	if (!next_ctx)
> +	if (!next_ctx || !next_ctx->is_active)
>  		goto unlock;

AFAICT this completely kills off the optimization. next_ctx->is_active
cannot be set at this point.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v1] perf/core: fix restoring of Intel LBR call stack on a context switch
  2019-10-15  9:16 ` Peter Zijlstra
@ 2019-10-15 10:55   ` Alexey Budankov
  0 siblings, 0 replies; 3+ messages in thread
From: Alexey Budankov @ 2019-10-15 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, Kan Liang, Stephane Eranian,
	Ian Rogers, Song Liu, linux-kernel


On 15.10.2019 12:16, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 09:08:34AM +0300, Alexey Budankov wrote:
>>
>> Restore Intel LBR call stack from cloned inactive task perf context on
>> a context switch. This change inherently addresses inconsistency in LBR 
>> call stack data provided on a sample in record profiling mode for 
>> example like this:
>>
>>   $ perf record -N -B -T -R --call-graph lbr \
>>          -e cpu/period=0xcdfe60,event=0x3c,name=\'CPU_CLK_UNHALTED.THREAD\'/Duk \
>>          --clockid=monotonic_raw -- ./miniFE.x nx 25 ny 25 nz 25
>>
>> Let's assume threads A, B, C belonging to the same process. 
>> B and C are siblings of A and their perf contexts are treated as equivalent.
>> At some point B blocks on a futex (non preempt context switch).
>> B's LBRs are preserved at B's perf context task_ctx_data and B's events 
>> are removed from PMU and disabled. B's perf context becomes inactive.
>>
>> Later C gets on a cpu, runs, gets profiled and eventually switches to 
>> the awaken but not yet running B. The optimized context switch path is 
>> executed coping B's task_ctx_data to C's one and updating B's perf context 
>> pointer to refer to C's task_ctx_data that contains preserved B's LBRs 
>> after coping.
>>
>> However, as far B's perf context is inactive there is no enabled events
>> in there and B's task_ctx_data->lbr_callstack_users is equal to 0.
>> When B gets on the cpu B's events reviving is skipped following
>> the optimized context switch path and B's task_ctx_data->lbr_callstack_users
>> remains 0. Thus B's LBR's are not restored by pmu sched_task() code called 
>> in the end of perf context switch sched_in callback for B.
>>
>> In the report that manifests as having short fragments of B's
>> call stack, still tracked by LBR's HW between adjacent samples,
>> but the whole thread call tree doesn't aggregate.
>>
> 
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  kernel/events/core.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 2aad959e6def..74c2ff38e079 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -3181,7 +3181,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
>>  
>>  	rcu_read_lock();
>>  	next_ctx = next->perf_event_ctxp[ctxn];
>> -	if (!next_ctx)
>> +	if (!next_ctx || !next_ctx->is_active)
>>  		goto unlock;
> 
> AFAICT this completely kills off the optimization. next_ctx->is_active
> cannot be set at this point.

Hmm, the intention was to skip optimization path only once when switching 
to just resumed thread. Thanks for observation.

~Alexey

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-10-15 10:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14  6:08 [PATCH v1] perf/core: fix restoring of Intel LBR call stack on a context switch Alexey Budankov
2019-10-15  9:16 ` Peter Zijlstra
2019-10-15 10:55   ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).