linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
@ 2018-09-13 20:08 kan.liang
  2018-09-14  6:47 ` Alexey Budankov
  0 siblings, 1 reply; 8+ messages in thread
From: kan.liang @ 2018-09-13 20:08 UTC (permalink / raw)
  To: peterz, tglx, acme, mingo, linux-kernel; +Cc: ak, jolsa, namhyung, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

LBR can bring big overhead when the benchmark has high context switches.
For example, a sub benchmark of Dacapo, avrora.

Baseline: java -jar dacapo-9.12-MR1-bach.jar avrora -n 20
With LBR: perf record --branch-filter any,u -- java -jar
dacapo-9.12-MR1-bach.jar avrora -n 20

Baseline (ms)    With LBR (ms)    Overhead
6508		 19831		  205%

In principle the LBRs need to be flushed between threads. So does
current code.

However in practice the LBRs clear very quickly when any code runs,
so it is unlikely to be a functional problem of LBR use for sampling
if there is a small leak shortly after each context switch.
It is mainly a security issue that we don't want to leak anything to an
attacker.

Different threads in a process already must trust each other so we can
safely leak in this case without opening security holes.

When switching to kernel threads (such as the common switch to idle
case) which also share the same mm and are guaranteed to not be
attackers.

For those cases, resetting the LBRs can be safely avoid.
Checking ctx_id, only resetting the LBRs when switching to a different
user process.

With the patch,
Baseline (ms)    With LBR (ms)    Overhead
6508		 10350            59%

Reported-by: Sandhya Viswanathan <sandhya.viswanathan@intel.com>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c  | 16 ++++++++++++++--
 arch/x86/events/perf_event.h |  1 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index f3e006b..26344c4 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -444,9 +444,21 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 	 * are not tagged with an identifier, we need to wipe the LBR, even for
 	 * per-cpu events. You simply cannot resolve the branches from the old
 	 * address space.
+	 * We don't need to wipe the LBR for a kernel thread which share the
+	 * same mm with previous user thread.
 	 */
-	if (sched_in)
-		intel_pmu_lbr_reset();
+	if (!current || !current->mm)
+		return;
+	if (sched_in) {
+		/*
+		 * Only flush when switching to user threads
+		 * and mm context changed
+		 */
+		if (current->mm->context.ctx_id != cpuc->last_ctx_id)
+			intel_pmu_lbr_reset();
+	} else {
+		cpuc->last_ctx_id = current->mm->context.ctx_id;
+	}
 }
 
 static inline bool branch_user_callstack(unsigned br_sel)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1562863..3aa3379 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -217,6 +217,7 @@ struct cpu_hw_events {
 	u64				br_sel;
 	struct x86_perf_task_context	*last_task_ctx;
 	int				last_log_id;
+	u64				last_ctx_id;
 
 	/*
 	 * Intel host/guest exclude bits
-- 
2.4.11


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-13 20:08 [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR kan.liang
@ 2018-09-14  6:47 ` Alexey Budankov
  2018-09-14  8:54   ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Budankov @ 2018-09-14  6:47 UTC (permalink / raw)
  To: linux-kernel-owner, peterz, tglx, acme, mingo, linux-kernel
  Cc: ak, jolsa, namhyung, Kan Liang

Hi,

On 13.09.2018 23:08, linux-kernel-owner@vger.kernel.org wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> LBR can bring big overhead when the benchmark has high context switches.
> For example, a sub benchmark of Dacapo, avrora.
> 
> Baseline: java -jar dacapo-9.12-MR1-bach.jar avrora -n 20
> With LBR: perf record --branch-filter any,u -- java -jar
> dacapo-9.12-MR1-bach.jar avrora -n 20
> 
> Baseline (ms)    With LBR (ms)    Overhead
> 6508		 19831		  205%
> 
> In principle the LBRs need to be flushed between threads. So does
> current code.

IMHO, ideally, LBRs stack would be preserved and restored when 
switching between execution stacks. That would allow implementing 
per-thread statistical call graph view in Perf tools, fully based 
on HW capabilities. It could be advantageous for some cases, in 
comparison with traditional dwarf based call graph. 

To me virtualization looks similar to e.g. HW performance counters 
whose values are switched back and forth from perf_event object
on context switches. But this is surely bigger effort.

Thanks,
Alexey

> 
> However in practice the LBRs clear very quickly when any code runs,
> so it is unlikely to be a functional problem of LBR use for sampling
> if there is a small leak shortly after each context switch.
> It is mainly a security issue that we don't want to leak anything to an
> attacker.
> 
> Different threads in a process already must trust each other so we can
> safely leak in this case without opening security holes.
> 
> When switching to kernel threads (such as the common switch to idle
> case) which also share the same mm and are guaranteed to not be
> attackers.
> 
> For those cases, resetting the LBRs can be safely avoid.
> Checking ctx_id, only resetting the LBRs when switching to a different
> user process.
> 
> With the patch,
> Baseline (ms)    With LBR (ms)    Overhead
> 6508		 10350            59%
> 
> Reported-by: Sandhya Viswanathan <sandhya.viswanathan@intel.com>
> Suggested-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  arch/x86/events/intel/lbr.c  | 16 ++++++++++++++--
>  arch/x86/events/perf_event.h |  1 +
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index f3e006b..26344c4 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -444,9 +444,21 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>  	 * are not tagged with an identifier, we need to wipe the LBR, even for
>  	 * per-cpu events. You simply cannot resolve the branches from the old
>  	 * address space.
> +	 * We don't need to wipe the LBR for a kernel thread which share the
> +	 * same mm with previous user thread.
>  	 */
> -	if (sched_in)
> -		intel_pmu_lbr_reset();
> +	if (!current || !current->mm)
> +		return;
> +	if (sched_in) {
> +		/*
> +		 * Only flush when switching to user threads
> +		 * and mm context changed
> +		 */
> +		if (current->mm->context.ctx_id != cpuc->last_ctx_id)
> +			intel_pmu_lbr_reset();
> +	} else {
> +		cpuc->last_ctx_id = current->mm->context.ctx_id;
> +	}
>  }
>  
>  static inline bool branch_user_callstack(unsigned br_sel)
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 1562863..3aa3379 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -217,6 +217,7 @@ struct cpu_hw_events {
>  	u64				br_sel;
>  	struct x86_perf_task_context	*last_task_ctx;
>  	int				last_log_id;
> +	u64				last_ctx_id;
>  
>  	/*
>  	 * Intel host/guest exclude bits
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14  6:47 ` Alexey Budankov
@ 2018-09-14  8:54   ` Andi Kleen
  2018-09-14  9:22     ` Alexey Budankov
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2018-09-14  8:54 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: linux-kernel-owner, peterz, tglx, acme, mingo, linux-kernel,
	jolsa, namhyung, Kan Liang

> > In principle the LBRs need to be flushed between threads. So does
> > current code.
> 
> IMHO, ideally, LBRs stack would be preserved and restored when 
> switching between execution stacks. That would allow implementing 
> per-thread statistical call graph view in Perf tools, fully based 
> on HW capabilities. It could be advantageous for some cases, in 
> comparison with traditional dwarf based call graph. 

This is already supported when you use LBR call stack mode
(perf record --call-graph lbr)

This change is only optimizing the case when call stack mode is not used.

Of course in call stack mode the context switch overhead is even higher,
because it not only writes, but also reads.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14  8:54   ` Andi Kleen
@ 2018-09-14  9:22     ` Alexey Budankov
  2018-09-14 12:39       ` Liang, Kan
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Budankov @ 2018-09-14  9:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel-owner, peterz, tglx, acme, mingo, linux-kernel,
	jolsa, namhyung, Kan Liang


Hi Andi,

On 14.09.2018 11:54, Andi Kleen wrote:
>>> In principle the LBRs need to be flushed between threads. So does
>>> current code.
>>
>> IMHO, ideally, LBRs stack would be preserved and restored when 
>> switching between execution stacks. That would allow implementing 
>> per-thread statistical call graph view in Perf tools, fully based 
>> on HW capabilities. It could be advantageous for some cases, in 
>> comparison with traditional dwarf based call graph. 
> 
> This is already supported when you use LBR call stack mode
> (perf record --call-graph lbr)

Which kernel versions does it make sense to try?

Thanks,
Alexey

> 
> This change is only optimizing the case when call stack mode is not used.
> 
> Of course in call stack mode the context switch overhead is even higher,
> because it not only writes, but also reads.
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14  9:22     ` Alexey Budankov
@ 2018-09-14 12:39       ` Liang, Kan
  2018-09-14 14:27         ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Liang, Kan @ 2018-09-14 12:39 UTC (permalink / raw)
  To: Alexey Budankov, Andi Kleen
  Cc: linux-kernel-owner, peterz, tglx, acme, mingo, linux-kernel,
	jolsa, namhyung



On 9/14/2018 5:22 AM, Alexey Budankov wrote:
> 
> Hi Andi,
> 
> On 14.09.2018 11:54, Andi Kleen wrote:
>>>> In principle the LBRs need to be flushed between threads. So does
>>>> current code.
>>>
>>> IMHO, ideally, LBRs stack would be preserved and restored when
>>> switching between execution stacks. That would allow implementing
>>> per-thread statistical call graph view in Perf tools, fully based
>>> on HW capabilities. It could be advantageous for some cases, in
>>> comparison with traditional dwarf based call graph.
>>
>> This is already supported when you use LBR call stack mode
>> (perf record --call-graph lbr)
> 
> Which kernel versions does it make sense to try?
>

The optimization for LBR call stack has been merged into 4.19.
commit id: 8b077e4a69bef5c4121426e99497975860191e53
perf/x86/intel/lbr: Optimize context switches for the LBR call stack

Thanks,
Kan

> Thanks,
> Alexey
> 
>>
>> This change is only optimizing the case when call stack mode is not used.
>>
>> Of course in call stack mode the context switch overhead is even higher,
>> because it not only writes, but also reads.
>>
>> -Andi
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14 12:39       ` Liang, Kan
@ 2018-09-14 14:27         ` Andi Kleen
  2018-09-14 14:57           ` Liang, Kan
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2018-09-14 14:27 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Alexey Budankov, linux-kernel-owner, peterz, tglx, acme, mingo,
	linux-kernel, jolsa, namhyung

On Fri, Sep 14, 2018 at 08:39:36AM -0400, Liang, Kan wrote:
> 
> 
> On 9/14/2018 5:22 AM, Alexey Budankov wrote:
> > 
> > Hi Andi,
> > 
> > On 14.09.2018 11:54, Andi Kleen wrote:
> > > > > In principle the LBRs need to be flushed between threads. So does
> > > > > current code.
> > > > 
> > > > IMHO, ideally, LBRs stack would be preserved and restored when
> > > > switching between execution stacks. That would allow implementing
> > > > per-thread statistical call graph view in Perf tools, fully based
> > > > on HW capabilities. It could be advantageous for some cases, in
> > > > comparison with traditional dwarf based call graph.
> > > 
> > > This is already supported when you use LBR call stack mode
> > > (perf record --call-graph lbr)
> > 
> > Which kernel versions does it make sense to try?
> > 
> 
> The optimization for LBR call stack has been merged into 4.19.
> commit id: 8b077e4a69bef5c4121426e99497975860191e53
> perf/x86/intel/lbr: Optimize context switches for the LBR call stack

I think he mean support for LBR call stack in general. This has been there 
for a long time (since Haswell) Any reasonable kernel version should
support it.

The commit Kan pointed out just optimize it for cases when it is not
needed, like switch to kernel, because we only use LBR call stack
for ring 3.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14 14:27         ` Andi Kleen
@ 2018-09-14 14:57           ` Liang, Kan
  2018-09-17  7:57             ` Alexey Budankov
  0 siblings, 1 reply; 8+ messages in thread
From: Liang, Kan @ 2018-09-14 14:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexey Budankov, linux-kernel-owner, peterz, tglx, acme, mingo,
	linux-kernel, jolsa, namhyung



On 9/14/2018 10:27 AM, Andi Kleen wrote:
> On Fri, Sep 14, 2018 at 08:39:36AM -0400, Liang, Kan wrote:
>>
>>
>> On 9/14/2018 5:22 AM, Alexey Budankov wrote:
>>>
>>> Hi Andi,
>>>
>>> On 14.09.2018 11:54, Andi Kleen wrote:
>>>>>> In principle the LBRs need to be flushed between threads. So does
>>>>>> current code.
>>>>>
>>>>> IMHO, ideally, LBRs stack would be preserved and restored when
>>>>> switching between execution stacks. That would allow implementing
>>>>> per-thread statistical call graph view in Perf tools, fully based
>>>>> on HW capabilities. It could be advantageous for some cases, in
>>>>> comparison with traditional dwarf based call graph.
>>>>
>>>> This is already supported when you use LBR call stack mode
>>>> (perf record --call-graph lbr)
>>>
>>> Which kernel versions does it make sense to try?
>>>
>>
>> The optimization for LBR call stack has been merged into 4.19.
>> commit id: 8b077e4a69bef5c4121426e99497975860191e53
>> perf/x86/intel/lbr: Optimize context switches for the LBR call stack
> 
> I think he mean support for LBR call stack in general. This has been there
> for a long time (since Haswell) Any reasonable kernel version should
> support it.
>

Oh I see. Yes, the feature of LBR call stack was added long time ago.
But I still recommend 4.19. Because it includes a recent bug fix for LBR 
call stack.

commit id: 0592e57b24e7e05ec1f4c50b9666c013abff7017
perf/x86/intel/lbr: Fix incomplete LBR call stack

Thanks,
Kan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR
  2018-09-14 14:57           ` Liang, Kan
@ 2018-09-17  7:57             ` Alexey Budankov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexey Budankov @ 2018-09-17  7:57 UTC (permalink / raw)
  To: linux-kernel-owner, Andi Kleen
  Cc: peterz, tglx, acme, mingo, linux-kernel, jolsa, namhyung

Hello Kan and Andi,

On 14.09.2018 17:57, linux-kernel-owner@vger.kernel.org wrote:
> 
> 
> On 9/14/2018 10:27 AM, Andi Kleen wrote:
>> On Fri, Sep 14, 2018 at 08:39:36AM -0400, Liang, Kan wrote:
>>>
>>>
>>> On 9/14/2018 5:22 AM, Alexey Budankov wrote:
>>>>
>>>> Hi Andi,
>>>>
>>>> On 14.09.2018 11:54, Andi Kleen wrote:
>>>>>>> In principle the LBRs need to be flushed between threads. So does
>>>>>>> current code.
>>>>>>
>>>>>> IMHO, ideally, LBRs stack would be preserved and restored when
>>>>>> switching between execution stacks. That would allow implementing
>>>>>> per-thread statistical call graph view in Perf tools, fully based
>>>>>> on HW capabilities. It could be advantageous for some cases, in
>>>>>> comparison with traditional dwarf based call graph.
>>>>>
>>>>> This is already supported when you use LBR call stack mode
>>>>> (perf record --call-graph lbr)
>>>>
>>>> Which kernel versions does it make sense to try?
>>>>
>>>
>>> The optimization for LBR call stack has been merged into 4.19.
>>> commit id: 8b077e4a69bef5c4121426e99497975860191e53
>>> perf/x86/intel/lbr: Optimize context switches for the LBR call stack
>>
>> I think he mean support for LBR call stack in general. This has been there
>> for a long time (since Haswell) Any reasonable kernel version should
>> support it.
>>
> 
> Oh I see. Yes, the feature of LBR call stack was added long time ago.
> But I still recommend 4.19. Because it includes a recent bug fix for LBR call stack.
> 
> commit id: 0592e57b24e7e05ec1f4c50b9666c013abff7017
> perf/x86/intel/lbr: Fix incomplete LBR call stack

Thanks for your support.

Best regards,
Alexey

> 
> Thanks,
> Kan
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-09-17  7:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13 20:08 [PATCH] perf/x86/intel/lbr: Optimize context switches for LBR kan.liang
2018-09-14  6:47 ` Alexey Budankov
2018-09-14  8:54   ` Andi Kleen
2018-09-14  9:22     ` Alexey Budankov
2018-09-14 12:39       ` Liang, Kan
2018-09-14 14:27         ` Andi Kleen
2018-09-14 14:57           ` Liang, Kan
2018-09-17  7:57             ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).