Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
@ 2022-09-29 21:54 Nick Desaulniers
  2022-09-29 22:10 ` Slade Watkins
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2022-09-29 21:54 UTC (permalink / raw)
  To: linux-perf-users; +Cc: LKML, Ian Rogers

So I recently moved from a dual-xeon box to a zen 2 based threadripper
workstation.

My usual incantation for measuring profiles for compile time isn't working:

$ perf record -e cycles:pp --freq=128 --call-graph lbr -- make LLVM=1 -j$(nproc)
Error:
Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.

I've already set /proc/sys/kernel/perf_event_paranoid and
/proc/sys/kernel/kptr_restrict to 0.

I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
Is this a known issue, or am I holding it wrong?
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-09-29 21:54 Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a' Nick Desaulniers
@ 2022-09-29 22:10 ` Slade Watkins
  2022-09-30  3:23   ` Ian Rogers
  0 siblings, 1 reply; 17+ messages in thread
From: Slade Watkins @ 2022-09-29 22:10 UTC (permalink / raw)
  To: Nick Desaulniers; +Cc: linux-perf-users, LKML, Ian Rogers

Hey Nick,

> On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> 
> I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> Is this a known issue, or am I holding it wrong?

Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.

If I discover something of note, I’ll get back to you.

Cheers,
-srw


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-09-29 22:10 ` Slade Watkins
@ 2022-09-30  3:23   ` Ian Rogers
  2022-09-30  4:26     ` Namhyung Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Rogers @ 2022-09-30  3:23 UTC (permalink / raw)
  To: Slade Watkins, Ravi Bangoria; +Cc: Nick Desaulniers, linux-perf-users, LKML

On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
>
> Hey Nick,
>
> > On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> > Is this a known issue, or am I holding it wrong?
>
> Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
>
> If I discover something of note, I’ll get back to you.
>
> Cheers,
> -srw
>

LBR isn't yet supported for Zen but is coming:
https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
I'd recommend frame-pointers.

+Ravi who may be able to say if there are any issues with the precise
sampling on AMD.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-09-30  3:23   ` Ian Rogers
@ 2022-09-30  4:26     ` Namhyung Kim
  2022-09-30  4:31       ` Ravi Bangoria
  0 siblings, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2022-09-30  4:26 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Slade Watkins, Ravi Bangoria, Nick Desaulniers, linux-perf-users, LKML

Hello,

On Thu, Sep 29, 2022 at 8:49 PM Ian Rogers <irogers@google.com> wrote:
>
> On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
> >
> > Hey Nick,
> >
> > > On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > >
> > > I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> > > Is this a known issue, or am I holding it wrong?
> >
> > Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
> >
> > If I discover something of note, I’ll get back to you.
> >
> > Cheers,
> > -srw
> >
>
> LBR isn't yet supported for Zen but is coming:
> https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
> I'd recommend frame-pointers.
>
> +Ravi who may be able to say if there are any issues with the precise
> sampling on AMD.

Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
since it has no task context.  Ravi is working on it..

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-09-30  4:26     ` Namhyung Kim
@ 2022-09-30  4:31       ` Ravi Bangoria
  2022-10-05 21:55         ` Nick Desaulniers
  0 siblings, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2022-09-30  4:31 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Slade Watkins, linux-perf-users, LKML, Ian Rogers, Namhyung Kim,
	Ravi Bangoria

On 30-Sep-22 9:56 AM, Namhyung Kim wrote:
> Hello,
> 
> On Thu, Sep 29, 2022 at 8:49 PM Ian Rogers <irogers@google.com> wrote:
>>
>> On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
>>>
>>> Hey Nick,
>>>
>>>> On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
>>>>
>>>> I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
>>>> Is this a known issue, or am I holding it wrong?
>>>
>>> Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
>>>
>>> If I discover something of note, I’ll get back to you.
>>>
>>> Cheers,
>>> -srw
>>>
>>
>> LBR isn't yet supported for Zen but is coming:
>> https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
>> I'd recommend frame-pointers.
>>
>> +Ravi who may be able to say if there are any issues with the precise
>> sampling on AMD.
> 
> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> since it has no task context.  Ravi is working on it..

Right.
https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-09-30  4:31       ` Ravi Bangoria
@ 2022-10-05 21:55         ` Nick Desaulniers
  2022-10-05 22:50           ` Stephane Eranian
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2022-10-05 21:55 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Slade Watkins, linux-perf-users, LKML, Ian Rogers, Namhyung Kim,
	Stephane Eranian, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song

+ Stephane, Kees, Sandipan, Bill, ClangBuiltLinux mailing list, Yonghong
https://www.spinics.net/lists/linux-perf-users/msg23103.html
starts the thread, for context.

On Thu, Sep 29, 2022 at 9:32 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> On 30-Sep-22 9:56 AM, Namhyung Kim wrote:
> > Hello,
> >
> > On Thu, Sep 29, 2022 at 8:49 PM Ian Rogers <irogers@google.com> wrote:
> >>
> >> On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
> >>>
> >>> Hey Nick,
> >>>
> >>>> On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> >>>>
> >>>> I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> >>>> Is this a known issue, or am I holding it wrong?
> >>>
> >>> Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
> >>>
> >>> If I discover something of note, I’ll get back to you.
> >>>
> >>> Cheers,
> >>> -srw
> >>>
> >>
> >> LBR isn't yet supported for Zen but is coming:
> >> https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
> >> I'd recommend frame-pointers.

Having to recompile is less than ideal for my workflow.  I have added a note to
https://github.com/ClangBuiltLinux/profiling/tree/main/perf#errors
Please let me know how I might improve the documentation.

> >>
> >> +Ravi who may be able to say if there are any issues with the precise
> >> sampling on AMD.
> >
> > Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> > since it has no task context.  Ravi is working on it..
>
> Right.
> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com

Cool, thanks for working on this Ravi.

I'm not sure yet whether I may replace the kernel on my corporate
provided workstation, so I'm not sure yet I can help test that patch.

Can you confirm that
$ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>

works with just that patch applied? Or is there more work required?
What is the status of that patch?

For context, we had difficulty upstreaming support for instrumentation
based profile guided optimizations in the Linux kernel.
https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
We'd like to be able to use either instrumentation or sampling to
optimize our builds.  The major barrier to sample based approaches are
architecture / micro architecture issues with sample based profile
data collection, and bitrot of data processing utilities.
https://github.com/google/autofdo/issues/144
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-05 21:55         ` Nick Desaulniers
@ 2022-10-05 22:50           ` Stephane Eranian
  2022-10-07  3:56             ` Ravi Bangoria
  2022-10-11 21:38             ` Nick Desaulniers
  0 siblings, 2 replies; 17+ messages in thread
From: Stephane Eranian @ 2022-10-05 22:50 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Ravi Bangoria, Slade Watkins, linux-perf-users, LKML, Ian Rogers,
	Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song

On Wed, Oct 5, 2022 at 2:56 PM Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> + Stephane, Kees, Sandipan, Bill, ClangBuiltLinux mailing list, Yonghong
> https://www.spinics.net/lists/linux-perf-users/msg23103.html
> starts the thread, for context.
>
> On Thu, Sep 29, 2022 at 9:32 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
> >
> > On 30-Sep-22 9:56 AM, Namhyung Kim wrote:
> > > Hello,
> > >
> > > On Thu, Sep 29, 2022 at 8:49 PM Ian Rogers <irogers@google.com> wrote:
> > >>
> > >> On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
> > >>>
> > >>> Hey Nick,
> > >>>
> > >>>> On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > >>>>
> > >>>> I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> > >>>> Is this a known issue, or am I holding it wrong?
> > >>>
> > >>> Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
> > >>>
> > >>> If I discover something of note, I’ll get back to you.
> > >>>
> > >>> Cheers,
> > >>> -srw
> > >>>
> > >>
> > >> LBR isn't yet supported for Zen but is coming:
> > >> https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
> > >> I'd recommend frame-pointers.
>
> Having to recompile is less than ideal for my workflow.  I have added a note to
> https://github.com/ClangBuiltLinux/profiling/tree/main/perf#errors
> Please let me know how I might improve the documentation.
>
> > >>
> > >> +Ravi who may be able to say if there are any issues with the precise
> > >> sampling on AMD.
> > >
> > > Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> > > since it has no task context.  Ravi is working on it..
> >
> > Right.
> > https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>
> Cool, thanks for working on this Ravi.
>
> I'm not sure yet whether I may replace the kernel on my corporate
> provided workstation, so I'm not sure yet I can help test that patch.
>
> Can you confirm that
> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>
> works with just that patch applied? Or is there more work required?
> What is the status of that patch?
>
> For context, we had difficulty upstreaming support for instrumentation
> based profile guided optimizations in the Linux kernel.
> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
> We'd like to be able to use either instrumentation or sampling to
> optimize our builds.  The major barrier to sample based approaches are
> architecture / micro architecture issues with sample based profile
> data collection, and bitrot of data processing utilities.
> https://github.com/google/autofdo/issues/144

On existing AMD Zen2, Zen3 the following cmdline:
$ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>

does not work. I see two reasons:

1. cycles:pp is likely converted into IBS op in cycle mode.
    Current kernels do not support IBS in per-thread mode.
    This is purely a kernel limitation

2. call-graph lbr is not supported on AMD because they do
   not have LBR and therefore no LBR callstack mode

The best way to get what you want here today on AMD Zen2 and Zen3:

   $ perf record -e cycles --freq=128 -g -- <command to profile>

On AMD Zen3, there is a precursor to LBR with Branch Sampling (BRS),
and you can use it to sample taken branches but not for callstacks. I
mention the cmdline here for reference:

$ perf record -e cpu/branch-brs/ -c 1000037  -b  -- <command to profile>

Note that AMD Zen3 BRS is enough to get the autoFDO usage of an
LBR working as per the cmdline above.

Hope this helps.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-05 22:50           ` Stephane Eranian
@ 2022-10-07  3:56             ` Ravi Bangoria
  2022-10-11 21:32               ` Nick Desaulniers
  2022-10-11 21:38             ` Nick Desaulniers
  1 sibling, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2022-10-07  3:56 UTC (permalink / raw)
  To: Stephane Eranian, Nick Desaulniers
  Cc: Slade Watkins, linux-perf-users, LKML, Ian Rogers, Namhyung Kim,
	Kees Cook, sandipan.das, Bill Wendling, clang-built-linux,
	Yonghong Song, Peter Zijlstra

+cc: PeterZ

>>>>> +Ravi who may be able to say if there are any issues with the precise
>>>>> sampling on AMD.
>>>>
>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
>>>> since it has no task context.  Ravi is working on it..
>>>
>>> Right.
>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>
>> Cool, thanks for working on this Ravi.
>>
>> I'm not sure yet whether I may replace the kernel on my corporate
>> provided workstation, so I'm not sure yet I can help test that patch.
>>
>> Can you confirm that
>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>
>> works with just that patch applied? Or is there more work required?
>> What is the status of that patch?
>>
>> For context, we had difficulty upstreaming support for instrumentation
>> based profile guided optimizations in the Linux kernel.
>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
>> We'd like to be able to use either instrumentation or sampling to
>> optimize our builds.  The major barrier to sample based approaches are
>> architecture / micro architecture issues with sample based profile
>> data collection, and bitrot of data processing utilities.
>> https://github.com/google/autofdo/issues/144
> 
> On existing AMD Zen2, Zen3 the following cmdline:
> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> 
> does not work. I see two reasons:
> 
> 1. cycles:pp is likely converted into IBS op in cycle mode.
>     Current kernels do not support IBS in per-thread mode.
>     This is purely a kernel limitation

Right, it's purely a kernel limitation. And below simple patch on top
of event-context rewrite patch[1] should be sufficient to make cycles:pp
working in per-process mode on AMD Zen.

---
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index c251bc44c088..de01b5d27e40 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
 
 static struct perf_ibs perf_ibs_op = {
 	.pmu = {
-		.task_ctx_nr	= perf_invalid_context,
+		.task_ctx_nr	= perf_hw_context,
 
 		.event_init	= perf_ibs_init,
 		.add		= perf_ibs_add,
---

[1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-07  3:56             ` Ravi Bangoria
@ 2022-10-11 21:32               ` Nick Desaulniers
  2022-10-12  4:06                 ` Ravi Bangoria
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2022-10-11 21:32 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra

On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> +cc: PeterZ
>
> >>>>> +Ravi who may be able to say if there are any issues with the precise
> >>>>> sampling on AMD.
> >>>>
> >>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> >>>> since it has no task context.  Ravi is working on it..
> >>>
> >>> Right.
> >>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> >>
> >> Cool, thanks for working on this Ravi.
> >>
> >> I'm not sure yet whether I may replace the kernel on my corporate
> >> provided workstation, so I'm not sure yet I can help test that patch.
> >>
> >> Can you confirm that
> >> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> >>
> >> works with just that patch applied? Or is there more work required?
> >> What is the status of that patch?
> >>
> >> For context, we had difficulty upstreaming support for instrumentation
> >> based profile guided optimizations in the Linux kernel.
> >> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
> >> We'd like to be able to use either instrumentation or sampling to
> >> optimize our builds.  The major barrier to sample based approaches are
> >> architecture / micro architecture issues with sample based profile
> >> data collection, and bitrot of data processing utilities.
> >> https://github.com/google/autofdo/issues/144
> >
> > On existing AMD Zen2, Zen3 the following cmdline:
> > $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> >
> > does not work. I see two reasons:
> >
> > 1. cycles:pp is likely converted into IBS op in cycle mode.
> >     Current kernels do not support IBS in per-thread mode.
> >     This is purely a kernel limitation
>
> Right, it's purely a kernel limitation. And below simple patch on top
> of event-context rewrite patch[1] should be sufficient to make cycles:pp
> working in per-process mode on AMD Zen.
>
> ---
> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> index c251bc44c088..de01b5d27e40 100644
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
>
>  static struct perf_ibs perf_ibs_op = {
>         .pmu = {
> -               .task_ctx_nr    = perf_invalid_context,
> +               .task_ctx_nr    = perf_hw_context,
>
>                 .event_init     = perf_ibs_init,
>                 .add            = perf_ibs_add,
> ---
>
> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com

Hi Ravi,
I didn't see the above diff in
https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
Was there another distinct patch you were going to send for the above?

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-11 21:32               ` Nick Desaulniers
@ 2022-10-12  4:06                 ` Ravi Bangoria
  2022-10-12  5:04                   ` Ravi Bangoria
  0 siblings, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2022-10-12  4:06 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra, Ravi Bangoria

On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
> On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>
>> +cc: PeterZ
>>
>>>>>>> +Ravi who may be able to say if there are any issues with the precise
>>>>>>> sampling on AMD.
>>>>>>
>>>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
>>>>>> since it has no task context.  Ravi is working on it..
>>>>>
>>>>> Right.
>>>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>>>
>>>> Cool, thanks for working on this Ravi.
>>>>
>>>> I'm not sure yet whether I may replace the kernel on my corporate
>>>> provided workstation, so I'm not sure yet I can help test that patch.
>>>>
>>>> Can you confirm that
>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>>
>>>> works with just that patch applied? Or is there more work required?
>>>> What is the status of that patch?
>>>>
>>>> For context, we had difficulty upstreaming support for instrumentation
>>>> based profile guided optimizations in the Linux kernel.
>>>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
>>>> We'd like to be able to use either instrumentation or sampling to
>>>> optimize our builds.  The major barrier to sample based approaches are
>>>> architecture / micro architecture issues with sample based profile
>>>> data collection, and bitrot of data processing utilities.
>>>> https://github.com/google/autofdo/issues/144
>>>
>>> On existing AMD Zen2, Zen3 the following cmdline:
>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>
>>> does not work. I see two reasons:
>>>
>>> 1. cycles:pp is likely converted into IBS op in cycle mode.
>>>     Current kernels do not support IBS in per-thread mode.
>>>     This is purely a kernel limitation
>>
>> Right, it's purely a kernel limitation. And below simple patch on top
>> of event-context rewrite patch[1] should be sufficient to make cycles:pp
>> working in per-process mode on AMD Zen.
>>
>> ---
>> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
>> index c251bc44c088..de01b5d27e40 100644
>> --- a/arch/x86/events/amd/ibs.c
>> +++ b/arch/x86/events/amd/ibs.c
>> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
>>
>>  static struct perf_ibs perf_ibs_op = {
>>         .pmu = {
>> -               .task_ctx_nr    = perf_invalid_context,
>> +               .task_ctx_nr    = perf_hw_context,
>>
>>                 .event_init     = perf_ibs_init,
>>                 .add            = perf_ibs_add,
>> ---
>>
>> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> 
> Hi Ravi,
> I didn't see the above diff in
> https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
> Was there another distinct patch you were going to send for the above?

Yes Nick. I was planning to send it once the rewrite stuff goes in.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-12  4:06                 ` Ravi Bangoria
@ 2022-10-12  5:04                   ` Ravi Bangoria
  2023-06-23 16:23                     ` Nick Desaulniers
  0 siblings, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2022-10-12  5:04 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra, Ravi Bangoria

On 12-Oct-22 9:36 AM, Ravi Bangoria wrote:
> On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
>> On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>>
>>> +cc: PeterZ
>>>
>>>>>>>> +Ravi who may be able to say if there are any issues with the precise
>>>>>>>> sampling on AMD.
>>>>>>>
>>>>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
>>>>>>> since it has no task context.  Ravi is working on it..
>>>>>>
>>>>>> Right.
>>>>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>>>>
>>>>> Cool, thanks for working on this Ravi.
>>>>>
>>>>> I'm not sure yet whether I may replace the kernel on my corporate
>>>>> provided workstation, so I'm not sure yet I can help test that patch.
>>>>>
>>>>> Can you confirm that
>>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>>>
>>>>> works with just that patch applied? Or is there more work required?
>>>>> What is the status of that patch?
>>>>>
>>>>> For context, we had difficulty upstreaming support for instrumentation
>>>>> based profile guided optimizations in the Linux kernel.
>>>>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
>>>>> We'd like to be able to use either instrumentation or sampling to
>>>>> optimize our builds.  The major barrier to sample based approaches are
>>>>> architecture / micro architecture issues with sample based profile
>>>>> data collection, and bitrot of data processing utilities.
>>>>> https://github.com/google/autofdo/issues/144
>>>>
>>>> On existing AMD Zen2, Zen3 the following cmdline:
>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>>
>>>> does not work. I see two reasons:
>>>>
>>>> 1. cycles:pp is likely converted into IBS op in cycle mode.
>>>>     Current kernels do not support IBS in per-thread mode.
>>>>     This is purely a kernel limitation
>>>
>>> Right, it's purely a kernel limitation. And below simple patch on top
>>> of event-context rewrite patch[1] should be sufficient to make cycles:pp
>>> working in per-process mode on AMD Zen.
>>>
>>> ---
>>> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
>>> index c251bc44c088..de01b5d27e40 100644
>>> --- a/arch/x86/events/amd/ibs.c
>>> +++ b/arch/x86/events/amd/ibs.c
>>> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
>>>
>>>  static struct perf_ibs perf_ibs_op = {
>>>         .pmu = {
>>> -               .task_ctx_nr    = perf_invalid_context,
>>> +               .task_ctx_nr    = perf_hw_context,
>>>
>>>                 .event_init     = perf_ibs_init,
>>>                 .add            = perf_ibs_add,
>>> ---
>>>
>>> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>
>> Hi Ravi,
>> I didn't see the above diff in
>> https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
>> Was there another distinct patch you were going to send for the above?
> 
> Yes Nick. I was planning to send it once the rewrite stuff goes in.

Hi Nick,

Since you have practical use case, would it be possible to run your workflow
with perf rewrite and IBS patches applied? It will help us in finding/fixing
more bugs and upstreaming these changes.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-12  5:04                   ` Ravi Bangoria
@ 2023-06-23 16:23                     ` Nick Desaulniers
  2023-06-23 23:18                       ` Namhyung Kim
  2023-06-26  5:44                       ` Ravi Bangoria
  0 siblings, 2 replies; 17+ messages in thread
From: Nick Desaulniers @ 2023-06-23 16:23 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra

On Tue, Oct 11, 2022 at 10:05 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> On 12-Oct-22 9:36 AM, Ravi Bangoria wrote:
> > On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
> >> On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
> >>>
> >>> +cc: PeterZ
> >>>
> >>>>>>>> +Ravi who may be able to say if there are any issues with the precise
> >>>>>>>> sampling on AMD.
> >>>>>>>
> >>>>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> >>>>>>> since it has no task context.  Ravi is working on it..
> >>>>>>
> >>>>>> Right.
> >>>>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> >>>>>
> >>>>> Cool, thanks for working on this Ravi.
> >>>>>
> >>>>> I'm not sure yet whether I may replace the kernel on my corporate
> >>>>> provided workstation, so I'm not sure yet I can help test that patch.
> >>>>>
> >>>>> Can you confirm that
> >>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> >>>>>
> >>>>> works with just that patch applied? Or is there more work required?
> >>>>> What is the status of that patch?
> >>>>>
> >>>>> For context, we had difficulty upstreaming support for instrumentation
> >>>>> based profile guided optimizations in the Linux kernel.
> >>>>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
> >>>>> We'd like to be able to use either instrumentation or sampling to
> >>>>> optimize our builds.  The major barrier to sample based approaches are
> >>>>> architecture / micro architecture issues with sample based profile
> >>>>> data collection, and bitrot of data processing utilities.
> >>>>> https://github.com/google/autofdo/issues/144
> >>>>
> >>>> On existing AMD Zen2, Zen3 the following cmdline:
> >>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> >>>>
> >>>> does not work. I see two reasons:
> >>>>
> >>>> 1. cycles:pp is likely converted into IBS op in cycle mode.
> >>>>     Current kernels do not support IBS in per-thread mode.
> >>>>     This is purely a kernel limitation
> >>>
> >>> Right, it's purely a kernel limitation. And below simple patch on top
> >>> of event-context rewrite patch[1] should be sufficient to make cycles:pp
> >>> working in per-process mode on AMD Zen.
> >>>
> >>> ---
> >>> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> >>> index c251bc44c088..de01b5d27e40 100644
> >>> --- a/arch/x86/events/amd/ibs.c
> >>> +++ b/arch/x86/events/amd/ibs.c
> >>> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
> >>>
> >>>  static struct perf_ibs perf_ibs_op = {
> >>>         .pmu = {
> >>> -               .task_ctx_nr    = perf_invalid_context,
> >>> +               .task_ctx_nr    = perf_hw_context,
> >>>
> >>>                 .event_init     = perf_ibs_init,
> >>>                 .add            = perf_ibs_add,
> >>> ---
> >>>
> >>> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> >>
> >> Hi Ravi,
> >> I didn't see the above diff in
> >> https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
> >> Was there another distinct patch you were going to send for the above?
> >
> > Yes Nick. I was planning to send it once the rewrite stuff goes in.
>
> Hi Nick,
>
> Since you have practical use case, would it be possible to run your workflow
> with perf rewrite and IBS patches applied? It will help us in finding/fixing
> more bugs and upstreaming these changes.

Hi Ravi,
Sorry, I'm not able to load a custom kernel image on my employer
provided workstation, and I never got approval to expense hardware for
testing this otherwise.

Was there ever any update on this? I'm on 6.1.25 now and still cant run
$ perf record -e cycles:pp --call-graph lbr <any command to profile>
$ cat /proc/cpuinfo
...
model name      : AMD Ryzen Threadripper PRO 3995WX 64-Cores
...
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2023-06-23 16:23                     ` Nick Desaulniers
@ 2023-06-23 23:18                       ` Namhyung Kim
  2023-06-26  5:44                       ` Ravi Bangoria
  1 sibling, 0 replies; 17+ messages in thread
From: Namhyung Kim @ 2023-06-23 23:18 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Ravi Bangoria, Stephane Eranian, Slade Watkins, linux-perf-users,
	LKML, Ian Rogers, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra

Hi Nick,

On Fri, Jun 23, 2023 at 9:23 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Oct 11, 2022 at 10:05 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
> >
> > On 12-Oct-22 9:36 AM, Ravi Bangoria wrote:
> > > On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
> > >> On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
> > >>>
> > >>> +cc: PeterZ
> > >>>
> > >>>>>>>> +Ravi who may be able to say if there are any issues with the precise
> > >>>>>>>> sampling on AMD.
> > >>>>>>>
> > >>>>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> > >>>>>>> since it has no task context.  Ravi is working on it..
> > >>>>>>
> > >>>>>> Right.
> > >>>>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> > >>>>>
> > >>>>> Cool, thanks for working on this Ravi.
> > >>>>>
> > >>>>> I'm not sure yet whether I may replace the kernel on my corporate
> > >>>>> provided workstation, so I'm not sure yet I can help test that patch.
> > >>>>>
> > >>>>> Can you confirm that
> > >>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> > >>>>>
> > >>>>> works with just that patch applied? Or is there more work required?
> > >>>>> What is the status of that patch?
> > >>>>>
> > >>>>> For context, we had difficulty upstreaming support for instrumentation
> > >>>>> based profile guided optimizations in the Linux kernel.
> > >>>>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
> > >>>>> We'd like to be able to use either instrumentation or sampling to
> > >>>>> optimize our builds.  The major barrier to sample based approaches are
> > >>>>> architecture / micro architecture issues with sample based profile
> > >>>>> data collection, and bitrot of data processing utilities.
> > >>>>> https://github.com/google/autofdo/issues/144
> > >>>>
> > >>>> On existing AMD Zen2, Zen3 the following cmdline:
> > >>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> > >>>>
> > >>>> does not work. I see two reasons:
> > >>>>
> > >>>> 1. cycles:pp is likely converted into IBS op in cycle mode.
> > >>>>     Current kernels do not support IBS in per-thread mode.
> > >>>>     This is purely a kernel limitation
> > >>>
> > >>> Right, it's purely a kernel limitation. And below simple patch on top
> > >>> of event-context rewrite patch[1] should be sufficient to make cycles:pp
> > >>> working in per-process mode on AMD Zen.
> > >>>
> > >>> ---
> > >>> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> > >>> index c251bc44c088..de01b5d27e40 100644
> > >>> --- a/arch/x86/events/amd/ibs.c
> > >>> +++ b/arch/x86/events/amd/ibs.c
> > >>> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
> > >>>
> > >>>  static struct perf_ibs perf_ibs_op = {
> > >>>         .pmu = {
> > >>> -               .task_ctx_nr    = perf_invalid_context,
> > >>> +               .task_ctx_nr    = perf_hw_context,
> > >>>
> > >>>                 .event_init     = perf_ibs_init,
> > >>>                 .add            = perf_ibs_add,
> > >>> ---
> > >>>
> > >>> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> > >>
> > >> Hi Ravi,
> > >> I didn't see the above diff in
> > >> https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
> > >> Was there another distinct patch you were going to send for the above?
> > >
> > > Yes Nick. I was planning to send it once the rewrite stuff goes in.
> >
> > Hi Nick,
> >
> > Since you have practical use case, would it be possible to run your workflow
> > with perf rewrite and IBS patches applied? It will help us in finding/fixing
> > more bugs and upstreaming these changes.
>
> Hi Ravi,
> Sorry, I'm not able to load a custom kernel image on my employer
> provided workstation, and I never got approval to expense hardware for
> testing this otherwise.
>
> Was there ever any update on this? I'm on 6.1.25 now and still cant run
> $ perf record -e cycles:pp --call-graph lbr <any command to profile>
> $ cat /proc/cpuinfo
> ...
> model name      : AMD Ryzen Threadripper PRO 3995WX 64-Cores
> ...

The commit 30093056f7b2 ("perf/amd/ibs: Make IBS a core pmu") in v6.2.

  $ git name-rev --tags --refs=v[2-6].* 30093056f7b2
  30093056f7b2 v6.2-rc1~176^2~16

https://git.kernel.org/torvalds/c/30093056f7b2

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2023-06-23 16:23                     ` Nick Desaulniers
  2023-06-23 23:18                       ` Namhyung Kim
@ 2023-06-26  5:44                       ` Ravi Bangoria
  2023-07-10 21:22                         ` Nick Desaulniers
  1 sibling, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2023-06-26  5:44 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Namhyung Kim, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra, Ravi Bangoria

Hi Nick,

On 23-Jun-23 9:53 PM, Nick Desaulniers wrote:
> On Tue, Oct 11, 2022 at 10:05 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>
>> On 12-Oct-22 9:36 AM, Ravi Bangoria wrote:
>>> On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
>>>> On Thu, Oct 6, 2022 at 8:56 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>>>>
>>>>> +cc: PeterZ
>>>>>
>>>>>>>>>> +Ravi who may be able to say if there are any issues with the precise
>>>>>>>>>> sampling on AMD.
>>>>>>>>>
>>>>>>>>> Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
>>>>>>>>> since it has no task context.  Ravi is working on it..
>>>>>>>>
>>>>>>>> Right.
>>>>>>>> https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>>>>>>
>>>>>>> Cool, thanks for working on this Ravi.
>>>>>>>
>>>>>>> I'm not sure yet whether I may replace the kernel on my corporate
>>>>>>> provided workstation, so I'm not sure yet I can help test that patch.
>>>>>>>
>>>>>>> Can you confirm that
>>>>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>>>>>
>>>>>>> works with just that patch applied? Or is there more work required?
>>>>>>> What is the status of that patch?
>>>>>>>
>>>>>>> For context, we had difficulty upstreaming support for instrumentation
>>>>>>> based profile guided optimizations in the Linux kernel.
>>>>>>> https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
>>>>>>> We'd like to be able to use either instrumentation or sampling to
>>>>>>> optimize our builds.  The major barrier to sample based approaches are
>>>>>>> architecture / micro architecture issues with sample based profile
>>>>>>> data collection, and bitrot of data processing utilities.
>>>>>>> https://github.com/google/autofdo/issues/144
>>>>>>
>>>>>> On existing AMD Zen2, Zen3 the following cmdline:
>>>>>> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>>>>>>
>>>>>> does not work. I see two reasons:
>>>>>>
>>>>>> 1. cycles:pp is likely converted into IBS op in cycle mode.
>>>>>>     Current kernels do not support IBS in per-thread mode.
>>>>>>     This is purely a kernel limitation
>>>>>
>>>>> Right, it's purely a kernel limitation. And below simple patch on top
>>>>> of event-context rewrite patch[1] should be sufficient to make cycles:pp
>>>>> working in per-process mode on AMD Zen.
>>>>>
>>>>> ---
>>>>> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
>>>>> index c251bc44c088..de01b5d27e40 100644
>>>>> --- a/arch/x86/events/amd/ibs.c
>>>>> +++ b/arch/x86/events/amd/ibs.c
>>>>> @@ -665,7 +665,7 @@ static struct perf_ibs perf_ibs_fetch = {
>>>>>
>>>>>  static struct perf_ibs perf_ibs_op = {
>>>>>         .pmu = {
>>>>> -               .task_ctx_nr    = perf_invalid_context,
>>>>> +               .task_ctx_nr    = perf_hw_context,
>>>>>
>>>>>                 .event_init     = perf_ibs_init,
>>>>>                 .add            = perf_ibs_add,
>>>>> ---
>>>>>
>>>>> [1]: https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
>>>>
>>>> Hi Ravi,
>>>> I didn't see the above diff in
>>>> https://lore.kernel.org/lkml/20221008062424.313-1-ravi.bangoria@amd.com/
>>>> Was there another distinct patch you were going to send for the above?
>>>
>>> Yes Nick. I was planning to send it once the rewrite stuff goes in.
>>
>> Hi Nick,
>>
>> Since you have practical use case, would it be possible to run your workflow
>> with perf rewrite and IBS patches applied? It will help us in finding/fixing
>> more bugs and upstreaming these changes.
> 
> Hi Ravi,
> Sorry, I'm not able to load a custom kernel image on my employer
> provided workstation, and I never got approval to expense hardware for
> testing this otherwise.
> 
> Was there ever any update on this? I'm on 6.1.25 now and still cant run
> $ perf record -e cycles:pp --call-graph lbr <any command to profile>

Per-process precise sampling on AMD platforms should work from 6.2-rc1
onward. However, --call-graph=lbr is not supported on AMD (hw limitation).

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2023-06-26  5:44                       ` Ravi Bangoria
@ 2023-07-10 21:22                         ` Nick Desaulniers
  2023-07-11  5:14                           ` Ravi Bangoria
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Desaulniers @ 2023-07-10 21:22 UTC (permalink / raw)
  To: Ravi Bangoria, Namhyung Kim
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra

On Sun, Jun 25, 2023 at 10:45 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> Hi Nick,
>
> On 23-Jun-23 9:53 PM, Nick Desaulniers wrote:
> >>> On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
> > Hi Ravi,
> > Sorry, I'm not able to load a custom kernel image on my employer
> > provided workstation, and I never got approval to expense hardware for
> > testing this otherwise.
> >
> > Was there ever any update on this? I'm on 6.1.25 now and still cant run
> > $ perf record -e cycles:pp --call-graph lbr <any command to profile>
>
> Per-process precise sampling on AMD platforms should work from 6.2-rc1
> onward.

Ok, I can wait for my employer to ship 6.2 on our workstations.

> However, --call-graph=lbr is not supported on AMD (hw limitation).

On any AMD uarches? Is there an equivalent? LBR encoding is compact
which makes working it much faster than DWARF or stack frame
unwinding.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2023-07-10 21:22                         ` Nick Desaulniers
@ 2023-07-11  5:14                           ` Ravi Bangoria
  0 siblings, 0 replies; 17+ messages in thread
From: Ravi Bangoria @ 2023-07-11  5:14 UTC (permalink / raw)
  To: Nick Desaulniers, Namhyung Kim
  Cc: Stephane Eranian, Slade Watkins, linux-perf-users, LKML,
	Ian Rogers, Kees Cook, sandipan.das, Bill Wendling,
	clang-built-linux, Yonghong Song, Peter Zijlstra, Ravi Bangoria

On 11-Jul-23 2:52 AM, Nick Desaulniers wrote:
> On Sun, Jun 25, 2023 at 10:45 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>
>> Hi Nick,
>>
>> On 23-Jun-23 9:53 PM, Nick Desaulniers wrote:
>>>>> On 12-Oct-22 3:02 AM, Nick Desaulniers wrote:
>>> Hi Ravi,
>>> Sorry, I'm not able to load a custom kernel image on my employer
>>> provided workstation, and I never got approval to expense hardware for
>>> testing this otherwise.
>>>
>>> Was there ever any update on this? I'm on 6.1.25 now and still cant run
>>> $ perf record -e cycles:pp --call-graph lbr <any command to profile>
>>
>> Per-process precise sampling on AMD platforms should work from 6.2-rc1
>> onward.
> 
> Ok, I can wait for my employer to ship 6.2 on our workstations.
> 
>> However, --call-graph=lbr is not supported on AMD (hw limitation).
> 
> On any AMD uarches? Is there an equivalent? LBR encoding is compact
> which makes working it much faster than DWARF or stack frame
> unwinding.

I understand that LBR call-stack is the fastest option but unfortunately
none of the current AMD uarch supports it.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a'.
  2022-10-05 22:50           ` Stephane Eranian
  2022-10-07  3:56             ` Ravi Bangoria
@ 2022-10-11 21:38             ` Nick Desaulniers
  1 sibling, 0 replies; 17+ messages in thread
From: Nick Desaulniers @ 2022-10-11 21:38 UTC (permalink / raw)
  To: Stephane Eranian, Ravi Bangoria
  Cc: Slade Watkins, linux-perf-users, LKML, Ian Rogers, Namhyung Kim,
	Kees Cook, sandipan.das, Bill Wendling, clang-built-linux,
	Yonghong Song

On Wed, Oct 5, 2022 at 3:50 PM Stephane Eranian <eranian@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 2:56 PM Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > + Stephane, Kees, Sandipan, Bill, ClangBuiltLinux mailing list, Yonghong
> > https://www.spinics.net/lists/linux-perf-users/msg23103.html
> > starts the thread, for context.
> >
> > On Thu, Sep 29, 2022 at 9:32 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
> > >
> > > On 30-Sep-22 9:56 AM, Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > On Thu, Sep 29, 2022 at 8:49 PM Ian Rogers <irogers@google.com> wrote:
> > > >>
> > > >> On Thu, Sep 29, 2022 at 3:10 PM Slade Watkins <srw@sladewatkins.net> wrote:
> > > >>>
> > > >>> Hey Nick,
> > > >>>
> > > >>>> On Sep 29, 2022, at 5:54 PM, Nick Desaulniers <ndesaulniers@google.com> wrote:
> > > >>>>
> > > >>>> I remember hearing rumblings about issues with zen 2, LBR, vs zen 3.
> > > >>>> Is this a known issue, or am I holding it wrong?
> > > >>>
> > > >>> Hm… I also remember this. I have a Zen 2 based system that I can do testing on, so I will do so when I’m able.
> > > >>>
> > > >>> If I discover something of note, I’ll get back to you.
> > > >>>
> > > >>> Cheers,
> > > >>> -srw
> > > >>>
> > > >>
> > > >> LBR isn't yet supported for Zen but is coming:
> > > >> https://lore.kernel.org/lkml/166155216401.401.5809694678609694438.tip-bot2@tip-bot2/
> > > >> I'd recommend frame-pointers.
> >
> > Having to recompile is less than ideal for my workflow.  I have added a note to
> > https://github.com/ClangBuiltLinux/profiling/tree/main/perf#errors
> > Please let me know how I might improve the documentation.
> >
> > > >>
> > > >> +Ravi who may be able to say if there are any issues with the precise
> > > >> sampling on AMD.
> > > >
> > > > Afaik cvcles:pp will use IBS but it doesn't support per-task profiling
> > > > since it has no task context.  Ravi is working on it..
> > >
> > > Right.
> > > https://lore.kernel.org/lkml/20220829113347.295-1-ravi.bangoria@amd.com
> >
> > Cool, thanks for working on this Ravi.
> >
> > I'm not sure yet whether I may replace the kernel on my corporate
> > provided workstation, so I'm not sure yet I can help test that patch.
> >
> > Can you confirm that
> > $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
> >
> > works with just that patch applied? Or is there more work required?
> > What is the status of that patch?
> >
> > For context, we had difficulty upstreaming support for instrumentation
> > based profile guided optimizations in the Linux kernel.
> > https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/
> > We'd like to be able to use either instrumentation or sampling to
> > optimize our builds.  The major barrier to sample based approaches are
> > architecture / micro architecture issues with sample based profile
> > data collection, and bitrot of data processing utilities.
> > https://github.com/google/autofdo/issues/144
>
> On existing AMD Zen2, Zen3 the following cmdline:
> $ perf record -e cycles:pp --freq=128 --call-graph lbr -- <command to profile>
>
> does not work. I see two reasons:
>
> 1. cycles:pp is likely converted into IBS op in cycle mode.
>     Current kernels do not support IBS in per-thread mode.
>     This is purely a kernel limitation

Sounds like Ravi has a diff that might work?
https://lore.kernel.org/linux-perf-users/85822c3c-2254-52cc-e6b1-9c89adb63771@amd.com/

>
> 2. call-graph lbr is not supported on AMD because they do
>    not have LBR and therefore no LBR callstack mode
>
> The best way to get what you want here today on AMD Zen2 and Zen3:
>
>    $ perf record -e cycles --freq=128 -g -- <command to profile>

So without recompiling to explicitly re-enable frame pointers, code
I'm looking to profile is built with -gmlt, so I get symbols, but I
can't observe callers.

Sounds like I'd need to rebuild with -fno-omit-frame-pointers which is
more painful than my prior LBR based workflow.

>
> On AMD Zen3, there is a precursor to LBR with Branch Sampling (BRS),
> and you can use it to sample taken branches but not for callstacks. I
> mention the cmdline here for reference:
>
> $ perf record -e cpu/branch-brs/ -c 1000037  -b  -- <command to profile>
>
> Note that AMD Zen3 BRS is enough to get the autoFDO usage of an
> LBR working as per the cmdline above.

Interesting, but I'm stuck with a Zen2 box for a couple years now
(corporate workstation).  This pretty much blows up all of the
profiling work I was doing, and any hope I had of contributing towards
building the kernel with AutoFDO profiles until this works on my box
with the kernel it has.

>
> Hope this helps.



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-11  5:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-29 21:54 Invalid event (cycles:pp) in per-thread mode, enable system wide with '-a' Nick Desaulniers
2022-09-29 22:10 ` Slade Watkins
2022-09-30  3:23   ` Ian Rogers
2022-09-30  4:26     ` Namhyung Kim
2022-09-30  4:31       ` Ravi Bangoria
2022-10-05 21:55         ` Nick Desaulniers
2022-10-05 22:50           ` Stephane Eranian
2022-10-07  3:56             ` Ravi Bangoria
2022-10-11 21:32               ` Nick Desaulniers
2022-10-12  4:06                 ` Ravi Bangoria
2022-10-12  5:04                   ` Ravi Bangoria
2023-06-23 16:23                     ` Nick Desaulniers
2023-06-23 23:18                       ` Namhyung Kim
2023-06-26  5:44                       ` Ravi Bangoria
2023-07-10 21:22                         ` Nick Desaulniers
2023-07-11  5:14                           ` Ravi Bangoria
2022-10-11 21:38             ` Nick Desaulniers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.