All of lore.kernel.org
 help / color / mirror / Atom feed
From: Madhavan Srinivasan <maddy@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Kim Phillips <kim.phillips@amd.com>,
	Ravi Bangoria <ravi.bangoria@linux.ibm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Andi Kleen <ak@linux.intel.com>, Jiri Olsa <jolsa@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Stephane Eranian <eranian@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	yao.jin@linux.intel.com, Ingo Molnar <mingo@redhat.com>,
	Paul Mackerras <paulus@samba.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Robert Richter <robert.richter@amd.com>,
	Namhyung Kim <namhyung@kernel.org>,
	linuxppc-dev@lists.ozlabs.org,
	Alexey Budankov <alexey.budankov@linux.intel.com>,
	"Liang, Kan" <kan.liang@linux.intel.com>
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
Date: Mon, 27 Apr 2020 12:48:05 +0530	[thread overview]
Message-ID: <9aa2da24-3621-ba50-60c8-5084fa1041d7@linux.ibm.com> (raw)
In-Reply-To: <960e39ae-4d9a-05e5-9fbc-0a11706dce70@linux.ibm.com>

peterz,

     Can you please help. Is it okay to use PERF_SAMPLE_RAW to expose 
the pipeline stall details and
add tool side infrastructure to handle the PERF_SAMPLE_RAW for cpu-pmu 
samples.

Maddy

On 4/20/20 12:39 PM, Madhavan Srinivasan wrote:
>
>
> On 3/27/20 1:18 AM, Kim Phillips wrote:
>>
>> On 3/26/20 5:19 AM, maddy wrote:
>>>
>>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>>> Hi Maddy,
>>>>
>>>> On 3/17/20 1:50 AM, maddy wrote:
>>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra 
>>>>>>>>>>> <peterz@infradead.org> wrote:
>>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction 
>>>>>>>>>>>>> Event
>>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based 
>>>>>>>>>>>>> Sampling' on
>>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is 
>>>>>>>>>>>>> introduced.
>>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        struct perf_pipeline_haz_data {
>>>>>>>>>>>>>               /* Instruction/Opcode type: Load, Store, 
>>>>>>>>>>>>> Branch .... */
>>>>>>>>>>>>>               __u8    itype;
>>>>>>>>>>>>>               /* Instruction Cache source */
>>>>>>>>>>>>>               __u8    icache;
>>>>>>>>>>>>>               /* Instruction suffered hazard in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    hazard_stage;
>>>>>>>>>>>>>               /* Hazard reason */
>>>>>>>>>>>>>               __u8    hazard_reason;
>>>>>>>>>>>>>               /* Instruction suffered stall in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    stall_stage;
>>>>>>>>>>>>>               /* Stall reason */
>>>>>>>>>>>>>               __u8    stall_reason;
>>>>>>>>>>>>>               __u16   pad;
>>>>>>>>>>>>>        };
>>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>>> and reasons, for example: we have missed in L2 cache, for 
>>>>>>>>>> example.
>>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>>> IBM's AFAICT.
>>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling 
>>>>>>>>> like the
>>>>>>>>> Fetch latency, tag to retire latency, completion to retire 
>>>>>>>>> latency and
>>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store 
>>>>>>>>> centric
>>>>>>>>> information. But it also captures more detailed data for 
>>>>>>>>> Branch instructions.
>>>>>>>>> And we also looked at ARM SPE, which also captures more 
>>>>>>>>> details pipeline
>>>>>>>>> data and latency information.
>>>>>>>>>
>>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>>> specific. We need to find a better term, maybe stall or 
>>>>>>>>>>> penalty.
>>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or 
>>>>>>>>>> otherwise
>>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>>> bikeshed.
>>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>>> NP. We thought pipeline is generic hw term so we proposed 
>>>>>>> "pipeline"
>>>>>>> word. We are open to term which can be generic enough.
>>>>>>>
>>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>>> of information coming out of it, but the vast majority
>>>>>>>> are addresses, latencies of various components in the memory
>>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce 
>>>>>>> related
>>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>>> part of perf-mem, IMO.
>>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>>
>>>>>> union perf_mem_data_src {
>>>>>> ...
>>>>>>                    __u64   mem_rsvd:24,
>>>>>>                            mem_snoopx:2,   /* snoop mode, ext */
>>>>>>                            mem_remote:1,   /* remote */
>>>>>>                            mem_lvl_num:4,  /* memory hierarchy 
>>>>>> level number */
>>>>>>                            mem_dtlb:7,     /* tlb access */
>>>>>>                            mem_lock:2,     /* lock instr */
>>>>>>                            mem_snoop:5,    /* snoop mode */
>>>>>>                            mem_lvl:14,     /* memory hierarchy 
>>>>>> level */
>>>>>>                            mem_op:5;       /* type of opcode */
>>>>>>
>>>>>>
>>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>>> be used to populate mem_snoop, right?
>>>>> Hi Kim,
>>>>>
>>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>>> OK, I see relevant PERF_MEM_S bits in 
>>>> arch/powerpc/perf/isa207-common.c:
>>>> isa207_find_source now, thanks.
>>>>
>>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>>> used for the ld/st target addresses, too.
>>>>>>
>>>>>>>> What's needed here is a vendor-specific extended
>>>>>>>> sample information that all these technologies gather,
>>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>>> all should have in common.
>>>>>>> Yes. We will include fields to capture the latency cycles (like 
>>>>>>> Issue
>>>>>>> latency, Instruction completion latency etc..) along with other 
>>>>>>> pipeline
>>>>>>> details in the proposed structure.
>>>>>> Latency figures are just an example, and from what I
>>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>>> transfer memory access latency figures.  Granted, that's
>>>>>> a bad name given all other vendors don't call latency
>>>>>> 'weight'.
>>>>>>
>>>>>> I didn't see any latency figures coming out of POWER9,
>>>>>> and do not expect this patchseries to implement those
>>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>>> to amend perf to suit their own h/w output please.
>>>>> Reference structure proposed in this patchset did not have members
>>>>> to capture latency info for that exact reason. But idea here is to
>>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>>> then this format can also capture data from IBS since it provides
>>>>> few latency details.
>>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>>> struct presented in this patchset.
>>>>
>>>> IBS Ops can report e.g.:
>>>>
>>>> 15 tag-to-retire cycles bits,
>>>> 15 completion to retire count bits,
>>>> 15 L1 DTLB refill latency bits,
>>>> 15 DC miss latency bits,
>>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>>
>>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>>> for iTLB latency, among others.
>>>>
>>>> Some of these may/may not be valid simultaneously, and
>>>> there are IBS specific rules to establish validity.
>>>>
>>>>>> My main point there, however, was that each vendor should
>>>>>> use streamlined record-level code to just copy the data
>>>>>> in the proprietary format that their hardware produces,
>>>>>> and then then perf tooling can synthesize the events
>>>>>> from the raw data at report/script/etc. time.
>>>>>>
>>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is 
>>>>>>> intended when
>>>>>>> large volume of data needs to be captured as part of perf.data 
>>>>>>> without
>>>>>>> frequent PMIs. But proposed type is to address the capture of 
>>>>>>> pipeline
>>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how 
>>>>>> frequent
>>>>>> PMIs are, even though it may be used in those environments.
>>>>>>
>>>>>>> information on each sample using PMI at periodic intervals. 
>>>>>>> Hence proposing
>>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>>> to its users beyond things already represented by other sample
>>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>>> what IBS currently uses.
>>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>>> as much vendor specific data as possible with this (like perf-mem).
>>>> Perhaps if I say it this way: instead of doing all the
>>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>>> in patch 4/11, rather/instead just put the raw sier value in a
>>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>>> Specific SIER capabilities can be written as part of the perf.data
>>>> header.  Then synthesize the true pipe events from the raw SIER
>>>> values later, and in userspace.
>>> Hi Kim,
>>>
>>> Would like to stay away from SAMPLE_RAW type for these comments in 
>>> perf_events.h
>>>
>>> *      #
>>> *      # The RAW record below is opaque data wrt the ABI
>>> *      #
>>> *      # That is, the ABI doesn't make any promises wrt to
>>> *      # the stability of its content, it may vary depending
>>> *      # on event, hardware, kernel version and phase of
>>> *      # the moon.
>>> *      #
>>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>>> *      #
>> The "it may vary depending on ... hardware" clause makes it sound
>> appropriate for the use-case where the raw hardware register contents
>> are copied directly into the user buffer.
>
>
> Hi Kim,
>
> Sorry for the delayed response.
>
> But perf tool side needs infrastructure to handle the raw sample
> data from cpu-pmu (used by tracepoints). I am not sure whether
> his is the approach we should look here.
>
> peterz any comments?
>
>>
>>> Secondly, sorry I didn't understand your suggestion about using 
>>> PERF_SAMPLE_AUX.
>>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory 
>>> and more
>>> challenging when correlating and presenting the pipeline details for 
>>> each IP.
>>> IMO, having a new sample type can be useful to capture the pipeline 
>>> data
>>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>>> AUX buffer.
>> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
>> interdependent, sorry.
>>
>> Thanks,
>>
>> Kim
>


WARNING: multiple messages have this Message-ID (diff)
From: Madhavan Srinivasan <maddy@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Ravi Bangoria <ravi.bangoria@linux.ibm.com>,
	Andi Kleen <ak@linux.intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	linuxppc-dev@lists.ozlabs.org,
	Alexey Budankov <alexey.budankov@linux.intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Stephane Eranian <eranian@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Robert Richter <robert.richter@amd.com>,
	yao.jin@linux.intel.com, Ingo Molnar <mingo@redhat.com>,
	Paul Mackerras <paulus@samba.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Kim Phillips <kim.phillips@amd.com>, Jiri Olsa <jolsa@redhat.com>,
	"Liang, Kan" <kan.liang@linux.intel.com>
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
Date: Mon, 27 Apr 2020 12:48:05 +0530	[thread overview]
Message-ID: <9aa2da24-3621-ba50-60c8-5084fa1041d7@linux.ibm.com> (raw)
In-Reply-To: <960e39ae-4d9a-05e5-9fbc-0a11706dce70@linux.ibm.com>

peterz,

     Can you please help. Is it okay to use PERF_SAMPLE_RAW to expose 
the pipeline stall details and
add tool side infrastructure to handle the PERF_SAMPLE_RAW for cpu-pmu 
samples.

Maddy

On 4/20/20 12:39 PM, Madhavan Srinivasan wrote:
>
>
> On 3/27/20 1:18 AM, Kim Phillips wrote:
>>
>> On 3/26/20 5:19 AM, maddy wrote:
>>>
>>> On 3/18/20 11:05 PM, Kim Phillips wrote:
>>>> Hi Maddy,
>>>>
>>>> On 3/17/20 1:50 AM, maddy wrote:
>>>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>>>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra 
>>>>>>>>>>> <peterz@infradead.org> wrote:
>>>>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction 
>>>>>>>>>>>>> Event
>>>>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based 
>>>>>>>>>>>>> Sampling' on
>>>>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation detail:
>>>>>>>>>>>>>
>>>>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is 
>>>>>>>>>>>>> introduced.
>>>>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>>>>> into generic format:
>>>>>>>>>>>>>
>>>>>>>>>>>>>        struct perf_pipeline_haz_data {
>>>>>>>>>>>>>               /* Instruction/Opcode type: Load, Store, 
>>>>>>>>>>>>> Branch .... */
>>>>>>>>>>>>>               __u8    itype;
>>>>>>>>>>>>>               /* Instruction Cache source */
>>>>>>>>>>>>>               __u8    icache;
>>>>>>>>>>>>>               /* Instruction suffered hazard in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    hazard_stage;
>>>>>>>>>>>>>               /* Hazard reason */
>>>>>>>>>>>>>               __u8    hazard_reason;
>>>>>>>>>>>>>               /* Instruction suffered stall in pipeline 
>>>>>>>>>>>>> stage */
>>>>>>>>>>>>>               __u8    stall_stage;
>>>>>>>>>>>>>               /* Stall reason */
>>>>>>>>>>>>>               __u8    stall_reason;
>>>>>>>>>>>>>               __u16   pad;
>>>>>>>>>>>>>        };
>>>>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>>>>> and reasons, for example: we have missed in L2 cache, for 
>>>>>>>>>> example.
>>>>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>>>>> IBM's AFAICT.
>>>>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling 
>>>>>>>>> like the
>>>>>>>>> Fetch latency, tag to retire latency, completion to retire 
>>>>>>>>> latency and
>>>>>>>>> so on. Yes, Ops sampling do provide more data on load/store 
>>>>>>>>> centric
>>>>>>>>> information. But it also captures more detailed data for 
>>>>>>>>> Branch instructions.
>>>>>>>>> And we also looked at ARM SPE, which also captures more 
>>>>>>>>> details pipeline
>>>>>>>>> data and latency information.
>>>>>>>>>
>>>>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>>>>> specific. We need to find a better term, maybe stall or 
>>>>>>>>>>> penalty.
>>>>>>>>>> Right, IBS doesn't have a filter to only count stalled or 
>>>>>>>>>> otherwise
>>>>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>>>>> bikeshed.
>>>>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>>>>> NP. We thought pipeline is generic hw term so we proposed 
>>>>>>> "pipeline"
>>>>>>> word. We are open to term which can be generic enough.
>>>>>>>
>>>>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>>>>> of information coming out of it, but the vast majority
>>>>>>>> are addresses, latencies of various components in the memory
>>>>>>>> hierarchy, and various component hit/miss bits.
>>>>>>> Yes. we should capture core pipeline specific details. For example,
>>>>>>> IBS generates Branch unit information(IbsOpData1) and Icahce 
>>>>>>> related
>>>>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>>>>> part of perf-mem, IMO.
>>>>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>>>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>>>>
>>>>>> union perf_mem_data_src {
>>>>>> ...
>>>>>>                    __u64   mem_rsvd:24,
>>>>>>                            mem_snoopx:2,   /* snoop mode, ext */
>>>>>>                            mem_remote:1,   /* remote */
>>>>>>                            mem_lvl_num:4,  /* memory hierarchy 
>>>>>> level number */
>>>>>>                            mem_dtlb:7,     /* tlb access */
>>>>>>                            mem_lock:2,     /* lock instr */
>>>>>>                            mem_snoop:5,    /* snoop mode */
>>>>>>                            mem_lvl:14,     /* memory hierarchy 
>>>>>> level */
>>>>>>                            mem_op:5;       /* type of opcode */
>>>>>>
>>>>>>
>>>>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>>>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>>>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>>>>> be used to populate mem_snoop, right?
>>>>> Hi Kim,
>>>>>
>>>>> Yes. We do expose these data as part of perf-mem for POWER.
>>>> OK, I see relevant PERF_MEM_S bits in 
>>>> arch/powerpc/perf/isa207-common.c:
>>>> isa207_find_source now, thanks.
>>>>
>>>>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>>>>> used for the ld/st target addresses, too.
>>>>>>
>>>>>>>> What's needed here is a vendor-specific extended
>>>>>>>> sample information that all these technologies gather,
>>>>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>>>>> all should have in common.
>>>>>>> Yes. We will include fields to capture the latency cycles (like 
>>>>>>> Issue
>>>>>>> latency, Instruction completion latency etc..) along with other 
>>>>>>> pipeline
>>>>>>> details in the proposed structure.
>>>>>> Latency figures are just an example, and from what I
>>>>>> can tell, struct perf_sample_data already has a 'weight' member,
>>>>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>>>>> transfer memory access latency figures.  Granted, that's
>>>>>> a bad name given all other vendors don't call latency
>>>>>> 'weight'.
>>>>>>
>>>>>> I didn't see any latency figures coming out of POWER9,
>>>>>> and do not expect this patchseries to implement those
>>>>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>>>>> to amend perf to suit their own h/w output please.
>>>>> Reference structure proposed in this patchset did not have members
>>>>> to capture latency info for that exact reason. But idea here is to
>>>>> abstract  as vendor specific as possible. So if we include u16 array,
>>>>> then this format can also capture data from IBS since it provides
>>>>> few latency details.
>>>> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
>>>> struct presented in this patchset.
>>>>
>>>> IBS Ops can report e.g.:
>>>>
>>>> 15 tag-to-retire cycles bits,
>>>> 15 completion to retire count bits,
>>>> 15 L1 DTLB refill latency bits,
>>>> 15 DC miss latency bits,
>>>> 5 outstanding memory requests on mem refill bits, and so on.
>>>>
>>>> IBS Fetch reports 15 bits of fetch latency, and another 16
>>>> for iTLB latency, among others.
>>>>
>>>> Some of these may/may not be valid simultaneously, and
>>>> there are IBS specific rules to establish validity.
>>>>
>>>>>> My main point there, however, was that each vendor should
>>>>>> use streamlined record-level code to just copy the data
>>>>>> in the proprietary format that their hardware produces,
>>>>>> and then then perf tooling can synthesize the events
>>>>>> from the raw data at report/script/etc. time.
>>>>>>
>>>>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is 
>>>>>>> intended when
>>>>>>> large volume of data needs to be captured as part of perf.data 
>>>>>>> without
>>>>>>> frequent PMIs. But proposed type is to address the capture of 
>>>>>>> pipeline
>>>>>> SAMPLE_AUX shouldn't care whether the volume is large, or how 
>>>>>> frequent
>>>>>> PMIs are, even though it may be used in those environments.
>>>>>>
>>>>>>> information on each sample using PMI at periodic intervals. 
>>>>>>> Hence proposing
>>>>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>>>>> And that's fine for any extra bits that POWER9 has to convey
>>>>>> to its users beyond things already represented by other sample
>>>>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>>>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>>>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>>>>> what IBS currently uses.
>>>>> My bad. Not sure what you mean by this. We are trying to abstract
>>>>> as much vendor specific data as possible with this (like perf-mem).
>>>> Perhaps if I say it this way: instead of doing all the
>>>> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
>>>> in patch 4/11, rather/instead just put the raw sier value in a
>>>> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
>>>> Specific SIER capabilities can be written as part of the perf.data
>>>> header.  Then synthesize the true pipe events from the raw SIER
>>>> values later, and in userspace.
>>> Hi Kim,
>>>
>>> Would like to stay away from SAMPLE_RAW type for these comments in 
>>> perf_events.h
>>>
>>> *      #
>>> *      # The RAW record below is opaque data wrt the ABI
>>> *      #
>>> *      # That is, the ABI doesn't make any promises wrt to
>>> *      # the stability of its content, it may vary depending
>>> *      # on event, hardware, kernel version and phase of
>>> *      # the moon.
>>> *      #
>>> *      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>>> *      #
>> The "it may vary depending on ... hardware" clause makes it sound
>> appropriate for the use-case where the raw hardware register contents
>> are copied directly into the user buffer.
>
>
> Hi Kim,
>
> Sorry for the delayed response.
>
> But perf tool side needs infrastructure to handle the raw sample
> data from cpu-pmu (used by tracepoints). I am not sure whether
> his is the approach we should look here.
>
> peterz any comments?
>
>>
>>> Secondly, sorry I didn't understand your suggestion about using 
>>> PERF_SAMPLE_AUX.
>>> IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory 
>>> and more
>>> challenging when correlating and presenting the pipeline details for 
>>> each IP.
>>> IMO, having a new sample type can be useful to capture the pipeline 
>>> data
>>> both in perf_sample_data and if _AUX is enabled, can be made to push to
>>> AUX buffer.
>> OK, I didn't think SAMPLE_AUX and the aux ring buffer were
>> interdependent, sorry.
>>
>> Thanks,
>>
>> Kim
>


  reply	other threads:[~2020-04-27  7:18 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-02  5:23 [RFC 00/11] perf: Enhancing perf to export processor hazard information Ravi Bangoria
2020-03-02  5:23 ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 01/11] powerpc/perf: Simplify ISA207_SIER macros Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 02/11] perf/core: Data structure to present hazard data Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  9:55   ` Peter Zijlstra
2020-03-02  9:55     ` Peter Zijlstra
2020-03-02 14:23     ` maddy
2020-03-02 14:23       ` maddy
2020-03-02 14:48   ` Mark Rutland
2020-03-02 14:48     ` Mark Rutland
2020-03-03 14:32     ` Ravi Bangoria
2020-03-03 14:32       ` Ravi Bangoria
2020-03-02 14:54   ` Mark Rutland
2020-03-02 14:54     ` Mark Rutland
2020-03-03 14:31     ` Ravi Bangoria
2020-03-03 14:31       ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 03/11] powerpc/perf: Arch specific definitions for pipeline Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 04/11] powerpc/perf: Arch support to expose Hazard data Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 05/11] perf tools: Enable record and script to record and show hazard data Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 06/11] perf hists: Make a room for hazard info in struct hist_entry Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 07/11] perf hazard: Functions to convert generic hazard data to arch specific string Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 08/11] perf report: Enable hazard mode Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 09/11] perf annotate: Introduce type for annotation_line Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 10/11] perf annotate: Preparation for hazard Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 11/11] perf annotate: Show hazard data in tui mode Ravi Bangoria
2020-03-02  5:23   ` Ravi Bangoria
2020-03-02 10:13 ` [RFC 00/11] perf: Enhancing perf to export processor hazard information Peter Zijlstra
2020-03-02 10:13   ` Peter Zijlstra
2020-03-02 20:21   ` Stephane Eranian
2020-03-02 20:21     ` Stephane Eranian
2020-03-02 22:25     ` Kim Phillips
2020-03-02 22:25       ` Kim Phillips
2020-03-05  4:46       ` Ravi Bangoria
2020-03-05  4:46         ` Ravi Bangoria
2020-03-05 22:06         ` Kim Phillips
2020-03-05 22:06           ` Kim Phillips
2020-03-11 16:00           ` Ravi Bangoria
2020-03-12 22:38             ` Kim Phillips
2020-03-12 22:38               ` Kim Phillips
2020-03-17  6:50               ` maddy
2020-03-17  6:50                 ` maddy
2020-03-18 17:35                 ` Kim Phillips
2020-03-18 17:35                   ` Kim Phillips
2020-03-19 11:22                   ` Michael Ellerman
2020-03-19 11:22                     ` Michael Ellerman
2020-03-26 10:19                   ` maddy
2020-03-26 10:19                     ` maddy
2020-03-26 19:48                     ` Kim Phillips
2020-03-26 19:48                       ` Kim Phillips
2020-04-20  7:09                       ` Madhavan Srinivasan
2020-04-20  7:09                         ` Madhavan Srinivasan
2020-04-27  7:18                         ` Madhavan Srinivasan [this message]
2020-04-27  7:18                           ` Madhavan Srinivasan
2020-03-05  4:28     ` maddy
2020-03-05  4:28       ` maddy
2020-03-03  1:33   ` Andi Kleen
2020-03-03  1:33     ` Andi Kleen
2020-03-05  5:06     ` Ravi Bangoria
2020-03-05  5:06       ` Ravi Bangoria
2020-03-02 21:08 ` Paul Clarke
2020-03-02 21:08   ` Paul Clarke
2020-03-05  5:06   ` Ravi Bangoria
2020-03-05  5:06     ` Ravi Bangoria

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9aa2da24-3621-ba50-60c8-5084fa1041d7@linux.ibm.com \
    --to=maddy@linux.ibm.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=alexey.budankov@linux.intel.com \
    --cc=eranian@google.com \
    --cc=jolsa@redhat.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kim.phillips@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=ravi.bangoria@linux.ibm.com \
    --cc=robert.richter@amd.com \
    --cc=yao.jin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.