All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jie Zhan <zhanjie9@hisilicon.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: <will@kernel.org>, <mark.rutland@arm.com>,
	<mathieu.poirier@linaro.org>, <suzuki.poulose@arm.com>,
	<mike.leach@linaro.org>, <leo.yan@linaro.org>,
	<john.g.garry@oracle.com>, <james.clark@arm.com>,
	<peterz@infradead.org>, <mingo@redhat.com>, <acme@kernel.org>,
	<corbet@lwn.net>, <shenyang39@huawei.com>, <hejunhao3@huawei.com>,
	<yangyicong@hisilicon.com>, <prime.zeng@huawei.com>,
	<suntao25@huawei.com>, <jiazhao4@hisilicon.com>,
	<linuxarm@huawei.com>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-perf-users@vger.kernel.org>
Subject: Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
Date: Sat, 25 Mar 2023 10:48:29 +0800	[thread overview]
Message-ID: <26103329-9d00-226f-6b85-386766814618@hisilicon.com> (raw)
In-Reply-To: <20230324121431.000034c4@Huawei.com>



On 24/03/2023 20:14, Jonathan Cameron wrote:
> On Fri, 24 Mar 2023 17:32:15 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> On 17/03/2023 21:37, Jonathan Cameron wrote:
>>> On Mon, 6 Feb 2023 14:51:43 +0800
>>> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>>   
>>>> Document the overview and usage of HiSilicon PMCU.
>>>>
>>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> well as its impact on target workloads, is reduced.
>>>>
>>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>>> Nice documentation. I've read this a few times before, but on this read
>>> through wondered if we could say anything about the skew between capture
>>> of the counters.  Not that important though so I'm happy to add
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>
>>> though this may of course need updating significantly as the interface
>>> is refined (the RFC question you raised for example in the cover letter).
>>>
>>> Thanks
>>>
>>> Jonathan
>>>   
>>>> ---
>>>>    Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>>>    Documentation/admin-guide/perf/index.rst     |   1 +
>>>>    2 files changed, 184 insertions(+)
>>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>>
>>>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> new file mode 100644
>>>> index 000000000000..50d17cbd0049
>>>> --- /dev/null
>>>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> @@ -0,0 +1,183 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +==========================================
>>>> +HiSilicon Performance Monitor Control Unit
>>>> +==========================================
>>>> +
>>>> +Introduction
>>>> +============
>>>> +
>>>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> +PMU accesses from CPUs, handling the configuration, event switching, and
>>>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>>>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> +reliably obtain the data of up to 240 PMU events with the sample interval
>>>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> +well as its impact on target workloads, is reduced.
>>>> +
>>>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>>>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>>>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>>>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>>>> +events, waits for a time interval, and stops them. The PMU counter readings are
>>>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>>>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>>>> +(when there are more events than available PMU counters) and completes multiple
>>>> +rounds of PMU event counting in one trigger.
>>>> +
>>>> +Hardware overview
>>>> +=================
>>>> +
>>>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>>>> +assistant to access the core PMUs on its die and move the counter readings to
>>>> +memory. An overview of PMCU's hardware organization is shown below::
>>>> +
>>>> +                                +--------------------+
>>>> +                                |       Memory       |
>>>> +                                | +------+ +-------+ |
>>>> +                   +--------+   | |Events| |Samples| |
>>>> +                   |  PMCU  |   | +------+ +-------+ |
>>>> +                   +---|----+   +---------|----------+
>>>> +                       |                  |
>>>> +        =======================================================  Bus
>>>> +                   |                         |               |
>>>> +        +----------|----------+   +----------|----------+    |
>>>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>>>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>>>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>>>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>>>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>>>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>>>> +        | +------+   +------+ |   | +------+   +------+ |
>>>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>>>> +        +---------------------+   +---------------------+
>>>> +
>>>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>>>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>>>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>>>> +events for a while, and move the counter readings back to memory.
>>>> +
>>>> +Once triggered, PMCU performs a number of loops and processes a number of
>>>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>>>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>>>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>>>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>>>> +stops all the counters, and moves the counter readings to memory, before
>>>> +handling the next ``nr_pmu`` events if there are more events to process in this
>>>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>>>> +the number of events to process depends on user inputs. The counters are
>>>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>>>> +window during which the events are not counted.
>>> I'm not clear from this description whether there is 'skew' between the counters
>>> (beyond the normal issues from uarch).  Does the PMCU stop all counters
>>> then read them all (minimizing skew) or does it stop each CPUs set of counters
>>> and read those, or stop each individual counter before reading?
>>>
>>> My impression is that this feature is meant to be left running over timescales
>>> much longer than the sampling period so it may not be necessary to align the
>>> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>>>   
>> Thanks for pointing this out.
>>
>> The PMCU stops all the counters before reading any counters (i.e. the
>> first case you said).
>>
>> The basic procedure is:
>>       start counters -> wait -> stop counters -> read and reset counters
>> -> switch events -> start counters -> ...
>> where each step applys to all CPUs and counters.
> Great. So this is across all cores on a die so skew should be minimized
> (at a cost of missing more events than a skew heavy approach).
>
>> The counters don't count during the tiny stop-start window.
>> I guess a small improvement would be: reset -> read -> switch -> reset
>> -> ..., while the counters keep running,
>> but we still lose some event counts between read and reset, and thus, no
>> fundamental differrence.
> Lots of ways to reduce both skew and missed counts, but I think you are
> right in that none of them matter for the intended long term monitoring
> usecase.
>
> Jonathan
Yeah it focuses more on general workload characteristics than 
time-senstive and
precise program analysis.

Jie

WARNING: multiple messages have this Message-ID (diff)
From: Jie Zhan <zhanjie9@hisilicon.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: <will@kernel.org>, <mark.rutland@arm.com>,
	<mathieu.poirier@linaro.org>, <suzuki.poulose@arm.com>,
	<mike.leach@linaro.org>, <leo.yan@linaro.org>,
	<john.g.garry@oracle.com>, <james.clark@arm.com>,
	<peterz@infradead.org>, <mingo@redhat.com>, <acme@kernel.org>,
	<corbet@lwn.net>, <shenyang39@huawei.com>, <hejunhao3@huawei.com>,
	<yangyicong@hisilicon.com>, <prime.zeng@huawei.com>,
	<suntao25@huawei.com>, <jiazhao4@hisilicon.com>,
	<linuxarm@huawei.com>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-perf-users@vger.kernel.org>
Subject: Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
Date: Sat, 25 Mar 2023 10:48:29 +0800	[thread overview]
Message-ID: <26103329-9d00-226f-6b85-386766814618@hisilicon.com> (raw)
In-Reply-To: <20230324121431.000034c4@Huawei.com>



On 24/03/2023 20:14, Jonathan Cameron wrote:
> On Fri, 24 Mar 2023 17:32:15 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> On 17/03/2023 21:37, Jonathan Cameron wrote:
>>> On Mon, 6 Feb 2023 14:51:43 +0800
>>> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>>   
>>>> Document the overview and usage of HiSilicon PMCU.
>>>>
>>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> well as its impact on target workloads, is reduced.
>>>>
>>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>>> Nice documentation. I've read this a few times before, but on this read
>>> through wondered if we could say anything about the skew between capture
>>> of the counters.  Not that important though so I'm happy to add
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>
>>> though this may of course need updating significantly as the interface
>>> is refined (the RFC question you raised for example in the cover letter).
>>>
>>> Thanks
>>>
>>> Jonathan
>>>   
>>>> ---
>>>>    Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>>>    Documentation/admin-guide/perf/index.rst     |   1 +
>>>>    2 files changed, 184 insertions(+)
>>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>>
>>>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> new file mode 100644
>>>> index 000000000000..50d17cbd0049
>>>> --- /dev/null
>>>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> @@ -0,0 +1,183 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +==========================================
>>>> +HiSilicon Performance Monitor Control Unit
>>>> +==========================================
>>>> +
>>>> +Introduction
>>>> +============
>>>> +
>>>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> +PMU accesses from CPUs, handling the configuration, event switching, and
>>>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>>>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> +reliably obtain the data of up to 240 PMU events with the sample interval
>>>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> +well as its impact on target workloads, is reduced.
>>>> +
>>>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>>>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>>>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>>>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>>>> +events, waits for a time interval, and stops them. The PMU counter readings are
>>>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>>>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>>>> +(when there are more events than available PMU counters) and completes multiple
>>>> +rounds of PMU event counting in one trigger.
>>>> +
>>>> +Hardware overview
>>>> +=================
>>>> +
>>>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>>>> +assistant to access the core PMUs on its die and move the counter readings to
>>>> +memory. An overview of PMCU's hardware organization is shown below::
>>>> +
>>>> +                                +--------------------+
>>>> +                                |       Memory       |
>>>> +                                | +------+ +-------+ |
>>>> +                   +--------+   | |Events| |Samples| |
>>>> +                   |  PMCU  |   | +------+ +-------+ |
>>>> +                   +---|----+   +---------|----------+
>>>> +                       |                  |
>>>> +        =======================================================  Bus
>>>> +                   |                         |               |
>>>> +        +----------|----------+   +----------|----------+    |
>>>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>>>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>>>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>>>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>>>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>>>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>>>> +        | +------+   +------+ |   | +------+   +------+ |
>>>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>>>> +        +---------------------+   +---------------------+
>>>> +
>>>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>>>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>>>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>>>> +events for a while, and move the counter readings back to memory.
>>>> +
>>>> +Once triggered, PMCU performs a number of loops and processes a number of
>>>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>>>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>>>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>>>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>>>> +stops all the counters, and moves the counter readings to memory, before
>>>> +handling the next ``nr_pmu`` events if there are more events to process in this
>>>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>>>> +the number of events to process depends on user inputs. The counters are
>>>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>>>> +window during which the events are not counted.
>>> I'm not clear from this description whether there is 'skew' between the counters
>>> (beyond the normal issues from uarch).  Does the PMCU stop all counters
>>> then read them all (minimizing skew) or does it stop each CPUs set of counters
>>> and read those, or stop each individual counter before reading?
>>>
>>> My impression is that this feature is meant to be left running over timescales
>>> much longer than the sampling period so it may not be necessary to align the
>>> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>>>   
>> Thanks for pointing this out.
>>
>> The PMCU stops all the counters before reading any counters (i.e. the
>> first case you said).
>>
>> The basic procedure is:
>>       start counters -> wait -> stop counters -> read and reset counters
>> -> switch events -> start counters -> ...
>> where each step applys to all CPUs and counters.
> Great. So this is across all cores on a die so skew should be minimized
> (at a cost of missing more events than a skew heavy approach).
>
>> The counters don't count during the tiny stop-start window.
>> I guess a small improvement would be: reset -> read -> switch -> reset
>> -> ..., while the counters keep running,
>> but we still lose some event counts between read and reset, and thus, no
>> fundamental differrence.
> Lots of ways to reduce both skew and missed counts, but I think you are
> right in that none of them matter for the intended long term monitoring
> usecase.
>
> Jonathan
Yeah it focuses more on general workload characteristics than 
time-senstive and
precise program analysis.

Jie

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-03-25  2:48 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06  6:51 [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit Jie Zhan
2023-02-06  6:51 ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-02-07  3:03   ` Jie Zhan
2023-02-07  3:03     ` Jie Zhan
2023-03-17 13:37   ` Jonathan Cameron
2023-03-17 13:37     ` Jonathan Cameron
2023-03-24  9:32     ` Jie Zhan
2023-03-24  9:32       ` Jie Zhan
2023-03-24 12:14       ` Jonathan Cameron
2023-03-24 12:14         ` Jonathan Cameron
2023-03-25  2:48         ` Jie Zhan [this message]
2023-03-25  2:48           ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support " Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-03-17 14:52   ` Jonathan Cameron
2023-03-17 14:52     ` Jonathan Cameron
2023-03-25 10:21     ` Jie Zhan
2023-03-25 10:21       ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-03-17 15:13   ` Jonathan Cameron
2023-03-17 15:13     ` Jonathan Cameron
2023-02-06  6:51 ` [RFC PATCH v1 4/4] perf tool: Add HiSilicon PMCU data decoding support Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-02-27  8:49 ` [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit Jie Zhan
2023-02-27  8:49   ` Jie Zhan
2023-03-17 13:11   ` Jonathan Cameron
2023-03-17 13:11     ` Jonathan Cameron
2023-04-19  8:01     ` Jie Zhan
2023-04-19  8:01       ` Jie Zhan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=26103329-9d00-226f-6b85-386766814618@hisilicon.com \
    --to=zhanjie9@hisilicon.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=acme@kernel.org \
    --cc=corbet@lwn.net \
    --cc=hejunhao3@huawei.com \
    --cc=james.clark@arm.com \
    --cc=jiazhao4@hisilicon.com \
    --cc=john.g.garry@oracle.com \
    --cc=leo.yan@linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mark.rutland@arm.com \
    --cc=mathieu.poirier@linaro.org \
    --cc=mike.leach@linaro.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@huawei.com \
    --cc=shenyang39@huawei.com \
    --cc=suntao25@huawei.com \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=yangyicong@hisilicon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.