All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
@ 2023-02-06  6:51 ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

This patchset contains the documentation, driver, and user perf tool
support to enable using PMCU with the 'perf_event' framework. 

Here are two key questions requested for comments:

- How do we make it compatible with arm_pmu drivers?

  Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
  from CPU and PMCU simultaneously. The current hardware can't guarantee
  mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
  the same time may mess up the operation of PMUs, delivering incorrect
  data for both events, e.g. unexpected events or sample periods.
  Software-wise, we probably need to prevent the two types of events from
  running at the same time, but currently there isn't a clear solution.

- Currently we reply on a sysfs file for users to input event numbers. Is
  there a better way to pass many events?

  The perf framework only allows three 64-bit config fields for custom PMU
  configs. Obviously, this can't satisfy our need for passing many events
  at a time. As an event number is 16-bit wide, the config fields can only
  take up to 12 events at a time, or up to 192 events even if we do a
  bitmap of events (and there are more than 192 available event numbers).
  Hence, the current design takes an array of event numbers from a sysfs
  file before starting profiling. However, this may go against the common
  way to schedule perf events through perf commands.

Jie Zhan (4):
  docs: perf: Add documentation for HiSilicon PMCU
  drivers/perf: hisi: Add driver support for HiSilicon PMCU
  perf tool: Add HiSilicon PMCU data recording support
  perf tool: Add HiSilicon PMCU data decoding support

 Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
 Documentation/admin-guide/perf/index.rst     |    1 +
 drivers/perf/hisilicon/Kconfig               |   15 +
 drivers/perf/hisilicon/Makefile              |    1 +
 drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
 tools/perf/arch/arm/util/auxtrace.c          |   61 +
 tools/perf/arch/arm64/util/Build             |    2 +-
 tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
 tools/perf/util/Build                        |    1 +
 tools/perf/util/auxtrace.c                   |    4 +
 tools/perf/util/auxtrace.h                   |    1 +
 tools/perf/util/hisi-pmcu.c                  |  305 +++++
 tools/perf/util/hisi-pmcu.h                  |   19 +
 13 files changed, 1833 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
 create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
 create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.h


base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
-- 
2.30.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
@ 2023-02-06  6:51 ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

This patchset contains the documentation, driver, and user perf tool
support to enable using PMCU with the 'perf_event' framework. 

Here are two key questions requested for comments:

- How do we make it compatible with arm_pmu drivers?

  Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
  from CPU and PMCU simultaneously. The current hardware can't guarantee
  mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
  the same time may mess up the operation of PMUs, delivering incorrect
  data for both events, e.g. unexpected events or sample periods.
  Software-wise, we probably need to prevent the two types of events from
  running at the same time, but currently there isn't a clear solution.

- Currently we reply on a sysfs file for users to input event numbers. Is
  there a better way to pass many events?

  The perf framework only allows three 64-bit config fields for custom PMU
  configs. Obviously, this can't satisfy our need for passing many events
  at a time. As an event number is 16-bit wide, the config fields can only
  take up to 12 events at a time, or up to 192 events even if we do a
  bitmap of events (and there are more than 192 available event numbers).
  Hence, the current design takes an array of event numbers from a sysfs
  file before starting profiling. However, this may go against the common
  way to schedule perf events through perf commands.

Jie Zhan (4):
  docs: perf: Add documentation for HiSilicon PMCU
  drivers/perf: hisi: Add driver support for HiSilicon PMCU
  perf tool: Add HiSilicon PMCU data recording support
  perf tool: Add HiSilicon PMCU data decoding support

 Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
 Documentation/admin-guide/perf/index.rst     |    1 +
 drivers/perf/hisilicon/Kconfig               |   15 +
 drivers/perf/hisilicon/Makefile              |    1 +
 drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
 tools/perf/arch/arm/util/auxtrace.c          |   61 +
 tools/perf/arch/arm64/util/Build             |    2 +-
 tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
 tools/perf/util/Build                        |    1 +
 tools/perf/util/auxtrace.c                   |    4 +
 tools/perf/util/auxtrace.h                   |    1 +
 tools/perf/util/hisi-pmcu.c                  |  305 +++++
 tools/perf/util/hisi-pmcu.h                  |   19 +
 13 files changed, 1833 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
 create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
 create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.h


base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
-- 
2.30.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-02-06  6:51 ` Jie Zhan
@ 2023-02-06  6:51   ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Document the overview and usage of HiSilicon PMCU.

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
 Documentation/admin-guide/perf/index.rst     |   1 +
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst

diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
new file mode 100644
index 000000000000..50d17cbd0049
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+HiSilicon Performance Monitor Control Unit
+==========================================
+
+Introduction
+============
+
+HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
+PMU accesses from CPUs, handling the configuration, event switching, and
+counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
+and multi-PMU-event CPU profiling, in which scenario the current ``perf``
+scheme may lose events or drop sampling frequency. With PMCU, users can
+reliably obtain the data of up to 240 PMU events with the sample interval
+of events down to 1ms, while the software overhead of accessing PMUs, as
+well as its impact on target workloads, is reduced.
+
+Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
+PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
+CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
+CPUs on the CPU die it is on. PMCU then starts the counters for counting
+events, waits for a time interval, and stops them. The PMU counter readings are
+dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
+the ``perf.data`` file in the user space. PMCU automatically switches events
+(when there are more events than available PMU counters) and completes multiple
+rounds of PMU event counting in one trigger.
+
+Hardware overview
+=================
+
+On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
+assistant to access the core PMUs on its die and move the counter readings to
+memory. An overview of PMCU's hardware organization is shown below::
+
+                                +--------------------+
+                                |       Memory       |
+                                | +------+ +-------+ |
+                   +--------+   | |Events| |Samples| |
+                   |  PMCU  |   | +------+ +-------+ |
+                   +---|----+   +---------|----------+
+                       |                  |
+        =======================================================  Bus
+                   |                         |               |
+        +----------|----------+   +----------|----------+    |
+        | +------+ | +------+ |   | +------+ | +------+ |    |
+        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
+        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
+        |    +-----+----+     |   |    +-----+----+     |  clusters
+        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
+        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
+        | +------+   +------+ |   | +------+   +------+ |
+        |    CPU Cluster 0    |   |    CPU Cluster 1    |
+        +---------------------+   +---------------------+
+
+On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
+CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
+The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
+events for a while, and move the counter readings back to memory.
+
+Once triggered, PMCU performs a number of loops and processes a number of
+events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
+``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
+``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
+where PMCU resides. Then, PMCU starts all the counters, waits for a period,
+stops all the counters, and moves the counter readings to memory, before
+handling the next ``nr_pmu`` events if there are more events to process in this
+loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
+the number of events to process depends on user inputs. The counters are
+stopped when PMCU reads counters and switches events, so there is a tiny time
+window during which the events are not counted.
+
+Usage
+=====
+
+The PMCU driver is designed to operate with the kernel perf_event framework,
+specifically with perf AUX trace buffer to dump sample data faster. User space
+usage of PMCU is supported through the 'perf' tool and root access is required.
+
+Steps:
+
+1. Write PMU event IDs to PMCU's ``sysfs`` event interface. The event IDs should
+   be hexadecimal and separated by whitespaces.
+
+   An example command can be::
+
+        echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
+
+   Alternatively, users can directly write the ``user_events`` file with a text
+   editor.
+
+   Please note that:
+
+   - As PMCU passes event IDs to core PMUs, any event IDs supported by the core
+     PMU are acceptible.
+   - Users can enter up to 240 events; any events beyond that are ignored.
+   - The event IDs remain unchanged until the next update of the file, such that
+     users do not have to enter the event IDs every time before issuing a
+     ``perf-record`` command for the same events.
+
+2. Profiling with ``perf-record``.
+
+   The command to start the sampling is::
+
+        perf record -e hisi_pmcu_sccl3/<configs>/
+
+   Users can pass the following optional parameters to ``<configs>``:
+
+   - nr_sample: number of samples to take. This defaults to 128.
+   - sample_period_ms: time interval in microseconds for PMU counters to keep
+     counting for each event. This defaults to 3, i.e. 3ms, and its max
+     value is 85,899, i.e. 85 seconds.
+   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
+     cycle counter increments. This defaults to 0x00. Please refer to the
+     "Performance Monitors external register descriptions" of *Arm Architecture
+     Reference Manual for A-profile architecture* on how to configure
+     PMCCFILTR_EL0.
+
+   An example command can be::
+
+        perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1000/
+
+3. Obtain the sample data
+
+   When the ``perf-record`` command finishes, data will be stored in the AUX
+   area of ``perf.data``. The data can be viewed with ``perf-report`` or
+   ``perf-script`` with the ``-D`` dump trace option, e.g.::
+
+        perf report -D
+
+   Users may search the keyword ``HISI PMCU`` to navigate to the PMCU data
+   section.
+
+   PMCU samples are arranged in the following format::
+
+        +------------+  +- +--------+  +- +-----------+  +- +------------+
+        |AUX buffer 0|->|  |Sample 1|->|  |Subsample 1|->|  |CID1SR      |--+
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+        |AUX buffer 1|  |  |Sample 2|  |  |Subsample 2|  |  |CID2SR      |  |
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+        |...         |  |  |...     |  |  |...        |  |  |Event 0     |  |
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+                        |  |  Gap   |  |  |Subsample N|  |  |Event 1     |  |
+                        +- +--------+  +- +-----------+  |  +------------+  |
+                                                         |  |...         |  |
+                                                         |  +------------+  |
+                                                         |  |Event nr_pmu|  |
+                                                         |  +------------+  |
+                                                         |  |Cycle count |  |
+                                                         +- +------------+  |
+        +-------------------------------------------------------------------+
+        |  +- +------------------+  +- +---------+
+        +->|  |CPU 0 in a cluster|->|  |Cluster 0|
+           |  +------------------+  |  +---------+
+           |  |CPU 1 in a cluster|  |  |Cluster 1|
+           |  +------------------+  |  +---------+
+           |  |CPU 2 in a cluster|  |  |Cluster 2|
+           |  +------------------+  |  +---------+
+           |  |...               |  |  |...      |
+           +- +------------------+  +- +---------+
+
+   The data may contain one or more AUX buffers. An AUX buffer contains many
+   samples, and may probably leave a gap at the buffer tail where there is no
+   space for a complete sample. The number of samples in all AUX buffers sums
+   up to the 'nr_sample' parameter passed from the 'perf-record' command.
+
+   A sample contains the events entered in the ``users_events`` sysfs file. A
+   sample may consist of multiple subsamples if the number of events is more
+   than the number of PMU counters used, i.e. ``nr_pmu``. The number of
+   subsamples in a sample, ``N``, equals to a round up of the number of event
+   divided by ``nr_pmu``.
+
+   A subsample consists of data fields of CID1SR, CID2SR, ``nr_pmu`` event
+   counter readings, and a cycle counter reading. CID1SR and CID2SR are a copy
+   of PMCID1SR and PMCID2SR on capture of the event counters, which reflects
+   the process ID, provided that the kernel compiling configuration
+   ``CONFIG_PID_IN_CONTEXTIDR`` is enabled. The size of CID1SR or CID2SR is 4
+   bytes, whereas the size of an event or cycle count is 8 bytes. A data field
+   has the data from all CPUs. The order of CPUs in a data field is 'CPU ID in
+   a cluster' -> 'cluster ID'. For example, a CPU die with 32 CPUs in 4
+   clusters (8 CPUs per cluster) has the data field ordered in::
+
+       CPU [0,8,16,24],[1,9,17,25],[2,10,18,26],...,[7,15,23,31]
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 793e1970bc05..f132838145f9 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -8,6 +8,7 @@ Performance monitor support
    :maxdepth: 1
 
    hisi-pmu
+   hisi-pmcu
    hisi-pcie-pmu
    hns3-pmu
    imx-ddr
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-02-06  6:51   ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Document the overview and usage of HiSilicon PMCU.

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
 Documentation/admin-guide/perf/index.rst     |   1 +
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst

diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
new file mode 100644
index 000000000000..50d17cbd0049
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+HiSilicon Performance Monitor Control Unit
+==========================================
+
+Introduction
+============
+
+HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
+PMU accesses from CPUs, handling the configuration, event switching, and
+counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
+and multi-PMU-event CPU profiling, in which scenario the current ``perf``
+scheme may lose events or drop sampling frequency. With PMCU, users can
+reliably obtain the data of up to 240 PMU events with the sample interval
+of events down to 1ms, while the software overhead of accessing PMUs, as
+well as its impact on target workloads, is reduced.
+
+Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
+PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
+CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
+CPUs on the CPU die it is on. PMCU then starts the counters for counting
+events, waits for a time interval, and stops them. The PMU counter readings are
+dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
+the ``perf.data`` file in the user space. PMCU automatically switches events
+(when there are more events than available PMU counters) and completes multiple
+rounds of PMU event counting in one trigger.
+
+Hardware overview
+=================
+
+On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
+assistant to access the core PMUs on its die and move the counter readings to
+memory. An overview of PMCU's hardware organization is shown below::
+
+                                +--------------------+
+                                |       Memory       |
+                                | +------+ +-------+ |
+                   +--------+   | |Events| |Samples| |
+                   |  PMCU  |   | +------+ +-------+ |
+                   +---|----+   +---------|----------+
+                       |                  |
+        =======================================================  Bus
+                   |                         |               |
+        +----------|----------+   +----------|----------+    |
+        | +------+ | +------+ |   | +------+ | +------+ |    |
+        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
+        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
+        |    +-----+----+     |   |    +-----+----+     |  clusters
+        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
+        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
+        | +------+   +------+ |   | +------+   +------+ |
+        |    CPU Cluster 0    |   |    CPU Cluster 1    |
+        +---------------------+   +---------------------+
+
+On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
+CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
+The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
+events for a while, and move the counter readings back to memory.
+
+Once triggered, PMCU performs a number of loops and processes a number of
+events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
+``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
+``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
+where PMCU resides. Then, PMCU starts all the counters, waits for a period,
+stops all the counters, and moves the counter readings to memory, before
+handling the next ``nr_pmu`` events if there are more events to process in this
+loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
+the number of events to process depends on user inputs. The counters are
+stopped when PMCU reads counters and switches events, so there is a tiny time
+window during which the events are not counted.
+
+Usage
+=====
+
+The PMCU driver is designed to operate with the kernel perf_event framework,
+specifically with perf AUX trace buffer to dump sample data faster. User space
+usage of PMCU is supported through the 'perf' tool and root access is required.
+
+Steps:
+
+1. Write PMU event IDs to PMCU's ``sysfs`` event interface. The event IDs should
+   be hexadecimal and separated by whitespaces.
+
+   An example command can be::
+
+        echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
+
+   Alternatively, users can directly write the ``user_events`` file with a text
+   editor.
+
+   Please note that:
+
+   - As PMCU passes event IDs to core PMUs, any event IDs supported by the core
+     PMU are acceptible.
+   - Users can enter up to 240 events; any events beyond that are ignored.
+   - The event IDs remain unchanged until the next update of the file, such that
+     users do not have to enter the event IDs every time before issuing a
+     ``perf-record`` command for the same events.
+
+2. Profiling with ``perf-record``.
+
+   The command to start the sampling is::
+
+        perf record -e hisi_pmcu_sccl3/<configs>/
+
+   Users can pass the following optional parameters to ``<configs>``:
+
+   - nr_sample: number of samples to take. This defaults to 128.
+   - sample_period_ms: time interval in microseconds for PMU counters to keep
+     counting for each event. This defaults to 3, i.e. 3ms, and its max
+     value is 85,899, i.e. 85 seconds.
+   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
+     cycle counter increments. This defaults to 0x00. Please refer to the
+     "Performance Monitors external register descriptions" of *Arm Architecture
+     Reference Manual for A-profile architecture* on how to configure
+     PMCCFILTR_EL0.
+
+   An example command can be::
+
+        perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1000/
+
+3. Obtain the sample data
+
+   When the ``perf-record`` command finishes, data will be stored in the AUX
+   area of ``perf.data``. The data can be viewed with ``perf-report`` or
+   ``perf-script`` with the ``-D`` dump trace option, e.g.::
+
+        perf report -D
+
+   Users may search the keyword ``HISI PMCU`` to navigate to the PMCU data
+   section.
+
+   PMCU samples are arranged in the following format::
+
+        +------------+  +- +--------+  +- +-----------+  +- +------------+
+        |AUX buffer 0|->|  |Sample 1|->|  |Subsample 1|->|  |CID1SR      |--+
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+        |AUX buffer 1|  |  |Sample 2|  |  |Subsample 2|  |  |CID2SR      |  |
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+        |...         |  |  |...     |  |  |...        |  |  |Event 0     |  |
+        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
+                        |  |  Gap   |  |  |Subsample N|  |  |Event 1     |  |
+                        +- +--------+  +- +-----------+  |  +------------+  |
+                                                         |  |...         |  |
+                                                         |  +------------+  |
+                                                         |  |Event nr_pmu|  |
+                                                         |  +------------+  |
+                                                         |  |Cycle count |  |
+                                                         +- +------------+  |
+        +-------------------------------------------------------------------+
+        |  +- +------------------+  +- +---------+
+        +->|  |CPU 0 in a cluster|->|  |Cluster 0|
+           |  +------------------+  |  +---------+
+           |  |CPU 1 in a cluster|  |  |Cluster 1|
+           |  +------------------+  |  +---------+
+           |  |CPU 2 in a cluster|  |  |Cluster 2|
+           |  +------------------+  |  +---------+
+           |  |...               |  |  |...      |
+           +- +------------------+  +- +---------+
+
+   The data may contain one or more AUX buffers. An AUX buffer contains many
+   samples, and may probably leave a gap at the buffer tail where there is no
+   space for a complete sample. The number of samples in all AUX buffers sums
+   up to the 'nr_sample' parameter passed from the 'perf-record' command.
+
+   A sample contains the events entered in the ``users_events`` sysfs file. A
+   sample may consist of multiple subsamples if the number of events is more
+   than the number of PMU counters used, i.e. ``nr_pmu``. The number of
+   subsamples in a sample, ``N``, equals to a round up of the number of event
+   divided by ``nr_pmu``.
+
+   A subsample consists of data fields of CID1SR, CID2SR, ``nr_pmu`` event
+   counter readings, and a cycle counter reading. CID1SR and CID2SR are a copy
+   of PMCID1SR and PMCID2SR on capture of the event counters, which reflects
+   the process ID, provided that the kernel compiling configuration
+   ``CONFIG_PID_IN_CONTEXTIDR`` is enabled. The size of CID1SR or CID2SR is 4
+   bytes, whereas the size of an event or cycle count is 8 bytes. A data field
+   has the data from all CPUs. The order of CPUs in a data field is 'CPU ID in
+   a cluster' -> 'cluster ID'. For example, a CPU die with 32 CPUs in 4
+   clusters (8 CPUs per cluster) has the data field ordered in::
+
+       CPU [0,8,16,24],[1,9,17,25],[2,10,18,26],...,[7,15,23,31]
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 793e1970bc05..f132838145f9 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -8,6 +8,7 @@ Performance monitor support
    :maxdepth: 1
 
    hisi-pmu
+   hisi-pmcu
    hisi-pcie-pmu
    hns3-pmu
    imx-ddr
-- 
2.30.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
  2023-02-06  6:51 ` Jie Zhan
@ 2023-02-06  6:51   ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

This driver enables the usage of PMCU through the perf_event framework.
PMCU is registered as a PMU device and utilises the AUX buffer to dump data
directly. Users can start PMCU sampling through 'perf-record'. Event
numbers are passed by a sysfs interface.

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/perf/hisilicon/Kconfig     |   15 +
 drivers/perf/hisilicon/Makefile    |    1 +
 drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++
 3 files changed, 1112 insertions(+)
 create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c

diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig
index 171bfc1b6bc2..d7728fbe8519 100644
--- a/drivers/perf/hisilicon/Kconfig
+++ b/drivers/perf/hisilicon/Kconfig
@@ -24,3 +24,18 @@ config HNS3_PMU
 	  devices.
 	  Adds the HNS3 PMU into perf events system for monitoring latency,
 	  bandwidth etc.
+
+config HISI_PMCU
+	tristate "HiSilicon PMCU"
+	depends on ARM64 && PID_IN_CONTEXTIDR
+	help
+	  Support for HiSilicon Performance Monitor Control Unit (PMCU).
+	  HiSilicon Performance Monitor Control Unit (PMCU) is a device that
+	  offloads PMU accesses from CPUs, handling the configuration, event
+	  switching, and counter reading of core PMUs on Kunpeng SoC. It
+	  facilitates fine-grained and multi-PMU-event CPU profiling, in which
+	  scenario the current 'perf' scheme may lose events or drop sampling
+	  frequency. With PMCU, users can reliably obtain the data of up to 240
+	  PMU events with the sample interval of events down to 1ms, while the
+	  software overhead of accessing PMUs, as well as its impact on target
+	  workloads, is reduced.
diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
index 4d2c9abe3372..93e4e6f2816a 100644
--- a/drivers/perf/hisilicon/Makefile
+++ b/drivers/perf/hisilicon/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
 
 obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
 obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
+obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o
diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c
new file mode 100644
index 000000000000..6ec5d6c31e1f
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_pmcu.c
@@ -0,0 +1,1096 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) driver
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ * Author: Jie Zhan <zhanjie9@hisilicon.com>
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+#include <linux/cpumask.h>
+#include <linux/delay.h>
+#include <linux/dev_printk.h>
+#include <linux/device.h>
+#include <linux/dma-mapping.h>
+#include <linux/errno.h>
+#include <linux/gfp_types.h>
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/mm_types.h>
+#include <linux/module.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/threads.h>
+#include <linux/vmalloc.h>
+
+#include <asm/cputype.h>
+#include <asm/sysreg.h>
+
+/* Registers */
+#define HISI_PMCU_REG_FSM_STATUS	0x0000
+#define HISI_PMCU_REG_FSM_CFG		0x0004
+#define HISI_PMCU_REG_EVENT_BASE_H	0x0008
+#define HISI_PMCU_REG_EVENT_BASE_L	0x000C
+#define HISI_PMCU_REG_KILL_BASE_H	0x0010
+#define HISI_PMCU_REG_KILL_BASE_L	0x0014
+#define HISI_PMCU_REG_STORE_BASE_H	0x0018
+#define HISI_PMCU_REG_STORE_BASE_L	0x001C
+#define HISI_PMCU_REG_WAIT_CNT		0x0020
+#define HISI_PMCU_REG_FSM_CTRL		0x0038
+#define HISI_PMCU_REG_FSM_BRK		0x003C
+#define HISI_PMCU_REG_COMP		0x0044
+#define HISI_PMCU_REG_INT_EN		0x0100
+#define HISI_PMCU_REG_INT_MSK		0x0104
+#define HISI_PMCU_REG_INT_STAT		0x0108
+#define HISI_PMCU_REG_INT_CLR		0x010C
+#define HISI_PMCU_REG_PMCR		0x0200
+#define HISI_PMCU_REG_PMCCFILTR		0x0204
+
+/* Register related configs */
+#define HISI_PMCU_FSM_CFG_EV_LEN_MSK	GENMASK(7, 0)
+#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK	GENMASK(15, 8)
+#define HISI_PMCU_FSM_CFG_NR_PMU_MSK	GENMASK(19, 16)
+#define HISI_PMCU_FSM_CFG_MAX_EV_LEN	240
+#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP	255
+#define HISI_PMCU_FSM_CFG_MAX_NR_PMU	8
+#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C	5
+#define HISI_PMCU_WAIT_CNT_DEFAULT	0x249F0
+#define HISI_PMCU_FSM_CTRL_TRIGGER	BIT(0)
+#define HISI_PMCU_FSM_BRK_BRK		BIT(0)
+#define HISI_PMCU_COMP_HPMN_THR		3
+#define HISI_PMCU_COMP_ENABLE		BIT(0)
+#define HISI_PMCU_INT_DONE		BIT(0)
+#define HISI_PMCU_INT_BRK		BIT(1)
+#define HISI_PMCU_INT_ALL		GENMASK(1, 0)
+#define HISI_PMCU_PMCR_DEFAULT		0xC1
+#define HISI_PMCU_PMCCFILTR_MSK		GENMASK(31, 24)
+
+/* User perf_event_attr configs */
+#define HISI_PMCU_PERF_ATTR_NR_SAMPLE		GENMASK(31, 0)
+#define HISI_PMCU_PERF_NR_SAMPLE_DEFAULT	0x80
+#define HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS	GENMASK(63, 32)
+#define HISI_PMCU_PERF_MS_TO_WAIT_CNT		50000
+#define HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS	(U32_MAX / \
+						 HISI_PMCU_PERF_MS_TO_WAIT_CNT)
+#define HISI_PMCU_PERF_ATTR_PMCCFILTR		GENMASK(7, 0)
+
+/* Others */
+#define HISI_PMCU_AUX_HEADER_ALIGN	0x10
+#define HISI_PMCU_BRK_DELAY_PERIOD_US	10
+#define HISI_PMCU_BRK_TIMEOUT_US	2000
+#define HISI_PMCU_DRV_NAME		"hisi-pmcu"
+#define NR_CPU_CLUSTER			8
+#define PMU_NULL_EVENT_ID		0xC000
+
+/**
+ * struct hisi_pmcu_sbuf - A single contiguous memory buffer
+ * @page:	starting page of this buffer
+ * @size:	size of this buffer
+ * @remain:	size of remaining space in this buffer
+ */
+struct hisi_pmcu_sbuf {
+	struct page *page;
+	u32 size;
+	u32 remain;
+};
+
+/**
+ * struct hisi_pmcu_buf - Management of multiple contiguous buffers
+ * @nr_buf:	number of buffers
+ * @cur_buf:	current working buffer
+ * @sbuf:	array of contiguous buffers
+ */
+struct hisi_pmcu_buf {
+	u32 nr_buf;
+	u32 cur_buf;
+	struct hisi_pmcu_sbuf sbuf[];
+};
+
+struct hisi_pmcu_auxtrace_header {
+	u32 buffer_size;
+	u32 nr_pmu;
+	u32 nr_cpu;
+	u32 comp_mode;
+	u32 subsample_size;
+	u32 nr_subsample_per_sample;
+	u32 nr_event;
+};
+
+/**
+ * struct hisi_pmcu_events - PMCU events and sampling configuration
+ * @nr_pmu:		number of core PMU counters that run in parallel
+ * @padding:		number of padding events in a sample
+ * @nr_ev:		number of events passed by users in a sample
+ * @nr_ev_per_sample:	number of events passed to hardware for a sample
+ *			This equals nr_ev + padding and should be evenly
+ *			divisible by nr_pmu.
+ * @max_sample_loop:	max number of samples that can be done in a loop
+ * @ev_len:		event length for hardware to read in a loop
+ * @nr_loop:		number of loops in one trigger
+ * @comp_mode:		compatibility mode
+ * @nr_sample:		number of samples that the current trigger takes
+ * @nr_pending_sample:	number of pending samples
+ * @subsample_size:	size of a subsample
+ * @sample_size:	size of a sample
+ * @output_size:	size of output from one trigger
+ * @sample_period:	sample period passed to hardware
+ * @nr_cpu:		number of hardware threads (logical CPUs)
+ * @events:		event IDs passed from users
+ */
+struct hisi_pmcu_events {
+	u8 nr_pmu;
+	u8 padding;
+	u8 nr_ev;
+	u8 nr_ev_per_sample;
+	u8 max_sample_loop;
+	u8 ev_len;
+	u8 nr_loop;
+	u8 comp_mode;
+	u32 nr_sample;
+	u32 nr_pending_sample;
+	u32 subsample_size;
+	u32 sample_size;
+	u32 output_size;
+	u32 sample_period;
+	u32 nr_cpu;
+	u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
+};
+
+enum hisi_pmcu_comp_mode {
+	HISI_PMCU_COMP_MODE_DISABLED,
+	HISI_PMCU_COMP_MODE_ENABLED,
+	HISI_PMCU_COMP_MODE_UNDEFINE,
+};
+
+/**
+ * struct hisi_pmcu_user_events - Data interacting with sysfs interface
+ * @nr_ev:	number of events written
+ * @ev:		event IDs
+ */
+struct hisi_pmcu_user_events {
+	u32 nr_ev;
+	u16 ev[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
+};
+
+/**
+ * struct hisi_pmcu - PMCU device data
+ * @pmu:	PMU device of this PMCU
+ * @dev:	device of this PMCU
+ * @regbase:	base IO address of registers
+ * @lock:	spinlock for serialising hardware operations
+ * @busy:	PMCU sampling running indicator
+ * @irq:	IRQ number
+ * @scclid:	CPU die (SCCL) ID where this PMCU is on
+ * @on_cpu:	CPU that handles perf_event and IRQ
+ * @cpus:	CPUs monitored by this PMCU
+ * @cpuhp_node:	CPU hotplug node
+ * @handle:	perf output handle for interacting with AUX buffers
+ * @ev:		PMCU events and sampling configuration
+ * @user_ev:	user events passed from sysfs
+ */
+struct hisi_pmcu {
+	struct pmu pmu;
+	struct device *dev;
+	void __iomem *regbase;
+	spinlock_t lock;
+	bool busy;
+	int irq;
+	int scclid;
+	int on_cpu;
+	cpumask_t cpus;
+	struct hlist_node cpuhp_node;
+	struct perf_output_handle handle;
+	struct hisi_pmcu_events ev;
+	struct hisi_pmcu_user_events user_ev;
+};
+
+#define to_hisi_pmcu(p) container_of(p, struct hisi_pmcu, pmu)
+
+static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr,
+						char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+
+	return sysfs_emit(buf, "%d\n", hisi_pmcu->on_cpu);
+}
+
+static DEVICE_ATTR_ADMIN_RO(cpumask);
+
+static struct attribute *hisi_pmcu_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_cpumask_attr_group = {
+	.attrs = hisi_pmcu_cpumask_attrs,
+};
+
+PMU_FORMAT_ATTR(nr_sample, "config:0-31");
+PMU_FORMAT_ATTR(sample_period_ms, "config:32-63");
+PMU_FORMAT_ATTR(pmccfiltr, "config1:0-7");
+
+static struct attribute *hisi_pmcu_format_attrs[] = {
+	&format_attr_nr_sample.attr,
+	&format_attr_sample_period_ms.attr,
+	&format_attr_pmccfiltr.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_format_attr_group = {
+	.name = "format",
+	.attrs = hisi_pmcu_format_attrs,
+};
+
+static ssize_t monitored_cpus_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+
+	return sysfs_emit(buf, "%d-%d\n",
+			  cpumask_first(&hisi_pmcu->cpus),
+			  cpumask_last(&hisi_pmcu->cpus));
+}
+
+static DEVICE_ATTR_ADMIN_RO(monitored_cpus);
+
+static struct attribute *hisi_pmcu_monitored_cpus_attrs[] = {
+	&dev_attr_monitored_cpus.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_monitored_cpus_attr_group = {
+	.attrs = hisi_pmcu_monitored_cpus_attrs,
+};
+
+static ssize_t user_events_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
+	int at = 0;
+	int i;
+
+	for (i = 0; i < user_ev->nr_ev; i++)
+		at += sysfs_emit_at(buf, at, "0x%04x\n", user_ev->ev[i]);
+
+	return at;
+};
+
+static ssize_t user_events_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
+	u32 head, tail, nr_ev;
+	char *line;
+	int err;
+
+	line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL);
+	nr_ev = 0;
+	head = 0;
+	tail = 0;
+	while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) {
+		while (head < count && isspace(buf[head]))
+			head++;
+		if (!isxdigit(buf[head]))
+			break;
+		tail = head + 1;
+
+		while (tail < count && isalnum(buf[tail]))
+			tail++;
+
+		strncpy(line, buf + head, tail - head);
+		line[tail - head] = '\0';
+		err = kstrtou16(line, 16, &user_ev->ev[nr_ev]);
+		if (err) {
+			user_ev->nr_ev = 0;
+			return err;
+		}
+		nr_ev++;
+		head = tail;
+	}
+	user_ev->nr_ev = nr_ev;
+
+	return count;
+}
+
+static DEVICE_ATTR_ADMIN_RW(user_events);
+
+static struct attribute *hisi_pmcu_user_events_attrs[] = {
+	&dev_attr_user_events.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_user_events_attr_group = {
+	.attrs = hisi_pmcu_user_events_attrs,
+};
+
+static const struct attribute_group *hisi_pmcu_attr_groups[] = {
+	&hisi_pmcu_cpumask_attr_group,
+	&hisi_pmcu_format_attr_group,
+	&hisi_pmcu_monitored_cpus_attr_group,
+	&hisi_pmcu_user_events_attr_group,
+	NULL
+};
+
+static int hisi_pmcu_pmu_event_init(struct perf_event *event)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+	void __iomem *base = hisi_pmcu->regbase;
+	u64 cfg;
+	u32 val;
+
+	if (event->attr.type != hisi_pmcu->pmu.type)
+		return -ENOENT;
+
+	if (hisi_pmcu->busy)
+		return -EBUSY;
+
+	cfg = event->attr.config;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg);
+	ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg);
+	if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) {
+		dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n",
+			HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS);
+		return -EINVAL;
+	}
+	ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT :
+				  HISI_PMCU_WAIT_CNT_DEFAULT;
+
+	cfg = event->attr.config1;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg);
+	val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val);
+	writel(val, base + HISI_PMCU_REG_PMCCFILTR);
+
+	return 0;
+}
+
+static void *hisi_pmcu_pmu_setup_aux(struct perf_event *event, void **pages,
+				     int nr_pages, bool overwrite)
+{
+	int pg, nr_pg, nbuf;
+	struct hisi_pmcu_buf *buf;
+	struct page *page;
+
+	if (overwrite) {
+		dev_warn(event->pmu->dev, "Overwrite mode is not supported\n");
+		return NULL;
+	}
+
+	/* Count buffers */
+	nbuf = 0;
+	for (pg = 0; pg < nr_pages;) {
+		page = virt_to_page(pages[pg]);
+		pg += 1 << page_private(page);
+		nbuf++;
+	}
+
+	buf = kzalloc(struct_size(buf, sbuf, nbuf), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	/* Set up buffers */
+	buf->nr_buf = nbuf;
+	buf->cur_buf = 0;
+	for (pg = 0, nbuf = 0; nbuf < buf->nr_buf; nbuf++) {
+		page = virt_to_page(pages[pg]);
+		nr_pg = 1 << page_private(page);
+		buf->sbuf[nbuf].page = page;
+		buf->sbuf[nbuf].size = nr_pg << PAGE_SHIFT;
+		buf->sbuf[nbuf].remain = nr_pg << PAGE_SHIFT;
+		pg += nr_pg;
+	}
+
+	return buf;
+}
+
+static void hisi_pmcu_pmu_free_aux(void *aux)
+{
+	kfree(aux);
+}
+
+static void hisi_pmcu_setup_events(struct hisi_pmcu_events *ev,
+				   struct hisi_pmcu_user_events *user_ev)
+{
+	u8 max_nr_pmu;
+	int i;
+
+	/* Copy events from user's sysfs interface */
+	ev->nr_ev = user_ev->nr_ev;
+	for (i = 0; i < ev->nr_ev; i++)
+		ev->events[i] = user_ev->ev[i];
+
+	/*
+	 * Set nr_pmu and pad events.
+	 *
+	 * PMCU takes nr_pmu events per "subsample", and nr_pmu is limited by
+	 * the number of available PMU counters (nr_pmu <= max_nr_pmu). If
+	 * nr_ev <= max_nr_pmu, we just set nr_pmu = ev->nr_ev and we do not
+	 * need to pad events.
+	 *
+	 * However, if nr_ev > max_nr_pmu, so that a "sample" of nr_ev events
+	 * is formed of multiple subsamples. In this case, we set nr_pmu =
+	 * max_nr_pmu and, if nr_ev % nr_pmu != 0, we pad null events, i.e.
+	 * reserved events that do not count, in the last subsample. Thus, one
+	 * subsample accounts for only one sample, making user space data
+	 * decoding easier.
+	 */
+	max_nr_pmu = ev->comp_mode ? HISI_PMCU_FSM_CFG_MAX_NR_PMU_C :
+				     HISI_PMCU_FSM_CFG_MAX_NR_PMU;
+
+	ev->nr_pmu = min(ev->nr_ev, max_nr_pmu);
+
+	ev->padding = ev->nr_ev % ev->nr_pmu ?
+		      ev->nr_pmu - ev->nr_ev % ev->nr_pmu : 0;
+
+	ev->nr_ev_per_sample = ev->nr_ev + ev->padding;
+
+	for (i = ev->nr_ev; i < ev->nr_ev_per_sample; i++)
+		ev->events[i] = PMU_NULL_EVENT_ID;
+
+	/*
+	 * Duplicate events in ev->events in case of needing many samples
+	 * (> MAX_NR_LOOP) in a trigger. See hisi_pmcu_config_sample().
+	 */
+	ev->max_sample_loop = HISI_PMCU_FSM_CFG_MAX_EV_LEN /
+			      ev->nr_ev_per_sample;
+	for (i = 1; i < ev->max_sample_loop; i++)
+		memcpy(ev->events + i * ev->nr_ev_per_sample,
+		       ev->events, ev->nr_ev_per_sample * sizeof(u32));
+
+	/* Update sample size */
+	ev->subsample_size = (ev->nr_pmu + (ev->comp_mode ? 1 : 2))
+			     * sizeof(u64) * ev->nr_cpu;
+	ev->sample_size = ev->nr_ev_per_sample / ev->nr_pmu
+			  * ev->subsample_size;
+}
+
+static int hisi_pmcu_config_sample(struct hisi_pmcu_events *ev, u32 buf_size)
+{
+	int nr_sample_loop, nr_max;
+
+	if (buf_size < ev->sample_size)
+		return 1;
+
+	/* Number of events that this buf can take or to take */
+	nr_max = min(buf_size / ev->sample_size, ev->nr_pending_sample);
+
+	/*
+	 * Determine ev->ev_len and ev->nr_loop, update ev->nr_sample
+	 *
+	 * NOTE: We haven't implemented an algorithm to find a pair of
+	 * [nr_loop, nr_sample_loop] that exactly delivers nr_max samples.
+	 *
+	 * We use nr_loop to do multiple samples if nr_max <= MAX_NR_LOOP.
+	 * Otherwise, we utilise the duplicate events in the event buffer to
+	 * get more samples. If there are any pending samples not going to be
+	 * taken in this trigger, e.g. due to the limit of (max_sample_loop *
+	 * MAX_NR_LOOP) or the round down of division (nr_max / MAX_NR_LOOP),
+	 * they will be handled in the next trigger from ISR.
+	 */
+	if (nr_max <= HISI_PMCU_FSM_CFG_MAX_NR_LOOP) {
+		nr_sample_loop = 1;
+		ev->nr_loop = nr_max;
+		ev->nr_sample = ev->nr_loop;
+	} else {
+		nr_sample_loop = nr_max / HISI_PMCU_FSM_CFG_MAX_NR_LOOP;
+		if (nr_sample_loop > ev->max_sample_loop)
+			nr_sample_loop = ev->max_sample_loop;
+		ev->nr_loop = HISI_PMCU_FSM_CFG_MAX_NR_LOOP;
+		ev->nr_sample = nr_sample_loop * ev->nr_loop;
+	}
+
+	ev->ev_len = ev->nr_ev_per_sample * nr_sample_loop;
+
+	ev->output_size = ev->sample_size * ev->nr_sample;
+
+	return 0;
+}
+
+static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu,
+				      struct hisi_pmcu_buf *buf)
+{
+	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
+	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+	void __iomem *base = hisi_pmcu->regbase;
+	u64 addr, end;
+	u32 val;
+
+	/* FSM CFG */
+	val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len);
+	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop);
+	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu);
+	writel(val, base + HISI_PMCU_REG_FSM_CFG);
+
+	/* Sample period */
+	writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT);
+
+	/* Event ID base */
+	addr = virt_to_phys(ev->events);
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_EVENT_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_EVENT_BASE_L);
+
+	/* sbuf end */
+	end = page_to_phys(sbuf->page) + sbuf->size;
+
+	/* Data output address */
+	addr = end - sbuf->remain;
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_STORE_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_STORE_BASE_L);
+
+	/* Stop data output if sbuf end is reached (abnormally) */
+	addr = end;
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_KILL_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_KILL_BASE_L);
+
+	/* Trigger */
+	writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL);
+}
+
+/*
+ * Break hardware sampling process and poll hisi_pmcu->busy. hisi_pmcu->busy
+ * will be cleared in ISR when hardware successfully handles the break request.
+ */
+static int hisi_pmcu_hw_sample_stop(struct hisi_pmcu *hisi_pmcu)
+{
+	ktime_t ddl;
+
+	writel(HISI_PMCU_FSM_BRK_BRK,
+	       hisi_pmcu->regbase + HISI_PMCU_REG_FSM_BRK);
+
+	ddl = ktime_add_us(ktime_get(), HISI_PMCU_BRK_TIMEOUT_US);
+
+	while (ktime_before(ktime_get(), ddl)) {
+		udelay(HISI_PMCU_BRK_DELAY_PERIOD_US);
+		if (!hisi_pmcu->busy)
+			return 0;
+	}
+
+	return -ETIMEDOUT;
+}
+
+static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev,
+					    struct hisi_pmcu_buf *buf)
+{
+	struct hisi_pmcu_auxtrace_header header;
+	struct hisi_pmcu_sbuf *sbuf;
+	u32 *data;
+	u32 sz;
+
+	sbuf = &buf->sbuf[buf->cur_buf];
+
+	header.buffer_size = sbuf->size;
+	header.nr_pmu = ev->nr_pmu;
+	header.nr_cpu = ev->nr_cpu;
+	header.comp_mode = ev->comp_mode;
+	header.subsample_size = ev->subsample_size;
+	header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu;
+	header.nr_event = ev->nr_ev_per_sample;
+
+	data = page_to_virt(sbuf->page);
+	memcpy(data, &header, sizeof(header));
+	memcpy(data + sizeof(header) / sizeof(*data), ev->events,
+	       ev->nr_ev_per_sample * sizeof(u32));
+
+	sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32);
+	sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN);
+
+	sbuf->remain -= sz;
+}
+
+static void hisi_pmcu_pmu_start(struct perf_event *event, int flags)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct perf_output_handle *handle = &hisi_pmcu->handle;
+	struct hw_perf_event *hwc = &event->hw;
+	struct hisi_pmcu_buf *buf;
+	int err;
+
+	spin_lock(&hisi_pmcu->lock);
+
+	if (hisi_pmcu->busy) {
+		dev_info(hisi_pmcu->dev,
+			 "Sampling is running, pmu->start() ignored\n");
+		goto out;
+	}
+
+	buf = perf_aux_output_begin(handle, event);
+	if (!buf) {
+		dev_err(hisi_pmcu->dev, "Failed to begin perf aux output\n");
+		goto out;
+	}
+
+	if (handle->head) {
+		dev_err(hisi_pmcu->dev, "got handle->head=0x%lx, should be 0\n",
+			handle->head);
+		goto out;
+	}
+
+	hisi_pmcu_setup_events(&hisi_pmcu->ev, &hisi_pmcu->user_ev);
+
+	hisi_pmcu_write_auxtrace_header(&hisi_pmcu->ev, buf);
+
+	err = hisi_pmcu_config_sample(&hisi_pmcu->ev,
+				      buf->sbuf[buf->cur_buf].remain);
+	if (err) {
+		dev_err(hisi_pmcu->dev,
+			"Failed to start sampling, buffer too small\n");
+		perf_aux_output_end(handle, 0);
+		goto out;
+	}
+
+	hisi_pmcu->busy = true;
+	hwc->state &= ~PERF_HES_STOPPED;
+
+	hisi_pmcu_hw_sample_start(hisi_pmcu, buf);
+
+	dev_dbg(hisi_pmcu->dev, "Sampling started\n");
+
+out:
+	spin_unlock(&hisi_pmcu->lock);
+}
+
+static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct perf_output_handle *handle;
+	struct hisi_pmcu_sbuf *sbuf;
+	struct hisi_pmcu_buf *buf;
+	int err;
+
+	spin_lock(&hisi_pmcu->lock);
+
+	handle = &hisi_pmcu->handle;
+
+	/* If PMCU is running, break it */
+	if (hisi_pmcu->busy) {
+		dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n");
+		err = hisi_pmcu_hw_sample_stop(hisi_pmcu);
+		if (err)
+			dev_err(hisi_pmcu->dev,
+				"Timed out for stopping PMCU!\n");
+	}
+
+	buf = perf_get_aux(handle);
+	sbuf = &buf->sbuf[buf->cur_buf];
+	perf_aux_output_end(handle, sbuf->size - sbuf->remain);
+
+	spin_unlock(&hisi_pmcu->lock);
+
+	hwc->state |= PERF_HES_STOPPED;
+	perf_event_update_userpage(event);
+}
+
+static int hisi_pmcu_pmu_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state |= PERF_HES_STOPPED;
+
+	hisi_pmcu_pmu_start(event, flags);
+
+	if (hwc->state & PERF_HES_STOPPED)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void hisi_pmcu_pmu_del(struct perf_event *event, int flags)
+{
+	hisi_pmcu_pmu_stop(event, flags);
+}
+
+static int hisi_pmcu_init_data(struct platform_device *pdev,
+			       struct hisi_pmcu *hisi_pmcu)
+{
+	int ret;
+
+	hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(hisi_pmcu->regbase))
+		return dev_err_probe(&pdev->dev, -ENODEV,
+				     "Failed to map device register space\n");
+
+	ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
+				       &hisi_pmcu->scclid);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to read sccl-id!\n");
+
+	/*
+	 * Obtain the number of CPUs that contributes to the sample size.
+	 * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain
+	 * number of CPUs in a cluster regardless of how many CPUs are actually
+	 * implemented/available.
+	 */
+	ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster",
+				       &hisi_pmcu->ev.nr_cpu);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to read nr-cluster!\n");
+	hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER;
+
+	return 0;
+}
+
+static irqreturn_t hisi_pmcu_isr(int irq, void *data)
+{
+	struct hisi_pmcu *hisi_pmcu = data;
+	void __iomem *base = hisi_pmcu->regbase;
+	u32 irq_status;
+
+	irq_status = FIELD_GET(HISI_PMCU_INT_ALL,
+			       readl(base + HISI_PMCU_REG_INT_STAT));
+
+	if (!irq_status)
+		return IRQ_NONE;
+
+	if (irq_status & HISI_PMCU_INT_DONE) {
+		/*
+		 * Buffers and perf_output_handle should be up-to-date
+		 * for hisi_pmcu_pmu_stop() before exiting ISR
+		 */
+		struct perf_output_handle *handle = &hisi_pmcu->handle;
+		struct hisi_pmcu_buf *buf = perf_get_aux(handle);
+		struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
+		struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+		int err;
+
+		spin_lock(&hisi_pmcu->lock);
+
+		ev->nr_pending_sample -= ev->nr_sample;
+		sbuf->remain -= ev->output_size;
+
+		if (!ev->nr_pending_sample) {
+			hisi_pmcu->busy = false;
+			dev_dbg(hisi_pmcu->dev, "Sampling finished\n");
+			goto skip;
+		}
+
+		err = hisi_pmcu_config_sample(ev, sbuf->remain);
+		if (err) {
+			/* This sbuf is full. Commit it, switch to the next. */
+			struct perf_event *event = handle->event;
+
+			perf_aux_output_end(handle, sbuf->size);
+
+			sbuf->remain = sbuf->size;
+
+			perf_aux_output_begin(handle, event);
+
+			if (++buf->cur_buf == buf->nr_buf)
+				buf->cur_buf = 0;
+			sbuf = &buf->sbuf[buf->cur_buf];
+
+			err = hisi_pmcu_config_sample(ev, sbuf->remain);
+			if (err) {
+				dev_err(hisi_pmcu->dev,
+					"Sampling stopped at AUX buffer %d, buffer size is probably tainted\n",
+					buf->cur_buf);
+				hisi_pmcu->busy = false;
+				goto skip;
+			}
+		}
+
+		hisi_pmcu_hw_sample_start(hisi_pmcu, buf);
+
+skip:
+		spin_unlock(&hisi_pmcu->lock);
+		writel(HISI_PMCU_INT_DONE, base + HISI_PMCU_REG_INT_CLR);
+	}
+
+	if (irq_status & HISI_PMCU_INT_BRK) {
+		hisi_pmcu->busy = false;
+		writel(HISI_PMCU_INT_BRK, base + HISI_PMCU_REG_INT_CLR);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static int hisi_pmcu_init_irq(struct platform_device *pdev,
+			      struct hisi_pmcu *hisi_pmcu)
+{
+	int irq, ret;
+
+	irq = platform_get_irq(pdev, 0);
+	if (irq < 0)
+		return irq;
+
+	ret = devm_request_irq(&pdev->dev, irq, hisi_pmcu_isr,
+			       IRQF_NOBALANCING | IRQF_NO_THREAD,
+			       dev_name(&pdev->dev), hisi_pmcu);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to request IRQ line %d\n", irq);
+
+	hisi_pmcu->irq = irq;
+
+	return 0;
+}
+
+static void hisi_pmcu_init_hw(struct hisi_pmcu *hisi_pmcu)
+{
+	void __iomem *base = hisi_pmcu->regbase;
+
+	writel(HISI_PMCU_INT_ALL, base + HISI_PMCU_REG_INT_EN);
+	writel(0, base + HISI_PMCU_REG_INT_MSK);
+	writel(HISI_PMCU_INT_ALL, base + HISI_PMCU_REG_INT_CLR);
+
+	writel(HISI_PMCU_PMCR_DEFAULT, base + HISI_PMCU_REG_PMCR);
+
+	if (hisi_pmcu->ev.comp_mode)
+		writel(HISI_PMCU_COMP_ENABLE, base + HISI_PMCU_REG_COMP);
+}
+
+static void hisi_pmcu_init_cpu_config(void *info)
+{
+	struct hisi_pmcu *hisi_pmcu = info;
+	u64 val, hpmn;
+	int cpu, sccl;
+
+	val = read_cpuid_mpidr();
+
+	if (FIELD_GET(MPIDR_MT_BITMASK, val))
+		sccl = MPIDR_AFFINITY_LEVEL(val, 3);
+	else
+		sccl = MPIDR_AFFINITY_LEVEL(val, 2);
+
+	if (sccl == hisi_pmcu->scclid) {
+		cpu = smp_processor_id();
+		cpumask_set_cpu(cpu, &hisi_pmcu->cpus);
+
+		val = read_sysreg(mdcr_el2);
+		hpmn = FIELD_GET(MDCR_EL2_HPMN_MASK, val);
+		if (hpmn > HISI_PMCU_COMP_HPMN_THR) {
+			hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_DISABLED;
+			dev_warn(hisi_pmcu->dev,
+				 "CPU%d MDCR_EL2.HPMN=%lld (> %d), PMCU may mess up VM's counter accesses\n",
+				 cpu, hpmn, HISI_PMCU_COMP_HPMN_THR);
+		}
+	}
+}
+
+static void hisi_pmcu_set_mdcr_el2_hpme(void *info)
+{
+	write_sysreg(read_sysreg(mdcr_el2) | MDCR_EL2_HPME, mdcr_el2);
+}
+
+static enum cpuhp_state hisi_pmcu_cpuhp_state;
+
+static void hisi_pmcu_remove_cpuhp_instance(void *cpuhp_node)
+{
+	cpuhp_state_remove_instance_nocalls(hisi_pmcu_cpuhp_state, cpuhp_node);
+}
+
+static int hisi_pmcu_init(struct platform_device *pdev,
+			  struct hisi_pmcu *hisi_pmcu)
+{
+	int ret;
+
+	hisi_pmcu->dev = &pdev->dev;
+
+	spin_lock_init(&hisi_pmcu->lock);
+
+	ret = hisi_pmcu_init_data(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	ret = hisi_pmcu_init_irq(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	hisi_pmcu_init_hw(hisi_pmcu);
+
+	/*
+	 * ARM64 sysreg MDCR_EL2.HPMN defines the number of core PMU counters
+	 * that are accessible from VMs. If HPMN > 0, PMCU may access PMU
+	 * counters at the same time with VMs, messing up the counter control.
+	 * PMCU supports a "compatibility mode", where it restricts itself to
+	 * use counters starting from an index of HISI_PMCU_COMP_HPMN_THR.
+	 * Hence, if HPMN <= HISI_PMCU_COMP_HPMN_THR on all CPUs, PMCU enables
+	 * the "compatibility mode" to resolve the conflict with VMs;
+	 * otherwise, we print a message to warn potential conflicts.
+	 */
+	hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_UNDEFINE;
+	on_each_cpu(hisi_pmcu_init_cpu_config, hisi_pmcu, 1);
+	if (hisi_pmcu->ev.comp_mode != HISI_PMCU_COMP_MODE_DISABLED) {
+		hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_ENABLED;
+		on_each_cpu(hisi_pmcu_set_mdcr_el2_hpme, NULL, 1);
+	}
+
+	ret = cpuhp_state_add_instance(hisi_pmcu_cpuhp_state,
+				       &hisi_pmcu->cpuhp_node);
+	if (ret)
+		return ret;
+
+	return devm_add_action_or_reset(hisi_pmcu->dev,
+					hisi_pmcu_remove_cpuhp_instance,
+					&hisi_pmcu->cpuhp_node);
+}
+
+static void hisi_pmcu_unregister_pmu(void *pmu)
+{
+	perf_pmu_unregister(pmu);
+}
+
+static int hisi_pmcu_register_pmu(struct hisi_pmcu *hisi_pmcu)
+{
+	char *name;
+	int ret;
+
+	hisi_pmcu->pmu = (struct pmu) {
+		.module		= THIS_MODULE,
+		.attr_groups	= hisi_pmcu_attr_groups,
+		.capabilities	= PERF_PMU_CAP_EXCLUSIVE |
+				  PERF_PMU_CAP_AUX_OUTPUT,
+		.task_ctx_nr	= perf_invalid_context,
+		.event_init	= hisi_pmcu_pmu_event_init,
+		.add		= hisi_pmcu_pmu_add,
+		.del		= hisi_pmcu_pmu_del,
+		.start		= hisi_pmcu_pmu_start,
+		.stop		= hisi_pmcu_pmu_stop,
+		.setup_aux	= hisi_pmcu_pmu_setup_aux,
+		.free_aux	= hisi_pmcu_pmu_free_aux,
+	};
+
+	name = devm_kasprintf(hisi_pmcu->dev, GFP_KERNEL, "hisi_pmcu_sccl%d",
+			      hisi_pmcu->scclid);
+	if (!name)
+		return -ENOMEM;
+
+	ret = perf_pmu_register(&hisi_pmcu->pmu, name, -1);
+	if (ret)
+		return dev_err_probe(hisi_pmcu->dev, ret,
+				     "Failed to register PMU\n");
+
+	return devm_add_action_or_reset(hisi_pmcu->dev,
+					hisi_pmcu_unregister_pmu,
+					&hisi_pmcu->pmu);
+}
+
+static int hisi_pmcu_probe(struct platform_device *pdev)
+{
+	struct hisi_pmcu *hisi_pmcu;
+	int ret;
+
+	hisi_pmcu = devm_kzalloc(&pdev->dev, sizeof(*hisi_pmcu), GFP_KERNEL);
+	if (!hisi_pmcu)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, hisi_pmcu);
+
+	ret = hisi_pmcu_init(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	return hisi_pmcu_register_pmu(hisi_pmcu);
+}
+
+static const struct acpi_device_id hisi_pmcu_acpi_match[] = {
+	{ "HISI0451", },
+	{}
+};
+MODULE_DEVICE_TABLE(acpi, hisi_pmcu_acpi_match);
+
+static struct platform_driver hisi_pmcu_driver = {
+	.driver = {
+		.name = HISI_PMCU_DRV_NAME,
+		.acpi_match_table = hisi_pmcu_acpi_match,
+		/*
+		 * Unbinding driver is not yet supported as we have not worked
+		 * out a safe bind/unbind process.
+		 */
+		.suppress_bind_attrs = true,
+	},
+	.probe = hisi_pmcu_probe,
+};
+
+static int hisi_pmcu_cpuhp_startup(unsigned int cpu, struct hlist_node *node)
+{
+	struct hisi_pmcu *hisi_pmcu;
+
+	hisi_pmcu = hlist_entry_safe(node, struct hisi_pmcu, cpuhp_node);
+
+	if (hisi_pmcu->on_cpu != -1)
+		return 0;
+
+	if (!cpumask_test_cpu(cpu, &hisi_pmcu->cpus))
+		return 0;
+
+	WARN_ON(irq_set_affinity(hisi_pmcu->irq, cpumask_of(cpu)));
+	hisi_pmcu->on_cpu = cpu;
+
+	return 0;
+}
+
+static int hisi_pmcu_cpuhp_teardown(unsigned int cpu, struct hlist_node *node)
+{
+	struct hisi_pmcu *hisi_pmcu;
+	cpumask_t available_cpus;
+	unsigned int target;
+
+	hisi_pmcu = hlist_entry_safe(node, struct hisi_pmcu, cpuhp_node);
+
+	if (hisi_pmcu->on_cpu != cpu)
+		return 0;
+
+	hisi_pmcu->on_cpu = -1;
+
+	cpumask_and(&available_cpus, &hisi_pmcu->cpus, cpu_online_mask);
+	target = cpumask_any_but(&available_cpus, cpu);
+	if (target >= nr_cpu_ids)
+		return 0;
+	perf_pmu_migrate_context(&hisi_pmcu->pmu, cpu, target);
+	WARN_ON(irq_set_affinity(hisi_pmcu->irq, cpumask_of(target)));
+	hisi_pmcu->on_cpu = target;
+
+	return 0;
+}
+
+static int __init hisi_pmcu_module_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, HISI_PMCU_DRV_NAME,
+				      hisi_pmcu_cpuhp_startup,
+				      hisi_pmcu_cpuhp_teardown);
+	if (ret < 0)
+		return ret;
+	hisi_pmcu_cpuhp_state = ret;
+
+	ret = platform_driver_register(&hisi_pmcu_driver);
+	if (ret)
+		cpuhp_remove_multi_state(hisi_pmcu_cpuhp_state);
+
+	return ret;
+}
+
+static void __exit hisi_pmcu_module_exit(void)
+{
+	platform_driver_unregister(&hisi_pmcu_driver);
+	cpuhp_remove_multi_state(hisi_pmcu_cpuhp_state);
+}
+
+module_init(hisi_pmcu_module_init);
+module_exit(hisi_pmcu_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jie Zhan <zhanjie9@hisilicon.com>");
+MODULE_DESCRIPTION("HiSilicon PMCU driver");
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
@ 2023-02-06  6:51   ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

This driver enables the usage of PMCU through the perf_event framework.
PMCU is registered as a PMU device and utilises the AUX buffer to dump data
directly. Users can start PMCU sampling through 'perf-record'. Event
numbers are passed by a sysfs interface.

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/perf/hisilicon/Kconfig     |   15 +
 drivers/perf/hisilicon/Makefile    |    1 +
 drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++
 3 files changed, 1112 insertions(+)
 create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c

diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig
index 171bfc1b6bc2..d7728fbe8519 100644
--- a/drivers/perf/hisilicon/Kconfig
+++ b/drivers/perf/hisilicon/Kconfig
@@ -24,3 +24,18 @@ config HNS3_PMU
 	  devices.
 	  Adds the HNS3 PMU into perf events system for monitoring latency,
 	  bandwidth etc.
+
+config HISI_PMCU
+	tristate "HiSilicon PMCU"
+	depends on ARM64 && PID_IN_CONTEXTIDR
+	help
+	  Support for HiSilicon Performance Monitor Control Unit (PMCU).
+	  HiSilicon Performance Monitor Control Unit (PMCU) is a device that
+	  offloads PMU accesses from CPUs, handling the configuration, event
+	  switching, and counter reading of core PMUs on Kunpeng SoC. It
+	  facilitates fine-grained and multi-PMU-event CPU profiling, in which
+	  scenario the current 'perf' scheme may lose events or drop sampling
+	  frequency. With PMCU, users can reliably obtain the data of up to 240
+	  PMU events with the sample interval of events down to 1ms, while the
+	  software overhead of accessing PMUs, as well as its impact on target
+	  workloads, is reduced.
diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
index 4d2c9abe3372..93e4e6f2816a 100644
--- a/drivers/perf/hisilicon/Makefile
+++ b/drivers/perf/hisilicon/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
 
 obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
 obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
+obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o
diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c
new file mode 100644
index 000000000000..6ec5d6c31e1f
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_pmcu.c
@@ -0,0 +1,1096 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) driver
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ * Author: Jie Zhan <zhanjie9@hisilicon.com>
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+#include <linux/cpumask.h>
+#include <linux/delay.h>
+#include <linux/dev_printk.h>
+#include <linux/device.h>
+#include <linux/dma-mapping.h>
+#include <linux/errno.h>
+#include <linux/gfp_types.h>
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/mm_types.h>
+#include <linux/module.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/threads.h>
+#include <linux/vmalloc.h>
+
+#include <asm/cputype.h>
+#include <asm/sysreg.h>
+
+/* Registers */
+#define HISI_PMCU_REG_FSM_STATUS	0x0000
+#define HISI_PMCU_REG_FSM_CFG		0x0004
+#define HISI_PMCU_REG_EVENT_BASE_H	0x0008
+#define HISI_PMCU_REG_EVENT_BASE_L	0x000C
+#define HISI_PMCU_REG_KILL_BASE_H	0x0010
+#define HISI_PMCU_REG_KILL_BASE_L	0x0014
+#define HISI_PMCU_REG_STORE_BASE_H	0x0018
+#define HISI_PMCU_REG_STORE_BASE_L	0x001C
+#define HISI_PMCU_REG_WAIT_CNT		0x0020
+#define HISI_PMCU_REG_FSM_CTRL		0x0038
+#define HISI_PMCU_REG_FSM_BRK		0x003C
+#define HISI_PMCU_REG_COMP		0x0044
+#define HISI_PMCU_REG_INT_EN		0x0100
+#define HISI_PMCU_REG_INT_MSK		0x0104
+#define HISI_PMCU_REG_INT_STAT		0x0108
+#define HISI_PMCU_REG_INT_CLR		0x010C
+#define HISI_PMCU_REG_PMCR		0x0200
+#define HISI_PMCU_REG_PMCCFILTR		0x0204
+
+/* Register related configs */
+#define HISI_PMCU_FSM_CFG_EV_LEN_MSK	GENMASK(7, 0)
+#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK	GENMASK(15, 8)
+#define HISI_PMCU_FSM_CFG_NR_PMU_MSK	GENMASK(19, 16)
+#define HISI_PMCU_FSM_CFG_MAX_EV_LEN	240
+#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP	255
+#define HISI_PMCU_FSM_CFG_MAX_NR_PMU	8
+#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C	5
+#define HISI_PMCU_WAIT_CNT_DEFAULT	0x249F0
+#define HISI_PMCU_FSM_CTRL_TRIGGER	BIT(0)
+#define HISI_PMCU_FSM_BRK_BRK		BIT(0)
+#define HISI_PMCU_COMP_HPMN_THR		3
+#define HISI_PMCU_COMP_ENABLE		BIT(0)
+#define HISI_PMCU_INT_DONE		BIT(0)
+#define HISI_PMCU_INT_BRK		BIT(1)
+#define HISI_PMCU_INT_ALL		GENMASK(1, 0)
+#define HISI_PMCU_PMCR_DEFAULT		0xC1
+#define HISI_PMCU_PMCCFILTR_MSK		GENMASK(31, 24)
+
+/* User perf_event_attr configs */
+#define HISI_PMCU_PERF_ATTR_NR_SAMPLE		GENMASK(31, 0)
+#define HISI_PMCU_PERF_NR_SAMPLE_DEFAULT	0x80
+#define HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS	GENMASK(63, 32)
+#define HISI_PMCU_PERF_MS_TO_WAIT_CNT		50000
+#define HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS	(U32_MAX / \
+						 HISI_PMCU_PERF_MS_TO_WAIT_CNT)
+#define HISI_PMCU_PERF_ATTR_PMCCFILTR		GENMASK(7, 0)
+
+/* Others */
+#define HISI_PMCU_AUX_HEADER_ALIGN	0x10
+#define HISI_PMCU_BRK_DELAY_PERIOD_US	10
+#define HISI_PMCU_BRK_TIMEOUT_US	2000
+#define HISI_PMCU_DRV_NAME		"hisi-pmcu"
+#define NR_CPU_CLUSTER			8
+#define PMU_NULL_EVENT_ID		0xC000
+
+/**
+ * struct hisi_pmcu_sbuf - A single contiguous memory buffer
+ * @page:	starting page of this buffer
+ * @size:	size of this buffer
+ * @remain:	size of remaining space in this buffer
+ */
+struct hisi_pmcu_sbuf {
+	struct page *page;
+	u32 size;
+	u32 remain;
+};
+
+/**
+ * struct hisi_pmcu_buf - Management of multiple contiguous buffers
+ * @nr_buf:	number of buffers
+ * @cur_buf:	current working buffer
+ * @sbuf:	array of contiguous buffers
+ */
+struct hisi_pmcu_buf {
+	u32 nr_buf;
+	u32 cur_buf;
+	struct hisi_pmcu_sbuf sbuf[];
+};
+
+struct hisi_pmcu_auxtrace_header {
+	u32 buffer_size;
+	u32 nr_pmu;
+	u32 nr_cpu;
+	u32 comp_mode;
+	u32 subsample_size;
+	u32 nr_subsample_per_sample;
+	u32 nr_event;
+};
+
+/**
+ * struct hisi_pmcu_events - PMCU events and sampling configuration
+ * @nr_pmu:		number of core PMU counters that run in parallel
+ * @padding:		number of padding events in a sample
+ * @nr_ev:		number of events passed by users in a sample
+ * @nr_ev_per_sample:	number of events passed to hardware for a sample
+ *			This equals nr_ev + padding and should be evenly
+ *			divisible by nr_pmu.
+ * @max_sample_loop:	max number of samples that can be done in a loop
+ * @ev_len:		event length for hardware to read in a loop
+ * @nr_loop:		number of loops in one trigger
+ * @comp_mode:		compatibility mode
+ * @nr_sample:		number of samples that the current trigger takes
+ * @nr_pending_sample:	number of pending samples
+ * @subsample_size:	size of a subsample
+ * @sample_size:	size of a sample
+ * @output_size:	size of output from one trigger
+ * @sample_period:	sample period passed to hardware
+ * @nr_cpu:		number of hardware threads (logical CPUs)
+ * @events:		event IDs passed from users
+ */
+struct hisi_pmcu_events {
+	u8 nr_pmu;
+	u8 padding;
+	u8 nr_ev;
+	u8 nr_ev_per_sample;
+	u8 max_sample_loop;
+	u8 ev_len;
+	u8 nr_loop;
+	u8 comp_mode;
+	u32 nr_sample;
+	u32 nr_pending_sample;
+	u32 subsample_size;
+	u32 sample_size;
+	u32 output_size;
+	u32 sample_period;
+	u32 nr_cpu;
+	u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
+};
+
+enum hisi_pmcu_comp_mode {
+	HISI_PMCU_COMP_MODE_DISABLED,
+	HISI_PMCU_COMP_MODE_ENABLED,
+	HISI_PMCU_COMP_MODE_UNDEFINE,
+};
+
+/**
+ * struct hisi_pmcu_user_events - Data interacting with sysfs interface
+ * @nr_ev:	number of events written
+ * @ev:		event IDs
+ */
+struct hisi_pmcu_user_events {
+	u32 nr_ev;
+	u16 ev[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
+};
+
+/**
+ * struct hisi_pmcu - PMCU device data
+ * @pmu:	PMU device of this PMCU
+ * @dev:	device of this PMCU
+ * @regbase:	base IO address of registers
+ * @lock:	spinlock for serialising hardware operations
+ * @busy:	PMCU sampling running indicator
+ * @irq:	IRQ number
+ * @scclid:	CPU die (SCCL) ID where this PMCU is on
+ * @on_cpu:	CPU that handles perf_event and IRQ
+ * @cpus:	CPUs monitored by this PMCU
+ * @cpuhp_node:	CPU hotplug node
+ * @handle:	perf output handle for interacting with AUX buffers
+ * @ev:		PMCU events and sampling configuration
+ * @user_ev:	user events passed from sysfs
+ */
+struct hisi_pmcu {
+	struct pmu pmu;
+	struct device *dev;
+	void __iomem *regbase;
+	spinlock_t lock;
+	bool busy;
+	int irq;
+	int scclid;
+	int on_cpu;
+	cpumask_t cpus;
+	struct hlist_node cpuhp_node;
+	struct perf_output_handle handle;
+	struct hisi_pmcu_events ev;
+	struct hisi_pmcu_user_events user_ev;
+};
+
+#define to_hisi_pmcu(p) container_of(p, struct hisi_pmcu, pmu)
+
+static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr,
+						char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+
+	return sysfs_emit(buf, "%d\n", hisi_pmcu->on_cpu);
+}
+
+static DEVICE_ATTR_ADMIN_RO(cpumask);
+
+static struct attribute *hisi_pmcu_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_cpumask_attr_group = {
+	.attrs = hisi_pmcu_cpumask_attrs,
+};
+
+PMU_FORMAT_ATTR(nr_sample, "config:0-31");
+PMU_FORMAT_ATTR(sample_period_ms, "config:32-63");
+PMU_FORMAT_ATTR(pmccfiltr, "config1:0-7");
+
+static struct attribute *hisi_pmcu_format_attrs[] = {
+	&format_attr_nr_sample.attr,
+	&format_attr_sample_period_ms.attr,
+	&format_attr_pmccfiltr.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_format_attr_group = {
+	.name = "format",
+	.attrs = hisi_pmcu_format_attrs,
+};
+
+static ssize_t monitored_cpus_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+
+	return sysfs_emit(buf, "%d-%d\n",
+			  cpumask_first(&hisi_pmcu->cpus),
+			  cpumask_last(&hisi_pmcu->cpus));
+}
+
+static DEVICE_ATTR_ADMIN_RO(monitored_cpus);
+
+static struct attribute *hisi_pmcu_monitored_cpus_attrs[] = {
+	&dev_attr_monitored_cpus.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_monitored_cpus_attr_group = {
+	.attrs = hisi_pmcu_monitored_cpus_attrs,
+};
+
+static ssize_t user_events_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
+	int at = 0;
+	int i;
+
+	for (i = 0; i < user_ev->nr_ev; i++)
+		at += sysfs_emit_at(buf, at, "0x%04x\n", user_ev->ev[i]);
+
+	return at;
+};
+
+static ssize_t user_events_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
+	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
+	u32 head, tail, nr_ev;
+	char *line;
+	int err;
+
+	line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL);
+	nr_ev = 0;
+	head = 0;
+	tail = 0;
+	while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) {
+		while (head < count && isspace(buf[head]))
+			head++;
+		if (!isxdigit(buf[head]))
+			break;
+		tail = head + 1;
+
+		while (tail < count && isalnum(buf[tail]))
+			tail++;
+
+		strncpy(line, buf + head, tail - head);
+		line[tail - head] = '\0';
+		err = kstrtou16(line, 16, &user_ev->ev[nr_ev]);
+		if (err) {
+			user_ev->nr_ev = 0;
+			return err;
+		}
+		nr_ev++;
+		head = tail;
+	}
+	user_ev->nr_ev = nr_ev;
+
+	return count;
+}
+
+static DEVICE_ATTR_ADMIN_RW(user_events);
+
+static struct attribute *hisi_pmcu_user_events_attrs[] = {
+	&dev_attr_user_events.attr,
+	NULL
+};
+
+static const struct attribute_group hisi_pmcu_user_events_attr_group = {
+	.attrs = hisi_pmcu_user_events_attrs,
+};
+
+static const struct attribute_group *hisi_pmcu_attr_groups[] = {
+	&hisi_pmcu_cpumask_attr_group,
+	&hisi_pmcu_format_attr_group,
+	&hisi_pmcu_monitored_cpus_attr_group,
+	&hisi_pmcu_user_events_attr_group,
+	NULL
+};
+
+static int hisi_pmcu_pmu_event_init(struct perf_event *event)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+	void __iomem *base = hisi_pmcu->regbase;
+	u64 cfg;
+	u32 val;
+
+	if (event->attr.type != hisi_pmcu->pmu.type)
+		return -ENOENT;
+
+	if (hisi_pmcu->busy)
+		return -EBUSY;
+
+	cfg = event->attr.config;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg);
+	ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg);
+	if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) {
+		dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n",
+			HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS);
+		return -EINVAL;
+	}
+	ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT :
+				  HISI_PMCU_WAIT_CNT_DEFAULT;
+
+	cfg = event->attr.config1;
+
+	val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg);
+	val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val);
+	writel(val, base + HISI_PMCU_REG_PMCCFILTR);
+
+	return 0;
+}
+
+static void *hisi_pmcu_pmu_setup_aux(struct perf_event *event, void **pages,
+				     int nr_pages, bool overwrite)
+{
+	int pg, nr_pg, nbuf;
+	struct hisi_pmcu_buf *buf;
+	struct page *page;
+
+	if (overwrite) {
+		dev_warn(event->pmu->dev, "Overwrite mode is not supported\n");
+		return NULL;
+	}
+
+	/* Count buffers */
+	nbuf = 0;
+	for (pg = 0; pg < nr_pages;) {
+		page = virt_to_page(pages[pg]);
+		pg += 1 << page_private(page);
+		nbuf++;
+	}
+
+	buf = kzalloc(struct_size(buf, sbuf, nbuf), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	/* Set up buffers */
+	buf->nr_buf = nbuf;
+	buf->cur_buf = 0;
+	for (pg = 0, nbuf = 0; nbuf < buf->nr_buf; nbuf++) {
+		page = virt_to_page(pages[pg]);
+		nr_pg = 1 << page_private(page);
+		buf->sbuf[nbuf].page = page;
+		buf->sbuf[nbuf].size = nr_pg << PAGE_SHIFT;
+		buf->sbuf[nbuf].remain = nr_pg << PAGE_SHIFT;
+		pg += nr_pg;
+	}
+
+	return buf;
+}
+
+static void hisi_pmcu_pmu_free_aux(void *aux)
+{
+	kfree(aux);
+}
+
+static void hisi_pmcu_setup_events(struct hisi_pmcu_events *ev,
+				   struct hisi_pmcu_user_events *user_ev)
+{
+	u8 max_nr_pmu;
+	int i;
+
+	/* Copy events from user's sysfs interface */
+	ev->nr_ev = user_ev->nr_ev;
+	for (i = 0; i < ev->nr_ev; i++)
+		ev->events[i] = user_ev->ev[i];
+
+	/*
+	 * Set nr_pmu and pad events.
+	 *
+	 * PMCU takes nr_pmu events per "subsample", and nr_pmu is limited by
+	 * the number of available PMU counters (nr_pmu <= max_nr_pmu). If
+	 * nr_ev <= max_nr_pmu, we just set nr_pmu = ev->nr_ev and we do not
+	 * need to pad events.
+	 *
+	 * However, if nr_ev > max_nr_pmu, so that a "sample" of nr_ev events
+	 * is formed of multiple subsamples. In this case, we set nr_pmu =
+	 * max_nr_pmu and, if nr_ev % nr_pmu != 0, we pad null events, i.e.
+	 * reserved events that do not count, in the last subsample. Thus, one
+	 * subsample accounts for only one sample, making user space data
+	 * decoding easier.
+	 */
+	max_nr_pmu = ev->comp_mode ? HISI_PMCU_FSM_CFG_MAX_NR_PMU_C :
+				     HISI_PMCU_FSM_CFG_MAX_NR_PMU;
+
+	ev->nr_pmu = min(ev->nr_ev, max_nr_pmu);
+
+	ev->padding = ev->nr_ev % ev->nr_pmu ?
+		      ev->nr_pmu - ev->nr_ev % ev->nr_pmu : 0;
+
+	ev->nr_ev_per_sample = ev->nr_ev + ev->padding;
+
+	for (i = ev->nr_ev; i < ev->nr_ev_per_sample; i++)
+		ev->events[i] = PMU_NULL_EVENT_ID;
+
+	/*
+	 * Duplicate events in ev->events in case of needing many samples
+	 * (> MAX_NR_LOOP) in a trigger. See hisi_pmcu_config_sample().
+	 */
+	ev->max_sample_loop = HISI_PMCU_FSM_CFG_MAX_EV_LEN /
+			      ev->nr_ev_per_sample;
+	for (i = 1; i < ev->max_sample_loop; i++)
+		memcpy(ev->events + i * ev->nr_ev_per_sample,
+		       ev->events, ev->nr_ev_per_sample * sizeof(u32));
+
+	/* Update sample size */
+	ev->subsample_size = (ev->nr_pmu + (ev->comp_mode ? 1 : 2))
+			     * sizeof(u64) * ev->nr_cpu;
+	ev->sample_size = ev->nr_ev_per_sample / ev->nr_pmu
+			  * ev->subsample_size;
+}
+
+static int hisi_pmcu_config_sample(struct hisi_pmcu_events *ev, u32 buf_size)
+{
+	int nr_sample_loop, nr_max;
+
+	if (buf_size < ev->sample_size)
+		return 1;
+
+	/* Number of events that this buf can take or to take */
+	nr_max = min(buf_size / ev->sample_size, ev->nr_pending_sample);
+
+	/*
+	 * Determine ev->ev_len and ev->nr_loop, update ev->nr_sample
+	 *
+	 * NOTE: We haven't implemented an algorithm to find a pair of
+	 * [nr_loop, nr_sample_loop] that exactly delivers nr_max samples.
+	 *
+	 * We use nr_loop to do multiple samples if nr_max <= MAX_NR_LOOP.
+	 * Otherwise, we utilise the duplicate events in the event buffer to
+	 * get more samples. If there are any pending samples not going to be
+	 * taken in this trigger, e.g. due to the limit of (max_sample_loop *
+	 * MAX_NR_LOOP) or the round down of division (nr_max / MAX_NR_LOOP),
+	 * they will be handled in the next trigger from ISR.
+	 */
+	if (nr_max <= HISI_PMCU_FSM_CFG_MAX_NR_LOOP) {
+		nr_sample_loop = 1;
+		ev->nr_loop = nr_max;
+		ev->nr_sample = ev->nr_loop;
+	} else {
+		nr_sample_loop = nr_max / HISI_PMCU_FSM_CFG_MAX_NR_LOOP;
+		if (nr_sample_loop > ev->max_sample_loop)
+			nr_sample_loop = ev->max_sample_loop;
+		ev->nr_loop = HISI_PMCU_FSM_CFG_MAX_NR_LOOP;
+		ev->nr_sample = nr_sample_loop * ev->nr_loop;
+	}
+
+	ev->ev_len = ev->nr_ev_per_sample * nr_sample_loop;
+
+	ev->output_size = ev->sample_size * ev->nr_sample;
+
+	return 0;
+}
+
+static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu,
+				      struct hisi_pmcu_buf *buf)
+{
+	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
+	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+	void __iomem *base = hisi_pmcu->regbase;
+	u64 addr, end;
+	u32 val;
+
+	/* FSM CFG */
+	val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len);
+	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop);
+	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu);
+	writel(val, base + HISI_PMCU_REG_FSM_CFG);
+
+	/* Sample period */
+	writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT);
+
+	/* Event ID base */
+	addr = virt_to_phys(ev->events);
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_EVENT_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_EVENT_BASE_L);
+
+	/* sbuf end */
+	end = page_to_phys(sbuf->page) + sbuf->size;
+
+	/* Data output address */
+	addr = end - sbuf->remain;
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_STORE_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_STORE_BASE_L);
+
+	/* Stop data output if sbuf end is reached (abnormally) */
+	addr = end;
+	val = upper_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_KILL_BASE_H);
+	val = lower_32_bits(addr);
+	writel(val, base + HISI_PMCU_REG_KILL_BASE_L);
+
+	/* Trigger */
+	writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL);
+}
+
+/*
+ * Break hardware sampling process and poll hisi_pmcu->busy. hisi_pmcu->busy
+ * will be cleared in ISR when hardware successfully handles the break request.
+ */
+static int hisi_pmcu_hw_sample_stop(struct hisi_pmcu *hisi_pmcu)
+{
+	ktime_t ddl;
+
+	writel(HISI_PMCU_FSM_BRK_BRK,
+	       hisi_pmcu->regbase + HISI_PMCU_REG_FSM_BRK);
+
+	ddl = ktime_add_us(ktime_get(), HISI_PMCU_BRK_TIMEOUT_US);
+
+	while (ktime_before(ktime_get(), ddl)) {
+		udelay(HISI_PMCU_BRK_DELAY_PERIOD_US);
+		if (!hisi_pmcu->busy)
+			return 0;
+	}
+
+	return -ETIMEDOUT;
+}
+
+static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev,
+					    struct hisi_pmcu_buf *buf)
+{
+	struct hisi_pmcu_auxtrace_header header;
+	struct hisi_pmcu_sbuf *sbuf;
+	u32 *data;
+	u32 sz;
+
+	sbuf = &buf->sbuf[buf->cur_buf];
+
+	header.buffer_size = sbuf->size;
+	header.nr_pmu = ev->nr_pmu;
+	header.nr_cpu = ev->nr_cpu;
+	header.comp_mode = ev->comp_mode;
+	header.subsample_size = ev->subsample_size;
+	header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu;
+	header.nr_event = ev->nr_ev_per_sample;
+
+	data = page_to_virt(sbuf->page);
+	memcpy(data, &header, sizeof(header));
+	memcpy(data + sizeof(header) / sizeof(*data), ev->events,
+	       ev->nr_ev_per_sample * sizeof(u32));
+
+	sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32);
+	sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN);
+
+	sbuf->remain -= sz;
+}
+
+static void hisi_pmcu_pmu_start(struct perf_event *event, int flags)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct perf_output_handle *handle = &hisi_pmcu->handle;
+	struct hw_perf_event *hwc = &event->hw;
+	struct hisi_pmcu_buf *buf;
+	int err;
+
+	spin_lock(&hisi_pmcu->lock);
+
+	if (hisi_pmcu->busy) {
+		dev_info(hisi_pmcu->dev,
+			 "Sampling is running, pmu->start() ignored\n");
+		goto out;
+	}
+
+	buf = perf_aux_output_begin(handle, event);
+	if (!buf) {
+		dev_err(hisi_pmcu->dev, "Failed to begin perf aux output\n");
+		goto out;
+	}
+
+	if (handle->head) {
+		dev_err(hisi_pmcu->dev, "got handle->head=0x%lx, should be 0\n",
+			handle->head);
+		goto out;
+	}
+
+	hisi_pmcu_setup_events(&hisi_pmcu->ev, &hisi_pmcu->user_ev);
+
+	hisi_pmcu_write_auxtrace_header(&hisi_pmcu->ev, buf);
+
+	err = hisi_pmcu_config_sample(&hisi_pmcu->ev,
+				      buf->sbuf[buf->cur_buf].remain);
+	if (err) {
+		dev_err(hisi_pmcu->dev,
+			"Failed to start sampling, buffer too small\n");
+		perf_aux_output_end(handle, 0);
+		goto out;
+	}
+
+	hisi_pmcu->busy = true;
+	hwc->state &= ~PERF_HES_STOPPED;
+
+	hisi_pmcu_hw_sample_start(hisi_pmcu, buf);
+
+	dev_dbg(hisi_pmcu->dev, "Sampling started\n");
+
+out:
+	spin_unlock(&hisi_pmcu->lock);
+}
+
+static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags)
+{
+	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct perf_output_handle *handle;
+	struct hisi_pmcu_sbuf *sbuf;
+	struct hisi_pmcu_buf *buf;
+	int err;
+
+	spin_lock(&hisi_pmcu->lock);
+
+	handle = &hisi_pmcu->handle;
+
+	/* If PMCU is running, break it */
+	if (hisi_pmcu->busy) {
+		dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n");
+		err = hisi_pmcu_hw_sample_stop(hisi_pmcu);
+		if (err)
+			dev_err(hisi_pmcu->dev,
+				"Timed out for stopping PMCU!\n");
+	}
+
+	buf = perf_get_aux(handle);
+	sbuf = &buf->sbuf[buf->cur_buf];
+	perf_aux_output_end(handle, sbuf->size - sbuf->remain);
+
+	spin_unlock(&hisi_pmcu->lock);
+
+	hwc->state |= PERF_HES_STOPPED;
+	perf_event_update_userpage(event);
+}
+
+static int hisi_pmcu_pmu_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state |= PERF_HES_STOPPED;
+
+	hisi_pmcu_pmu_start(event, flags);
+
+	if (hwc->state & PERF_HES_STOPPED)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void hisi_pmcu_pmu_del(struct perf_event *event, int flags)
+{
+	hisi_pmcu_pmu_stop(event, flags);
+}
+
+static int hisi_pmcu_init_data(struct platform_device *pdev,
+			       struct hisi_pmcu *hisi_pmcu)
+{
+	int ret;
+
+	hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(hisi_pmcu->regbase))
+		return dev_err_probe(&pdev->dev, -ENODEV,
+				     "Failed to map device register space\n");
+
+	ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
+				       &hisi_pmcu->scclid);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to read sccl-id!\n");
+
+	/*
+	 * Obtain the number of CPUs that contributes to the sample size.
+	 * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain
+	 * number of CPUs in a cluster regardless of how many CPUs are actually
+	 * implemented/available.
+	 */
+	ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster",
+				       &hisi_pmcu->ev.nr_cpu);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to read nr-cluster!\n");
+	hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER;
+
+	return 0;
+}
+
+static irqreturn_t hisi_pmcu_isr(int irq, void *data)
+{
+	struct hisi_pmcu *hisi_pmcu = data;
+	void __iomem *base = hisi_pmcu->regbase;
+	u32 irq_status;
+
+	irq_status = FIELD_GET(HISI_PMCU_INT_ALL,
+			       readl(base + HISI_PMCU_REG_INT_STAT));
+
+	if (!irq_status)
+		return IRQ_NONE;
+
+	if (irq_status & HISI_PMCU_INT_DONE) {
+		/*
+		 * Buffers and perf_output_handle should be up-to-date
+		 * for hisi_pmcu_pmu_stop() before exiting ISR
+		 */
+		struct perf_output_handle *handle = &hisi_pmcu->handle;
+		struct hisi_pmcu_buf *buf = perf_get_aux(handle);
+		struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
+		struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
+		int err;
+
+		spin_lock(&hisi_pmcu->lock);
+
+		ev->nr_pending_sample -= ev->nr_sample;
+		sbuf->remain -= ev->output_size;
+
+		if (!ev->nr_pending_sample) {
+			hisi_pmcu->busy = false;
+			dev_dbg(hisi_pmcu->dev, "Sampling finished\n");
+			goto skip;
+		}
+
+		err = hisi_pmcu_config_sample(ev, sbuf->remain);
+		if (err) {
+			/* This sbuf is full. Commit it, switch to the next. */
+			struct perf_event *event = handle->event;
+
+			perf_aux_output_end(handle, sbuf->size);
+
+			sbuf->remain = sbuf->size;
+
+			perf_aux_output_begin(handle, event);
+
+			if (++buf->cur_buf == buf->nr_buf)
+				buf->cur_buf = 0;
+			sbuf = &buf->sbuf[buf->cur_buf];
+
+			err = hisi_pmcu_config_sample(ev, sbuf->remain);
+			if (err) {
+				dev_err(hisi_pmcu->dev,
+					"Sampling stopped at AUX buffer %d, buffer size is probably tainted\n",
+					buf->cur_buf);
+				hisi_pmcu->busy = false;
+				goto skip;
+			}
+		}
+
+		hisi_pmcu_hw_sample_start(hisi_pmcu, buf);
+
+skip:
+		spin_unlock(&hisi_pmcu->lock);
+		writel(HISI_PMCU_INT_DONE, base + HISI_PMCU_REG_INT_CLR);
+	}
+
+	if (irq_status & HISI_PMCU_INT_BRK) {
+		hisi_pmcu->busy = false;
+		writel(HISI_PMCU_INT_BRK, base + HISI_PMCU_REG_INT_CLR);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static int hisi_pmcu_init_irq(struct platform_device *pdev,
+			      struct hisi_pmcu *hisi_pmcu)
+{
+	int irq, ret;
+
+	irq = platform_get_irq(pdev, 0);
+	if (irq < 0)
+		return irq;
+
+	ret = devm_request_irq(&pdev->dev, irq, hisi_pmcu_isr,
+			       IRQF_NOBALANCING | IRQF_NO_THREAD,
+			       dev_name(&pdev->dev), hisi_pmcu);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret,
+				     "Failed to request IRQ line %d\n", irq);
+
+	hisi_pmcu->irq = irq;
+
+	return 0;
+}
+
+static void hisi_pmcu_init_hw(struct hisi_pmcu *hisi_pmcu)
+{
+	void __iomem *base = hisi_pmcu->regbase;
+
+	writel(HISI_PMCU_INT_ALL, base + HISI_PMCU_REG_INT_EN);
+	writel(0, base + HISI_PMCU_REG_INT_MSK);
+	writel(HISI_PMCU_INT_ALL, base + HISI_PMCU_REG_INT_CLR);
+
+	writel(HISI_PMCU_PMCR_DEFAULT, base + HISI_PMCU_REG_PMCR);
+
+	if (hisi_pmcu->ev.comp_mode)
+		writel(HISI_PMCU_COMP_ENABLE, base + HISI_PMCU_REG_COMP);
+}
+
+static void hisi_pmcu_init_cpu_config(void *info)
+{
+	struct hisi_pmcu *hisi_pmcu = info;
+	u64 val, hpmn;
+	int cpu, sccl;
+
+	val = read_cpuid_mpidr();
+
+	if (FIELD_GET(MPIDR_MT_BITMASK, val))
+		sccl = MPIDR_AFFINITY_LEVEL(val, 3);
+	else
+		sccl = MPIDR_AFFINITY_LEVEL(val, 2);
+
+	if (sccl == hisi_pmcu->scclid) {
+		cpu = smp_processor_id();
+		cpumask_set_cpu(cpu, &hisi_pmcu->cpus);
+
+		val = read_sysreg(mdcr_el2);
+		hpmn = FIELD_GET(MDCR_EL2_HPMN_MASK, val);
+		if (hpmn > HISI_PMCU_COMP_HPMN_THR) {
+			hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_DISABLED;
+			dev_warn(hisi_pmcu->dev,
+				 "CPU%d MDCR_EL2.HPMN=%lld (> %d), PMCU may mess up VM's counter accesses\n",
+				 cpu, hpmn, HISI_PMCU_COMP_HPMN_THR);
+		}
+	}
+}
+
+static void hisi_pmcu_set_mdcr_el2_hpme(void *info)
+{
+	write_sysreg(read_sysreg(mdcr_el2) | MDCR_EL2_HPME, mdcr_el2);
+}
+
+static enum cpuhp_state hisi_pmcu_cpuhp_state;
+
+static void hisi_pmcu_remove_cpuhp_instance(void *cpuhp_node)
+{
+	cpuhp_state_remove_instance_nocalls(hisi_pmcu_cpuhp_state, cpuhp_node);
+}
+
+static int hisi_pmcu_init(struct platform_device *pdev,
+			  struct hisi_pmcu *hisi_pmcu)
+{
+	int ret;
+
+	hisi_pmcu->dev = &pdev->dev;
+
+	spin_lock_init(&hisi_pmcu->lock);
+
+	ret = hisi_pmcu_init_data(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	ret = hisi_pmcu_init_irq(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	hisi_pmcu_init_hw(hisi_pmcu);
+
+	/*
+	 * ARM64 sysreg MDCR_EL2.HPMN defines the number of core PMU counters
+	 * that are accessible from VMs. If HPMN > 0, PMCU may access PMU
+	 * counters at the same time with VMs, messing up the counter control.
+	 * PMCU supports a "compatibility mode", where it restricts itself to
+	 * use counters starting from an index of HISI_PMCU_COMP_HPMN_THR.
+	 * Hence, if HPMN <= HISI_PMCU_COMP_HPMN_THR on all CPUs, PMCU enables
+	 * the "compatibility mode" to resolve the conflict with VMs;
+	 * otherwise, we print a message to warn potential conflicts.
+	 */
+	hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_UNDEFINE;
+	on_each_cpu(hisi_pmcu_init_cpu_config, hisi_pmcu, 1);
+	if (hisi_pmcu->ev.comp_mode != HISI_PMCU_COMP_MODE_DISABLED) {
+		hisi_pmcu->ev.comp_mode = HISI_PMCU_COMP_MODE_ENABLED;
+		on_each_cpu(hisi_pmcu_set_mdcr_el2_hpme, NULL, 1);
+	}
+
+	ret = cpuhp_state_add_instance(hisi_pmcu_cpuhp_state,
+				       &hisi_pmcu->cpuhp_node);
+	if (ret)
+		return ret;
+
+	return devm_add_action_or_reset(hisi_pmcu->dev,
+					hisi_pmcu_remove_cpuhp_instance,
+					&hisi_pmcu->cpuhp_node);
+}
+
+static void hisi_pmcu_unregister_pmu(void *pmu)
+{
+	perf_pmu_unregister(pmu);
+}
+
+static int hisi_pmcu_register_pmu(struct hisi_pmcu *hisi_pmcu)
+{
+	char *name;
+	int ret;
+
+	hisi_pmcu->pmu = (struct pmu) {
+		.module		= THIS_MODULE,
+		.attr_groups	= hisi_pmcu_attr_groups,
+		.capabilities	= PERF_PMU_CAP_EXCLUSIVE |
+				  PERF_PMU_CAP_AUX_OUTPUT,
+		.task_ctx_nr	= perf_invalid_context,
+		.event_init	= hisi_pmcu_pmu_event_init,
+		.add		= hisi_pmcu_pmu_add,
+		.del		= hisi_pmcu_pmu_del,
+		.start		= hisi_pmcu_pmu_start,
+		.stop		= hisi_pmcu_pmu_stop,
+		.setup_aux	= hisi_pmcu_pmu_setup_aux,
+		.free_aux	= hisi_pmcu_pmu_free_aux,
+	};
+
+	name = devm_kasprintf(hisi_pmcu->dev, GFP_KERNEL, "hisi_pmcu_sccl%d",
+			      hisi_pmcu->scclid);
+	if (!name)
+		return -ENOMEM;
+
+	ret = perf_pmu_register(&hisi_pmcu->pmu, name, -1);
+	if (ret)
+		return dev_err_probe(hisi_pmcu->dev, ret,
+				     "Failed to register PMU\n");
+
+	return devm_add_action_or_reset(hisi_pmcu->dev,
+					hisi_pmcu_unregister_pmu,
+					&hisi_pmcu->pmu);
+}
+
+static int hisi_pmcu_probe(struct platform_device *pdev)
+{
+	struct hisi_pmcu *hisi_pmcu;
+	int ret;
+
+	hisi_pmcu = devm_kzalloc(&pdev->dev, sizeof(*hisi_pmcu), GFP_KERNEL);
+	if (!hisi_pmcu)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, hisi_pmcu);
+
+	ret = hisi_pmcu_init(pdev, hisi_pmcu);
+	if (ret)
+		return ret;
+
+	return hisi_pmcu_register_pmu(hisi_pmcu);
+}
+
+static const struct acpi_device_id hisi_pmcu_acpi_match[] = {
+	{ "HISI0451", },
+	{}
+};
+MODULE_DEVICE_TABLE(acpi, hisi_pmcu_acpi_match);
+
+static struct platform_driver hisi_pmcu_driver = {
+	.driver = {
+		.name = HISI_PMCU_DRV_NAME,
+		.acpi_match_table = hisi_pmcu_acpi_match,
+		/*
+		 * Unbinding driver is not yet supported as we have not worked
+		 * out a safe bind/unbind process.
+		 */
+		.suppress_bind_attrs = true,
+	},
+	.probe = hisi_pmcu_probe,
+};
+
+static int hisi_pmcu_cpuhp_startup(unsigned int cpu, struct hlist_node *node)
+{
+	struct hisi_pmcu *hisi_pmcu;
+
+	hisi_pmcu = hlist_entry_safe(node, struct hisi_pmcu, cpuhp_node);
+
+	if (hisi_pmcu->on_cpu != -1)
+		return 0;
+
+	if (!cpumask_test_cpu(cpu, &hisi_pmcu->cpus))
+		return 0;
+
+	WARN_ON(irq_set_affinity(hisi_pmcu->irq, cpumask_of(cpu)));
+	hisi_pmcu->on_cpu = cpu;
+
+	return 0;
+}
+
+static int hisi_pmcu_cpuhp_teardown(unsigned int cpu, struct hlist_node *node)
+{
+	struct hisi_pmcu *hisi_pmcu;
+	cpumask_t available_cpus;
+	unsigned int target;
+
+	hisi_pmcu = hlist_entry_safe(node, struct hisi_pmcu, cpuhp_node);
+
+	if (hisi_pmcu->on_cpu != cpu)
+		return 0;
+
+	hisi_pmcu->on_cpu = -1;
+
+	cpumask_and(&available_cpus, &hisi_pmcu->cpus, cpu_online_mask);
+	target = cpumask_any_but(&available_cpus, cpu);
+	if (target >= nr_cpu_ids)
+		return 0;
+	perf_pmu_migrate_context(&hisi_pmcu->pmu, cpu, target);
+	WARN_ON(irq_set_affinity(hisi_pmcu->irq, cpumask_of(target)));
+	hisi_pmcu->on_cpu = target;
+
+	return 0;
+}
+
+static int __init hisi_pmcu_module_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, HISI_PMCU_DRV_NAME,
+				      hisi_pmcu_cpuhp_startup,
+				      hisi_pmcu_cpuhp_teardown);
+	if (ret < 0)
+		return ret;
+	hisi_pmcu_cpuhp_state = ret;
+
+	ret = platform_driver_register(&hisi_pmcu_driver);
+	if (ret)
+		cpuhp_remove_multi_state(hisi_pmcu_cpuhp_state);
+
+	return ret;
+}
+
+static void __exit hisi_pmcu_module_exit(void)
+{
+	platform_driver_unregister(&hisi_pmcu_driver);
+	cpuhp_remove_multi_state(hisi_pmcu_cpuhp_state);
+}
+
+module_init(hisi_pmcu_module_init);
+module_exit(hisi_pmcu_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jie Zhan <zhanjie9@hisilicon.com>");
+MODULE_DESCRIPTION("HiSilicon PMCU driver");
-- 
2.30.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support
  2023-02-06  6:51 ` Jie Zhan
@ 2023-02-06  6:51   ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Support for HiSilicon PMCU data recording using 'perf-record'.

Users can start PMCU profiling through 'perf-record'. Event numbers are
passed by a sysfs interface. The following optional parameters can be
passed through 'perf-record':
- nr_sample: number of samples to take
- sample_period_ms: time in ms for PMU counters to stay on for an event
- pmccfiltr: bits[31-24] of system register PMCCFILTR_EL0

Example usage:

1. Enter event numbers in the 'user_events' file:

	echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events

2. Start the sampling with 'perf-record':

	perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1/

In this example, the PMCU takes 1000 samples of event 0x0010 and 0x0011
with a sampling period of 1ms. Data will be written to a 'perf.data' file.

Co-developed-by: Yang Shen <shenyang39@huawei.com>
Signed-off-by: Yang Shen <shenyang39@huawei.com>
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 tools/perf/arch/arm/util/auxtrace.c    |  61 +++++++++++
 tools/perf/arch/arm64/util/Build       |   2 +-
 tools/perf/arch/arm64/util/hisi-pmcu.c | 145 +++++++++++++++++++++++++
 tools/perf/util/auxtrace.h             |   1 +
 tools/perf/util/hisi-pmcu.h            |  17 +++
 5 files changed, 225 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index deeb163999ce..05307c325137 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -17,6 +17,7 @@
 #include "cs-etm.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "hisi-pmcu.h"
 
 static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err)
 {
@@ -99,6 +100,52 @@ static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err)
 	return hisi_ptt_pmus;
 }
 
+static struct perf_pmu **find_all_hisi_pmcu_pmus(int *nr_pmcus, int *err)
+{
+	const char *sysfs = sysfs__mountpoint();
+	struct perf_pmu **hisi_pmcu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	snprintf(path, PATH_MAX, "%s" EVENT_SOURCE_DEVICE_PATH, sysfs);
+	dir = opendir(path);
+	if (!dir) {
+		pr_err("can't read directory '%s'\n", EVENT_SOURCE_DEVICE_PATH);
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, HISI_PMCU_PMU_NAME))
+			(*nr_pmcus)++;
+	}
+
+	if (!(*nr_pmcus))
+		goto out;
+
+	hisi_pmcu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_pmcus));
+	if (!hisi_pmcu_pmus) {
+		pr_err("hisi_pmcu alloc failed\n");
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, HISI_PMCU_PMU_NAME) && idx < *nr_pmcus) {
+			hisi_pmcu_pmus[idx] = perf_pmu__find(dent->d_name);
+			if (hisi_pmcu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return hisi_pmcu_pmus;
+}
+
 static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
 					   int pmu_nr, struct evsel *evsel)
 {
@@ -121,13 +168,16 @@ struct auxtrace_record
 	struct perf_pmu	*cs_etm_pmu = NULL;
 	struct perf_pmu **arm_spe_pmus = NULL;
 	struct perf_pmu **hisi_ptt_pmus = NULL;
+	struct perf_pmu **hisi_pmcu_pmus = NULL;
 	struct evsel *evsel;
 	struct perf_pmu *found_etm = NULL;
 	struct perf_pmu *found_spe = NULL;
 	struct perf_pmu *found_ptt = NULL;
+	struct perf_pmu *found_pmcu = NULL;
 	int auxtrace_event_cnt = 0;
 	int nr_spes = 0;
 	int nr_ptts = 0;
+	int nr_pmcus = 0;
 
 	if (!evlist)
 		return NULL;
@@ -135,6 +185,7 @@ struct auxtrace_record
 	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);
 	arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err);
 	hisi_ptt_pmus = find_all_hisi_ptt_pmus(&nr_ptts, err);
+	hisi_pmcu_pmus = find_all_hisi_pmcu_pmus(&nr_pmcus, err);
 
 	evlist__for_each_entry(evlist, evsel) {
 		if (cs_etm_pmu && !found_etm)
@@ -145,10 +196,14 @@ struct auxtrace_record
 
 		if (hisi_ptt_pmus && !found_ptt)
 			found_ptt = find_pmu_for_event(hisi_ptt_pmus, nr_ptts, evsel);
+
+		if (hisi_pmcu_pmus && !found_pmcu)
+			found_pmcu = find_pmu_for_event(hisi_pmcu_pmus, nr_pmcus, evsel);
 	}
 
 	free(arm_spe_pmus);
 	free(hisi_ptt_pmus);
+	free(hisi_pmcu_pmus);
 
 	if (found_etm)
 		auxtrace_event_cnt++;
@@ -159,6 +214,9 @@ struct auxtrace_record
 	if (found_ptt)
 		auxtrace_event_cnt++;
 
+	if (found_pmcu)
+		auxtrace_event_cnt++;
+
 	if (auxtrace_event_cnt > 1) {
 		pr_err("Concurrent AUX trace operation not currently supported\n");
 		*err = -EOPNOTSUPP;
@@ -174,6 +232,9 @@ struct auxtrace_record
 
 	if (found_ptt)
 		return hisi_ptt_recording_init(err, found_ptt);
+
+	if (found_pmcu)
+		return hisi_pmcu_recording_init(err, found_pmcu);
 #endif
 
 	/*
diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index 337aa9bdf905..daba9e6ae054 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -11,4 +11,4 @@ perf-$(CONFIG_LIBDW_DWARF_UNWIND) += unwind-libdw.o
 perf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \
 			      ../../arm/util/auxtrace.o \
 			      ../../arm/util/cs-etm.o \
-			      arm-spe.o mem-events.o hisi-ptt.o
+			      arm-spe.o mem-events.o hisi-ptt.o hisi-pmcu.o
diff --git a/tools/perf/arch/arm64/util/hisi-pmcu.c b/tools/perf/arch/arm64/util/hisi-pmcu.c
new file mode 100644
index 000000000000..7c33abf1182d
--- /dev/null
+++ b/tools/perf/arch/arm64/util/hisi-pmcu.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <time.h>
+#include <math.h>
+#include <internal/lib.h>
+#include <internal/threadmap.h>
+
+#include "../../../util/auxtrace.h"
+#include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/evlist.h"
+#include "../../../util/hisi-pmcu.h"
+#include "../../../util/pmu.h"
+#include "../../../util/record.h"
+#include "../../../util/session.h"
+#include "../../../util/thread_map.h"
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+#define HISI_PMCU_DATA_ALIGNMENT	4
+
+struct hisi_pmcu_record {
+	struct auxtrace_record itr;
+	struct perf_pmu *hisi_pmcu_pmu;
+	struct evlist *evlist;
+};
+
+static int hisi_pmcu_recording_options(struct auxtrace_record *itr,
+				       struct evlist *evlist,
+				       struct record_opts *opts)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+	struct perf_pmu *hisi_pmcu_pmu = pmcu_record->hisi_pmcu_pmu;
+	struct evsel *hisi_pmcu_evsel = NULL;
+	struct evsel *evsel;
+
+	if (!perf_event_paranoid_check(-1))
+		return -EPERM;
+
+	pmcu_record->evlist = evlist;
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.type == hisi_pmcu_pmu->type) {
+			if (hisi_pmcu_evsel) {
+				pr_err("Only one event allowed on a PMCU\n");
+				return -EINVAL;
+			}
+			evsel->core.attr.sample_period = 1;
+			evsel->core.attr.freq = false;
+			evsel->needs_auxtrace_mmap = true;
+			opts->full_auxtrace = true;
+			hisi_pmcu_evsel = evsel;
+		}
+	}
+
+	opts->auxtrace_mmap_pages = MiB(16) / page_size;
+
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	evlist__to_front(evlist, hisi_pmcu_evsel);
+	evsel__set_sample_bit(hisi_pmcu_evsel, TIME);
+
+	return 0;
+}
+
+static size_t hisi_pmcu_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+				       struct evlist *evlist __maybe_unused)
+{
+	return HISI_PMCU_AUXTRACE_PRIV_SIZE;
+}
+
+static int hisi_pmcu_info_fill(struct auxtrace_record *itr,
+			       struct perf_session *session,
+			       struct perf_record_auxtrace_info *auxtrace_info,
+			       size_t priv_size)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+	struct perf_pmu *hisi_pmcu_pmu = pmcu_record->hisi_pmcu_pmu;
+
+	if (priv_size != HISI_PMCU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->core.nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_HISI_PMCU;
+	auxtrace_info->priv[0] = hisi_pmcu_pmu->type;
+
+	return 0;
+}
+
+static void hisi_pmcu_record_free(struct auxtrace_record *itr)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+
+	free(pmcu_record);
+}
+
+static u64 hisi_pmcu_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	return 0;
+}
+
+struct auxtrace_record *hisi_pmcu_recording_init(int *err,
+						 struct perf_pmu *hisi_pmcu_pmu)
+{
+	struct hisi_pmcu_record *pmcu_record;
+
+	if (!hisi_pmcu_pmu) {
+		*err = -ENODEV;
+		return NULL;
+	}
+
+	pmcu_record = zalloc(sizeof(*pmcu_record));
+	if (!pmcu_record) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	pmcu_record->hisi_pmcu_pmu = hisi_pmcu_pmu;
+	pmcu_record->itr.recording_options = hisi_pmcu_recording_options;
+	pmcu_record->itr.info_priv_size = hisi_pmcu_info_priv_size;
+	pmcu_record->itr.info_fill = hisi_pmcu_info_fill;
+	pmcu_record->itr.free = hisi_pmcu_record_free;
+	pmcu_record->itr.reference = hisi_pmcu_reference;
+	pmcu_record->itr.read_finish = auxtrace_record__read_finish;
+	pmcu_record->itr.alignment = HISI_PMCU_DATA_ALIGNMENT;
+	pmcu_record->itr.pmu = hisi_pmcu_pmu;
+
+	*err = 0;
+	return &pmcu_record->itr;
+}
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 6a0f9b98f059..89b2b14407f5 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -49,6 +49,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_ARM_SPE,
 	PERF_AUXTRACE_S390_CPUMSF,
 	PERF_AUXTRACE_HISI_PTT,
+	PERF_AUXTRACE_HISI_PMCU,
 };
 
 enum itrace_period_type {
diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
new file mode 100644
index 000000000000..d46d523a3aee
--- /dev/null
+++ b/tools/perf/util/hisi-pmcu.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#ifndef INCLUDE__PERF_HISI_PMCU_H__
+#define INCLUDE__PERF_HISI_PMCU_H__
+
+#define HISI_PMCU_PMU_NAME		"hisi_pmcu"
+#define HISI_PMCU_AUXTRACE_PRIV_SIZE	sizeof(u64)
+
+struct auxtrace_record *hisi_pmcu_recording_init(int *err,
+					struct perf_pmu *hisi_pmcu_pmu);
+
+#endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support
@ 2023-02-06  6:51   ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Support for HiSilicon PMCU data recording using 'perf-record'.

Users can start PMCU profiling through 'perf-record'. Event numbers are
passed by a sysfs interface. The following optional parameters can be
passed through 'perf-record':
- nr_sample: number of samples to take
- sample_period_ms: time in ms for PMU counters to stay on for an event
- pmccfiltr: bits[31-24] of system register PMCCFILTR_EL0

Example usage:

1. Enter event numbers in the 'user_events' file:

	echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events

2. Start the sampling with 'perf-record':

	perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1/

In this example, the PMCU takes 1000 samples of event 0x0010 and 0x0011
with a sampling period of 1ms. Data will be written to a 'perf.data' file.

Co-developed-by: Yang Shen <shenyang39@huawei.com>
Signed-off-by: Yang Shen <shenyang39@huawei.com>
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 tools/perf/arch/arm/util/auxtrace.c    |  61 +++++++++++
 tools/perf/arch/arm64/util/Build       |   2 +-
 tools/perf/arch/arm64/util/hisi-pmcu.c | 145 +++++++++++++++++++++++++
 tools/perf/util/auxtrace.h             |   1 +
 tools/perf/util/hisi-pmcu.h            |  17 +++
 5 files changed, 225 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
 create mode 100644 tools/perf/util/hisi-pmcu.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index deeb163999ce..05307c325137 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -17,6 +17,7 @@
 #include "cs-etm.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "hisi-pmcu.h"
 
 static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err)
 {
@@ -99,6 +100,52 @@ static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err)
 	return hisi_ptt_pmus;
 }
 
+static struct perf_pmu **find_all_hisi_pmcu_pmus(int *nr_pmcus, int *err)
+{
+	const char *sysfs = sysfs__mountpoint();
+	struct perf_pmu **hisi_pmcu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	snprintf(path, PATH_MAX, "%s" EVENT_SOURCE_DEVICE_PATH, sysfs);
+	dir = opendir(path);
+	if (!dir) {
+		pr_err("can't read directory '%s'\n", EVENT_SOURCE_DEVICE_PATH);
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, HISI_PMCU_PMU_NAME))
+			(*nr_pmcus)++;
+	}
+
+	if (!(*nr_pmcus))
+		goto out;
+
+	hisi_pmcu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_pmcus));
+	if (!hisi_pmcu_pmus) {
+		pr_err("hisi_pmcu alloc failed\n");
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, HISI_PMCU_PMU_NAME) && idx < *nr_pmcus) {
+			hisi_pmcu_pmus[idx] = perf_pmu__find(dent->d_name);
+			if (hisi_pmcu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return hisi_pmcu_pmus;
+}
+
 static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
 					   int pmu_nr, struct evsel *evsel)
 {
@@ -121,13 +168,16 @@ struct auxtrace_record
 	struct perf_pmu	*cs_etm_pmu = NULL;
 	struct perf_pmu **arm_spe_pmus = NULL;
 	struct perf_pmu **hisi_ptt_pmus = NULL;
+	struct perf_pmu **hisi_pmcu_pmus = NULL;
 	struct evsel *evsel;
 	struct perf_pmu *found_etm = NULL;
 	struct perf_pmu *found_spe = NULL;
 	struct perf_pmu *found_ptt = NULL;
+	struct perf_pmu *found_pmcu = NULL;
 	int auxtrace_event_cnt = 0;
 	int nr_spes = 0;
 	int nr_ptts = 0;
+	int nr_pmcus = 0;
 
 	if (!evlist)
 		return NULL;
@@ -135,6 +185,7 @@ struct auxtrace_record
 	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);
 	arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err);
 	hisi_ptt_pmus = find_all_hisi_ptt_pmus(&nr_ptts, err);
+	hisi_pmcu_pmus = find_all_hisi_pmcu_pmus(&nr_pmcus, err);
 
 	evlist__for_each_entry(evlist, evsel) {
 		if (cs_etm_pmu && !found_etm)
@@ -145,10 +196,14 @@ struct auxtrace_record
 
 		if (hisi_ptt_pmus && !found_ptt)
 			found_ptt = find_pmu_for_event(hisi_ptt_pmus, nr_ptts, evsel);
+
+		if (hisi_pmcu_pmus && !found_pmcu)
+			found_pmcu = find_pmu_for_event(hisi_pmcu_pmus, nr_pmcus, evsel);
 	}
 
 	free(arm_spe_pmus);
 	free(hisi_ptt_pmus);
+	free(hisi_pmcu_pmus);
 
 	if (found_etm)
 		auxtrace_event_cnt++;
@@ -159,6 +214,9 @@ struct auxtrace_record
 	if (found_ptt)
 		auxtrace_event_cnt++;
 
+	if (found_pmcu)
+		auxtrace_event_cnt++;
+
 	if (auxtrace_event_cnt > 1) {
 		pr_err("Concurrent AUX trace operation not currently supported\n");
 		*err = -EOPNOTSUPP;
@@ -174,6 +232,9 @@ struct auxtrace_record
 
 	if (found_ptt)
 		return hisi_ptt_recording_init(err, found_ptt);
+
+	if (found_pmcu)
+		return hisi_pmcu_recording_init(err, found_pmcu);
 #endif
 
 	/*
diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index 337aa9bdf905..daba9e6ae054 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -11,4 +11,4 @@ perf-$(CONFIG_LIBDW_DWARF_UNWIND) += unwind-libdw.o
 perf-$(CONFIG_AUXTRACE) += ../../arm/util/pmu.o \
 			      ../../arm/util/auxtrace.o \
 			      ../../arm/util/cs-etm.o \
-			      arm-spe.o mem-events.o hisi-ptt.o
+			      arm-spe.o mem-events.o hisi-ptt.o hisi-pmcu.o
diff --git a/tools/perf/arch/arm64/util/hisi-pmcu.c b/tools/perf/arch/arm64/util/hisi-pmcu.c
new file mode 100644
index 000000000000..7c33abf1182d
--- /dev/null
+++ b/tools/perf/arch/arm64/util/hisi-pmcu.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <time.h>
+#include <math.h>
+#include <internal/lib.h>
+#include <internal/threadmap.h>
+
+#include "../../../util/auxtrace.h"
+#include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/evlist.h"
+#include "../../../util/hisi-pmcu.h"
+#include "../../../util/pmu.h"
+#include "../../../util/record.h"
+#include "../../../util/session.h"
+#include "../../../util/thread_map.h"
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+#define HISI_PMCU_DATA_ALIGNMENT	4
+
+struct hisi_pmcu_record {
+	struct auxtrace_record itr;
+	struct perf_pmu *hisi_pmcu_pmu;
+	struct evlist *evlist;
+};
+
+static int hisi_pmcu_recording_options(struct auxtrace_record *itr,
+				       struct evlist *evlist,
+				       struct record_opts *opts)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+	struct perf_pmu *hisi_pmcu_pmu = pmcu_record->hisi_pmcu_pmu;
+	struct evsel *hisi_pmcu_evsel = NULL;
+	struct evsel *evsel;
+
+	if (!perf_event_paranoid_check(-1))
+		return -EPERM;
+
+	pmcu_record->evlist = evlist;
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.type == hisi_pmcu_pmu->type) {
+			if (hisi_pmcu_evsel) {
+				pr_err("Only one event allowed on a PMCU\n");
+				return -EINVAL;
+			}
+			evsel->core.attr.sample_period = 1;
+			evsel->core.attr.freq = false;
+			evsel->needs_auxtrace_mmap = true;
+			opts->full_auxtrace = true;
+			hisi_pmcu_evsel = evsel;
+		}
+	}
+
+	opts->auxtrace_mmap_pages = MiB(16) / page_size;
+
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	evlist__to_front(evlist, hisi_pmcu_evsel);
+	evsel__set_sample_bit(hisi_pmcu_evsel, TIME);
+
+	return 0;
+}
+
+static size_t hisi_pmcu_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+				       struct evlist *evlist __maybe_unused)
+{
+	return HISI_PMCU_AUXTRACE_PRIV_SIZE;
+}
+
+static int hisi_pmcu_info_fill(struct auxtrace_record *itr,
+			       struct perf_session *session,
+			       struct perf_record_auxtrace_info *auxtrace_info,
+			       size_t priv_size)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+	struct perf_pmu *hisi_pmcu_pmu = pmcu_record->hisi_pmcu_pmu;
+
+	if (priv_size != HISI_PMCU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->core.nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_HISI_PMCU;
+	auxtrace_info->priv[0] = hisi_pmcu_pmu->type;
+
+	return 0;
+}
+
+static void hisi_pmcu_record_free(struct auxtrace_record *itr)
+{
+	struct hisi_pmcu_record *pmcu_record =
+			container_of(itr, struct hisi_pmcu_record, itr);
+
+	free(pmcu_record);
+}
+
+static u64 hisi_pmcu_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	return 0;
+}
+
+struct auxtrace_record *hisi_pmcu_recording_init(int *err,
+						 struct perf_pmu *hisi_pmcu_pmu)
+{
+	struct hisi_pmcu_record *pmcu_record;
+
+	if (!hisi_pmcu_pmu) {
+		*err = -ENODEV;
+		return NULL;
+	}
+
+	pmcu_record = zalloc(sizeof(*pmcu_record));
+	if (!pmcu_record) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	pmcu_record->hisi_pmcu_pmu = hisi_pmcu_pmu;
+	pmcu_record->itr.recording_options = hisi_pmcu_recording_options;
+	pmcu_record->itr.info_priv_size = hisi_pmcu_info_priv_size;
+	pmcu_record->itr.info_fill = hisi_pmcu_info_fill;
+	pmcu_record->itr.free = hisi_pmcu_record_free;
+	pmcu_record->itr.reference = hisi_pmcu_reference;
+	pmcu_record->itr.read_finish = auxtrace_record__read_finish;
+	pmcu_record->itr.alignment = HISI_PMCU_DATA_ALIGNMENT;
+	pmcu_record->itr.pmu = hisi_pmcu_pmu;
+
+	*err = 0;
+	return &pmcu_record->itr;
+}
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 6a0f9b98f059..89b2b14407f5 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -49,6 +49,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_ARM_SPE,
 	PERF_AUXTRACE_S390_CPUMSF,
 	PERF_AUXTRACE_HISI_PTT,
+	PERF_AUXTRACE_HISI_PMCU,
 };
 
 enum itrace_period_type {
diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
new file mode 100644
index 000000000000..d46d523a3aee
--- /dev/null
+++ b/tools/perf/util/hisi-pmcu.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#ifndef INCLUDE__PERF_HISI_PMCU_H__
+#define INCLUDE__PERF_HISI_PMCU_H__
+
+#define HISI_PMCU_PMU_NAME		"hisi_pmcu"
+#define HISI_PMCU_AUXTRACE_PRIV_SIZE	sizeof(u64)
+
+struct auxtrace_record *hisi_pmcu_recording_init(int *err,
+					struct perf_pmu *hisi_pmcu_pmu);
+
+#endif
-- 
2.30.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 4/4] perf tool: Add HiSilicon PMCU data decoding support
  2023-02-06  6:51 ` Jie Zhan
@ 2023-02-06  6:51   ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Support for dumping raw trace of HiSilicon PMCU data using 'perf-report'
or 'perf-script'

Example usage:

 # perf report -D

Output will contain the raw PMCU data with notes, such as:

. ... HISI PMCU data: size 0x9630 bytes
. ... Header: size 0x30 bytes
.  00000000:  00 00 40 00 04 00 00 00 08 00 00 00 00 00 00 00
.  00000010:  80 01 00 00 01 00 00 00 04 00 00 00 10 00 00 00
.  00000020:  11 00 00 00 12 00 00 00 13 00 00 00 00 00 00 00
.  Auxtrace buffer max size: 0x400000
.  Number of PMU counters in parallel: 4
.  Number of monitored CPUs: 8
.  Compatible mode: no
.  Subsample size: 0x180
.  Number of subsamples per sample: 1
.  Number of events: 4
.  Event   0: 0x0010
.  Event   1: 0x0011
.  Event   2: 0x0012
.  Event   3: 0x0013
. ... Data: size 0x9600 bytes
.  Sample 0
.    Subsample 0
.    00000030:  00000000            PMCID0SR CPU 0
.    00000034:  00000000            PMCID0SR CPU 1
.    00000038:  00000000            PMCID0SR CPU 2
.    0000003c:  00000000            PMCID0SR CPU 3
.    00000040:  00000000            PMCID0SR CPU 4
.    00000044:  00000000            PMCID0SR CPU 5
.    00000048:  00000000            PMCID0SR CPU 6
.    0000004c:  00000000            PMCID0SR CPU 7
.    00000050:  000000ba            PMCID1SR CPU 0
.    00000054:  000056fe            PMCID1SR CPU 1
.    00000058:  00000000            PMCID1SR CPU 2
.    0000005c:  00000000            PMCID1SR CPU 3
.    00000060:  00000195            PMCID1SR CPU 4
.    00000064:  000056fc            PMCID1SR CPU 5
.    00000068:  00000000            PMCID1SR CPU 6
.    0000006c:  00000000            PMCID1SR CPU 7
.    00000070:  0000000000000000    Event 0010 CPU 0
.    00000078:  0000000000000000    Event 0010 CPU 1
.    00000080:  0000000000000000    Event 0010 CPU 2
.    00000088:  0000000000000000    Event 0010 CPU 3
.    00000090:  0000000000000000    Event 0010 CPU 4
.    00000098:  0000000000000001    Event 0010 CPU 5
.    000000a0:  0000000000000000    Event 0010 CPU 6
.    000000a8:  0000000000000000    Event 0010 CPU 7
.    000000b0:  0000000000000000    Event 0011 CPU 0
.    000000b8:  0000000000000000    Event 0011 CPU 1
.    000000c0:  0000000000000000    Event 0011 CPU 2
.    000000c8:  0000000000000000    Event 0011 CPU 3
.    000000d0:  000000000000d614    Event 0011 CPU 4
.    000000d8:  000000000000046b    Event 0011 CPU 5
.    000000e0:  0000000000000000    Event 0011 CPU 6
.    000000e8:  0000000000000000    Event 0011 CPU 7
.    000000f0:  0000000000000000    Event 0012 CPU 0
.    000000f8:  0000000000000000    Event 0012 CPU 1
.    00000100:  0000000000000000    Event 0012 CPU 2
.    00000108:  0000000000000000    Event 0012 CPU 3
.    00000110:  00000000000000f4    Event 0012 CPU 4
.    00000118:  0000000000000003    Event 0012 CPU 5
.    00000120:  0000000000000000    Event 0012 CPU 6
.    00000128:  0000000000000000    Event 0012 CPU 7
.    00000130:  0000000000000000    Event 0013 CPU 0
.    00000138:  0000000000000000    Event 0013 CPU 1
.    00000140:  0000000000000000    Event 0013 CPU 2
.    00000148:  0000000000000000    Event 0013 CPU 3
.    00000150:  00000000000000f4    Event 0013 CPU 4
.    00000158:  0000000000000004    Event 0013 CPU 5
.    00000160:  0000000000000000    Event 0013 CPU 6
.    00000168:  0000000000000000    Event 0013 CPU 7
.    00000170:  000000000000d614    Cycle count CPU 0
.    00000178:  000000000000d614    Cycle count CPU 1
.    00000180:  0000000000000000    Cycle count CPU 2
.    00000188:  0000000000000000    Cycle count CPU 3
.    00000190:  000000000000d614    Cycle count CPU 4
.    00000198:  000000000000d614    Cycle count CPU 5
.    000001a0:  0000000000000000    Cycle count CPU 6
.    000001a8:  0000000000000000    Cycle count CPU 7
(...more data follows)

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 tools/perf/util/Build       |   1 +
 tools/perf/util/auxtrace.c  |   4 +
 tools/perf/util/hisi-pmcu.c | 305 ++++++++++++++++++++++++++++++++++++
 tools/perf/util/hisi-pmcu.h |   2 +
 4 files changed, 312 insertions(+)
 create mode 100644 tools/perf/util/hisi-pmcu.c

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index e315ecaec323..e062a2c1b962 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -120,6 +120,7 @@ perf-$(CONFIG_AUXTRACE) += arm-spe.o
 perf-$(CONFIG_AUXTRACE) += arm-spe-decoder/
 perf-$(CONFIG_AUXTRACE) += hisi-ptt.o
 perf-$(CONFIG_AUXTRACE) += hisi-ptt-decoder/
+perf-$(CONFIG_AUXTRACE) += hisi-pmcu.o
 perf-$(CONFIG_AUXTRACE) += s390-cpumsf.o
 
 ifdef CONFIG_LIBOPENCSD
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 46ada5ec3f9a..ac19220d307e 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -53,6 +53,7 @@
 #include "intel-bts.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "hisi-pmcu.h"
 #include "s390-cpumsf.h"
 #include "util/mmap.h"
 
@@ -1324,6 +1325,9 @@ int perf_event__process_auxtrace_info(struct perf_session *session,
 	case PERF_AUXTRACE_HISI_PTT:
 		err = hisi_ptt_process_auxtrace_info(event, session);
 		break;
+	case PERF_AUXTRACE_HISI_PMCU:
+		err = hisi_pmcu_process_auxtrace_info(event, session);
+		break;
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
 		return -EINVAL;
diff --git a/tools/perf/util/hisi-pmcu.c b/tools/perf/util/hisi-pmcu.c
new file mode 100644
index 000000000000..7e0b41cd464d
--- /dev/null
+++ b/tools/perf/util/hisi-pmcu.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <linux/math.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <perf/event.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "auxtrace.h"
+#include "color.h"
+#include "debug.h"
+#include "event.h"
+#include "evsel.h"
+#include "hisi-pmcu.h"
+#include "session.h"
+#include "tool.h"
+#include <internal/lib.h>
+
+#define HISI_PMCU_AUX_HEADER_ALIGN	0x10
+#define HISI_PMCU_NR_CPU_CLUSTER	8
+#define dump_print(fmt, ...) \
+	color_fprintf(stdout, PERF_COLOR_BLUE, fmt, ##__VA_ARGS__)
+
+enum hisi_pmcu_auxtrace_header_index {
+	HISI_PMCU_HEADER_BUFFER_SIZE,
+	HISI_PMCU_HEADER_NR_PMU,
+	HISI_PMCU_HEADER_NR_CPU,
+	HISI_PMCU_HEADER_COMP_MODE,
+	HISI_PMCU_HEADER_SUBSAMPLE_SIZE,
+	HISI_PMCU_HEADER_NR_SUBSAMPLE_PER_SAMPLE,
+	HISI_PMCU_HEADER_NR_EVENT,
+	HISI_PMCU_HEADER_MAX
+};
+
+struct hisi_pmcu_aux_header_info {
+	u32 buffer_size;
+	u32 nr_pmu;
+	u32 nr_cpu;
+	u32 comp_mode;
+	u32 subsample_size;
+	u32 nr_subsample_per_sample;
+	u32 nr_event;
+	u32 events[];
+};
+
+struct hisi_pmcu_process {
+	u32 pmu_type;
+	struct auxtrace auxtrace;
+	struct hisi_pmcu_aux_header_info *header;
+};
+
+static int hisi_pmcu_process_event(struct perf_session *session __maybe_unused,
+				   union perf_event *event __maybe_unused,
+				   struct perf_sample *sample __maybe_unused,
+				   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int hisi_pmcu_process_header(struct hisi_pmcu_process *pmcu,
+				    const unsigned char *__data, u64 size)
+{
+	struct hisi_pmcu_aux_header_info *header;
+	const u32 *data = (const u32 *) __data;
+	unsigned int i, j;
+	u32 read_size;
+
+	read_size = HISI_PMCU_HEADER_MAX * sizeof(*data);
+	if (size < read_size)
+		return -EINVAL;
+
+	read_size += data[HISI_PMCU_HEADER_NR_EVENT] * sizeof(*data);
+	if (size < read_size)
+		return -EINVAL;
+
+	pmcu->header = malloc(read_size);
+	header = pmcu->header;
+	memcpy(header, data, read_size);
+	read_size = round_up(read_size, HISI_PMCU_AUX_HEADER_ALIGN);
+
+	dump_print(". ... Header: size 0x%lx bytes\n", read_size);
+	for (i = 0; i < read_size; i += HISI_PMCU_AUX_HEADER_ALIGN) {
+		dump_print(".  %08lx:  ", i);
+		for (j = 0; j < HISI_PMCU_AUX_HEADER_ALIGN; j++)
+			dump_print("%02x ", __data[i + j]);
+		dump_print("\n");
+	}
+
+	dump_print(".  Auxtrace buffer max size: 0x%lx\n", header->buffer_size);
+	dump_print(".  Number of PMU counters in parallel: %d\n", header->nr_pmu);
+	dump_print(".  Number of monitored CPUs: %d\n", header->nr_cpu);
+	dump_print(".  Compatible mode: %s\n", header->comp_mode ? "yes" : "no");
+	dump_print(".  Subsample size: 0x%lx\n", header->subsample_size);
+	dump_print(".  Number of subsamples per sample: %d\n", header->nr_subsample_per_sample);
+	dump_print(".  Number of events: %d\n", header->nr_event);
+
+	for (i = 0; i < header->nr_event; i++)
+		dump_print(".  Event %3d: 0x%04x\n", i, header->events[i]);
+
+	return read_size;
+}
+
+static int hisi_pmcu_dump_subsample(struct hisi_pmcu_aux_header_info *header,
+				    const unsigned char *data, u64 offset,
+				    u32 evoffset)
+{
+	int nr_cluster, core, cid, i;
+	u32 pos = 0, event;
+
+	nr_cluster = header->nr_cpu / HISI_PMCU_NR_CPU_CLUSTER;
+
+	for (cid = 0; cid < 2; cid++) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %08lx            PMCID%dSR CPU %d\n",
+					   offset + pos, *(u32 *) (data + pos),
+					   cid,
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u32);
+			}
+		}
+	}
+
+	for (event = 0; event < header->nr_pmu; event++) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %016llx    Event %04lx CPU %d\n",
+					   offset + pos, *(u64 *) (data + pos),
+					   header->events[event + evoffset],
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u64);
+			}
+		}
+	}
+
+	if (!header->comp_mode) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %016llx    Cycle count CPU %d\n",
+					   offset + pos, *(u64 *) (data + pos),
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u64);
+			}
+		}
+	}
+
+	return pos;
+}
+
+static int hisi_pmcu_dump_sample(struct hisi_pmcu_aux_header_info *header,
+				 const unsigned char *data, u64 offset)
+{
+	u32 pos = 0, i = 0;
+
+	while (i < header->nr_subsample_per_sample) {
+		dump_print(".    Subsample %d\n", i + 1);
+		pos += hisi_pmcu_dump_subsample(header, data + pos,
+						offset + pos,
+						i * header->nr_pmu);
+		i++;
+	}
+
+	return pos;
+}
+
+static int hisi_pmcu_dump_data(struct hisi_pmcu_process *pmcu,
+			       const unsigned char *data, u64 size)
+{
+	struct hisi_pmcu_aux_header_info *header;
+	u32 sample_size;
+	u32 nr_sample;
+	u64 pos = 0;
+	int ret;
+
+	dump_print(". ... HISI PMCU data: size 0x%lx bytes\n", size);
+
+	ret = hisi_pmcu_process_header(pmcu, data, size);
+	if (ret < 0)
+		return ret;
+
+	pos += ret;
+
+	header = pmcu->header;
+	sample_size = header->subsample_size * header->nr_subsample_per_sample;
+	nr_sample = 1;
+	dump_print(". ... Data: size 0x%lx bytes\n", size - pos);
+	while (pos < size) {
+		u32 buf_remain;
+
+		dump_print(".  Sample %d\n", nr_sample);
+		pos += hisi_pmcu_dump_sample(header, data + pos, pos);
+		nr_sample++;
+
+		// Skip gap at the end of an auxtrace buffer
+		buf_remain = header->buffer_size - pos % header->buffer_size;
+		if (buf_remain < sample_size)
+			pos += buf_remain;
+	}
+
+	return 0;
+}
+
+static int hisi_pmcu_process_auxtrace_event(struct perf_session *session,
+					    union perf_event *event,
+					    struct perf_tool *tool __maybe_unused)
+{
+	struct hisi_pmcu_process *pmcu_process;
+	void *data;
+	u64 size;
+	int fd;
+
+	if (!dump_trace)
+		return 0;
+
+	size = event->auxtrace.size;
+	if (!size)
+		return 0;
+
+	data = malloc(size);
+	if (!data)
+		return -errno;
+
+	fd = perf_data__fd(session->data);
+
+	if (readn(fd, data, size) < 0) {
+		free(data);
+		return -errno;
+	}
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	return hisi_pmcu_dump_data(pmcu_process, data, size);
+}
+
+static int hisi_pmcu_flush_events(struct perf_session *session __maybe_unused,
+				  struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void hisi_pmcu_free_events(struct perf_session *session __maybe_unused)
+{
+}
+
+static void hisi_pmcu_free(struct perf_session *session)
+{
+	struct hisi_pmcu_process *pmcu_process;
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	session->auxtrace = NULL;
+	free(pmcu_process);
+}
+
+static bool hisi_pmcu_evsel_is_auxtrace(struct perf_session *session,
+					struct evsel *evsel)
+{
+	struct hisi_pmcu_process *pmcu_process;
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	return evsel->core.attr.type == pmcu_process->pmu_type;
+}
+
+int hisi_pmcu_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session)
+{
+	struct perf_record_auxtrace_info *auxtrace_info;
+	struct hisi_pmcu_process *pmcu_process;
+
+	auxtrace_info = &event->auxtrace_info;
+
+	if (auxtrace_info->header.size < sizeof(*auxtrace_info) +
+					 HISI_PMCU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	pmcu_process = zalloc(sizeof(*pmcu_process));
+	if (!pmcu_process)
+		return -ENOMEM;
+
+	pmcu_process->pmu_type = auxtrace_info->priv[0];
+
+	pmcu_process->auxtrace = (struct auxtrace) {
+		.process_event =  hisi_pmcu_process_event,
+		.process_auxtrace_event = hisi_pmcu_process_auxtrace_event,
+		.flush_events = hisi_pmcu_flush_events,
+		.free_events = hisi_pmcu_free_events,
+		.free = hisi_pmcu_free,
+		.evsel_is_auxtrace = hisi_pmcu_evsel_is_auxtrace,
+	};
+
+	session->auxtrace = &pmcu_process->auxtrace;
+
+	return 0;
+}
diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
index d46d523a3aee..8df74695164b 100644
--- a/tools/perf/util/hisi-pmcu.h
+++ b/tools/perf/util/hisi-pmcu.h
@@ -14,4 +14,6 @@
 struct auxtrace_record *hisi_pmcu_recording_init(int *err,
 					struct perf_pmu *hisi_pmcu_pmu);
 
+int hisi_pmcu_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session);
 #endif
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v1 4/4] perf tool: Add HiSilicon PMCU data decoding support
@ 2023-02-06  6:51   ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-06  6:51 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	zhanjie9, suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Support for dumping raw trace of HiSilicon PMCU data using 'perf-report'
or 'perf-script'

Example usage:

 # perf report -D

Output will contain the raw PMCU data with notes, such as:

. ... HISI PMCU data: size 0x9630 bytes
. ... Header: size 0x30 bytes
.  00000000:  00 00 40 00 04 00 00 00 08 00 00 00 00 00 00 00
.  00000010:  80 01 00 00 01 00 00 00 04 00 00 00 10 00 00 00
.  00000020:  11 00 00 00 12 00 00 00 13 00 00 00 00 00 00 00
.  Auxtrace buffer max size: 0x400000
.  Number of PMU counters in parallel: 4
.  Number of monitored CPUs: 8
.  Compatible mode: no
.  Subsample size: 0x180
.  Number of subsamples per sample: 1
.  Number of events: 4
.  Event   0: 0x0010
.  Event   1: 0x0011
.  Event   2: 0x0012
.  Event   3: 0x0013
. ... Data: size 0x9600 bytes
.  Sample 0
.    Subsample 0
.    00000030:  00000000            PMCID0SR CPU 0
.    00000034:  00000000            PMCID0SR CPU 1
.    00000038:  00000000            PMCID0SR CPU 2
.    0000003c:  00000000            PMCID0SR CPU 3
.    00000040:  00000000            PMCID0SR CPU 4
.    00000044:  00000000            PMCID0SR CPU 5
.    00000048:  00000000            PMCID0SR CPU 6
.    0000004c:  00000000            PMCID0SR CPU 7
.    00000050:  000000ba            PMCID1SR CPU 0
.    00000054:  000056fe            PMCID1SR CPU 1
.    00000058:  00000000            PMCID1SR CPU 2
.    0000005c:  00000000            PMCID1SR CPU 3
.    00000060:  00000195            PMCID1SR CPU 4
.    00000064:  000056fc            PMCID1SR CPU 5
.    00000068:  00000000            PMCID1SR CPU 6
.    0000006c:  00000000            PMCID1SR CPU 7
.    00000070:  0000000000000000    Event 0010 CPU 0
.    00000078:  0000000000000000    Event 0010 CPU 1
.    00000080:  0000000000000000    Event 0010 CPU 2
.    00000088:  0000000000000000    Event 0010 CPU 3
.    00000090:  0000000000000000    Event 0010 CPU 4
.    00000098:  0000000000000001    Event 0010 CPU 5
.    000000a0:  0000000000000000    Event 0010 CPU 6
.    000000a8:  0000000000000000    Event 0010 CPU 7
.    000000b0:  0000000000000000    Event 0011 CPU 0
.    000000b8:  0000000000000000    Event 0011 CPU 1
.    000000c0:  0000000000000000    Event 0011 CPU 2
.    000000c8:  0000000000000000    Event 0011 CPU 3
.    000000d0:  000000000000d614    Event 0011 CPU 4
.    000000d8:  000000000000046b    Event 0011 CPU 5
.    000000e0:  0000000000000000    Event 0011 CPU 6
.    000000e8:  0000000000000000    Event 0011 CPU 7
.    000000f0:  0000000000000000    Event 0012 CPU 0
.    000000f8:  0000000000000000    Event 0012 CPU 1
.    00000100:  0000000000000000    Event 0012 CPU 2
.    00000108:  0000000000000000    Event 0012 CPU 3
.    00000110:  00000000000000f4    Event 0012 CPU 4
.    00000118:  0000000000000003    Event 0012 CPU 5
.    00000120:  0000000000000000    Event 0012 CPU 6
.    00000128:  0000000000000000    Event 0012 CPU 7
.    00000130:  0000000000000000    Event 0013 CPU 0
.    00000138:  0000000000000000    Event 0013 CPU 1
.    00000140:  0000000000000000    Event 0013 CPU 2
.    00000148:  0000000000000000    Event 0013 CPU 3
.    00000150:  00000000000000f4    Event 0013 CPU 4
.    00000158:  0000000000000004    Event 0013 CPU 5
.    00000160:  0000000000000000    Event 0013 CPU 6
.    00000168:  0000000000000000    Event 0013 CPU 7
.    00000170:  000000000000d614    Cycle count CPU 0
.    00000178:  000000000000d614    Cycle count CPU 1
.    00000180:  0000000000000000    Cycle count CPU 2
.    00000188:  0000000000000000    Cycle count CPU 3
.    00000190:  000000000000d614    Cycle count CPU 4
.    00000198:  000000000000d614    Cycle count CPU 5
.    000001a0:  0000000000000000    Cycle count CPU 6
.    000001a8:  0000000000000000    Cycle count CPU 7
(...more data follows)

Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 tools/perf/util/Build       |   1 +
 tools/perf/util/auxtrace.c  |   4 +
 tools/perf/util/hisi-pmcu.c | 305 ++++++++++++++++++++++++++++++++++++
 tools/perf/util/hisi-pmcu.h |   2 +
 4 files changed, 312 insertions(+)
 create mode 100644 tools/perf/util/hisi-pmcu.c

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index e315ecaec323..e062a2c1b962 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -120,6 +120,7 @@ perf-$(CONFIG_AUXTRACE) += arm-spe.o
 perf-$(CONFIG_AUXTRACE) += arm-spe-decoder/
 perf-$(CONFIG_AUXTRACE) += hisi-ptt.o
 perf-$(CONFIG_AUXTRACE) += hisi-ptt-decoder/
+perf-$(CONFIG_AUXTRACE) += hisi-pmcu.o
 perf-$(CONFIG_AUXTRACE) += s390-cpumsf.o
 
 ifdef CONFIG_LIBOPENCSD
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 46ada5ec3f9a..ac19220d307e 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -53,6 +53,7 @@
 #include "intel-bts.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "hisi-pmcu.h"
 #include "s390-cpumsf.h"
 #include "util/mmap.h"
 
@@ -1324,6 +1325,9 @@ int perf_event__process_auxtrace_info(struct perf_session *session,
 	case PERF_AUXTRACE_HISI_PTT:
 		err = hisi_ptt_process_auxtrace_info(event, session);
 		break;
+	case PERF_AUXTRACE_HISI_PMCU:
+		err = hisi_pmcu_process_auxtrace_info(event, session);
+		break;
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
 		return -EINVAL;
diff --git a/tools/perf/util/hisi-pmcu.c b/tools/perf/util/hisi-pmcu.c
new file mode 100644
index 000000000000..7e0b41cd464d
--- /dev/null
+++ b/tools/perf/util/hisi-pmcu.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon Performance Monitor Control Unit (PMCU) support
+ *
+ * Copyright (C) 2022 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <linux/math.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <perf/event.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "auxtrace.h"
+#include "color.h"
+#include "debug.h"
+#include "event.h"
+#include "evsel.h"
+#include "hisi-pmcu.h"
+#include "session.h"
+#include "tool.h"
+#include <internal/lib.h>
+
+#define HISI_PMCU_AUX_HEADER_ALIGN	0x10
+#define HISI_PMCU_NR_CPU_CLUSTER	8
+#define dump_print(fmt, ...) \
+	color_fprintf(stdout, PERF_COLOR_BLUE, fmt, ##__VA_ARGS__)
+
+enum hisi_pmcu_auxtrace_header_index {
+	HISI_PMCU_HEADER_BUFFER_SIZE,
+	HISI_PMCU_HEADER_NR_PMU,
+	HISI_PMCU_HEADER_NR_CPU,
+	HISI_PMCU_HEADER_COMP_MODE,
+	HISI_PMCU_HEADER_SUBSAMPLE_SIZE,
+	HISI_PMCU_HEADER_NR_SUBSAMPLE_PER_SAMPLE,
+	HISI_PMCU_HEADER_NR_EVENT,
+	HISI_PMCU_HEADER_MAX
+};
+
+struct hisi_pmcu_aux_header_info {
+	u32 buffer_size;
+	u32 nr_pmu;
+	u32 nr_cpu;
+	u32 comp_mode;
+	u32 subsample_size;
+	u32 nr_subsample_per_sample;
+	u32 nr_event;
+	u32 events[];
+};
+
+struct hisi_pmcu_process {
+	u32 pmu_type;
+	struct auxtrace auxtrace;
+	struct hisi_pmcu_aux_header_info *header;
+};
+
+static int hisi_pmcu_process_event(struct perf_session *session __maybe_unused,
+				   union perf_event *event __maybe_unused,
+				   struct perf_sample *sample __maybe_unused,
+				   struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int hisi_pmcu_process_header(struct hisi_pmcu_process *pmcu,
+				    const unsigned char *__data, u64 size)
+{
+	struct hisi_pmcu_aux_header_info *header;
+	const u32 *data = (const u32 *) __data;
+	unsigned int i, j;
+	u32 read_size;
+
+	read_size = HISI_PMCU_HEADER_MAX * sizeof(*data);
+	if (size < read_size)
+		return -EINVAL;
+
+	read_size += data[HISI_PMCU_HEADER_NR_EVENT] * sizeof(*data);
+	if (size < read_size)
+		return -EINVAL;
+
+	pmcu->header = malloc(read_size);
+	header = pmcu->header;
+	memcpy(header, data, read_size);
+	read_size = round_up(read_size, HISI_PMCU_AUX_HEADER_ALIGN);
+
+	dump_print(". ... Header: size 0x%lx bytes\n", read_size);
+	for (i = 0; i < read_size; i += HISI_PMCU_AUX_HEADER_ALIGN) {
+		dump_print(".  %08lx:  ", i);
+		for (j = 0; j < HISI_PMCU_AUX_HEADER_ALIGN; j++)
+			dump_print("%02x ", __data[i + j]);
+		dump_print("\n");
+	}
+
+	dump_print(".  Auxtrace buffer max size: 0x%lx\n", header->buffer_size);
+	dump_print(".  Number of PMU counters in parallel: %d\n", header->nr_pmu);
+	dump_print(".  Number of monitored CPUs: %d\n", header->nr_cpu);
+	dump_print(".  Compatible mode: %s\n", header->comp_mode ? "yes" : "no");
+	dump_print(".  Subsample size: 0x%lx\n", header->subsample_size);
+	dump_print(".  Number of subsamples per sample: %d\n", header->nr_subsample_per_sample);
+	dump_print(".  Number of events: %d\n", header->nr_event);
+
+	for (i = 0; i < header->nr_event; i++)
+		dump_print(".  Event %3d: 0x%04x\n", i, header->events[i]);
+
+	return read_size;
+}
+
+static int hisi_pmcu_dump_subsample(struct hisi_pmcu_aux_header_info *header,
+				    const unsigned char *data, u64 offset,
+				    u32 evoffset)
+{
+	int nr_cluster, core, cid, i;
+	u32 pos = 0, event;
+
+	nr_cluster = header->nr_cpu / HISI_PMCU_NR_CPU_CLUSTER;
+
+	for (cid = 0; cid < 2; cid++) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %08lx            PMCID%dSR CPU %d\n",
+					   offset + pos, *(u32 *) (data + pos),
+					   cid,
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u32);
+			}
+		}
+	}
+
+	for (event = 0; event < header->nr_pmu; event++) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %016llx    Event %04lx CPU %d\n",
+					   offset + pos, *(u64 *) (data + pos),
+					   header->events[event + evoffset],
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u64);
+			}
+		}
+	}
+
+	if (!header->comp_mode) {
+		for (core = 0; core < HISI_PMCU_NR_CPU_CLUSTER; core++) {
+			for (i = 0; i < nr_cluster; i++) {
+				dump_print(".    %08lx:  %016llx    Cycle count CPU %d\n",
+					   offset + pos, *(u64 *) (data + pos),
+					   core + i * HISI_PMCU_NR_CPU_CLUSTER);
+				pos += sizeof(u64);
+			}
+		}
+	}
+
+	return pos;
+}
+
+static int hisi_pmcu_dump_sample(struct hisi_pmcu_aux_header_info *header,
+				 const unsigned char *data, u64 offset)
+{
+	u32 pos = 0, i = 0;
+
+	while (i < header->nr_subsample_per_sample) {
+		dump_print(".    Subsample %d\n", i + 1);
+		pos += hisi_pmcu_dump_subsample(header, data + pos,
+						offset + pos,
+						i * header->nr_pmu);
+		i++;
+	}
+
+	return pos;
+}
+
+static int hisi_pmcu_dump_data(struct hisi_pmcu_process *pmcu,
+			       const unsigned char *data, u64 size)
+{
+	struct hisi_pmcu_aux_header_info *header;
+	u32 sample_size;
+	u32 nr_sample;
+	u64 pos = 0;
+	int ret;
+
+	dump_print(". ... HISI PMCU data: size 0x%lx bytes\n", size);
+
+	ret = hisi_pmcu_process_header(pmcu, data, size);
+	if (ret < 0)
+		return ret;
+
+	pos += ret;
+
+	header = pmcu->header;
+	sample_size = header->subsample_size * header->nr_subsample_per_sample;
+	nr_sample = 1;
+	dump_print(". ... Data: size 0x%lx bytes\n", size - pos);
+	while (pos < size) {
+		u32 buf_remain;
+
+		dump_print(".  Sample %d\n", nr_sample);
+		pos += hisi_pmcu_dump_sample(header, data + pos, pos);
+		nr_sample++;
+
+		// Skip gap at the end of an auxtrace buffer
+		buf_remain = header->buffer_size - pos % header->buffer_size;
+		if (buf_remain < sample_size)
+			pos += buf_remain;
+	}
+
+	return 0;
+}
+
+static int hisi_pmcu_process_auxtrace_event(struct perf_session *session,
+					    union perf_event *event,
+					    struct perf_tool *tool __maybe_unused)
+{
+	struct hisi_pmcu_process *pmcu_process;
+	void *data;
+	u64 size;
+	int fd;
+
+	if (!dump_trace)
+		return 0;
+
+	size = event->auxtrace.size;
+	if (!size)
+		return 0;
+
+	data = malloc(size);
+	if (!data)
+		return -errno;
+
+	fd = perf_data__fd(session->data);
+
+	if (readn(fd, data, size) < 0) {
+		free(data);
+		return -errno;
+	}
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	return hisi_pmcu_dump_data(pmcu_process, data, size);
+}
+
+static int hisi_pmcu_flush_events(struct perf_session *session __maybe_unused,
+				  struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void hisi_pmcu_free_events(struct perf_session *session __maybe_unused)
+{
+}
+
+static void hisi_pmcu_free(struct perf_session *session)
+{
+	struct hisi_pmcu_process *pmcu_process;
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	session->auxtrace = NULL;
+	free(pmcu_process);
+}
+
+static bool hisi_pmcu_evsel_is_auxtrace(struct perf_session *session,
+					struct evsel *evsel)
+{
+	struct hisi_pmcu_process *pmcu_process;
+
+	pmcu_process = container_of(session->auxtrace,
+				    struct hisi_pmcu_process, auxtrace);
+
+	return evsel->core.attr.type == pmcu_process->pmu_type;
+}
+
+int hisi_pmcu_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session)
+{
+	struct perf_record_auxtrace_info *auxtrace_info;
+	struct hisi_pmcu_process *pmcu_process;
+
+	auxtrace_info = &event->auxtrace_info;
+
+	if (auxtrace_info->header.size < sizeof(*auxtrace_info) +
+					 HISI_PMCU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	pmcu_process = zalloc(sizeof(*pmcu_process));
+	if (!pmcu_process)
+		return -ENOMEM;
+
+	pmcu_process->pmu_type = auxtrace_info->priv[0];
+
+	pmcu_process->auxtrace = (struct auxtrace) {
+		.process_event =  hisi_pmcu_process_event,
+		.process_auxtrace_event = hisi_pmcu_process_auxtrace_event,
+		.flush_events = hisi_pmcu_flush_events,
+		.free_events = hisi_pmcu_free_events,
+		.free = hisi_pmcu_free,
+		.evsel_is_auxtrace = hisi_pmcu_evsel_is_auxtrace,
+	};
+
+	session->auxtrace = &pmcu_process->auxtrace;
+
+	return 0;
+}
diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
index d46d523a3aee..8df74695164b 100644
--- a/tools/perf/util/hisi-pmcu.h
+++ b/tools/perf/util/hisi-pmcu.h
@@ -14,4 +14,6 @@
 struct auxtrace_record *hisi_pmcu_recording_init(int *err,
 					struct perf_pmu *hisi_pmcu_pmu);
 
+int hisi_pmcu_process_auxtrace_info(union perf_event *event,
+				    struct perf_session *session);
 #endif
-- 
2.30.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-02-06  6:51   ` Jie Zhan
@ 2023-02-07  3:03     ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-07  3:03 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users


On 06/02/2023 14:51, Jie Zhan wrote:

> +
> +2. Profiling with ``perf-record``.
> +
> +   The command to start the sampling is::
> +
> +        perf record -e hisi_pmcu_sccl3/<configs>/
> +
> +   Users can pass the following optional parameters to ``<configs>``:
> +
> +   - nr_sample: number of samples to take. This defaults to 128.
> +   - sample_period_ms: time interval in microseconds for PMU counters to keep

Spot a typo before causing any confusion. This should be "milliseconds" 
rather than "microseconds".

Jie

> +     counting for each event. This defaults to 3, i.e. 3ms, and its max
> +     value is 85,899, i.e. 85 seconds.
> +   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
> +     cycle counter increments. This defaults to 0x00. Please refer to the
> +     "Performance Monitors external register descriptions" of *Arm Architecture
> +     Reference Manual for A-profile architecture* on how to configure
> +     PMCCFILTR_EL0.
> +
> ...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-02-07  3:03     ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-07  3:03 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users


On 06/02/2023 14:51, Jie Zhan wrote:

> +
> +2. Profiling with ``perf-record``.
> +
> +   The command to start the sampling is::
> +
> +        perf record -e hisi_pmcu_sccl3/<configs>/
> +
> +   Users can pass the following optional parameters to ``<configs>``:
> +
> +   - nr_sample: number of samples to take. This defaults to 128.
> +   - sample_period_ms: time interval in microseconds for PMU counters to keep

Spot a typo before causing any confusion. This should be "milliseconds" 
rather than "microseconds".

Jie

> +     counting for each event. This defaults to 3, i.e. 3ms, and its max
> +     value is 85,899, i.e. 85 seconds.
> +   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
> +     cycle counter increments. This defaults to 0x00. Please refer to the
> +     "Performance Monitors external register descriptions" of *Arm Architecture
> +     Reference Manual for A-profile architecture* on how to configure
> +     PMCCFILTR_EL0.
> +
> ...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
  2023-02-06  6:51 ` Jie Zhan
@ 2023-02-27  8:49   ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-27  8:49 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Please can anyone have a look at this PMCU patchset and provide some 
comments?

It is much related to the ARM PMU.

We are looking forward to the feedback.

Any relevant comments/questions, with respect to software or hardware 
design, use cases, coding, are welcome.

Kind regards,

Jie


On 06/02/2023 14:51, Jie Zhan wrote:
> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
>
> This patchset contains the documentation, driver, and user perf tool
> support to enable using PMCU with the 'perf_event' framework.
>
> Here are two key questions requested for comments:
>
> - How do we make it compatible with arm_pmu drivers?
>
>    Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
>    from CPU and PMCU simultaneously. The current hardware can't guarantee
>    mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
>    the same time may mess up the operation of PMUs, delivering incorrect
>    data for both events, e.g. unexpected events or sample periods.
>    Software-wise, we probably need to prevent the two types of events from
>    running at the same time, but currently there isn't a clear solution.
>
> - Currently we reply on a sysfs file for users to input event numbers. Is
>    there a better way to pass many events?
>
>    The perf framework only allows three 64-bit config fields for custom PMU
>    configs. Obviously, this can't satisfy our need for passing many events
>    at a time. As an event number is 16-bit wide, the config fields can only
>    take up to 12 events at a time, or up to 192 events even if we do a
>    bitmap of events (and there are more than 192 available event numbers).
>    Hence, the current design takes an array of event numbers from a sysfs
>    file before starting profiling. However, this may go against the common
>    way to schedule perf events through perf commands.
>
> Jie Zhan (4):
>    docs: perf: Add documentation for HiSilicon PMCU
>    drivers/perf: hisi: Add driver support for HiSilicon PMCU
>    perf tool: Add HiSilicon PMCU data recording support
>    perf tool: Add HiSilicon PMCU data decoding support
>
>   Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
>   Documentation/admin-guide/perf/index.rst     |    1 +
>   drivers/perf/hisilicon/Kconfig               |   15 +
>   drivers/perf/hisilicon/Makefile              |    1 +
>   drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
>   tools/perf/arch/arm/util/auxtrace.c          |   61 +
>   tools/perf/arch/arm64/util/Build             |    2 +-
>   tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
>   tools/perf/util/Build                        |    1 +
>   tools/perf/util/auxtrace.c                   |    4 +
>   tools/perf/util/auxtrace.h                   |    1 +
>   tools/perf/util/hisi-pmcu.c                  |  305 +++++
>   tools/perf/util/hisi-pmcu.h                  |   19 +
>   13 files changed, 1833 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>   create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
>   create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
>   create mode 100644 tools/perf/util/hisi-pmcu.c
>   create mode 100644 tools/perf/util/hisi-pmcu.h
>
>
> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
@ 2023-02-27  8:49   ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-02-27  8:49 UTC (permalink / raw)
  To: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	jonathan.cameron
  Cc: zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

Please can anyone have a look at this PMCU patchset and provide some 
comments?

It is much related to the ARM PMU.

We are looking forward to the feedback.

Any relevant comments/questions, with respect to software or hardware 
design, use cases, coding, are welcome.

Kind regards,

Jie


On 06/02/2023 14:51, Jie Zhan wrote:
> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
>
> This patchset contains the documentation, driver, and user perf tool
> support to enable using PMCU with the 'perf_event' framework.
>
> Here are two key questions requested for comments:
>
> - How do we make it compatible with arm_pmu drivers?
>
>    Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
>    from CPU and PMCU simultaneously. The current hardware can't guarantee
>    mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
>    the same time may mess up the operation of PMUs, delivering incorrect
>    data for both events, e.g. unexpected events or sample periods.
>    Software-wise, we probably need to prevent the two types of events from
>    running at the same time, but currently there isn't a clear solution.
>
> - Currently we reply on a sysfs file for users to input event numbers. Is
>    there a better way to pass many events?
>
>    The perf framework only allows three 64-bit config fields for custom PMU
>    configs. Obviously, this can't satisfy our need for passing many events
>    at a time. As an event number is 16-bit wide, the config fields can only
>    take up to 12 events at a time, or up to 192 events even if we do a
>    bitmap of events (and there are more than 192 available event numbers).
>    Hence, the current design takes an array of event numbers from a sysfs
>    file before starting profiling. However, this may go against the common
>    way to schedule perf events through perf commands.
>
> Jie Zhan (4):
>    docs: perf: Add documentation for HiSilicon PMCU
>    drivers/perf: hisi: Add driver support for HiSilicon PMCU
>    perf tool: Add HiSilicon PMCU data recording support
>    perf tool: Add HiSilicon PMCU data decoding support
>
>   Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
>   Documentation/admin-guide/perf/index.rst     |    1 +
>   drivers/perf/hisilicon/Kconfig               |   15 +
>   drivers/perf/hisilicon/Makefile              |    1 +
>   drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
>   tools/perf/arch/arm/util/auxtrace.c          |   61 +
>   tools/perf/arch/arm64/util/Build             |    2 +-
>   tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
>   tools/perf/util/Build                        |    1 +
>   tools/perf/util/auxtrace.c                   |    4 +
>   tools/perf/util/auxtrace.h                   |    1 +
>   tools/perf/util/hisi-pmcu.c                  |  305 +++++
>   tools/perf/util/hisi-pmcu.h                  |   19 +
>   13 files changed, 1833 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>   create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
>   create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
>   create mode 100644 tools/perf/util/hisi-pmcu.c
>   create mode 100644 tools/perf/util/hisi-pmcu.h
>
>
> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
  2023-02-27  8:49   ` Jie Zhan
@ 2023-03-17 13:11     ` Jonathan Cameron
  -1 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 13:11 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users, Rob Herring

On Mon, 27 Feb 2023 16:49:46 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Please can anyone have a look at this PMCU patchset and provide some 
> comments?
> 
> It is much related to the ARM PMU.
> 
> We are looking forward to the feedback.
> 
> Any relevant comments/questions, with respect to software or hardware 
> design, use cases, coding, are welcome.
> 
> Kind regards,
> 
> Jie
> 
> 
> On 06/02/2023 14:51, Jie Zhan wrote:
> > HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> > PMU accesses from CPUs, handling the configuration, event switching, and
> > counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> > and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> > scheme may lose events or drop sampling frequency. With PMCU, users can
> > reliably obtain the data of up to 240 PMU events with the sample interval
> > of events down to 1ms, while the software overhead of accessing PMUs, as
> > well as its impact on target workloads, is reduced.
> >
> > This patchset contains the documentation, driver, and user perf tool
> > support to enable using PMCU with the 'perf_event' framework.
> >
> > Here are two key questions requested for comments:
> >
> > - How do we make it compatible with arm_pmu drivers?
> >
> >    Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
> >    from CPU and PMCU simultaneously. The current hardware can't guarantee
> >    mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
> >    the same time may mess up the operation of PMUs, delivering incorrect
> >    data for both events, e.g. unexpected events or sample periods.
> >    Software-wise, we probably need to prevent the two types of events from
> >    running at the same time, but currently there isn't a clear solution.

I've been thinking about this a bit and don't have a good answer yet.

So some thoughts that might get some discussion going (some are here
mostly to be shot down ;)

1. I suspect adding a hook into the specific pmu driver to reserve a counter is going
   to be controversial for this usecase.  But maybe there is a more generic
   way...  There are lock up detectors that use PMU counters and ensure the counters
   aren't also used for other purposes and that leads me to wonder if you can use
https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L12700
perf_event_create_kernel_counter()
to do the same as opening a counter from userspace but then not use it.
I have no idea if this will work though or if enabling the event would be necessary
to prevent it being used elsewhere.

2. It might be possible to reuse any of the infrastructure that exists
   for userspace PMU counter access or maybe Rob Herring (+CC) has a suggestion based on
   his work on that feature.

3. It's not nice, but maybe could enforce this constraint just in userspace?
   We'd have to make sure that both drivers didn't do anything beyond not working
   correctly if the other driver is messing with the hardware.

4. We can't do the nasty trick of providing a second driver that binds to the
   PMU hardware to prevent it being used because I think the main arm PMU
   driver has suppress_bind_attrs = true.  Maybe we can make remove work?
   (original patch for this in 2018 added that line because of a crash on remove
    - not sure anyone looked at fixing the crash).

> >
> > - Currently we reply on a sysfs file for users to input event numbers. Is
> >    there a better way to pass many events?
> >
> >    The perf framework only allows three 64-bit config fields for custom PMU
> >    configs. Obviously, this can't satisfy our need for passing many events
> >    at a time. As an event number is 16-bit wide, the config fields can only
> >    take up to 12 events at a time, or up to 192 events even if we do a
> >    bitmap of events (and there are more than 192 available event numbers).
> >    Hence, the current design takes an array of event numbers from a sysfs
> >    file before starting profiling. However, this may go against the common
> >    way to schedule perf events through perf commands.
> >
> > Jie Zhan (4):
> >    docs: perf: Add documentation for HiSilicon PMCU
> >    drivers/perf: hisi: Add driver support for HiSilicon PMCU
> >    perf tool: Add HiSilicon PMCU data recording support
> >    perf tool: Add HiSilicon PMCU data decoding support
> >
> >   Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
> >   Documentation/admin-guide/perf/index.rst     |    1 +
> >   drivers/perf/hisilicon/Kconfig               |   15 +
> >   drivers/perf/hisilicon/Makefile              |    1 +
> >   drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
> >   tools/perf/arch/arm/util/auxtrace.c          |   61 +
> >   tools/perf/arch/arm64/util/Build             |    2 +-
> >   tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
> >   tools/perf/util/Build                        |    1 +
> >   tools/perf/util/auxtrace.c                   |    4 +
> >   tools/perf/util/auxtrace.h                   |    1 +
> >   tools/perf/util/hisi-pmcu.c                  |  305 +++++
> >   tools/perf/util/hisi-pmcu.h                  |   19 +
> >   13 files changed, 1833 insertions(+), 1 deletion(-)
> >   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> >   create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
> >   create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
> >   create mode 100644 tools/perf/util/hisi-pmcu.c
> >   create mode 100644 tools/perf/util/hisi-pmcu.h
> >
> >
> > base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476  


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
@ 2023-03-17 13:11     ` Jonathan Cameron
  0 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 13:11 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users, Rob Herring

On Mon, 27 Feb 2023 16:49:46 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Please can anyone have a look at this PMCU patchset and provide some 
> comments?
> 
> It is much related to the ARM PMU.
> 
> We are looking forward to the feedback.
> 
> Any relevant comments/questions, with respect to software or hardware 
> design, use cases, coding, are welcome.
> 
> Kind regards,
> 
> Jie
> 
> 
> On 06/02/2023 14:51, Jie Zhan wrote:
> > HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> > PMU accesses from CPUs, handling the configuration, event switching, and
> > counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> > and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> > scheme may lose events or drop sampling frequency. With PMCU, users can
> > reliably obtain the data of up to 240 PMU events with the sample interval
> > of events down to 1ms, while the software overhead of accessing PMUs, as
> > well as its impact on target workloads, is reduced.
> >
> > This patchset contains the documentation, driver, and user perf tool
> > support to enable using PMCU with the 'perf_event' framework.
> >
> > Here are two key questions requested for comments:
> >
> > - How do we make it compatible with arm_pmu drivers?
> >
> >    Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
> >    from CPU and PMCU simultaneously. The current hardware can't guarantee
> >    mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
> >    the same time may mess up the operation of PMUs, delivering incorrect
> >    data for both events, e.g. unexpected events or sample periods.
> >    Software-wise, we probably need to prevent the two types of events from
> >    running at the same time, but currently there isn't a clear solution.

I've been thinking about this a bit and don't have a good answer yet.

So some thoughts that might get some discussion going (some are here
mostly to be shot down ;)

1. I suspect adding a hook into the specific pmu driver to reserve a counter is going
   to be controversial for this usecase.  But maybe there is a more generic
   way...  There are lock up detectors that use PMU counters and ensure the counters
   aren't also used for other purposes and that leads me to wonder if you can use
https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L12700
perf_event_create_kernel_counter()
to do the same as opening a counter from userspace but then not use it.
I have no idea if this will work though or if enabling the event would be necessary
to prevent it being used elsewhere.

2. It might be possible to reuse any of the infrastructure that exists
   for userspace PMU counter access or maybe Rob Herring (+CC) has a suggestion based on
   his work on that feature.

3. It's not nice, but maybe could enforce this constraint just in userspace?
   We'd have to make sure that both drivers didn't do anything beyond not working
   correctly if the other driver is messing with the hardware.

4. We can't do the nasty trick of providing a second driver that binds to the
   PMU hardware to prevent it being used because I think the main arm PMU
   driver has suppress_bind_attrs = true.  Maybe we can make remove work?
   (original patch for this in 2018 added that line because of a crash on remove
    - not sure anyone looked at fixing the crash).

> >
> > - Currently we reply on a sysfs file for users to input event numbers. Is
> >    there a better way to pass many events?
> >
> >    The perf framework only allows three 64-bit config fields for custom PMU
> >    configs. Obviously, this can't satisfy our need for passing many events
> >    at a time. As an event number is 16-bit wide, the config fields can only
> >    take up to 12 events at a time, or up to 192 events even if we do a
> >    bitmap of events (and there are more than 192 available event numbers).
> >    Hence, the current design takes an array of event numbers from a sysfs
> >    file before starting profiling. However, this may go against the common
> >    way to schedule perf events through perf commands.
> >
> > Jie Zhan (4):
> >    docs: perf: Add documentation for HiSilicon PMCU
> >    drivers/perf: hisi: Add driver support for HiSilicon PMCU
> >    perf tool: Add HiSilicon PMCU data recording support
> >    perf tool: Add HiSilicon PMCU data decoding support
> >
> >   Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
> >   Documentation/admin-guide/perf/index.rst     |    1 +
> >   drivers/perf/hisilicon/Kconfig               |   15 +
> >   drivers/perf/hisilicon/Makefile              |    1 +
> >   drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
> >   tools/perf/arch/arm/util/auxtrace.c          |   61 +
> >   tools/perf/arch/arm64/util/Build             |    2 +-
> >   tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
> >   tools/perf/util/Build                        |    1 +
> >   tools/perf/util/auxtrace.c                   |    4 +
> >   tools/perf/util/auxtrace.h                   |    1 +
> >   tools/perf/util/hisi-pmcu.c                  |  305 +++++
> >   tools/perf/util/hisi-pmcu.h                  |   19 +
> >   13 files changed, 1833 insertions(+), 1 deletion(-)
> >   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> >   create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
> >   create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
> >   create mode 100644 tools/perf/util/hisi-pmcu.c
> >   create mode 100644 tools/perf/util/hisi-pmcu.h
> >
> >
> > base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-02-06  6:51   ` Jie Zhan
@ 2023-03-17 13:37     ` Jonathan Cameron
  -1 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 13:37 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:43 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Document the overview and usage of HiSilicon PMCU.
> 
> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
> 
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

Nice documentation. I've read this a few times before, but on this read
through wondered if we could say anything about the skew between capture
of the counters.  Not that important though so I'm happy to add

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

though this may of course need updating significantly as the interface
is refined (the RFC question you raised for example in the cover letter).

Thanks

Jonathan

> ---
>  Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>  Documentation/admin-guide/perf/index.rst     |   1 +
>  2 files changed, 184 insertions(+)
>  create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> 
> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
> new file mode 100644
> index 000000000000..50d17cbd0049
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
> @@ -0,0 +1,183 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================================
> +HiSilicon Performance Monitor Control Unit
> +==========================================
> +
> +Introduction
> +============
> +
> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> +PMU accesses from CPUs, handling the configuration, event switching, and
> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
> +scheme may lose events or drop sampling frequency. With PMCU, users can
> +reliably obtain the data of up to 240 PMU events with the sample interval
> +of events down to 1ms, while the software overhead of accessing PMUs, as
> +well as its impact on target workloads, is reduced.
> +
> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
> +events, waits for a time interval, and stops them. The PMU counter readings are
> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
> +the ``perf.data`` file in the user space. PMCU automatically switches events
> +(when there are more events than available PMU counters) and completes multiple
> +rounds of PMU event counting in one trigger.
> +
> +Hardware overview
> +=================
> +
> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
> +assistant to access the core PMUs on its die and move the counter readings to
> +memory. An overview of PMCU's hardware organization is shown below::
> +
> +                                +--------------------+
> +                                |       Memory       |
> +                                | +------+ +-------+ |
> +                   +--------+   | |Events| |Samples| |
> +                   |  PMCU  |   | +------+ +-------+ |
> +                   +---|----+   +---------|----------+
> +                       |                  |
> +        =======================================================  Bus
> +                   |                         |               |
> +        +----------|----------+   +----------|----------+    |
> +        | +------+ | +------+ |   | +------+ | +------+ |    |
> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
> +        |    +-----+----+     |   |    +-----+----+     |  clusters
> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
> +        | +------+   +------+ |   | +------+   +------+ |
> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
> +        +---------------------+   +---------------------+
> +
> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
> +events for a while, and move the counter readings back to memory.
> +
> +Once triggered, PMCU performs a number of loops and processes a number of
> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
> +stops all the counters, and moves the counter readings to memory, before
> +handling the next ``nr_pmu`` events if there are more events to process in this
> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
> +the number of events to process depends on user inputs. The counters are
> +stopped when PMCU reads counters and switches events, so there is a tiny time
> +window during which the events are not counted.

I'm not clear from this description whether there is 'skew' between the counters
(beyond the normal issues from uarch).  Does the PMCU stop all counters
then read them all (minimizing skew) or does it stop each CPUs set of counters
and read those, or stop each individual counter before reading?

My impression is that this feature is meant to be left running over timescales
much longer than the sampling period so it may not be necessary to align the
different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.

> +
> +Usage
> +=====
> +
> +The PMCU driver is designed to operate with the kernel perf_event framework,
> +specifically with perf AUX trace buffer to dump sample data faster. User space
> +usage of PMCU is supported through the 'perf' tool and root access is required.
> +
> +Steps:
> +
> +1. Write PMU event IDs to PMCU's ``sysfs`` event interface. The event IDs should
> +   be hexadecimal and separated by whitespaces.
> +
> +   An example command can be::
> +
> +        echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
> +
> +   Alternatively, users can directly write the ``user_events`` file with a text
> +   editor.
> +
> +   Please note that:
> +
> +   - As PMCU passes event IDs to core PMUs, any event IDs supported by the core
> +     PMU are acceptible.
> +   - Users can enter up to 240 events; any events beyond that are ignored.
> +   - The event IDs remain unchanged until the next update of the file, such that
> +     users do not have to enter the event IDs every time before issuing a
> +     ``perf-record`` command for the same events.
> +
> +2. Profiling with ``perf-record``.
> +
> +   The command to start the sampling is::
> +
> +        perf record -e hisi_pmcu_sccl3/<configs>/
> +
> +   Users can pass the following optional parameters to ``<configs>``:
> +
> +   - nr_sample: number of samples to take. This defaults to 128.
> +   - sample_period_ms: time interval in microseconds for PMU counters to keep
> +     counting for each event. This defaults to 3, i.e. 3ms, and its max
> +     value is 85,899, i.e. 85 seconds.
> +   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
> +     cycle counter increments. This defaults to 0x00. Please refer to the
> +     "Performance Monitors external register descriptions" of *Arm Architecture
> +     Reference Manual for A-profile architecture* on how to configure
> +     PMCCFILTR_EL0.
> +
> +   An example command can be::
> +
> +        perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1000/
> +
> +3. Obtain the sample data
> +
> +   When the ``perf-record`` command finishes, data will be stored in the AUX
> +   area of ``perf.data``. The data can be viewed with ``perf-report`` or
> +   ``perf-script`` with the ``-D`` dump trace option, e.g.::
> +
> +        perf report -D
> +
> +   Users may search the keyword ``HISI PMCU`` to navigate to the PMCU data
> +   section.
> +
> +   PMCU samples are arranged in the following format::
> +
> +        +------------+  +- +--------+  +- +-----------+  +- +------------+
> +        |AUX buffer 0|->|  |Sample 1|->|  |Subsample 1|->|  |CID1SR      |--+
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +        |AUX buffer 1|  |  |Sample 2|  |  |Subsample 2|  |  |CID2SR      |  |
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +        |...         |  |  |...     |  |  |...        |  |  |Event 0     |  |
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +                        |  |  Gap   |  |  |Subsample N|  |  |Event 1     |  |
> +                        +- +--------+  +- +-----------+  |  +------------+  |
> +                                                         |  |...         |  |
> +                                                         |  +------------+  |
> +                                                         |  |Event nr_pmu|  |
> +                                                         |  +------------+  |
> +                                                         |  |Cycle count |  |
> +                                                         +- +------------+  |
> +        +-------------------------------------------------------------------+
> +        |  +- +------------------+  +- +---------+
> +        +->|  |CPU 0 in a cluster|->|  |Cluster 0|
> +           |  +------------------+  |  +---------+
> +           |  |CPU 1 in a cluster|  |  |Cluster 1|
> +           |  +------------------+  |  +---------+
> +           |  |CPU 2 in a cluster|  |  |Cluster 2|
> +           |  +------------------+  |  +---------+
> +           |  |...               |  |  |...      |
> +           +- +------------------+  +- +---------+
> +
> +   The data may contain one or more AUX buffers. An AUX buffer contains many
> +   samples, and may probably leave a gap at the buffer tail where there is no
> +   space for a complete sample. The number of samples in all AUX buffers sums
> +   up to the 'nr_sample' parameter passed from the 'perf-record' command.
> +
> +   A sample contains the events entered in the ``users_events`` sysfs file. A
> +   sample may consist of multiple subsamples if the number of events is more
> +   than the number of PMU counters used, i.e. ``nr_pmu``. The number of
> +   subsamples in a sample, ``N``, equals to a round up of the number of event
> +   divided by ``nr_pmu``.
> +
> +   A subsample consists of data fields of CID1SR, CID2SR, ``nr_pmu`` event
> +   counter readings, and a cycle counter reading. CID1SR and CID2SR are a copy
> +   of PMCID1SR and PMCID2SR on capture of the event counters, which reflects
> +   the process ID, provided that the kernel compiling configuration
> +   ``CONFIG_PID_IN_CONTEXTIDR`` is enabled. The size of CID1SR or CID2SR is 4
> +   bytes, whereas the size of an event or cycle count is 8 bytes. A data field
> +   has the data from all CPUs. The order of CPUs in a data field is 'CPU ID in
> +   a cluster' -> 'cluster ID'. For example, a CPU die with 32 CPUs in 4
> +   clusters (8 CPUs per cluster) has the data field ordered in::
> +
> +       CPU [0,8,16,24],[1,9,17,25],[2,10,18,26],...,[7,15,23,31]
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index 793e1970bc05..f132838145f9 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -8,6 +8,7 @@ Performance monitor support
>     :maxdepth: 1
>  
>     hisi-pmu
> +   hisi-pmcu
>     hisi-pcie-pmu
>     hns3-pmu
>     imx-ddr


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-03-17 13:37     ` Jonathan Cameron
  0 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 13:37 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:43 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Document the overview and usage of HiSilicon PMCU.
> 
> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
> 
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

Nice documentation. I've read this a few times before, but on this read
through wondered if we could say anything about the skew between capture
of the counters.  Not that important though so I'm happy to add

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

though this may of course need updating significantly as the interface
is refined (the RFC question you raised for example in the cover letter).

Thanks

Jonathan

> ---
>  Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>  Documentation/admin-guide/perf/index.rst     |   1 +
>  2 files changed, 184 insertions(+)
>  create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> 
> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
> new file mode 100644
> index 000000000000..50d17cbd0049
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
> @@ -0,0 +1,183 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================================
> +HiSilicon Performance Monitor Control Unit
> +==========================================
> +
> +Introduction
> +============
> +
> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> +PMU accesses from CPUs, handling the configuration, event switching, and
> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
> +scheme may lose events or drop sampling frequency. With PMCU, users can
> +reliably obtain the data of up to 240 PMU events with the sample interval
> +of events down to 1ms, while the software overhead of accessing PMUs, as
> +well as its impact on target workloads, is reduced.
> +
> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
> +events, waits for a time interval, and stops them. The PMU counter readings are
> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
> +the ``perf.data`` file in the user space. PMCU automatically switches events
> +(when there are more events than available PMU counters) and completes multiple
> +rounds of PMU event counting in one trigger.
> +
> +Hardware overview
> +=================
> +
> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
> +assistant to access the core PMUs on its die and move the counter readings to
> +memory. An overview of PMCU's hardware organization is shown below::
> +
> +                                +--------------------+
> +                                |       Memory       |
> +                                | +------+ +-------+ |
> +                   +--------+   | |Events| |Samples| |
> +                   |  PMCU  |   | +------+ +-------+ |
> +                   +---|----+   +---------|----------+
> +                       |                  |
> +        =======================================================  Bus
> +                   |                         |               |
> +        +----------|----------+   +----------|----------+    |
> +        | +------+ | +------+ |   | +------+ | +------+ |    |
> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
> +        |    +-----+----+     |   |    +-----+----+     |  clusters
> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
> +        | +------+   +------+ |   | +------+   +------+ |
> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
> +        +---------------------+   +---------------------+
> +
> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
> +events for a while, and move the counter readings back to memory.
> +
> +Once triggered, PMCU performs a number of loops and processes a number of
> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
> +stops all the counters, and moves the counter readings to memory, before
> +handling the next ``nr_pmu`` events if there are more events to process in this
> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
> +the number of events to process depends on user inputs. The counters are
> +stopped when PMCU reads counters and switches events, so there is a tiny time
> +window during which the events are not counted.

I'm not clear from this description whether there is 'skew' between the counters
(beyond the normal issues from uarch).  Does the PMCU stop all counters
then read them all (minimizing skew) or does it stop each CPUs set of counters
and read those, or stop each individual counter before reading?

My impression is that this feature is meant to be left running over timescales
much longer than the sampling period so it may not be necessary to align the
different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.

> +
> +Usage
> +=====
> +
> +The PMCU driver is designed to operate with the kernel perf_event framework,
> +specifically with perf AUX trace buffer to dump sample data faster. User space
> +usage of PMCU is supported through the 'perf' tool and root access is required.
> +
> +Steps:
> +
> +1. Write PMU event IDs to PMCU's ``sysfs`` event interface. The event IDs should
> +   be hexadecimal and separated by whitespaces.
> +
> +   An example command can be::
> +
> +        echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
> +
> +   Alternatively, users can directly write the ``user_events`` file with a text
> +   editor.
> +
> +   Please note that:
> +
> +   - As PMCU passes event IDs to core PMUs, any event IDs supported by the core
> +     PMU are acceptible.
> +   - Users can enter up to 240 events; any events beyond that are ignored.
> +   - The event IDs remain unchanged until the next update of the file, such that
> +     users do not have to enter the event IDs every time before issuing a
> +     ``perf-record`` command for the same events.
> +
> +2. Profiling with ``perf-record``.
> +
> +   The command to start the sampling is::
> +
> +        perf record -e hisi_pmcu_sccl3/<configs>/
> +
> +   Users can pass the following optional parameters to ``<configs>``:
> +
> +   - nr_sample: number of samples to take. This defaults to 128.
> +   - sample_period_ms: time interval in microseconds for PMU counters to keep
> +     counting for each event. This defaults to 3, i.e. 3ms, and its max
> +     value is 85,899, i.e. 85 seconds.
> +   - pmccfiltr: bits 31-24 of the sysreg PMCCFILTR_EL0, which controls how the
> +     cycle counter increments. This defaults to 0x00. Please refer to the
> +     "Performance Monitors external register descriptions" of *Arm Architecture
> +     Reference Manual for A-profile architecture* on how to configure
> +     PMCCFILTR_EL0.
> +
> +   An example command can be::
> +
> +        perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1000/
> +
> +3. Obtain the sample data
> +
> +   When the ``perf-record`` command finishes, data will be stored in the AUX
> +   area of ``perf.data``. The data can be viewed with ``perf-report`` or
> +   ``perf-script`` with the ``-D`` dump trace option, e.g.::
> +
> +        perf report -D
> +
> +   Users may search the keyword ``HISI PMCU`` to navigate to the PMCU data
> +   section.
> +
> +   PMCU samples are arranged in the following format::
> +
> +        +------------+  +- +--------+  +- +-----------+  +- +------------+
> +        |AUX buffer 0|->|  |Sample 1|->|  |Subsample 1|->|  |CID1SR      |--+
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +        |AUX buffer 1|  |  |Sample 2|  |  |Subsample 2|  |  |CID2SR      |  |
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +        |...         |  |  |...     |  |  |...        |  |  |Event 0     |  |
> +        +------------+  |  +--------+  |  +-----------+  |  +------------+  |
> +                        |  |  Gap   |  |  |Subsample N|  |  |Event 1     |  |
> +                        +- +--------+  +- +-----------+  |  +------------+  |
> +                                                         |  |...         |  |
> +                                                         |  +------------+  |
> +                                                         |  |Event nr_pmu|  |
> +                                                         |  +------------+  |
> +                                                         |  |Cycle count |  |
> +                                                         +- +------------+  |
> +        +-------------------------------------------------------------------+
> +        |  +- +------------------+  +- +---------+
> +        +->|  |CPU 0 in a cluster|->|  |Cluster 0|
> +           |  +------------------+  |  +---------+
> +           |  |CPU 1 in a cluster|  |  |Cluster 1|
> +           |  +------------------+  |  +---------+
> +           |  |CPU 2 in a cluster|  |  |Cluster 2|
> +           |  +------------------+  |  +---------+
> +           |  |...               |  |  |...      |
> +           +- +------------------+  +- +---------+
> +
> +   The data may contain one or more AUX buffers. An AUX buffer contains many
> +   samples, and may probably leave a gap at the buffer tail where there is no
> +   space for a complete sample. The number of samples in all AUX buffers sums
> +   up to the 'nr_sample' parameter passed from the 'perf-record' command.
> +
> +   A sample contains the events entered in the ``users_events`` sysfs file. A
> +   sample may consist of multiple subsamples if the number of events is more
> +   than the number of PMU counters used, i.e. ``nr_pmu``. The number of
> +   subsamples in a sample, ``N``, equals to a round up of the number of event
> +   divided by ``nr_pmu``.
> +
> +   A subsample consists of data fields of CID1SR, CID2SR, ``nr_pmu`` event
> +   counter readings, and a cycle counter reading. CID1SR and CID2SR are a copy
> +   of PMCID1SR and PMCID2SR on capture of the event counters, which reflects
> +   the process ID, provided that the kernel compiling configuration
> +   ``CONFIG_PID_IN_CONTEXTIDR`` is enabled. The size of CID1SR or CID2SR is 4
> +   bytes, whereas the size of an event or cycle count is 8 bytes. A data field
> +   has the data from all CPUs. The order of CPUs in a data field is 'CPU ID in
> +   a cluster' -> 'cluster ID'. For example, a CPU die with 32 CPUs in 4
> +   clusters (8 CPUs per cluster) has the data field ordered in::
> +
> +       CPU [0,8,16,24],[1,9,17,25],[2,10,18,26],...,[7,15,23,31]
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index 793e1970bc05..f132838145f9 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -8,6 +8,7 @@ Performance monitor support
>     :maxdepth: 1
>  
>     hisi-pmu
> +   hisi-pmcu
>     hisi-pcie-pmu
>     hns3-pmu
>     imx-ddr


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
  2023-02-06  6:51   ` Jie Zhan
@ 2023-03-17 14:52     ` Jonathan Cameron
  -1 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 14:52 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:44 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
> 
> This driver enables the usage of PMCU through the perf_event framework.
> PMCU is registered as a PMU device and utilises the AUX buffer to dump data
> directly. Users can start PMCU sampling through 'perf-record'. Event
> numbers are passed by a sysfs interface.
> 
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

Hi Jie,

A few minor comments inline.
Whilst I looked at this internally, that was a while back so I've
found a few new things to point out in what I think is a pretty good/clean driver.
The main thing here is the RFC questions you've raised in the cover letter
of course - particularly the one around mediating who has the counters between
this and the normal PMU driver.

Thanks,

Jonathan

> ---
>  drivers/perf/hisilicon/Kconfig     |   15 +
>  drivers/perf/hisilicon/Makefile    |    1 +
>  drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++
>  3 files changed, 1112 insertions(+)
>  create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
> 
> diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig
> index 171bfc1b6bc2..d7728fbe8519 100644
> --- a/drivers/perf/hisilicon/Kconfig
> +++ b/drivers/perf/hisilicon/Kconfig
> @@ -24,3 +24,18 @@ config HNS3_PMU
>  	  devices.
>  	  Adds the HNS3 PMU into perf events system for monitoring latency,
>  	  bandwidth etc.
> +
> +config HISI_PMCU
> +	tristate "HiSilicon PMCU"
> +	depends on ARM64 && PID_IN_CONTEXTIDR
> +	help
> +	  Support for HiSilicon Performance Monitor Control Unit (PMCU).
> +	  HiSilicon Performance Monitor Control Unit (PMCU) is a device that
> +	  offloads PMU accesses from CPUs, handling the configuration, event
> +	  switching, and counter reading of core PMUs on Kunpeng SoC. It
> +	  facilitates fine-grained and multi-PMU-event CPU profiling, in which
> +	  scenario the current 'perf' scheme may lose events or drop sampling
> +	  frequency. With PMCU, users can reliably obtain the data of up to 240
> +	  PMU events with the sample interval of events down to 1ms, while the
> +	  software overhead of accessing PMUs, as well as its impact on target
> +	  workloads, is reduced.
> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> index 4d2c9abe3372..93e4e6f2816a 100644
> --- a/drivers/perf/hisilicon/Makefile
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
>  
>  obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
>  obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
> +obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o
> diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c
> new file mode 100644
> index 000000000000..6ec5d6c31e1f
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_pmcu.c
> @@ -0,0 +1,1096 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * HiSilicon Performance Monitor Control Unit (PMCU) driver
> + *
> + * Copyright (C) 2022 HiSilicon Limited
> + * Author: Jie Zhan <zhanjie9@hisilicon.com>
> + */
> +
> +#include <linux/acpi.h>

Not seeing this used. Probably want mod_devicetable.h that
includes the struct acpi_device_id definition.

> +#include <linux/bitfield.h>
> +#include <linux/bits.h>
> +#include <linux/cpumask.h>
> +#include <linux/delay.h>
> +#include <linux/dev_printk.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/errno.h>
> +#include <linux/gfp_types.h>

It is very very rare for this to be included directly.
Normally just rely on indirect inclusion from slab.h or similar.
I would drop this one.

> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/mm_types.h>
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/printk.h>

property.h

> +#include <linux/slab.h>
> +#include <linux/smp.h>
> +#include <linux/threads.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/cputype.h>
> +#include <asm/sysreg.h>
> +
> +/* Registers */
> +#define HISI_PMCU_REG_FSM_STATUS	0x0000
> +#define HISI_PMCU_REG_FSM_CFG		0x0004
> +#define HISI_PMCU_REG_EVENT_BASE_H	0x0008
> +#define HISI_PMCU_REG_EVENT_BASE_L	0x000C
> +#define HISI_PMCU_REG_KILL_BASE_H	0x0010
> +#define HISI_PMCU_REG_KILL_BASE_L	0x0014
> +#define HISI_PMCU_REG_STORE_BASE_H	0x0018
> +#define HISI_PMCU_REG_STORE_BASE_L	0x001C
> +#define HISI_PMCU_REG_WAIT_CNT		0x0020
> +#define HISI_PMCU_REG_FSM_CTRL		0x0038
> +#define HISI_PMCU_REG_FSM_BRK		0x003C
> +#define HISI_PMCU_REG_COMP		0x0044
> +#define HISI_PMCU_REG_INT_EN		0x0100
> +#define HISI_PMCU_REG_INT_MSK		0x0104
> +#define HISI_PMCU_REG_INT_STAT		0x0108
> +#define HISI_PMCU_REG_INT_CLR		0x010C
> +#define HISI_PMCU_REG_PMCR		0x0200
> +#define HISI_PMCU_REG_PMCCFILTR		0x0204
> +
> +/* Register related configs */
> +#define HISI_PMCU_FSM_CFG_EV_LEN_MSK	GENMASK(7, 0)
> +#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK	GENMASK(15, 8)
> +#define HISI_PMCU_FSM_CFG_NR_PMU_MSK	GENMASK(19, 16)
> +#define HISI_PMCU_FSM_CFG_MAX_EV_LEN	240

As this is used in various places that are only loosely assocated
with this register, I'd just rename it HISI_PMCU_MAX_EVN_LEN.
Similar probably applies to some of these others.

> +#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP	255
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU	8
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C	5
> +#define HISI_PMCU_WAIT_CNT_DEFAULT	0x249F0
> +#define HISI_PMCU_FSM_CTRL_TRIGGER	BIT(0)
> +#define HISI_PMCU_FSM_BRK_BRK		BIT(0)
> +#define HISI_PMCU_COMP_HPMN_THR		3
> +#define HISI_PMCU_COMP_ENABLE		BIT(0)
> +#define HISI_PMCU_INT_DONE		BIT(0)
> +#define HISI_PMCU_INT_BRK		BIT(1)
> +#define HISI_PMCU_INT_ALL		GENMASK(1, 0)
> +#define HISI_PMCU_PMCR_DEFAULT		0xC1

How is this related to the architecture defined PMCR register?
Or just a coincidence of naming?

Either way, I'm assuming 0xC1 is probably multiple fields so if
possible can we break this down further with defines to show
where the value comes from.


> +#define HISI_PMCU_PMCCFILTR_MSK		GENMASK(31, 24)
...

> +/**
> + * struct hisi_pmcu_events - PMCU events and sampling configuration
> + * @nr_pmu:		number of core PMU counters that run in parallel
> + * @padding:		number of padding events in a sample
> + * @nr_ev:		number of events passed by users in a sample
> + * @nr_ev_per_sample:	number of events passed to hardware for a sample
> + *			This equals nr_ev + padding and should be evenly
> + *			divisible by nr_pmu.
> + * @max_sample_loop:	max number of samples that can be done in a loop
> + * @ev_len:		event length for hardware to read in a loop
> + * @nr_loop:		number of loops in one trigger
> + * @comp_mode:		compatibility mode
> + * @nr_sample:		number of samples that the current trigger takes
> + * @nr_pending_sample:	number of pending samples
> + * @subsample_size:	size of a subsample
> + * @sample_size:	size of a sample
> + * @output_size:	size of output from one trigger
> + * @sample_period:	sample period passed to hardware
> + * @nr_cpu:		number of hardware threads (logical CPUs)
> + * @events:		event IDs passed from users

Maybe say what they are for rather than where they come from?
event IDs to sample.

> + */
> +struct hisi_pmcu_events {
> +	u8 nr_pmu;
> +	u8 padding;
> +	u8 nr_ev;
> +	u8 nr_ev_per_sample;
> +	u8 max_sample_loop;
> +	u8 ev_len;
> +	u8 nr_loop;
> +	u8 comp_mode;

Could you use the enum hisi_pmcu_comp_mode type for this?

> +	u32 nr_sample;
> +	u32 nr_pending_sample;
> +	u32 subsample_size;
> +	u32 sample_size;
> +	u32 output_size;
> +	u32 sample_period;
> +	u32 nr_cpu;
> +	u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
> +};
> +

...


> +static const struct attribute_group hisi_pmcu_format_attr_group = {
> +	.name = "format",
> +	.attrs = hisi_pmcu_format_attrs,
> +};
> +
> +static ssize_t monitored_cpus_show(struct device *dev,
> +				   struct device_attribute *attr, char *buf)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +
> +	return sysfs_emit(buf, "%d-%d\n",
> +			  cpumask_first(&hisi_pmcu->cpus),
> +			  cpumask_last(&hisi_pmcu->cpus));

What does this do about offline CPUs?
Should it include them or not?

> +}
> +
> +static DEVICE_ATTR_ADMIN_RO(monitored_cpus);



> +static ssize_t user_events_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
> +	u32 head, tail, nr_ev;
> +	char *line;
> +	int err;
> +
> +	line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL);

Doesn't seem to be freed anywhere.

> +	nr_ev = 0;
> +	head = 0;
> +	tail = 0;
> +	while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) {
> +		while (head < count && isspace(buf[head]))
> +			head++;
> +		if (!isxdigit(buf[head]))
> +			break;
> +		tail = head + 1;
> +
> +		while (tail < count && isalnum(buf[tail]))
> +			tail++;
> +
> +		strncpy(line, buf + head, tail - head);
> +		line[tail - head] = '\0';
> +		err = kstrtou16(line, 16, &user_ev->ev[nr_ev]);
> +		if (err) {
> +			user_ev->nr_ev = 0;
> +			return err;
> +		}
> +		nr_ev++;
> +		head = tail;
> +	}
> +	user_ev->nr_ev = nr_ev;
> +
> +	return count;
> +}
> +
> +static int hisi_pmcu_pmu_event_init(struct perf_event *event)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 cfg;
> +	u32 val;
> +
> +	if (event->attr.type != hisi_pmcu->pmu.type)
> +		return -ENOENT;
> +
> +	if (hisi_pmcu->busy)
> +		return -EBUSY;
> +
> +	cfg = event->attr.config;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg);

val gets used for a lot of different things in this function.
I would use as set of new local variables with names that make it more
obvious what they are.

> +	ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT;
local variable isn't that useful here and makes the whole reuse of value
issue worse.

	ev->nr_pending_sample = FIELD_GET(...);
	if (ev->nr_pending_sample == 0)
		ev->nr_pending_sample = HISI...

> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg);
> +	if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) {
> +		dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n",
> +			HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS);
> +		return -EINVAL;
> +	}
> +	ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT :
> +				  HISI_PMCU_WAIT_CNT_DEFAULT;
> +
> +	cfg = event->attr.config1;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg);
> +	val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val);
> +	writel(val, base + HISI_PMCU_REG_PMCCFILTR);
> +
> +	return 0;
> +}
> +

...

> +static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu,
> +				      struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 addr, end;
> +	u32 val;
> +
> +	/* FSM CFG */
> +	val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu);
> +	writel(val, base + HISI_PMCU_REG_FSM_CFG);
> +
> +	/* Sample period */
> +	writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT);
> +
> +	/* Event ID base */
> +	addr = virt_to_phys(ev->events);
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_H);

No point in using the local variable val here that I can see.
	writel(upper_32_bits(addr), base + ...)

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_L);
same with this one.

> +
> +	/* sbuf end */
> +	end = page_to_phys(sbuf->page) + sbuf->size;
> +
> +	/* Data output address */
> +	addr = end - sbuf->remain;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_H);
and this one.

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_L);
and another. etc..

> +
> +	/* Stop data output if sbuf end is reached (abnormally) */
> +	addr = end;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_H);
> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_L);
> +
> +	/* Trigger */
> +	writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL);
> +}
> +

> +
> +static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev,
> +					    struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_auxtrace_header header;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	u32 *data;
> +	u32 sz;
> +
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +
> +	header.buffer_size = sbuf->size;
> +	header.nr_pmu = ev->nr_pmu;
> +	header.nr_cpu = ev->nr_cpu;
> +	header.comp_mode = ev->comp_mode;
> +	header.subsample_size = ev->subsample_size;
> +	header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu;
> +	header.nr_event = ev->nr_ev_per_sample;

Might be nicer to read as as

	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
	struct hisi_pmcu_auxtrace_header header = {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};

		
> +
> +	data = page_to_virt(sbuf->page);
> +	memcpy(data, &header, sizeof(header));
> +	memcpy(data + sizeof(header) / sizeof(*data), ev->events,
> +	       ev->nr_ev_per_sample * sizeof(u32));

I'm not sure why data is a u32 *
A few things that would make this neater.

* write the header directly.

	struct hisi_pmcu_auxtrace_header *header = page_to_virt(sbuf->page);

	*header = (struct hisi_pmcu_auxtrace_header) {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};
* Use header + 1 to get to the address just after it.
	memcpy(header + 1, ev->events, ev->nr_ev_per_sample * sizeof(u32));

* Instead add teh data to the structure (maybe rename it)
struct hisi_pmcu_auxtrace_header {
	u32 buffer_size;
	u32 nr_pmu;
	u32 nr_cpu;
	u32 comp_mode;
	u32 subsample_size;
	u32 nr_subsample_per_sample;
	u32 nr_event;
	u32 data[];
};

> +
> +	sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32);

with above augmented header structure, struct_size()
 
> +	sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN);
> +
> +	sbuf->remain -= sz;
> +}
> +
> +static void hisi_pmcu_pmu_start(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct perf_output_handle *handle = &hisi_pmcu->handle;
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev,
> +			 "Sampling is running, pmu->start() ignored\n");

I'm not sure on perf convention on this, but I'd have though dev_dbg
enough for this.  If this is normal thing to do then feel free to leave it.


> +		goto out;
> +	}
> +
...


> +
> +static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct perf_output_handle *handle;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	handle = &hisi_pmcu->handle;
> +
> +	/* If PMCU is running, break it */
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n");

Is this useful?  dev_dbg maybe?

> +		err = hisi_pmcu_hw_sample_stop(hisi_pmcu);
> +		if (err)
> +			dev_err(hisi_pmcu->dev,
> +				"Timed out for stopping PMCU!\n");
> +	}
> +
> +	buf = perf_get_aux(handle);
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +	perf_aux_output_end(handle, sbuf->size - sbuf->remain);
> +
> +	spin_unlock(&hisi_pmcu->lock);
> +
> +	hwc->state |= PERF_HES_STOPPED;
> +	perf_event_update_userpage(event);
> +}

...

> +static int hisi_pmcu_init_data(struct platform_device *pdev,
> +			       struct hisi_pmcu *hisi_pmcu)
> +{
> +	int ret;
> +
> +	hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0);
> +	if (IS_ERR(hisi_pmcu->regbase))
> +		return dev_err_probe(&pdev->dev, -ENODEV,
> +				     "Failed to map device register space\n");
> +
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
> +				       &hisi_pmcu->scclid);

These need linux/property.h to be included (mentioned above)

> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read sccl-id!\n");
> +
> +	/*
> +	 * Obtain the number of CPUs that contributes to the sample size.
> +	 * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain
> +	 * number of CPUs in a cluster regardless of how many CPUs are actually
> +	 * implemented/available.
> +	 */
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster",
> +				       &hisi_pmcu->ev.nr_cpu);
> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read nr-cluster!\n");
> +	hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER;
> +
> +	return 0;
> +}




> +static struct platform_driver hisi_pmcu_driver = {
> +	.driver = {
> +		.name = HISI_PMCU_DRV_NAME,
> +		.acpi_match_table = hisi_pmcu_acpi_match,
> +		/*
> +		 * Unbinding driver is not yet supported as we have not worked
> +		 * out a safe bind/unbind process.

I'd add a note on this to the cover letter and the above
patch description.  It's definitely something we should
make work for this driver.

> +		 */
> +		.suppress_bind_attrs = true,
> +	},
> +	.probe = hisi_pmcu_probe,
> +};
> +
..

I'm very interested to see what this hardware/driver gets used for.

Thanks,

Jonathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
@ 2023-03-17 14:52     ` Jonathan Cameron
  0 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 14:52 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:44 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> PMU accesses from CPUs, handling the configuration, event switching, and
> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> scheme may lose events or drop sampling frequency. With PMCU, users can
> reliably obtain the data of up to 240 PMU events with the sample interval
> of events down to 1ms, while the software overhead of accessing PMUs, as
> well as its impact on target workloads, is reduced.
> 
> This driver enables the usage of PMCU through the perf_event framework.
> PMCU is registered as a PMU device and utilises the AUX buffer to dump data
> directly. Users can start PMCU sampling through 'perf-record'. Event
> numbers are passed by a sysfs interface.
> 
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

Hi Jie,

A few minor comments inline.
Whilst I looked at this internally, that was a while back so I've
found a few new things to point out in what I think is a pretty good/clean driver.
The main thing here is the RFC questions you've raised in the cover letter
of course - particularly the one around mediating who has the counters between
this and the normal PMU driver.

Thanks,

Jonathan

> ---
>  drivers/perf/hisilicon/Kconfig     |   15 +
>  drivers/perf/hisilicon/Makefile    |    1 +
>  drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++
>  3 files changed, 1112 insertions(+)
>  create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
> 
> diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig
> index 171bfc1b6bc2..d7728fbe8519 100644
> --- a/drivers/perf/hisilicon/Kconfig
> +++ b/drivers/perf/hisilicon/Kconfig
> @@ -24,3 +24,18 @@ config HNS3_PMU
>  	  devices.
>  	  Adds the HNS3 PMU into perf events system for monitoring latency,
>  	  bandwidth etc.
> +
> +config HISI_PMCU
> +	tristate "HiSilicon PMCU"
> +	depends on ARM64 && PID_IN_CONTEXTIDR
> +	help
> +	  Support for HiSilicon Performance Monitor Control Unit (PMCU).
> +	  HiSilicon Performance Monitor Control Unit (PMCU) is a device that
> +	  offloads PMU accesses from CPUs, handling the configuration, event
> +	  switching, and counter reading of core PMUs on Kunpeng SoC. It
> +	  facilitates fine-grained and multi-PMU-event CPU profiling, in which
> +	  scenario the current 'perf' scheme may lose events or drop sampling
> +	  frequency. With PMCU, users can reliably obtain the data of up to 240
> +	  PMU events with the sample interval of events down to 1ms, while the
> +	  software overhead of accessing PMUs, as well as its impact on target
> +	  workloads, is reduced.
> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> index 4d2c9abe3372..93e4e6f2816a 100644
> --- a/drivers/perf/hisilicon/Makefile
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
>  
>  obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
>  obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
> +obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o
> diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c
> new file mode 100644
> index 000000000000..6ec5d6c31e1f
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_pmcu.c
> @@ -0,0 +1,1096 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * HiSilicon Performance Monitor Control Unit (PMCU) driver
> + *
> + * Copyright (C) 2022 HiSilicon Limited
> + * Author: Jie Zhan <zhanjie9@hisilicon.com>
> + */
> +
> +#include <linux/acpi.h>

Not seeing this used. Probably want mod_devicetable.h that
includes the struct acpi_device_id definition.

> +#include <linux/bitfield.h>
> +#include <linux/bits.h>
> +#include <linux/cpumask.h>
> +#include <linux/delay.h>
> +#include <linux/dev_printk.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/errno.h>
> +#include <linux/gfp_types.h>

It is very very rare for this to be included directly.
Normally just rely on indirect inclusion from slab.h or similar.
I would drop this one.

> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/mm_types.h>
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/printk.h>

property.h

> +#include <linux/slab.h>
> +#include <linux/smp.h>
> +#include <linux/threads.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/cputype.h>
> +#include <asm/sysreg.h>
> +
> +/* Registers */
> +#define HISI_PMCU_REG_FSM_STATUS	0x0000
> +#define HISI_PMCU_REG_FSM_CFG		0x0004
> +#define HISI_PMCU_REG_EVENT_BASE_H	0x0008
> +#define HISI_PMCU_REG_EVENT_BASE_L	0x000C
> +#define HISI_PMCU_REG_KILL_BASE_H	0x0010
> +#define HISI_PMCU_REG_KILL_BASE_L	0x0014
> +#define HISI_PMCU_REG_STORE_BASE_H	0x0018
> +#define HISI_PMCU_REG_STORE_BASE_L	0x001C
> +#define HISI_PMCU_REG_WAIT_CNT		0x0020
> +#define HISI_PMCU_REG_FSM_CTRL		0x0038
> +#define HISI_PMCU_REG_FSM_BRK		0x003C
> +#define HISI_PMCU_REG_COMP		0x0044
> +#define HISI_PMCU_REG_INT_EN		0x0100
> +#define HISI_PMCU_REG_INT_MSK		0x0104
> +#define HISI_PMCU_REG_INT_STAT		0x0108
> +#define HISI_PMCU_REG_INT_CLR		0x010C
> +#define HISI_PMCU_REG_PMCR		0x0200
> +#define HISI_PMCU_REG_PMCCFILTR		0x0204
> +
> +/* Register related configs */
> +#define HISI_PMCU_FSM_CFG_EV_LEN_MSK	GENMASK(7, 0)
> +#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK	GENMASK(15, 8)
> +#define HISI_PMCU_FSM_CFG_NR_PMU_MSK	GENMASK(19, 16)
> +#define HISI_PMCU_FSM_CFG_MAX_EV_LEN	240

As this is used in various places that are only loosely assocated
with this register, I'd just rename it HISI_PMCU_MAX_EVN_LEN.
Similar probably applies to some of these others.

> +#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP	255
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU	8
> +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C	5
> +#define HISI_PMCU_WAIT_CNT_DEFAULT	0x249F0
> +#define HISI_PMCU_FSM_CTRL_TRIGGER	BIT(0)
> +#define HISI_PMCU_FSM_BRK_BRK		BIT(0)
> +#define HISI_PMCU_COMP_HPMN_THR		3
> +#define HISI_PMCU_COMP_ENABLE		BIT(0)
> +#define HISI_PMCU_INT_DONE		BIT(0)
> +#define HISI_PMCU_INT_BRK		BIT(1)
> +#define HISI_PMCU_INT_ALL		GENMASK(1, 0)
> +#define HISI_PMCU_PMCR_DEFAULT		0xC1

How is this related to the architecture defined PMCR register?
Or just a coincidence of naming?

Either way, I'm assuming 0xC1 is probably multiple fields so if
possible can we break this down further with defines to show
where the value comes from.


> +#define HISI_PMCU_PMCCFILTR_MSK		GENMASK(31, 24)
...

> +/**
> + * struct hisi_pmcu_events - PMCU events and sampling configuration
> + * @nr_pmu:		number of core PMU counters that run in parallel
> + * @padding:		number of padding events in a sample
> + * @nr_ev:		number of events passed by users in a sample
> + * @nr_ev_per_sample:	number of events passed to hardware for a sample
> + *			This equals nr_ev + padding and should be evenly
> + *			divisible by nr_pmu.
> + * @max_sample_loop:	max number of samples that can be done in a loop
> + * @ev_len:		event length for hardware to read in a loop
> + * @nr_loop:		number of loops in one trigger
> + * @comp_mode:		compatibility mode
> + * @nr_sample:		number of samples that the current trigger takes
> + * @nr_pending_sample:	number of pending samples
> + * @subsample_size:	size of a subsample
> + * @sample_size:	size of a sample
> + * @output_size:	size of output from one trigger
> + * @sample_period:	sample period passed to hardware
> + * @nr_cpu:		number of hardware threads (logical CPUs)
> + * @events:		event IDs passed from users

Maybe say what they are for rather than where they come from?
event IDs to sample.

> + */
> +struct hisi_pmcu_events {
> +	u8 nr_pmu;
> +	u8 padding;
> +	u8 nr_ev;
> +	u8 nr_ev_per_sample;
> +	u8 max_sample_loop;
> +	u8 ev_len;
> +	u8 nr_loop;
> +	u8 comp_mode;

Could you use the enum hisi_pmcu_comp_mode type for this?

> +	u32 nr_sample;
> +	u32 nr_pending_sample;
> +	u32 subsample_size;
> +	u32 sample_size;
> +	u32 output_size;
> +	u32 sample_period;
> +	u32 nr_cpu;
> +	u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN];
> +};
> +

...


> +static const struct attribute_group hisi_pmcu_format_attr_group = {
> +	.name = "format",
> +	.attrs = hisi_pmcu_format_attrs,
> +};
> +
> +static ssize_t monitored_cpus_show(struct device *dev,
> +				   struct device_attribute *attr, char *buf)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +
> +	return sysfs_emit(buf, "%d-%d\n",
> +			  cpumask_first(&hisi_pmcu->cpus),
> +			  cpumask_last(&hisi_pmcu->cpus));

What does this do about offline CPUs?
Should it include them or not?

> +}
> +
> +static DEVICE_ATTR_ADMIN_RO(monitored_cpus);



> +static ssize_t user_events_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
> +	struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev;
> +	u32 head, tail, nr_ev;
> +	char *line;
> +	int err;
> +
> +	line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL);

Doesn't seem to be freed anywhere.

> +	nr_ev = 0;
> +	head = 0;
> +	tail = 0;
> +	while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) {
> +		while (head < count && isspace(buf[head]))
> +			head++;
> +		if (!isxdigit(buf[head]))
> +			break;
> +		tail = head + 1;
> +
> +		while (tail < count && isalnum(buf[tail]))
> +			tail++;
> +
> +		strncpy(line, buf + head, tail - head);
> +		line[tail - head] = '\0';
> +		err = kstrtou16(line, 16, &user_ev->ev[nr_ev]);
> +		if (err) {
> +			user_ev->nr_ev = 0;
> +			return err;
> +		}
> +		nr_ev++;
> +		head = tail;
> +	}
> +	user_ev->nr_ev = nr_ev;
> +
> +	return count;
> +}
> +
> +static int hisi_pmcu_pmu_event_init(struct perf_event *event)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 cfg;
> +	u32 val;
> +
> +	if (event->attr.type != hisi_pmcu->pmu.type)
> +		return -ENOENT;
> +
> +	if (hisi_pmcu->busy)
> +		return -EBUSY;
> +
> +	cfg = event->attr.config;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg);

val gets used for a lot of different things in this function.
I would use as set of new local variables with names that make it more
obvious what they are.

> +	ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT;
local variable isn't that useful here and makes the whole reuse of value
issue worse.

	ev->nr_pending_sample = FIELD_GET(...);
	if (ev->nr_pending_sample == 0)
		ev->nr_pending_sample = HISI...

> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg);
> +	if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) {
> +		dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n",
> +			HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS);
> +		return -EINVAL;
> +	}
> +	ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT :
> +				  HISI_PMCU_WAIT_CNT_DEFAULT;
> +
> +	cfg = event->attr.config1;
> +
> +	val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg);
> +	val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val);
> +	writel(val, base + HISI_PMCU_REG_PMCCFILTR);
> +
> +	return 0;
> +}
> +

...

> +static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu,
> +				      struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
> +	struct hisi_pmcu_events *ev = &hisi_pmcu->ev;
> +	void __iomem *base = hisi_pmcu->regbase;
> +	u64 addr, end;
> +	u32 val;
> +
> +	/* FSM CFG */
> +	val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop);
> +	val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu);
> +	writel(val, base + HISI_PMCU_REG_FSM_CFG);
> +
> +	/* Sample period */
> +	writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT);
> +
> +	/* Event ID base */
> +	addr = virt_to_phys(ev->events);
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_H);

No point in using the local variable val here that I can see.
	writel(upper_32_bits(addr), base + ...)

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_EVENT_BASE_L);
same with this one.

> +
> +	/* sbuf end */
> +	end = page_to_phys(sbuf->page) + sbuf->size;
> +
> +	/* Data output address */
> +	addr = end - sbuf->remain;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_H);
and this one.

> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_STORE_BASE_L);
and another. etc..

> +
> +	/* Stop data output if sbuf end is reached (abnormally) */
> +	addr = end;
> +	val = upper_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_H);
> +	val = lower_32_bits(addr);
> +	writel(val, base + HISI_PMCU_REG_KILL_BASE_L);
> +
> +	/* Trigger */
> +	writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL);
> +}
> +

> +
> +static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev,
> +					    struct hisi_pmcu_buf *buf)
> +{
> +	struct hisi_pmcu_auxtrace_header header;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	u32 *data;
> +	u32 sz;
> +
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +
> +	header.buffer_size = sbuf->size;
> +	header.nr_pmu = ev->nr_pmu;
> +	header.nr_cpu = ev->nr_cpu;
> +	header.comp_mode = ev->comp_mode;
> +	header.subsample_size = ev->subsample_size;
> +	header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu;
> +	header.nr_event = ev->nr_ev_per_sample;

Might be nicer to read as as

	struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf];
	struct hisi_pmcu_auxtrace_header header = {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};

		
> +
> +	data = page_to_virt(sbuf->page);
> +	memcpy(data, &header, sizeof(header));
> +	memcpy(data + sizeof(header) / sizeof(*data), ev->events,
> +	       ev->nr_ev_per_sample * sizeof(u32));

I'm not sure why data is a u32 *
A few things that would make this neater.

* write the header directly.

	struct hisi_pmcu_auxtrace_header *header = page_to_virt(sbuf->page);

	*header = (struct hisi_pmcu_auxtrace_header) {
		.buffer_size = sbuf->size,
		.nr_pmu = ev->nr_pmu,
		.nr_cpu = ev->nr_cpu,
		...
	};
* Use header + 1 to get to the address just after it.
	memcpy(header + 1, ev->events, ev->nr_ev_per_sample * sizeof(u32));

* Instead add teh data to the structure (maybe rename it)
struct hisi_pmcu_auxtrace_header {
	u32 buffer_size;
	u32 nr_pmu;
	u32 nr_cpu;
	u32 comp_mode;
	u32 subsample_size;
	u32 nr_subsample_per_sample;
	u32 nr_event;
	u32 data[];
};

> +
> +	sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32);

with above augmented header structure, struct_size()
 
> +	sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN);
> +
> +	sbuf->remain -= sz;
> +}
> +
> +static void hisi_pmcu_pmu_start(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct perf_output_handle *handle = &hisi_pmcu->handle;
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev,
> +			 "Sampling is running, pmu->start() ignored\n");

I'm not sure on perf convention on this, but I'd have though dev_dbg
enough for this.  If this is normal thing to do then feel free to leave it.


> +		goto out;
> +	}
> +
...


> +
> +static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags)
> +{
> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu);
> +	struct hw_perf_event *hwc = &event->hw;
> +	struct perf_output_handle *handle;
> +	struct hisi_pmcu_sbuf *sbuf;
> +	struct hisi_pmcu_buf *buf;
> +	int err;
> +
> +	spin_lock(&hisi_pmcu->lock);
> +
> +	handle = &hisi_pmcu->handle;
> +
> +	/* If PMCU is running, break it */
> +	if (hisi_pmcu->busy) {
> +		dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n");

Is this useful?  dev_dbg maybe?

> +		err = hisi_pmcu_hw_sample_stop(hisi_pmcu);
> +		if (err)
> +			dev_err(hisi_pmcu->dev,
> +				"Timed out for stopping PMCU!\n");
> +	}
> +
> +	buf = perf_get_aux(handle);
> +	sbuf = &buf->sbuf[buf->cur_buf];
> +	perf_aux_output_end(handle, sbuf->size - sbuf->remain);
> +
> +	spin_unlock(&hisi_pmcu->lock);
> +
> +	hwc->state |= PERF_HES_STOPPED;
> +	perf_event_update_userpage(event);
> +}

...

> +static int hisi_pmcu_init_data(struct platform_device *pdev,
> +			       struct hisi_pmcu *hisi_pmcu)
> +{
> +	int ret;
> +
> +	hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0);
> +	if (IS_ERR(hisi_pmcu->regbase))
> +		return dev_err_probe(&pdev->dev, -ENODEV,
> +				     "Failed to map device register space\n");
> +
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
> +				       &hisi_pmcu->scclid);

These need linux/property.h to be included (mentioned above)

> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read sccl-id!\n");
> +
> +	/*
> +	 * Obtain the number of CPUs that contributes to the sample size.
> +	 * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain
> +	 * number of CPUs in a cluster regardless of how many CPUs are actually
> +	 * implemented/available.
> +	 */
> +	ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster",
> +				       &hisi_pmcu->ev.nr_cpu);
> +	if (ret < 0)
> +		return dev_err_probe(&pdev->dev, ret,
> +				     "Failed to read nr-cluster!\n");
> +	hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER;
> +
> +	return 0;
> +}




> +static struct platform_driver hisi_pmcu_driver = {
> +	.driver = {
> +		.name = HISI_PMCU_DRV_NAME,
> +		.acpi_match_table = hisi_pmcu_acpi_match,
> +		/*
> +		 * Unbinding driver is not yet supported as we have not worked
> +		 * out a safe bind/unbind process.

I'd add a note on this to the cover letter and the above
patch description.  It's definitely something we should
make work for this driver.

> +		 */
> +		.suppress_bind_attrs = true,
> +	},
> +	.probe = hisi_pmcu_probe,
> +};
> +
..

I'm very interested to see what this hardware/driver gets used for.

Thanks,

Jonathan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support
  2023-02-06  6:51   ` Jie Zhan
@ 2023-03-17 15:13     ` Jonathan Cameron
  -1 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 15:13 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:45 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Support for HiSilicon PMCU data recording using 'perf-record'.
> 
> Users can start PMCU profiling through 'perf-record'. Event numbers are
> passed by a sysfs interface. The following optional parameters can be
> passed through 'perf-record':
> - nr_sample: number of samples to take
> - sample_period_ms: time in ms for PMU counters to stay on for an event
> - pmccfiltr: bits[31-24] of system register PMCCFILTR_EL0
> 
> Example usage:
> 
> 1. Enter event numbers in the 'user_events' file:
> 
> 	echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
> 
> 2. Start the sampling with 'perf-record':
> 
> 	perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1/
> 
> In this example, the PMCU takes 1000 samples of event 0x0010 and 0x0011
> with a sampling period of 1ms. Data will be written to a 'perf.data' file.
> 
> Co-developed-by: Yang Shen <shenyang39@huawei.com>
> Signed-off-by: Yang Shen <shenyang39@huawei.com>
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

I'm not particularly knowledgeable about perf tool so just some superficial comments
from me.

> ---

> diff --git a/tools/perf/arch/arm64/util/hisi-pmcu.c b/tools/perf/arch/arm64/util/hisi-pmcu.c
> new file mode 100644
> index 000000000000..7c33abf1182d
> --- /dev/null
> +++ b/tools/perf/arch/arm64/util/hisi-pmcu.c

> +struct hisi_pmcu_record {
> +	struct auxtrace_record itr;
> +	struct perf_pmu *hisi_pmcu_pmu;
> +	struct evlist *evlist;
> +};

...

> +struct auxtrace_record *hisi_pmcu_recording_init(int *err,
> +						 struct perf_pmu *hisi_pmcu_pmu)
> +{

...

> +	pmcu_record->hisi_pmcu_pmu = hisi_pmcu_pmu;
> +	pmcu_record->itr.recording_options = hisi_pmcu_recording_options;
> +	pmcu_record->itr.info_priv_size = hisi_pmcu_info_priv_size;
> +	pmcu_record->itr.info_fill = hisi_pmcu_info_fill;
> +	pmcu_record->itr.free = hisi_pmcu_record_free;
> +	pmcu_record->itr.reference = hisi_pmcu_reference;
> +	pmcu_record->itr.read_finish = auxtrace_record__read_finish;
> +	pmcu_record->itr.alignment = HISI_PMCU_DATA_ALIGNMENT;
> +	pmcu_record->itr.pmu = hisi_pmcu_pmu;

Maybe a local variable for itr - or if you can rely on c99 in perf tool
a compound literal to use structure field names etc.

	pmcu_record->itr = (struct xxx){
		.recording_options = ,
etc


> +
> +	*err = 0;
> +	return &pmcu_record->itr;
> +}

> diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
> new file mode 100644
> index 000000000000..d46d523a3aee
> --- /dev/null
> +++ b/tools/perf/util/hisi-pmcu.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * HiSilicon Performance Monitor Control Unit (PMCU) support
> + *
> + * Copyright (C) 2022 HiSilicon Limited

Probably want to update the dates if any substantial changes for v2.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support
@ 2023-03-17 15:13     ` Jonathan Cameron
  0 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-17 15:13 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Mon, 6 Feb 2023 14:51:45 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> Support for HiSilicon PMCU data recording using 'perf-record'.
> 
> Users can start PMCU profiling through 'perf-record'. Event numbers are
> passed by a sysfs interface. The following optional parameters can be
> passed through 'perf-record':
> - nr_sample: number of samples to take
> - sample_period_ms: time in ms for PMU counters to stay on for an event
> - pmccfiltr: bits[31-24] of system register PMCCFILTR_EL0
> 
> Example usage:
> 
> 1. Enter event numbers in the 'user_events' file:
> 
> 	echo "0x10 0x11" > /sys/devices/hisi_pmcu_sccl3/user_events
> 
> 2. Start the sampling with 'perf-record':
> 
> 	perf record -e hisi_pmcu_sccl3/nr_sample=1000,sample_period_ms=1/
> 
> In this example, the PMCU takes 1000 samples of event 0x0010 and 0x0011
> with a sampling period of 1ms. Data will be written to a 'perf.data' file.
> 
> Co-developed-by: Yang Shen <shenyang39@huawei.com>
> Signed-off-by: Yang Shen <shenyang39@huawei.com>
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>

I'm not particularly knowledgeable about perf tool so just some superficial comments
from me.

> ---

> diff --git a/tools/perf/arch/arm64/util/hisi-pmcu.c b/tools/perf/arch/arm64/util/hisi-pmcu.c
> new file mode 100644
> index 000000000000..7c33abf1182d
> --- /dev/null
> +++ b/tools/perf/arch/arm64/util/hisi-pmcu.c

> +struct hisi_pmcu_record {
> +	struct auxtrace_record itr;
> +	struct perf_pmu *hisi_pmcu_pmu;
> +	struct evlist *evlist;
> +};

...

> +struct auxtrace_record *hisi_pmcu_recording_init(int *err,
> +						 struct perf_pmu *hisi_pmcu_pmu)
> +{

...

> +	pmcu_record->hisi_pmcu_pmu = hisi_pmcu_pmu;
> +	pmcu_record->itr.recording_options = hisi_pmcu_recording_options;
> +	pmcu_record->itr.info_priv_size = hisi_pmcu_info_priv_size;
> +	pmcu_record->itr.info_fill = hisi_pmcu_info_fill;
> +	pmcu_record->itr.free = hisi_pmcu_record_free;
> +	pmcu_record->itr.reference = hisi_pmcu_reference;
> +	pmcu_record->itr.read_finish = auxtrace_record__read_finish;
> +	pmcu_record->itr.alignment = HISI_PMCU_DATA_ALIGNMENT;
> +	pmcu_record->itr.pmu = hisi_pmcu_pmu;

Maybe a local variable for itr - or if you can rely on c99 in perf tool
a compound literal to use structure field names etc.

	pmcu_record->itr = (struct xxx){
		.recording_options = ,
etc


> +
> +	*err = 0;
> +	return &pmcu_record->itr;
> +}

> diff --git a/tools/perf/util/hisi-pmcu.h b/tools/perf/util/hisi-pmcu.h
> new file mode 100644
> index 000000000000..d46d523a3aee
> --- /dev/null
> +++ b/tools/perf/util/hisi-pmcu.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * HiSilicon Performance Monitor Control Unit (PMCU) support
> + *
> + * Copyright (C) 2022 HiSilicon Limited

Probably want to update the dates if any substantial changes for v2.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-03-17 13:37     ` Jonathan Cameron
@ 2023-03-24  9:32       ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-24  9:32 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users



On 17/03/2023 21:37, Jonathan Cameron wrote:
> On Mon, 6 Feb 2023 14:51:43 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> Document the overview and usage of HiSilicon PMCU.
>>
>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> PMU accesses from CPUs, handling the configuration, event switching, and
>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>> scheme may lose events or drop sampling frequency. With PMCU, users can
>> reliably obtain the data of up to 240 PMU events with the sample interval
>> of events down to 1ms, while the software overhead of accessing PMUs, as
>> well as its impact on target workloads, is reduced.
>>
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> Nice documentation. I've read this a few times before, but on this read
> through wondered if we could say anything about the skew between capture
> of the counters.  Not that important though so I'm happy to add
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> though this may of course need updating significantly as the interface
> is refined (the RFC question you raised for example in the cover letter).
>
> Thanks
>
> Jonathan
>
>> ---
>>   Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>   Documentation/admin-guide/perf/index.rst     |   1 +
>>   2 files changed, 184 insertions(+)
>>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> new file mode 100644
>> index 000000000000..50d17cbd0049
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> @@ -0,0 +1,183 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================================
>> +HiSilicon Performance Monitor Control Unit
>> +==========================================
>> +
>> +Introduction
>> +============
>> +
>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> +PMU accesses from CPUs, handling the configuration, event switching, and
>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>> +reliably obtain the data of up to 240 PMU events with the sample interval
>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>> +well as its impact on target workloads, is reduced.
>> +
>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>> +events, waits for a time interval, and stops them. The PMU counter readings are
>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>> +(when there are more events than available PMU counters) and completes multiple
>> +rounds of PMU event counting in one trigger.
>> +
>> +Hardware overview
>> +=================
>> +
>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>> +assistant to access the core PMUs on its die and move the counter readings to
>> +memory. An overview of PMCU's hardware organization is shown below::
>> +
>> +                                +--------------------+
>> +                                |       Memory       |
>> +                                | +------+ +-------+ |
>> +                   +--------+   | |Events| |Samples| |
>> +                   |  PMCU  |   | +------+ +-------+ |
>> +                   +---|----+   +---------|----------+
>> +                       |                  |
>> +        =======================================================  Bus
>> +                   |                         |               |
>> +        +----------|----------+   +----------|----------+    |
>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>> +        | +------+   +------+ |   | +------+   +------+ |
>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>> +        +---------------------+   +---------------------+
>> +
>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>> +events for a while, and move the counter readings back to memory.
>> +
>> +Once triggered, PMCU performs a number of loops and processes a number of
>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>> +stops all the counters, and moves the counter readings to memory, before
>> +handling the next ``nr_pmu`` events if there are more events to process in this
>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>> +the number of events to process depends on user inputs. The counters are
>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>> +window during which the events are not counted.
> I'm not clear from this description whether there is 'skew' between the counters
> (beyond the normal issues from uarch).  Does the PMCU stop all counters
> then read them all (minimizing skew) or does it stop each CPUs set of counters
> and read those, or stop each individual counter before reading?
>
> My impression is that this feature is meant to be left running over timescales
> much longer than the sampling period so it may not be necessary to align the
> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>
Thanks for pointing this out.

The PMCU stops all the counters before reading any counters (i.e. the 
first case you said).

The basic procedure is:
     start counters -> wait -> stop counters -> read and reset counters 
-> switch events -> start counters -> ...
where each step applys to all CPUs and counters.

The counters don't count during the tiny stop-start window.
I guess a small improvement would be: reset -> read -> switch -> reset 
-> ..., while the counters keep running,
but we still lose some event counts between read and reset, and thus, no 
fundamental differrence.

Regards,
Jie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-03-24  9:32       ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-24  9:32 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users



On 17/03/2023 21:37, Jonathan Cameron wrote:
> On Mon, 6 Feb 2023 14:51:43 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> Document the overview and usage of HiSilicon PMCU.
>>
>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> PMU accesses from CPUs, handling the configuration, event switching, and
>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>> scheme may lose events or drop sampling frequency. With PMCU, users can
>> reliably obtain the data of up to 240 PMU events with the sample interval
>> of events down to 1ms, while the software overhead of accessing PMUs, as
>> well as its impact on target workloads, is reduced.
>>
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> Nice documentation. I've read this a few times before, but on this read
> through wondered if we could say anything about the skew between capture
> of the counters.  Not that important though so I'm happy to add
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> though this may of course need updating significantly as the interface
> is refined (the RFC question you raised for example in the cover letter).
>
> Thanks
>
> Jonathan
>
>> ---
>>   Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>   Documentation/admin-guide/perf/index.rst     |   1 +
>>   2 files changed, 184 insertions(+)
>>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> new file mode 100644
>> index 000000000000..50d17cbd0049
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>> @@ -0,0 +1,183 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================================
>> +HiSilicon Performance Monitor Control Unit
>> +==========================================
>> +
>> +Introduction
>> +============
>> +
>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> +PMU accesses from CPUs, handling the configuration, event switching, and
>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>> +reliably obtain the data of up to 240 PMU events with the sample interval
>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>> +well as its impact on target workloads, is reduced.
>> +
>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>> +events, waits for a time interval, and stops them. The PMU counter readings are
>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>> +(when there are more events than available PMU counters) and completes multiple
>> +rounds of PMU event counting in one trigger.
>> +
>> +Hardware overview
>> +=================
>> +
>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>> +assistant to access the core PMUs on its die and move the counter readings to
>> +memory. An overview of PMCU's hardware organization is shown below::
>> +
>> +                                +--------------------+
>> +                                |       Memory       |
>> +                                | +------+ +-------+ |
>> +                   +--------+   | |Events| |Samples| |
>> +                   |  PMCU  |   | +------+ +-------+ |
>> +                   +---|----+   +---------|----------+
>> +                       |                  |
>> +        =======================================================  Bus
>> +                   |                         |               |
>> +        +----------|----------+   +----------|----------+    |
>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>> +        | +------+   +------+ |   | +------+   +------+ |
>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>> +        +---------------------+   +---------------------+
>> +
>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>> +events for a while, and move the counter readings back to memory.
>> +
>> +Once triggered, PMCU performs a number of loops and processes a number of
>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>> +stops all the counters, and moves the counter readings to memory, before
>> +handling the next ``nr_pmu`` events if there are more events to process in this
>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>> +the number of events to process depends on user inputs. The counters are
>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>> +window during which the events are not counted.
> I'm not clear from this description whether there is 'skew' between the counters
> (beyond the normal issues from uarch).  Does the PMCU stop all counters
> then read them all (minimizing skew) or does it stop each CPUs set of counters
> and read those, or stop each individual counter before reading?
>
> My impression is that this feature is meant to be left running over timescales
> much longer than the sampling period so it may not be necessary to align the
> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>
Thanks for pointing this out.

The PMCU stops all the counters before reading any counters (i.e. the 
first case you said).

The basic procedure is:
     start counters -> wait -> stop counters -> read and reset counters 
-> switch events -> start counters -> ...
where each step applys to all CPUs and counters.

The counters don't count during the tiny stop-start window.
I guess a small improvement would be: reset -> read -> switch -> reset 
-> ..., while the counters keep running,
but we still lose some event counts between read and reset, and thus, no 
fundamental differrence.

Regards,
Jie

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-03-24  9:32       ` Jie Zhan
@ 2023-03-24 12:14         ` Jonathan Cameron
  -1 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-24 12:14 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Fri, 24 Mar 2023 17:32:15 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> On 17/03/2023 21:37, Jonathan Cameron wrote:
> > On Mon, 6 Feb 2023 14:51:43 +0800
> > Jie Zhan <zhanjie9@hisilicon.com> wrote:
> >  
> >> Document the overview and usage of HiSilicon PMCU.
> >>
> >> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> >> PMU accesses from CPUs, handling the configuration, event switching, and
> >> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> >> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> >> scheme may lose events or drop sampling frequency. With PMCU, users can
> >> reliably obtain the data of up to 240 PMU events with the sample interval
> >> of events down to 1ms, while the software overhead of accessing PMUs, as
> >> well as its impact on target workloads, is reduced.
> >>
> >> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>  
> > Nice documentation. I've read this a few times before, but on this read
> > through wondered if we could say anything about the skew between capture
> > of the counters.  Not that important though so I'm happy to add
> >
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > though this may of course need updating significantly as the interface
> > is refined (the RFC question you raised for example in the cover letter).
> >
> > Thanks
> >
> > Jonathan
> >  
> >> ---
> >>   Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
> >>   Documentation/admin-guide/perf/index.rst     |   1 +
> >>   2 files changed, 184 insertions(+)
> >>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> >>
> >> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
> >> new file mode 100644
> >> index 000000000000..50d17cbd0049
> >> --- /dev/null
> >> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
> >> @@ -0,0 +1,183 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +==========================================
> >> +HiSilicon Performance Monitor Control Unit
> >> +==========================================
> >> +
> >> +Introduction
> >> +============
> >> +
> >> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> >> +PMU accesses from CPUs, handling the configuration, event switching, and
> >> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> >> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
> >> +scheme may lose events or drop sampling frequency. With PMCU, users can
> >> +reliably obtain the data of up to 240 PMU events with the sample interval
> >> +of events down to 1ms, while the software overhead of accessing PMUs, as
> >> +well as its impact on target workloads, is reduced.
> >> +
> >> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
> >> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
> >> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
> >> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
> >> +events, waits for a time interval, and stops them. The PMU counter readings are
> >> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
> >> +the ``perf.data`` file in the user space. PMCU automatically switches events
> >> +(when there are more events than available PMU counters) and completes multiple
> >> +rounds of PMU event counting in one trigger.
> >> +
> >> +Hardware overview
> >> +=================
> >> +
> >> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
> >> +assistant to access the core PMUs on its die and move the counter readings to
> >> +memory. An overview of PMCU's hardware organization is shown below::
> >> +
> >> +                                +--------------------+
> >> +                                |       Memory       |
> >> +                                | +------+ +-------+ |
> >> +                   +--------+   | |Events| |Samples| |
> >> +                   |  PMCU  |   | +------+ +-------+ |
> >> +                   +---|----+   +---------|----------+
> >> +                       |                  |
> >> +        =======================================================  Bus
> >> +                   |                         |               |
> >> +        +----------|----------+   +----------|----------+    |
> >> +        | +------+ | +------+ |   | +------+ | +------+ |    |
> >> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
> >> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
> >> +        |    +-----+----+     |   |    +-----+----+     |  clusters
> >> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
> >> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
> >> +        | +------+   +------+ |   | +------+   +------+ |
> >> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
> >> +        +---------------------+   +---------------------+
> >> +
> >> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
> >> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
> >> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
> >> +events for a while, and move the counter readings back to memory.
> >> +
> >> +Once triggered, PMCU performs a number of loops and processes a number of
> >> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
> >> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
> >> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
> >> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
> >> +stops all the counters, and moves the counter readings to memory, before
> >> +handling the next ``nr_pmu`` events if there are more events to process in this
> >> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
> >> +the number of events to process depends on user inputs. The counters are
> >> +stopped when PMCU reads counters and switches events, so there is a tiny time
> >> +window during which the events are not counted.  
> > I'm not clear from this description whether there is 'skew' between the counters
> > (beyond the normal issues from uarch).  Does the PMCU stop all counters
> > then read them all (minimizing skew) or does it stop each CPUs set of counters
> > and read those, or stop each individual counter before reading?
> >
> > My impression is that this feature is meant to be left running over timescales
> > much longer than the sampling period so it may not be necessary to align the
> > different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
> >  
> Thanks for pointing this out.
> 
> The PMCU stops all the counters before reading any counters (i.e. the 
> first case you said).
> 
> The basic procedure is:
>      start counters -> wait -> stop counters -> read and reset counters 
> -> switch events -> start counters -> ...  
> where each step applys to all CPUs and counters.

Great. So this is across all cores on a die so skew should be minimized
(at a cost of missing more events than a skew heavy approach).

> 
> The counters don't count during the tiny stop-start window.
> I guess a small improvement would be: reset -> read -> switch -> reset 
> -> ..., while the counters keep running,  
> but we still lose some event counts between read and reset, and thus, no 
> fundamental differrence.

Lots of ways to reduce both skew and missed counts, but I think you are
right in that none of them matter for the intended long term monitoring
usecase.

Jonathan

> 
> Regards,
> Jie


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-03-24 12:14         ` Jonathan Cameron
  0 siblings, 0 replies; 32+ messages in thread
From: Jonathan Cameron @ 2023-03-24 12:14 UTC (permalink / raw)
  To: Jie Zhan
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users

On Fri, 24 Mar 2023 17:32:15 +0800
Jie Zhan <zhanjie9@hisilicon.com> wrote:

> On 17/03/2023 21:37, Jonathan Cameron wrote:
> > On Mon, 6 Feb 2023 14:51:43 +0800
> > Jie Zhan <zhanjie9@hisilicon.com> wrote:
> >  
> >> Document the overview and usage of HiSilicon PMCU.
> >>
> >> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> >> PMU accesses from CPUs, handling the configuration, event switching, and
> >> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> >> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
> >> scheme may lose events or drop sampling frequency. With PMCU, users can
> >> reliably obtain the data of up to 240 PMU events with the sample interval
> >> of events down to 1ms, while the software overhead of accessing PMUs, as
> >> well as its impact on target workloads, is reduced.
> >>
> >> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>  
> > Nice documentation. I've read this a few times before, but on this read
> > through wondered if we could say anything about the skew between capture
> > of the counters.  Not that important though so I'm happy to add
> >
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > though this may of course need updating significantly as the interface
> > is refined (the RFC question you raised for example in the cover letter).
> >
> > Thanks
> >
> > Jonathan
> >  
> >> ---
> >>   Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
> >>   Documentation/admin-guide/perf/index.rst     |   1 +
> >>   2 files changed, 184 insertions(+)
> >>   create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
> >>
> >> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
> >> new file mode 100644
> >> index 000000000000..50d17cbd0049
> >> --- /dev/null
> >> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
> >> @@ -0,0 +1,183 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +==========================================
> >> +HiSilicon Performance Monitor Control Unit
> >> +==========================================
> >> +
> >> +Introduction
> >> +============
> >> +
> >> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
> >> +PMU accesses from CPUs, handling the configuration, event switching, and
> >> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
> >> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
> >> +scheme may lose events or drop sampling frequency. With PMCU, users can
> >> +reliably obtain the data of up to 240 PMU events with the sample interval
> >> +of events down to 1ms, while the software overhead of accessing PMUs, as
> >> +well as its impact on target workloads, is reduced.
> >> +
> >> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
> >> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
> >> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
> >> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
> >> +events, waits for a time interval, and stops them. The PMU counter readings are
> >> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
> >> +the ``perf.data`` file in the user space. PMCU automatically switches events
> >> +(when there are more events than available PMU counters) and completes multiple
> >> +rounds of PMU event counting in one trigger.
> >> +
> >> +Hardware overview
> >> +=================
> >> +
> >> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
> >> +assistant to access the core PMUs on its die and move the counter readings to
> >> +memory. An overview of PMCU's hardware organization is shown below::
> >> +
> >> +                                +--------------------+
> >> +                                |       Memory       |
> >> +                                | +------+ +-------+ |
> >> +                   +--------+   | |Events| |Samples| |
> >> +                   |  PMCU  |   | +------+ +-------+ |
> >> +                   +---|----+   +---------|----------+
> >> +                       |                  |
> >> +        =======================================================  Bus
> >> +                   |                         |               |
> >> +        +----------|----------+   +----------|----------+    |
> >> +        | +------+ | +------+ |   | +------+ | +------+ |    |
> >> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
> >> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
> >> +        |    +-----+----+     |   |    +-----+----+     |  clusters
> >> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
> >> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
> >> +        | +------+   +------+ |   | +------+   +------+ |
> >> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
> >> +        +---------------------+   +---------------------+
> >> +
> >> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
> >> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
> >> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
> >> +events for a while, and move the counter readings back to memory.
> >> +
> >> +Once triggered, PMCU performs a number of loops and processes a number of
> >> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
> >> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
> >> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
> >> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
> >> +stops all the counters, and moves the counter readings to memory, before
> >> +handling the next ``nr_pmu`` events if there are more events to process in this
> >> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
> >> +the number of events to process depends on user inputs. The counters are
> >> +stopped when PMCU reads counters and switches events, so there is a tiny time
> >> +window during which the events are not counted.  
> > I'm not clear from this description whether there is 'skew' between the counters
> > (beyond the normal issues from uarch).  Does the PMCU stop all counters
> > then read them all (minimizing skew) or does it stop each CPUs set of counters
> > and read those, or stop each individual counter before reading?
> >
> > My impression is that this feature is meant to be left running over timescales
> > much longer than the sampling period so it may not be necessary to align the
> > different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
> >  
> Thanks for pointing this out.
> 
> The PMCU stops all the counters before reading any counters (i.e. the 
> first case you said).
> 
> The basic procedure is:
>      start counters -> wait -> stop counters -> read and reset counters 
> -> switch events -> start counters -> ...  
> where each step applys to all CPUs and counters.

Great. So this is across all cores on a die so skew should be minimized
(at a cost of missing more events than a skew heavy approach).

> 
> The counters don't count during the tiny stop-start window.
> I guess a small improvement would be: reset -> read -> switch -> reset 
> -> ..., while the counters keep running,  
> but we still lose some event counts between read and reset, and thus, no 
> fundamental differrence.

Lots of ways to reduce both skew and missed counts, but I think you are
right in that none of them matter for the intended long term monitoring
usecase.

Jonathan

> 
> Regards,
> Jie


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
  2023-03-24 12:14         ` Jonathan Cameron
@ 2023-03-25  2:48           ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-25  2:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	shenyang39, hejunhao3, yangyicong, prime.zeng, suntao25,
	jiazhao4, linuxarm, linux-doc, linux-kernel, linux-arm-kernel,
	linux-perf-users



On 24/03/2023 20:14, Jonathan Cameron wrote:
> On Fri, 24 Mar 2023 17:32:15 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> On 17/03/2023 21:37, Jonathan Cameron wrote:
>>> On Mon, 6 Feb 2023 14:51:43 +0800
>>> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>>   
>>>> Document the overview and usage of HiSilicon PMCU.
>>>>
>>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> well as its impact on target workloads, is reduced.
>>>>
>>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>>> Nice documentation. I've read this a few times before, but on this read
>>> through wondered if we could say anything about the skew between capture
>>> of the counters.  Not that important though so I'm happy to add
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>
>>> though this may of course need updating significantly as the interface
>>> is refined (the RFC question you raised for example in the cover letter).
>>>
>>> Thanks
>>>
>>> Jonathan
>>>   
>>>> ---
>>>>    Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>>>    Documentation/admin-guide/perf/index.rst     |   1 +
>>>>    2 files changed, 184 insertions(+)
>>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>>
>>>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> new file mode 100644
>>>> index 000000000000..50d17cbd0049
>>>> --- /dev/null
>>>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> @@ -0,0 +1,183 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +==========================================
>>>> +HiSilicon Performance Monitor Control Unit
>>>> +==========================================
>>>> +
>>>> +Introduction
>>>> +============
>>>> +
>>>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> +PMU accesses from CPUs, handling the configuration, event switching, and
>>>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>>>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> +reliably obtain the data of up to 240 PMU events with the sample interval
>>>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> +well as its impact on target workloads, is reduced.
>>>> +
>>>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>>>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>>>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>>>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>>>> +events, waits for a time interval, and stops them. The PMU counter readings are
>>>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>>>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>>>> +(when there are more events than available PMU counters) and completes multiple
>>>> +rounds of PMU event counting in one trigger.
>>>> +
>>>> +Hardware overview
>>>> +=================
>>>> +
>>>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>>>> +assistant to access the core PMUs on its die and move the counter readings to
>>>> +memory. An overview of PMCU's hardware organization is shown below::
>>>> +
>>>> +                                +--------------------+
>>>> +                                |       Memory       |
>>>> +                                | +------+ +-------+ |
>>>> +                   +--------+   | |Events| |Samples| |
>>>> +                   |  PMCU  |   | +------+ +-------+ |
>>>> +                   +---|----+   +---------|----------+
>>>> +                       |                  |
>>>> +        =======================================================  Bus
>>>> +                   |                         |               |
>>>> +        +----------|----------+   +----------|----------+    |
>>>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>>>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>>>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>>>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>>>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>>>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>>>> +        | +------+   +------+ |   | +------+   +------+ |
>>>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>>>> +        +---------------------+   +---------------------+
>>>> +
>>>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>>>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>>>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>>>> +events for a while, and move the counter readings back to memory.
>>>> +
>>>> +Once triggered, PMCU performs a number of loops and processes a number of
>>>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>>>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>>>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>>>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>>>> +stops all the counters, and moves the counter readings to memory, before
>>>> +handling the next ``nr_pmu`` events if there are more events to process in this
>>>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>>>> +the number of events to process depends on user inputs. The counters are
>>>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>>>> +window during which the events are not counted.
>>> I'm not clear from this description whether there is 'skew' between the counters
>>> (beyond the normal issues from uarch).  Does the PMCU stop all counters
>>> then read them all (minimizing skew) or does it stop each CPUs set of counters
>>> and read those, or stop each individual counter before reading?
>>>
>>> My impression is that this feature is meant to be left running over timescales
>>> much longer than the sampling period so it may not be necessary to align the
>>> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>>>   
>> Thanks for pointing this out.
>>
>> The PMCU stops all the counters before reading any counters (i.e. the
>> first case you said).
>>
>> The basic procedure is:
>>       start counters -> wait -> stop counters -> read and reset counters
>> -> switch events -> start counters -> ...
>> where each step applys to all CPUs and counters.
> Great. So this is across all cores on a die so skew should be minimized
> (at a cost of missing more events than a skew heavy approach).
>
>> The counters don't count during the tiny stop-start window.
>> I guess a small improvement would be: reset -> read -> switch -> reset
>> -> ..., while the counters keep running,
>> but we still lose some event counts between read and reset, and thus, no
>> fundamental differrence.
> Lots of ways to reduce both skew and missed counts, but I think you are
> right in that none of them matter for the intended long term monitoring
> usecase.
>
> Jonathan
Yeah it focuses more on general workload characteristics than 
time-senstive and
precise program analysis.

Jie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU
@ 2023-03-25  2:48           ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-25  2:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	shenyang39, hejunhao3, yangyicong, prime.zeng, suntao25,
	jiazhao4, linuxarm, linux-doc, linux-kernel, linux-arm-kernel,
	linux-perf-users



On 24/03/2023 20:14, Jonathan Cameron wrote:
> On Fri, 24 Mar 2023 17:32:15 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> On 17/03/2023 21:37, Jonathan Cameron wrote:
>>> On Mon, 6 Feb 2023 14:51:43 +0800
>>> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>>   
>>>> Document the overview and usage of HiSilicon PMCU.
>>>>
>>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> well as its impact on target workloads, is reduced.
>>>>
>>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>>> Nice documentation. I've read this a few times before, but on this read
>>> through wondered if we could say anything about the skew between capture
>>> of the counters.  Not that important though so I'm happy to add
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>
>>> though this may of course need updating significantly as the interface
>>> is refined (the RFC question you raised for example in the cover letter).
>>>
>>> Thanks
>>>
>>> Jonathan
>>>   
>>>> ---
>>>>    Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
>>>>    Documentation/admin-guide/perf/index.rst     |   1 +
>>>>    2 files changed, 184 insertions(+)
>>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>>
>>>> diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> new file mode 100644
>>>> index 000000000000..50d17cbd0049
>>>> --- /dev/null
>>>> +++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
>>>> @@ -0,0 +1,183 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +==========================================
>>>> +HiSilicon Performance Monitor Control Unit
>>>> +==========================================
>>>> +
>>>> +Introduction
>>>> +============
>>>> +
>>>> +HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>>> +PMU accesses from CPUs, handling the configuration, event switching, and
>>>> +counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>>> +and multi-PMU-event CPU profiling, in which scenario the current ``perf``
>>>> +scheme may lose events or drop sampling frequency. With PMCU, users can
>>>> +reliably obtain the data of up to 240 PMU events with the sample interval
>>>> +of events down to 1ms, while the software overhead of accessing PMUs, as
>>>> +well as its impact on target workloads, is reduced.
>>>> +
>>>> +Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
>>>> +PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
>>>> +CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
>>>> +CPUs on the CPU die it is on. PMCU then starts the counters for counting
>>>> +events, waits for a time interval, and stops them. The PMU counter readings are
>>>> +dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
>>>> +the ``perf.data`` file in the user space. PMCU automatically switches events
>>>> +(when there are more events than available PMU counters) and completes multiple
>>>> +rounds of PMU event counting in one trigger.
>>>> +
>>>> +Hardware overview
>>>> +=================
>>>> +
>>>> +On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
>>>> +assistant to access the core PMUs on its die and move the counter readings to
>>>> +memory. An overview of PMCU's hardware organization is shown below::
>>>> +
>>>> +                                +--------------------+
>>>> +                                |       Memory       |
>>>> +                                | +------+ +-------+ |
>>>> +                   +--------+   | |Events| |Samples| |
>>>> +                   |  PMCU  |   | +------+ +-------+ |
>>>> +                   +---|----+   +---------|----------+
>>>> +                       |                  |
>>>> +        =======================================================  Bus
>>>> +                   |                         |               |
>>>> +        +----------|----------+   +----------|----------+    |
>>>> +        | +------+ | +------+ |   | +------+ | +------+ |    |
>>>> +        | |Core 0| | |Core 1| |   | |Core 0| | |Core 1| |    |
>>>> +        | +--|---+ | +--|---+ |   | +--|---+ | +--|---+ |  (More
>>>> +        |    +-----+----+     |   |    +-----+----+     |  clusters
>>>> +        | +--|---+   +--|---+ |   | +--|---+   +--|---+ |  ...)
>>>> +        | |Core 2|   |Core 3| |   | |Core 2|   |Core 3| |
>>>> +        | +------+   +------+ |   | +------+   +------+ |
>>>> +        |    CPU Cluster 0    |   |    CPU Cluster 1    |
>>>> +        +---------------------+   +---------------------+
>>>> +
>>>> +On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
>>>> +CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
>>>> +The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
>>>> +events for a while, and move the counter readings back to memory.
>>>> +
>>>> +Once triggered, PMCU performs a number of loops and processes a number of
>>>> +events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
>>>> +``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
>>>> +``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
>>>> +where PMCU resides. Then, PMCU starts all the counters, waits for a period,
>>>> +stops all the counters, and moves the counter readings to memory, before
>>>> +handling the next ``nr_pmu`` events if there are more events to process in this
>>>> +loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
>>>> +the number of events to process depends on user inputs. The counters are
>>>> +stopped when PMCU reads counters and switches events, so there is a tiny time
>>>> +window during which the events are not counted.
>>> I'm not clear from this description whether there is 'skew' between the counters
>>> (beyond the normal issues from uarch).  Does the PMCU stop all counters
>>> then read them all (minimizing skew) or does it stop each CPUs set of counters
>>> and read those, or stop each individual counter before reading?
>>>
>>> My impression is that this feature is meant to be left running over timescales
>>> much longer than the sampling period so it may not be necessary to align the
>>> different lines on the resulting graphs perfectly.  Hence maybe this doesn't matter.
>>>   
>> Thanks for pointing this out.
>>
>> The PMCU stops all the counters before reading any counters (i.e. the
>> first case you said).
>>
>> The basic procedure is:
>>       start counters -> wait -> stop counters -> read and reset counters
>> -> switch events -> start counters -> ...
>> where each step applys to all CPUs and counters.
> Great. So this is across all cores on a die so skew should be minimized
> (at a cost of missing more events than a skew heavy approach).
>
>> The counters don't count during the tiny stop-start window.
>> I guess a small improvement would be: reset -> read -> switch -> reset
>> -> ..., while the counters keep running,
>> but we still lose some event counts between read and reset, and thus, no
>> fundamental differrence.
> Lots of ways to reduce both skew and missed counts, but I think you are
> right in that none of them matter for the intended long term monitoring
> usecase.
>
> Jonathan
Yeah it focuses more on general workload characteristics than 
time-senstive and
precise program analysis.

Jie

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
  2023-03-17 14:52     ` Jonathan Cameron
@ 2023-03-25 10:21       ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-25 10:21 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users



On 17/03/2023 22:52, Jonathan Cameron wrote:
> On Mon, 6 Feb 2023 14:51:44 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> PMU accesses from CPUs, handling the configuration, event switching, and
>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>> scheme may lose events or drop sampling frequency. With PMCU, users can
>> reliably obtain the data of up to 240 PMU events with the sample interval
>> of events down to 1ms, while the software overhead of accessing PMUs, as
>> well as its impact on target workloads, is reduced.
>>
>> This driver enables the usage of PMCU through the perf_event framework.
>> PMCU is registered as a PMU device and utilises the AUX buffer to dump data
>> directly. Users can start PMCU sampling through 'perf-record'. Event
>> numbers are passed by a sysfs interface.
>>
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> Hi Jie,
>
> A few minor comments inline.
> Whilst I looked at this internally, that was a while back so I've
> found a few new things to point out in what I think is a pretty good/clean driver.
> The main thing here is the RFC questions you've raised in the cover letter
> of course - particularly the one around mediating who has the counters between
> this and the normal PMU driver.
>
> Thanks,
>
> Jonathan
Hi Jonathan,

Many thanks for the review again.

Happy to accept all the comments. I have updated the driver based on them.

One reply below.

Jie


...
>> +static const struct attribute_group hisi_pmcu_format_attr_group = {
>> +	.name = "format",
>> +	.attrs = hisi_pmcu_format_attrs,
>> +};
>> +
>> +static ssize_t monitored_cpus_show(struct device *dev,
>> +				   struct device_attribute *attr, char *buf)
>> +{
>> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
>> +
>> +	return sysfs_emit(buf, "%d-%d\n",
>> +			  cpumask_first(&hisi_pmcu->cpus),
>> +			  cpumask_last(&hisi_pmcu->cpus));
> What does this do about offline CPUs?
> Should it include them or not?
PMCU takes care of offline CPUs as well, and the event counts from 
offline CPUs
should show as zeroes in the output.

hisi_pmcu->cpus contains only the online CPUs monitored by the PMCU,
so something should be improved with the "monitored_cpus" interface here.

"monitored_cpus" should actually show alll the online/offline CPUs 
monitored,
or, if it is meant to show only online CPUs, it show be a comma 
separated list
representing the hisi_pmcu->cpus mask rather than a range that may ignore
some offline CPUs in the middle.

Will fix this in V2.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support for HiSilicon PMCU
@ 2023-03-25 10:21       ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-03-25 10:21 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users



On 17/03/2023 22:52, Jonathan Cameron wrote:
> On Mon, 6 Feb 2023 14:51:44 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>> PMU accesses from CPUs, handling the configuration, event switching, and
>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>> scheme may lose events or drop sampling frequency. With PMCU, users can
>> reliably obtain the data of up to 240 PMU events with the sample interval
>> of events down to 1ms, while the software overhead of accessing PMUs, as
>> well as its impact on target workloads, is reduced.
>>
>> This driver enables the usage of PMCU through the perf_event framework.
>> PMCU is registered as a PMU device and utilises the AUX buffer to dump data
>> directly. Users can start PMCU sampling through 'perf-record'. Event
>> numbers are passed by a sysfs interface.
>>
>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> Hi Jie,
>
> A few minor comments inline.
> Whilst I looked at this internally, that was a while back so I've
> found a few new things to point out in what I think is a pretty good/clean driver.
> The main thing here is the RFC questions you've raised in the cover letter
> of course - particularly the one around mediating who has the counters between
> this and the normal PMU driver.
>
> Thanks,
>
> Jonathan
Hi Jonathan,

Many thanks for the review again.

Happy to accept all the comments. I have updated the driver based on them.

One reply below.

Jie


...
>> +static const struct attribute_group hisi_pmcu_format_attr_group = {
>> +	.name = "format",
>> +	.attrs = hisi_pmcu_format_attrs,
>> +};
>> +
>> +static ssize_t monitored_cpus_show(struct device *dev,
>> +				   struct device_attribute *attr, char *buf)
>> +{
>> +	struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev));
>> +
>> +	return sysfs_emit(buf, "%d-%d\n",
>> +			  cpumask_first(&hisi_pmcu->cpus),
>> +			  cpumask_last(&hisi_pmcu->cpus));
> What does this do about offline CPUs?
> Should it include them or not?
PMCU takes care of offline CPUs as well, and the event counts from 
offline CPUs
should show as zeroes in the output.

hisi_pmcu->cpus contains only the online CPUs monitored by the PMCU,
so something should be improved with the "monitored_cpus" interface here.

"monitored_cpus" should actually show alll the online/offline CPUs 
monitored,
or, if it is meant to show only online CPUs, it show be a comma 
separated list
representing the hisi_pmcu->cpus mask rather than a range that may ignore
some offline CPUs in the middle.

Will fix this in V2.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
  2023-03-17 13:11     ` Jonathan Cameron
@ 2023-04-19  8:01       ` Jie Zhan
  -1 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-04-19  8:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users, Rob Herring



On 17/03/2023 21:11, Jonathan Cameron wrote:
> On Mon, 27 Feb 2023 16:49:46 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> Please can anyone have a look at this PMCU patchset and provide some
>> comments?
>>
>> It is much related to the ARM PMU.
>>
>> We are looking forward to the feedback.
>>
>> Any relevant comments/questions, with respect to software or hardware
>> design, use cases, coding, are welcome.
>>
>> Kind regards,
>>
>> Jie
>>
>>
>> On 06/02/2023 14:51, Jie Zhan wrote:
>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>> well as its impact on target workloads, is reduced.
>>>
>>> This patchset contains the documentation, driver, and user perf tool
>>> support to enable using PMCU with the 'perf_event' framework.
>>>
>>> Here are two key questions requested for comments:
>>>
>>> - How do we make it compatible with arm_pmu drivers?
>>>
>>>     Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
>>>     from CPU and PMCU simultaneously. The current hardware can't guarantee
>>>     mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
>>>     the same time may mess up the operation of PMUs, delivering incorrect
>>>     data for both events, e.g. unexpected events or sample periods.
>>>     Software-wise, we probably need to prevent the two types of events from
>>>     running at the same time, but currently there isn't a clear solution.
Hi Jonathan,

Sorry for a late reply on this, but I have thought a bit more on this 
issue recently.
> I've been thinking about this a bit and don't have a good answer yet.
>
> So some thoughts that might get some discussion going (some are here
> mostly to be shot down ;)
>
> 1. I suspect adding a hook into the specific pmu driver to reserve a counter is going
>     to be controversial for this usecase.  But maybe there is a more generic
>     way...  There are lock up detectors that use PMU counters and ensure the counters
>     aren't also used for other purposes and that leads me to wonder if you can use
> https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L12700
> perf_event_create_kernel_counter()
> to do the same as opening a counter from userspace but then not use it.
> I have no idea if this will work though or if enabling the event would be necessary
> to prevent it being used elsewhere.
KVM is actually doing a similar thing. KVM inserts a call in 
armpmu_register() to save
a reference to struct arm_pmu, so as to get some information of arm_pmu, 
e.g. its pmu
type. With the pmu type, it can issue arm_pmu events through 
perf_event_create_kernel_counter().
Now we can make a general interface of this (supposed to be read-only), 
enabling other
kernel code to get the data of arm_pmu, not just for kvm.

In addition, PMCU needs to occupy certain counters, while the arm_pmu 
driver currently
gets the first free counter it finds in the counter bitmap (see 
armv8pmu_get_event_idx()).
Thus, we may have to add a mechanism to optionally specify a counter 
index that an event
wants to use. Adding a config field and adapting 
armv8pmu_get_event_idx() should work.

A more tricky work would be preventing the "occupying" events from being 
scheduled out.
I don't think this is a friendly action, and the perf_event framework 
doesn't seem to
support so (even if we add the "pinned" attribute, the event would also 
be switched out
when there comes another "pinned" event). However, any "occupying" 
events being scheduled
out should cause PMCU to stop, and I think this would undermine the 
advantage of PMCU.
> 2. It might be possible to reuse any of the infrastructure that exists
>     for userspace PMU counter access or maybe Rob Herring (+CC) has a suggestion based on
>     his work on that feature.
>
> 3. It's not nice, but maybe could enforce this constraint just in userspace?
>     We'd have to make sure that both drivers didn't do anything beyond not working
>     correctly if the other driver is messing with the hardware.
I actually think this is fine? So far, we haven't identified or found 
any problem from
running PMCU and ARM_PMU simultaneously beyond getting wrong readings. 
PMCU is designed
for system administrative use only. PMCU can also use a subset of PMU 
counters with higher
indices, and the reset of counters with lower indices can still be 
exposed to EL0 or EL1.
Thus, this approach should also work, providing that: a) EL0 or EL1 can 
only access a subset
of counters with the lower indices, and b) system administrative 
programs don't use ARM_PMU
and PMCU at the same time, or don't do anything harmful when getting 
abnormal PMU readings.
>
> 4. We can't do the nasty trick of providing a second driver that binds to the
>     PMU hardware to prevent it being used because I think the main arm PMU
>     driver has suppress_bind_attrs = true.  Maybe we can make remove work?
>     (original patch for this in 2018 added that line because of a crash on remove
>      - not sure anyone looked at fixing the crash).
We still prefer to keep at least part of ARM PMU counters in service 
while running PMCU
in some scenarios. Unbinding the ARM PMU driver would go against that.

Thanks!
Jie Zhan

>>> - Currently we reply on a sysfs file for users to input event numbers. Is
>>>     there a better way to pass many events?
>>>
>>>     The perf framework only allows three 64-bit config fields for custom PMU
>>>     configs. Obviously, this can't satisfy our need for passing many events
>>>     at a time. As an event number is 16-bit wide, the config fields can only
>>>     take up to 12 events at a time, or up to 192 events even if we do a
>>>     bitmap of events (and there are more than 192 available event numbers).
>>>     Hence, the current design takes an array of event numbers from a sysfs
>>>     file before starting profiling. However, this may go against the common
>>>     way to schedule perf events through perf commands.
>>>
>>> Jie Zhan (4):
>>>     docs: perf: Add documentation for HiSilicon PMCU
>>>     drivers/perf: hisi: Add driver support for HiSilicon PMCU
>>>     perf tool: Add HiSilicon PMCU data recording support
>>>     perf tool: Add HiSilicon PMCU data decoding support
>>>
>>>    Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
>>>    Documentation/admin-guide/perf/index.rst     |    1 +
>>>    drivers/perf/hisilicon/Kconfig               |   15 +
>>>    drivers/perf/hisilicon/Makefile              |    1 +
>>>    drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
>>>    tools/perf/arch/arm/util/auxtrace.c          |   61 +
>>>    tools/perf/arch/arm64/util/Build             |    2 +-
>>>    tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
>>>    tools/perf/util/Build                        |    1 +
>>>    tools/perf/util/auxtrace.c                   |    4 +
>>>    tools/perf/util/auxtrace.h                   |    1 +
>>>    tools/perf/util/hisi-pmcu.c                  |  305 +++++
>>>    tools/perf/util/hisi-pmcu.h                  |   19 +
>>>    13 files changed, 1833 insertions(+), 1 deletion(-)
>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>    create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
>>>    create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
>>>    create mode 100644 tools/perf/util/hisi-pmcu.c
>>>    create mode 100644 tools/perf/util/hisi-pmcu.h
>>>
>>>
>>> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit
@ 2023-04-19  8:01       ` Jie Zhan
  0 siblings, 0 replies; 32+ messages in thread
From: Jie Zhan @ 2023-04-19  8:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: will, mark.rutland, mathieu.poirier, suzuki.poulose, mike.leach,
	leo.yan, john.g.garry, james.clark, peterz, mingo, acme, corbet,
	zhangshaokun, shenyang39, hejunhao3, yangyicong, prime.zeng,
	suntao25, jiazhao4, linuxarm, linux-doc, linux-kernel,
	linux-arm-kernel, linux-perf-users, Rob Herring



On 17/03/2023 21:11, Jonathan Cameron wrote:
> On Mon, 27 Feb 2023 16:49:46 +0800
> Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
>> Please can anyone have a look at this PMCU patchset and provide some
>> comments?
>>
>> It is much related to the ARM PMU.
>>
>> We are looking forward to the feedback.
>>
>> Any relevant comments/questions, with respect to software or hardware
>> design, use cases, coding, are welcome.
>>
>> Kind regards,
>>
>> Jie
>>
>>
>> On 06/02/2023 14:51, Jie Zhan wrote:
>>> HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
>>> PMU accesses from CPUs, handling the configuration, event switching, and
>>> counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
>>> and multi-PMU-event CPU profiling, in which scenario the current 'perf'
>>> scheme may lose events or drop sampling frequency. With PMCU, users can
>>> reliably obtain the data of up to 240 PMU events with the sample interval
>>> of events down to 1ms, while the software overhead of accessing PMUs, as
>>> well as its impact on target workloads, is reduced.
>>>
>>> This patchset contains the documentation, driver, and user perf tool
>>> support to enable using PMCU with the 'perf_event' framework.
>>>
>>> Here are two key questions requested for comments:
>>>
>>> - How do we make it compatible with arm_pmu drivers?
>>>
>>>     Hardware-wise, PMCU uses the existing core PMUs, so PMUs can be accessed
>>>     from CPU and PMCU simultaneously. The current hardware can't guarantee
>>>     mutual exclusive accesses. Hence, scheduling arm_pmu and PMCU events at
>>>     the same time may mess up the operation of PMUs, delivering incorrect
>>>     data for both events, e.g. unexpected events or sample periods.
>>>     Software-wise, we probably need to prevent the two types of events from
>>>     running at the same time, but currently there isn't a clear solution.
Hi Jonathan,

Sorry for a late reply on this, but I have thought a bit more on this 
issue recently.
> I've been thinking about this a bit and don't have a good answer yet.
>
> So some thoughts that might get some discussion going (some are here
> mostly to be shot down ;)
>
> 1. I suspect adding a hook into the specific pmu driver to reserve a counter is going
>     to be controversial for this usecase.  But maybe there is a more generic
>     way...  There are lock up detectors that use PMU counters and ensure the counters
>     aren't also used for other purposes and that leads me to wonder if you can use
> https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L12700
> perf_event_create_kernel_counter()
> to do the same as opening a counter from userspace but then not use it.
> I have no idea if this will work though or if enabling the event would be necessary
> to prevent it being used elsewhere.
KVM is actually doing a similar thing. KVM inserts a call in 
armpmu_register() to save
a reference to struct arm_pmu, so as to get some information of arm_pmu, 
e.g. its pmu
type. With the pmu type, it can issue arm_pmu events through 
perf_event_create_kernel_counter().
Now we can make a general interface of this (supposed to be read-only), 
enabling other
kernel code to get the data of arm_pmu, not just for kvm.

In addition, PMCU needs to occupy certain counters, while the arm_pmu 
driver currently
gets the first free counter it finds in the counter bitmap (see 
armv8pmu_get_event_idx()).
Thus, we may have to add a mechanism to optionally specify a counter 
index that an event
wants to use. Adding a config field and adapting 
armv8pmu_get_event_idx() should work.

A more tricky work would be preventing the "occupying" events from being 
scheduled out.
I don't think this is a friendly action, and the perf_event framework 
doesn't seem to
support so (even if we add the "pinned" attribute, the event would also 
be switched out
when there comes another "pinned" event). However, any "occupying" 
events being scheduled
out should cause PMCU to stop, and I think this would undermine the 
advantage of PMCU.
> 2. It might be possible to reuse any of the infrastructure that exists
>     for userspace PMU counter access or maybe Rob Herring (+CC) has a suggestion based on
>     his work on that feature.
>
> 3. It's not nice, but maybe could enforce this constraint just in userspace?
>     We'd have to make sure that both drivers didn't do anything beyond not working
>     correctly if the other driver is messing with the hardware.
I actually think this is fine? So far, we haven't identified or found 
any problem from
running PMCU and ARM_PMU simultaneously beyond getting wrong readings. 
PMCU is designed
for system administrative use only. PMCU can also use a subset of PMU 
counters with higher
indices, and the reset of counters with lower indices can still be 
exposed to EL0 or EL1.
Thus, this approach should also work, providing that: a) EL0 or EL1 can 
only access a subset
of counters with the lower indices, and b) system administrative 
programs don't use ARM_PMU
and PMCU at the same time, or don't do anything harmful when getting 
abnormal PMU readings.
>
> 4. We can't do the nasty trick of providing a second driver that binds to the
>     PMU hardware to prevent it being used because I think the main arm PMU
>     driver has suppress_bind_attrs = true.  Maybe we can make remove work?
>     (original patch for this in 2018 added that line because of a crash on remove
>      - not sure anyone looked at fixing the crash).
We still prefer to keep at least part of ARM PMU counters in service 
while running PMCU
in some scenarios. Unbinding the ARM PMU driver would go against that.

Thanks!
Jie Zhan

>>> - Currently we reply on a sysfs file for users to input event numbers. Is
>>>     there a better way to pass many events?
>>>
>>>     The perf framework only allows three 64-bit config fields for custom PMU
>>>     configs. Obviously, this can't satisfy our need for passing many events
>>>     at a time. As an event number is 16-bit wide, the config fields can only
>>>     take up to 12 events at a time, or up to 192 events even if we do a
>>>     bitmap of events (and there are more than 192 available event numbers).
>>>     Hence, the current design takes an array of event numbers from a sysfs
>>>     file before starting profiling. However, this may go against the common
>>>     way to schedule perf events through perf commands.
>>>
>>> Jie Zhan (4):
>>>     docs: perf: Add documentation for HiSilicon PMCU
>>>     drivers/perf: hisi: Add driver support for HiSilicon PMCU
>>>     perf tool: Add HiSilicon PMCU data recording support
>>>     perf tool: Add HiSilicon PMCU data decoding support
>>>
>>>    Documentation/admin-guide/perf/hisi-pmcu.rst |  183 +++
>>>    Documentation/admin-guide/perf/index.rst     |    1 +
>>>    drivers/perf/hisilicon/Kconfig               |   15 +
>>>    drivers/perf/hisilicon/Makefile              |    1 +
>>>    drivers/perf/hisilicon/hisi_pmcu.c           | 1096 ++++++++++++++++++
>>>    tools/perf/arch/arm/util/auxtrace.c          |   61 +
>>>    tools/perf/arch/arm64/util/Build             |    2 +-
>>>    tools/perf/arch/arm64/util/hisi-pmcu.c       |  145 +++
>>>    tools/perf/util/Build                        |    1 +
>>>    tools/perf/util/auxtrace.c                   |    4 +
>>>    tools/perf/util/auxtrace.h                   |    1 +
>>>    tools/perf/util/hisi-pmcu.c                  |  305 +++++
>>>    tools/perf/util/hisi-pmcu.h                  |   19 +
>>>    13 files changed, 1833 insertions(+), 1 deletion(-)
>>>    create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst
>>>    create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c
>>>    create mode 100644 tools/perf/arch/arm64/util/hisi-pmcu.c
>>>    create mode 100644 tools/perf/util/hisi-pmcu.c
>>>    create mode 100644 tools/perf/util/hisi-pmcu.h
>>>
>>>
>>> base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-04-19  8:04 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-06  6:51 [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit Jie Zhan
2023-02-06  6:51 ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-02-07  3:03   ` Jie Zhan
2023-02-07  3:03     ` Jie Zhan
2023-03-17 13:37   ` Jonathan Cameron
2023-03-17 13:37     ` Jonathan Cameron
2023-03-24  9:32     ` Jie Zhan
2023-03-24  9:32       ` Jie Zhan
2023-03-24 12:14       ` Jonathan Cameron
2023-03-24 12:14         ` Jonathan Cameron
2023-03-25  2:48         ` Jie Zhan
2023-03-25  2:48           ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 2/4] drivers/perf: hisi: Add driver support " Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-03-17 14:52   ` Jonathan Cameron
2023-03-17 14:52     ` Jonathan Cameron
2023-03-25 10:21     ` Jie Zhan
2023-03-25 10:21       ` Jie Zhan
2023-02-06  6:51 ` [RFC PATCH v1 3/4] perf tool: Add HiSilicon PMCU data recording support Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-03-17 15:13   ` Jonathan Cameron
2023-03-17 15:13     ` Jonathan Cameron
2023-02-06  6:51 ` [RFC PATCH v1 4/4] perf tool: Add HiSilicon PMCU data decoding support Jie Zhan
2023-02-06  6:51   ` Jie Zhan
2023-02-27  8:49 ` [RFC PATCH v1 0/4] HiSilicon Performance Monitor Control Unit Jie Zhan
2023-02-27  8:49   ` Jie Zhan
2023-03-17 13:11   ` Jonathan Cameron
2023-03-17 13:11     ` Jonathan Cameron
2023-04-19  8:01     ` Jie Zhan
2023-04-19  8:01       ` Jie Zhan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.