linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support
@ 2023-10-17  1:32 Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-17  1:32 UTC (permalink / raw)
  To: chengyou, kaishen, helgaas, yangyicong, will, Jonathan.Cameron,
	baolin.wang, robin.murphy
  Cc: linux-kernel, linux-arm-kernel, linux-pci, rdunlap, mark.rutland,
	zhuo.song, xueshuai, renyu.zj

Change Log
==========

changes since v7:
- add config help with alibaba name (per Bjorn)
- remove the ARM64 dependency (per Bjorn and Jonathan)
- fix typo and column warp (per Bjorn) 
- move list_del() after perf_pmu_unregister() (per Bjorn) 
- reorder the funtions by interests (per Bjorn)
- rewrite commit log about PMU counters, also update doc (per Bjorn)
- extend driver to support stat time-based analysis and lane event at the same time (per Bjorn and Jonathan)
Link: https://lore.kernel.org/linux-arm-kernel/20231012032856.2640-2-xueshuai@linux.alibaba.com/T/

changes since v6:
- improve editorial things in doc (Per Jonathan)
- change config help to generic text (Per Jonathan)
- remove macro to_dwc_pcie_pmu by moving pmu as the first member to struct dwc_pcie_pmu (Per Yicong)
- add event type check in dwc_pcie_event_show() to keep consistent with other function (Per Jonathan)
- remove intended blank line (Per Yicong)
- protect against lower 32 bits of counter overflow by try again trick (Per Jonathan)
- call pci_dev_put on all the return branch to keep the refcnt balance (Per Jonathan and Yicong)
- use devm_add_action_or_reset() to automatic unwind (Per Jonathan)
- fix picking numa-aware context cpu up when offline and offline cpu (Per Jonathan)
- simplify online cpu by init pcie_pmu->on_cpu as -1 (Per Jonathan)
- add bus_register_notifier() to handle rootport hotplug (Per Yicong)
- pick up Acked-by from Bjorn for patch 2/4 (Per Bjorn)
Link: https://lore.kernel.org/lkml/20230606074938.97724-1-xueshuai@linux.alibaba.com/T/

changes since v5:
- Rewrite the commit log to follow policy in pci_ids.h (Bjorn Helgaas)
- return error code when __dwc_pcie_pmu_probe failed (Baolin Wang)
- call 'cpuhp_remove_multi_state()' when exiting the driver. (Baolin Wang)
- pick up Review-by tag from Baolin for Patch 1 and 3
Link: https://lore.kernel.org/lkml/ZGuSimj1cuQl3W5L@bhelgaas/T/#mba3fa2572dde0deddb40b5b24a31f4df41004bdf

changes since v4:

1. addressing commens from Bjorn Helgaas:
- reorder the includes by alpha
- change all macros with upper-case hex
- change ras_des type into u16
- remove unnecessary outer "()"
- minor format changes

2. Address commensts from Jonathan Cameron:
- rewrite doc and add a example to show how to use lane event

3. fix compile error reported by: kernel test robot
- remove COMPILE_TEST and add depend on PCI in kconfig
- add Reported-by: kernel test robot <lkp@intel.com>

Changes since v3:

1. addressing comments from Robin Murphy:
- add a prepare patch to define pci id in linux/pci_ids.h
- remove unnecessary 64BIT dependency
- fix DWC_PCIE_PER_EVENT_OFF/ON macro
- remove dwc_pcie_pmu struct and move all its fileds into dwc_pcie_rp_info
- remove unnecessary format field show
- use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
- remove unnecessary spaces and remove unnecessary cast to follow event show convention
- remove pcie_pmu_event_attr_is_visible
- fix a refcout leak on error branch when walk pci device in for_each_pci_dev
- remove bdf field from dwc_pcie_rp_info and calculate it at runtime
- finish all the checks before allocating rp_info to avoid hanging wasted memory
- remove some unused fields
- warp out control register configuration from sub function to .add()
- make function return type with a proper signature
- fix lane event count enable by clear DWC_PCIE_CNT_ENABLE field first
- pass rp_info directly to the read_*_counter helpers and in start, stop and add callbacks
- move event type validtion into .event_init()
- use is_sampling_event() to be consistent with everything else of pmu drivers
- remove unnecessary dev_err message in .event_init()
- return EINVAL instead EOPNOTSUPP for not a valid event 
- finish all the checks before start modifying the event
- fix sibling event check by comparing event->pmu with sibling->pmu
- probe PMU for each rootport independently
- use .update() as .read() directly
- remove dynamically generating symbolic name of lane event
- redefine static symbolic name of lane event and leave lane filed to user
- add CPU hotplug support

2. addressing comments from Baolin:
- add a mask to avoid possible overflow

Changes since v2 addressing comments from Baolin:
- remove redundant macro definitions
- use dev_err to print error message
- change pmu_is_register to boolean
- use PLATFORM_DEVID_NONE macro
- fix module author format

Changes since v1:

1. address comments from Jonathan:
- drop marco for PMU name and VSEC version
- simplify code with PCI standard marco
- simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
- name register filed with single _ instead double
- wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check 
- check vendor id while matching vesc with pci_find_vsec_capability()
- remove RP_NUM_MAX and use a list to organize PMU devices for rootports
- replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
- comments on riping register together

2. address comments from Bjorn:
- rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
- rename cap_pos to ras_des
- simplify declare of device_attribute with DEVICE_ATTR_RO
- simplify code with PCI standard macro and API like pcie_get_width_cap()
- fix some code style problem and typo
- drop meaningless snaity check of container_of

3. address comments from Yicong:
- use sysfs_emit() to replace sprintf()
- simplify iteration of pci device with for_each_pci_dev
- pick preferred CPUs on a near die and add comments
- unregister PMU drivers only for failed ones
- log on behalf PMU device and give more hint
- fix some code style problem

(Thanks for all comments and they are very valuable to me)

Cover Letter
==========

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.

Shuai Xue (4):
  docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  PCI: Add Alibaba Vendor ID to linux/pci_ids.h
  drivers/perf: add DesignWare PCIe PMU driver
  MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

 .../admin-guide/perf/dwc_pcie_pmu.rst         |  94 +++
 Documentation/admin-guide/perf/index.rst      |   1 +
 MAINTAINERS                                   |   7 +
 drivers/infiniband/hw/erdma/erdma_hw.h        |   2 -
 drivers/perf/Kconfig                          |   7 +
 drivers/perf/Makefile                         |   1 +
 drivers/perf/dwc_pcie_pmu.c                   | 756 ++++++++++++++++++
 include/linux/pci_ids.h                       |   2 +
 8 files changed, 868 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
 create mode 100644 drivers/perf/dwc_pcie_pmu.c

-- 
2.39.3


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-17  1:32 [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support Shuai Xue
@ 2023-10-17  1:32 ` Shuai Xue
  2023-10-17  9:16   ` Jonathan Cameron
  2023-10-19  7:35   ` Yicong Yang
  2023-10-17  1:32 ` [PATCH v8 2/4] PCI: Add Alibaba Vendor ID to linux/pci_ids.h Shuai Xue
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-17  1:32 UTC (permalink / raw)
  To: chengyou, kaishen, helgaas, yangyicong, will, Jonathan.Cameron,
	baolin.wang, robin.murphy
  Cc: linux-kernel, linux-arm-kernel, linux-pci, rdunlap, mark.rutland,
	zhuo.song, xueshuai, renyu.zj

Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
controller which implements which implements PMU for performance and
functional debugging to facilitate system maintenance.

Document it to provide guidance on how to use it.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
 Documentation/admin-guide/perf/index.rst      |  1 +
 2 files changed, 95 insertions(+)
 create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst

diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..eac1b6f36450
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,94 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+The PMU is a PCIe configuration space register block provided by each PCIe Root
+Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
+injection, and Statistics).
+
+As the name indicates, the RAS DES capability supports system level
+debugging, AER error injection, and collection of statistics. To facilitate
+collection of statistics, Synopsys DesignWare Cores PCIe controller
+provides the following two features:
+
+- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
+  time spent in each low-power LTSSM state) and
+- one 32-bit counter for Event Counting (error and non-error events for
+  a specified lane)
+
+Note: There is no interrupt for counter overflow.
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+The PMU measures data in two categories:
+
+- Group#0: Percentage of time the controller stays in LTSSM states.
+- Group#1: Amount of data processed (Units of 16 bytes).
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller. The PMU event is select by:
+
+- Group i
+- Event j within the Group i
+- and Lane k
+
+Some of the event only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver adds PMU devices for each PCIe Root Port named based on the BDF of
+the Root Port. For example,
+
+    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+The DWC PCIe PMU driver registers a perf PMU driver, which provides
+description of available events and configuration options in sysfs, see
+/sys/bus/event_source/devices/dwc_rootport_{bdf}.
+
+The "format" directory describes format of the config fields of the
+perf_event_attr structure. The "events" directory provides configuration
+templates for all documented events.  For example,
+"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
+
+The "perf list" command shall list the available events from sysfs, e.g.::
+
+    $# perf list | grep dwc_rootport
+    <...>
+    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
+    <...>
+    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
+
+Time Based Analysis Event Usage
+-------------------------------
+
+Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
+
+    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+The average RX/TX bandwidth can be calculated using the following formula:
+
+    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
+    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
+
+Lane Event Usage
+-------------------------------
+
+Each lane has the same event set and to avoid generating a list of hundreds
+of events, the user need to specify the lane ID explicitly, e.g.::
+
+    $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
+
+The driver does not support sampling, therefore "perf record" will not
+work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index f60be04e4e33..6bc7739fddb5 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -19,6 +19,7 @@ Performance monitor support
    arm_dsu_pmu
    thunderx2-pmu
    alibaba_pmu
+   dwc_pcie_pmu
    nvidia-pmu
    meson-ddr-pmu
    cxl
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v8 2/4] PCI: Add Alibaba Vendor ID to linux/pci_ids.h
  2023-10-17  1:32 [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
@ 2023-10-17  1:32 ` Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 4/4] MAINTAINERS: add maintainers for " Shuai Xue
  3 siblings, 0 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-17  1:32 UTC (permalink / raw)
  To: chengyou, kaishen, helgaas, yangyicong, will, Jonathan.Cameron,
	baolin.wang, robin.murphy
  Cc: linux-kernel, linux-arm-kernel, linux-pci, rdunlap, mark.rutland,
	zhuo.song, xueshuai, renyu.zj

The Alibaba Vendor ID (0x1ded) is now used by Alibaba elasticRDMA ("erdma")
and will be shared with the upcoming PCIe PMU ("dwc_pcie_pmu"). Move the
Vendor ID to linux/pci_ids.h so that it can shared by several drivers
later.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>	# pci_ids.h
---
 drivers/infiniband/hw/erdma/erdma_hw.h | 2 --
 include/linux/pci_ids.h                | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
index 9d316fdc6f9a..a155519a862f 100644
--- a/drivers/infiniband/hw/erdma/erdma_hw.h
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -11,8 +11,6 @@
 #include <linux/types.h>
 
 /* PCIe device related definition. */
-#define PCI_VENDOR_ID_ALIBABA 0x1ded
-
 #define ERDMA_PCI_WIDTH 64
 #define ERDMA_FUNC_BAR 0
 #define ERDMA_MISX_BAR 2
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 5fb3d4c393a9..d8760daf9e5a 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2601,6 +2601,8 @@
 #define PCI_VENDOR_ID_TEKRAM		0x1de1
 #define PCI_DEVICE_ID_TEKRAM_DC290	0xdc29
 
+#define PCI_VENDOR_ID_ALIBABA		0x1ded
+
 #define PCI_VENDOR_ID_TEHUTI		0x1fc9
 #define PCI_DEVICE_ID_TEHUTI_3009	0x3009
 #define PCI_DEVICE_ID_TEHUTI_3010	0x3010
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver
  2023-10-17  1:32 [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
  2023-10-17  1:32 ` [PATCH v8 2/4] PCI: Add Alibaba Vendor ID to linux/pci_ids.h Shuai Xue
@ 2023-10-17  1:32 ` Shuai Xue
  2023-10-17  9:39   ` Jonathan Cameron
  2023-10-17  1:32 ` [PATCH v8 4/4] MAINTAINERS: add maintainers for " Shuai Xue
  3 siblings, 1 reply; 15+ messages in thread
From: Shuai Xue @ 2023-10-17  1:32 UTC (permalink / raw)
  To: chengyou, kaishen, helgaas, yangyicong, will, Jonathan.Cameron,
	baolin.wang, robin.murphy
  Cc: linux-kernel, linux-arm-kernel, linux-pci, rdunlap, mark.rutland,
	zhuo.song, xueshuai, renyu.zj

This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
Core controller IP which provides statistics feature. The PMU is a PCIe
configuration space register block provided by each PCIe Root Port in a
Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
injection, and Statistics).

To facilitate collection of statistics the controller provides the
following two features for each Root Port:

- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
  time spent in each low-power LTSSM state) and
- one 32-bit counter for Event Counting (error and non-error events for
  a specified lane)

Note: There is no interrupt for counter overflow.

This driver adds PMU devices for each PCIe Root Port. And the PMU device is
named based the BDF of Root Port. For example,

    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)

the PMU device name for this Root Port is dwc_rootport_3018.

Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/

average RX bandwidth can be calculated like this:

    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/perf/Kconfig        |   7 +
 drivers/perf/Makefile       |   1 +
 drivers/perf/dwc_pcie_pmu.c | 756 ++++++++++++++++++++++++++++++++++++
 3 files changed, 764 insertions(+)
 create mode 100644 drivers/perf/dwc_pcie_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 273d67ecf6d2..ec6e0d9194a1 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -217,6 +217,13 @@ config MARVELL_CN10K_DDR_PMU
 	  Enable perf support for Marvell DDR Performance monitoring
 	  event on CN10K platform.
 
+config DWC_PCIE_PMU
+	tristate "Synopsys DesignWare PCIe PMU"
+	depends on PCI
+	help
+	  Enable perf support for Synopsys DesignWare PCIe PMU Performance
+	  monitoring event on platform including the Alibaba Yitian 710.
+
 source "drivers/perf/arm_cspmu/Kconfig"
 
 source "drivers/perf/amlogic/Kconfig"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 16b3ec4db916..a06338e3401c 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
 obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
 obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
 obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
+obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
 obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
 obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
 obj-$(CONFIG_CXL_PMU) += cxl_pmu.o
diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
new file mode 100644
index 000000000000..9a97ffca8719
--- /dev/null
+++ b/drivers/perf/dwc_pcie_pmu.c
@@ -0,0 +1,756 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synopsys DesignWare PCIe PMU driver
+ *
+ * Copyright (C) 2021-2023 Alibaba Inc.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cpuhotplug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/pci.h>
+#include <linux/platform_device.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#define DWC_PCIE_VSEC_RAS_DES_ID		0x02
+#define DWC_PCIE_EVENT_CNT_CTL			0x8
+
+/*
+ * Event Counter Data Select includes two parts:
+ * - 27-24: Group number(4-bit: 0..0x7)
+ * - 23-16: Event number(8-bit: 0..0x13) within the Group
+ *
+ * Put them together as in TRM.
+ */
+#define DWC_PCIE_CNT_EVENT_SEL			GENMASK(27, 16)
+#define DWC_PCIE_CNT_LANE_SEL			GENMASK(11, 8)
+#define DWC_PCIE_CNT_STATUS			BIT(7)
+#define DWC_PCIE_CNT_ENABLE			GENMASK(4, 2)
+#define DWC_PCIE_PER_EVENT_OFF			0x1
+#define DWC_PCIE_PER_EVENT_ON			0x3
+#define DWC_PCIE_EVENT_CLEAR			GENMASK(1, 0)
+#define DWC_PCIE_EVENT_PER_CLEAR		0x1
+
+#define DWC_PCIE_EVENT_CNT_DATA			0xC
+
+#define DWC_PCIE_TIME_BASED_ANAL_CTL		0x10
+#define DWC_PCIE_TIME_BASED_REPORT_SEL		GENMASK(31, 24)
+#define DWC_PCIE_TIME_BASED_DURATION_SEL	GENMASK(15, 8)
+#define DWC_PCIE_DURATION_MANUAL_CTL		0x0
+#define DWC_PCIE_DURATION_1MS			0x1
+#define DWC_PCIE_DURATION_10MS			0x2
+#define DWC_PCIE_DURATION_100MS			0x3
+#define DWC_PCIE_DURATION_1S			0x4
+#define DWC_PCIE_DURATION_2S			0x5
+#define DWC_PCIE_DURATION_4S			0x6
+#define DWC_PCIE_DURATION_4US			0xFF
+#define DWC_PCIE_TIME_BASED_TIMER_START		BIT(0)
+#define DWC_PCIE_TIME_BASED_CNT_ENABLE		0x1
+
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW	0x14
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH	0x18
+
+/* Event attributes */
+#define DWC_PCIE_CONFIG_EVENTID			GENMASK(15, 0)
+#define DWC_PCIE_CONFIG_TYPE			GENMASK(19, 16)
+#define DWC_PCIE_CONFIG_LANE			GENMASK(27, 20)
+
+#define DWC_PCIE_EVENT_ID(event)	FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
+#define DWC_PCIE_EVENT_TYPE(event)	FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
+#define DWC_PCIE_EVENT_LANE(event)	FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
+
+enum dwc_pcie_event_type {
+	DWC_PCIE_TIME_BASE_EVENT,
+	DWC_PCIE_LANE_EVENT,
+	DWC_PCIE_EVENT_TYPE_MAX,
+};
+
+#define DWC_PCIE_LANE_EVENT_MAX_PERIOD		GENMASK_ULL(31, 0)
+#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD	GENMASK_ULL(63, 0)
+
+struct dwc_pcie_pmu {
+	struct pmu		pmu;
+	struct pci_dev		*pdev;		/* Root Port device */
+	u16			ras_des;	/* RAS DES capability offset */
+	u32			nr_lanes;
+
+	struct list_head	pmu_node;
+	struct hlist_node	cpuhp_node;
+	struct perf_event	*event[DWC_PCIE_EVENT_TYPE_MAX];
+	int			on_cpu;
+	bool			registered;
+};
+
+#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
+
+static struct platform_device *dwc_pcie_pmu_dev;
+static int dwc_pcie_pmu_hp_state;
+static struct list_head dwc_pcie_pmu_head = LIST_HEAD_INIT(dwc_pcie_pmu_head);
+static bool dwc_pcie_pmu_notify;
+
+static ssize_t cpumask_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
+
+	return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL
+};
+
+static struct attribute_group dwc_pcie_cpumask_attr_group = {
+	.attrs = dwc_pcie_pmu_cpumask_attrs,
+};
+
+struct dwc_pcie_format_attr {
+	struct device_attribute attr;
+	u64 field;
+	int config;
+};
+
+static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
+	int lo = __ffs(fmt->field), hi = __fls(fmt->field);
+
+	return sysfs_emit(buf, "config:%d-%d\n", lo, hi);
+}
+
+#define _dwc_pcie_format_attr(_name, _cfg, _fld)			    \
+	(&((struct dwc_pcie_format_attr[]) {{				    \
+		.attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL),\
+		.config = _cfg,						    \
+		.field = _fld,						    \
+	}})[0].attr.attr)
+
+#define dwc_pcie_format_attr(_name, _fld)	_dwc_pcie_format_attr(_name, 0, _fld)
+
+static struct attribute *dwc_pcie_format_attrs[] = {
+	dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
+	dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
+	dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
+	NULL,
+};
+
+static struct attribute_group dwc_pcie_format_attrs_group = {
+	.name = "format",
+	.attrs = dwc_pcie_format_attrs,
+};
+
+struct dwc_pcie_event_attr {
+	struct device_attribute attr;
+	enum dwc_pcie_event_type type;
+	u16 eventid;
+	u8 lane;
+};
+
+static ssize_t dwc_pcie_event_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct dwc_pcie_event_attr *eattr;
+
+	eattr = container_of(attr, typeof(*eattr), attr);
+
+	if (eattr->type == DWC_PCIE_LANE_EVENT)
+		return sysfs_emit(buf, "eventid=0x%x,type=0x%x,lane=?\n",
+				  eattr->eventid, eattr->type);
+	else if (eattr->type == DWC_PCIE_TIME_BASE_EVENT)
+		return sysfs_emit(buf, "eventid=0x%x,type=0x%x\n",
+				  eattr->eventid, eattr->type);
+
+	return 0;
+}
+
+#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane)		\
+	(&((struct dwc_pcie_event_attr[]) {{				\
+		.attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL),	\
+		.type = _type,						\
+		.eventid = _eventid,					\
+		.lane = _lane,						\
+	}})[0].attr.attr)
+
+#define DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(_name, _eventid)		\
+	DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
+#define DWC_PCIE_PMU_LANE_EVENT_ATTR(_name, _eventid)			\
+	DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_LANE_EVENT, _eventid, 0)
+
+static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
+	/* Group #0 */
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(one_cycle, 0x00),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_L0S, 0x01),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(RX_L0S, 0x02),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L0, 0x03),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1, 0x04),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_1, 0x05),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_2, 0x06),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(CFG_RCVRY, 0x07),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_RX_L0S, 0x08),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_AUX, 0x09),
+
+	/* Group #1 */
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
+	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
+
+	/*
+	 * Leave it to the user to specify the lane ID to avoid generating
+	 * a list of hundreds of events.
+	 */
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ack_dllp, 0x600),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_update_fc_dllp, 0x601),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ack_dllp, 0x602),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_update_fc_dllp, 0x603),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_nulified_tlp, 0x604),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_nulified_tlp, 0x605),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_duplicate_tl, 0x606),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_write, 0x700),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_read, 0x701),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_write, 0x702),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_read, 0x703),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_write, 0x704),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_read, 0x705),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_without_data, 0x706),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_with_data, 0x707),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_message_tlp, 0x708),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_atomic, 0x709),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_tlp_with_prefix, 0x70A),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_write, 0x70B),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_read, 0x70C),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_write, 0x70F),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_read, 0x710),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_without_data, 0x711),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_with_data, 0x712),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_message_tlp, 0x713),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_atomic, 0x714),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_tlp_with_prefix, 0x715),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ccix_tlp, 0x716),
+	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ccix_tlp, 0x717),
+	NULL
+};
+
+static const struct attribute_group dwc_pcie_event_attrs_group = {
+	.name = "events",
+	.attrs = dwc_pcie_pmu_time_event_attrs,
+};
+
+static const struct attribute_group *dwc_pcie_attr_groups[] = {
+	&dwc_pcie_event_attrs_group,
+	&dwc_pcie_format_attrs_group,
+	&dwc_pcie_cpumask_attr_group,
+	NULL
+};
+
+static void dwc_pcie_pmu_lane_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+					   bool enable)
+{
+	struct pci_dev *pdev = pcie_pmu->pdev;
+	u16 ras_des = pcie_pmu->ras_des;
+	u32 val;
+
+	pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+	/* Clear DWC_PCIE_CNT_ENABLE field first */
+	val &= ~DWC_PCIE_CNT_ENABLE;
+	if (enable)
+		val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
+	else
+		val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
+
+	pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_time_based_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+					  bool enable)
+{
+	struct pci_dev *pdev = pcie_pmu->pdev;
+	u16 ras_des = pcie_pmu->ras_des;
+	u32 val;
+
+	pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+			      &val);
+
+	if (enable)
+		val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
+	else
+		val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
+
+	pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+			       val);
+}
+
+static u64 dwc_pcie_pmu_read_lane_event_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+	struct pci_dev *pdev = pcie_pmu->pdev;
+	u16 ras_des = pcie_pmu->ras_des;
+	u32 val;
+
+	pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
+
+	return val;
+}
+
+static u64 dwc_pcie_pmu_read_time_based_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+	struct pci_dev *pdev = pcie_pmu->pdev;
+	u16 ras_des = pcie_pmu->ras_des;
+	u32 lo, hi, ss;
+
+	/*
+	 * The 64-bit value of the data counter is spread across two
+	 * registers that are not synchronized. In order to read them
+	 * atomically, ensure that the high 32 bits match before and after
+	 * reading the low 32 bits.
+	 */
+	pci_read_config_dword(pdev, ras_des +
+		DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH, &hi);
+	do {
+		/* snapshot the high 32 bits */
+		ss = hi;
+
+		pci_read_config_dword(
+			pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
+			&lo);
+		pci_read_config_dword(
+			pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
+			&hi);
+	} while (hi != ss);
+
+	return ((u64)hi << 32) | lo;
+}
+
+static void dwc_pcie_pmu_event_update(struct perf_event *event)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+	u64 delta, prev, now;
+
+	do {
+		prev = local64_read(&hwc->prev_count);
+
+		if (type == DWC_PCIE_LANE_EVENT)
+			now = dwc_pcie_pmu_read_lane_event_counter(pcie_pmu);
+		else if (type == DWC_PCIE_TIME_BASE_EVENT)
+			now = dwc_pcie_pmu_read_time_based_counter(pcie_pmu);
+
+	} while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+	if (type == DWC_PCIE_LANE_EVENT)
+		delta = (now - prev) & DWC_PCIE_LANE_EVENT_MAX_PERIOD;
+	else if (type == DWC_PCIE_TIME_BASE_EVENT)
+		delta = (now - prev) & DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD;
+
+	local64_add(delta, &event->count);
+}
+
+static int dwc_pcie_pmu_event_init(struct perf_event *event)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+	struct perf_event *sibling;
+	u32 lane;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* We don't support sampling */
+	if (is_sampling_event(event))
+		return -EINVAL;
+
+	/* We cannot support task bound events */
+	if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK)
+		return -EINVAL;
+
+	if (event->group_leader != event &&
+	    !is_software_event(event->group_leader))
+		return -EINVAL;
+
+	for_each_sibling_event(sibling, event->group_leader) {
+		if (sibling->pmu != event->pmu && !is_software_event(sibling))
+			return -EINVAL;
+	}
+
+	if (type == DWC_PCIE_LANE_EVENT) {
+		lane = DWC_PCIE_EVENT_LANE(event);
+		if (lane < 0 || lane >= pcie_pmu->nr_lanes)
+			return -EINVAL;
+	}
+
+	event->cpu = pcie_pmu->on_cpu;
+
+	return 0;
+}
+
+static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
+{
+	local64_set(&hwc->prev_count, 0);
+}
+
+static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+	hwc->state = 0;
+	dwc_pcie_pmu_set_period(hwc);
+
+	if (type == DWC_PCIE_LANE_EVENT)
+		dwc_pcie_pmu_lane_event_enable(pcie_pmu, true);
+	else if (type == DWC_PCIE_TIME_BASE_EVENT)
+		dwc_pcie_pmu_time_based_event_enable(pcie_pmu, true);
+}
+
+static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (event->hw.state & PERF_HES_STOPPED)
+		return;
+
+	if (type == DWC_PCIE_LANE_EVENT)
+		dwc_pcie_pmu_lane_event_enable(pcie_pmu, false);
+	else if (type == DWC_PCIE_TIME_BASE_EVENT)
+		dwc_pcie_pmu_time_based_event_enable(pcie_pmu, false);
+
+	dwc_pcie_pmu_event_update(event);
+	hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+}
+
+static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	struct pci_dev *pdev = pcie_pmu->pdev;
+	struct hw_perf_event *hwc = &event->hw;
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+	int event_id = DWC_PCIE_EVENT_ID(event);
+	int lane = DWC_PCIE_EVENT_LANE(event);
+	u16 ras_des = pcie_pmu->ras_des;
+	u32 ctrl;
+
+	/* one counter for each type and it is in use */
+	if (pcie_pmu->event[type])
+		return -ENOSPC;
+
+	pcie_pmu->event[type] = event;
+	hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+	if (type == DWC_PCIE_LANE_EVENT) {
+		/* EVENT_COUNTER_DATA_REG needs clear manually */
+		ctrl = FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id) |
+			FIELD_PREP(DWC_PCIE_CNT_LANE_SEL, lane) |
+			FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF) |
+			FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
+		pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL,
+				       ctrl);
+	} else if (type == DWC_PCIE_TIME_BASE_EVENT) {
+		/*
+		 * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
+		 * use it with any manually controlled duration. And it is
+		 * cleared when next measurement starts.
+		 */
+		ctrl = FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id) |
+			FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
+				   DWC_PCIE_DURATION_MANUAL_CTL) |
+			DWC_PCIE_TIME_BASED_CNT_ENABLE;
+		pci_write_config_dword(
+			pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, ctrl);
+	}
+
+	if (flags & PERF_EF_START)
+		dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
+
+	perf_event_update_userpage(event);
+
+	return 0;
+}
+
+static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
+{
+	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+	dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
+	perf_event_update_userpage(event);
+	pcie_pmu->event[type] = NULL;
+}
+
+static void dwc_pcie_pmu_remove_cpuhp_instance(void *hotplug_node)
+{
+	cpuhp_state_remove_instance_nocalls(dwc_pcie_pmu_hp_state, hotplug_node);
+}
+
+/*
+ * Find the PMU of a PCI device.
+ * @pdev: The PCI device.
+ */
+static struct dwc_pcie_pmu *dwc_pcie_find_dev_pmu(struct pci_dev *pdev)
+{
+	struct dwc_pcie_pmu *pcie_pmu;
+
+	list_for_each_entry(pcie_pmu, &dwc_pcie_pmu_head, pmu_node)
+		if (pcie_pmu->pdev == pdev)
+			return pcie_pmu;
+
+	return NULL;
+}
+
+static void dwc_pcie_pmu_unregister_pmu(void *data)
+{
+	struct dwc_pcie_pmu *pcie_pmu = data;
+
+	if (!pcie_pmu->registered)
+		return;
+
+	perf_pmu_unregister(&pcie_pmu->pmu);
+	pcie_pmu->registered = false;
+	list_del(&pcie_pmu->pmu_node);
+}
+
+static int dwc_pcie_pmu_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data)
+{
+	struct device *dev = data;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct dwc_pcie_pmu *pcie_pmu;
+
+	/* Unregister the PMU when the device is going to be deleted. */
+	if (action != BUS_NOTIFY_DEL_DEVICE)
+		return NOTIFY_DONE;
+
+	pcie_pmu = dwc_pcie_find_dev_pmu(pdev);
+	if (!pcie_pmu)
+		return NOTIFY_DONE;
+
+	dwc_pcie_pmu_unregister_pmu(pcie_pmu);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block dwc_pcie_pmu_nb = {
+	.notifier_call = dwc_pcie_pmu_notifier,
+};
+
+static int dwc_pcie_pmu_probe(struct platform_device *plat_dev)
+{
+	struct pci_dev *pdev = NULL;
+	struct dwc_pcie_pmu *pcie_pmu;
+	bool notify = false;
+	char *name;
+	u32 bdf;
+	int ret;
+
+	/* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
+	for_each_pci_dev(pdev) {
+		u16 vsec;
+		u32 val;
+
+		if (!(pci_is_pcie(pdev) &&
+		      pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
+			continue;
+
+		vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
+						DWC_PCIE_VSEC_RAS_DES_ID);
+		if (!vsec)
+			continue;
+
+		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
+		if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
+		    PCI_VNDR_HEADER_LEN(val) != 0x100)
+			continue;
+		pci_dbg(pdev,
+			"Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
+
+		bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
+		name = devm_kasprintf(&plat_dev->dev, GFP_KERNEL, "dwc_rootport_%x",
+				      bdf);
+		if (!name) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* All checks passed, go go go */
+		pcie_pmu = devm_kzalloc(&plat_dev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
+		if (!pcie_pmu) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		pcie_pmu->pdev = pdev;
+		pcie_pmu->ras_des = vsec;
+		pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
+		pcie_pmu->on_cpu = -1;
+		pcie_pmu->pmu = (struct pmu){
+			.module		= THIS_MODULE,
+			.attr_groups	= dwc_pcie_attr_groups,
+			.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
+			.task_ctx_nr	= perf_invalid_context,
+			.event_init	= dwc_pcie_pmu_event_init,
+			.add		= dwc_pcie_pmu_event_add,
+			.del		= dwc_pcie_pmu_event_del,
+			.start		= dwc_pcie_pmu_event_start,
+			.stop		= dwc_pcie_pmu_event_stop,
+			.read		= dwc_pcie_pmu_event_update,
+		};
+
+		/* Add this instance to the list used by the offline callback */
+		ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
+					       &pcie_pmu->cpuhp_node);
+		if (ret) {
+			pci_err(pcie_pmu->pdev,
+				"Error %d registering hotplug @%x\n", ret, bdf);
+			goto out;
+		}
+
+		/* Unwind when platform driver removes */
+		ret = devm_add_action_or_reset(
+			&plat_dev->dev, dwc_pcie_pmu_remove_cpuhp_instance,
+			&pcie_pmu->cpuhp_node);
+		if (ret)
+			goto out;
+
+		ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
+		if (ret) {
+			pci_err(pcie_pmu->pdev,
+				"Error %d registering PMU @%x\n", ret, bdf);
+			goto out;
+		}
+		ret = devm_add_action_or_reset(
+			&plat_dev->dev, dwc_pcie_pmu_unregister_pmu, pcie_pmu);
+		if (ret)
+			goto out;
+
+		/* Cache PMU to handle pci device hotplug */
+		list_add(&pcie_pmu->pmu_node, &dwc_pcie_pmu_head);
+		pcie_pmu->registered = true;
+		notify = true;
+	}
+
+	if (notify && bus_register_notifier(&pci_bus_type, &dwc_pcie_pmu_nb))
+		notify = false;
+
+	if (notify)
+		dwc_pcie_pmu_notify = true;
+
+	return 0;
+
+out:
+	pci_dev_put(pdev);
+
+	return ret;
+}
+
+static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+	struct dwc_pcie_pmu *pcie_pmu;
+
+	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+	if (pcie_pmu->on_cpu == -1)
+		pcie_pmu->on_cpu = cpumask_local_spread(
+			0, dev_to_node(&pcie_pmu->pdev->dev));
+
+	return 0;
+}
+
+static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+	struct dwc_pcie_pmu *pcie_pmu;
+	struct pci_dev *pdev;
+	int node;
+	cpumask_t mask;
+	unsigned int target;
+
+	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+	/* Nothing to do if this CPU doesn't own the PMU */
+	if (cpu != pcie_pmu->on_cpu)
+		return 0;
+
+	pcie_pmu->on_cpu = -1;
+	pdev = pcie_pmu->pdev;
+	node = dev_to_node(&pdev->dev);
+	if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
+	    cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
+		target = cpumask_any(&mask);
+	else
+		target = cpumask_any_but(cpu_online_mask, cpu);
+
+	if (target >= nr_cpu_ids) {
+		pci_err(pcie_pmu->pdev, "There is no CPU to set\n");
+		return 0;
+	}
+
+	/* This PMU does NOT support interrupt, just migrate context. */
+	perf_pmu_migrate_context(&pcie_pmu->pmu, cpu, target);
+	pcie_pmu->on_cpu = target;
+
+	return 0;
+}
+
+static struct platform_driver dwc_pcie_pmu_driver = {
+	.probe = dwc_pcie_pmu_probe,
+	.driver = {.name = "dwc_pcie_pmu",},
+};
+
+static int __init dwc_pcie_pmu_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+				      "perf/dwc_pcie_pmu:online",
+				      dwc_pcie_pmu_online_cpu,
+				      dwc_pcie_pmu_offline_cpu);
+	if (ret < 0)
+		return ret;
+
+	dwc_pcie_pmu_hp_state = ret;
+
+	ret = platform_driver_register(&dwc_pcie_pmu_driver);
+	if (ret) {
+		cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
+		return ret;
+	}
+
+	dwc_pcie_pmu_dev = platform_device_register_simple(
+				"dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
+	if (IS_ERR(dwc_pcie_pmu_dev)) {
+		platform_driver_unregister(&dwc_pcie_pmu_driver);
+		return PTR_ERR(dwc_pcie_pmu_dev);
+	}
+
+	return 0;
+}
+
+static void __exit dwc_pcie_pmu_exit(void)
+{
+	platform_device_unregister(dwc_pcie_pmu_dev);
+	platform_driver_unregister(&dwc_pcie_pmu_driver);
+	cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
+
+	if (dwc_pcie_pmu_notify)
+		bus_unregister_notifier(&pci_bus_type, &dwc_pcie_pmu_nb);
+}
+
+module_init(dwc_pcie_pmu_init);
+module_exit(dwc_pcie_pmu_exit);
+
+MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
+MODULE_AUTHOR("Shuai Xue <xueshuai@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v8 4/4] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver
  2023-10-17  1:32 [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support Shuai Xue
                   ` (2 preceding siblings ...)
  2023-10-17  1:32 ` [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver Shuai Xue
@ 2023-10-17  1:32 ` Shuai Xue
  3 siblings, 0 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-17  1:32 UTC (permalink / raw)
  To: chengyou, kaishen, helgaas, yangyicong, will, Jonathan.Cameron,
	baolin.wang, robin.murphy
  Cc: linux-kernel, linux-arm-kernel, linux-pci, rdunlap, mark.rutland,
	zhuo.song, xueshuai, renyu.zj

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6c4cce45a09d..71f363f836ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20866,6 +20866,13 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/dw_mmc*
 
+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M:	Shuai Xue <xueshuai@linux.alibaba.com>
+M:	Jing Zhang <renyu.zj@linux.alibaba.com>
+S:	Supported
+F:	Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F:	drivers/perf/dwc_pcie_pmu.c
+
 SYNOPSYS HSDK RESET CONTROLLER DRIVER
 M:	Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
 S:	Supported
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
@ 2023-10-17  9:16   ` Jonathan Cameron
  2023-10-18  1:19     ` Shuai Xue
  2023-10-19  7:35   ` Yicong Yang
  1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2023-10-17  9:16 UTC (permalink / raw)
  To: Shuai Xue
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj

On Tue, 17 Oct 2023 09:32:32 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
> controller which implements which implements PMU for performance and
> functional debugging to facilitate system maintenance.
> 
> Document it to provide guidance on how to use it.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

A few minor things inline and one question that I'd like a comment on
for my understanding at least! (why not multiply the counter by 16 and
make the maths simpler?)

With those tidied up,
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks,

Jonathan


> ---
>  .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
>  Documentation/admin-guide/perf/index.rst      |  1 +
>  2 files changed, 95 insertions(+)
>  create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> 
> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> new file mode 100644
> index 000000000000..eac1b6f36450
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> @@ -0,0 +1,94 @@
> +======================================================================
> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
> +======================================================================
> +
> +DesignWare Cores (DWC) PCIe PMU
> +===============================
> +
> +The PMU is a PCIe configuration space register block provided by each PCIe Root
> +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
> +injection, and Statistics).
> +
> +As the name indicates, the RAS DES capability supports system level
> +debugging, AER error injection, and collection of statistics. To facilitate
> +collection of statistics, Synopsys DesignWare Cores PCIe controller
> +provides the following two features:
> +
> +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
> +  time spent in each low-power LTSSM state) and
> +- one 32-bit counter for Event Counting (error and non-error events for
> +  a specified lane)
> +
> +Note: There is no interrupt for counter overflow.
> +
> +Time Based Analysis
> +-------------------
> +
> +Using this feature you can obtain information regarding RX/TX data
> +throughput and time spent in each low-power LTSSM state by the controller.
> +The PMU measures data in two categories:
> +
> +- Group#0: Percentage of time the controller stays in LTSSM states.
> +- Group#1: Amount of data processed (Units of 16 bytes).
> +
> +Lane Event counters
> +-------------------
> +
> +Using this feature you can obtain Error and Non-Error information in
> +specific lane by the controller. The PMU event is select by:
> +
> +- Group i
> +- Event j within the Group i
> +- and Lane k
The and here is a little confusing. I'd rework as
The PMU event is selected by all of:
- Group i
- Event j within the Group i
- Lane k

> +
> +Some of the event only exist for specific configurations.

events

> +
> +DesignWare Cores (DWC) PCIe PMU Driver
> +=======================================
> +
> +This driver adds PMU devices for each PCIe Root Port named based on the BDF of
> +the Root Port. For example,
> +
> +    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
> +
> +the PMU device name for this Root Port is dwc_rootport_3018.
> +
> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
> +description of available events and configuration options in sysfs, see
> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
> +
> +The "format" directory describes format of the config fields of the
> +perf_event_attr structure. The "events" directory provides configuration
> +templates for all documented events.  For example,
> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
> +
> +The "perf list" command shall list the available events from sysfs, e.g.::
> +
> +    $# perf list | grep dwc_rootport
> +    <...>
> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
> +    <...>
> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
> +
> +Time Based Analysis Event Usage
> +-------------------------------
> +
> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> +
> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
> +
> +The average RX/TX bandwidth can be calculated using the following formula:
> +
> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Silly question (sorry I didn't raise it earlier) but can we make the interface
more intuitive by just multiplying the counter value at point of read by 16?

> +
> +Lane Event Usage
> +-------------------------------
> +
> +Each lane has the same event set and to avoid generating a list of hundreds
> +of events, the user need to specify the lane ID explicitly, e.g.::
> +
> +    $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
> +
> +The driver does not support sampling, therefore "perf record" will not
> +work. Per-task (without "-a") perf sessions are not supported.
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index f60be04e4e33..6bc7739fddb5 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -19,6 +19,7 @@ Performance monitor support
>     arm_dsu_pmu
>     thunderx2-pmu
>     alibaba_pmu
> +   dwc_pcie_pmu
>     nvidia-pmu
>     meson-ddr-pmu
>     cxl


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver
  2023-10-17  1:32 ` [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver Shuai Xue
@ 2023-10-17  9:39   ` Jonathan Cameron
  2023-10-18  3:33     ` Shuai Xue
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2023-10-17  9:39 UTC (permalink / raw)
  To: Shuai Xue
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj

On Tue, 17 Oct 2023 09:32:34 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature. The PMU is a PCIe
> configuration space register block provided by each PCIe Root Port in a
> Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
> injection, and Statistics).
> 
> To facilitate collection of statistics the controller provides the
> following two features for each Root Port:
> 
> - one 64-bit counter for Time Based Analysis (RX/TX data throughput and
>   time spent in each low-power LTSSM state) and
> - one 32-bit counter for Event Counting (error and non-error events for
>   a specified lane)
> 
> Note: There is no interrupt for counter overflow.
> 
> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
> named based the BDF of Root Port. For example,
> 
>     30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
> 
> the PMU device name for this Root Port is dwc_rootport_3018.
> 
> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

Question follow through from previous patch comment, why not just
multiply it by 16 when you read it?  Is something in perf going to
overflow?

> 
>     $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
> 
> average RX bandwidth can be calculated like this:
> 
>     PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

Most of the comments inline aren't perf driver specific.  To me that
part of it looks fine but I'm only an intermittent reviewer of perf
drivers, so that needs more eyes!

Anyhow, to my eyes this is coming together well but there are a few things
that don't look quite right yet.

...

> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD		GENMASK_ULL(31, 0)
> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD	GENMASK_ULL(63, 0)
> +
> +struct dwc_pcie_pmu {
> +	struct pmu		pmu;
> +	struct pci_dev		*pdev;		/* Root Port device */
> +	u16			ras_des;	/* RAS DES capability offset */

Could call it ras_des_offset as then the comment wouldn't be needed...

> +	u32			nr_lanes;
> +
> +	struct list_head	pmu_node;
> +	struct hlist_node	cpuhp_node;
> +	struct perf_event	*event[DWC_PCIE_EVENT_TYPE_MAX];
> +	int			on_cpu;
> +	bool			registered;
> +};
>


> +static void dwc_pcie_pmu_unregister_pmu(void *data)
> +{
> +	struct dwc_pcie_pmu *pcie_pmu = data;
> +
> +	if (!pcie_pmu->registered)
> +		return;
> +
> +	perf_pmu_unregister(&pcie_pmu->pmu);
> +	pcie_pmu->registered = false;
> +	list_del(&pcie_pmu->pmu_node);
For simplicity or reviewing I'd expect either:
a) Exact reverse order of what happened in probe.
b) A comment on why not.

That probably jus means that the perf_pmu_unregister
should be first.

> +}


...

> +static int dwc_pcie_pmu_probe(struct platform_device *plat_dev)
> +{
> +	struct pci_dev *pdev = NULL;
> +	struct dwc_pcie_pmu *pcie_pmu;
> +	bool notify = false;
> +	char *name;
> +	u32 bdf;
> +	int ret;
> +
> +	/* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
> +	for_each_pci_dev(pdev) {
> +		u16 vsec;
> +		u32 val;
> +
> +		if (!(pci_is_pcie(pdev) &&
> +		      pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
> +			continue;
> +
> +		vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> +						DWC_PCIE_VSEC_RAS_DES_ID);
> +		if (!vsec)
> +			continue;
> +
> +		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
> +		if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
> +		    PCI_VNDR_HEADER_LEN(val) != 0x100)

I'm curious - why check the header length?  Paranoia / defensive coding or does it vary?
I'd expect it to be fixed for a given revision. Ideally it should be backwards compatible
so that revs above 4 will always work (possibly with missing features) with rev 4 targeting
software but I guess we can't guaranteed that...

> +			continue;
> +		pci_dbg(pdev,
> +			"Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
> +
> +		bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
> +		name = devm_kasprintf(&plat_dev->dev, GFP_KERNEL, "dwc_rootport_%x",
> +				      bdf);
> +		if (!name) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		/* All checks passed, go go go */
> +		pcie_pmu = devm_kzalloc(&plat_dev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
> +		if (!pcie_pmu) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		pcie_pmu->pdev = pdev;
> +		pcie_pmu->ras_des = vsec;
> +		pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
> +		pcie_pmu->on_cpu = -1;
> +		pcie_pmu->pmu = (struct pmu){
> +			.module		= THIS_MODULE,
> +			.attr_groups	= dwc_pcie_attr_groups,
> +			.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
> +			.task_ctx_nr	= perf_invalid_context,
> +			.event_init	= dwc_pcie_pmu_event_init,
> +			.add		= dwc_pcie_pmu_event_add,
> +			.del		= dwc_pcie_pmu_event_del,
> +			.start		= dwc_pcie_pmu_event_start,
> +			.stop		= dwc_pcie_pmu_event_stop,
> +			.read		= dwc_pcie_pmu_event_update,
> +		};
> +
> +		/* Add this instance to the list used by the offline callback */
> +		ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
> +					       &pcie_pmu->cpuhp_node);
> +		if (ret) {
> +			pci_err(pcie_pmu->pdev,
> +				"Error %d registering hotplug @%x\n", ret, bdf);
> +			goto out;
> +		}
> +
> +		/* Unwind when platform driver removes */
> +		ret = devm_add_action_or_reset(
> +			&plat_dev->dev, dwc_pcie_pmu_remove_cpuhp_instance,
> +			&pcie_pmu->cpuhp_node);
> +		if (ret)
> +			goto out;
> +
> +		ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
> +		if (ret) {
> +			pci_err(pcie_pmu->pdev,
> +				"Error %d registering PMU @%x\n", ret, bdf);
> +			goto out;
> +		}
> +		ret = devm_add_action_or_reset(
> +			&plat_dev->dev, dwc_pcie_pmu_unregister_pmu, pcie_pmu);
> +		if (ret)
> +			goto out;
This is messy because your devm callback also deals with the bit below.
So if the _or_reset here happens because this call fails, the list_del will
happen on something that was never added.  Simple fix is move this down to after
pcie_pmu->registered given none of the next few line of code can fail anyway.

> +
> +		/* Cache PMU to handle pci device hotplug */
> +		list_add(&pcie_pmu->pmu_node, &dwc_pcie_pmu_head);
> +		pcie_pmu->registered = true;
> +		notify = true;
> +	}
> +
> +	if (notify && bus_register_notifier(&pci_bus_type, &dwc_pcie_pmu_nb))
> +		notify = false;
As mentioned below, I'd expect the bus_unregister_notifier to be in a remove()
callback, or you could use a devm_add_action_or_reset() an a simple callback.
You can register that as

	if (notify) {
		if (bus_register_notifier() == 0)
			ret = devm_add_action_or_reset(unreg_notifier,
						       &dwc_pcie_pmu_nb);
	}
and then you don't need to track if it was registered or not as the
cleanup only happens if it was.


> +
> +	if (notify)
> +		dwc_pcie_pmu_notify = true;
> +
> +	return 0;
> +
> +out:
> +	pci_dev_put(pdev);
> +
> +	return ret;
> +}
> +


...

> +
> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
> +{
> +	struct dwc_pcie_pmu *pcie_pmu;
> +	struct pci_dev *pdev;
> +	int node;
> +	cpumask_t mask;
> +	unsigned int target;
> +
> +	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
> +	/* Nothing to do if this CPU doesn't own the PMU */
> +	if (cpu != pcie_pmu->on_cpu)
> +		return 0;
> +
> +	pcie_pmu->on_cpu = -1;
> +	pdev = pcie_pmu->pdev;
> +	node = dev_to_node(&pdev->dev);
> +	if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
> +	    cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
> +		target = cpumask_any(&mask);
> +	else
> +		target = cpumask_any_but(cpu_online_mask, cpu);
> +
> +	if (target >= nr_cpu_ids) {
> +		pci_err(pcie_pmu->pdev, "There is no CPU to set\n");

You have a local variable for pdev, use it here as well.

> +		return 0;
> +	}
> +
> +	/* This PMU does NOT support interrupt, just migrate context. */
> +	perf_pmu_migrate_context(&pcie_pmu->pmu, cpu, target);
> +	pcie_pmu->on_cpu = target;
> +
> +	return 0;
> +}
> +
> +static struct platform_driver dwc_pcie_pmu_driver = {
> +	.probe = dwc_pcie_pmu_probe,
> +	.driver = {.name = "dwc_pcie_pmu",},
> +};
> +
> +static int __init dwc_pcie_pmu_init(void)
> +{
> +	int ret;
> +
> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
> +				      "perf/dwc_pcie_pmu:online",
> +				      dwc_pcie_pmu_online_cpu,
> +				      dwc_pcie_pmu_offline_cpu);
> +	if (ret < 0)
> +		return ret;
> +
> +	dwc_pcie_pmu_hp_state = ret;
> +
> +	ret = platform_driver_register(&dwc_pcie_pmu_driver);
> +	if (ret) {
> +		cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
> +		return ret;
> +	}
> +
> +	dwc_pcie_pmu_dev = platform_device_register_simple(
> +				"dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
> +	if (IS_ERR(dwc_pcie_pmu_dev)) {
> +		platform_driver_unregister(&dwc_pcie_pmu_driver);

Why no cpuhp_remove_multi_state() in this error path?

I'd move to the approach of a gotos and a single error handling block as
that makes this sort of thing easier to spot.

> +		return PTR_ERR(dwc_pcie_pmu_dev);
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit dwc_pcie_pmu_exit(void)
> +{
> +	platform_device_unregister(dwc_pcie_pmu_dev);
> +	platform_driver_unregister(&dwc_pcie_pmu_driver);
> +	cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
> +
> +	if (dwc_pcie_pmu_notify)

If you have something unusual like this a driver module_exit() it definitely
deserves a comment on why.  I'm surprised by this as I'd expect the notifier
to be unregistered in the driver remove so not sure why this is here.
I've lost track of earlier discussions so if this was addressed then all
we need is a comment here for the next person to run into it!

> +		bus_unregister_notifier(&pci_bus_type, &dwc_pcie_pmu_nb);
> +}
> +
> +module_init(dwc_pcie_pmu_init);
> +module_exit(dwc_pcie_pmu_exit);
> +
> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
> +MODULE_AUTHOR("Shuai Xue <xueshuai@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-17  9:16   ` Jonathan Cameron
@ 2023-10-18  1:19     ` Shuai Xue
  2023-10-19 11:06       ` Jonathan Cameron
  0 siblings, 1 reply; 15+ messages in thread
From: Shuai Xue @ 2023-10-18  1:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj



On 2023/10/17 17:16, Jonathan Cameron wrote:
> On Tue, 17 Oct 2023 09:32:32 +0800
> Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> 
>> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
>> controller which implements which implements PMU for performance and
>> functional debugging to facilitate system maintenance.
>>
>> Document it to provide guidance on how to use it.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> A few minor things inline and one question that I'd like a comment on
> for my understanding at least! (why not multiply the counter by 16 and
> make the maths simpler?)
> 
> With those tidied up,
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 

Thank you for providing prompt feedback and valuable comments to me.
(please also see my replies inline)

Best Regards,
Shuai

> 
> 
>> ---
>>  .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
>>  Documentation/admin-guide/perf/index.rst      |  1 +
>>  2 files changed, 95 insertions(+)
>>  create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> new file mode 100644
>> index 000000000000..eac1b6f36450
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> @@ -0,0 +1,94 @@
>> +======================================================================
>> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
>> +======================================================================
>> +
>> +DesignWare Cores (DWC) PCIe PMU
>> +===============================
>> +
>> +The PMU is a PCIe configuration space register block provided by each PCIe Root
>> +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
>> +injection, and Statistics).
>> +
>> +As the name indicates, the RAS DES capability supports system level
>> +debugging, AER error injection, and collection of statistics. To facilitate
>> +collection of statistics, Synopsys DesignWare Cores PCIe controller
>> +provides the following two features:
>> +
>> +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
>> +  time spent in each low-power LTSSM state) and
>> +- one 32-bit counter for Event Counting (error and non-error events for
>> +  a specified lane)
>> +
>> +Note: There is no interrupt for counter overflow.
>> +
>> +Time Based Analysis
>> +-------------------
>> +
>> +Using this feature you can obtain information regarding RX/TX data
>> +throughput and time spent in each low-power LTSSM state by the controller.
>> +The PMU measures data in two categories:
>> +
>> +- Group#0: Percentage of time the controller stays in LTSSM states.
>> +- Group#1: Amount of data processed (Units of 16 bytes).
>> +
>> +Lane Event counters
>> +-------------------
>> +
>> +Using this feature you can obtain Error and Non-Error information in
>> +specific lane by the controller. The PMU event is select by:
>> +
>> +- Group i
>> +- Event j within the Group i
>> +- and Lane k
> The and here is a little confusing. I'd rework as
> The PMU event is selected by all of:
> - Group i
> - Event j within the Group i
> - Lane k

Will rework it in next version.

> 
>> +
>> +Some of the event only exist for specific configurations.
> 
> events

Sorry for typo, will fix it.

> 
>> +
>> +DesignWare Cores (DWC) PCIe PMU Driver
>> +=======================================
>> +
>> +This driver adds PMU devices for each PCIe Root Port named based on the BDF of
>> +the Root Port. For example,
>> +
>> +    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>> +
>> +the PMU device name for this Root Port is dwc_rootport_3018.
>> +
>> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
>> +description of available events and configuration options in sysfs, see
>> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
>> +
>> +The "format" directory describes format of the config fields of the
>> +perf_event_attr structure. The "events" directory provides configuration
>> +templates for all documented events.  For example,
>> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
>> +
>> +The "perf list" command shall list the available events from sysfs, e.g.::
>> +
>> +    $# perf list | grep dwc_rootport
>> +    <...>
>> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
>> +    <...>
>> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
>> +
>> +Time Based Analysis Event Usage
>> +-------------------------------
>> +
>> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>> +
>> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>> +
>> +The average RX/TX bandwidth can be calculated using the following formula:
>> +
>> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
>> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
> 
> Silly question (sorry I didn't raise it earlier) but can we make the interface
> more intuitive by just multiplying the counter value at point of read by 16?

Really a good suggestion, and it is very convenient for end perf users.
But the unit of 16 is only applied to group#1 as described in Time Based Analysis
section.

So I prefer to left the unit part to end users.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver
  2023-10-17  9:39   ` Jonathan Cameron
@ 2023-10-18  3:33     ` Shuai Xue
  2023-10-19  8:05       ` Yicong Yang
  0 siblings, 1 reply; 15+ messages in thread
From: Shuai Xue @ 2023-10-18  3:33 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj



On 2023/10/17 17:39, Jonathan Cameron wrote:
> On Tue, 17 Oct 2023 09:32:34 +0800
> Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> 
>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>> Core controller IP which provides statistics feature. The PMU is a PCIe
>> configuration space register block provided by each PCIe Root Port in a
>> Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
>> injection, and Statistics).
>>
>> To facilitate collection of statistics the controller provides the
>> following two features for each Root Port:
>>
>> - one 64-bit counter for Time Based Analysis (RX/TX data throughput and
>>   time spent in each low-power LTSSM state) and
>> - one 32-bit counter for Event Counting (error and non-error events for
>>   a specified lane)
>>
>> Note: There is no interrupt for counter overflow.
>>
>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>> named based the BDF of Root Port. For example,
>>
>>     30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>
>> the PMU device name for this Root Port is dwc_rootport_3018.
>>
>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> 
> Question follow through from previous patch comment, why not just
> multiply it by 16 when you read it?  Is something in perf going to
> overflow?

As we discussed in Path 1/4, the unit 16 is not general for all groups of
Time Based Analysis, I prefer to leave unit part to end perf users.

> 
>>
>>     $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>
>> average RX bandwidth can be calculated like this:
>>
>>     PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> Most of the comments inline aren't perf driver specific.  To me that
> part of it looks fine but I'm only an intermittent reviewer of perf
> drivers, so that needs more eyes!
> 
> Anyhow, to my eyes this is coming together well but there are a few things
> that don't look quite right yet.

Thank you for valuable comments, I will try my best to address them :)

> 
> ...
> 
>> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD		GENMASK_ULL(31, 0)
>> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD	GENMASK_ULL(63, 0)
>> +
>> +struct dwc_pcie_pmu {
>> +	struct pmu		pmu;
>> +	struct pci_dev		*pdev;		/* Root Port device */
>> +	u16			ras_des;	/* RAS DES capability offset */
> 
> Could call it ras_des_offset as then the comment wouldn't be needed...

Agreed, will fix it.

> 
>> +	u32			nr_lanes;
>> +
>> +	struct list_head	pmu_node;
>> +	struct hlist_node	cpuhp_node;
>> +	struct perf_event	*event[DWC_PCIE_EVENT_TYPE_MAX];
>> +	int			on_cpu;
>> +	bool			registered;
>> +};
>>
> 
> 
>> +static void dwc_pcie_pmu_unregister_pmu(void *data)
>> +{
>> +	struct dwc_pcie_pmu *pcie_pmu = data;
>> +
>> +	if (!pcie_pmu->registered)
>> +		return;
>> +
>> +	perf_pmu_unregister(&pcie_pmu->pmu);
>> +	pcie_pmu->registered = false;
>> +	list_del(&pcie_pmu->pmu_node);
> For simplicity or reviewing I'd expect either:
> a) Exact reverse order of what happened in probe.
> b) A comment on why not.
> 
> That probably jus means that the perf_pmu_unregister
> should be first.

I guess you mean perf_pmu_unregister should be last?
Bellow is the exact reverse order of what happened in probe.

	pcie_pmu->registered = false;
	list_del(&pcie_pmu->pmu_node);
	perf_pmu_unregister(&pcie_pmu->pmu);

> 
>> +}
> 
> 
> ...
> 
>> +static int dwc_pcie_pmu_probe(struct platform_device *plat_dev)
>> +{
>> +	struct pci_dev *pdev = NULL;
>> +	struct dwc_pcie_pmu *pcie_pmu;
>> +	bool notify = false;
>> +	char *name;
>> +	u32 bdf;
>> +	int ret;
>> +
>> +	/* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
>> +	for_each_pci_dev(pdev) {
>> +		u16 vsec;
>> +		u32 val;
>> +
>> +		if (!(pci_is_pcie(pdev) &&
>> +		      pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
>> +			continue;
>> +
>> +		vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>> +						DWC_PCIE_VSEC_RAS_DES_ID);
>> +		if (!vsec)
>> +			continue;
>> +
>> +		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>> +		if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>> +		    PCI_VNDR_HEADER_LEN(val) != 0x100)
> 
> I'm curious - why check the header length?  Paranoia / defensive coding or does it vary?
> I'd expect it to be fixed for a given revision. Ideally it should be backwards compatible
> so that revs above 4 will always work (possibly with missing features) with rev 4 targeting
> software but I guess we can't guaranteed that...

Kind of defensive coding. Agreed, I will remove the hender length check to make the driver more
compatible.
> 
>> +			continue;
>> +		pci_dbg(pdev,
>> +			"Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>> +
>> +		bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>> +		name = devm_kasprintf(&plat_dev->dev, GFP_KERNEL, "dwc_rootport_%x",
>> +				      bdf);
>> +		if (!name) {
>> +			ret = -ENOMEM;
>> +			goto out;
>> +		}
>> +
>> +		/* All checks passed, go go go */
>> +		pcie_pmu = devm_kzalloc(&plat_dev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
>> +		if (!pcie_pmu) {
>> +			ret = -ENOMEM;
>> +			goto out;
>> +		}
>> +
>> +		pcie_pmu->pdev = pdev;
>> +		pcie_pmu->ras_des = vsec;
>> +		pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
>> +		pcie_pmu->on_cpu = -1;
>> +		pcie_pmu->pmu = (struct pmu){
>> +			.module		= THIS_MODULE,
>> +			.attr_groups	= dwc_pcie_attr_groups,
>> +			.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
>> +			.task_ctx_nr	= perf_invalid_context,
>> +			.event_init	= dwc_pcie_pmu_event_init,
>> +			.add		= dwc_pcie_pmu_event_add,
>> +			.del		= dwc_pcie_pmu_event_del,
>> +			.start		= dwc_pcie_pmu_event_start,
>> +			.stop		= dwc_pcie_pmu_event_stop,
>> +			.read		= dwc_pcie_pmu_event_update,
>> +		};
>> +
>> +		/* Add this instance to the list used by the offline callback */
>> +		ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
>> +					       &pcie_pmu->cpuhp_node);
>> +		if (ret) {
>> +			pci_err(pcie_pmu->pdev,
>> +				"Error %d registering hotplug @%x\n", ret, bdf);
>> +			goto out;
>> +		}
>> +
>> +		/* Unwind when platform driver removes */
>> +		ret = devm_add_action_or_reset(
>> +			&plat_dev->dev, dwc_pcie_pmu_remove_cpuhp_instance,
>> +			&pcie_pmu->cpuhp_node);
>> +		if (ret)
>> +			goto out;
>> +
>> +		ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>> +		if (ret) {
>> +			pci_err(pcie_pmu->pdev,
>> +				"Error %d registering PMU @%x\n", ret, bdf);
>> +			goto out;
>> +		}
>> +		ret = devm_add_action_or_reset(
>> +			&plat_dev->dev, dwc_pcie_pmu_unregister_pmu, pcie_pmu);
>> +		if (ret)
>> +			goto out;
> This is messy because your devm callback also deals with the bit below.
> So if the _or_reset here happens because this call fails, the list_del will
> happen on something that was never added.  Simple fix is move this down to after
> pcie_pmu->registered given none of the next few line of code can fail anyway.

Goot point. I missed the fact that if devm_add_action_or_reset() fails, the action
dwc_pcie_pmu_unregister_pmu() will be called right away.

Will fix it in next version.


> 
>> +
>> +		/* Cache PMU to handle pci device hotplug */
>> +		list_add(&pcie_pmu->pmu_node, &dwc_pcie_pmu_head);
>> +		pcie_pmu->registered = true;
>> +		notify = true;
>> +	}
>> +
>> +	if (notify && bus_register_notifier(&pci_bus_type, &dwc_pcie_pmu_nb))
>> +		notify = false;
> As mentioned below, I'd expect the bus_unregister_notifier to be in a remove()
> callback, or you could use a devm_add_action_or_reset() an a simple callback.
> You can register that as
> 
> 	if (notify) {
> 		if (bus_register_notifier() == 0)
> 			ret = devm_add_action_or_reset(unreg_notifier,
> 						       &dwc_pcie_pmu_nb);
> 	}
> and then you don't need to track if it was registered or not as the
> cleanup only happens if it was.

Good suggestion. Will also use devm_add_action_or_reset() to unwind.

> 
> 
>> +
>> +	if (notify)
>> +		dwc_pcie_pmu_notify = true;
>> +
>> +	return 0;
>> +
>> +out:
>> +	pci_dev_put(pdev);
>> +
>> +	return ret;
>> +}
>> +
> 
> 
> ...
> 
>> +
>> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>> +{
>> +	struct dwc_pcie_pmu *pcie_pmu;
>> +	struct pci_dev *pdev;
>> +	int node;
>> +	cpumask_t mask;
>> +	unsigned int target;
>> +
>> +	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>> +	/* Nothing to do if this CPU doesn't own the PMU */
>> +	if (cpu != pcie_pmu->on_cpu)
>> +		return 0;
>> +
>> +	pcie_pmu->on_cpu = -1;
>> +	pdev = pcie_pmu->pdev;
>> +	node = dev_to_node(&pdev->dev);
>> +	if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>> +	    cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>> +		target = cpumask_any(&mask);
>> +	else
>> +		target = cpumask_any_but(cpu_online_mask, cpu);
>> +
>> +	if (target >= nr_cpu_ids) {
>> +		pci_err(pcie_pmu->pdev, "There is no CPU to set\n");
> 
> You have a local variable for pdev, use it here as well.

Will use local pdev directly.

> 
>> +		return 0;
>> +	}
>> +
>> +	/* This PMU does NOT support interrupt, just migrate context. */
>> +	perf_pmu_migrate_context(&pcie_pmu->pmu, cpu, target);
>> +	pcie_pmu->on_cpu = target;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct platform_driver dwc_pcie_pmu_driver = {
>> +	.probe = dwc_pcie_pmu_probe,
>> +	.driver = {.name = "dwc_pcie_pmu",},
>> +};
>> +
>> +static int __init dwc_pcie_pmu_init(void)
>> +{
>> +	int ret;
>> +
>> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
>> +				      "perf/dwc_pcie_pmu:online",
>> +				      dwc_pcie_pmu_online_cpu,
>> +				      dwc_pcie_pmu_offline_cpu);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	dwc_pcie_pmu_hp_state = ret;
>> +
>> +	ret = platform_driver_register(&dwc_pcie_pmu_driver);
>> +	if (ret) {
>> +		cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>> +		return ret;
>> +	}
>> +
>> +	dwc_pcie_pmu_dev = platform_device_register_simple(
>> +				"dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>> +	if (IS_ERR(dwc_pcie_pmu_dev)) {
>> +		platform_driver_unregister(&dwc_pcie_pmu_driver);
> 
> Why no cpuhp_remove_multi_state() in this error path?
> 
> I'd move to the approach of a gotos and a single error handling block as
> that makes this sort of thing easier to spot.

Agreed, will use the approach of a gotos to handle errors.

> 
>> +		return PTR_ERR(dwc_pcie_pmu_dev);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void __exit dwc_pcie_pmu_exit(void)
>> +{
>> +	platform_device_unregister(dwc_pcie_pmu_dev);
>> +	platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +	cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>> +
>> +	if (dwc_pcie_pmu_notify)
> 
> If you have something unusual like this a driver module_exit() it definitely
> deserves a comment on why.  I'm surprised by this as I'd expect the notifier
> to be unregistered in the driver remove so not sure why this is here.
> I've lost track of earlier discussions so if this was addressed then all
> we need is a comment here for the next person to run into it!

All replied above, I will unregistered the notifier by devm_add_action_or_reset().

I am curious about that what the difference between unregistered in module_exit()
and remove()?

> 
>> +		bus_unregister_notifier(&pci_bus_type, &dwc_pcie_pmu_nb);
>> +}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
  2023-10-17  9:16   ` Jonathan Cameron
@ 2023-10-19  7:35   ` Yicong Yang
  2023-10-19 11:51     ` Shuai Xue
  1 sibling, 1 reply; 15+ messages in thread
From: Yicong Yang @ 2023-10-19  7:35 UTC (permalink / raw)
  To: Shuai Xue
  Cc: robin.murphy, baolin.wang, Jonathan.Cameron, will, helgaas,
	kaishen, chengyou, yangyicong, linux-kernel, linux-arm-kernel,
	linux-pci, rdunlap, mark.rutland, zhuo.song, renyu.zj

On 2023/10/17 9:32, Shuai Xue wrote:
> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
> controller which implements which implements PMU for performance and
> functional debugging to facilitate system maintenance.

Double "which implements"?

> 
> Document it to provide guidance on how to use it.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Others look good to me.

Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>

> ---
>  .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
>  Documentation/admin-guide/perf/index.rst      |  1 +
>  2 files changed, 95 insertions(+)
>  create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> 
> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> new file mode 100644
> index 000000000000..eac1b6f36450
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> @@ -0,0 +1,94 @@
> +======================================================================
> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
> +======================================================================
> +
> +DesignWare Cores (DWC) PCIe PMU
> +===============================
> +
> +The PMU is a PCIe configuration space register block provided by each PCIe Root
> +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
> +injection, and Statistics).
> +
> +As the name indicates, the RAS DES capability supports system level
> +debugging, AER error injection, and collection of statistics. To facilitate
> +collection of statistics, Synopsys DesignWare Cores PCIe controller
> +provides the following two features:
> +
> +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
> +  time spent in each low-power LTSSM state) and
> +- one 32-bit counter for Event Counting (error and non-error events for
> +  a specified lane)
> +
> +Note: There is no interrupt for counter overflow.
> +
> +Time Based Analysis
> +-------------------
> +
> +Using this feature you can obtain information regarding RX/TX data
> +throughput and time spent in each low-power LTSSM state by the controller.
> +The PMU measures data in two categories:
> +
> +- Group#0: Percentage of time the controller stays in LTSSM states.
> +- Group#1: Amount of data processed (Units of 16 bytes).
> +
> +Lane Event counters
> +-------------------
> +
> +Using this feature you can obtain Error and Non-Error information in
> +specific lane by the controller. The PMU event is select by:
> +
> +- Group i
> +- Event j within the Group i
> +- and Lane k
> +
> +Some of the event only exist for specific configurations.
> +
> +DesignWare Cores (DWC) PCIe PMU Driver
> +=======================================
> +
> +This driver adds PMU devices for each PCIe Root Port named based on the BDF of
> +the Root Port. For example,
> +
> +    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
> +
> +the PMU device name for this Root Port is dwc_rootport_3018.
> +
> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
> +description of available events and configuration options in sysfs, see
> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
> +
> +The "format" directory describes format of the config fields of the
> +perf_event_attr structure. The "events" directory provides configuration
> +templates for all documented events.  For example,
> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
> +
> +The "perf list" command shall list the available events from sysfs, e.g.::
> +
> +    $# perf list | grep dwc_rootport
> +    <...>
> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
> +    <...>
> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
> +
> +Time Based Analysis Event Usage
> +-------------------------------
> +
> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> +
> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
> +
> +The average RX/TX bandwidth can be calculated using the following formula:
> +
> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
> +
> +Lane Event Usage
> +-------------------------------
> +
> +Each lane has the same event set and to avoid generating a list of hundreds
> +of events, the user need to specify the lane ID explicitly, e.g.::
> +
> +    $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
> +
> +The driver does not support sampling, therefore "perf record" will not
> +work. Per-task (without "-a") perf sessions are not supported.
> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index f60be04e4e33..6bc7739fddb5 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -19,6 +19,7 @@ Performance monitor support
>     arm_dsu_pmu
>     thunderx2-pmu
>     alibaba_pmu
> +   dwc_pcie_pmu
>     nvidia-pmu
>     meson-ddr-pmu
>     cxl
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver
  2023-10-18  3:33     ` Shuai Xue
@ 2023-10-19  8:05       ` Yicong Yang
  2023-10-19 12:02         ` Shuai Xue
  0 siblings, 1 reply; 15+ messages in thread
From: Yicong Yang @ 2023-10-19  8:05 UTC (permalink / raw)
  To: Shuai Xue, Jonathan Cameron
  Cc: yangyicong, chengyou, kaishen, helgaas, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj

On 2023/10/18 11:33, Shuai Xue wrote:
> 
> 
> On 2023/10/17 17:39, Jonathan Cameron wrote:
>> On Tue, 17 Oct 2023 09:32:34 +0800
>> Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>>> Core controller IP which provides statistics feature. The PMU is a PCIe
>>> configuration space register block provided by each PCIe Root Port in a
>>> Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
>>> injection, and Statistics).
>>>
>>> To facilitate collection of statistics the controller provides the
>>> following two features for each Root Port:
>>>
>>> - one 64-bit counter for Time Based Analysis (RX/TX data throughput and
>>>   time spent in each low-power LTSSM state) and
>>> - one 32-bit counter for Event Counting (error and non-error events for
>>>   a specified lane)
>>>
>>> Note: There is no interrupt for counter overflow.
>>>
>>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>>> named based the BDF of Root Port. For example,
>>>
>>>     30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>>
>>> the PMU device name for this Root Port is dwc_rootport_3018.
>>>
>>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>
>> Question follow through from previous patch comment, why not just
>> multiply it by 16 when you read it?  Is something in perf going to
>> overflow?
> 
> As we discussed in Path 1/4, the unit 16 is not general for all groups of
> Time Based Analysis, I prefer to leave unit part to end perf users.
> 
>>
>>>
>>>     $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>>
>>> average RX bandwidth can be calculated like this:
>>>
>>>     PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>
>> Most of the comments inline aren't perf driver specific.  To me that
>> part of it looks fine but I'm only an intermittent reviewer of perf
>> drivers, so that needs more eyes!
>>
>> Anyhow, to my eyes this is coming together well but there are a few things
>> that don't look quite right yet.
> 
> Thank you for valuable comments, I will try my best to address them :)
> 
>>
>> ...
>>
>>> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD		GENMASK_ULL(31, 0)
>>> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD	GENMASK_ULL(63, 0)
>>> +
>>> +struct dwc_pcie_pmu {
>>> +	struct pmu		pmu;
>>> +	struct pci_dev		*pdev;		/* Root Port device */
>>> +	u16			ras_des;	/* RAS DES capability offset */
>>
>> Could call it ras_des_offset as then the comment wouldn't be needed...
> 
> Agreed, will fix it.
> 
>>
>>> +	u32			nr_lanes;
>>> +
>>> +	struct list_head	pmu_node;
>>> +	struct hlist_node	cpuhp_node;
>>> +	struct perf_event	*event[DWC_PCIE_EVENT_TYPE_MAX];
>>> +	int			on_cpu;
>>> +	bool			registered;
>>> +};
>>>
>>
>>
>>> +static void dwc_pcie_pmu_unregister_pmu(void *data)
>>> +{
>>> +	struct dwc_pcie_pmu *pcie_pmu = data;
>>> +
>>> +	if (!pcie_pmu->registered)
>>> +		return;
>>> +
>>> +	perf_pmu_unregister(&pcie_pmu->pmu);
>>> +	pcie_pmu->registered = false;
>>> +	list_del(&pcie_pmu->pmu_node);
>> For simplicity or reviewing I'd expect either:
>> a) Exact reverse order of what happened in probe.
>> b) A comment on why not.
>>
>> That probably jus means that the perf_pmu_unregister
>> should be first.
> 
> I guess you mean perf_pmu_unregister should be last?
> Bellow is the exact reverse order of what happened in probe.
> 
> 	pcie_pmu->registered = false;
> 	list_del(&pcie_pmu->pmu_node);
> 	perf_pmu_unregister(&pcie_pmu->pmu);
> 
>>
>>> +}
>>
>>
>> ...
>>
>>> +static int dwc_pcie_pmu_probe(struct platform_device *plat_dev)
>>> +{
>>> +	struct pci_dev *pdev = NULL;
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +	bool notify = false;
>>> +	char *name;
>>> +	u32 bdf;
>>> +	int ret;
>>> +
>>> +	/* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
>>> +	for_each_pci_dev(pdev) {
>>> +		u16 vsec;
>>> +		u32 val;
>>> +
>>> +		if (!(pci_is_pcie(pdev) &&
>>> +		      pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
>>> +			continue;
>>> +
>>> +		vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>>> +						DWC_PCIE_VSEC_RAS_DES_ID);
>>> +		if (!vsec)
>>> +			continue;
>>> +
>>> +		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>>> +		if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>>> +		    PCI_VNDR_HEADER_LEN(val) != 0x100)
>>
>> I'm curious - why check the header length?  Paranoia / defensive coding or does it vary?
>> I'd expect it to be fixed for a given revision. Ideally it should be backwards compatible
>> so that revs above 4 will always work (possibly with missing features) with rev 4 targeting
>> software but I guess we can't guaranteed that...
> 
> Kind of defensive coding. Agreed, I will remove the hender length check to make the driver more
> compatible.
>>
>>> +			continue;
>>> +		pci_dbg(pdev,
>>> +			"Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>>> +
>>> +		bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>>> +		name = devm_kasprintf(&plat_dev->dev, GFP_KERNEL, "dwc_rootport_%x",
>>> +				      bdf);
>>> +		if (!name) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +
>>> +		/* All checks passed, go go go */
>>> +		pcie_pmu = devm_kzalloc(&plat_dev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
>>> +		if (!pcie_pmu) {
>>> +			ret = -ENOMEM;
>>> +			goto out;
>>> +		}
>>> +
>>> +		pcie_pmu->pdev = pdev;
>>> +		pcie_pmu->ras_des = vsec;
>>> +		pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
>>> +		pcie_pmu->on_cpu = -1;
>>> +		pcie_pmu->pmu = (struct pmu){
>>> +			.module		= THIS_MODULE,
>>> +			.attr_groups	= dwc_pcie_attr_groups,
>>> +			.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
>>> +			.task_ctx_nr	= perf_invalid_context,
>>> +			.event_init	= dwc_pcie_pmu_event_init,
>>> +			.add		= dwc_pcie_pmu_event_add,
>>> +			.del		= dwc_pcie_pmu_event_del,
>>> +			.start		= dwc_pcie_pmu_event_start,
>>> +			.stop		= dwc_pcie_pmu_event_stop,
>>> +			.read		= dwc_pcie_pmu_event_update,
>>> +		};
>>> +
>>> +		/* Add this instance to the list used by the offline callback */
>>> +		ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
>>> +					       &pcie_pmu->cpuhp_node);
>>> +		if (ret) {
>>> +			pci_err(pcie_pmu->pdev,
>>> +				"Error %d registering hotplug @%x\n", ret, bdf);
>>> +			goto out;
>>> +		}
>>> +
>>> +		/* Unwind when platform driver removes */
>>> +		ret = devm_add_action_or_reset(
>>> +			&plat_dev->dev, dwc_pcie_pmu_remove_cpuhp_instance,
>>> +			&pcie_pmu->cpuhp_node);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>>> +		if (ret) {
>>> +			pci_err(pcie_pmu->pdev,
>>> +				"Error %d registering PMU @%x\n", ret, bdf);
>>> +			goto out;
>>> +		}
>>> +		ret = devm_add_action_or_reset(
>>> +			&plat_dev->dev, dwc_pcie_pmu_unregister_pmu, pcie_pmu);
>>> +		if (ret)
>>> +			goto out;
>> This is messy because your devm callback also deals with the bit below.
>> So if the _or_reset here happens because this call fails, the list_del will
>> happen on something that was never added.  Simple fix is move this down to after
>> pcie_pmu->registered given none of the next few line of code can fail anyway.
> 
> Goot point. I missed the fact that if devm_add_action_or_reset() fails, the action
> dwc_pcie_pmu_unregister_pmu() will be called right away.
> 
> Will fix it in next version.
> 
> 
>>
>>> +
>>> +		/* Cache PMU to handle pci device hotplug */
>>> +		list_add(&pcie_pmu->pmu_node, &dwc_pcie_pmu_head);
>>> +		pcie_pmu->registered = true;
>>> +		notify = true;
>>> +	}
>>> +
>>> +	if (notify && bus_register_notifier(&pci_bus_type, &dwc_pcie_pmu_nb))
>>> +		notify = false;
>> As mentioned below, I'd expect the bus_unregister_notifier to be in a remove()
>> callback, or you could use a devm_add_action_or_reset() an a simple callback.
>> You can register that as
>>
>> 	if (notify) {
>> 		if (bus_register_notifier() == 0)
>> 			ret = devm_add_action_or_reset(unreg_notifier,
>> 						       &dwc_pcie_pmu_nb);
>> 	}
>> and then you don't need to track if it was registered or not as the
>> cleanup only happens if it was.
> 
> Good suggestion. Will also use devm_add_action_or_reset() to unwind.
> 
>>
>>
>>> +
>>> +	if (notify)
>>> +		dwc_pcie_pmu_notify = true;
>>> +
>>> +	return 0;
>>> +
>>> +out:
>>> +	pci_dev_put(pdev);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>
>>
>> ...
>>
>>> +
>>> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>>> +{
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +	struct pci_dev *pdev;
>>> +	int node;
>>> +	cpumask_t mask;
>>> +	unsigned int target;
>>> +
>>> +	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>>> +	/* Nothing to do if this CPU doesn't own the PMU */
>>> +	if (cpu != pcie_pmu->on_cpu)
>>> +		return 0;
>>> +
>>> +	pcie_pmu->on_cpu = -1;
>>> +	pdev = pcie_pmu->pdev;
>>> +	node = dev_to_node(&pdev->dev);
>>> +	if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>>> +	    cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>>> +		target = cpumask_any(&mask);
>>> +	else
>>> +		target = cpumask_any_but(cpu_online_mask, cpu);
>>> +
>>> +	if (target >= nr_cpu_ids) {
>>> +		pci_err(pcie_pmu->pdev, "There is no CPU to set\n");
>>
>> You have a local variable for pdev, use it here as well.
> 
> Will use local pdev directly.
> 
>>
>>> +		return 0;
>>> +	}
>>> +
>>> +	/* This PMU does NOT support interrupt, just migrate context. */
>>> +	perf_pmu_migrate_context(&pcie_pmu->pmu, cpu, target);
>>> +	pcie_pmu->on_cpu = target;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static struct platform_driver dwc_pcie_pmu_driver = {
>>> +	.probe = dwc_pcie_pmu_probe,
>>> +	.driver = {.name = "dwc_pcie_pmu",},
>>> +};
>>> +
>>> +static int __init dwc_pcie_pmu_init(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
>>> +				      "perf/dwc_pcie_pmu:online",
>>> +				      dwc_pcie_pmu_online_cpu,
>>> +				      dwc_pcie_pmu_offline_cpu);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	dwc_pcie_pmu_hp_state = ret;
>>> +
>>> +	ret = platform_driver_register(&dwc_pcie_pmu_driver);
>>> +	if (ret) {
>>> +		cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>>> +		return ret;
>>> +	}
>>> +
>>> +	dwc_pcie_pmu_dev = platform_device_register_simple(
>>> +				"dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>>> +	if (IS_ERR(dwc_pcie_pmu_dev)) {
>>> +		platform_driver_unregister(&dwc_pcie_pmu_driver);
>>
>> Why no cpuhp_remove_multi_state() in this error path?
>>
>> I'd move to the approach of a gotos and a single error handling block as
>> that makes this sort of thing easier to spot.
> 
> Agreed, will use the approach of a gotos to handle errors.
> 
>>
>>> +		return PTR_ERR(dwc_pcie_pmu_dev);
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void __exit dwc_pcie_pmu_exit(void)
>>> +{
>>> +	platform_device_unregister(dwc_pcie_pmu_dev);
>>> +	platform_driver_unregister(&dwc_pcie_pmu_driver);
>>> +	cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>>> +
>>> +	if (dwc_pcie_pmu_notify)
>>
>> If you have something unusual like this a driver module_exit() it definitely
>> deserves a comment on why.  I'm surprised by this as I'd expect the notifier
>> to be unregistered in the driver remove so not sure why this is here.
>> I've lost track of earlier discussions so if this was addressed then all
>> we need is a comment here for the next person to run into it!
> 
> All replied above, I will unregistered the notifier by devm_add_action_or_reset().
> 
> I am curious about that what the difference between unregistered in module_exit()
> and remove()?
> 

From my understanding, if you register it in probe() then should undo it in remove().
Otherwise you should register it in module_init(). Just make them coupled to make
sure cleanup the resources correctly.

This driver is a bit different since device and driver are created in module_init()
so will works fine in most cases, because the device/driver removal will happens the
same time when unloading the module. However if manually unbind the driver and device
without unloading the module, we'll miss to unregister the notifier in the currently
implementation.

>>
>>> +		bus_unregister_notifier(&pci_bus_type, &dwc_pcie_pmu_nb);
>>> +}
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-18  1:19     ` Shuai Xue
@ 2023-10-19 11:06       ` Jonathan Cameron
  2023-10-19 11:56         ` Shuai Xue
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Cameron @ 2023-10-19 11:06 UTC (permalink / raw)
  To: Shuai Xue
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj

On Wed, 18 Oct 2023 09:19:51 +0800
Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> On 2023/10/17 17:16, Jonathan Cameron wrote:
> > On Tue, 17 Oct 2023 09:32:32 +0800
> > Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> >   
> >> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
> >> controller which implements which implements PMU for performance and
> >> functional debugging to facilitate system maintenance.
> >>
> >> Document it to provide guidance on how to use it.
> >>
> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> >> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>  
> > 
> > A few minor things inline and one question that I'd like a comment on
> > for my understanding at least! (why not multiply the counter by 16 and
> > make the maths simpler?)
> > 
> > With those tidied up,
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >   
> 
> Thank you for providing prompt feedback and valuable comments to me.
> (please also see my replies inline)
> 
> Best Regards,
> Shuai
> 
> > 
> >   
> >> ---
> >>  .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
> >>  Documentation/admin-guide/perf/index.rst      |  1 +
> >>  2 files changed, 95 insertions(+)
> >>  create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> >>
> >> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> >> new file mode 100644
> >> index 000000000000..eac1b6f36450
> >> --- /dev/null
> >> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> >> @@ -0,0 +1,94 @@
> >> +======================================================================
> >> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
> >> +======================================================================
> >> +
> >> +DesignWare Cores (DWC) PCIe PMU
> >> +===============================
> >> +
> >> +The PMU is a PCIe configuration space register block provided by each PCIe Root
> >> +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
> >> +injection, and Statistics).
> >> +
> >> +As the name indicates, the RAS DES capability supports system level
> >> +debugging, AER error injection, and collection of statistics. To facilitate
> >> +collection of statistics, Synopsys DesignWare Cores PCIe controller
> >> +provides the following two features:
> >> +
> >> +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
> >> +  time spent in each low-power LTSSM state) and
> >> +- one 32-bit counter for Event Counting (error and non-error events for
> >> +  a specified lane)
> >> +
> >> +Note: There is no interrupt for counter overflow.
> >> +
> >> +Time Based Analysis
> >> +-------------------
> >> +
> >> +Using this feature you can obtain information regarding RX/TX data
> >> +throughput and time spent in each low-power LTSSM state by the controller.
> >> +The PMU measures data in two categories:
> >> +
> >> +- Group#0: Percentage of time the controller stays in LTSSM states.
> >> +- Group#1: Amount of data processed (Units of 16 bytes).
> >> +
> >> +Lane Event counters
> >> +-------------------
> >> +
> >> +Using this feature you can obtain Error and Non-Error information in
> >> +specific lane by the controller. The PMU event is select by:
> >> +
> >> +- Group i
> >> +- Event j within the Group i
> >> +- and Lane k  
> > The and here is a little confusing. I'd rework as
> > The PMU event is selected by all of:
> > - Group i
> > - Event j within the Group i
> > - Lane k  
> 
> Will rework it in next version.
> 
> >   
> >> +
> >> +Some of the event only exist for specific configurations.  
> > 
> > events  
> 
> Sorry for typo, will fix it.
> 
> >   
> >> +
> >> +DesignWare Cores (DWC) PCIe PMU Driver
> >> +=======================================
> >> +
> >> +This driver adds PMU devices for each PCIe Root Port named based on the BDF of
> >> +the Root Port. For example,
> >> +
> >> +    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
> >> +
> >> +the PMU device name for this Root Port is dwc_rootport_3018.
> >> +
> >> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
> >> +description of available events and configuration options in sysfs, see
> >> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
> >> +
> >> +The "format" directory describes format of the config fields of the
> >> +perf_event_attr structure. The "events" directory provides configuration
> >> +templates for all documented events.  For example,
> >> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
> >> +
> >> +The "perf list" command shall list the available events from sysfs, e.g.::
> >> +
> >> +    $# perf list | grep dwc_rootport
> >> +    <...>
> >> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
> >> +    <...>
> >> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
> >> +
> >> +Time Based Analysis Event Usage
> >> +-------------------------------
> >> +
> >> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> >> +
> >> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
> >> +
> >> +The average RX/TX bandwidth can be calculated using the following formula:
> >> +
> >> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
> >> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window  
> > 
> > Silly question (sorry I didn't raise it earlier) but can we make the interface
> > more intuitive by just multiplying the counter value at point of read by 16?  
> 
> Really a good suggestion, and it is very convenient for end perf users.
> But the unit of 16 is only applied to group#1 as described in Time Based Analysis
> section.

How hard would it be to just apply it to those events?
Userspace doesn't care what the hardware does underneath - it just wants to get
moderately intuitive data back. Having the end user deal with this oddity + even
the need to document it seems to me to be unnecessary burden given how simple it
is (I assume) to remove the oddity.

> 
> So I prefer to left the unit part to end users.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-19  7:35   ` Yicong Yang
@ 2023-10-19 11:51     ` Shuai Xue
  0 siblings, 0 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-19 11:51 UTC (permalink / raw)
  To: Yicong Yang
  Cc: robin.murphy, baolin.wang, Jonathan.Cameron, will, helgaas,
	kaishen, chengyou, yangyicong, linux-kernel, linux-arm-kernel,
	linux-pci, rdunlap, mark.rutland, zhuo.song, renyu.zj



On 2023/10/19 15:35, Yicong Yang wrote:
> On 2023/10/17 9:32, Shuai Xue wrote:
>> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
>> controller which implements which implements PMU for performance and
>> functional debugging to facilitate system maintenance.
> 
> Double "which implements"?

Sorry for the typo, will fix it.

> 
>>
>> Document it to provide guidance on how to use it.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Others look good to me.
> 
> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
> 

Thank you for valuable comments :)

Best Regards
Shuai

>> ---
>>  .../admin-guide/perf/dwc_pcie_pmu.rst         | 94 +++++++++++++++++++
>>  Documentation/admin-guide/perf/index.rst      |  1 +
>>  2 files changed, 95 insertions(+)
>>  create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> new file mode 100644
>> index 000000000000..eac1b6f36450
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> @@ -0,0 +1,94 @@
>> +======================================================================
>> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
>> +======================================================================
>> +
>> +DesignWare Cores (DWC) PCIe PMU
>> +===============================
>> +
>> +The PMU is a PCIe configuration space register block provided by each PCIe Root
>> +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error
>> +injection, and Statistics).
>> +
>> +As the name indicates, the RAS DES capability supports system level
>> +debugging, AER error injection, and collection of statistics. To facilitate
>> +collection of statistics, Synopsys DesignWare Cores PCIe controller
>> +provides the following two features:
>> +
>> +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and
>> +  time spent in each low-power LTSSM state) and
>> +- one 32-bit counter for Event Counting (error and non-error events for
>> +  a specified lane)
>> +
>> +Note: There is no interrupt for counter overflow.
>> +
>> +Time Based Analysis
>> +-------------------
>> +
>> +Using this feature you can obtain information regarding RX/TX data
>> +throughput and time spent in each low-power LTSSM state by the controller.
>> +The PMU measures data in two categories:
>> +
>> +- Group#0: Percentage of time the controller stays in LTSSM states.
>> +- Group#1: Amount of data processed (Units of 16 bytes).
>> +
>> +Lane Event counters
>> +-------------------
>> +
>> +Using this feature you can obtain Error and Non-Error information in
>> +specific lane by the controller. The PMU event is select by:
>> +
>> +- Group i
>> +- Event j within the Group i
>> +- and Lane k
>> +
>> +Some of the event only exist for specific configurations.
>> +
>> +DesignWare Cores (DWC) PCIe PMU Driver
>> +=======================================
>> +
>> +This driver adds PMU devices for each PCIe Root Port named based on the BDF of
>> +the Root Port. For example,
>> +
>> +    30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>> +
>> +the PMU device name for this Root Port is dwc_rootport_3018.
>> +
>> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
>> +description of available events and configuration options in sysfs, see
>> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
>> +
>> +The "format" directory describes format of the config fields of the
>> +perf_event_attr structure. The "events" directory provides configuration
>> +templates for all documented events.  For example,
>> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
>> +
>> +The "perf list" command shall list the available events from sysfs, e.g.::
>> +
>> +    $# perf list | grep dwc_rootport
>> +    <...>
>> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
>> +    <...>
>> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
>> +
>> +Time Based Analysis Event Usage
>> +-------------------------------
>> +
>> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>> +
>> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>> +
>> +The average RX/TX bandwidth can be calculated using the following formula:
>> +
>> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
>> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>> +
>> +Lane Event Usage
>> +-------------------------------
>> +
>> +Each lane has the same event set and to avoid generating a list of hundreds
>> +of events, the user need to specify the lane ID explicitly, e.g.::
>> +
>> +    $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
>> +
>> +The driver does not support sampling, therefore "perf record" will not
>> +work. Per-task (without "-a") perf sessions are not supported.
>> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
>> index f60be04e4e33..6bc7739fddb5 100644
>> --- a/Documentation/admin-guide/perf/index.rst
>> +++ b/Documentation/admin-guide/perf/index.rst
>> @@ -19,6 +19,7 @@ Performance monitor support
>>     arm_dsu_pmu
>>     thunderx2-pmu
>>     alibaba_pmu
>> +   dwc_pcie_pmu
>>     nvidia-pmu
>>     meson-ddr-pmu
>>     cxl
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
  2023-10-19 11:06       ` Jonathan Cameron
@ 2023-10-19 11:56         ` Shuai Xue
  0 siblings, 0 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-19 11:56 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: chengyou, kaishen, helgaas, yangyicong, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj



On 2023/10/19 19:06, Jonathan Cameron wrote:

...

>>>> +
>>>> +The DWC PCIe PMU driver registers a perf PMU driver, which provides
>>>> +description of available events and configuration options in sysfs, see
>>>> +/sys/bus/event_source/devices/dwc_rootport_{bdf}.
>>>> +
>>>> +The "format" directory describes format of the config fields of the
>>>> +perf_event_attr structure. The "events" directory provides configuration
>>>> +templates for all documented events.  For example,
>>>> +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
>>>> +
>>>> +The "perf list" command shall list the available events from sysfs, e.g.::
>>>> +
>>>> +    $# perf list | grep dwc_rootport
>>>> +    <...>
>>>> +    dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/        [Kernel PMU event]
>>>> +    <...>
>>>> +    dwc_rootport_3018/rx_memory_read,lane=?/               [Kernel PMU event]
>>>> +
>>>> +Time Based Analysis Event Usage
>>>> +-------------------------------
>>>> +
>>>> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>>> +
>>>> +    $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>>> +
>>>> +The average RX/TX bandwidth can be calculated using the following formula:
>>>> +
>>>> +    PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
>>>> +    PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window  
>>>
>>> Silly question (sorry I didn't raise it earlier) but can we make the interface
>>> more intuitive by just multiplying the counter value at point of read by 16?  
>>
>> Really a good suggestion, and it is very convenient for end perf users.
>> But the unit of 16 is only applied to group#1 as described in Time Based Analysis
>> section.
> 
> How hard would it be to just apply it to those events?
> Userspace doesn't care what the hardware does underneath - it just wants to get
> moderately intuitive data back. Having the end user deal with this oddity + even
> the need to document it seems to me to be unnecessary burden given how simple it
> is (I assume) to remove the oddity.

Ok. Talked me into it :)
I will multiply the counter value at point of read by 16 for group#1 events.

Thank you.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver
  2023-10-19  8:05       ` Yicong Yang
@ 2023-10-19 12:02         ` Shuai Xue
  0 siblings, 0 replies; 15+ messages in thread
From: Shuai Xue @ 2023-10-19 12:02 UTC (permalink / raw)
  To: Yicong Yang, Jonathan Cameron
  Cc: yangyicong, chengyou, kaishen, helgaas, will, baolin.wang,
	robin.murphy, linux-kernel, linux-arm-kernel, linux-pci, rdunlap,
	mark.rutland, zhuo.song, renyu.zj



On 2023/10/19 16:05, Yicong Yang wrote:
> On 2023/10/18 11:33, Shuai Xue wrote:
>>
...
>>
>>>
>>>> +		return PTR_ERR(dwc_pcie_pmu_dev);
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void __exit dwc_pcie_pmu_exit(void)
>>>> +{
>>>> +	platform_device_unregister(dwc_pcie_pmu_dev);
>>>> +	platform_driver_unregister(&dwc_pcie_pmu_driver);
>>>> +	cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>>>> +
>>>> +	if (dwc_pcie_pmu_notify)
>>>
>>> If you have something unusual like this a driver module_exit() it definitely
>>> deserves a comment on why.  I'm surprised by this as I'd expect the notifier
>>> to be unregistered in the driver remove so not sure why this is here.
>>> I've lost track of earlier discussions so if this was addressed then all
>>> we need is a comment here for the next person to run into it!
>>
>> All replied above, I will unregistered the notifier by devm_add_action_or_reset().
>>
>> I am curious about that what the difference between unregistered in module_exit()
>> and remove()?
>>
> 
> From my understanding, if you register it in probe() then should undo it in remove().
> Otherwise you should register it in module_init(). Just make them coupled to make
> sure cleanup the resources correctly.
> 
> This driver is a bit different since device and driver are created in module_init()
> so will works fine in most cases, because the device/driver removal will happens the
> same time when unloading the module. However if manually unbind the driver and device
> without unloading the module, we'll miss to unregister the notifier in the currently
> implementation.
> 

I see, thank you for your patient explanation.

Thank you.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-19 12:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-17  1:32 [PATCH v8 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support Shuai Xue
2023-10-17  1:32 ` [PATCH v8 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver Shuai Xue
2023-10-17  9:16   ` Jonathan Cameron
2023-10-18  1:19     ` Shuai Xue
2023-10-19 11:06       ` Jonathan Cameron
2023-10-19 11:56         ` Shuai Xue
2023-10-19  7:35   ` Yicong Yang
2023-10-19 11:51     ` Shuai Xue
2023-10-17  1:32 ` [PATCH v8 2/4] PCI: Add Alibaba Vendor ID to linux/pci_ids.h Shuai Xue
2023-10-17  1:32 ` [PATCH v8 3/4] drivers/perf: add DesignWare PCIe PMU driver Shuai Xue
2023-10-17  9:39   ` Jonathan Cameron
2023-10-18  3:33     ` Shuai Xue
2023-10-19  8:05       ` Yicong Yang
2023-10-19 12:02         ` Shuai Xue
2023-10-17  1:32 ` [PATCH v8 4/4] MAINTAINERS: add maintainers for " Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).