linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] perf x86: Exposing IO stack to IO PMON mapping through sysfs
@ 2019-11-26 16:36 roman.sudarikov
  2019-11-26 16:36 ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping roman.sudarikov
  0 siblings, 1 reply; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
changes in the integrated I/O (IIO) architecture. The new solution introduces
IIO stacks which are responsible for managing traffic between the PCIe domain
and the Mesh domain. Each IIO stack has its own PMON block and can handle either
DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
within each IIO stack.

Software is supposed to program required perf counters within each IIO stack
and gather performance data. The tricky thing here is that IIO PMON reports data
per IIO stack but users have no idea what IIO stacks are - they only know devices
which are connected to the platform. 

Understanding IIO stack concept to find which IIO stack that particular IO device
is connected to, or to identify an IIO PMON block to program for monitoring
specific IIO stack assumes a lot of implicit knowledge about given Intel server
platform architecture.

This patch set introduces:
    An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
    A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device

Usage examples:

1. List all devices below IIO stacks
  ./perf stat --iiostat=show

Sample output w/o libpci:

    S0-RootPort0-uncore_iio_0<00:00.0>
    S1-RootPort0-uncore_iio_0<81:00.0>
    S0-RootPort1-uncore_iio_1<18:00.0>
    S1-RootPort1-uncore_iio_1<86:00.0>
    S1-RootPort1-uncore_iio_1<88:00.0>
    S0-RootPort2-uncore_iio_2<3d:00.0>
    S1-RootPort2-uncore_iio_2<af:00.0>
    S1-RootPort3-uncore_iio_3<da:00.0>

Sample output with libpci:

    S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
    S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+>
    S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
    S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
    S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
    S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T>
    S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
    S1-RootPort3-uncore_iio_3<da:00.0 NVMe Datacenter SSD [Optane]>

2. Collect metrics for all I/O devices below IIO stack

  ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
    357708+0 records in
    357707+0 records out
    375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s

  Performance counter stats for 'system wide':

     device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
    00:00.0                    0                    0                    0                    0
    81:00.0                    0                    0                    0                    0
    18:00.0                    0                    0                    0                    0
    86:00.0                    0                    0                    0                    0
    88:00.0                    0                    0                    0                    0
    3b:00.0                    3                    0                    0                    0
    3c:03.0                    3                    0                    0                    0
    3d:00.0                    3                    0                    0                    0
    af:00.0                    0                    0                    0                    0
    da:00.0               358559                   44                    0                   22

    215.383783574 seconds time elapsed


3. Collect metrics for comma separted list of I/O devices 

  ./perf stat --iiostat=da:00.0 -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
    381555+0 records in
    381554+0 records out
    400088457216 bytes (400 GB, 373 GiB) copied, 374.044 s, 1.1 GB/s

  Performance counter stats for 'system wide':

     device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
    da:00.0               382462                   47                    0                   23

    374.045775505 seconds time elapsed

Roman Sudarikov (6):
  perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  perf tools: Helper functions to enumerate and probe PCI devices
  perf stat: Helper functions for list of IIO devices
  perf stat: New --iiostat mode to provide I/O performance metrics
  perf tools: Add feature check for libpci
  perf stat: Add PCI device name to --iiostat output

 arch/x86/events/intel/uncore.c                |  61 +-
 arch/x86/events/intel/uncore.h                |  13 +-
 arch/x86/events/intel/uncore_snbep.c          | 144 ++++
 tools/build/Makefile.feature                  |   2 +
 tools/build/feature/Makefile                  |   4 +
 tools/build/feature/test-all.c                |   5 +
 tools/build/feature/test-libpci.c             |  10 +
 tools/perf/Documentation/perf-stat.txt        |  12 +
 tools/perf/Makefile.config                    |  10 +
 tools/perf/arch/x86/util/Build                |   1 +
 tools/perf/arch/x86/util/iiostat.c            | 718 ++++++++++++++++++
 tools/perf/builtin-stat.c                     |  32 +-
 tools/perf/builtin-version.c                  |   1 +
 tools/perf/tests/make                         |   1 +
 tools/perf/util/Build                         |   1 +
 tools/perf/util/evsel.h                       |   1 +
 tools/perf/util/iiostat.h                     |  35 +
 tools/perf/util/pci.c                         |  99 +++
 tools/perf/util/pci.h                         |  27 +
 .../scripting-engines/trace-event-python.c    |   2 +-
 tools/perf/util/stat-display.c                |  53 +-
 tools/perf/util/stat-shadow.c                 |  12 +-
 tools/perf/util/stat.c                        |   3 +-
 tools/perf/util/stat.h                        |   2 +
 24 files changed, 1237 insertions(+), 12 deletions(-)
 create mode 100644 tools/build/feature/test-libpci.c
 create mode 100644 tools/perf/arch/x86/util/iiostat.c
 create mode 100644 tools/perf/util/iiostat.h
 create mode 100644 tools/perf/util/pci.c
 create mode 100644 tools/perf/util/pci.h


base-commit: 219d54332a09e8d8741c1e1982f5eae56099de85
-- 
2.19.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-11-26 16:36 [PATCH 0/6] perf x86: Exposing IO stack to IO PMON mapping through sysfs roman.sudarikov
@ 2019-11-26 16:36 ` roman.sudarikov
  2019-11-26 16:36   ` [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices roman.sudarikov
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
changes in the integrated I/O (IIO) architecture. The new solution introduces
IIO stacks which are responsible for managing traffic between the PCIe domain
and the Mesh domain. Each IIO stack has its own PMON block and can handle either
DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
within each IIO stack.

Software is supposed to program required perf counters within each IIO stack
and gather performance data. The tricky thing here is that IIO PMON reports data
per IIO stack but users have no idea what IIO stacks are - they only know devices
which are connected to the platform.

Understanding IIO stack concept to find which IIO stack that particular IO device
is connected to, or to identify an IIO PMON block to program for monitoring
specific IIO stack assumes a lot of implicit knowledge about given Intel server
platform architecture.

This patch set introduces:
    An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
    A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device

Current version supports a server line starting Intel® Xeon® Processor Scalable
Family and introduces mapping for IIO Uncore units only.
Other units can be added on demand.

Usage example:
    /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping

Each Uncore unit type, by its nature, can be mapped to its own context, for example:
    CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
    UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
    IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
    IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller

Implementation details:
    Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
        int (*get_topology)(struct intel_uncore_type *type)
        int (*set_mapping)(struct intel_uncore_type *type)

    IIO stack to PMON mapping is exposed through
        /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
        in the following format: domain:bus

Details of IIO Uncore unit mapping to IIO PMON:
Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
holds bus numbers of devices, which can be monitored by that IIO PMON block
on each die.

For example, on a 4-die Intel Xeon® server platform:
    $ cat /sys/devices/uncore_iio_0/platform_mapping
    0000:00,0000:40,0000:80,0000:c0

Which means:
IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 arch/x86/events/intel/uncore.c       |  61 +++++++++++-
 arch/x86/events/intel/uncore.h       |  13 ++-
 arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++
 3 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 86467f85c383..0f779c8fcc05 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_extra_dev *uncore_extra_pci_dev;
 static int max_dies;
 
+int get_max_dies(void)
+{
+	return max_dies;
+}
+
 /* mask of cpus that collect uncore events */
 static cpumask_t uncore_cpu_mask;
 
@@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev,
 
 static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);
 
+static ssize_t platform_mapping_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct intel_uncore_pmu *pmu = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ?
+		       (char *)pmu->platform_mapping : "0");
+}
+static DEVICE_ATTR_RO(platform_mapping);
+
 static struct attribute *uncore_pmu_attrs[] = {
 	&dev_attr_cpumask.attr,
 	NULL,
@@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = {
 	.attrs = uncore_pmu_attrs,
 };
 
+static struct attribute *platform_attrs[] = {
+	&dev_attr_platform_mapping.attr,
+	NULL,
+};
+
+static const struct attribute_group uncore_platform_discovery_group = {
+	.attrs = platform_attrs,
+};
+
 static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
 {
 	int ret;
@@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types)
 		uncore_type_exit(*types);
 }
 
+static void uncore_type_attrs_compaction(struct intel_uncore_type *type)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) {
+		if (!type->attr_groups[i])
+			continue;
+		if (i > j) {
+			type->attr_groups[j] = type->attr_groups[i];
+			type->attr_groups[i] = NULL;
+		}
+		j++;
+	}
+}
+
 static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
 {
 	struct intel_uncore_pmu *pmus;
 	size_t size;
 	int i, j;
+	int ret;
 
 	pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL);
 	if (!pmus)
@@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
 		pmus[i].pmu_idx	= i;
 		pmus[i].type	= type;
 		pmus[i].boxes	= kzalloc(size, GFP_KERNEL);
-		if (!pmus[i].boxes)
+		if (!pmus[i].boxes) {
+			ret = -ENOMEM;
 			goto err;
+		}
 	}
 
 	type->pmus = pmus;
@@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
 
 		attr_group = kzalloc(struct_size(attr_group, attrs, i + 1),
 								GFP_KERNEL);
-		if (!attr_group)
+		if (!attr_group) {
+			ret = -ENOMEM;
 			goto err;
+		}
 
 		attr_group->group.name = "events";
 		attr_group->group.attrs = attr_group->attrs;
@@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
 
 	type->pmu_group = &uncore_pmu_attr_group;
 
+	/*
+	 * Exposing mapping of Uncore units to corresponding Uncore PMUs
+	 * through /sys/devices/uncore_<type>_<idx>/platform_mapping
+	 */
+	if (type->get_topology && type->set_mapping)
+		if (!type->get_topology(type) && !type->set_mapping(type))
+			type->platform_discovery = &uncore_platform_discovery_group;
+
+	/* For optional attributes, we can safely remove embedded NULL attr_groups elements */
+	uncore_type_attrs_compaction(type);
+
 	return 0;
 
 err:
@@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
 		kfree(pmus[i].boxes);
 	kfree(pmus);
 
-	return -ENOMEM;
+	return ret;
 }
 
 static int __init
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
index bbfdaa720b45..ce3727b9f7f8 100644
--- a/arch/x86/events/intel/uncore.h
+++ b/arch/x86/events/intel/uncore.h
@@ -43,6 +43,8 @@ struct intel_uncore_box;
 struct uncore_event_desc;
 struct freerunning_counters;
 
+#define UNCORE_MAX_NUM_ATTR_GROUP 5
+
 struct intel_uncore_type {
 	const char *name;
 	int num_counters;
@@ -71,13 +73,19 @@ struct intel_uncore_type {
 	struct intel_uncore_ops *ops;
 	struct uncore_event_desc *event_descs;
 	struct freerunning_counters *freerunning;
-	const struct attribute_group *attr_groups[4];
+	const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP];
 	struct pmu *pmu; /* for custom pmu ops */
+	void *platform_topology;
+	/* finding Uncore units */
+	int (*get_topology)(struct intel_uncore_type *type);
+	/* mapping Uncore units to PMON ranges */
+	int (*set_mapping)(struct intel_uncore_type *type);
 };
 
 #define pmu_group attr_groups[0]
 #define format_group attr_groups[1]
 #define events_group attr_groups[2]
+#define platform_discovery attr_groups[3]
 
 struct intel_uncore_ops {
 	void (*init_box)(struct intel_uncore_box *);
@@ -99,6 +107,7 @@ struct intel_uncore_pmu {
 	int				pmu_idx;
 	int				func_id;
 	bool				registered;
+	void				*platform_mapping;
 	atomic_t			activeboxes;
 	struct intel_uncore_type	*type;
 	struct intel_uncore_box		**boxes;
@@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
 	return event->pmu_private;
 }
 
+int get_max_dies(void);
+
 struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu);
 u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
 void uncore_mmio_exit_box(struct intel_uncore_box *box);
diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
index b10a5ec79e48..92ce9fbafde1 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -273,6 +273,28 @@
 #define SKX_CPUNODEID			0xc0
 #define SKX_GIDNIDMAP			0xd4
 
+/*
+ * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR
+ * that BIOS programmed. MSR has package scope.
+ * |  Bit  |  Default  |  Description
+ * | [63]  |    00h    | VALID - When set, indicates the CPU bus
+ *                       numbers have been initialized. (RO)
+ * |[62:48]|    ---    | Reserved
+ * |[47:40]|    00h    | BUS_NUM_5 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(5). (RO)
+ * |[39:32]|    00h    | BUS_NUM_4 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(4). (RO)
+ * |[31:24]|    00h    | BUS_NUM_3 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(3). (RO)
+ * |[23:16]|    00h    | BUS_NUM_2 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(2). (RO)
+ * |[15:8] |    00h    | BUS_NUM_1 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(1). (RO)
+ * | [7:0] |    00h    | BUS_NUM_0 — Return the bus number BIOS assigned
+ *                       CPUBUSNO(0). (RO)
+ */
+#define SKX_MSR_CPU_BUS_NUMBER		0x300
+
 /* SKX CHA */
 #define SKX_CHA_MSR_PMON_BOX_FILTER_TID		(0x1ffULL << 0)
 #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK	(0xfULL << 9)
@@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = {
 	.read_counter		= uncore_msr_read_counter,
 };
 
+static int skx_iio_get_topology(struct intel_uncore_type *type);
+static int skx_iio_set_mapping(struct intel_uncore_type *type);
+
 static struct intel_uncore_type skx_uncore_iio = {
 	.name			= "iio",
 	.num_counters		= 4,
@@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = {
 	.constraints		= skx_uncore_iio_constraints,
 	.ops			= &skx_uncore_iio_ops,
 	.format_group		= &skx_uncore_iio_format_group,
+	.get_topology		= skx_iio_get_topology,
+	.set_mapping		= skx_iio_set_mapping,
 };
 
 enum perf_uncore_iio_freerunning_type_id {
@@ -3780,6 +3807,123 @@ static int skx_count_chabox(void)
 	return hweight32(val);
 }
 
+static inline u8 skx_iio_topology_byte(void *platform_topology,
+					int die, int idx)
+{
+	return *((u8 *)(platform_topology) + die * sizeof(u64) + idx);
+}
+
+static inline bool skx_iio_topology_valid(u64 msr_value)
+{
+	return msr_value & ((u64)1 << 63);
+}
+
+static int skx_msr_cpu_bus_read(int cpu, int die)
+{
+	int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER,
+				((u64 *)skx_uncore_iio.platform_topology) + die);
+
+	if (!ret) {
+		if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die)))
+			ret = -1;
+	}
+	return ret;
+}
+
+static int skx_iio_get_topology(struct intel_uncore_type *type)
+{
+	int ret, cpu, die, current_die;
+	struct pci_bus *bus = NULL;
+
+	while ((bus = pci_find_next_bus(bus)) != NULL)
+		if (pci_domain_nr(bus)) {
+			pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n");
+			return -1;
+		}
+
+	/* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/
+	type->platform_topology =
+		kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL);
+	if (!type->platform_topology)
+		return -ENOMEM;
+
+	/*
+	 * Using cpus_read_lock() to ensure cpu is not going down between
+	 * looking at cpu_online_mask.
+	 */
+	cpus_read_lock();
+	/* Invalid value to start loop.*/
+	current_die = -1;
+	for_each_online_cpu(cpu) {
+		die = topology_logical_die_id(cpu);
+		if (current_die == die)
+			continue;
+		ret = skx_msr_cpu_bus_read(cpu, die);
+		if (ret)
+			break;
+		current_die = die;
+	}
+	cpus_read_unlock();
+
+	if (ret)
+		kfree(type->platform_topology);
+	return ret;
+}
+
+static int skx_iio_set_mapping(struct intel_uncore_type *type)
+{
+	/*
+	 * Each IIO stack (PCIe root port) has its own IIO PMON block, so each
+	 * platform_mapping holds bus number(s) of PCIe root port(s), which can
+	 * be monitored by that IIO PMON block.
+	 *
+	 * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die
+	 * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO
+	 * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0":
+	 *
+	 * $ cat /sys/devices/uncore_iio_0/platform_mapping
+	 * 0000:00,0000:40,0000:80,0000:c0
+	 *
+	 * Which means:
+	 * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000
+	 * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000
+	 * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000
+	 * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000
+	 */
+
+	int ret = 0;
+	int die, i;
+	char *buf;
+	struct intel_uncore_pmu *pmu;
+	const int template_len = 8;
+
+	for (i = 0; i < type->num_boxes; i++) {
+		pmu = type->pmus + i;
+		/* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */
+		if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) {
+			pmu->platform_mapping =
+				kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL);
+			if (pmu->platform_mapping) {
+				buf = (char *)pmu->platform_mapping;
+				for (die = 0; die < get_max_dies(); die++)
+					buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0,
+						skx_iio_topology_byte(type->platform_topology,
+								      die, pmu->pmu_idx));
+
+				*(--buf) = '\0';
+			} else {
+				for (; i >= 0; i--)
+					kfree((type->pmus + i)->platform_mapping);
+				ret = -ENOMEM;
+				break;
+			}
+		}
+	}
+
+	kfree(type->platform_topology);
+	return ret;
+}
+
 void skx_uncore_cpu_init(void)
 {
 	skx_uncore_chabox.num_boxes = skx_count_chabox();
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices
  2019-11-26 16:36 ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping roman.sudarikov
@ 2019-11-26 16:36   ` roman.sudarikov
  2019-11-26 16:36     ` [PATCH 3/6] perf stat: Helper functions for list of IIO devices roman.sudarikov
  2019-12-02 14:00   ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping Peter Zijlstra
  2019-12-02 19:47   ` Stephane Eranian
  2 siblings, 1 reply; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

This makes aggregation of performance data per I/O device
available in the perf user tools.

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 tools/perf/util/Build |  1 +
 tools/perf/util/pci.c | 53 +++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/pci.h | 23 +++++++++++++++++++
 3 files changed, 77 insertions(+)
 create mode 100644 tools/perf/util/pci.c
 create mode 100644 tools/perf/util/pci.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 8dcfca1a882f..02b699f8a10a 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -23,6 +23,7 @@ perf-y += memswap.o
 perf-y += parse-events.o
 perf-y += perf_regs.o
 perf-y += path.o
+perf-y += pci.o
 perf-y += print_binary.o
 perf-y += rlimit.o
 perf-y += argv_split.o
diff --git a/tools/perf/util/pci.c b/tools/perf/util/pci.c
new file mode 100644
index 000000000000..ba1a48e9d0cc
--- /dev/null
+++ b/tools/perf/util/pci.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Helper functions to access PCI CFG space.
+ *
+ * Copyright (C) 2019, Intel Corporation
+ *
+ * Authors: Roman Sudarikov <roman.sudarikov@intel.com>
+ *	    Alexander Antonov <alexander.antonov@intel.com>
+ */
+#include "pci.h"
+#include <api/fs/fs.h>
+#include <linux/kernel.h>
+#include <string.h>
+#include <unistd.h>
+
+#define PCI_DEVICE_PATH_TEMPLATE "bus/pci/devices/0000:%02x:%02x.0"
+#define PCI_DEVICE_FILE_TEMPLATE PCI_DEVICE_PATH_TEMPLATE"/%s"
+
+static bool directory_exists(const char * const path)
+{
+	return (access(path, F_OK) == 0);
+}
+
+bool pci_device_probe(struct bdf bdf)
+{
+	char path[PATH_MAX];
+
+	scnprintf(path, PATH_MAX, "%s/"PCI_DEVICE_PATH_TEMPLATE,
+		  sysfs__mountpoint(), bdf.busno, bdf.devno);
+	return directory_exists(path);
+}
+
+bool is_pci_device_root_port(struct bdf bdf, u8 *secondary, u8 *subordinate)
+{
+	char path[PATH_MAX];
+	int secondary_interim;
+	int subordinate_interim;
+
+	scnprintf(path, PATH_MAX, PCI_DEVICE_FILE_TEMPLATE,
+		  bdf.busno, bdf.devno, "secondary_bus_number");
+	if (!sysfs__read_int(path, &secondary_interim)) {
+		scnprintf(path, PATH_MAX, PCI_DEVICE_FILE_TEMPLATE,
+			  bdf.busno, bdf.devno, "subordinate_bus_number");
+		if (!sysfs__read_int(path, &subordinate_interim)) {
+			if (secondary)
+				*secondary = (u8)secondary_interim;
+			if (subordinate)
+				*subordinate = (u8)subordinate_interim;
+			return true;
+		}
+	}
+	return false;
+}
diff --git a/tools/perf/util/pci.h b/tools/perf/util/pci.h
new file mode 100644
index 000000000000..e963b12e10e7
--- /dev/null
+++ b/tools/perf/util/pci.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0*/
+/*
+ *
+ * Copyright (C) 2019, Intel Corporation
+ *
+ * Authors: Roman Sudarikov <roman.sudarikov@intel.com>
+ *	    Alexander Antonov <alexander.antonov@intel.com>
+ */
+#ifndef _PCI_H
+#define _PCI_H
+
+#include <linux/types.h>
+
+struct bdf {
+	u8 busno;
+	u8 devno;
+	u8 funcno;
+};
+
+bool pci_device_probe(struct bdf bdf);
+bool is_pci_device_root_port(struct bdf bdf, u8 *secondary, u8 *subordinate);
+
+#endif /* _PCI_H */
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] perf stat: Helper functions for list of IIO devices
  2019-11-26 16:36   ` [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices roman.sudarikov
@ 2019-11-26 16:36     ` roman.sudarikov
  2019-11-26 16:36       ` [PATCH 4/6] perf stat: New --iiostat mode to provide I/O performance metrics roman.sudarikov
  0 siblings, 1 reply; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

Helper functions to iterate through and manipulate with list
of struct iio_device objects. The following patch will use it.

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 tools/perf/arch/x86/util/iiostat.c | 178 +++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)
 create mode 100644 tools/perf/arch/x86/util/iiostat.c

diff --git a/tools/perf/arch/x86/util/iiostat.c b/tools/perf/arch/x86/util/iiostat.c
new file mode 100644
index 000000000000..b93b9b9da418
--- /dev/null
+++ b/tools/perf/arch/x86/util/iiostat.c
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * perf stat --iiostat
+ *
+ * Copyright (C) 2019, Intel Corporation
+ *
+ * Authors: Roman Sudarikov <roman.sudarikov@intel.com>
+ *	    Alexander Antonov <alexander.antonov@intel.com>
+ */
+#include "path.h"
+#include "pci.h"
+#include <api/fs/fs.h>
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include "util/debug.h"
+#include "util/iiostat.h"
+#include "util/counts.h"
+#include <limits.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <regex.h>
+
+struct dev_info {
+	struct bdf bdf;
+	u8 ch_mask;
+	u8 socket;
+	u8 pmu_idx;
+	u8 root_port_nr;
+};
+
+struct iio_device {
+	struct list_head node;
+	struct dev_info	dev_info;
+	int idx;
+};
+
+struct iio_devs_list {
+	struct list_head devices;
+	int nr_entries;
+};
+
+/**
+ * __iio_devs_for_each_device - iterate thru all the iio devices
+ * @list: list_head instance to iterate
+ * @device: struct iio_device iterator
+ */
+#define __iio_devs_for_each_device(list, device) \
+		list_for_each_entry(device, list, node)
+
+/**
+ * iio_devs_list_for_each_device - iterate thru all the iio devices
+ * @list: iio_devs_list instance to iterate
+ * @device: struct iio_device iterator
+ */
+#define iio_devs_list_for_each_device(list, device) \
+	__iio_devs_for_each_device(&(list->devices), device)
+
+/**
+ * __iio_devs_for_each_device_safe - safely iterate thru all the iio devices
+ * @devices: list_head instance to iterate
+ * @tmp: struct iio_device temp iterator
+ * @device: struct iio_device iterator
+ */
+#define __iio_devs_for_each_device_safe(devices, tmp, device) \
+		list_for_each_entry_safe(device, tmp, devices, node)
+
+/**
+ * iio_devs_list_for_each_device_safe - safely iterate thru all the iio devices
+ * @list: iio_devs_list instance to iterate
+ * @tmp: struct iio_device temp iterator
+ * @device: struct iio_device iterator
+ */
+#define iio_devs_list_for_each_device_safe(list, tmp, device) \
+		__iio_devs_for_each_device_safe(&(list->devices), tmp, device)
+
+#define iio_device_delete_from_list(device) \
+		list_del(&(device->node))
+
+static struct iio_device *iio_device_new(struct dev_info *info)
+{
+	struct iio_device *p =
+		(struct iio_device *)calloc(1, sizeof(struct iio_device));
+
+	if (p) {
+		INIT_LIST_HEAD(&(p->node));
+		p->dev_info = *info;
+		p->idx = -1;
+	}
+	return p;
+}
+
+static void iio_device_delete(struct iio_device *device)
+{
+	if (device) {
+		list_del_init(&(device->node));
+		free(device);
+	}
+}
+
+static void iiostat_device_show(FILE *output,
+			const struct iio_device * const device)
+{
+	if (output && device)
+		fprintf(output, "S%d-RootPort%d-uncore_iio_%d<%02x:%02x.%x>\n",
+			device->dev_info.socket,
+			device->dev_info.root_port_nr, device->dev_info.pmu_idx,
+			device->dev_info.bdf.busno, device->dev_info.bdf.devno,
+			device->dev_info.bdf.funcno);
+}
+
+static struct iio_devs_list *iio_devs_list_new(void)
+{
+	struct iio_devs_list *devs_list =
+		(struct iio_devs_list *)calloc(1, sizeof(struct iio_devs_list));
+
+	if (devs_list)
+		INIT_LIST_HEAD(&(devs_list->devices));
+	return devs_list;
+}
+
+static void iio_devs_list_free(struct iio_devs_list *list)
+{
+	struct iio_device *tmp_device;
+	struct iio_device *device;
+
+	if (list) {
+		iio_devs_list_for_each_device_safe(list, tmp_device, device)
+			iio_device_delete(device);
+		list_del_init(&(list->devices));
+		free(list);
+	}
+}
+
+static bool is_same_iio_device(struct bdf lhd, struct bdf rhd)
+{
+	return (lhd.busno == rhd.busno) && (lhd.devno == rhd.devno) &&
+		(lhd.funcno == rhd.funcno);
+}
+
+static void iio_devs_list_add_device(struct iio_devs_list *list,
+				      struct iio_device * const device)
+{
+	struct iio_device *it;
+
+	if (list && device) {
+		iio_devs_list_for_each_device(list, it)
+			if (is_same_iio_device(it->dev_info.bdf, device->dev_info.bdf))
+				return;
+		device->idx = list->nr_entries++;
+		list_add_tail(&(device->node), &(list->devices));
+	}
+}
+
+static void iio_devs_list_join_list(struct iio_devs_list *dest,
+				     struct iio_devs_list *src)
+{
+	int idx = 0;
+	struct iio_device *it;
+
+	if (dest && src) {
+		if (dest->nr_entries) {
+			it = list_last_entry(&(dest->devices),
+					     struct iio_device, node);
+			idx = it->idx + 1;
+		}
+		iio_devs_list_for_each_device(src, it)
+			it->idx = idx++;
+		list_splice_tail(&(src->devices), &(dest->devices));
+		dest->nr_entries += src->nr_entries;
+	}
+}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] perf stat: New --iiostat mode to provide I/O performance metrics
  2019-11-26 16:36     ` [PATCH 3/6] perf stat: Helper functions for list of IIO devices roman.sudarikov
@ 2019-11-26 16:36       ` roman.sudarikov
  2019-11-26 16:36         ` [PATCH 5/6] perf tools: Add feature check for libpci roman.sudarikov
  0 siblings, 1 reply; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

New --iiostat mode in perf stat is intended to provide four I/O performance
metrics per each IO device below IIO stacks:
    --Inbound Read(Mb)   - I/O device reads from the host memory, in Mb
    --Inbound Write(Mb)  - I/O device writes to the host memory, in Mb
    --Outbound Read(Mb)  - CPU reads from I/O device, in Mb
    --Outbound Write(Mb) - CPU writes to I/O device, in Mb

Each metric requires only one IIO event which increments at every 4B transfer
in the corresponding direction. The formula to compute metrics are generic:
    #EventCount * 4B / (1024 * 1024)

This implementation starts from discovering IIO stacks on the platform and
all devices below each stack. Next step is to configure a group of four events
per each device and tie each event group to its device.

Note: --iiostat introduces new perf data aggregation mode - per I/O device
hence -e and -M options are not supported.

Usage examples:

1. List all devices below IIO stacks
  ./perf stat --iiostat=show

Sample output w/o libpci:

    S0-RootPort0-uncore_iio_0<00:00.0>
    S1-RootPort0-uncore_iio_0<81:00.0>
    S0-RootPort1-uncore_iio_1<18:00.0>
    S1-RootPort1-uncore_iio_1<86:00.0>
    S1-RootPort1-uncore_iio_1<88:00.0>
    S0-RootPort2-uncore_iio_2<3d:00.0>
    S1-RootPort2-uncore_iio_2<af:00.0>
    S1-RootPort3-uncore_iio_3<da:00.0>

Sample output with libpci:

    S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
    S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+>
    S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
    S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
    S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
    S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T>
    S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
    S1-RootPort3-uncore_iio_3<da:00.0 NVMe Datacenter SSD [Optane]>

2. Collect metrics for all I/O devices below IIO stack

  ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
    357708+0 records in
    357707+0 records out
    375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s

  Performance counter stats for 'system wide':

     device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
    00:00.0                    0                    0                    0                    0
    81:00.0                    0                    0                    0                    0
    18:00.0                    0                    0                    0                    0
    86:00.0                    0                    0                    0                    0
    88:00.0                    0                    0                    0                    0
    3b:00.0                    3                    0                    0                    0
    3c:03.0                    3                    0                    0                    0
    3d:00.0                    3                    0                    0                    0
    af:00.0                    0                    0                    0                    0
    da:00.0               358559                   44                    0                   22

    215.383783574 seconds time elapsed

3. Collect metrics for comma separted list of I/O devices

  ./perf stat --iiostat=da:00.0 -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
    381555+0 records in
    381554+0 records out
    400088457216 bytes (400 GB, 373 GiB) copied, 374.044 s, 1.1 GB/s

  Performance counter stats for 'system wide':

     device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
    da:00.0               382462                   47                    0                   23

    374.045775505 seconds time elapsed

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 tools/perf/Documentation/perf-stat.txt        |  12 +
 tools/perf/arch/x86/util/Build                |   1 +
 tools/perf/arch/x86/util/iiostat.c            | 533 +++++++++++++++++-
 tools/perf/builtin-stat.c                     |  32 +-
 tools/perf/util/evsel.h                       |   1 +
 tools/perf/util/iiostat.h                     |  35 ++
 .../scripting-engines/trace-event-python.c    |   2 +-
 tools/perf/util/stat-display.c                |  53 +-
 tools/perf/util/stat-shadow.c                 |  12 +-
 tools/perf/util/stat.c                        |   3 +-
 tools/perf/util/stat.h                        |   2 +
 11 files changed, 676 insertions(+), 10 deletions(-)
 create mode 100644 tools/perf/util/iiostat.h

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 930c51c01201..96a2ec8c6b55 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -262,6 +262,18 @@ See perf list output for the possble metrics and metricgroups.
 -A::
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
+--iiostat::
+This mode is intended to provide four I/O performance metrics per each
+IO device below IIO stacks:
+    --Inbound Read(Mb)   - I/O device reads from the host memory, in Mb
+    --Inbound Write(Mb)  - I/O device writes to the host memory, in Mb
+    --Outbound Read(Mb)  - CPU reads from I/O device, in Mb
+    --Outbound Write(Mb) - CPU writes to I/O device, in Mb
+
+Each metric requiries only one IIO event which increments at every 4B
+transfer in corresponding direction. The formulas to compute metrics
+are generic:
+    #EventCount * 4B / (1024 * 1024)
 
 --topdown::
 Print top down level 1 metrics if supported by the CPU. This allows to
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 47f9c56e744f..e19566e16e5d 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
 perf-y += group.o
 perf-y += machine.o
 perf-y += event.o
+perf-y += iiostat.o
 
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/iiostat.c b/tools/perf/arch/x86/util/iiostat.c
index b93b9b9da418..058a01d3a93f 100644
--- a/tools/perf/arch/x86/util/iiostat.c
+++ b/tools/perf/arch/x86/util/iiostat.c
@@ -9,6 +9,7 @@
  */
 #include "path.h"
 #include "pci.h"
+#include "util/cpumap.h"
 #include <api/fs/fs.h>
 #include <linux/kernel.h>
 #include <linux/err.h>
@@ -27,10 +28,89 @@
 #include <stdlib.h>
 #include <regex.h>
 
+/*
+ * The Intel® Xeon® Scalable processor family (code name Skylake-SP) makes
+ * significant changes in integrated I/O (IIO) architecture. The new solution
+ * introduces IIO stacks which are responsible for managing traffic between PCIe
+ * domain and the Mesh domain. Each IIO stack has its own PMON block and can
+ * handle either DMI port, x16 PCIe root port, MCP-Link or various built-in
+ * accelerators. IIO PMON blocks allow concurrent monitoring of I/O flows up
+ * to 4 x4 bifurcation within each IIO stack.
+ *
+ * New --iiostat mode in perf stat is intended to provide four I/O performance
+ * metrics per each IO device below IIO stacks:
+ *     --Inbound Read(Mb)   - I/O device reads from the host memory, in Mb
+ *     --Inbound Write(Mb)  - I/O device writes to the host memory, in Mb
+ *     --Outbound Read(Mb)  - CPU reads from I/O device, in Mb
+ *     --Outbound Write(Mb) - CPU writes to I/O device, in Mb
+ *
+ * Each metric requiries only one IIO event which increments at every 4B
+ * transfer in corresponding direction. The formulas to compute metrics
+ * are generic:
+ *     #EventCount * 4B / (1024 * 1024)
+ *
+ * This implementation starts from discovering IIO stacks on the platform and
+ * all devices below each stack. Next step is to configure group of four events
+ * per each device and tie each event group to its device.
+ *
+ * Sample output:
+
+./perf stat --iiostat=show
+	S0-RootPort0-uncore_iio_0<00:00.0>
+	S1-RootPort0-uncore_iio_0<81:00.0>
+	S0-RootPort1-uncore_iio_1<18:00.0>
+	S1-RootPort1-uncore_iio_1<86:00.0>
+	S1-RootPort1-uncore_iio_1<88:00.0>
+	S0-RootPort2-uncore_iio_2<3d:00.0>
+	S1-RootPort2-uncore_iio_2<af:00.0>
+	S1-RootPort3-uncore_iio_3<da:00.0>
+
+./perf stat --iiostat=af:00.0 -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
+	381555+0 records in
+	381554+0 records out
+	400088457216 bytes (400 GB, 373 GiB) copied, 374.044 s, 1.1 GB/s
+
+Performance counter stats for 'system wide':
+
+	 device  Inbound Read(MB)  Inbound Write(MB)  Outbound Read(MB)  Outbound Write(MB)
+	af:00.0    382462                47                   0                 23
+
+374.045775505 seconds time elapsed
+ */
+#define PCI_BUS_MAX_DEVICE_NUMBER 32
+#define PCI_BUS_MAX_FUNCTION_NUMBER 8
+
+#define PLATFORM_MAPPING_PATH	"devices/uncore_iio_%d/platform_mapping"
+
+typedef enum {
+	IIOSTAT_NONE		= 0,
+	IIOSTAT_SHOW		= 1,
+	IIOSTAT_RUN		= 2
+} iiostat_mode_t;
+
+static iiostat_mode_t iiostat_mode = IIOSTAT_NONE;
+
+static const char * const iiostat_metrics[] = {
+	"Inbound Read(MB)",
+	"Inbound Write(MB)",
+	"Outbound Read(MB)",
+	"Outbound Write(MB)",
+};
+
+static inline int iiostat_metrics_count(void)
+{
+	return sizeof(iiostat_metrics) / sizeof(char *);
+}
+
+static const char *get_iiostat_metric(int idx)
+{
+	return *(iiostat_metrics + idx % iiostat_metrics_count());
+}
+
 struct dev_info {
 	struct bdf bdf;
 	u8 ch_mask;
-	u8 socket;
+	u8 die;
 	u8 pmu_idx;
 	u8 root_port_nr;
 };
@@ -83,6 +163,45 @@ struct iio_devs_list {
 #define iio_device_delete_from_list(device) \
 		list_del(&(device->node))
 
+static u8 *rp_nr;
+
+static u8 get_rp_nr(u8 die)
+{
+	return rp_nr[die]++;
+}
+
+static u64 platform_mapping_build(char *buf)
+{
+	char *end;
+	u8 offset = 0;
+	unsigned long long interim = 0;
+
+	for (long mapping_byte = strtol(buf, &end, 16);
+		buf != end; mapping_byte = strtol(buf, &end, 16)) {
+		buf = end + 1;
+		if (*end == ',' || *end == '\n')
+			interim |= (u64)mapping_byte << (8 * offset++);
+	}
+	return interim;
+}
+
+static int uncore_pmu_iio_platform_mapping_read(u8 pmu_idx, u64 * const mapping)
+{
+	char *buf;
+	char path[PATH_MAX];
+	size_t size;
+
+	scnprintf(path, PATH_MAX, PLATFORM_MAPPING_PATH, pmu_idx);
+	if (sysfs__read_str(path, &buf, &size) < 0) {
+		fprintf(stderr, "iiostat is not supported\n");
+		return -1;
+	}
+	*mapping = (size == 2) ? 0 : platform_mapping_build(buf);
+	free(buf);
+
+	return 0;
+}
+
 static struct iio_device *iio_device_new(struct dev_info *info)
 {
 	struct iio_device *p =
@@ -109,7 +228,7 @@ static void iiostat_device_show(FILE *output,
 {
 	if (output && device)
 		fprintf(output, "S%d-RootPort%d-uncore_iio_%d<%02x:%02x.%x>\n",
-			device->dev_info.socket,
+			device->dev_info.die,
 			device->dev_info.root_port_nr, device->dev_info.pmu_idx,
 			device->dev_info.bdf.busno, device->dev_info.bdf.devno,
 			device->dev_info.bdf.funcno);
@@ -176,3 +295,413 @@ static void iio_devs_list_join_list(struct iio_devs_list *dest,
 		dest->nr_entries += src->nr_entries;
 	}
 }
+
+static int pci_rp_probe(struct dev_info *info, struct iio_devs_list *list)
+{
+	u8 secondary_bus_number = 0;
+	u8 subordinate_bus_number = 0;
+	struct iio_device *device = NULL;
+
+	if (!pci_device_probe(info->bdf))
+		return 0;
+
+	if (!is_pci_device_root_port(info->bdf,
+				     &secondary_bus_number,
+				     &subordinate_bus_number)) {
+		secondary_bus_number = info->bdf.busno;
+		subordinate_bus_number = info->bdf.busno;
+	}
+
+	for (u8 b = secondary_bus_number; b <= subordinate_bus_number; b++) {
+		for (u8 d = 0; d < PCI_BUS_MAX_DEVICE_NUMBER; d++) {
+			for (u8 f = 0; f < PCI_BUS_MAX_FUNCTION_NUMBER; f++) {
+				info->bdf.busno = b;
+				info->bdf.devno = d;
+				info->bdf.funcno = f;
+				if (!pci_device_probe(info->bdf) ||
+				    is_pci_device_root_port(info->bdf, NULL, NULL))
+					continue;
+				device = iio_device_new(info);
+				if (device) {
+					iio_devs_list_add_device(list, device);
+					break;
+				}
+				return -ENOMEM;
+			}
+		}
+	}
+	return 0;
+}
+
+static int pci_rp_scan(u8 rp, u8 pmu_idx, u8 die,
+			struct iio_devs_list **list)
+{
+	int ret = 0;
+	u8 part = 0;
+	struct dev_info info;
+	struct iio_device *device = NULL;
+
+	struct iio_devs_list *interim_list = iio_devs_list_new();
+
+	if (!interim_list)
+		return -ENOMEM;
+
+	info.bdf.busno = rp;
+	info.bdf.funcno = 0;
+	info.pmu_idx = pmu_idx;
+	info.die = die;
+	info.root_port_nr = get_rp_nr(die);
+
+	/* Extra case for root port 0x00*/
+	if (info.bdf.busno == 0x00) {
+		info.bdf.devno = 0;
+		device = iio_device_new(&info);
+		if (device)
+			iio_devs_list_add_device(interim_list, device);
+		else {
+			iio_devs_list_free(interim_list);
+			return -ENOMEM;
+		}
+	} else {
+		for (part = 0; part < 4; part++) {
+			info.bdf.devno = part;
+			info.ch_mask = (1 << part);
+			ret = pci_rp_probe(&info, interim_list);
+			if (ret) {
+				iio_devs_list_free(interim_list);
+				return ret;
+			}
+		}
+	}
+
+	if (interim_list->nr_entries)
+		*list = interim_list;
+	else
+		iio_devs_list_free(interim_list);
+
+	return 0;
+}
+
+static int pmu_scan(u8 pmu_idx, u64 mapping, struct iio_devs_list **list)
+{
+	int ret;
+	u8 rp;
+	struct iio_devs_list *interim_list, *rp_list = NULL;
+
+	interim_list = iio_devs_list_new();
+	if (!interim_list)
+		return -ENOMEM;
+
+	for (u8 die = 0; die < cpu__max_node(); die++) {
+		rp = (u8)(mapping >> (die * 8));
+		if (!rp && die)
+			break;
+
+		ret = pci_rp_scan(rp, pmu_idx, die, &rp_list);
+		if (ret) {
+			iio_devs_list_free(interim_list);
+			return ret;
+		}
+		if (rp_list) {
+			iio_devs_list_join_list(interim_list, rp_list);
+			free(rp_list);
+			rp_list = NULL;
+		}
+	}
+	if (interim_list->nr_entries)
+		*list = interim_list;
+	else
+		iio_devs_list_free(interim_list);
+	return 0;
+}
+
+static int iio_devs_scan(struct iio_devs_list **list)
+{
+	u64 mapping;
+	int ret;
+	struct iio_devs_list *pmu_dev_list = NULL;
+	struct iio_devs_list *interim = NULL;
+
+	rp_nr = (u8 *)calloc(cpu__max_node(), 1);
+	if (!rp_nr)
+		return -ENOMEM;
+
+	interim = iio_devs_list_new();
+	if (!interim) {
+		free(rp_nr);
+		return -ENOMEM;
+	}
+
+	for (u8 pmu_idx = 0; pmu_idx < 6; pmu_idx++) {
+		ret = uncore_pmu_iio_platform_mapping_read(pmu_idx, &mapping);
+		if (ret)
+			break;
+		/* IIO stack 0 on die 0 is always on bus 0x00.*/
+		if ((mapping == 0) && pmu_idx)
+			continue;
+
+		ret = pmu_scan(pmu_idx, mapping, &pmu_dev_list);
+		if (ret)
+			break;
+
+		if (pmu_dev_list) {
+			iio_devs_list_join_list(interim, pmu_dev_list);
+			free(pmu_dev_list);
+			pmu_dev_list = NULL;
+		}
+	}
+
+	if (!ret)
+		*list = interim;
+	else
+		iio_devs_list_free(interim);
+
+	free(rp_nr);
+
+	return ret;
+}
+
+static int iio_dev_parse_bdf_str(struct bdf *bdf, char *str)
+{
+	int ret = 0;
+	regex_t regex;
+	/*
+	 * Expected format bus:device.function:
+	 * Valid bus range [0:ff]
+	 * Valid device range [0:1f]
+	 * Valid function range [0:7]
+	 * Example: af:00.0, d:0.0, 5e:0.0
+	 */
+	regcomp(&regex,
+		"^([a-f0-9A-F]{1,2}):(([0|1]{0,1})([0-9a-fA-F]{1})).([0-7]{1})$",
+		REG_EXTENDED);
+	ret = regexec(&regex, str, 0, NULL, 0);
+	if (!ret)
+		sscanf(str, "%02hhx:%02hhx.%hhx",
+		       &bdf->busno, &bdf->devno, &bdf->funcno);
+	else
+		pr_warning("Unrecognized device format: %s\n", str);
+
+	regfree(&regex);
+	return ret;
+}
+
+static struct iio_device *pci_devs_list_find_device_by_bdf(
+			   const struct iio_devs_list * const list,
+			   struct bdf bdf)
+{
+	struct iio_device *it;
+
+	if (list) {
+		iio_devs_list_for_each_device(list, it) {
+			if (is_same_iio_device(it->dev_info.bdf, bdf))
+				return it;
+		}
+	}
+	return NULL;
+}
+
+static int iio_devs_list_filter_by_bdf(struct iio_devs_list **list,
+					const char *bdf_str)
+{
+	struct bdf bdf;
+	struct iio_device *device;
+	const char *delim = ",";
+	char *token = NULL;
+	char *tmp;
+	char *tmp_bdf_str = (char *)bdf_str;
+
+	struct iio_devs_list *filtered_list = iio_devs_list_new();
+
+	if (!filtered_list)
+		return -ENOMEM;
+
+	token = strtok(tmp_bdf_str, delim);
+	while (token != NULL) {
+		tmp = token;
+		if (!iio_dev_parse_bdf_str(&bdf, tmp)) {
+			if (!pci_devs_list_find_device_by_bdf(filtered_list, bdf)) {
+				device = pci_devs_list_find_device_by_bdf(*list, bdf);
+				if (device) {
+					iio_device_delete_from_list(device);
+					iio_devs_list_add_device(filtered_list, device);
+				} else
+					pr_warning("Device %02x:%02x.%x not found\n",
+						   bdf.busno, bdf.devno, bdf.funcno);
+			}
+		}
+		token = strtok(NULL, delim);
+	}
+	iio_devs_list_free(*list);
+	*list = filtered_list;
+	return 0;
+}
+
+static struct iio_device *iio_dev_get_by_idx(const struct iio_devs_list *list,
+					      int idx)
+{
+	struct iio_device *device = NULL;
+
+	if (idx < list->nr_entries)
+		iio_devs_list_for_each_device(list, device)
+			if (device->idx == idx)
+				break;
+
+	return device;
+}
+
+static int iiostat_event_group(struct evlist *evl,
+				struct iio_devs_list *dev_list)
+{
+	int ret = 0;
+	struct iio_device *device = NULL;
+	const char *iiostat_cmd_template =
+	"{uncore_iio_%x/event=0x83,umask=0x04,ch_mask=0x%02x,fc_mask=0x07/,\
+	uncore_iio_%x/event=0x83,umask=0x01,ch_mask=0x%02x,fc_mask=0x07/,\
+	uncore_iio_%x/event=0xc0,umask=0x04,ch_mask=0x%02x,fc_mask=0x07/,\
+	uncore_iio_%x/event=0xc0,umask=0x01,ch_mask=0x%02x,fc_mask=0x07/}";
+	const int len_template = strlen(iiostat_cmd_template) + 1;
+	struct evsel *evsel = NULL;
+	int metrics_count = iiostat_metrics_count();
+	char *iiostat_cmd = calloc(len_template, 1);
+
+	if (!iiostat_cmd)
+		return -ENOMEM;
+	iio_devs_list_for_each_device(dev_list, device) {
+		sprintf(iiostat_cmd, iiostat_cmd_template,
+			device->dev_info.pmu_idx, device->dev_info.ch_mask,
+			device->dev_info.pmu_idx, device->dev_info.ch_mask,
+			device->dev_info.pmu_idx, device->dev_info.ch_mask,
+			device->dev_info.pmu_idx, device->dev_info.ch_mask);
+		ret = parse_events(evl, iiostat_cmd, NULL);
+		if (ret)
+			goto out;
+	}
+	evlist__for_each_entry(evl, evsel)
+		evsel->perf_device = iio_dev_get_by_idx(dev_list,
+							evsel->idx / metrics_count);
+out:
+	list_del_init(&(dev_list->devices));
+	iio_devs_list_free(dev_list);
+	free(iiostat_cmd);
+	return ret;
+}
+
+int iiostat_parse(const struct option *opt,
+		  const char *str,
+		  int unset __maybe_unused)
+{
+	int ret = 0;
+	struct iio_devs_list *dev_list = NULL;
+	struct evlist *evl = *(struct evlist **)opt->value;
+	struct perf_stat_config *config = (struct perf_stat_config *)opt->data;
+
+	if (evl->core.nr_entries > 0) {
+		pr_err("unsupported event configuration\n");
+		return -1;
+	}
+	config->metric_only = true;
+	config->aggr_mode = AGGR_DEVICE;
+	config->iiostat_run = true;
+	ret = iio_devs_scan(&dev_list);
+	if (ret)
+		return ret;
+
+	if (!str)
+		iiostat_mode = IIOSTAT_RUN;
+	else if (!strcmp(str, "show"))
+		iiostat_mode = IIOSTAT_SHOW;
+	else {
+		iiostat_mode = IIOSTAT_RUN;
+		ret = iio_devs_list_filter_by_bdf(&dev_list, str);
+		if (ret) {
+			iio_devs_list_free(dev_list);
+			return ret;
+		}
+		if (dev_list->nr_entries == 0) {
+			pr_err("Requested devices were not found\n");
+			iio_devs_list_free(dev_list);
+			return -1;
+		}
+	}
+	return iiostat_event_group(evl, dev_list);
+}
+
+void iiostat_prefix(struct perf_stat_config *config,
+		    struct evlist *evlist,
+		    char *prefix, struct timespec *ts)
+{
+	struct iio_device *device = evlist->selected->perf_device;
+
+	if (device) {
+		if (ts)
+			sprintf(prefix, "%6lu.%09lu%s%02x:%02x.%x%s",
+					ts->tv_sec, ts->tv_nsec,
+					config->csv_sep, device->dev_info.bdf.busno,
+					device->dev_info.bdf.devno, device->dev_info.bdf.funcno,
+					config->csv_sep);
+		else
+			sprintf(prefix, "%02x:%02x.%x%s",
+					device->dev_info.bdf.busno, device->dev_info.bdf.devno,
+					device->dev_info.bdf.funcno, config->csv_sep);
+	}
+}
+
+void iiostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
+			  struct perf_stat_output_ctx *out)
+{
+	double iiostat_value = 0;
+	u64 prev_count_val = 0;
+	const char *iiostat_metric = get_iiostat_metric(evsel->idx);
+	u8 device_die =
+		((struct iio_device *)evsel->perf_device)->dev_info.die;
+	struct perf_counts_values *count =
+		perf_counts(evsel->counts, device_die, 0);
+
+	if (evsel->prev_raw_counts && !out->force_header) {
+		struct perf_counts_values *prev_count =
+			perf_counts(evsel->prev_raw_counts, device_die, 0);
+		prev_count_val = prev_count->val;
+		prev_count->val = count->val;
+	}
+	iiostat_value = (count->val - prev_count_val) / ((double) count->run / count->ena);
+	out->print_metric(config, out->ctx, NULL, "%8.0f",
+			  iiostat_metric, iiostat_value / (256 * 1024));
+}
+
+int iiostat_print_device_list(struct evlist *evlist,
+			       struct perf_stat_config *config)
+{
+	struct evsel *evsel;
+	struct iio_device *device = NULL;
+
+	if (config->aggr_mode != AGGR_DEVICE) {
+		pr_err("unsupported event config\n");
+		return -1;
+	}
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (!evsel->perf_device) {
+			pr_err("unsupported event config\n");
+			return -1;
+		}
+		if ((iiostat_mode == IIOSTAT_SHOW || verbose) && device != evsel->perf_device) {
+			device = evsel->perf_device;
+			iiostat_device_show(config->output, device);
+		}
+	}
+	return (iiostat_mode == IIOSTAT_SHOW) ? -1 : 0;
+}
+
+void iiostat_delete_device_list(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	struct iio_device *device = NULL;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (device != evsel->perf_device) {
+			device = evsel->perf_device;
+			iio_device_delete(evsel->perf_device);
+		}
+	}
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 468fc49420ce..c7516a0182a0 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -66,6 +66,7 @@
 #include "util/time-utils.h"
 #include "util/top.h"
 #include "asm/bug.h"
+#include "util/iiostat.h"
 
 #include <linux/time64.h>
 #include <linux/zalloc.h>
@@ -186,6 +187,7 @@ static struct perf_stat_config stat_config = {
 	.metric_only_len	= METRIC_ONLY_LEN,
 	.walltime_nsecs_stats	= &walltime_nsecs_stats,
 	.big_num		= true,
+	.iiostat_run		= false,
 };
 
 static inline void diff_timespec(struct timespec *r, struct timespec *a,
@@ -723,6 +725,13 @@ static int parse_metric_groups(const struct option *opt,
 	return metricgroup__parse_groups(opt, str, &stat_config.metric_events);
 }
 
+__weak int iiostat_parse(const struct option *opt __maybe_unused,
+						const char *str __maybe_unused,
+						int unset __maybe_unused)
+{
+	return 0;
+}
+
 static struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -803,6 +812,8 @@ static struct option stat_options[] = {
 	OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
 		     "monitor specified metrics or metric groups (separated by ,)",
 		     parse_metric_groups),
+	OPT_CALLBACK_OPTARG(0, "iiostat", &evsel_list, &stat_config, "PCIe bandwidth",
+	     "measure PCIe bandwidth per device", iiostat_parse),
 	OPT_END()
 };
 
@@ -908,6 +919,7 @@ static int perf_stat_init_aggr_mode(void)
 		break;
 	case AGGR_GLOBAL:
 	case AGGR_THREAD:
+	case AGGR_DEVICE:
 	case AGGR_UNSET:
 	default:
 		break;
@@ -1072,6 +1084,7 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	case AGGR_NONE:
 	case AGGR_GLOBAL:
 	case AGGR_THREAD:
+	case AGGR_DEVICE:
 	case AGGR_UNSET:
 	default:
 		break;
@@ -1129,6 +1142,12 @@ __weak void arch_topdown_group_warn(void)
 {
 }
 
+__weak int iiostat_print_device_list(struct evlist *evlist __maybe_unused,
+				      struct perf_stat_config *config __maybe_unused)
+{
+	return 0;
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
@@ -1358,6 +1377,10 @@ static int add_default_attributes(void)
 		free(str);
 	}
 
+	if (stat_config.iiostat_run &&
+		iiostat_print_device_list(evsel_list, &stat_config) < 0)
+		return -1;
+
 	if (!evsel_list->core.nr_entries) {
 		if (target__has_cpu(&target))
 			default_attrs0[0].config = PERF_COUNT_SW_CPU_CLOCK;
@@ -1680,6 +1703,10 @@ static void setup_system_wide(int forks)
 	}
 }
 
+__weak void iiostat_delete_device_list(struct evlist *evlist __maybe_unused)
+{
+}
+
 int cmd_stat(int argc, const char **argv)
 {
 	const char * const stat_usage[] = {
@@ -1844,7 +1871,7 @@ int cmd_stat(int argc, const char **argv)
 	 * --per-thread is aggregated per thread, we dont mix it with cpu mode
 	 */
 	if (((stat_config.aggr_mode != AGGR_GLOBAL &&
-	      stat_config.aggr_mode != AGGR_THREAD) || nr_cgroups) &&
+	      stat_config.aggr_mode != AGGR_THREAD && stat_config.aggr_mode != AGGR_DEVICE) || nr_cgroups) &&
 	    !target__has_cpu(&target)) {
 		fprintf(stderr, "both cgroup and no-aggregation "
 			"modes only available in system-wide mode\n");
@@ -2005,6 +2032,9 @@ int cmd_stat(int argc, const char **argv)
 	perf_stat__exit_aggr_mode();
 	perf_evlist__free_stats(evsel_list);
 out:
+	if (stat_config.iiostat_run)
+		iiostat_delete_device_list(evsel_list);
+
 	zfree(&stat_config.walltime_run);
 
 	if (smi_cost && smi_reset)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ddc5ee6f6592..4d1582db3c7b 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -100,6 +100,7 @@ struct evsel {
 		perf_evsel__sb_cb_t	*cb;
 		void			*data;
 	} side_band;
+	void			*perf_device;
 };
 
 struct perf_missing_features {
diff --git a/tools/perf/util/iiostat.h b/tools/perf/util/iiostat.h
new file mode 100644
index 000000000000..4c886beec348
--- /dev/null
+++ b/tools/perf/util/iiostat.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * perf stat --iiostat
+ *
+ * Copyright (C) 2019, Intel Corporation
+ *
+ * Authors: Roman Sudarikov <roman.sudarikov@intel.com>
+ *	    Alexander Antonov <alexander.antonov@intel.com>
+ */
+
+#ifndef _IIOSTAT_H
+#define _IIOSTAT_H
+
+#include "util/stat.h"
+#include <subcmd/parse-options.h>
+#include "util/parse-events.h"
+#include "util/evlist.h"
+
+struct option;
+struct perf_stat_config;
+struct evlist;
+struct timespec;
+
+int  iiostat_parse(const struct option *opt, const char *str,
+		    int unset __maybe_unused);
+void iiostat_prefix(struct perf_stat_config *config, struct evlist *evlist,
+		     char *prefix, struct timespec *ts);
+void iiostat_print_metric(struct perf_stat_config *config __maybe_unused,
+			   struct evsel *evsel __maybe_unused,
+			   struct perf_stat_output_ctx *out __maybe_unused);
+int iiostat_print_device_list(struct evlist *evlist __maybe_unused,
+			       struct perf_stat_config *config __maybe_unused);
+void iiostat_delete_device_list(struct evlist *evlist);
+
+#endif /* _IIOSTAT_H */
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c
index 93c03b39cd9c..65d08daff8d9 100644
--- a/tools/perf/util/scripting-engines/trace-event-python.c
+++ b/tools/perf/util/scripting-engines/trace-event-python.c
@@ -1396,7 +1396,7 @@ static void python_process_stat(struct perf_stat_config *config,
 	struct perf_cpu_map *cpus = counter->core.cpus;
 	int cpu, thread;
 
-	if (config->aggr_mode == AGGR_GLOBAL) {
+	if (config->aggr_mode == AGGR_GLOBAL || config->aggr_mode == AGGR_DEVICE) {
 		process_stat(counter, -1, -1, tstamp,
 			     &counter->counts->aggr);
 		return;
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index ed3b0ac2f785..def4ef6bb155 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -16,6 +16,8 @@
 #include <linux/ctype.h>
 #include "cgroup.h"
 #include <api/fs/fs.h>
+#include "iiostat.h"
+#include "debug.h"
 
 #define CNTR_NOT_SUPPORTED	"<not supported>"
 #define CNTR_NOT_COUNTED	"<not counted>"
@@ -123,6 +125,7 @@ static void aggr_printout(struct perf_stat_config *config,
 			config->csv_sep);
 		break;
 	case AGGR_GLOBAL:
+	case AGGR_DEVICE:
 	case AGGR_UNSET:
 	default:
 		break;
@@ -301,6 +304,11 @@ static void print_metric_header(struct perf_stat_config *config,
 	struct outstate *os = ctx;
 	char tbuf[1024];
 
+	if (os->evsel->perf_device && os->evsel->evlist->selected->perf_device
+	    && config->iiostat_run &&
+	    os->evsel->perf_device != os->evsel->evlist->selected->perf_device)
+		return;
+
 	if (!valid_only_metric(unit))
 		return;
 	unit = fixunit(tbuf, os->evsel, unit);
@@ -322,7 +330,7 @@ static int first_shadow_cpu(struct perf_stat_config *config,
 	if (config->aggr_mode == AGGR_NONE)
 		return id;
 
-	if (config->aggr_mode == AGGR_GLOBAL)
+	if (config->aggr_mode == AGGR_GLOBAL || config->aggr_mode == AGGR_DEVICE)
 		return 0;
 
 	for (i = 0; i < perf_evsel__nr_cpus(evsel); i++) {
@@ -416,6 +424,7 @@ static void printout(struct perf_stat_config *config, int id, int nr,
 	if (config->csv_output && !config->metric_only) {
 		static int aggr_fields[] = {
 			[AGGR_GLOBAL] = 0,
+			[AGGR_DEVICE] = 0,
 			[AGGR_THREAD] = 1,
 			[AGGR_NONE] = 1,
 			[AGGR_SOCKET] = 2,
@@ -899,6 +908,7 @@ static int aggr_header_lens[] = {
 	[AGGR_NONE] = 6,
 	[AGGR_THREAD] = 24,
 	[AGGR_GLOBAL] = 0,
+	[AGGR_DEVICE] = 0,
 };
 
 static const char *aggr_header_csv[] = {
@@ -907,7 +917,8 @@ static const char *aggr_header_csv[] = {
 	[AGGR_SOCKET] 	= 	"socket,cpus",
 	[AGGR_NONE] 	= 	"cpu,",
 	[AGGR_THREAD] 	= 	"comm-pid,",
-	[AGGR_GLOBAL] 	=	""
+	[AGGR_GLOBAL]	=	"",
+	[AGGR_DEVICE]	=	"device,"
 };
 
 static void print_metric_headers(struct perf_stat_config *config,
@@ -931,6 +942,8 @@ static void print_metric_headers(struct perf_stat_config *config,
 			fputs("time,", config->output);
 		fputs(aggr_header_csv[config->aggr_mode], config->output);
 	}
+	if (config->iiostat_run && !config->interval && !config->csv_output)
+		fprintf(config->output, " device         ");
 
 	/* Print metrics headers only */
 	evlist__for_each_entry(evlist, counter) {
@@ -949,6 +962,13 @@ static void print_metric_headers(struct perf_stat_config *config,
 	fputc('\n', config->output);
 }
 
+__weak void iiostat_prefix(struct perf_stat_config *config __maybe_unused,
+			    struct evlist *evlist __maybe_unused,
+			    char *prefix __maybe_unused,
+			    struct timespec *ts __maybe_unused)
+{
+}
+
 static void print_interval(struct perf_stat_config *config,
 			   struct evlist *evlist,
 			   char *prefix, struct timespec *ts)
@@ -961,7 +981,8 @@ static void print_interval(struct perf_stat_config *config,
 	if (config->interval_clear)
 		puts(CONSOLE_CLEAR);
 
-	sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
+	if (!config->iiostat_run)
+		sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
 
 	if ((num_print_interval == 0 && !config->csv_output) || config->interval_clear) {
 		switch (config->aggr_mode) {
@@ -990,6 +1011,9 @@ static void print_interval(struct perf_stat_config *config,
 			if (!metric_only)
 				fprintf(output, "                  counts %*s events\n", unit_width, "unit");
 			break;
+		case AGGR_DEVICE:
+			fprintf(output, "#           time  device        ");
+			break;
 		case AGGR_GLOBAL:
 		default:
 			fprintf(output, "#           time");
@@ -1167,6 +1191,10 @@ perf_evlist__print_counters(struct evlist *evlist,
 	int interval = config->interval;
 	struct evsel *counter;
 	char buf[64], *prefix = NULL;
+	void *device = NULL;
+
+	if (config->iiostat_run)
+		evlist->selected = evlist__first(evlist);
 
 	if (interval)
 		print_interval(config, evlist, prefix = buf, ts);
@@ -1180,7 +1208,7 @@ perf_evlist__print_counters(struct evlist *evlist,
 			print_metric_headers(config, evlist, prefix, false);
 		if (num_print_iv++ == 25)
 			num_print_iv = 0;
-		if (config->aggr_mode == AGGR_GLOBAL && prefix)
+		if ((config->aggr_mode == AGGR_GLOBAL) && prefix)
 			fprintf(config->output, "%s", prefix);
 	}
 
@@ -1214,6 +1242,23 @@ perf_evlist__print_counters(struct evlist *evlist,
 			}
 		}
 		break;
+	case AGGR_DEVICE:
+		counter = evlist__first(evlist);
+		perf_evlist__set_selected(evlist, counter);
+		iiostat_prefix(config, evlist, prefix = buf, ts);
+		fprintf(config->output, "%s", prefix);
+		evlist__for_each_entry(evlist, counter) {
+			device = evlist->selected->perf_device;
+			if (device && device != counter->perf_device) {
+				perf_evlist__set_selected(evlist, counter);
+				iiostat_prefix(config, evlist, prefix, ts);
+				fprintf(config->output, "\n%s", prefix);
+			}
+			print_counter_aggr(config, counter, prefix);
+			if ((counter->idx + 1) == evlist->core.nr_entries)
+				fputc('\n', config->output);
+		}
+		break;
 	case AGGR_UNSET:
 	default:
 		break;
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 2c41d47f6f83..8c46c172a457 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -9,6 +9,8 @@
 #include "expr.h"
 #include "metricgroup.h"
 #include <linux/zalloc.h>
+#include "iiostat.h"
+#include "counts.h"
 
 /*
  * AGGR_GLOBAL: Use CPU 0
@@ -814,6 +816,12 @@ static void generic_metric(struct perf_stat_config *config,
 		zfree(&pctx.ids[i].name);
 }
 
+__weak void iiostat_print_metric(struct perf_stat_config *config __maybe_unused,
+				  struct evsel *evsel __maybe_unused,
+				  struct perf_stat_output_ctx *out __maybe_unused)
+{
+}
+
 void perf_stat__print_shadow_stats(struct perf_stat_config *config,
 				   struct evsel *evsel,
 				   double avg, int cpu,
@@ -829,7 +837,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
 	struct metric_event *me;
 	int num = 1;
 
-	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
+	if (config->iiostat_run) {
+		iiostat_print_metric(config, evsel, out);
+	} else if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
 		total = runtime_stat_avg(st, STAT_CYCLES, ctx, cpu);
 
 		if (total) {
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ebdd130557fb..672d33c1cafe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -318,6 +318,7 @@ process_counter_values(struct perf_stat_config *config, struct evsel *evsel,
 		}
 		break;
 	case AGGR_GLOBAL:
+	case AGGR_DEVICE:
 		aggr->val += count->val;
 		aggr->ena += count->ena;
 		aggr->run += count->run;
@@ -377,7 +378,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	if (ret)
 		return ret;
 
-	if (config->aggr_mode != AGGR_GLOBAL)
+	if (config->aggr_mode != AGGR_GLOBAL && config->aggr_mode != AGGR_DEVICE)
 		return 0;
 
 	if (!counter->snapshot)
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index edbeb2f63e8d..be65afdaad90 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -46,6 +46,7 @@ enum aggr_mode {
 	AGGR_DIE,
 	AGGR_CORE,
 	AGGR_THREAD,
+	AGGR_DEVICE,
 	AGGR_UNSET,
 };
 
@@ -106,6 +107,7 @@ struct perf_stat_config {
 	bool			 big_num;
 	bool			 no_merge;
 	bool			 walltime_run_table;
+	bool			 iiostat_run;
 	FILE			*output;
 	unsigned int		 interval;
 	unsigned int		 timeout;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] perf tools: Add feature check for libpci
  2019-11-26 16:36       ` [PATCH 4/6] perf stat: New --iiostat mode to provide I/O performance metrics roman.sudarikov
@ 2019-11-26 16:36         ` roman.sudarikov
  2019-11-26 16:36           ` [PATCH 6/6] perf stat: Add PCI device name to --iiostat output roman.sudarikov
  0 siblings, 1 reply; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

Add feature check for libpci to show device name in --iiostat mode.
libpci support allows device name to b:d:f notion.

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 tools/build/Makefile.feature      |  2 ++
 tools/build/feature/Makefile      |  4 ++++
 tools/build/feature/test-all.c    |  5 +++++
 tools/build/feature/test-libpci.c | 10 ++++++++++
 tools/perf/Makefile.config        | 10 ++++++++++
 tools/perf/builtin-version.c      |  1 +
 tools/perf/tests/make             |  1 +
 7 files changed, 33 insertions(+)
 create mode 100644 tools/build/feature/test-libpci.c

diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature
index 8a19753cc26a..bebdfb99607c 100644
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@@ -50,6 +50,7 @@ FEATURE_TESTS_BASIC :=                  \
         libelf-mmap                     \
         libnuma                         \
         numa_num_possible_cpus          \
+        libpci                          \
         libperl                         \
         libpython                       \
         libpython-version               \
@@ -115,6 +116,7 @@ FEATURE_DISPLAY ?=              \
          libelf                 \
          libnuma                \
          numa_num_possible_cpus \
+         libpci                 \
          libperl                \
          libpython              \
          libcrypto              \
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 8499385365c0..f0d5f886602d 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -28,6 +28,7 @@ FILES=                                          \
          test-libelf-mmap.bin                   \
          test-libnuma.bin                       \
          test-numa_num_possible_cpus.bin        \
+         test-libpci.bin                        \
          test-libperl.bin                       \
          test-libpython.bin                     \
          test-libpython-version.bin             \
@@ -210,6 +211,9 @@ PERL_EMBED_LIBADD = $(call grep-libs,$(PERL_EMBED_LDOPTS))
 PERL_EMBED_CCOPTS = `perl -MExtUtils::Embed -e ccopts 2>/dev/null`
 FLAGS_PERL_EMBED=$(PERL_EMBED_CCOPTS) $(PERL_EMBED_LDOPTS)
 
+$(OUTPUT)test-libpci.bin:
+	$(BUILD) -lpci
+
 $(OUTPUT)test-libperl.bin:
 	$(BUILD) $(FLAGS_PERL_EMBED)
 
diff --git a/tools/build/feature/test-all.c b/tools/build/feature/test-all.c
index 88145e8cde1a..c61d34804a06 100644
--- a/tools/build/feature/test-all.c
+++ b/tools/build/feature/test-all.c
@@ -74,6 +74,10 @@
 # include "test-libunwind.c"
 #undef main
 
+#define main main_test_libpci
+# include "test-libpci.c"
+#undef
+
 #define main main_test_libaudit
 # include "test-libaudit.c"
 #undef main
@@ -210,6 +214,7 @@ int main(int argc, char *argv[])
 	main_test_libunwind();
 	main_test_libaudit();
 	main_test_libslang();
+	main_test_libpci();
 	main_test_gtk2(argc, argv);
 	main_test_gtk2_infobar(argc, argv);
 	main_test_libbfd();
diff --git a/tools/build/feature/test-libpci.c b/tools/build/feature/test-libpci.c
new file mode 100644
index 000000000000..4bbeb9ffd687
--- /dev/null
+++ b/tools/build/feature/test-libpci.c
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "pci/pci.h"
+
+int main(void)
+{
+	struct pci_access *pacc = pci_alloc();
+
+	pci_cleanup(pacc);
+	return 0;
+}
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index 46f7fba2306c..1b9d341492c8 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -839,6 +839,16 @@ ifndef NO_LIBCAP
   endif
 endif
 
+ifndef NO_PCILIB
+  ifeq ($(feature-libpci), 1)
+    CFLAGS += -DHAVE_LIBPCI_SUPPORT
+    EXTLIBS += -lpci
+  else
+    msg := $(warning No libpci found, show pci devices without names in iiostat mode, please install libpci-dev/pciutils-devel);
+    NO_PCILIB := 1
+  endif
+endif
+
 ifndef NO_BACKTRACE
   ifeq ($(feature-backtrace), 1)
     CFLAGS += -DHAVE_BACKTRACE_SUPPORT
diff --git a/tools/perf/builtin-version.c b/tools/perf/builtin-version.c
index 05cf2af9e2c2..ec4e0eb07825 100644
--- a/tools/perf/builtin-version.c
+++ b/tools/perf/builtin-version.c
@@ -76,6 +76,7 @@ static void library_status(void)
 	STATUS(HAVE_LIBUNWIND_SUPPORT, libunwind);
 	STATUS(HAVE_DWARF_SUPPORT, libdw-dwarf-unwind);
 	STATUS(HAVE_ZLIB_SUPPORT, zlib);
+	STATUS(HAVE_LIBPCI_SUPPORT, libpci);
 	STATUS(HAVE_LZMA_SUPPORT, lzma);
 	STATUS(HAVE_AUXTRACE_SUPPORT, get_cpuid);
 	STATUS(HAVE_LIBBPF_SUPPORT, bpf);
diff --git a/tools/perf/tests/make b/tools/perf/tests/make
index c850d1664c56..0b78cf6e8377 100644
--- a/tools/perf/tests/make
+++ b/tools/perf/tests/make
@@ -109,6 +109,7 @@ make_minimal        += NO_LIBNUMA=1 NO_LIBAUDIT=1 NO_LIBBIONIC=1
 make_minimal        += NO_LIBDW_DWARF_UNWIND=1 NO_AUXTRACE=1 NO_LIBBPF=1
 make_minimal        += NO_LIBCRYPTO=1 NO_SDT=1 NO_JVMTI=1 NO_LIBZSTD=1
 make_minimal        += NO_LIBCAP=1
+make_minimal        += NO_LIBPCI=1
 
 # $(run) contains all available tests
 run := make_pure
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] perf stat: Add PCI device name to --iiostat output
  2019-11-26 16:36         ` [PATCH 5/6] perf tools: Add feature check for libpci roman.sudarikov
@ 2019-11-26 16:36           ` roman.sudarikov
  0 siblings, 0 replies; 14+ messages in thread
From: roman.sudarikov @ 2019-11-26 16:36 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, linux-kernel, eranian, bgregg, ak, kan.liang
  Cc: alexander.antonov, roman.sudarikov

From: Roman Sudarikov <roman.sudarikov@linux.intel.com>

Example:
   $ perf stat --iiostat=show

Sample output w/o libpci:

    S0-RootPort0-uncore_iio_0<00:00.0>
    S1-RootPort2-uncore_iio_2<af:00.0>

Sample output with libpci:

    S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
    S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>

Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
---
 tools/perf/arch/x86/util/iiostat.c | 15 ++++++++--
 tools/perf/util/pci.c              | 46 ++++++++++++++++++++++++++++++
 tools/perf/util/pci.h              |  4 +++
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/tools/perf/arch/x86/util/iiostat.c b/tools/perf/arch/x86/util/iiostat.c
index 058a01d3a93f..7aad994e4936 100644
--- a/tools/perf/arch/x86/util/iiostat.c
+++ b/tools/perf/arch/x86/util/iiostat.c
@@ -113,6 +113,7 @@ struct dev_info {
 	u8 die;
 	u8 pmu_idx;
 	u8 root_port_nr;
+	char *name;
 };
 
 struct iio_device {
@@ -210,7 +211,12 @@ static struct iio_device *iio_device_new(struct dev_info *info)
 	if (p) {
 		INIT_LIST_HEAD(&(p->node));
 		p->dev_info = *info;
+		p->dev_info.name = strdup(pci_device_name(info->bdf));
 		p->idx = -1;
+		if (!p->dev_info.name) {
+			free(p);
+			p = NULL;
+		}
 	}
 	return p;
 }
@@ -219,6 +225,7 @@ static void iio_device_delete(struct iio_device *device)
 {
 	if (device) {
 		list_del_init(&(device->node));
+		free(device->dev_info.name);
 		free(device);
 	}
 }
@@ -227,11 +234,11 @@ static void iiostat_device_show(FILE *output,
 			const struct iio_device * const device)
 {
 	if (output && device)
-		fprintf(output, "S%d-RootPort%d-uncore_iio_%d<%02x:%02x.%x>\n",
+		fprintf(output, "S%d-RootPort%d-uncore_iio_%d<%02x:%02x.%x %s>\n",
 			device->dev_info.die,
 			device->dev_info.root_port_nr, device->dev_info.pmu_idx,
 			device->dev_info.bdf.busno, device->dev_info.bdf.devno,
-			device->dev_info.bdf.funcno);
+			device->dev_info.bdf.funcno, device->dev_info.name);
 }
 
 static struct iio_devs_list *iio_devs_list_new(void)
@@ -426,9 +433,12 @@ static int iio_devs_scan(struct iio_devs_list **list)
 	if (!rp_nr)
 		return -ENOMEM;
 
+	pci_library_init();
+
 	interim = iio_devs_list_new();
 	if (!interim) {
 		free(rp_nr);
+		pci_library_cleanup();
 		return -ENOMEM;
 	}
 
@@ -457,6 +467,7 @@ static int iio_devs_scan(struct iio_devs_list **list)
 		iio_devs_list_free(interim);
 
 	free(rp_nr);
+	pci_library_cleanup();
 
 	return ret;
 }
diff --git a/tools/perf/util/pci.c b/tools/perf/util/pci.c
index ba1a48e9d0cc..6ce05e6ba037 100644
--- a/tools/perf/util/pci.c
+++ b/tools/perf/util/pci.c
@@ -8,6 +8,9 @@
  *	    Alexander Antonov <alexander.antonov@intel.com>
  */
 #include "pci.h"
+#ifdef HAVE_LIBPCI_SUPPORT
+#include <pci/pci.h>
+#endif
 #include <api/fs/fs.h>
 #include <linux/kernel.h>
 #include <string.h>
@@ -16,6 +19,49 @@
 #define PCI_DEVICE_PATH_TEMPLATE "bus/pci/devices/0000:%02x:%02x.0"
 #define PCI_DEVICE_FILE_TEMPLATE PCI_DEVICE_PATH_TEMPLATE"/%s"
 
+#ifdef HAVE_LIBPCI_SUPPORT
+static struct pci_access *pacc;
+#endif
+
+void pci_library_init(void)
+{
+#ifdef HAVE_LIBPCI_SUPPORT
+	pacc = pci_alloc();
+	if (pacc) {
+		pci_init(pacc);
+		pci_scan_bus(pacc);
+	}
+#endif
+}
+
+void pci_library_cleanup(void)
+{
+#ifdef HAVE_LIBPCI_SUPPORT
+	pci_cleanup(pacc);
+#endif
+}
+
+char *pci_device_name(struct bdf bdf __maybe_unused)
+{
+#ifdef HAVE_LIBPCI_SUPPORT
+	struct pci_dev *device;
+	char namebuf[PATH_MAX];
+
+	if (pacc) {
+		device = pci_get_dev(pacc, 0, bdf.busno, bdf.devno, bdf.funcno);
+		if (device) {
+			pci_fill_info(device, PCI_FILL_IDENT);
+			return pci_lookup_name(pacc, namebuf, sizeof(namebuf),
+					       PCI_LOOKUP_DEVICE, device->vendor_id,
+					       device->device_id);
+		}
+	}
+	return (char *)"";
+#else
+	return (char *)"";
+#endif
+}
+
 static bool directory_exists(const char * const path)
 {
 	return (access(path, F_OK) == 0);
diff --git a/tools/perf/util/pci.h b/tools/perf/util/pci.h
index e963b12e10e7..8d8551360419 100644
--- a/tools/perf/util/pci.h
+++ b/tools/perf/util/pci.h
@@ -17,6 +17,10 @@ struct bdf {
 	u8 funcno;
 };
 
+void pci_library_init(void);
+void pci_library_cleanup(void);
+
+char *pci_device_name(struct bdf bdf);
 bool pci_device_probe(struct bdf bdf);
 bool is_pci_device_root_port(struct bdf bdf, u8 *secondary, u8 *subordinate);
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-11-26 16:36 ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping roman.sudarikov
  2019-11-26 16:36   ` [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices roman.sudarikov
@ 2019-12-02 14:00   ` Peter Zijlstra
  2019-12-02 19:47   ` Stephane Eranian
  2 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2019-12-02 14:00 UTC (permalink / raw)
  To: roman.sudarikov
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	linux-kernel, eranian, bgregg, ak, kan.liang, alexander.antonov

On Tue, Nov 26, 2019 at 07:36:25PM +0300, roman.sudarikov@linux.intel.com wrote:
> From: Roman Sudarikov <roman.sudarikov@linux.intel.com>
> 
> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
> changes in the integrated I/O (IIO) architecture. The new solution introduces
> IIO stacks which are responsible for managing traffic between the PCIe domain
> and the Mesh domain. Each IIO stack has its own PMON block and can handle either
> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
> within each IIO stack.
> 
> Software is supposed to program required perf counters within each IIO stack
> and gather performance data. The tricky thing here is that IIO PMON reports data
> per IIO stack but users have no idea what IIO stacks are - they only know devices
> which are connected to the platform.
> 
> Understanding IIO stack concept to find which IIO stack that particular IO device
> is connected to, or to identify an IIO PMON block to program for monitoring
> specific IIO stack assumes a lot of implicit knowledge about given Intel server
> platform architecture.
> 
> This patch set introduces:
>     An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
>     A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device
> 
> Current version supports a server line starting Intel® Xeon® Processor Scalable
> Family and introduces mapping for IIO Uncore units only.
> Other units can be added on demand.
> 
> Usage example:
>     /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping
> 
> Each Uncore unit type, by its nature, can be mapped to its own context, for example:
>     CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
>     UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
>     IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
>     IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller
> 
> Implementation details:
>     Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
>         int (*get_topology)(struct intel_uncore_type *type)
>         int (*set_mapping)(struct intel_uncore_type *type)
> 
>     IIO stack to PMON mapping is exposed through
>         /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
>         in the following format: domain:bus
> 
> Details of IIO Uncore unit mapping to IIO PMON:
> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
> holds bus numbers of devices, which can be monitored by that IIO PMON block
> on each die.
> 
> For example, on a 4-die Intel Xeon® server platform:
>     $ cat /sys/devices/uncore_iio_0/platform_mapping
>     0000:00,0000:40,0000:80,0000:c0
> 
> Which means:
> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000
> 
> Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
> Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
> Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>

Kan, can you help these people? There's a ton of process fail with this
submission. From SoB chain to CodingStyle to git-sendmail threading.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-11-26 16:36 ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping roman.sudarikov
  2019-11-26 16:36   ` [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices roman.sudarikov
  2019-12-02 14:00   ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping Peter Zijlstra
@ 2019-12-02 19:47   ` Stephane Eranian
  2019-12-03  3:00     ` Andi Kleen
  2019-12-04 18:48     ` Sudarikov, Roman
  2 siblings, 2 replies; 14+ messages in thread
From: Stephane Eranian @ 2019-12-02 19:47 UTC (permalink / raw)
  To: roman.sudarikov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, LKML,
	bgregg, Andi Kleen, Liang, Kan, alexander.antonov

On Tue, Nov 26, 2019 at 8:36 AM <roman.sudarikov@linux.intel.com> wrote:
>
> From: Roman Sudarikov <roman.sudarikov@linux.intel.com>
>
> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
> changes in the integrated I/O (IIO) architecture. The new solution introduces
> IIO stacks which are responsible for managing traffic between the PCIe domain
> and the Mesh domain. Each IIO stack has its own PMON block and can handle either
> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
> within each IIO stack.
>
> Software is supposed to program required perf counters within each IIO stack
> and gather performance data. The tricky thing here is that IIO PMON reports data
> per IIO stack but users have no idea what IIO stacks are - they only know devices
> which are connected to the platform.
>
> Understanding IIO stack concept to find which IIO stack that particular IO device
> is connected to, or to identify an IIO PMON block to program for monitoring
> specific IIO stack assumes a lot of implicit knowledge about given Intel server
> platform architecture.
>
> This patch set introduces:
>     An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
>     A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device
>
> Current version supports a server line starting Intel® Xeon® Processor Scalable
> Family and introduces mapping for IIO Uncore units only.
> Other units can be added on demand.
>
> Usage example:
>     /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping
>
> Each Uncore unit type, by its nature, can be mapped to its own context, for example:
>     CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
>     UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
>     IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
>     IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller
>
> Implementation details:
>     Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
>         int (*get_topology)(struct intel_uncore_type *type)
>         int (*set_mapping)(struct intel_uncore_type *type)
>
>     IIO stack to PMON mapping is exposed through
>         /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
>         in the following format: domain:bus
>
> Details of IIO Uncore unit mapping to IIO PMON:
> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
> holds bus numbers of devices, which can be monitored by that IIO PMON block
> on each die.
>
> For example, on a 4-die Intel Xeon® server platform:
>     $ cat /sys/devices/uncore_iio_0/platform_mapping
>     0000:00,0000:40,0000:80,0000:c0
>
> Which means:
> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000
>
You are just looking at one die (package). How does your enumeration
help figure out
is the iio_0 is on socket0 of socket1 and then figure out which
bus/domain in on which
socket.

And how does that help map actual devices (using the output of lspci)
to the IIO?
You need to show how you would do that, which is really what people
want, with what you
have in your patch right now.

>
> Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
> Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
> Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
> ---
>  arch/x86/events/intel/uncore.c       |  61 +++++++++++-
>  arch/x86/events/intel/uncore.h       |  13 ++-
>  arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++
>  3 files changed, 214 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
> index 86467f85c383..0f779c8fcc05 100644
> --- a/arch/x86/events/intel/uncore.c
> +++ b/arch/x86/events/intel/uncore.c
> @@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
>  struct pci_extra_dev *uncore_extra_pci_dev;
>  static int max_dies;
>
> +int get_max_dies(void)
> +{
> +       return max_dies;
> +}
> +
>  /* mask of cpus that collect uncore events */
>  static cpumask_t uncore_cpu_mask;
>
> @@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev,
>
>  static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);
>
> +static ssize_t platform_mapping_show(struct device *dev,
> +                               struct device_attribute *attr, char *buf)
> +{
> +       struct intel_uncore_pmu *pmu = dev_get_drvdata(dev);
> +
> +       return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ?
> +                      (char *)pmu->platform_mapping : "0");
> +}
> +static DEVICE_ATTR_RO(platform_mapping);
> +
>  static struct attribute *uncore_pmu_attrs[] = {
>         &dev_attr_cpumask.attr,
>         NULL,
> @@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = {
>         .attrs = uncore_pmu_attrs,
>  };
>
> +static struct attribute *platform_attrs[] = {
> +       &dev_attr_platform_mapping.attr,
> +       NULL,
> +};
> +
> +static const struct attribute_group uncore_platform_discovery_group = {
> +       .attrs = platform_attrs,
> +};
> +
>  static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>  {
>         int ret;
> @@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types)
>                 uncore_type_exit(*types);
>  }
>
> +static void uncore_type_attrs_compaction(struct intel_uncore_type *type)
> +{
> +       int i, j;
> +
> +       for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) {
> +               if (!type->attr_groups[i])
> +                       continue;
> +               if (i > j) {
> +                       type->attr_groups[j] = type->attr_groups[i];
> +                       type->attr_groups[i] = NULL;
> +               }
> +               j++;
> +       }
> +}
> +
>  static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>  {
>         struct intel_uncore_pmu *pmus;
>         size_t size;
>         int i, j;
> +       int ret;
>
>         pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL);
>         if (!pmus)
> @@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>                 pmus[i].pmu_idx = i;
>                 pmus[i].type    = type;
>                 pmus[i].boxes   = kzalloc(size, GFP_KERNEL);
> -               if (!pmus[i].boxes)
> +               if (!pmus[i].boxes) {
> +                       ret = -ENOMEM;
>                         goto err;
> +               }
>         }
>
>         type->pmus = pmus;
> @@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>
>                 attr_group = kzalloc(struct_size(attr_group, attrs, i + 1),
>                                                                 GFP_KERNEL);
> -               if (!attr_group)
> +               if (!attr_group) {
> +                       ret = -ENOMEM;
>                         goto err;
> +               }
>
>                 attr_group->group.name = "events";
>                 attr_group->group.attrs = attr_group->attrs;
> @@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>
>         type->pmu_group = &uncore_pmu_attr_group;
>
> +       /*
> +        * Exposing mapping of Uncore units to corresponding Uncore PMUs
> +        * through /sys/devices/uncore_<type>_<idx>/platform_mapping
> +        */
> +       if (type->get_topology && type->set_mapping)
> +               if (!type->get_topology(type) && !type->set_mapping(type))
> +                       type->platform_discovery = &uncore_platform_discovery_group;
> +
> +       /* For optional attributes, we can safely remove embedded NULL attr_groups elements */
> +       uncore_type_attrs_compaction(type);
> +
>         return 0;
>
>  err:
> @@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>                 kfree(pmus[i].boxes);
>         kfree(pmus);
>
> -       return -ENOMEM;
> +       return ret;
>  }
>
>  static int __init
> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
> index bbfdaa720b45..ce3727b9f7f8 100644
> --- a/arch/x86/events/intel/uncore.h
> +++ b/arch/x86/events/intel/uncore.h
> @@ -43,6 +43,8 @@ struct intel_uncore_box;
>  struct uncore_event_desc;
>  struct freerunning_counters;
>
> +#define UNCORE_MAX_NUM_ATTR_GROUP 5
> +
>  struct intel_uncore_type {
>         const char *name;
>         int num_counters;
> @@ -71,13 +73,19 @@ struct intel_uncore_type {
>         struct intel_uncore_ops *ops;
>         struct uncore_event_desc *event_descs;
>         struct freerunning_counters *freerunning;
> -       const struct attribute_group *attr_groups[4];
> +       const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP];
>         struct pmu *pmu; /* for custom pmu ops */
> +       void *platform_topology;
> +       /* finding Uncore units */
> +       int (*get_topology)(struct intel_uncore_type *type);
> +       /* mapping Uncore units to PMON ranges */
> +       int (*set_mapping)(struct intel_uncore_type *type);
>  };
>
>  #define pmu_group attr_groups[0]
>  #define format_group attr_groups[1]
>  #define events_group attr_groups[2]
> +#define platform_discovery attr_groups[3]
>
>  struct intel_uncore_ops {
>         void (*init_box)(struct intel_uncore_box *);
> @@ -99,6 +107,7 @@ struct intel_uncore_pmu {
>         int                             pmu_idx;
>         int                             func_id;
>         bool                            registered;
> +       void                            *platform_mapping;
>         atomic_t                        activeboxes;
>         struct intel_uncore_type        *type;
>         struct intel_uncore_box         **boxes;
> @@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
>         return event->pmu_private;
>  }
>
> +int get_max_dies(void);
> +
>  struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu);
>  u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
>  void uncore_mmio_exit_box(struct intel_uncore_box *box);
> diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
> index b10a5ec79e48..92ce9fbafde1 100644
> --- a/arch/x86/events/intel/uncore_snbep.c
> +++ b/arch/x86/events/intel/uncore_snbep.c
> @@ -273,6 +273,28 @@
>  #define SKX_CPUNODEID                  0xc0
>  #define SKX_GIDNIDMAP                  0xd4
>
> +/*
> + * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR
> + * that BIOS programmed. MSR has package scope.
> + * |  Bit  |  Default  |  Description
> + * | [63]  |    00h    | VALID - When set, indicates the CPU bus
> + *                       numbers have been initialized. (RO)
> + * |[62:48]|    ---    | Reserved
> + * |[47:40]|    00h    | BUS_NUM_5 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(5). (RO)
> + * |[39:32]|    00h    | BUS_NUM_4 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(4). (RO)
> + * |[31:24]|    00h    | BUS_NUM_3 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(3). (RO)
> + * |[23:16]|    00h    | BUS_NUM_2 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(2). (RO)
> + * |[15:8] |    00h    | BUS_NUM_1 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(1). (RO)
> + * | [7:0] |    00h    | BUS_NUM_0 — Return the bus number BIOS assigned
> + *                       CPUBUSNO(0). (RO)
> + */
> +#define SKX_MSR_CPU_BUS_NUMBER         0x300
> +
>  /* SKX CHA */
>  #define SKX_CHA_MSR_PMON_BOX_FILTER_TID                (0x1ffULL << 0)
>  #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK       (0xfULL << 9)
> @@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = {
>         .read_counter           = uncore_msr_read_counter,
>  };
>
> +static int skx_iio_get_topology(struct intel_uncore_type *type);
> +static int skx_iio_set_mapping(struct intel_uncore_type *type);
> +
>  static struct intel_uncore_type skx_uncore_iio = {
>         .name                   = "iio",
>         .num_counters           = 4,
> @@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = {
>         .constraints            = skx_uncore_iio_constraints,
>         .ops                    = &skx_uncore_iio_ops,
>         .format_group           = &skx_uncore_iio_format_group,
> +       .get_topology           = skx_iio_get_topology,
> +       .set_mapping            = skx_iio_set_mapping,
>  };
>
>  enum perf_uncore_iio_freerunning_type_id {
> @@ -3780,6 +3807,123 @@ static int skx_count_chabox(void)
>         return hweight32(val);
>  }
>
> +static inline u8 skx_iio_topology_byte(void *platform_topology,
> +                                       int die, int idx)
> +{
> +       return *((u8 *)(platform_topology) + die * sizeof(u64) + idx);
> +}
> +
> +static inline bool skx_iio_topology_valid(u64 msr_value)
> +{
> +       return msr_value & ((u64)1 << 63);
> +}
> +
> +static int skx_msr_cpu_bus_read(int cpu, int die)
> +{
> +       int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER,
> +                               ((u64 *)skx_uncore_iio.platform_topology) + die);
> +
> +       if (!ret) {
> +               if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die)))
> +                       ret = -1;
> +       }
> +       return ret;
> +}
> +
> +static int skx_iio_get_topology(struct intel_uncore_type *type)
> +{
> +       int ret, cpu, die, current_die;
> +       struct pci_bus *bus = NULL;
> +
> +       while ((bus = pci_find_next_bus(bus)) != NULL)
> +               if (pci_domain_nr(bus)) {
> +                       pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n");
> +                       return -1;
> +               }
> +
> +       /* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/
> +       type->platform_topology =
> +               kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL);
> +       if (!type->platform_topology)
> +               return -ENOMEM;
> +
> +       /*
> +        * Using cpus_read_lock() to ensure cpu is not going down between
> +        * looking at cpu_online_mask.
> +        */
> +       cpus_read_lock();
> +       /* Invalid value to start loop.*/
> +       current_die = -1;
> +       for_each_online_cpu(cpu) {
> +               die = topology_logical_die_id(cpu);
> +               if (current_die == die)
> +                       continue;
> +               ret = skx_msr_cpu_bus_read(cpu, die);
> +               if (ret)
> +                       break;
> +               current_die = die;
> +       }
> +       cpus_read_unlock();
> +
> +       if (ret)
> +               kfree(type->platform_topology);
> +       return ret;
> +}
> +
> +static int skx_iio_set_mapping(struct intel_uncore_type *type)
> +{
> +       /*
> +        * Each IIO stack (PCIe root port) has its own IIO PMON block, so each
> +        * platform_mapping holds bus number(s) of PCIe root port(s), which can
> +        * be monitored by that IIO PMON block.
> +        *
> +        * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die
> +        * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO
> +        * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0":
> +        *
> +        * $ cat /sys/devices/uncore_iio_0/platform_mapping
> +        * 0000:00,0000:40,0000:80,0000:c0
> +        *
> +        * Which means:
> +        * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000
> +        * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000
> +        * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000
> +        * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000
> +        */
> +
> +       int ret = 0;
> +       int die, i;
> +       char *buf;
> +       struct intel_uncore_pmu *pmu;
> +       const int template_len = 8;
> +
> +       for (i = 0; i < type->num_boxes; i++) {
> +               pmu = type->pmus + i;
> +               /* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */
> +               if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) {
> +                       pmu->platform_mapping =
> +                               kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL);
> +                       if (pmu->platform_mapping) {
> +                               buf = (char *)pmu->platform_mapping;
> +                               for (die = 0; die < get_max_dies(); die++)
> +                                       buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0,
> +                                               skx_iio_topology_byte(type->platform_topology,
> +                                                                     die, pmu->pmu_idx));
> +
> +                               *(--buf) = '\0';
> +                       } else {
> +                               for (; i >= 0; i--)
> +                                       kfree((type->pmus + i)->platform_mapping);
> +                               ret = -ENOMEM;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       kfree(type->platform_topology);
> +       return ret;
> +}
> +
>  void skx_uncore_cpu_init(void)
>  {
>         skx_uncore_chabox.num_boxes = skx_count_chabox();
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-12-02 19:47   ` Stephane Eranian
@ 2019-12-03  3:00     ` Andi Kleen
  2019-12-04 18:48     ` Sudarikov, Roman
  1 sibling, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2019-12-03  3:00 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: roman.sudarikov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, LKML, bgregg, Liang, Kan,
	alexander.antonov

> You are just looking at one die (package). How does your enumeration
> help figure out
> is the iio_0 is on socket0 of socket1 and then figure out which
> bus/domain in on which
> socket.
> 
> And how does that help map actual devices (using the output of lspci)
> to the IIO?
> You need to show how you would do that, which is really what people
> want, with what you
> have in your patch right now.

See the rest of the patch series. It implements all of this in
the perf tool.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-12-02 19:47   ` Stephane Eranian
  2019-12-03  3:00     ` Andi Kleen
@ 2019-12-04 18:48     ` Sudarikov, Roman
  2019-12-05 18:02       ` Stephane Eranian
  1 sibling, 1 reply; 14+ messages in thread
From: Sudarikov, Roman @ 2019-12-04 18:48 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, LKML,
	bgregg, Andi Kleen, Liang, Kan, alexander.antonov

On 02.12.2019 22:47, Stephane Eranian wrote:
> On Tue, Nov 26, 2019 at 8:36 AM <roman.sudarikov@linux.intel.com> wrote:
>> From: Roman Sudarikov <roman.sudarikov@linux.intel.com>
>>
>> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
>> changes in the integrated I/O (IIO) architecture. The new solution introduces
>> IIO stacks which are responsible for managing traffic between the PCIe domain
>> and the Mesh domain. Each IIO stack has its own PMON block and can handle either
>> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
>> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
>> within each IIO stack.
>>
>> Software is supposed to program required perf counters within each IIO stack
>> and gather performance data. The tricky thing here is that IIO PMON reports data
>> per IIO stack but users have no idea what IIO stacks are - they only know devices
>> which are connected to the platform.
>>
>> Understanding IIO stack concept to find which IIO stack that particular IO device
>> is connected to, or to identify an IIO PMON block to program for monitoring
>> specific IIO stack assumes a lot of implicit knowledge about given Intel server
>> platform architecture.
>>
>> This patch set introduces:
>>      An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
>>      A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device
>>
>> Current version supports a server line starting Intel® Xeon® Processor Scalable
>> Family and introduces mapping for IIO Uncore units only.
>> Other units can be added on demand.
>>
>> Usage example:
>>      /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping
>>
>> Each Uncore unit type, by its nature, can be mapped to its own context, for example:
>>      CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
>>      UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
>>      IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
>>      IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller
>>
>> Implementation details:
>>      Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
>>          int (*get_topology)(struct intel_uncore_type *type)
>>          int (*set_mapping)(struct intel_uncore_type *type)
>>
>>      IIO stack to PMON mapping is exposed through
>>          /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
>>          in the following format: domain:bus
>>
>> Details of IIO Uncore unit mapping to IIO PMON:
>> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
>> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
>> holds bus numbers of devices, which can be monitored by that IIO PMON block
>> on each die.
>>
>> For example, on a 4-die Intel Xeon® server platform:
>>      $ cat /sys/devices/uncore_iio_0/platform_mapping
>>      0000:00,0000:40,0000:80,0000:c0
>>
>> Which means:
>> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
>> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
>> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
>> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000
>>
> You are just looking at one die (package). How does your enumeration
> help figure out
> is the iio_0 is on socket0 of socket1 and then figure out which
> bus/domain in on which
> socket.
>
> And how does that help map actual devices (using the output of lspci)
> to the IIO?
> You need to show how you would do that, which is really what people
> want, with what you
> have in your patch right now.
No. I'm enumerating all IIO PMUs for all sockets on the platform.

Let's take an 4 socket SKX as an example - sysfs exposes 6 instances of 
IIO PMU and each socket has its own instance of each IIO PMUs,
meaning that socket 0 has its own IIO PMU0, socket 1 also has its own 
IIO PMU0 and so on. Same apply for IIO PMUs 1 through 5.
Below is sample output:

$:/sys/devices# cat uncore_iio_0/platform_mapping
0000:00,0000:40,0000:80,0000:c0
$:/sys/devices# cat uncore_iio_1/platform_mapping
0000:16,0000:44,0000:84,0000:c4
$:/sys/devices# cat uncore_iio_2/platform_mapping
0000:24,0000:58,0000:98,0000:d8
$:/sys/devices# cat uncore_iio_3/platform_mapping
0000:32,0000:6c,0000:ac,0000:ec

Technically, the idea is as follows - kernel part of the feature is for 
locating IIO stacks and creating  IIO PMON to IIO stack mapping.
Userspace part of the feature is for locating IO devices connected to 
each IIO stack on each socket and configure only required IIO counters to
provide 4 IO performance metrics - Inbound Read, Inbound Write, Outbound 
Read, Outbound Write - attributed to each device.


Follow up patches show how users can benefit from the feature; see 
https://lkml.org/lkml/2019/11/26/451

Below is sample output:

1. show mode

./perf stat --iiostat=show

     S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
     S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+>
     S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
     S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
     S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
     S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T>
     S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
     S1-RootPort3-uncore_iio_3<da:00.0 NVMe Datacenter SSD [Optane]>

For example, NIC at 81:00.0 is local to S1, connected to its RootPort0 and is covered by IIO PMU0 (on socket 1)

1. collector mode

   ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
     357708+0 records in
     357707+0 records out
     375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s

   Performance counter stats for 'system wide':

      device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
     00:00.0                    0                    0                    0                    0
     81:00.0                    0                    0                    0                    0
     18:00.0                    0                    0                    0                    0
     86:00.0                    0                    0                    0                    0
     88:00.0                    0                    0                    0                    0
     3b:00.0                    3                    0                    0                    0
     3c:03.0                    3                    0                    0                    0
     3d:00.0                    3                    0                    0                    0
     af:00.0                    0                    0                    0                    0
     da:00.0               358559                   44                    0                   22

     215.383783574 seconds time elapsed

>
>> Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
>> Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
>> Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
>> ---
>>   arch/x86/events/intel/uncore.c       |  61 +++++++++++-
>>   arch/x86/events/intel/uncore.h       |  13 ++-
>>   arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++
>>   3 files changed, 214 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
>> index 86467f85c383..0f779c8fcc05 100644
>> --- a/arch/x86/events/intel/uncore.c
>> +++ b/arch/x86/events/intel/uncore.c
>> @@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
>>   struct pci_extra_dev *uncore_extra_pci_dev;
>>   static int max_dies;
>>
>> +int get_max_dies(void)
>> +{
>> +       return max_dies;
>> +}
>> +
>>   /* mask of cpus that collect uncore events */
>>   static cpumask_t uncore_cpu_mask;
>>
>> @@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev,
>>
>>   static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);
>>
>> +static ssize_t platform_mapping_show(struct device *dev,
>> +                               struct device_attribute *attr, char *buf)
>> +{
>> +       struct intel_uncore_pmu *pmu = dev_get_drvdata(dev);
>> +
>> +       return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ?
>> +                      (char *)pmu->platform_mapping : "0");
>> +}
>> +static DEVICE_ATTR_RO(platform_mapping);
>> +
>>   static struct attribute *uncore_pmu_attrs[] = {
>>          &dev_attr_cpumask.attr,
>>          NULL,
>> @@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = {
>>          .attrs = uncore_pmu_attrs,
>>   };
>>
>> +static struct attribute *platform_attrs[] = {
>> +       &dev_attr_platform_mapping.attr,
>> +       NULL,
>> +};
>> +
>> +static const struct attribute_group uncore_platform_discovery_group = {
>> +       .attrs = platform_attrs,
>> +};
>> +
>>   static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>>   {
>>          int ret;
>> @@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types)
>>                  uncore_type_exit(*types);
>>   }
>>
>> +static void uncore_type_attrs_compaction(struct intel_uncore_type *type)
>> +{
>> +       int i, j;
>> +
>> +       for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) {
>> +               if (!type->attr_groups[i])
>> +                       continue;
>> +               if (i > j) {
>> +                       type->attr_groups[j] = type->attr_groups[i];
>> +                       type->attr_groups[i] = NULL;
>> +               }
>> +               j++;
>> +       }
>> +}
>> +
>>   static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>   {
>>          struct intel_uncore_pmu *pmus;
>>          size_t size;
>>          int i, j;
>> +       int ret;
>>
>>          pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL);
>>          if (!pmus)
>> @@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>                  pmus[i].pmu_idx = i;
>>                  pmus[i].type    = type;
>>                  pmus[i].boxes   = kzalloc(size, GFP_KERNEL);
>> -               if (!pmus[i].boxes)
>> +               if (!pmus[i].boxes) {
>> +                       ret = -ENOMEM;
>>                          goto err;
>> +               }
>>          }
>>
>>          type->pmus = pmus;
>> @@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>
>>                  attr_group = kzalloc(struct_size(attr_group, attrs, i + 1),
>>                                                                  GFP_KERNEL);
>> -               if (!attr_group)
>> +               if (!attr_group) {
>> +                       ret = -ENOMEM;
>>                          goto err;
>> +               }
>>
>>                  attr_group->group.name = "events";
>>                  attr_group->group.attrs = attr_group->attrs;
>> @@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>
>>          type->pmu_group = &uncore_pmu_attr_group;
>>
>> +       /*
>> +        * Exposing mapping of Uncore units to corresponding Uncore PMUs
>> +        * through /sys/devices/uncore_<type>_<idx>/platform_mapping
>> +        */
>> +       if (type->get_topology && type->set_mapping)
>> +               if (!type->get_topology(type) && !type->set_mapping(type))
>> +                       type->platform_discovery = &uncore_platform_discovery_group;
>> +
>> +       /* For optional attributes, we can safely remove embedded NULL attr_groups elements */
>> +       uncore_type_attrs_compaction(type);
>> +
>>          return 0;
>>
>>   err:
>> @@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>                  kfree(pmus[i].boxes);
>>          kfree(pmus);
>>
>> -       return -ENOMEM;
>> +       return ret;
>>   }
>>
>>   static int __init
>> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
>> index bbfdaa720b45..ce3727b9f7f8 100644
>> --- a/arch/x86/events/intel/uncore.h
>> +++ b/arch/x86/events/intel/uncore.h
>> @@ -43,6 +43,8 @@ struct intel_uncore_box;
>>   struct uncore_event_desc;
>>   struct freerunning_counters;
>>
>> +#define UNCORE_MAX_NUM_ATTR_GROUP 5
>> +
>>   struct intel_uncore_type {
>>          const char *name;
>>          int num_counters;
>> @@ -71,13 +73,19 @@ struct intel_uncore_type {
>>          struct intel_uncore_ops *ops;
>>          struct uncore_event_desc *event_descs;
>>          struct freerunning_counters *freerunning;
>> -       const struct attribute_group *attr_groups[4];
>> +       const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP];
>>          struct pmu *pmu; /* for custom pmu ops */
>> +       void *platform_topology;
>> +       /* finding Uncore units */
>> +       int (*get_topology)(struct intel_uncore_type *type);
>> +       /* mapping Uncore units to PMON ranges */
>> +       int (*set_mapping)(struct intel_uncore_type *type);
>>   };
>>
>>   #define pmu_group attr_groups[0]
>>   #define format_group attr_groups[1]
>>   #define events_group attr_groups[2]
>> +#define platform_discovery attr_groups[3]
>>
>>   struct intel_uncore_ops {
>>          void (*init_box)(struct intel_uncore_box *);
>> @@ -99,6 +107,7 @@ struct intel_uncore_pmu {
>>          int                             pmu_idx;
>>          int                             func_id;
>>          bool                            registered;
>> +       void                            *platform_mapping;
>>          atomic_t                        activeboxes;
>>          struct intel_uncore_type        *type;
>>          struct intel_uncore_box         **boxes;
>> @@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
>>          return event->pmu_private;
>>   }
>>
>> +int get_max_dies(void);
>> +
>>   struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu);
>>   u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
>>   void uncore_mmio_exit_box(struct intel_uncore_box *box);
>> diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
>> index b10a5ec79e48..92ce9fbafde1 100644
>> --- a/arch/x86/events/intel/uncore_snbep.c
>> +++ b/arch/x86/events/intel/uncore_snbep.c
>> @@ -273,6 +273,28 @@
>>   #define SKX_CPUNODEID                  0xc0
>>   #define SKX_GIDNIDMAP                  0xd4
>>
>> +/*
>> + * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR
>> + * that BIOS programmed. MSR has package scope.
>> + * |  Bit  |  Default  |  Description
>> + * | [63]  |    00h    | VALID - When set, indicates the CPU bus
>> + *                       numbers have been initialized. (RO)
>> + * |[62:48]|    ---    | Reserved
>> + * |[47:40]|    00h    | BUS_NUM_5 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(5). (RO)
>> + * |[39:32]|    00h    | BUS_NUM_4 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(4). (RO)
>> + * |[31:24]|    00h    | BUS_NUM_3 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(3). (RO)
>> + * |[23:16]|    00h    | BUS_NUM_2 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(2). (RO)
>> + * |[15:8] |    00h    | BUS_NUM_1 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(1). (RO)
>> + * | [7:0] |    00h    | BUS_NUM_0 — Return the bus number BIOS assigned
>> + *                       CPUBUSNO(0). (RO)
>> + */
>> +#define SKX_MSR_CPU_BUS_NUMBER         0x300
>> +
>>   /* SKX CHA */
>>   #define SKX_CHA_MSR_PMON_BOX_FILTER_TID                (0x1ffULL << 0)
>>   #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK       (0xfULL << 9)
>> @@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = {
>>          .read_counter           = uncore_msr_read_counter,
>>   };
>>
>> +static int skx_iio_get_topology(struct intel_uncore_type *type);
>> +static int skx_iio_set_mapping(struct intel_uncore_type *type);
>> +
>>   static struct intel_uncore_type skx_uncore_iio = {
>>          .name                   = "iio",
>>          .num_counters           = 4,
>> @@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = {
>>          .constraints            = skx_uncore_iio_constraints,
>>          .ops                    = &skx_uncore_iio_ops,
>>          .format_group           = &skx_uncore_iio_format_group,
>> +       .get_topology           = skx_iio_get_topology,
>> +       .set_mapping            = skx_iio_set_mapping,
>>   };
>>
>>   enum perf_uncore_iio_freerunning_type_id {
>> @@ -3780,6 +3807,123 @@ static int skx_count_chabox(void)
>>          return hweight32(val);
>>   }
>>
>> +static inline u8 skx_iio_topology_byte(void *platform_topology,
>> +                                       int die, int idx)
>> +{
>> +       return *((u8 *)(platform_topology) + die * sizeof(u64) + idx);
>> +}
>> +
>> +static inline bool skx_iio_topology_valid(u64 msr_value)
>> +{
>> +       return msr_value & ((u64)1 << 63);
>> +}
>> +
>> +static int skx_msr_cpu_bus_read(int cpu, int die)
>> +{
>> +       int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER,
>> +                               ((u64 *)skx_uncore_iio.platform_topology) + die);
>> +
>> +       if (!ret) {
>> +               if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die)))
>> +                       ret = -1;
>> +       }
>> +       return ret;
>> +}
>> +
>> +static int skx_iio_get_topology(struct intel_uncore_type *type)
>> +{
>> +       int ret, cpu, die, current_die;
>> +       struct pci_bus *bus = NULL;
>> +
>> +       while ((bus = pci_find_next_bus(bus)) != NULL)
>> +               if (pci_domain_nr(bus)) {
>> +                       pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n");
>> +                       return -1;
>> +               }
>> +
>> +       /* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/
>> +       type->platform_topology =
>> +               kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL);
>> +       if (!type->platform_topology)
>> +               return -ENOMEM;
>> +
>> +       /*
>> +        * Using cpus_read_lock() to ensure cpu is not going down between
>> +        * looking at cpu_online_mask.
>> +        */
>> +       cpus_read_lock();
>> +       /* Invalid value to start loop.*/
>> +       current_die = -1;
>> +       for_each_online_cpu(cpu) {
>> +               die = topology_logical_die_id(cpu);
>> +               if (current_die == die)
>> +                       continue;
>> +               ret = skx_msr_cpu_bus_read(cpu, die);
>> +               if (ret)
>> +                       break;
>> +               current_die = die;
>> +       }
>> +       cpus_read_unlock();
>> +
>> +       if (ret)
>> +               kfree(type->platform_topology);
>> +       return ret;
>> +}
>> +
>> +static int skx_iio_set_mapping(struct intel_uncore_type *type)
>> +{
>> +       /*
>> +        * Each IIO stack (PCIe root port) has its own IIO PMON block, so each
>> +        * platform_mapping holds bus number(s) of PCIe root port(s), which can
>> +        * be monitored by that IIO PMON block.
>> +        *
>> +        * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die
>> +        * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO
>> +        * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0":
>> +        *
>> +        * $ cat /sys/devices/uncore_iio_0/platform_mapping
>> +        * 0000:00,0000:40,0000:80,0000:c0
>> +        *
>> +        * Which means:
>> +        * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000
>> +        * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000
>> +        * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000
>> +        * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000
>> +        */
>> +
>> +       int ret = 0;
>> +       int die, i;
>> +       char *buf;
>> +       struct intel_uncore_pmu *pmu;
>> +       const int template_len = 8;
>> +
>> +       for (i = 0; i < type->num_boxes; i++) {
>> +               pmu = type->pmus + i;
>> +               /* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */
>> +               if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) {
>> +                       pmu->platform_mapping =
>> +                               kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL);
>> +                       if (pmu->platform_mapping) {
>> +                               buf = (char *)pmu->platform_mapping;
>> +                               for (die = 0; die < get_max_dies(); die++)
>> +                                       buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0,
>> +                                               skx_iio_topology_byte(type->platform_topology,
>> +                                                                     die, pmu->pmu_idx));
>> +
>> +                               *(--buf) = '\0';
>> +                       } else {
>> +                               for (; i >= 0; i--)
>> +                                       kfree((type->pmus + i)->platform_mapping);
>> +                               ret = -ENOMEM;
>> +                               break;
>> +                       }
>> +               }
>> +       }
>> +
>> +       kfree(type->platform_topology);
>> +       return ret;
>> +}
>> +
>>   void skx_uncore_cpu_init(void)
>>   {
>>          skx_uncore_chabox.num_boxes = skx_count_chabox();
>> --
>> 2.19.1
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-12-04 18:48     ` Sudarikov, Roman
@ 2019-12-05 18:02       ` Stephane Eranian
  2019-12-05 22:28         ` Andi Kleen
  2019-12-06 16:08         ` Sudarikov, Roman
  0 siblings, 2 replies; 14+ messages in thread
From: Stephane Eranian @ 2019-12-05 18:02 UTC (permalink / raw)
  To: Sudarikov, Roman
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, LKML,
	Brendan Gregg, Andi Kleen, Liang, Kan, alexander.antonov

On Wed, Dec 4, 2019 at 10:48 AM Sudarikov, Roman
<roman.sudarikov@linux.intel.com> wrote:
>
> On 02.12.2019 22:47, Stephane Eranian wrote:
> > On Tue, Nov 26, 2019 at 8:36 AM <roman.sudarikov@linux.intel.com> wrote:
> >> From: Roman Sudarikov <roman.sudarikov@linux.intel.com>
> >>
> >> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
> >> changes in the integrated I/O (IIO) architecture. The new solution introduces
> >> IIO stacks which are responsible for managing traffic between the PCIe domain
> >> and the Mesh domain. Each IIO stack has its own PMON block and can handle either
> >> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
> >> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
> >> within each IIO stack.
> >>
> >> Software is supposed to program required perf counters within each IIO stack
> >> and gather performance data. The tricky thing here is that IIO PMON reports data
> >> per IIO stack but users have no idea what IIO stacks are - they only know devices
> >> which are connected to the platform.
> >>
> >> Understanding IIO stack concept to find which IIO stack that particular IO device
> >> is connected to, or to identify an IIO PMON block to program for monitoring
> >> specific IIO stack assumes a lot of implicit knowledge about given Intel server
> >> platform architecture.
> >>
> >> This patch set introduces:
> >>      An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
> >>      A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device
> >>
> >> Current version supports a server line starting Intel® Xeon® Processor Scalable
> >> Family and introduces mapping for IIO Uncore units only.
> >> Other units can be added on demand.
> >>
> >> Usage example:
> >>      /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping
> >>
> >> Each Uncore unit type, by its nature, can be mapped to its own context, for example:
> >>      CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
> >>      UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
> >>      IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
> >>      IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller
> >>
> >> Implementation details:
> >>      Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
> >>          int (*get_topology)(struct intel_uncore_type *type)
> >>          int (*set_mapping)(struct intel_uncore_type *type)
> >>
> >>      IIO stack to PMON mapping is exposed through
> >>          /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
> >>          in the following format: domain:bus
> >>
> >> Details of IIO Uncore unit mapping to IIO PMON:
> >> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
> >> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
> >> holds bus numbers of devices, which can be monitored by that IIO PMON block
> >> on each die.
> >>
> >> For example, on a 4-die Intel Xeon® server platform:
> >>      $ cat /sys/devices/uncore_iio_0/platform_mapping
> >>      0000:00,0000:40,0000:80,0000:c0
> >>
> >> Which means:
> >> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
> >> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
> >> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
> >> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000
> >>
> > You are just looking at one die (package). How does your enumeration
> > help figure out
> > is the iio_0 is on socket0 of socket1 and then figure out which
> > bus/domain in on which
> > socket.
> >
> > And how does that help map actual devices (using the output of lspci)
> > to the IIO?
> > You need to show how you would do that, which is really what people
> > want, with what you
> > have in your patch right now.
> No. I'm enumerating all IIO PMUs for all sockets on the platform.
>
> Let's take an 4 socket SKX as an example - sysfs exposes 6 instances of
> IIO PMU and each socket has its own instance of each IIO PMUs,
> meaning that socket 0 has its own IIO PMU0, socket 1 also has its own
> IIO PMU0 and so on. Same apply for IIO PMUs 1 through 5.

I know that.

> Below is sample output:
>
> $:/sys/devices# cat uncore_iio_0/platform_mapping
> 0000:00,0000:40,0000:80,0000:c0
> $:/sys/devices# cat uncore_iio_1/platform_mapping
> 0000:16,0000:44,0000:84,0000:c4
> $:/sys/devices# cat uncore_iio_2/platform_mapping
> 0000:24,0000:58,0000:98,0000:d8
> $:/sys/devices# cat uncore_iio_3/platform_mapping
> 0000:32,0000:6c,0000:ac,0000:ec
>
> Technically, the idea is as follows - kernel part of the feature is for
> locating IIO stacks and creating  IIO PMON to IIO stack mapping.
> Userspace part of the feature is for locating IO devices connected to
> each IIO stack on each socket and configure only required IIO counters to
> provide 4 IO performance metrics - Inbound Read, Inbound Write, Outbound
> Read, Outbound Write - attributed to each device.
>
>
> Follow up patches show how users can benefit from the feature; see
> https://lkml.org/lkml/2019/11/26/451
>
I know this is useful. I have done this for internal users a long time ago.

> Below is sample output:
>
> 1. show mode
>
> ./perf stat --iiostat=show
>
>      S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
>      S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+>
>      S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
>      S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
>      S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
>      S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T>
>      S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
>      S1-RootPort3-uncore_iio_3<da:00.0 NVMe Datacenter SSD [Optane]>
>
> For example, NIC at 81:00.0 is local to S1, connected to its RootPort0 and is covered by IIO PMU0 (on socket 1)
>
> 1. collector mode
>
>    ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
>      357708+0 records in
>      357707+0 records out
>      375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s
>
>    Performance counter stats for 'system wide':
>
>       device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
>      00:00.0                    0                    0                    0                    0
>      81:00.0                    0                    0                    0                    0
>      18:00.0                    0                    0                    0                    0
>      86:00.0                    0                    0                    0                    0
>      88:00.0                    0                    0                    0                    0
>      3b:00.0                    3                    0                    0                    0
>      3c:03.0                    3                    0                    0                    0
>      3d:00.0                    3                    0                    0                    0
>      af:00.0                    0                    0                    0                    0
>      da:00.0               358559                   44                    0                   22
>
I think this output would be more useful with the socket information.
People care about NUMA locality. This output
does not cover that (in a single cmdline). It would also benefit from
having the actual Linux device names, e.g., sda, ssda, eth0, ....,



>      215.383783574 seconds time elapsed
>
> >
> >> Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
> >> Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
> >> Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
> >> ---
> >>   arch/x86/events/intel/uncore.c       |  61 +++++++++++-
> >>   arch/x86/events/intel/uncore.h       |  13 ++-
> >>   arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++
> >>   3 files changed, 214 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
> >> index 86467f85c383..0f779c8fcc05 100644
> >> --- a/arch/x86/events/intel/uncore.c
> >> +++ b/arch/x86/events/intel/uncore.c
> >> @@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
> >>   struct pci_extra_dev *uncore_extra_pci_dev;
> >>   static int max_dies;
> >>
> >> +int get_max_dies(void)
> >> +{
> >> +       return max_dies;
> >> +}
> >> +
> >>   /* mask of cpus that collect uncore events */
> >>   static cpumask_t uncore_cpu_mask;
> >>
> >> @@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev,
> >>
> >>   static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);
> >>
> >> +static ssize_t platform_mapping_show(struct device *dev,
> >> +                               struct device_attribute *attr, char *buf)
> >> +{
> >> +       struct intel_uncore_pmu *pmu = dev_get_drvdata(dev);
> >> +
> >> +       return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ?
> >> +                      (char *)pmu->platform_mapping : "0");
> >> +}
> >> +static DEVICE_ATTR_RO(platform_mapping);
> >> +
> >>   static struct attribute *uncore_pmu_attrs[] = {
> >>          &dev_attr_cpumask.attr,
> >>          NULL,
> >> @@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = {
> >>          .attrs = uncore_pmu_attrs,
> >>   };
> >>
> >> +static struct attribute *platform_attrs[] = {
> >> +       &dev_attr_platform_mapping.attr,
> >> +       NULL,
> >> +};
> >> +
> >> +static const struct attribute_group uncore_platform_discovery_group = {
> >> +       .attrs = platform_attrs,
> >> +};
> >> +
> >>   static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
> >>   {
> >>          int ret;
> >> @@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types)
> >>                  uncore_type_exit(*types);
> >>   }
> >>
> >> +static void uncore_type_attrs_compaction(struct intel_uncore_type *type)
> >> +{
> >> +       int i, j;
> >> +
> >> +       for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) {
> >> +               if (!type->attr_groups[i])
> >> +                       continue;
> >> +               if (i > j) {
> >> +                       type->attr_groups[j] = type->attr_groups[i];
> >> +                       type->attr_groups[i] = NULL;
> >> +               }
> >> +               j++;
> >> +       }
> >> +}
> >> +
> >>   static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
> >>   {
> >>          struct intel_uncore_pmu *pmus;
> >>          size_t size;
> >>          int i, j;
> >> +       int ret;
> >>
> >>          pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL);
> >>          if (!pmus)
> >> @@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
> >>                  pmus[i].pmu_idx = i;
> >>                  pmus[i].type    = type;
> >>                  pmus[i].boxes   = kzalloc(size, GFP_KERNEL);
> >> -               if (!pmus[i].boxes)
> >> +               if (!pmus[i].boxes) {
> >> +                       ret = -ENOMEM;
> >>                          goto err;
> >> +               }
> >>          }
> >>
> >>          type->pmus = pmus;
> >> @@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
> >>
> >>                  attr_group = kzalloc(struct_size(attr_group, attrs, i + 1),
> >>                                                                  GFP_KERNEL);
> >> -               if (!attr_group)
> >> +               if (!attr_group) {
> >> +                       ret = -ENOMEM;
> >>                          goto err;
> >> +               }
> >>
> >>                  attr_group->group.name = "events";
> >>                  attr_group->group.attrs = attr_group->attrs;
> >> @@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
> >>
> >>          type->pmu_group = &uncore_pmu_attr_group;
> >>
> >> +       /*
> >> +        * Exposing mapping of Uncore units to corresponding Uncore PMUs
> >> +        * through /sys/devices/uncore_<type>_<idx>/platform_mapping
> >> +        */
> >> +       if (type->get_topology && type->set_mapping)
> >> +               if (!type->get_topology(type) && !type->set_mapping(type))
> >> +                       type->platform_discovery = &uncore_platform_discovery_group;
> >> +
> >> +       /* For optional attributes, we can safely remove embedded NULL attr_groups elements */
> >> +       uncore_type_attrs_compaction(type);
> >> +
> >>          return 0;
> >>
> >>   err:
> >> @@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
> >>                  kfree(pmus[i].boxes);
> >>          kfree(pmus);
> >>
> >> -       return -ENOMEM;
> >> +       return ret;
> >>   }
> >>
> >>   static int __init
> >> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
> >> index bbfdaa720b45..ce3727b9f7f8 100644
> >> --- a/arch/x86/events/intel/uncore.h
> >> +++ b/arch/x86/events/intel/uncore.h
> >> @@ -43,6 +43,8 @@ struct intel_uncore_box;
> >>   struct uncore_event_desc;
> >>   struct freerunning_counters;
> >>
> >> +#define UNCORE_MAX_NUM_ATTR_GROUP 5
> >> +
> >>   struct intel_uncore_type {
> >>          const char *name;
> >>          int num_counters;
> >> @@ -71,13 +73,19 @@ struct intel_uncore_type {
> >>          struct intel_uncore_ops *ops;
> >>          struct uncore_event_desc *event_descs;
> >>          struct freerunning_counters *freerunning;
> >> -       const struct attribute_group *attr_groups[4];
> >> +       const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP];
> >>          struct pmu *pmu; /* for custom pmu ops */
> >> +       void *platform_topology;
> >> +       /* finding Uncore units */
> >> +       int (*get_topology)(struct intel_uncore_type *type);
> >> +       /* mapping Uncore units to PMON ranges */
> >> +       int (*set_mapping)(struct intel_uncore_type *type);
> >>   };
> >>
> >>   #define pmu_group attr_groups[0]
> >>   #define format_group attr_groups[1]
> >>   #define events_group attr_groups[2]
> >> +#define platform_discovery attr_groups[3]
> >>
> >>   struct intel_uncore_ops {
> >>          void (*init_box)(struct intel_uncore_box *);
> >> @@ -99,6 +107,7 @@ struct intel_uncore_pmu {
> >>          int                             pmu_idx;
> >>          int                             func_id;
> >>          bool                            registered;
> >> +       void                            *platform_mapping;
> >>          atomic_t                        activeboxes;
> >>          struct intel_uncore_type        *type;
> >>          struct intel_uncore_box         **boxes;
> >> @@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
> >>          return event->pmu_private;
> >>   }
> >>
> >> +int get_max_dies(void);
> >> +
> >>   struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu);
> >>   u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
> >>   void uncore_mmio_exit_box(struct intel_uncore_box *box);
> >> diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
> >> index b10a5ec79e48..92ce9fbafde1 100644
> >> --- a/arch/x86/events/intel/uncore_snbep.c
> >> +++ b/arch/x86/events/intel/uncore_snbep.c
> >> @@ -273,6 +273,28 @@
> >>   #define SKX_CPUNODEID                  0xc0
> >>   #define SKX_GIDNIDMAP                  0xd4
> >>
> >> +/*
> >> + * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR
> >> + * that BIOS programmed. MSR has package scope.
> >> + * |  Bit  |  Default  |  Description
> >> + * | [63]  |    00h    | VALID - When set, indicates the CPU bus
> >> + *                       numbers have been initialized. (RO)
> >> + * |[62:48]|    ---    | Reserved
> >> + * |[47:40]|    00h    | BUS_NUM_5 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(5). (RO)
> >> + * |[39:32]|    00h    | BUS_NUM_4 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(4). (RO)
> >> + * |[31:24]|    00h    | BUS_NUM_3 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(3). (RO)
> >> + * |[23:16]|    00h    | BUS_NUM_2 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(2). (RO)
> >> + * |[15:8] |    00h    | BUS_NUM_1 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(1). (RO)
> >> + * | [7:0] |    00h    | BUS_NUM_0 — Return the bus number BIOS assigned
> >> + *                       CPUBUSNO(0). (RO)
> >> + */
> >> +#define SKX_MSR_CPU_BUS_NUMBER         0x300
> >> +
> >>   /* SKX CHA */
> >>   #define SKX_CHA_MSR_PMON_BOX_FILTER_TID                (0x1ffULL << 0)
> >>   #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK       (0xfULL << 9)
> >> @@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = {
> >>          .read_counter           = uncore_msr_read_counter,
> >>   };
> >>
> >> +static int skx_iio_get_topology(struct intel_uncore_type *type);
> >> +static int skx_iio_set_mapping(struct intel_uncore_type *type);
> >> +
> >>   static struct intel_uncore_type skx_uncore_iio = {
> >>          .name                   = "iio",
> >>          .num_counters           = 4,
> >> @@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = {
> >>          .constraints            = skx_uncore_iio_constraints,
> >>          .ops                    = &skx_uncore_iio_ops,
> >>          .format_group           = &skx_uncore_iio_format_group,
> >> +       .get_topology           = skx_iio_get_topology,
> >> +       .set_mapping            = skx_iio_set_mapping,
> >>   };
> >>
> >>   enum perf_uncore_iio_freerunning_type_id {
> >> @@ -3780,6 +3807,123 @@ static int skx_count_chabox(void)
> >>          return hweight32(val);
> >>   }
> >>
> >> +static inline u8 skx_iio_topology_byte(void *platform_topology,
> >> +                                       int die, int idx)
> >> +{
> >> +       return *((u8 *)(platform_topology) + die * sizeof(u64) + idx);
> >> +}
> >> +
> >> +static inline bool skx_iio_topology_valid(u64 msr_value)
> >> +{
> >> +       return msr_value & ((u64)1 << 63);
> >> +}
> >> +
> >> +static int skx_msr_cpu_bus_read(int cpu, int die)
> >> +{
> >> +       int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER,
> >> +                               ((u64 *)skx_uncore_iio.platform_topology) + die);
> >> +
> >> +       if (!ret) {
> >> +               if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die)))
> >> +                       ret = -1;
> >> +       }
> >> +       return ret;
> >> +}
> >> +
> >> +static int skx_iio_get_topology(struct intel_uncore_type *type)
> >> +{
> >> +       int ret, cpu, die, current_die;
> >> +       struct pci_bus *bus = NULL;
> >> +
> >> +       while ((bus = pci_find_next_bus(bus)) != NULL)
> >> +               if (pci_domain_nr(bus)) {
> >> +                       pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n");
> >> +                       return -1;
> >> +               }
> >> +
> >> +       /* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/
> >> +       type->platform_topology =
> >> +               kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL);
> >> +       if (!type->platform_topology)
> >> +               return -ENOMEM;
> >> +
> >> +       /*
> >> +        * Using cpus_read_lock() to ensure cpu is not going down between
> >> +        * looking at cpu_online_mask.
> >> +        */
> >> +       cpus_read_lock();
> >> +       /* Invalid value to start loop.*/
> >> +       current_die = -1;
> >> +       for_each_online_cpu(cpu) {
> >> +               die = topology_logical_die_id(cpu);
> >> +               if (current_die == die)
> >> +                       continue;
> >> +               ret = skx_msr_cpu_bus_read(cpu, die);
> >> +               if (ret)
> >> +                       break;
> >> +               current_die = die;
> >> +       }
> >> +       cpus_read_unlock();
> >> +
> >> +       if (ret)
> >> +               kfree(type->platform_topology);
> >> +       return ret;
> >> +}
> >> +
> >> +static int skx_iio_set_mapping(struct intel_uncore_type *type)
> >> +{
> >> +       /*
> >> +        * Each IIO stack (PCIe root port) has its own IIO PMON block, so each
> >> +        * platform_mapping holds bus number(s) of PCIe root port(s), which can
> >> +        * be monitored by that IIO PMON block.
> >> +        *
> >> +        * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die
> >> +        * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO
> >> +        * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0":
> >> +        *
> >> +        * $ cat /sys/devices/uncore_iio_0/platform_mapping
> >> +        * 0000:00,0000:40,0000:80,0000:c0
> >> +        *
> >> +        * Which means:
> >> +        * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000
> >> +        * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000
> >> +        * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000
> >> +        * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000
> >> +        */
> >> +
> >> +       int ret = 0;
> >> +       int die, i;
> >> +       char *buf;
> >> +       struct intel_uncore_pmu *pmu;
> >> +       const int template_len = 8;
> >> +
> >> +       for (i = 0; i < type->num_boxes; i++) {
> >> +               pmu = type->pmus + i;
> >> +               /* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */
> >> +               if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) {
> >> +                       pmu->platform_mapping =
> >> +                               kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL);
> >> +                       if (pmu->platform_mapping) {
> >> +                               buf = (char *)pmu->platform_mapping;
> >> +                               for (die = 0; die < get_max_dies(); die++)
> >> +                                       buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0,
> >> +                                               skx_iio_topology_byte(type->platform_topology,
> >> +                                                                     die, pmu->pmu_idx));
> >> +
> >> +                               *(--buf) = '\0';
> >> +                       } else {
> >> +                               for (; i >= 0; i--)
> >> +                                       kfree((type->pmus + i)->platform_mapping);
> >> +                               ret = -ENOMEM;
> >> +                               break;
> >> +                       }
> >> +               }
> >> +       }
> >> +
> >> +       kfree(type->platform_topology);
> >> +       return ret;
> >> +}
> >> +
> >>   void skx_uncore_cpu_init(void)
> >>   {
> >>          skx_uncore_chabox.num_boxes = skx_count_chabox();
> >> --
> >> 2.19.1
> >>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-12-05 18:02       ` Stephane Eranian
@ 2019-12-05 22:28         ` Andi Kleen
  2019-12-06 16:08         ` Sudarikov, Roman
  1 sibling, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2019-12-05 22:28 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Sudarikov, Roman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, LKML, Brendan Gregg, Liang, Kan,
	alexander.antonov

On Thu, Dec 05, 2019 at 10:02:55AM -0800, Stephane Eranian wrote:
> does not cover that (in a single cmdline). 

> It would also benefit from
> having the actual Linux device names, e.g., sda, ssda, eth0, ....,

Some example code to do that mapping in the other direction is here

https://github.com/numactl/numactl/blob/master/affinity.c


-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping
  2019-12-05 18:02       ` Stephane Eranian
  2019-12-05 22:28         ` Andi Kleen
@ 2019-12-06 16:08         ` Sudarikov, Roman
  1 sibling, 0 replies; 14+ messages in thread
From: Sudarikov, Roman @ 2019-12-06 16:08 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim, LKML,
	Brendan Gregg, Andi Kleen, Liang, Kan, alexander.antonov

On 05.12.2019 21:02, Stephane Eranian wrote:
> On Wed, Dec 4, 2019 at 10:48 AM Sudarikov, Roman
> <roman.sudarikov@linux.intel.com> wrote:
>> On 02.12.2019 22:47, Stephane Eranian wrote:
>>> On Tue, Nov 26, 2019 at 8:36 AM <roman.sudarikov@linux.intel.com> wrote:
>>>> From: Roman Sudarikov <roman.sudarikov@linux.intel.com>
>>>>
>>>> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant
>>>> changes in the integrated I/O (IIO) architecture. The new solution introduces
>>>> IIO stacks which are responsible for managing traffic between the PCIe domain
>>>> and the Mesh domain. Each IIO stack has its own PMON block and can handle either
>>>> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators.
>>>> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation
>>>> within each IIO stack.
>>>>
>>>> Software is supposed to program required perf counters within each IIO stack
>>>> and gather performance data. The tricky thing here is that IIO PMON reports data
>>>> per IIO stack but users have no idea what IIO stacks are - they only know devices
>>>> which are connected to the platform.
>>>>
>>>> Understanding IIO stack concept to find which IIO stack that particular IO device
>>>> is connected to, or to identify an IIO PMON block to program for monitoring
>>>> specific IIO stack assumes a lot of implicit knowledge about given Intel server
>>>> platform architecture.
>>>>
>>>> This patch set introduces:
>>>>       An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend
>>>>       A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device
>>>>
>>>> Current version supports a server line starting Intel® Xeon® Processor Scalable
>>>> Family and introduces mapping for IIO Uncore units only.
>>>> Other units can be added on demand.
>>>>
>>>> Usage example:
>>>>       /sys/devices/uncore_<type>_<pmu_idx>/platform_mapping
>>>>
>>>> Each Uncore unit type, by its nature, can be mapped to its own context, for example:
>>>>       CHA - each uncore_cha_<pmu_idx> is assigned to manage a distinct slice of LLC capacity
>>>>       UPI - each uncore_upi_<pmu_idx> is assigned to manage one link of Intel UPI Subsystem
>>>>       IIO - each uncore_iio_<pmu_idx> is assigned to manage one stack of the IIO module
>>>>       IMC - each uncore_imc_<pmu_idx> is assigned to manage one channel of Memory Controller
>>>>
>>>> Implementation details:
>>>>       Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs:
>>>>           int (*get_topology)(struct intel_uncore_type *type)
>>>>           int (*set_mapping)(struct intel_uncore_type *type)
>>>>
>>>>       IIO stack to PMON mapping is exposed through
>>>>           /sys/devices/uncore_iio_<pmu_idx>/platform_mapping
>>>>           in the following format: domain:bus
>>>>
>>>> Details of IIO Uncore unit mapping to IIO PMON:
>>>> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various
>>>> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file
>>>> holds bus numbers of devices, which can be monitored by that IIO PMON block
>>>> on each die.
>>>>
>>>> For example, on a 4-die Intel Xeon® server platform:
>>>>       $ cat /sys/devices/uncore_iio_0/platform_mapping
>>>>       0000:00,0000:40,0000:80,0000:c0
>>>>
>>>> Which means:
>>>> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000
>>>> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000
>>>> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000
>>>> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000
>>>>
>>> You are just looking at one die (package). How does your enumeration
>>> help figure out
>>> is the iio_0 is on socket0 of socket1 and then figure out which
>>> bus/domain in on which
>>> socket.
>>>
>>> And how does that help map actual devices (using the output of lspci)
>>> to the IIO?
>>> You need to show how you would do that, which is really what people
>>> want, with what you
>>> have in your patch right now.
>> No. I'm enumerating all IIO PMUs for all sockets on the platform.
>>
>> Let's take an 4 socket SKX as an example - sysfs exposes 6 instances of
>> IIO PMU and each socket has its own instance of each IIO PMUs,
>> meaning that socket 0 has its own IIO PMU0, socket 1 also has its own
>> IIO PMU0 and so on. Same apply for IIO PMUs 1 through 5.
> I know that.
>
>> Below is sample output:
>>
>> $:/sys/devices# cat uncore_iio_0/platform_mapping
>> 0000:00,0000:40,0000:80,0000:c0
>> $:/sys/devices# cat uncore_iio_1/platform_mapping
>> 0000:16,0000:44,0000:84,0000:c4
>> $:/sys/devices# cat uncore_iio_2/platform_mapping
>> 0000:24,0000:58,0000:98,0000:d8
>> $:/sys/devices# cat uncore_iio_3/platform_mapping
>> 0000:32,0000:6c,0000:ac,0000:ec
>>
>> Technically, the idea is as follows - kernel part of the feature is for
>> locating IIO stacks and creating  IIO PMON to IIO stack mapping.
>> Userspace part of the feature is for locating IO devices connected to
>> each IIO stack on each socket and configure only required IIO counters to
>> provide 4 IO performance metrics - Inbound Read, Inbound Write, Outbound
>> Read, Outbound Write - attributed to each device.
>>
>>
>> Follow up patches show how users can benefit from the feature; see
>> https://lkml.org/lkml/2019/11/26/451
>>
> I know this is useful. I have done this for internal users a long time ago.
>
>> Below is sample output:
>>
>> 1. show mode
>>
>> ./perf stat --iiostat=show
>>
>>       S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers>
>>       S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+>
>>       S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
>>       S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
>>       S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+>
>>       S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T>
>>       S1-RootPort2-uncore_iio_2<af:00.0 Omni-Path HFI Silicon 100 Series [discrete]>
>>       S1-RootPort3-uncore_iio_3<da:00.0 NVMe Datacenter SSD [Optane]>
>>
>> For example, NIC at 81:00.0 is local to S1, connected to its RootPort0 and is covered by IIO PMU0 (on socket 1)
>>
>> 1. collector mode
>>
>>     ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
>>       357708+0 records in
>>       357707+0 records out
>>       375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s
>>
>>     Performance counter stats for 'system wide':
>>
>>        device             Inbound Read(MB)    Inbound Write(MB)    Outbound Read(MB)   Outbound Write(MB)
>>       00:00.0                    0                    0                    0                    0
>>       81:00.0                    0                    0                    0                    0
>>       18:00.0                    0                    0                    0                    0
>>       86:00.0                    0                    0                    0                    0
>>       88:00.0                    0                    0                    0                    0
>>       3b:00.0                    3                    0                    0                    0
>>       3c:03.0                    3                    0                    0                    0
>>       3d:00.0                    3                    0                    0                    0
>>       af:00.0                    0                    0                    0                    0
>>       da:00.0               358559                   44                    0                   22
>>
> I think this output would be more useful with the socket information.
> People care about NUMA locality. This output
> does not cover that (in a single cmdline). It would also benefit from
> having the actual Linux device names, e.g., sda, ssda, eth0, ....,
Hi Stephane,

I still think we should keep b:d.f notion as a part of the output and, 
sure, we
can add socket and device name information, so it will look like this:

Before:
    Performance counter stats for 'system wide':

   deviceInbound Read(MB) Inbound Write(MB)
  da:00.0

After:
    Performance counter stats for 'system wide':

            device             Inbound Read(MB)    Inbound Write(MB)
  S1<nvme0>da:00.0

Are you OK with that approach?

BTW, to address it we need code changes at userspace part only.
Can we proceed with kernel patch review and once finalized, I'll send 
userspace
part and we will figure out right output format?

Thanks,
Roman
>
>>       215.383783574 seconds time elapsed
>>
>>>> Signed-off-by: Roman Sudarikov <roman.sudarikov@linux.intel.com>
>>>> Co-developed-by: Alexander Antonov <alexander.antonov@intel.com>
>>>> Signed-off-by: Alexander Antonov <alexander.antonov@intel.com>
>>>> ---
>>>>    arch/x86/events/intel/uncore.c       |  61 +++++++++++-
>>>>    arch/x86/events/intel/uncore.h       |  13 ++-
>>>>    arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++
>>>>    3 files changed, 214 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
>>>> index 86467f85c383..0f779c8fcc05 100644
>>>> --- a/arch/x86/events/intel/uncore.c
>>>> +++ b/arch/x86/events/intel/uncore.c
>>>> @@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
>>>>    struct pci_extra_dev *uncore_extra_pci_dev;
>>>>    static int max_dies;
>>>>
>>>> +int get_max_dies(void)
>>>> +{
>>>> +       return max_dies;
>>>> +}
>>>> +
>>>>    /* mask of cpus that collect uncore events */
>>>>    static cpumask_t uncore_cpu_mask;
>>>>
>>>> @@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev,
>>>>
>>>>    static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);
>>>>
>>>> +static ssize_t platform_mapping_show(struct device *dev,
>>>> +                               struct device_attribute *attr, char *buf)
>>>> +{
>>>> +       struct intel_uncore_pmu *pmu = dev_get_drvdata(dev);
>>>> +
>>>> +       return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ?
>>>> +                      (char *)pmu->platform_mapping : "0");
>>>> +}
>>>> +static DEVICE_ATTR_RO(platform_mapping);
>>>> +
>>>>    static struct attribute *uncore_pmu_attrs[] = {
>>>>           &dev_attr_cpumask.attr,
>>>>           NULL,
>>>> @@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = {
>>>>           .attrs = uncore_pmu_attrs,
>>>>    };
>>>>
>>>> +static struct attribute *platform_attrs[] = {
>>>> +       &dev_attr_platform_mapping.attr,
>>>> +       NULL,
>>>> +};
>>>> +
>>>> +static const struct attribute_group uncore_platform_discovery_group = {
>>>> +       .attrs = platform_attrs,
>>>> +};
>>>> +
>>>>    static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
>>>>    {
>>>>           int ret;
>>>> @@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types)
>>>>                   uncore_type_exit(*types);
>>>>    }
>>>>
>>>> +static void uncore_type_attrs_compaction(struct intel_uncore_type *type)
>>>> +{
>>>> +       int i, j;
>>>> +
>>>> +       for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) {
>>>> +               if (!type->attr_groups[i])
>>>> +                       continue;
>>>> +               if (i > j) {
>>>> +                       type->attr_groups[j] = type->attr_groups[i];
>>>> +                       type->attr_groups[i] = NULL;
>>>> +               }
>>>> +               j++;
>>>> +       }
>>>> +}
>>>> +
>>>>    static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>>>    {
>>>>           struct intel_uncore_pmu *pmus;
>>>>           size_t size;
>>>>           int i, j;
>>>> +       int ret;
>>>>
>>>>           pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL);
>>>>           if (!pmus)
>>>> @@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>>>                   pmus[i].pmu_idx = i;
>>>>                   pmus[i].type    = type;
>>>>                   pmus[i].boxes   = kzalloc(size, GFP_KERNEL);
>>>> -               if (!pmus[i].boxes)
>>>> +               if (!pmus[i].boxes) {
>>>> +                       ret = -ENOMEM;
>>>>                           goto err;
>>>> +               }
>>>>           }
>>>>
>>>>           type->pmus = pmus;
>>>> @@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>>>
>>>>                   attr_group = kzalloc(struct_size(attr_group, attrs, i + 1),
>>>>                                                                   GFP_KERNEL);
>>>> -               if (!attr_group)
>>>> +               if (!attr_group) {
>>>> +                       ret = -ENOMEM;
>>>>                           goto err;
>>>> +               }
>>>>
>>>>                   attr_group->group.name = "events";
>>>>                   attr_group->group.attrs = attr_group->attrs;
>>>> @@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>>>
>>>>           type->pmu_group = &uncore_pmu_attr_group;
>>>>
>>>> +       /*
>>>> +        * Exposing mapping of Uncore units to corresponding Uncore PMUs
>>>> +        * through /sys/devices/uncore_<type>_<idx>/platform_mapping
>>>> +        */
>>>> +       if (type->get_topology && type->set_mapping)
>>>> +               if (!type->get_topology(type) && !type->set_mapping(type))
>>>> +                       type->platform_discovery = &uncore_platform_discovery_group;
>>>> +
>>>> +       /* For optional attributes, we can safely remove embedded NULL attr_groups elements */
>>>> +       uncore_type_attrs_compaction(type);
>>>> +
>>>>           return 0;
>>>>
>>>>    err:
>>>> @@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid)
>>>>                   kfree(pmus[i].boxes);
>>>>           kfree(pmus);
>>>>
>>>> -       return -ENOMEM;
>>>> +       return ret;
>>>>    }
>>>>
>>>>    static int __init
>>>> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
>>>> index bbfdaa720b45..ce3727b9f7f8 100644
>>>> --- a/arch/x86/events/intel/uncore.h
>>>> +++ b/arch/x86/events/intel/uncore.h
>>>> @@ -43,6 +43,8 @@ struct intel_uncore_box;
>>>>    struct uncore_event_desc;
>>>>    struct freerunning_counters;
>>>>
>>>> +#define UNCORE_MAX_NUM_ATTR_GROUP 5
>>>> +
>>>>    struct intel_uncore_type {
>>>>           const char *name;
>>>>           int num_counters;
>>>> @@ -71,13 +73,19 @@ struct intel_uncore_type {
>>>>           struct intel_uncore_ops *ops;
>>>>           struct uncore_event_desc *event_descs;
>>>>           struct freerunning_counters *freerunning;
>>>> -       const struct attribute_group *attr_groups[4];
>>>> +       const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP];
>>>>           struct pmu *pmu; /* for custom pmu ops */
>>>> +       void *platform_topology;
>>>> +       /* finding Uncore units */
>>>> +       int (*get_topology)(struct intel_uncore_type *type);
>>>> +       /* mapping Uncore units to PMON ranges */
>>>> +       int (*set_mapping)(struct intel_uncore_type *type);
>>>>    };
>>>>
>>>>    #define pmu_group attr_groups[0]
>>>>    #define format_group attr_groups[1]
>>>>    #define events_group attr_groups[2]
>>>> +#define platform_discovery attr_groups[3]
>>>>
>>>>    struct intel_uncore_ops {
>>>>           void (*init_box)(struct intel_uncore_box *);
>>>> @@ -99,6 +107,7 @@ struct intel_uncore_pmu {
>>>>           int                             pmu_idx;
>>>>           int                             func_id;
>>>>           bool                            registered;
>>>> +       void                            *platform_mapping;
>>>>           atomic_t                        activeboxes;
>>>>           struct intel_uncore_type        *type;
>>>>           struct intel_uncore_box         **boxes;
>>>> @@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev
>>>>           return event->pmu_private;
>>>>    }
>>>>
>>>> +int get_max_dies(void);
>>>> +
>>>>    struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu);
>>>>    u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
>>>>    void uncore_mmio_exit_box(struct intel_uncore_box *box);
>>>> diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
>>>> index b10a5ec79e48..92ce9fbafde1 100644
>>>> --- a/arch/x86/events/intel/uncore_snbep.c
>>>> +++ b/arch/x86/events/intel/uncore_snbep.c
>>>> @@ -273,6 +273,28 @@
>>>>    #define SKX_CPUNODEID                  0xc0
>>>>    #define SKX_GIDNIDMAP                  0xd4
>>>>
>>>> +/*
>>>> + * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR
>>>> + * that BIOS programmed. MSR has package scope.
>>>> + * |  Bit  |  Default  |  Description
>>>> + * | [63]  |    00h    | VALID - When set, indicates the CPU bus
>>>> + *                       numbers have been initialized. (RO)
>>>> + * |[62:48]|    ---    | Reserved
>>>> + * |[47:40]|    00h    | BUS_NUM_5 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(5). (RO)
>>>> + * |[39:32]|    00h    | BUS_NUM_4 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(4). (RO)
>>>> + * |[31:24]|    00h    | BUS_NUM_3 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(3). (RO)
>>>> + * |[23:16]|    00h    | BUS_NUM_2 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(2). (RO)
>>>> + * |[15:8] |    00h    | BUS_NUM_1 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(1). (RO)
>>>> + * | [7:0] |    00h    | BUS_NUM_0 — Return the bus number BIOS assigned
>>>> + *                       CPUBUSNO(0). (RO)
>>>> + */
>>>> +#define SKX_MSR_CPU_BUS_NUMBER         0x300
>>>> +
>>>>    /* SKX CHA */
>>>>    #define SKX_CHA_MSR_PMON_BOX_FILTER_TID                (0x1ffULL << 0)
>>>>    #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK       (0xfULL << 9)
>>>> @@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = {
>>>>           .read_counter           = uncore_msr_read_counter,
>>>>    };
>>>>
>>>> +static int skx_iio_get_topology(struct intel_uncore_type *type);
>>>> +static int skx_iio_set_mapping(struct intel_uncore_type *type);
>>>> +
>>>>    static struct intel_uncore_type skx_uncore_iio = {
>>>>           .name                   = "iio",
>>>>           .num_counters           = 4,
>>>> @@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = {
>>>>           .constraints            = skx_uncore_iio_constraints,
>>>>           .ops                    = &skx_uncore_iio_ops,
>>>>           .format_group           = &skx_uncore_iio_format_group,
>>>> +       .get_topology           = skx_iio_get_topology,
>>>> +       .set_mapping            = skx_iio_set_mapping,
>>>>    };
>>>>
>>>>    enum perf_uncore_iio_freerunning_type_id {
>>>> @@ -3780,6 +3807,123 @@ static int skx_count_chabox(void)
>>>>           return hweight32(val);
>>>>    }
>>>>
>>>> +static inline u8 skx_iio_topology_byte(void *platform_topology,
>>>> +                                       int die, int idx)
>>>> +{
>>>> +       return *((u8 *)(platform_topology) + die * sizeof(u64) + idx);
>>>> +}
>>>> +
>>>> +static inline bool skx_iio_topology_valid(u64 msr_value)
>>>> +{
>>>> +       return msr_value & ((u64)1 << 63);
>>>> +}
>>>> +
>>>> +static int skx_msr_cpu_bus_read(int cpu, int die)
>>>> +{
>>>> +       int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER,
>>>> +                               ((u64 *)skx_uncore_iio.platform_topology) + die);
>>>> +
>>>> +       if (!ret) {
>>>> +               if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die)))
>>>> +                       ret = -1;
>>>> +       }
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static int skx_iio_get_topology(struct intel_uncore_type *type)
>>>> +{
>>>> +       int ret, cpu, die, current_die;
>>>> +       struct pci_bus *bus = NULL;
>>>> +
>>>> +       while ((bus = pci_find_next_bus(bus)) != NULL)
>>>> +               if (pci_domain_nr(bus)) {
>>>> +                       pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n");
>>>> +                       return -1;
>>>> +               }
>>>> +
>>>> +       /* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/
>>>> +       type->platform_topology =
>>>> +               kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL);
>>>> +       if (!type->platform_topology)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       /*
>>>> +        * Using cpus_read_lock() to ensure cpu is not going down between
>>>> +        * looking at cpu_online_mask.
>>>> +        */
>>>> +       cpus_read_lock();
>>>> +       /* Invalid value to start loop.*/
>>>> +       current_die = -1;
>>>> +       for_each_online_cpu(cpu) {
>>>> +               die = topology_logical_die_id(cpu);
>>>> +               if (current_die == die)
>>>> +                       continue;
>>>> +               ret = skx_msr_cpu_bus_read(cpu, die);
>>>> +               if (ret)
>>>> +                       break;
>>>> +               current_die = die;
>>>> +       }
>>>> +       cpus_read_unlock();
>>>> +
>>>> +       if (ret)
>>>> +               kfree(type->platform_topology);
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static int skx_iio_set_mapping(struct intel_uncore_type *type)
>>>> +{
>>>> +       /*
>>>> +        * Each IIO stack (PCIe root port) has its own IIO PMON block, so each
>>>> +        * platform_mapping holds bus number(s) of PCIe root port(s), which can
>>>> +        * be monitored by that IIO PMON block.
>>>> +        *
>>>> +        * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die
>>>> +        * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO
>>>> +        * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0":
>>>> +        *
>>>> +        * $ cat /sys/devices/uncore_iio_0/platform_mapping
>>>> +        * 0000:00,0000:40,0000:80,0000:c0
>>>> +        *
>>>> +        * Which means:
>>>> +        * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000
>>>> +        * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000
>>>> +        * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000
>>>> +        * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000
>>>> +        */
>>>> +
>>>> +       int ret = 0;
>>>> +       int die, i;
>>>> +       char *buf;
>>>> +       struct intel_uncore_pmu *pmu;
>>>> +       const int template_len = 8;
>>>> +
>>>> +       for (i = 0; i < type->num_boxes; i++) {
>>>> +               pmu = type->pmus + i;
>>>> +               /* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */
>>>> +               if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) {
>>>> +                       pmu->platform_mapping =
>>>> +                               kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL);
>>>> +                       if (pmu->platform_mapping) {
>>>> +                               buf = (char *)pmu->platform_mapping;
>>>> +                               for (die = 0; die < get_max_dies(); die++)
>>>> +                                       buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0,
>>>> +                                               skx_iio_topology_byte(type->platform_topology,
>>>> +                                                                     die, pmu->pmu_idx));
>>>> +
>>>> +                               *(--buf) = '\0';
>>>> +                       } else {
>>>> +                               for (; i >= 0; i--)
>>>> +                                       kfree((type->pmus + i)->platform_mapping);
>>>> +                               ret = -ENOMEM;
>>>> +                               break;
>>>> +                       }
>>>> +               }
>>>> +       }
>>>> +
>>>> +       kfree(type->platform_topology);
>>>> +       return ret;
>>>> +}
>>>> +
>>>>    void skx_uncore_cpu_init(void)
>>>>    {
>>>>           skx_uncore_chabox.num_boxes = skx_count_chabox();
>>>> --
>>>> 2.19.1
>>>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-12-06 16:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-26 16:36 [PATCH 0/6] perf x86: Exposing IO stack to IO PMON mapping through sysfs roman.sudarikov
2019-11-26 16:36 ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping roman.sudarikov
2019-11-26 16:36   ` [PATCH 2/6] perf tools: Helper functions to enumerate and probe PCI devices roman.sudarikov
2019-11-26 16:36     ` [PATCH 3/6] perf stat: Helper functions for list of IIO devices roman.sudarikov
2019-11-26 16:36       ` [PATCH 4/6] perf stat: New --iiostat mode to provide I/O performance metrics roman.sudarikov
2019-11-26 16:36         ` [PATCH 5/6] perf tools: Add feature check for libpci roman.sudarikov
2019-11-26 16:36           ` [PATCH 6/6] perf stat: Add PCI device name to --iiostat output roman.sudarikov
2019-12-02 14:00   ` [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping Peter Zijlstra
2019-12-02 19:47   ` Stephane Eranian
2019-12-03  3:00     ` Andi Kleen
2019-12-04 18:48     ` Sudarikov, Roman
2019-12-05 18:02       ` Stephane Eranian
2019-12-05 22:28         ` Andi Kleen
2019-12-06 16:08         ` Sudarikov, Roman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).