All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/6] Intel memory b/w monitoring support
@ 2016-03-01 23:48 Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group Vikas Shivappa
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

The patch series has two preparatory patch for cqm and then 4 MBM
patches. Patches are based on tip perf/core.
Thanks to Thomas for feedback on V4 and have tried to implement his
feedback in this version.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.
Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure.

intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w 
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Changes in V5:
As per Thomas feedback made the below changes:
- Fixed the memory leak and notifier leak in cqm init and also made it a
separate patch
- Changed mbm patch to using topology_max_packages to count the max
packages rather than online packages.
- Removed the unnecessary out: label and goto in the 0003 .
- Fixed the restarting of timer when the event list is empty.

- Also Fixed the incorrect usage of mutex in timer context. 

Changes in v4:
The V4 version of MBM is almost a complete rewrite of the prior
versions except for some declarations. 
It has seemed the best way to address all of Thomas earlier
comments.

[PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a
[PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
[PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
[PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
[PATCH 5/6] x86/mbm: RMID Recycling MBM changes
[PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-07 23:04   ` Peter Zijlstra
  2016-03-01 23:48 ` [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak Vikas Shivappa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

Currently cqm(cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero for
all events other than the group leader. The patch tries to address the
issue by keeping a flag in the perf_event.hw which has other cqm related
fields. The field is updated at event creation and during grouping.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 13 ++++++++++---
 include/linux/perf_event.h                 |  1 +
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index a316ca9..e6be335 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -281,9 +281,13 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
 
 	/*
 	 * Events that target same task are placed into the same cache group.
+	 * Mark it as a multi event group, so that we update ->count
+	 * for every event rather than just the group leader later.
 	 */
-	if (a->hw.target == b->hw.target)
+	if (a->hw.target == b->hw.target) {
+		b->hw.is_group_event = true;
 		return true;
+	}
 
 	/*
 	 * Are we an inherited event?
@@ -849,6 +853,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
 	bool conflict = false;
 	u32 rmid;
 
+	event->hw.is_group_event = false;
 	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
 		rmid = iter->hw.cqm_rmid;
 
@@ -940,7 +945,9 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 		return __perf_event_count(event);
 
 	/*
-	 * Only the group leader gets to report values. This stops us
+	 * Only the group leader gets to report values except in case of
+	 * multiple events in the same group, we still need to read the
+	 * other events.This stops us
 	 * reporting duplicate values to userspace, and gives us a clear
 	 * rule for which task gets to report the values.
 	 *
@@ -948,7 +955,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 	 * specific packages - we forfeit that ability when we create
 	 * task events.
 	 */
-	if (!cqm_group_leader(event))
+	if (!cqm_group_leader(event) && !event->hw.is_group_event)
 		return 0;
 
 	/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5c5a3f..a3ba886 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -121,6 +121,7 @@ struct hw_perf_event {
 		struct { /* intel_cqm */
 			int			cqm_state;
 			u32			cqm_rmid;
+			bool			is_group_event;
 			struct list_head	cqm_events_entry;
 			struct list_head	cqm_groups_entry;
 			struct list_head	cqm_group_entry;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-02  8:00   ` Thomas Gleixner
  2016-03-02 23:53   ` Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init Vikas Shivappa
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

Fixes the hotcpu notifier leak and a memory leak during cqm(cache
quality of service monitoring) initialization.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e6be335..5666171 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {
 
 static int __init intel_cqm_init(void)
 {
-	char *str, scale[20];
+	char *str = NULL, scale[20];
 	int i, cpu, ret;
 
 	if (!x86_match_cpu(intel_cqm_match))
@@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
 		cqm_pick_event_reader(i);
 	}
 
-	__perf_cpu_notifier(intel_cqm_cpu_notifier);
-
 	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
-	if (ret)
+	if (ret) {
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
-	else
+		goto out;
+	} else {
 		pr_info("Intel CQM monitoring enabled\n");
+	}
 
+	/*
+	 * Register the hot cpu notifier once we are sure cqm
+	 * is enabled to avoid notifier leak.
+	 */
+	__perf_cpu_notifier(intel_cqm_cpu_notifier);
 out:
 	cpu_notifier_register_done();
+	if (ret)
+		kfree(str);
 
 	return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-02  8:04   ` Thomas Gleixner
  2016-03-02 23:56   ` Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management Vikas Shivappa
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

The MBM init patch enumerates the Intel (Memory b/w monitoring)MBM and
initializes the perf events and datastructures for monitoring the memory
b/w. Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure:
intel_cqm_llc/local_bytes - bytes sent through local socket memory
controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h          |   2 +
 arch/x86/kernel/cpu/common.c               |   4 +-
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 157 ++++++++++++++++++++++++++++-
 3 files changed, 158 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c94..9b4233e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -241,6 +241,8 @@
 
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */
 
 /* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
 #define X86_FEATURE_CLZERO	(13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa05680..13af76e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -635,7 +635,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 			cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
 			c->x86_capability[CPUID_F_1_EDX] = edx;
 
-			if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+			if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+			      ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+			       (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
 				c->x86_cache_max_rmid = ecx;
 				c->x86_cache_occ_scale = ebx;
 			}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 5666171..cf08a0f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -15,6 +15,7 @@
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;
 
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,30 @@ struct intel_pqr_state {
  * interrupts disabled, which is sufficient for the protection.
  */
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @interval_start Time this interval began
+ * @interval_bytes #bytes in this interval
+ * @total_bytes    #bytes since we began monitoring
+ * @prev_msr       previous value of MSR
+ * @bandwidth      bytes/sec in previous completed interval
+ */
+struct sample {
+	ktime_t interval_start;
+	u64	interval_bytes;
+	u64	total_bytes;
+	u64	prev_msr;
+	u64	bandwidth;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -1152,6 +1177,28 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
 EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
 EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
 
+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(total_bw, intel_cqm_total_bw, "event=0x04");
+EVENT_ATTR_STR(total_bw.per-pkg, intel_cqm_total_bw_pkg, "1");
+EVENT_ATTR_STR(total_bw.unit, intel_cqm_total_bw_unit, "MB/sec");
+EVENT_ATTR_STR(total_bw.scale, intel_cqm_total_bw_scale, "1e-6");
+EVENT_ATTR_STR(total_bw.snapshot, intel_cqm_total_bw_snapshot, "1");
+
+EVENT_ATTR_STR(local_bw, intel_cqm_local_bw, "event=0x05");
+EVENT_ATTR_STR(local_bw.per-pkg, intel_cqm_local_bw_pkg, "1");
+EVENT_ATTR_STR(local_bw.unit, intel_cqm_local_bw_unit, "MB/sec");
+EVENT_ATTR_STR(local_bw.scale, intel_cqm_local_bw_scale, "1e-6");
+EVENT_ATTR_STR(local_bw.snapshot, intel_cqm_local_bw_snapshot, "1");
+
 static struct attribute *intel_cqm_events_attr[] = {
 	EVENT_PTR(intel_cqm_llc),
 	EVENT_PTR(intel_cqm_llc_pkg),
@@ -1161,9 +1208,58 @@ static struct attribute *intel_cqm_events_attr[] = {
 	NULL,
 };
 
+static struct attribute *intel_mbm_events_attr[] = {
+	EVENT_PTR(intel_cqm_total_bytes),
+	EVENT_PTR(intel_cqm_local_bytes),
+	EVENT_PTR(intel_cqm_total_bw),
+	EVENT_PTR(intel_cqm_local_bw),
+	EVENT_PTR(intel_cqm_total_bytes_pkg),
+	EVENT_PTR(intel_cqm_local_bytes_pkg),
+	EVENT_PTR(intel_cqm_total_bw_pkg),
+	EVENT_PTR(intel_cqm_local_bw_pkg),
+	EVENT_PTR(intel_cqm_total_bytes_unit),
+	EVENT_PTR(intel_cqm_local_bytes_unit),
+	EVENT_PTR(intel_cqm_total_bw_unit),
+	EVENT_PTR(intel_cqm_local_bw_unit),
+	EVENT_PTR(intel_cqm_total_bytes_scale),
+	EVENT_PTR(intel_cqm_local_bytes_scale),
+	EVENT_PTR(intel_cqm_total_bw_scale),
+	EVENT_PTR(intel_cqm_local_bw_scale),
+	EVENT_PTR(intel_cqm_total_bw_snapshot),
+	EVENT_PTR(intel_cqm_local_bw_snapshot),
+	NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+	EVENT_PTR(intel_cqm_llc),
+	EVENT_PTR(intel_cqm_total_bytes),
+	EVENT_PTR(intel_cqm_local_bytes),
+	EVENT_PTR(intel_cqm_total_bw),
+	EVENT_PTR(intel_cqm_local_bw),
+	EVENT_PTR(intel_cqm_llc_pkg),
+	EVENT_PTR(intel_cqm_total_bytes_pkg),
+	EVENT_PTR(intel_cqm_local_bytes_pkg),
+	EVENT_PTR(intel_cqm_total_bw_pkg),
+	EVENT_PTR(intel_cqm_local_bw_pkg),
+	EVENT_PTR(intel_cqm_llc_unit),
+	EVENT_PTR(intel_cqm_total_bytes_unit),
+	EVENT_PTR(intel_cqm_local_bytes_unit),
+	EVENT_PTR(intel_cqm_total_bw_unit),
+	EVENT_PTR(intel_cqm_local_bw_unit),
+	EVENT_PTR(intel_cqm_llc_scale),
+	EVENT_PTR(intel_cqm_total_bytes_scale),
+	EVENT_PTR(intel_cqm_local_bytes_scale),
+	EVENT_PTR(intel_cqm_total_bw_scale),
+	EVENT_PTR(intel_cqm_local_bw_scale),
+	EVENT_PTR(intel_cqm_llc_snapshot),
+	EVENT_PTR(intel_cqm_total_bw_snapshot),
+	EVENT_PTR(intel_cqm_local_bw_snapshot),
+	NULL,
+};
+
 static struct attribute_group intel_cqm_events_group = {
 	.name = "events",
-	.attrs = intel_cqm_events_attr,
+	.attrs = NULL,
 };
 
 PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1320,12 +1416,47 @@ static const struct x86_cpu_id intel_cqm_match[] = {
 	{}
 };
 
+static const struct x86_cpu_id intel_mbm_local_match[] = {
+	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+	{}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+	{}
+};
+
+static int intel_mbm_init(void)
+{
+	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
+
+	array_size = sizeof(struct sample) * maxid * topology_max_packages();
+	mbm_local = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_local)
+		return -ENOMEM;
+
+	mbm_total = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_total) {
+		kfree(mbm_local);
+		ret = -ENOMEM;
+	}
+
+	return ret;
+}
+
 static int __init intel_cqm_init(void)
 {
 	char *str = NULL, scale[20];
 	int i, cpu, ret;
 
-	if (!x86_match_cpu(intel_cqm_match))
+	if (x86_match_cpu(intel_cqm_match))
+		cqm_enabled = true;
+
+	if (x86_match_cpu(intel_mbm_local_match) &&
+	     x86_match_cpu(intel_mbm_total_match))
+		mbm_enabled = true;
+
+	if (!cqm_enabled && !mbm_enabled)
 		return -ENODEV;
 
 	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1382,12 +1513,27 @@ static int __init intel_cqm_init(void)
 		cqm_pick_event_reader(i);
 	}
 
+	if (mbm_enabled)
+		ret = intel_mbm_init();
+	if (ret && !cqm_enabled)
+		goto out;
+
+	if (cqm_enabled && mbm_enabled)
+		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+	else if (!cqm_enabled && mbm_enabled)
+		intel_cqm_events_group.attrs = intel_mbm_events_attr;
+	else if (cqm_enabled && !mbm_enabled)
+		intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
 	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
 	if (ret) {
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
 		goto out;
 	} else {
-		pr_info("Intel CQM monitoring enabled\n");
+		if (cqm_enabled)
+			pr_info("Intel CQM monitoring enabled\n");
+		if (mbm_enabled)
+			pr_info("Intel MBM enabled\n");
 	}
 
 	/*
@@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
 	__perf_cpu_notifier(intel_cqm_cpu_notifier);
 out:
 	cpu_notifier_register_done();
-	if (ret)
+	if (ret) {
+		mbm_enabled = false;
+		cqm_enabled = false;
 		kfree(str);
+	}
 
 	return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
                   ` (2 preceding siblings ...)
  2016-03-01 23:48 ` [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-07 23:03   ` Peter Zijlstra
  2016-03-01 23:48 ` [PATCH 5/6] x86/mbm: RMID Recycling MBM changes Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling Vikas Shivappa
  5 siblings, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

From: Tony Luck <tony.luck@intel.com>

Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external bytes
and local socket memory-controller bytes. The current b/w is calculated
for a minimum diff time(time since it was last counted) of 100ms. The OS
does MSR writes to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the
counters and uses the IA32_PQR_ASSOC_MSR to associate the RMID with
the task. The tasks have a common RMID for cqm(cache quality of
 service monitoring) and MBM. Hence most of the scheduling code is
reused from cqm.

Lot of the scheduling code was taken out from Tony's patch and a 3-4
lines of change were added in the intel_cqm_event_read. Since the timer
is no more added on every context switch this change was made.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 158 ++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index cf08a0f..6638dcc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,6 +13,11 @@
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
+/*
+ * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
+ * value
+ */
+#define MBM_CNTR_MAX		0xffffff
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
@@ -68,6 +73,16 @@ static struct sample *mbm_total;
  */
 static struct sample *mbm_local;
 
+#define pkg_id	topology_physical_package_id(smp_processor_id())
+/*
+ * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
+ * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
+ * rmids per socket, an example is given below
+ * RMID1 of Socket0:  vrmid =  1
+ * RMID1 of Socket1:  vrmid =  1 * (cqm_max_rmid + 1) + 1
+ * RMID1 of Socket2:  vrmid =  2 * (cqm_max_rmid + 1) + 1
+ */
+#define rmid_2_index(rmid)  ((pkg_id * (cqm_max_rmid + 1)) + rmid)
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
  * Also protects event->hw.cqm_rmid
@@ -91,8 +106,19 @@ static cpumask_t cqm_cpumask;
 #define RMID_VAL_UNAVAIL	(1ULL << 62)
 
 #define QOS_L3_OCCUP_EVENT_ID	(1 << 0)
+/*
+ * MBM Event IDs as defined in SDM section 17.15.5
+ * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
+ */
+enum mbm_evt_type {
+	QOS_MBM_TOTAL_EVENT_ID = 0x02,
+	QOS_MBM_LOCAL_EVENT_ID,
+	QOS_MBM_TOTAL_BW_EVENT_ID,
+	QOS_MBM_LOCAL_BW_EVENT_ID,
+};
 
-#define QOS_EVENT_MASK	QOS_L3_OCCUP_EVENT_ID
+#define QOS_MBM_BW_EVENT_MASK 0x04
+#define QOS_MBM_LOCAL_EVENT_MASK 0x01
 
 /*
  * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -422,9 +448,16 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
 struct rmid_read {
 	u32 rmid;
 	atomic64_t value;
+	enum mbm_evt_type evt_type;
 };
 
 static void __intel_cqm_event_count(void *info);
+static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type);
+
+static bool is_mbm_event(int e)
+{
+	return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
+}
 
 /*
  * Exchange the RMID of a group of events.
@@ -866,6 +899,98 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
 	schedule_delayed_work(&intel_cqm_rmid_work, delay);
 }
 
+static struct sample *update_sample(unsigned int rmid,
+				    enum mbm_evt_type evt_type, int first)
+{
+	ktime_t cur_time;
+	struct sample *mbm_current;
+	u32 vrmid = rmid_2_index(rmid);
+	u64 val, bytes, diff_time;
+	u32 eventid;
+
+	if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
+		mbm_current = &mbm_local[vrmid];
+		eventid     =  QOS_MBM_LOCAL_EVENT_ID;
+	} else {
+		mbm_current = &mbm_total[vrmid];
+		eventid     = QOS_MBM_TOTAL_EVENT_ID;
+	}
+
+	cur_time = ktime_get();
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	rdmsrl(MSR_IA32_QM_CTR, val);
+	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+		return mbm_current;
+	val &= MBM_CNTR_MAX;
+
+	if (first) {
+		mbm_current->interval_start = cur_time;
+		mbm_current->prev_msr = val;
+		mbm_current->total_bytes = 0;
+		mbm_current->interval_bytes = 0;
+		mbm_current->bandwidth = 0;
+		return mbm_current;
+	}
+
+	if (val < mbm_current->prev_msr)
+		bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
+	else
+		bytes = val - mbm_current->prev_msr;
+	bytes *= cqm_l3_scale;
+
+	mbm_current->total_bytes += bytes;
+	mbm_current->interval_bytes += bytes;
+	mbm_current->prev_msr = val;
+	diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);
+
+	/*
+	 * The b/w measured is really the most recent/current b/w.
+	 * We wait till enough time has passed to avoid
+	 * arthmetic rounding problems.Having it at >=100ms,
+	 * such errors would be <=1%.
+	 */
+	if (diff_time > 100) {
+		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
+		do_div(bytes, diff_time);
+		mbm_current->bandwidth = bytes;
+		mbm_current->interval_bytes = 0;
+		mbm_current->interval_start = cur_time;
+	}
+
+	return mbm_current;
+}
+
+static u64 rmid_read_mbm(unsigned int rmid, enum mbm_evt_type evt_type)
+{
+	struct sample *mbm_current;
+
+	mbm_current = update_sample(rmid, evt_type, 0);
+
+	if (evt_type & QOS_MBM_BW_EVENT_MASK)
+		return mbm_current->bandwidth;
+	else
+		return mbm_current->total_bytes;
+}
+
+static void __intel_mbm_event_init(void *info)
+{
+	struct rmid_read *rr = info;
+
+	update_sample(rr->rmid, rr->evt_type, 1);
+}
+
+static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
+{
+	struct rmid_read rr = {
+		.value = ATOMIC64_INIT(0),
+	};
+
+	rr.rmid = rmid;
+	rr.evt_type = evt_type;
+	/* on each socket, init sample */
+	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
+}
+
 /*
  * Find a group and setup RMID.
  *
@@ -886,6 +1011,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
 			/* All tasks in a group share an RMID */
 			event->hw.cqm_rmid = rmid;
 			*group = iter;
+			if (is_mbm_event(event->attr.config))
+				init_mbm_sample(rmid, event->attr.config);
 			return;
 		}
 
@@ -902,6 +1029,9 @@ static void intel_cqm_setup_event(struct perf_event *event,
 	else
 		rmid = __get_rmid();
 
+	if (is_mbm_event(event->attr.config))
+		init_mbm_sample(rmid, event->attr.config);
+
 	event->hw.cqm_rmid = rmid;
 }
 
@@ -923,7 +1053,10 @@ static void intel_cqm_event_read(struct perf_event *event)
 	if (!__rmid_valid(rmid))
 		goto out;
 
-	val = __rmid_read(rmid);
+	if (is_mbm_event(event->attr.config))
+		val = rmid_read_mbm(rmid, event->attr.config);
+	else
+		val = __rmid_read(rmid);
 
 	/*
 	 * Ignore this reading on error states and do not update the value.
@@ -954,6 +1087,17 @@ static inline bool cqm_group_leader(struct perf_event *event)
 	return !list_empty(&event->hw.cqm_groups_entry);
 }
 
+static void __intel_mbm_event_count(void *info)
+{
+	struct rmid_read *rr = info;
+	u64 val;
+
+	val = rmid_read_mbm(rr->rmid, rr->evt_type);
+	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+		return;
+	atomic64_add(val, &rr->value);
+}
+
 static u64 intel_cqm_event_count(struct perf_event *event)
 {
 	unsigned long flags;
@@ -1007,7 +1151,12 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 	if (!__rmid_valid(rr.rmid))
 		goto out;
 
-	on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+	if (is_mbm_event(event->attr.config)) {
+		rr.evt_type = event->attr.config;
+		on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+	} else {
+		on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+	}
 
 	raw_spin_lock_irqsave(&cache_lock, flags);
 	if (event->hw.cqm_rmid == rr.rmid)
@@ -1122,7 +1271,8 @@ static int intel_cqm_event_init(struct perf_event *event)
 	if (event->attr.type != intel_cqm_pmu.type)
 		return -ENOENT;
 
-	if (event->attr.config & ~QOS_EVENT_MASK)
+	if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
+	     (event->attr.config > QOS_MBM_LOCAL_BW_EVENT_ID))
 		return -EINVAL;
 
 	/* unsupported modes and filters */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] x86/mbm: RMID Recycling MBM changes
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
                   ` (3 preceding siblings ...)
  2016-03-01 23:48 ` [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-01 23:48 ` [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling Vikas Shivappa
  5 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

RMID could be allocated or deallocated as part of RMID recycling.
When an RMID is allocated for mbm event, the mbm counter needs to be
initialized because next time we read the counter we need the previous
value to account for total bytes that went to the memory controller.
Similarly, when RMID is deallocated we need to update the ->count
variable.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 32 ++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 6638dcc..fa5ec85 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -453,6 +453,7 @@ struct rmid_read {
 
 static void __intel_cqm_event_count(void *info);
 static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type);
+static void __intel_mbm_event_count(void *info);
 
 static bool is_mbm_event(int e)
 {
@@ -479,8 +480,14 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 			.rmid = old_rmid,
 		};
 
-		on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
-				 &rr, 1);
+		if (is_mbm_event(group->attr.config)) {
+			rr.evt_type = group->attr.config;
+			on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count,
+					 &rr, 1);
+		} else {
+			on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
+					 &rr, 1);
+		}
 		local64_set(&group->count, atomic64_read(&rr.value));
 	}
 
@@ -492,6 +499,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
 
 	raw_spin_unlock_irq(&cache_lock);
 
+	/*
+	 * If the allocation is for mbm, init the mbm stats.
+	 * Need to check if each event in the group is mbm event
+	 * because there could be multiple type of events in the same group.
+	 */
+	if (__rmid_valid(rmid)) {
+		event = group;
+		if (is_mbm_event(event->attr.config))
+			init_mbm_sample(rmid, event->attr.config);
+
+		list_for_each_entry(event, head, hw.cqm_group_entry) {
+			if (is_mbm_event(event->attr.config))
+				init_mbm_sample(rmid, event->attr.config);
+		}
+	}
+
 	return old_rmid;
 }
 
@@ -1011,7 +1034,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
 			/* All tasks in a group share an RMID */
 			event->hw.cqm_rmid = rmid;
 			*group = iter;
-			if (is_mbm_event(event->attr.config))
+			if (is_mbm_event(event->attr.config) &&
+			    __rmid_valid(rmid))
 				init_mbm_sample(rmid, event->attr.config);
 			return;
 		}
@@ -1029,7 +1053,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
 	else
 		rmid = __get_rmid();
 
-	if (is_mbm_event(event->attr.config))
+	if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
 		init_mbm_sample(rmid, event->attr.config);
 
 	event->hw.cqm_rmid = rmid;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling
  2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
                   ` (4 preceding siblings ...)
  2016-03-01 23:48 ` [PATCH 5/6] x86/mbm: RMID Recycling MBM changes Vikas Shivappa
@ 2016-03-01 23:48 ` Vikas Shivappa
  2016-03-02 23:58   ` Vikas Shivappa
  5 siblings, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-01 23:48 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 132 ++++++++++++++++++++++++++++-
 1 file changed, 130 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index fa5ec85..2870fc7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -18,9 +18,15 @@
  * value
  */
 #define MBM_CNTR_MAX		0xffffff
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME	1000
+
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;
 
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
@@ -48,6 +54,7 @@ struct intel_pqr_state {
  * interrupts disabled, which is sufficient for the protection.
  */
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
 /**
  * struct sample - mbm event's (local or total) data
  * @interval_start Time this interval began
@@ -1122,6 +1129,84 @@ static void __intel_mbm_event_count(void *info)
 	atomic64_add(val, &rr->value);
 }
 
+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct perf_event *iter, *iter1;
+	int ret = HRTIMER_RESTART;
+	struct list_head *head;
+	unsigned long flags;
+	u32 grp_rmid;
+
+	/*
+	 * Need to cache_lock as the timer Event Select MSR reads
+	 * can race with the mbm/cqm count() and mbm_init() reads.
+	 */
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
+	if (list_empty(&cache_groups)) {
+		ret = HRTIMER_NORESTART;
+		goto out;
+	}
+
+	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+		grp_rmid = iter->hw.cqm_rmid;
+		if (!__rmid_valid(grp_rmid))
+			continue;
+		if (is_mbm_event(iter->attr.config))
+			update_sample(grp_rmid, iter->attr.config, 0);
+
+		head = &iter->hw.cqm_group_entry;
+		if (list_empty(head))
+			continue;
+		list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+			if (!iter1->hw.is_group_event)
+				break;
+			if (is_mbm_event(iter1->attr.config))
+				update_sample(iter1->hw.cqm_rmid,
+					      iter1->attr.config, 0);
+		}
+	}
+
+	hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+	hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+			     HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+	hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+	on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+	on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+	struct hrtimer *hr;
+	int i;
+
+	for (i = 0; i < mbm_socket_max; i++) {
+		hr = &mbm_timers[i];
+		hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+		hr->function = mbm_hrtimer_handle;
+	}
+}
+
 static u64 intel_cqm_event_count(struct perf_event *event)
 {
 	unsigned long flags;
@@ -1251,8 +1336,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
 static void intel_cqm_event_destroy(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
+	unsigned long flags;
 
 	mutex_lock(&cache_mutex);
+	/*
+	* Hold the cache_lock as mbm timer handlers could be
+	* scanning the list of events.
+	*/
+	raw_spin_lock_irqsave(&cache_lock, flags);
 
 	/*
 	 * If there's another event in this group...
@@ -1284,6 +1375,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 		}
 	}
 
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	/*
+	 * Stop the mbm overflow timers when the last event is destroyed.
+	*/
+	if (mbm_enabled && list_empty(&cache_groups))
+		mbm_stop_timers();
+
 	mutex_unlock(&cache_mutex);
 }
 
@@ -1291,6 +1390,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 {
 	struct perf_event *group = NULL;
 	bool rotate = false;
+	unsigned long flags;
 
 	if (event->attr.type != intel_cqm_pmu.type)
 		return -ENOENT;
@@ -1316,9 +1416,21 @@ static int intel_cqm_event_init(struct perf_event *event)
 
 	mutex_lock(&cache_mutex);
 
+	/*
+	 * Start the mbm overflow timers when the first event is created.
+	*/
+	if (mbm_enabled && list_empty(&cache_groups))
+		mbm_start_timers();
+
 	/* Will also set rmid */
 	intel_cqm_setup_event(event, &group);
 
+	/*
+	* Hold the cache_lock as mbm timer handlers be
+	* scanning the list of events.
+	*/
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
 	if (group) {
 		list_add_tail(&event->hw.cqm_group_entry,
 			      &group->hw.cqm_group_entry);
@@ -1337,6 +1449,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 			rotate = true;
 	}
 
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
 	mutex_unlock(&cache_mutex);
 
 	if (rotate)
@@ -1604,15 +1717,30 @@ static int intel_mbm_init(void)
 {
 	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
 
-	array_size = sizeof(struct sample) * maxid * topology_max_packages();
+	mbm_socket_max = topology_max_packages();
+	array_size = sizeof(struct sample) * maxid * mbm_socket_max;
 	mbm_local = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_local)
 		return -ENOMEM;
 
 	mbm_total = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_total) {
-		kfree(mbm_local);
 		ret = -ENOMEM;
+		goto out;
+	}
+
+	array_size = sizeof(struct hrtimer) * mbm_socket_max;
+	mbm_timers = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_timers) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	mbm_hrtimer_init();
+
+out:
+	if (ret) {
+		kfree(mbm_local);
+		kfree(mbm_total);
 	}
 
 	return ret;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-01 23:48 ` [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak Vikas Shivappa
@ 2016-03-02  8:00   ` Thomas Gleixner
  2016-03-02 17:58     ` Vikas Shivappa
  2016-03-02 23:53   ` Vikas Shivappa
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-02  8:00 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Tue, 1 Mar 2016, Vikas Shivappa wrote:

> Fixes the hotcpu notifier leak and a memory leak during cqm(cache
> quality of service monitoring) initialization.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> index e6be335..5666171 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> @@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>  
>  static int __init intel_cqm_init(void)
>  {
> -	char *str, scale[20];
> +	char *str = NULL, scale[20];
>  	int i, cpu, ret;
>  
>  	if (!x86_match_cpu(intel_cqm_match))
> @@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
>  		cqm_pick_event_reader(i);
>  	}
>  
> -	__perf_cpu_notifier(intel_cqm_cpu_notifier);
> -
>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> -	if (ret)
> +	if (ret) {
>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
> -	else
> +		goto out;
> +	} else {
>  		pr_info("Intel CQM monitoring enabled\n");
> +	}
>  
> +	/*
> +	 * Register the hot cpu notifier once we are sure cqm
> +	 * is enabled to avoid notifier leak.
> +	 */
> +	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>  out:
>  	cpu_notifier_register_done();
> +	if (ret)
> +		kfree(str);

This still leaks cqm_rmid_ptrs ....

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-01 23:48 ` [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init Vikas Shivappa
@ 2016-03-02  8:04   ` Thomas Gleixner
  2016-03-02 17:59     ` Vikas Shivappa
  2016-03-02 23:56   ` Vikas Shivappa
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-02  8:04 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Tue, 1 Mar 2016, Vikas Shivappa wrote:
> @@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
>  	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>  out:
>  	cpu_notifier_register_done();
> -	if (ret)
> +	if (ret) {
> +		mbm_enabled = false;
> +		cqm_enabled = false;
>  		kfree(str);


Leaks mbm_local and mbm_total ....

> +	}
>  
>  	return ret;
>  }
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-02  8:00   ` Thomas Gleixner
@ 2016-03-02 17:58     ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 17:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Tue, 1 Mar 2016, Vikas Shivappa wrote:
>
>> Fixes the hotcpu notifier leak and a memory leak during cqm(cache
>> quality of service monitoring) initialization.
>>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>> ---
>>  arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> index e6be335..5666171 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> @@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>>
>>  static int __init intel_cqm_init(void)
>>  {
>> -	char *str, scale[20];
>> +	char *str = NULL, scale[20];
>>  	int i, cpu, ret;
>>
>>  	if (!x86_match_cpu(intel_cqm_match))
>> @@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
>>  		cqm_pick_event_reader(i);
>>  	}
>>
>> -	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>> -
>>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>> -	if (ret)
>> +	if (ret) {
>>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
>> -	else
>> +		goto out;
>> +	} else {
>>  		pr_info("Intel CQM monitoring enabled\n");
>> +	}
>>
>> +	/*
>> +	 * Register the hot cpu notifier once we are sure cqm
>> +	 * is enabled to avoid notifier leak.
>> +	 */
>> +	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>>  out:
>>  	cpu_notifier_register_done();
>> +	if (ret)
>> +		kfree(str);
>
> This still leaks cqm_rmid_ptrs ....
>

Thats right. Thought there must be something more than the str leak i fixed, 
but missed this. Will fix

Thanks,
Vikas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-02  8:04   ` Thomas Gleixner
@ 2016-03-02 17:59     ` Vikas Shivappa
  2016-03-02 21:31       ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 17:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Tue, 1 Mar 2016, Vikas Shivappa wrote:
>> @@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
>>  	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>>  out:
>>  	cpu_notifier_register_done();
>> -	if (ret)
>> +	if (ret) {
>> +		mbm_enabled = false;
>> +		cqm_enabled = false;
>>  		kfree(str);
>
>
> Leaks mbm_local and mbm_total ....

Will fix. Thanks for pointing out. I missed the ones which are done at the 
next level of calls from the init. Will do a check on all the globals as well.

Vikas

>
>> +	}
>>
>>  	return ret;
>>  }
>> --
>> 1.9.1
>>
>>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-02 17:59     ` Vikas Shivappa
@ 2016-03-02 21:31       ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 21:31 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Thomas Gleixner, Vikas Shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Wed, 2 Mar 2016, Vikas Shivappa wrote:

>
>
> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
>
>> Leaks mbm_local and mbm_total ....
>
> Will fix. Thanks for pointing out. I missed the ones which are done at the 
> next level of calls from the init. Will do a check on all the globals as 
> well.
>
> Vikas

Looks like the mbm_timers(declared in patch 06) is leaked in cqm_init , will fix 
that as well.

>
>> 
>>> +	}
>>>
>>>  	return ret;
>>>  }
>>> --
>>> 1.9.1
>>> 
>>> 
>> 
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-01 23:48 ` [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak Vikas Shivappa
  2016-03-02  8:00   ` Thomas Gleixner
@ 2016-03-02 23:53   ` Vikas Shivappa
  2016-03-08  9:22     ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 23:53 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

Fixes the hotcpu notifier leak and other global variable memory leaks
during cqm(cache quality of service monitoring) initialization.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---

Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.
  
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e6be335..37a93fa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)
 
 	return 0;
 fail:
-	while (r--)
+	while (r--) {
 		kfree(cqm_rmid_ptrs[r]);
+		cqm_rmid_ptrs[r] = NULL;
+	}
 
 	kfree(cqm_rmid_ptrs);
+	cqm_rmid_ptrs = NULL;
 	return -ENOMEM;
 }
 
@@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
 	{}
 };
 
+static void cqm_cleanup(void)
+{
+	int r = cqm_max_rmid;
+
+	while (r--)
+		kfree(cqm_rmid_ptrs[r]);
+
+	kfree(cqm_rmid_ptrs);
+}
+
 static int __init intel_cqm_init(void)
 {
-	char *str, scale[20];
+	char *str = NULL, scale[20];
 	int i, cpu, ret;
 
 	if (!x86_match_cpu(intel_cqm_match))
@@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
 		cqm_pick_event_reader(i);
 	}
 
-	__perf_cpu_notifier(intel_cqm_cpu_notifier);
-
 	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
-	if (ret)
+	if (ret) {
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
-	else
+		goto out;
+	} else {
 		pr_info("Intel CQM monitoring enabled\n");
+	}
 
+	/*
+	 * Register the hot cpu notifier once we are sure cqm
+	 * is enabled to avoid notifier leak.
+	 */
+	__perf_cpu_notifier(intel_cqm_cpu_notifier);
 out:
 	cpu_notifier_register_done();
+	if (ret) {
+		kfree(str);
+		cqm_cleanup();
+	}
 
 	return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-01 23:48 ` [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init Vikas Shivappa
  2016-03-02  8:04   ` Thomas Gleixner
@ 2016-03-02 23:56   ` Vikas Shivappa
  2016-03-03  7:35     ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 23:56 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

The MBM init patch enumerates the Intel (Memory b/w monitoring)MBM and
initializes the perf events and datastructures for monitoring the memory
b/w. Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure:
intel_cqm_llc/local_bytes - bytes sent through local socket memory
controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---

Fixed mbm_local and mbm_total leaks as per Thomas feedback.

 arch/x86/include/asm/cpufeature.h          |   2 +
 arch/x86/kernel/cpu/common.c               |   4 +-
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 157 ++++++++++++++++++++++++++++-
 3 files changed, 159 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c94..9b4233e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -241,6 +241,8 @@
 
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */
 
 /* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
 #define X86_FEATURE_CLZERO	(13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa05680..13af76e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -635,7 +635,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 			cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
 			c->x86_capability[CPUID_F_1_EDX] = edx;
 
-			if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+			if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+			      ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+			       (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
 				c->x86_cache_max_rmid = ecx;
 				c->x86_cache_occ_scale = ebx;
 			}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 37a93fa..5da415d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -15,6 +15,7 @@
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;
 
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,30 @@ struct intel_pqr_state {
  * interrupts disabled, which is sufficient for the protection.
  */
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @interval_start Time this interval began
+ * @interval_bytes #bytes in this interval
+ * @total_bytes    #bytes since we began monitoring
+ * @prev_msr       previous value of MSR
+ * @bandwidth      bytes/sec in previous completed interval
+ */
+struct sample {
+	ktime_t interval_start;
+	u64	interval_bytes;
+	u64	total_bytes;
+	u64	prev_msr;
+	u64	bandwidth;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -1155,6 +1180,28 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
 EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
 EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
 
+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(total_bw, intel_cqm_total_bw, "event=0x04");
+EVENT_ATTR_STR(total_bw.per-pkg, intel_cqm_total_bw_pkg, "1");
+EVENT_ATTR_STR(total_bw.unit, intel_cqm_total_bw_unit, "MB/sec");
+EVENT_ATTR_STR(total_bw.scale, intel_cqm_total_bw_scale, "1e-6");
+EVENT_ATTR_STR(total_bw.snapshot, intel_cqm_total_bw_snapshot, "1");
+
+EVENT_ATTR_STR(local_bw, intel_cqm_local_bw, "event=0x05");
+EVENT_ATTR_STR(local_bw.per-pkg, intel_cqm_local_bw_pkg, "1");
+EVENT_ATTR_STR(local_bw.unit, intel_cqm_local_bw_unit, "MB/sec");
+EVENT_ATTR_STR(local_bw.scale, intel_cqm_local_bw_scale, "1e-6");
+EVENT_ATTR_STR(local_bw.snapshot, intel_cqm_local_bw_snapshot, "1");
+
 static struct attribute *intel_cqm_events_attr[] = {
 	EVENT_PTR(intel_cqm_llc),
 	EVENT_PTR(intel_cqm_llc_pkg),
@@ -1164,9 +1211,58 @@ static struct attribute *intel_cqm_events_attr[] = {
 	NULL,
 };
 
+static struct attribute *intel_mbm_events_attr[] = {
+	EVENT_PTR(intel_cqm_total_bytes),
+	EVENT_PTR(intel_cqm_local_bytes),
+	EVENT_PTR(intel_cqm_total_bw),
+	EVENT_PTR(intel_cqm_local_bw),
+	EVENT_PTR(intel_cqm_total_bytes_pkg),
+	EVENT_PTR(intel_cqm_local_bytes_pkg),
+	EVENT_PTR(intel_cqm_total_bw_pkg),
+	EVENT_PTR(intel_cqm_local_bw_pkg),
+	EVENT_PTR(intel_cqm_total_bytes_unit),
+	EVENT_PTR(intel_cqm_local_bytes_unit),
+	EVENT_PTR(intel_cqm_total_bw_unit),
+	EVENT_PTR(intel_cqm_local_bw_unit),
+	EVENT_PTR(intel_cqm_total_bytes_scale),
+	EVENT_PTR(intel_cqm_local_bytes_scale),
+	EVENT_PTR(intel_cqm_total_bw_scale),
+	EVENT_PTR(intel_cqm_local_bw_scale),
+	EVENT_PTR(intel_cqm_total_bw_snapshot),
+	EVENT_PTR(intel_cqm_local_bw_snapshot),
+	NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+	EVENT_PTR(intel_cqm_llc),
+	EVENT_PTR(intel_cqm_total_bytes),
+	EVENT_PTR(intel_cqm_local_bytes),
+	EVENT_PTR(intel_cqm_total_bw),
+	EVENT_PTR(intel_cqm_local_bw),
+	EVENT_PTR(intel_cqm_llc_pkg),
+	EVENT_PTR(intel_cqm_total_bytes_pkg),
+	EVENT_PTR(intel_cqm_local_bytes_pkg),
+	EVENT_PTR(intel_cqm_total_bw_pkg),
+	EVENT_PTR(intel_cqm_local_bw_pkg),
+	EVENT_PTR(intel_cqm_llc_unit),
+	EVENT_PTR(intel_cqm_total_bytes_unit),
+	EVENT_PTR(intel_cqm_local_bytes_unit),
+	EVENT_PTR(intel_cqm_total_bw_unit),
+	EVENT_PTR(intel_cqm_local_bw_unit),
+	EVENT_PTR(intel_cqm_llc_scale),
+	EVENT_PTR(intel_cqm_total_bytes_scale),
+	EVENT_PTR(intel_cqm_local_bytes_scale),
+	EVENT_PTR(intel_cqm_total_bw_scale),
+	EVENT_PTR(intel_cqm_local_bw_scale),
+	EVENT_PTR(intel_cqm_llc_snapshot),
+	EVENT_PTR(intel_cqm_total_bw_snapshot),
+	EVENT_PTR(intel_cqm_local_bw_snapshot),
+	NULL,
+};
+
 static struct attribute_group intel_cqm_events_group = {
 	.name = "events",
-	.attrs = intel_cqm_events_attr,
+	.attrs = NULL,
 };
 
 PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
 		kfree(cqm_rmid_ptrs[r]);
 
 	kfree(cqm_rmid_ptrs);
+	kfree(mbm_local);
+	kfree(mbm_total);
+	mbm_enabled = false;
+	cqm_enabled = false;
+}
+
+static const struct x86_cpu_id intel_mbm_local_match[] = {
+	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+	{}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+	{}
+};
+
+static int intel_mbm_init(void)
+{
+	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
+
+	array_size = sizeof(struct sample) * maxid * topology_max_packages();
+	mbm_local = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_local)
+		return -ENOMEM;
+
+	mbm_total = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_total) {
+		kfree(mbm_local);
+		mbm_local = NULL;
+		ret = -ENOMEM;
+	}
+
+	return ret;
 }
 
 static int __init intel_cqm_init(void)
@@ -1338,7 +1467,14 @@ static int __init intel_cqm_init(void)
 	char *str = NULL, scale[20];
 	int i, cpu, ret;
 
-	if (!x86_match_cpu(intel_cqm_match))
+	if (x86_match_cpu(intel_cqm_match))
+		cqm_enabled = true;
+
+	if (x86_match_cpu(intel_mbm_local_match) &&
+	     x86_match_cpu(intel_mbm_total_match))
+		mbm_enabled = true;
+
+	if (!cqm_enabled && !mbm_enabled)
 		return -ENODEV;
 
 	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1395,12 +1531,27 @@ static int __init intel_cqm_init(void)
 		cqm_pick_event_reader(i);
 	}
 
+	if (mbm_enabled)
+		ret = intel_mbm_init();
+	if (ret && !cqm_enabled)
+		goto out;
+
+	if (cqm_enabled && mbm_enabled)
+		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+	else if (!cqm_enabled && mbm_enabled)
+		intel_cqm_events_group.attrs = intel_mbm_events_attr;
+	else if (cqm_enabled && !mbm_enabled)
+		intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
 	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
 	if (ret) {
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
 		goto out;
 	} else {
-		pr_info("Intel CQM monitoring enabled\n");
+		if (cqm_enabled)
+			pr_info("Intel CQM monitoring enabled\n");
+		if (mbm_enabled)
+			pr_info("Intel MBM enabled\n");
 	}
 
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling
  2016-03-01 23:48 ` [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling Vikas Shivappa
@ 2016-03-02 23:58   ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-02 23:58 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: linux-kernel, x86, hpa, tglx, mingo, peterz, ravi.v.shankar,
	tony.luck, fenghua.yu, h.peter.anvin

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---

Fixed mbm_timers leak in the intel_cqm_init function.

 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 134 ++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e4b520f..0b3a9ac 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -18,9 +18,15 @@
  * value
  */
 #define MBM_CNTR_MAX		0xffffff
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME	1000
+
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;
 
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
@@ -48,6 +54,7 @@ struct intel_pqr_state {
  * interrupts disabled, which is sufficient for the protection.
  */
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
 /**
  * struct sample - mbm event's (local or total) data
  * @interval_start Time this interval began
@@ -1125,6 +1132,84 @@ static void __intel_mbm_event_count(void *info)
 	atomic64_add(val, &rr->value);
 }
 
+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct perf_event *iter, *iter1;
+	int ret = HRTIMER_RESTART;
+	struct list_head *head;
+	unsigned long flags;
+	u32 grp_rmid;
+
+	/*
+	 * Need to cache_lock as the timer Event Select MSR reads
+	 * can race with the mbm/cqm count() and mbm_init() reads.
+	 */
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
+	if (list_empty(&cache_groups)) {
+		ret = HRTIMER_NORESTART;
+		goto out;
+	}
+
+	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+		grp_rmid = iter->hw.cqm_rmid;
+		if (!__rmid_valid(grp_rmid))
+			continue;
+		if (is_mbm_event(iter->attr.config))
+			update_sample(grp_rmid, iter->attr.config, 0);
+
+		head = &iter->hw.cqm_group_entry;
+		if (list_empty(head))
+			continue;
+		list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+			if (!iter1->hw.is_group_event)
+				break;
+			if (is_mbm_event(iter1->attr.config))
+				update_sample(iter1->hw.cqm_rmid,
+					      iter1->attr.config, 0);
+		}
+	}
+
+	hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+	hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+			     HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+	hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+	on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+	on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+	struct hrtimer *hr;
+	int i;
+
+	for (i = 0; i < mbm_socket_max; i++) {
+		hr = &mbm_timers[i];
+		hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+		hr->function = mbm_hrtimer_handle;
+	}
+}
+
 static u64 intel_cqm_event_count(struct perf_event *event)
 {
 	unsigned long flags;
@@ -1254,8 +1339,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
 static void intel_cqm_event_destroy(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
+	unsigned long flags;
 
 	mutex_lock(&cache_mutex);
+	/*
+	* Hold the cache_lock as mbm timer handlers could be
+	* scanning the list of events.
+	*/
+	raw_spin_lock_irqsave(&cache_lock, flags);
 
 	/*
 	 * If there's another event in this group...
@@ -1287,6 +1378,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 		}
 	}
 
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	/*
+	 * Stop the mbm overflow timers when the last event is destroyed.
+	*/
+	if (mbm_enabled && list_empty(&cache_groups))
+		mbm_stop_timers();
+
 	mutex_unlock(&cache_mutex);
 }
 
@@ -1294,6 +1393,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 {
 	struct perf_event *group = NULL;
 	bool rotate = false;
+	unsigned long flags;
 
 	if (event->attr.type != intel_cqm_pmu.type)
 		return -ENOENT;
@@ -1319,9 +1419,21 @@ static int intel_cqm_event_init(struct perf_event *event)
 
 	mutex_lock(&cache_mutex);
 
+	/*
+	 * Start the mbm overflow timers when the first event is created.
+	*/
+	if (mbm_enabled && list_empty(&cache_groups))
+		mbm_start_timers();
+
 	/* Will also set rmid */
 	intel_cqm_setup_event(event, &group);
 
+	/*
+	* Hold the cache_lock as mbm timer handlers be
+	* scanning the list of events.
+	*/
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
 	if (group) {
 		list_add_tail(&event->hw.cqm_group_entry,
 			      &group->hw.cqm_group_entry);
@@ -1340,6 +1452,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 			rotate = true;
 	}
 
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
 	mutex_unlock(&cache_mutex);
 
 	if (rotate)
@@ -1603,6 +1716,7 @@ static void cqm_cleanup(void)
 	kfree(cqm_rmid_ptrs);
 	kfree(mbm_local);
 	kfree(mbm_total);
+	kfree(mbm_timers);
 	mbm_enabled = false;
 	cqm_enabled = false;
 }
@@ -1621,16 +1735,32 @@ static int intel_mbm_init(void)
 {
 	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
 
-	array_size = sizeof(struct sample) * maxid * topology_max_packages();
+	mbm_socket_max = topology_max_packages();
+	array_size = sizeof(struct sample) * maxid * mbm_socket_max;
 	mbm_local = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_local)
 		return -ENOMEM;
 
 	mbm_total = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_total) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	array_size = sizeof(struct hrtimer) * mbm_socket_max;
+	mbm_timers = kmalloc(array_size, GFP_KERNEL);
+	if (!mbm_timers) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	mbm_hrtimer_init();
+
+out:
+	if (ret) {
 		kfree(mbm_local);
 		mbm_local = NULL;
-		ret = -ENOMEM;
+		kfree(mbm_total);
+		mbm_total = NULL;
 	}
 
 	return ret;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-02 23:56   ` Vikas Shivappa
@ 2016-03-03  7:35     ` Thomas Gleixner
  2016-03-03 18:26       ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-03  7:35 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> +	if (cqm_enabled && mbm_enabled)
> +		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> +	else if (!cqm_enabled && mbm_enabled)
> +		intel_cqm_events_group.attrs = intel_mbm_events_attr;
> +	else if (cqm_enabled && !mbm_enabled)
> +		intel_cqm_events_group.attrs = intel_cqm_events_attr;
> +
>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>  	if (ret) {
>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
>  		goto out;

So what cleans up mbm_local and mbm_total in that case?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-03  7:35     ` Thomas Gleixner
@ 2016-03-03 18:26       ` Vikas Shivappa
  2016-03-03 18:37         ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-03 18:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>> +	if (cqm_enabled && mbm_enabled)
>> +		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
>> +	else if (!cqm_enabled && mbm_enabled)
>> +		intel_cqm_events_group.attrs = intel_mbm_events_attr;
>> +	else if (cqm_enabled && !mbm_enabled)
>> +		intel_cqm_events_group.attrs = intel_cqm_events_attr;
>> +
>>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>>  	if (ret) {
>>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
>>  		goto out;
>
> So what cleans up mbm_local and mbm_total in that case?

I put all the cleanup code in the cqm_cleanup .. - please see copy below

@@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
                 kfree(cqm_rmid_ptrs[r]);

         kfree(cqm_rmid_ptrs);
+       kfree(mbm_local);
+       kfree(mbm_total);
+       mbm_enabled = false;
+       cqm_enabled = false;
+}

Thanks,
Vikas


>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-03 18:26       ` Vikas Shivappa
@ 2016-03-03 18:37         ` Thomas Gleixner
  2016-03-08  9:25           ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-03 18:37 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Thu, 3 Mar 2016, Vikas Shivappa wrote:

> 
> 
> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
> 
> > On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> > > +	if (cqm_enabled && mbm_enabled)
> > > +		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> > > +	else if (!cqm_enabled && mbm_enabled)
> > > +		intel_cqm_events_group.attrs = intel_mbm_events_attr;
> > > +	else if (cqm_enabled && !mbm_enabled)
> > > +		intel_cqm_events_group.attrs = intel_cqm_events_attr;
> > > +
> > >  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> > >  	if (ret) {
> > >  		pr_err("Intel CQM perf registration failed: %d\n", ret);
> > >  		goto out;
> > 
> > So what cleans up mbm_local and mbm_total in that case?
> 
> I put all the cleanup code in the cqm_cleanup .. - please see copy below

Ok, missed that.

> @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
>                 kfree(cqm_rmid_ptrs[r]);
> 
>         kfree(cqm_rmid_ptrs);
> +       kfree(mbm_local);
> +       kfree(mbm_total);
> +       mbm_enabled = false;
> +       cqm_enabled = false;
> +}
> 
> Thanks,
> Vikas
> 
> 
> > 
> > Thanks,
> > 
> > 	tglx
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-01 23:48 ` [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management Vikas Shivappa
@ 2016-03-07 23:03   ` Peter Zijlstra
  2016-03-07 23:27     ` Luck, Tony
  2016-03-10 22:49     ` Vikas Shivappa
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2016-03-07 23:03 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, tglx, mingo,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Tue, Mar 01, 2016 at 03:48:26PM -0800, Vikas Shivappa wrote:

> Lot of the scheduling code was taken out from Tony's patch and a 3-4
> lines of change were added in the intel_cqm_event_read. Since the timer
> is no more added on every context switch this change was made.

It this here to confuse people or is there some actual information in
it?

> +/*
> + * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
> + * value
> + */
> +#define MBM_CNTR_MAX		0xffffff

#define MBM_CNTR_WIDTH	24
#define MBM_CNTR_MAX	((1U << MBM_CNTR_WIDTH) - 1)


>  #define QOS_L3_OCCUP_EVENT_ID	(1 << 0)
> +/*
> + * MBM Event IDs as defined in SDM section 17.15.5
> + * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
> + */
> +enum mbm_evt_type {
> +	QOS_MBM_TOTAL_EVENT_ID = 0x02,
> +	QOS_MBM_LOCAL_EVENT_ID,
> +	QOS_MBM_TOTAL_BW_EVENT_ID,
> +	QOS_MBM_LOCAL_BW_EVENT_ID,
> +};

QOS_L3_*_EVENT_ID is a define, these are an enum. Rather inconsistent.

>  struct rmid_read {
>  	u32 rmid;

Hole, you could've filled with the enum (which ends up being an int I
think).

>  	atomic64_t value;
> +	enum mbm_evt_type evt_type;
>  };

> +static bool is_mbm_event(int e)

You had an enum type, you might as well use it.

> +{
> +	return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
> +}
>  

> +static struct sample *update_sample(unsigned int rmid,
> +				    enum mbm_evt_type evt_type, int first)
> +{
> +	ktime_t cur_time;
> +	struct sample *mbm_current;
> +	u32 vrmid = rmid_2_index(rmid);
> +	u64 val, bytes, diff_time;
> +	u32 eventid;
> +
> +	if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
> +		mbm_current = &mbm_local[vrmid];
> +		eventid     =  QOS_MBM_LOCAL_EVENT_ID;
> +	} else {
> +		mbm_current = &mbm_total[vrmid];
> +		eventid     = QOS_MBM_TOTAL_EVENT_ID;
> +	}
> +
> +	cur_time = ktime_get();
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	rdmsrl(MSR_IA32_QM_CTR, val);
> +	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> +		return mbm_current;

> +	val &= MBM_CNTR_MAX;

> +	if (val < mbm_current->prev_msr)
> +		bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
> +	else
> +		bytes = val - mbm_current->prev_msr;

Would not something like:

	shift = 64 - MBM_CNTR_WIDTH;

	bytes = (val << shift) - (prev << shift);
	bytes >>= shift;

be less obtuse? (and consistent with how every other perf update
function does it).

What guarantee is there we didn't wrap multiple times? Doesn't that
deserve a comment?

> +	bytes *= cqm_l3_scale;
> +
> +	mbm_current->total_bytes += bytes;
> +	mbm_current->interval_bytes += bytes;
> +	mbm_current->prev_msr = val;
> +	diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);

Here we do a / 1e6

> +
> +	/*
> +	 * The b/w measured is really the most recent/current b/w.
> +	 * We wait till enough time has passed to avoid
> +	 * arthmetic rounding problems.Having it at >=100ms,
> +	 * such errors would be <=1%.
> +	 */
> +	if (diff_time > 100) {

This could well be > 100e6 instead, avoiding the above division most of
the time.

> +		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
> +		do_div(bytes, diff_time);
> +		mbm_current->bandwidth = bytes;
> +		mbm_current->interval_bytes = 0;
> +		mbm_current->interval_start = cur_time;
> +	}
> +
> +	return mbm_current;
> +}

How does the above time tracking deal with the event not actually having
been scheduled the whole time?


> +static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
> +{
> +	struct rmid_read rr = {
> +		.value = ATOMIC64_INIT(0),
> +	};
> +
> +	rr.rmid = rmid;
> +	rr.evt_type = evt_type;

That's just sad.. put those two in the struct init as well.

> +	/* on each socket, init sample */
> +	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group
  2016-03-01 23:48 ` [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group Vikas Shivappa
@ 2016-03-07 23:04   ` Peter Zijlstra
  2016-03-10  0:18     ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2016-03-07 23:04 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, tglx, mingo,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Tue, Mar 01, 2016 at 03:48:23PM -0800, Vikas Shivappa wrote:
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -121,6 +121,7 @@ struct hw_perf_event {
>  		struct { /* intel_cqm */
>  			int			cqm_state;
>  			u32			cqm_rmid;
> +			bool			is_group_event;
>  			struct list_head	cqm_events_entry;
>  			struct list_head	cqm_groups_entry;
>  			struct list_head	cqm_group_entry;

Please, no 'bool' in structures.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-07 23:03   ` Peter Zijlstra
@ 2016-03-07 23:27     ` Luck, Tony
  2016-03-08  8:49       ` Peter Zijlstra
  2016-03-10 22:49     ` Vikas Shivappa
  1 sibling, 1 reply; 29+ messages in thread
From: Luck, Tony @ 2016-03-07 23:27 UTC (permalink / raw)
  To: Peter Zijlstra, Vikas Shivappa
  Cc: Shivappa, Vikas, linux-kernel, x86, hpa, tglx, mingo, Shankar,
	Ravi V, Yu, Fenghua, Anvin, H Peter

>> +		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>> +		do_div(bytes, diff_time);
>> +		mbm_current->bandwidth = bytes;
>> +		mbm_current->interval_bytes = 0;
>> +		mbm_current->interval_start = cur_time;
>> +	}
>>> +
>> +	return mbm_current;
>> +}
>
> How does the above time tracking deal with the event not actually having
> been scheduled the whole time?

That's been the topic of a few philosophical debates ... what exactly are
we trying to say when we report that a process has a "memory bandwidth"
of, say, 1523 MBytes/s?  We need to know both the amount of data moved
and to pick an interval to measure and divide by. Does it make a difference
whether the process voluntarily gave up the cpu for some part of the interval
(by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?

The above code gives the average bandwidth across the last interval
(with a minimum interval size of 100ms to avoid craziness with rounding
errors on exceptionally tiny intervals). Some folks apparently want to get
a "rate" directly from perf.  I think many folks will find the "bytes" counters
more helpful (where they control the sample interval with '-I" flag to perf
utility).

-Tony

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-07 23:27     ` Luck, Tony
@ 2016-03-08  8:49       ` Peter Zijlstra
  2016-03-10 22:46         ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2016-03-08  8:49 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Vikas Shivappa, Shivappa, Vikas, linux-kernel, x86, hpa, tglx,
	mingo, Shankar, Ravi V, Yu, Fenghua, Anvin, H Peter

On Mon, Mar 07, 2016 at 11:27:26PM +0000, Luck, Tony wrote:
> >> +		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
> >> +		do_div(bytes, diff_time);
> >> +		mbm_current->bandwidth = bytes;
> >> +		mbm_current->interval_bytes = 0;
> >> +		mbm_current->interval_start = cur_time;
> >> +	}
> >>> +
> >> +	return mbm_current;
> >> +}
> >
> > How does the above time tracking deal with the event not actually having
> > been scheduled the whole time?
> 
> That's been the topic of a few philosophical debates ... what exactly are
> we trying to say when we report that a process has a "memory bandwidth"
> of, say, 1523 MBytes/s?  We need to know both the amount of data moved
> and to pick an interval to measure and divide by. Does it make a difference
> whether the process voluntarily gave up the cpu for some part of the interval
> (by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?
> 
> The above code gives the average bandwidth across the last interval
> (with a minimum interval size of 100ms to avoid craziness with rounding
> errors on exceptionally tiny intervals). Some folks apparently want to get
> a "rate" directly from perf.  I think many folks will find the "bytes" counters
> more helpful (where they control the sample interval with '-I" flag to perf
> utility).

So why didn't any of that make it into the Changelog? This is very much
different from any other perf driver, at the very least this debate
should have been mentioned and the choice defended.

Also, why are we doing the time tracking and divisions at all? Can't we
simply report the number of bytes transferred and let userspace sort out
the rest?

Userspace is provided the time the event was enabled, the time the event
was running and it can fairly trivially obtain walltime if it so desires
and then it can compute whatever definition of bandwidth it wants to
use.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-02 23:53   ` Vikas Shivappa
@ 2016-03-08  9:22     ` Thomas Gleixner
  2016-03-08 19:36       ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-08  9:22 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Wed, 2 Mar 2016, Vikas Shivappa wrote:

Please fix the subject line prefix: "x86/perf/intel/cqm:"

> Fixes the hotcpu notifier leak and other global variable memory leaks
> during cqm(cache quality of service monitoring) initialization.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> ---
> 
> Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.
>   
>  arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> index e6be335..37a93fa 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> @@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)
>  
>  	return 0;
>  fail:
> -	while (r--)
> +	while (r--) {
>  		kfree(cqm_rmid_ptrs[r]);
> +		cqm_rmid_ptrs[r] = NULL;
> +	}
>  
>  	kfree(cqm_rmid_ptrs);
> +	cqm_rmid_ptrs = NULL;

This partial rollback is crap. Why can't you utilize cqm_cleanup() ?
Performance certainly is not an issue here.

>  	return -ENOMEM;
>  }
>  
> @@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>  	{}
>  };
>  
> +static void cqm_cleanup(void)
> +{
> +	int r = cqm_max_rmid;
> +
> +	while (r--)
> +		kfree(cqm_rmid_ptrs[r]);
> +
> +	kfree(cqm_rmid_ptrs);
> +}
> +
>  static int __init intel_cqm_init(void)
>  {
> -	char *str, scale[20];
> +	char *str = NULL, scale[20];
>  	int i, cpu, ret;
>  
>  	if (!x86_match_cpu(intel_cqm_match))
> @@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
>  		cqm_pick_event_reader(i);
>  	}
>  
> -	__perf_cpu_notifier(intel_cqm_cpu_notifier);
> -
>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> -	if (ret)
> +	if (ret) {
>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
> -	else
> +		goto out;
> +	} else {
>  		pr_info("Intel CQM monitoring enabled\n");
> +	}
>  
> +	/*
> +	 * Register the hot cpu notifier once we are sure cqm
> +	 * is enabled to avoid notifier leak.
> +	 */
> +	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>  out:
>  	cpu_notifier_register_done();
> +	if (ret) {
> +		kfree(str);
> +		cqm_cleanup();

This is still broken. If intel_cqm_setup_rmid_cache() fails, you clear
cqm_rmid_ptrs and then jump to out. ret is -ENOMEM, so you will call
cqm_cleanup which will dereference cqm_rmid_ptrs .....

Find below a delta patch, which fixes that proper and safe.

Thanks,

	tglx

8<----------------

 arch/x86/events/intel/cqm.c |   40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
 	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
 }
 
+static void cqm_cleanup(void)
+{
+	int i;
+
+	if (!cqm_rmid_ptrs)
+		return;
+
+	for (i = 0; i < cqm_max_rmid; i++)
+		kfree(cqm_rmid_ptrs[i]);
+
+	kfree(cqm_rmid_ptrs);
+	cqm_rmid_ptrs = NULL;
+}
+
 static int intel_cqm_setup_rmid_cache(void)
 {
 	struct cqm_rmid_entry *entry;
@@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(vo
 	int r = 0;
 
 	nr_rmids = cqm_max_rmid + 1;
-	cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+	cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
 				nr_rmids, GFP_KERNEL);
 	if (!cqm_rmid_ptrs)
 		return -ENOMEM;
@@ -247,16 +261,10 @@ static int intel_cqm_setup_rmid_cache(vo
 	mutex_lock(&cache_mutex);
 	intel_cqm_rotation_rmid = __get_rmid();
 	mutex_unlock(&cache_mutex);
-
 	return 0;
-fail:
-	while (r--) {
-		kfree(cqm_rmid_ptrs[r]);
-		cqm_rmid_ptrs[r] = NULL;
-	}
 
-	kfree(cqm_rmid_ptrs);
-	cqm_rmid_ptrs = NULL;
+fail:
+	cqm_cleanup();
 	return -ENOMEM;
 }
 
@@ -1313,16 +1321,6 @@ static const struct x86_cpu_id intel_cqm
 	{}
 };
 
-static void cqm_cleanup(void)
-{
-	int r = cqm_max_rmid;
-
-	while (r--)
-		kfree(cqm_rmid_ptrs[r]);
-
-	kfree(cqm_rmid_ptrs);
-}
-
 static int __init intel_cqm_init(void)
 {
 	char *str = NULL, scale[20];
@@ -1389,10 +1387,10 @@ static int __init intel_cqm_init(void)
 	if (ret) {
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
 		goto out;
-	} else {
-		pr_info("Intel CQM monitoring enabled\n");
 	}
 
+	pr_info("Intel CQM monitoring enabled\n");
+
 	/*
 	 * Register the hot cpu notifier once we are sure cqm
 	 * is enabled to avoid notifier leak.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-03 18:37         ` Thomas Gleixner
@ 2016-03-08  9:25           ` Thomas Gleixner
  2016-03-08 19:36             ` Vikas Shivappa
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2016-03-08  9:25 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin

On Thu, 3 Mar 2016, Thomas Gleixner wrote:
> On Thu, 3 Mar 2016, Vikas Shivappa wrote:
> 
> > 
> > 
> > On Wed, 2 Mar 2016, Thomas Gleixner wrote:
> > 
> > > On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> > > > +	if (cqm_enabled && mbm_enabled)
> > > > +		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> > > > +	else if (!cqm_enabled && mbm_enabled)
> > > > +		intel_cqm_events_group.attrs = intel_mbm_events_attr;
> > > > +	else if (cqm_enabled && !mbm_enabled)
> > > > +		intel_cqm_events_group.attrs = intel_cqm_events_attr;
> > > > +
> > > >  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> > > >  	if (ret) {
> > > >  		pr_err("Intel CQM perf registration failed: %d\n", ret);
> > > >  		goto out;
> > > 
> > > So what cleans up mbm_local and mbm_total in that case?
> > 
> > I put all the cleanup code in the cqm_cleanup .. - please see copy below
> 
> Ok, missed that.
> 
> > @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
> >                 kfree(cqm_rmid_ptrs[r]);
> > 
> >         kfree(cqm_rmid_ptrs);
> > +       kfree(mbm_local);
> > +       kfree(mbm_total);
> > +       mbm_enabled = false;
> > +       cqm_enabled = false;
> > +}

After looking at it closely, you really need a separate mbm_cleanup()
function, so you can utilize it from mbm_init() and in the exit path.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
  2016-03-08  9:22     ` Thomas Gleixner
@ 2016-03-08 19:36       ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-08 19:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Tue, 8 Mar 2016, Thomas Gleixner wrote:

> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>
> Please fix the subject line prefix: "x86/perf/intel/cqm:"

Will fix..

>
>> Fixes the hotcpu notifier leak and other global variable memory leaks
>> during cqm(cache quality of service monitoring) initialization.
>>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>> ---
>>
>> Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.
>>
>>  arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
>>  1 file changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> index e6be335..37a93fa 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> @@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)
>>
>>  	return 0;
>>  fail:
>> -	while (r--)
>> +	while (r--) {
>>  		kfree(cqm_rmid_ptrs[r]);
>> +		cqm_rmid_ptrs[r] = NULL;
>> +	}
>>
>>  	kfree(cqm_rmid_ptrs);
>> +	cqm_rmid_ptrs = NULL;
>
> This partial rollback is crap. Why can't you utilize cqm_cleanup() ?
> Performance certainly is not an issue here.
>
>>  	return -ENOMEM;
>>  }
>>
>> @@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>>  	{}
>>  };
>>
>> +static void cqm_cleanup(void)
>> +{
>> +	int r = cqm_max_rmid;
>> +
>> +	while (r--)
>> +		kfree(cqm_rmid_ptrs[r]);
>> +
>> +	kfree(cqm_rmid_ptrs);
>> +}
>> +
>>  static int __init intel_cqm_init(void)
>>  {
>> -	char *str, scale[20];
>> +	char *str = NULL, scale[20];
>>  	int i, cpu, ret;
>>
>>  	if (!x86_match_cpu(intel_cqm_match))
>> @@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
>>  		cqm_pick_event_reader(i);
>>  	}
>>
>> -	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>> -
>>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>> -	if (ret)
>> +	if (ret) {
>>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
>> -	else
>> +		goto out;
>> +	} else {
>>  		pr_info("Intel CQM monitoring enabled\n");
>> +	}
>>
>> +	/*
>> +	 * Register the hot cpu notifier once we are sure cqm
>> +	 * is enabled to avoid notifier leak.
>> +	 */
>> +	__perf_cpu_notifier(intel_cqm_cpu_notifier);
>>  out:
>>  	cpu_notifier_register_done();
>> +	if (ret) {
>> +		kfree(str);
>> +		cqm_cleanup();
>
> This is still broken. If intel_cqm_setup_rmid_cache() fails, you clear
> cqm_rmid_ptrs and then jump to out. ret is -ENOMEM, so you will call
> cqm_cleanup which will dereference cqm_rmid_ptrs .....
>
> Find below a delta patch, which fixes that proper and safe.

Thanks for sending it. will resend the patch with your fix.

-Vikas

>
> Thanks,
>
> 	tglx
>
> 8<----------------
>
> arch/x86/events/intel/cqm.c |   40 +++++++++++++++++++---------------------
> 1 file changed, 19 insertions(+), 21 deletions(-)
>
> --- a/arch/x86/events/intel/cqm.c
> +++ b/arch/x86/events/intel/cqm.c
> @@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
> 	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
> }
>
> +static void cqm_cleanup(void)
> +{
> +	int i;
> +
> +	if (!cqm_rmid_ptrs)
> +		return;
> +
> +	for (i = 0; i < cqm_max_rmid; i++)
> +		kfree(cqm_rmid_ptrs[i]);
> +
> +	kfree(cqm_rmid_ptrs);
> +	cqm_rmid_ptrs = NULL;
> +}
> +
> static int intel_cqm_setup_rmid_cache(void)
> {
> 	struct cqm_rmid_entry *entry;
> @@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(vo
> 	int r = 0;
>
> 	nr_rmids = cqm_max_rmid + 1;
> -	cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
> +	cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
> 				nr_rmids, GFP_KERNEL);
> 	if (!cqm_rmid_ptrs)
> 		return -ENOMEM;
> @@ -247,16 +261,10 @@ static int intel_cqm_setup_rmid_cache(vo
> 	mutex_lock(&cache_mutex);
> 	intel_cqm_rotation_rmid = __get_rmid();
> 	mutex_unlock(&cache_mutex);
> -
> 	return 0;
> -fail:
> -	while (r--) {
> -		kfree(cqm_rmid_ptrs[r]);
> -		cqm_rmid_ptrs[r] = NULL;
> -	}
>
> -	kfree(cqm_rmid_ptrs);
> -	cqm_rmid_ptrs = NULL;
> +fail:
> +	cqm_cleanup();
> 	return -ENOMEM;
> }
>
> @@ -1313,16 +1321,6 @@ static const struct x86_cpu_id intel_cqm
> 	{}
> };
>
> -static void cqm_cleanup(void)
> -{
> -	int r = cqm_max_rmid;
> -
> -	while (r--)
> -		kfree(cqm_rmid_ptrs[r]);
> -
> -	kfree(cqm_rmid_ptrs);
> -}
> -
> static int __init intel_cqm_init(void)
> {
> 	char *str = NULL, scale[20];
> @@ -1389,10 +1387,10 @@ static int __init intel_cqm_init(void)
> 	if (ret) {
> 		pr_err("Intel CQM perf registration failed: %d\n", ret);
> 		goto out;
> -	} else {
> -		pr_info("Intel CQM monitoring enabled\n");
> 	}
>
> +	pr_info("Intel CQM monitoring enabled\n");
> +
> 	/*
> 	 * Register the hot cpu notifier once we are sure cqm
> 	 * is enabled to avoid notifier leak.
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
  2016-03-08  9:25           ` Thomas Gleixner
@ 2016-03-08 19:36             ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-08 19:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Tue, 8 Mar 2016, Thomas Gleixner wrote:

> On Thu, 3 Mar 2016, Thomas Gleixner wrote:
>> On Thu, 3 Mar 2016, Vikas Shivappa wrote:
>>
>>>
>>>
>>> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
>>>
>>>> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>>>>> +	if (cqm_enabled && mbm_enabled)
>>>>> +		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
>>>>> +	else if (!cqm_enabled && mbm_enabled)
>>>>> +		intel_cqm_events_group.attrs = intel_mbm_events_attr;
>>>>> +	else if (cqm_enabled && !mbm_enabled)
>>>>> +		intel_cqm_events_group.attrs = intel_cqm_events_attr;
>>>>> +
>>>>>  	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>>>>>  	if (ret) {
>>>>>  		pr_err("Intel CQM perf registration failed: %d\n", ret);
>>>>>  		goto out;
>>>>
>>>> So what cleans up mbm_local and mbm_total in that case?
>>>
>>> I put all the cleanup code in the cqm_cleanup .. - please see copy below
>>
>> Ok, missed that.
>>
>>> @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
>>>                 kfree(cqm_rmid_ptrs[r]);
>>>
>>>         kfree(cqm_rmid_ptrs);
>>> +       kfree(mbm_local);
>>> +       kfree(mbm_total);
>>> +       mbm_enabled = false;
>>> +       cqm_enabled = false;
>>> +}
>
> After looking at it closely, you really need a separate mbm_cleanup()
> function, so you can utilize it from mbm_init() and in the exit path.

Will fix and seperate the mbm and cqm cleanups..

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group
  2016-03-07 23:04   ` Peter Zijlstra
@ 2016-03-10  0:18     ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-10  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, tglx,
	mingo, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Mon, 7 Mar 2016, Peter Zijlstra wrote:

> On Tue, Mar 01, 2016 at 03:48:23PM -0800, Vikas Shivappa wrote:
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -121,6 +121,7 @@ struct hw_perf_event {
>>  		struct { /* intel_cqm */
>>  			int			cqm_state;
>>  			u32			cqm_rmid;
>> +			bool			is_group_event;
>>  			struct list_head	cqm_events_entry;
>>  			struct list_head	cqm_groups_entry;
>>  			struct list_head	cqm_group_entry;
>
> Please, no 'bool' in structures.

Will fix..

thanks,
vikas

>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-08  8:49       ` Peter Zijlstra
@ 2016-03-10 22:46         ` Vikas Shivappa
  0 siblings, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-10 22:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luck, Tony, Vikas Shivappa, Shivappa, Vikas, linux-kernel, x86,
	hpa, tglx, mingo, Shankar, Ravi V, Yu, Fenghua, Anvin, H Peter



On Tue, 8 Mar 2016, Peter Zijlstra wrote:

> On Mon, Mar 07, 2016 at 11:27:26PM +0000, Luck, Tony wrote:
>>>> +		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>>>> +		do_div(bytes, diff_time);
>>>> +		mbm_current->bandwidth = bytes;
>>>> +		mbm_current->interval_bytes = 0;
>>>> +		mbm_current->interval_start = cur_time;
>>>> +	}
>>>>> +
>>>> +	return mbm_current;
>>>> +}
>>>
>>> How does the above time tracking deal with the event not actually having
>>> been scheduled the whole time?
>>
>> That's been the topic of a few philosophical debates ... what exactly are
>> we trying to say when we report that a process has a "memory bandwidth"
>> of, say, 1523 MBytes/s?  We need to know both the amount of data moved
>> and to pick an interval to measure and divide by. Does it make a difference
>> whether the process voluntarily gave up the cpu for some part of the interval
>> (by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?
>>
>> The above code gives the average bandwidth across the last interval
>> (with a minimum interval size of 100ms to avoid craziness with rounding
>> errors on exceptionally tiny intervals). Some folks apparently want to get
>> a "rate" directly from perf.  I think many folks will find the "bytes" counters
>> more helpful (where they control the sample interval with '-I" flag to perf
>> utility).
>
> So why didn't any of that make it into the Changelog? This is very much
> different from any other perf driver, at the very least this debate
> should have been mentioned and the choice defended.
>
> Also, why are we doing the time tracking and divisions at all? Can't we
> simply report the number of bytes transferred and let userspace sort out
> the rest?
>
> Userspace is provided the time the event was enabled, the time the event
> was running and it can fairly trivially obtain walltime if it so desires
> and then it can compute whatever definition of bandwidth it wants to
> use.


We had discussions on removing the bw event. Discussed this with Tony and will 
update the patch by removing the bw events.. So this code will be removed.

thanks,
Vikas

>
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
  2016-03-07 23:03   ` Peter Zijlstra
  2016-03-07 23:27     ` Luck, Tony
@ 2016-03-10 22:49     ` Vikas Shivappa
  1 sibling, 0 replies; 29+ messages in thread
From: Vikas Shivappa @ 2016-03-10 22:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vikas Shivappa, vikas.shivappa, linux-kernel, x86, hpa, tglx,
	mingo, ravi.v.shankar, tony.luck, fenghua.yu, h.peter.anvin



On Mon, 7 Mar 2016, Peter Zijlstra wrote:

> On Tue, Mar 01, 2016 at 03:48:26PM -0800, Vikas Shivappa wrote:
>
>> Lot of the scheduling code was taken out from Tony's patch and a 3-4
>> lines of change were added in the intel_cqm_event_read. Since the timer
>> is no more added on every context switch this change was made.
>
> It this here to confuse people or is there some actual information in
> it?

Will remove the comment.

>
>> +/*
>> + * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
>> + * value
>> + */
>> +#define MBM_CNTR_MAX		0xffffff
>
> #define MBM_CNTR_WIDTH	24
> #define MBM_CNTR_MAX	((1U << MBM_CNTR_WIDTH) - 1)
>
>
Will fix

>>  #define QOS_L3_OCCUP_EVENT_ID	(1 << 0)
>> +/*
>> + * MBM Event IDs as defined in SDM section 17.15.5
>> + * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
>> + */
>> +enum mbm_evt_type {
>> +	QOS_MBM_TOTAL_EVENT_ID = 0x02,
>> +	QOS_MBM_LOCAL_EVENT_ID,
>> +	QOS_MBM_TOTAL_BW_EVENT_ID,
>> +	QOS_MBM_LOCAL_BW_EVENT_ID,
>> +};
>
> QOS_L3_*_EVENT_ID is a define, these are an enum. Rather inconsistent.
>

Will be changing all of them to #define . and we are also removing the bw 
events..

>>  struct rmid_read {
>>  	u32 rmid;
>
> Hole, you could've filled with the enum (which ends up being an int I
> think).
>
>>  	atomic64_t value;
>> +	enum mbm_evt_type evt_type;
>>  };
>
>> +static bool is_mbm_event(int e)
>
> You had an enum type, you might as well use it.

the enum will be gone..

>
>> +{
>> +	return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
>> +}
>>
>
>> +static struct sample *update_sample(unsigned int rmid,
>> +				    enum mbm_evt_type evt_type, int first)
>> +{
>> +	ktime_t cur_time;
>> +	struct sample *mbm_current;
>> +	u32 vrmid = rmid_2_index(rmid);
>> +	u64 val, bytes, diff_time;
>> +	u32 eventid;
>> +
>> +	if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
>> +		mbm_current = &mbm_local[vrmid];
>> +		eventid     =  QOS_MBM_LOCAL_EVENT_ID;
>> +	} else {
>> +		mbm_current = &mbm_total[vrmid];
>> +		eventid     = QOS_MBM_TOTAL_EVENT_ID;
>> +	}
>> +
>> +	cur_time = ktime_get();
>> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>> +	rdmsrl(MSR_IA32_QM_CTR, val);
>> +	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>> +		return mbm_current;
>
>> +	val &= MBM_CNTR_MAX;
>
>> +	if (val < mbm_current->prev_msr)
>> +		bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
>> +	else
>> +		bytes = val - mbm_current->prev_msr;
>
> Would not something like:
>
> 	shift = 64 - MBM_CNTR_WIDTH;
>
> 	bytes = (val << shift) - (prev << shift);
> 	bytes >>= shift;
>
> be less obtuse? (and consistent with how every other perf update
> function does it).

Will fix.

>
> What guarantee is there we didn't wrap multiple times? Doesn't that
> deserve a comment?


this is taken care of in the next patch 0006. I have put a comment there that 
h/w guarentees that overflow wont happen with in 1s at the definition of the 
timers, but can add an other comment here in the patch 0006

>
>> +	bytes *= cqm_l3_scale;
>> +
>> +	mbm_current->total_bytes += bytes;
>> +	mbm_current->interval_bytes += bytes;
>> +	mbm_current->prev_msr = val;
>> +	diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);
>
> Here we do a / 1e6
>
>> +
>> +	/*
>> +	 * The b/w measured is really the most recent/current b/w.
>> +	 * We wait till enough time has passed to avoid
>> +	 * arthmetic rounding problems.Having it at >=100ms,
>> +	 * such errors would be <=1%.
>> +	 */
>> +	if (diff_time > 100) {
>
> This could well be > 100e6 instead, avoiding the above division most of
> the time.
>
>> +		bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>> +		do_div(bytes, diff_time);
>> +		mbm_current->bandwidth = bytes;
>> +		mbm_current->interval_bytes = 0;
>> +		mbm_current->interval_start = cur_time;
>> +	}
>> +
>> +	return mbm_current;
>> +}
>
> How does the above time tracking deal with the event not actually having
> been scheduled the whole time?

Will be removing the bw events - so should address all three comments above.

>
>
>> +static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
>> +{
>> +	struct rmid_read rr = {
>> +		.value = ATOMIC64_INIT(0),
>> +	};
>> +
>> +	rr.rmid = rmid;
>> +	rr.evt_type = evt_type;
>
> That's just sad.. put those two in the struct init as well.

Will fix

thanks,
vikas
>
>> +	/* on each socket, init sample */
>> +	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>> +}
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-03-10 22:50 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-01 23:48 [PATCH V5 0/6] Intel memory b/w monitoring support Vikas Shivappa
2016-03-01 23:48 ` [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group Vikas Shivappa
2016-03-07 23:04   ` Peter Zijlstra
2016-03-10  0:18     ` Vikas Shivappa
2016-03-01 23:48 ` [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak Vikas Shivappa
2016-03-02  8:00   ` Thomas Gleixner
2016-03-02 17:58     ` Vikas Shivappa
2016-03-02 23:53   ` Vikas Shivappa
2016-03-08  9:22     ` Thomas Gleixner
2016-03-08 19:36       ` Vikas Shivappa
2016-03-01 23:48 ` [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init Vikas Shivappa
2016-03-02  8:04   ` Thomas Gleixner
2016-03-02 17:59     ` Vikas Shivappa
2016-03-02 21:31       ` Vikas Shivappa
2016-03-02 23:56   ` Vikas Shivappa
2016-03-03  7:35     ` Thomas Gleixner
2016-03-03 18:26       ` Vikas Shivappa
2016-03-03 18:37         ` Thomas Gleixner
2016-03-08  9:25           ` Thomas Gleixner
2016-03-08 19:36             ` Vikas Shivappa
2016-03-01 23:48 ` [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management Vikas Shivappa
2016-03-07 23:03   ` Peter Zijlstra
2016-03-07 23:27     ` Luck, Tony
2016-03-08  8:49       ` Peter Zijlstra
2016-03-10 22:46         ` Vikas Shivappa
2016-03-10 22:49     ` Vikas Shivappa
2016-03-01 23:48 ` [PATCH 5/6] x86/mbm: RMID Recycling MBM changes Vikas Shivappa
2016-03-01 23:48 ` [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling Vikas Shivappa
2016-03-02 23:58   ` Vikas Shivappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.