All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] perf: Intel Cache QoS Monitoring support
@ 2014-09-24 14:04 Matt Fleming
  2014-09-24 14:04 ` [PATCH 01/11] perf stat: Fix AGGR_CORE segfault on multi-socket system Matt Fleming
                   ` (10 more replies)
  0 siblings, 11 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

From: Matt Fleming <matt.fleming@intel.com>

This patch series adds a new PMU driver for the Intel Cache Monitoring
hardware feature available in Intel Xeon processors, which allows
monitoring of LLC occupancy on a task, group or system-wide basis.

The first few patches modify tools/perf to handle per-package counters,
which necessitates discarding some values when doing per-cpu reads to
avoid getting duplicate data. The rest add support for the new PMU code.

I've left a notoriously funky bit of code as the last patch, the RMID
rotation code, in an attempt to simplify things. Doing the rotation
provides the ability to multiplex the RMIDs and basically overcome the
hardware limitation, but the rest of the patches work fine without it.
But there are a number of scenarios where being able to monitor more
tasks than RMIDs is extremely useful.

The series is based on tip/perf/core.

Matt Fleming (10):
  perf stat: Fix AGGR_CORE segfault on multi-socket system
  perf tools: Refactor unit and scale function parameters
  perf tools: Parse event per-package info files
  perf: Make perf_cgroup_from_task() global
  perf: Add ->count() function to read per-package counters
  perf: Move cgroup init before PMU ->event_init()
  perf/x86/intel: Add Intel Cache QoS Monitoring support
  perf/x86/intel: Implement LRU monitoring ID allocation for CQM
  perf/x86/intel: Support task events with Intel CQM
  perf/x86/intel: Perform rotation on Intel CQM RMIDs

Peter P Waskiewicz Jr (1):
  x86: Add support for Intel Cache QoS Monitoring (CQM) detection

 arch/x86/include/asm/cpufeature.h          |    9 +-
 arch/x86/include/asm/processor.h           |    3 +
 arch/x86/kernel/cpu/Makefile               |    2 +-
 arch/x86/kernel/cpu/common.c               |   39 +
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 1159 ++++++++++++++++++++++++++++
 include/linux/perf_event.h                 |   48 ++
 include/uapi/linux/perf_event.h            |    1 +
 kernel/events/core.c                       |   63 +-
 tools/perf/builtin-stat.c                  |   81 +-
 tools/perf/util/evsel.c                    |    6 +-
 tools/perf/util/evsel.h                    |    8 +-
 tools/perf/util/parse-events.c             |   10 +-
 tools/perf/util/pmu.c                      |   66 +-
 tools/perf/util/pmu.h                      |    8 +-
 14 files changed, 1434 insertions(+), 69 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_cqm.c

-- 
1.9.3


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/11] perf stat: Fix AGGR_CORE segfault on multi-socket system
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 14:04 ` [PATCH 02/11] perf tools: Refactor unit and scale function parameters Matt Fleming
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

When printing the stats associated with a counter in AGGR_MODE mode, the
'cpu' argument represents an encoded socket and core_id, not a 'cpu'.
Using it as an index into the any of the *_stats[MAX_NR_CPUS] arrays
generates a SIGSEGV if the encoded socket id is non-zero.

Follow the AGGR_GLOBAL case and reset the cpu index to 0.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 tools/perf/builtin-stat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 5fe0edb1de5d..7c30fbd2d9b7 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -964,7 +964,7 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 
 	aggr_printout(evsel, cpu, nr);
 
-	if (aggr_mode == AGGR_GLOBAL)
+	if (aggr_mode == AGGR_GLOBAL || aggr_mode == AGGR_CORE)
 		cpu = 0;
 
 	fprintf(output, fmt, avg, csv_sep);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/11] perf tools: Refactor unit and scale function parameters
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
  2014-09-24 14:04 ` [PATCH 01/11] perf stat: Fix AGGR_CORE segfault on multi-socket system Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-29  9:21   ` Jiri Olsa
  2014-10-03  5:25   ` [tip:perf/core] " tip-bot for Matt Fleming
  2014-09-24 14:04 ` [PATCH 03/11] perf tools: Parse event per-package info files Matt Fleming
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

Passing pointers to alias modifiers 'unit' and 'scale' isn't very
future-proof since if we add more modifiers to the list we'll end up
passing more arguments.

Instead wrap everything up in a struct perf_pmu_info, which can easily
be expanded when additional alias modifiers are necessary in the future.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 tools/perf/util/parse-events.c |  9 ++++-----
 tools/perf/util/pmu.c          | 38 +++++++++++++++++++++++---------------
 tools/perf/util/pmu.h          |  7 ++++++-
 3 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 61be3e695ec2..9522cf22ad81 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -634,10 +634,9 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 			 char *name, struct list_head *head_config)
 {
 	struct perf_event_attr attr;
+	struct perf_pmu_info info;
 	struct perf_pmu *pmu;
 	struct perf_evsel *evsel;
-	const char *unit;
-	double scale;
 
 	pmu = perf_pmu__find(name);
 	if (!pmu)
@@ -656,7 +655,7 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 		return evsel ? 0 : -ENOMEM;
 	}
 
-	if (perf_pmu__check_alias(pmu, head_config, &unit, &scale))
+	if (perf_pmu__check_alias(pmu, head_config, &info))
 		return -EINVAL;
 
 	/*
@@ -671,8 +670,8 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 	evsel = __add_event(list, idx, &attr, pmu_event_name(head_config),
 			    pmu->cpus);
 	if (evsel) {
-		evsel->unit = unit;
-		evsel->scale = scale;
+		evsel->unit = info.unit;
+		evsel->scale = info.scale;
 	}
 
 	return evsel ? 0 : -ENOMEM;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 22a4ad5a927a..93a41ca96b8e 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -210,6 +210,19 @@ static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FI
 	return 0;
 }
 
+static inline bool pmu_alias_info_file(char *name)
+{
+	size_t len;
+
+	len = strlen(name);
+	if (len > 5 && !strcmp(name + len - 5, ".unit"))
+		return true;
+	if (len > 6 && !strcmp(name + len - 6, ".scale"))
+		return true;
+
+	return false;
+}
+
 /*
  * Process all the sysfs attributes located under the directory
  * specified in 'dir' parameter.
@@ -218,7 +231,6 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 {
 	struct dirent *evt_ent;
 	DIR *event_dir;
-	size_t len;
 	int ret = 0;
 
 	event_dir = opendir(dir);
@@ -234,13 +246,9 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 			continue;
 
 		/*
-		 * skip .unit and .scale info files
-		 * parsed in perf_pmu__new_alias()
+		 * skip info files parsed in perf_pmu__new_alias()
 		 */
-		len = strlen(name);
-		if (len > 5 && !strcmp(name + len - 5, ".unit"))
-			continue;
-		if (len > 6 && !strcmp(name + len - 6, ".scale"))
+		if (pmu_alias_info_file(name))
 			continue;
 
 		snprintf(path, PATH_MAX, "%s/%s", dir, name);
@@ -645,7 +653,7 @@ static int check_unit_scale(struct perf_pmu_alias *alias,
  * defined for the alias
  */
 int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
-			  const char **unit, double *scale)
+			  struct perf_pmu_info *info)
 {
 	struct parse_events_term *term, *h;
 	struct perf_pmu_alias *alias;
@@ -655,8 +663,8 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 	 * Mark unit and scale as not set
 	 * (different from default values, see below)
 	 */
-	*unit   = NULL;
-	*scale  = 0.0;
+	info->unit   = NULL;
+	info->scale  = 0.0;
 
 	list_for_each_entry_safe(term, h, head_terms, list) {
 		alias = pmu_find_alias(pmu, term);
@@ -666,7 +674,7 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 		if (ret)
 			return ret;
 
-		ret = check_unit_scale(alias, unit, scale);
+		ret = check_unit_scale(alias, &info->unit, &info->scale);
 		if (ret)
 			return ret;
 
@@ -679,11 +687,11 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 	 * set defaults as for evsel
 	 * unit cannot left to NULL
 	 */
-	if (*unit == NULL)
-		*unit   = "";
+	if (info->unit == NULL)
+		info->unit   = "";
 
-	if (*scale == 0.0)
-		*scale  = 1.0;
+	if (info->scale == 0.0)
+		info->scale  = 1.0;
 
 	return 0;
 }
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 0f5c0a88fdc8..fe90a012c003 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -25,6 +25,11 @@ struct perf_pmu {
 	struct list_head list;    /* ELEM */
 };
 
+struct perf_pmu_info {
+	const char *unit;
+	double scale;
+};
+
 struct perf_pmu *perf_pmu__find(const char *name);
 int perf_pmu__config(struct perf_pmu *pmu, struct perf_event_attr *attr,
 		     struct list_head *head_terms);
@@ -33,7 +38,7 @@ int perf_pmu__config_terms(struct list_head *formats,
 			   struct list_head *head_terms,
 			   bool zero);
 int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
-			  const char **unit, double *scale);
+			  struct perf_pmu_info *info);
 struct list_head *perf_pmu__alias(struct perf_pmu *pmu,
 				  struct list_head *head_terms);
 int perf_pmu_wrap(void);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/11] perf tools: Parse event per-package info files
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
  2014-09-24 14:04 ` [PATCH 01/11] perf stat: Fix AGGR_CORE segfault on multi-socket system Matt Fleming
  2014-09-24 14:04 ` [PATCH 02/11] perf tools: Refactor unit and scale function parameters Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 14:04 ` [PATCH 04/11] perf: Make perf_cgroup_from_task() global Matt Fleming
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

In preparation for upcoming PMU drivers that support system-wide,
per-package counters and hence report duplicate values, add support for
parsing the .per-pkg file.

An event can export this info file to indicate that all but one value
per socket should be discarded.

The discarding is much easier to do in userspace than inside the kernel
because the kernel cannot infer what userspace is going to do with the
reported values, what order it will read them in, etc.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 tools/perf/builtin-stat.c      | 79 +++++++++++++++++++++++++++++++++++++++++-
 tools/perf/util/evsel.c        |  6 +++-
 tools/perf/util/evsel.h        |  8 +++--
 tools/perf/util/parse-events.c |  1 +
 tools/perf/util/pmu.c          | 28 +++++++++++++++
 tools/perf/util/pmu.h          |  1 +
 6 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7c30fbd2d9b7..c5dd11fbbd7c 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -388,17 +388,89 @@ static void update_shadow_stats(struct perf_evsel *counter, u64 *count)
 }
 
 /*
+ * If 'evsel' is a per-socket event we may get duplicate values
+ * reported. We need to discard all but one per-socket value.
+ */
+static bool counter_per_socket_skip(struct perf_evsel *evsel, int cpu, u64 val)
+{
+	struct cpu_map *map;
+	int i, ncpus;
+	int s1, s2;
+
+	map = perf_evsel__cpus(evsel);
+	ncpus = map->nr;
+
+	s1 = cpu_map__get_socket(evsel_list->cpus, map->map[cpu]);
+
+	/*
+	 * Read all CPUs for this socket and see if any already have
+	 * value assigned.
+	 */
+	for (i = 0; i < ncpus; i++) {
+		s2 = cpu_map__get_socket(evsel_list->cpus, map->map[i]);
+		if (s1 != s2)
+			continue;
+
+		if (evsel->counts->cpu[i].val)
+			return true;
+	}
+
+	/* Stash the counter value in unused ->counts */
+	evsel->counts->cpu[cpu].val = val;
+	return false;
+}
+
+static bool aggr_per_socket_skip(struct perf_evsel *evsel, int cpu)
+{
+	struct cpu_map *map;
+	int leader_cpu = -1;
+	int i, ncpus;
+	int s1, s2;
+
+	map = perf_evsel__cpus(evsel);
+	ncpus = map->nr;
+
+	s1 = cpu_map__get_socket(evsel_list->cpus, map->map[cpu]);
+
+	/*
+	 * Find the first enabled counter for this socket and skip
+	 * everything else.
+	 */
+	for (i = 0; i < ncpus; i++) {
+		s2 = cpu_map__get_socket(evsel_list->cpus, map->map[i]);
+		if (s1 != s2)
+			continue;
+
+		if (!evsel->counts->cpu[i].ena)
+			continue;
+
+		leader_cpu = i;
+		break;
+	}
+
+	if (cpu == leader_cpu)
+		return false;
+
+	return true;
+}
+
+/*
  * Read out the results of a single counter:
  * aggregate counts across CPUs in system-wide mode
  */
 static int read_counter_aggr(struct perf_evsel *counter)
 {
 	struct perf_stat *ps = counter->priv;
+	bool (*f_skip)(struct perf_evsel *evsel, int cpu, u64 val) = NULL;
 	u64 *count = counter->counts->aggr.values;
 	int i;
 
+	if (counter->per_pkg)
+		f_skip = counter_per_socket_skip;
+
 	if (__perf_evsel__read(counter, perf_evsel__nr_cpus(counter),
-			       thread_map__nr(evsel_list->threads), scale) < 0)
+			       thread_map__nr(evsel_list->threads),
+			       scale, f_skip) < 0)
 		return -1;
 
 	for (i = 0; i < 3; i++)
@@ -1128,6 +1200,11 @@ static void print_aggr(char *prefix)
 			val = ena = run = 0;
 			nr = 0;
 			for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
+				if (counter->per_pkg) {
+					if (aggr_per_socket_skip(counter, cpu))
+						continue;
+				}
+
 				cpu2 = perf_evsel__cpus(counter)->map[cpu];
 				s2 = aggr_get_id(evsel_list->cpus, cpu2);
 				if (s2 != id)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b38de5819323..e7693b59e081 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -883,7 +883,8 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 }
 
 int __perf_evsel__read(struct perf_evsel *evsel,
-		       int ncpus, int nthreads, bool scale)
+		       int ncpus, int nthreads, bool scale,
+		       bool (*f_skip)(struct perf_evsel *evsel, int cpu, u64 val))
 {
 	size_t nv = scale ? 3 : 1;
 	int cpu, thread;
@@ -903,6 +904,9 @@ int __perf_evsel__read(struct perf_evsel *evsel,
 				  &count, nv * sizeof(u64)) < 0)
 				return -errno;
 
+			if (f_skip && f_skip(evsel, cpu, count.val))
+				continue;
+
 			aggr->val += count.val;
 			if (scale) {
 				aggr->ena += count.ena;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 7bc314be6a7b..c5441e547b28 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -87,6 +87,7 @@ struct perf_evsel {
 	bool			immediate;
 	bool			system_wide;
 	bool			tracking;
+	bool			per_pkg;
 	/* parse modifier helper */
 	int			exclude_GH;
 	int			nr_members;
@@ -253,7 +254,8 @@ static inline int perf_evsel__read_on_cpu_scaled(struct perf_evsel *evsel,
 }
 
 int __perf_evsel__read(struct perf_evsel *evsel, int ncpus, int nthreads,
-		       bool scale);
+		       bool scale,
+		       bool (*f_skip)(struct perf_evsel *evsel, int cpu, u64 val));
 
 /**
  * perf_evsel__read - Read the aggregate results on all CPUs
@@ -265,7 +267,7 @@ int __perf_evsel__read(struct perf_evsel *evsel, int ncpus, int nthreads,
 static inline int perf_evsel__read(struct perf_evsel *evsel,
 				    int ncpus, int nthreads)
 {
-	return __perf_evsel__read(evsel, ncpus, nthreads, false);
+	return __perf_evsel__read(evsel, ncpus, nthreads, false, NULL);
 }
 
 /**
@@ -278,7 +280,7 @@ static inline int perf_evsel__read(struct perf_evsel *evsel,
 static inline int perf_evsel__read_scaled(struct perf_evsel *evsel,
 					  int ncpus, int nthreads)
 {
-	return __perf_evsel__read(evsel, ncpus, nthreads, true);
+	return __perf_evsel__read(evsel, ncpus, nthreads, true, NULL);
 }
 
 void hists__init(struct hists *hists);
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 9522cf22ad81..e1d8025c46bc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -672,6 +672,7 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 	if (evsel) {
 		evsel->unit = info.unit;
 		evsel->scale = info.scale;
+		evsel->per_pkg = info.per_pkg;
 	}
 
 	return evsel ? 0 : -ENOMEM;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 93a41ca96b8e..a08e8893b2e1 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -20,6 +20,7 @@ struct perf_pmu_alias {
 	struct list_head list;  /* ELEM */
 	char unit[UNIT_MAX_LEN+1];
 	double scale;
+	bool per_pkg;
 };
 
 struct perf_pmu_format {
@@ -173,6 +174,24 @@ error:
 	return -1;
 }
 
+static int
+perf_pmu__parse_per_pkg(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+	char path[PATH_MAX];
+	int fd;
+
+	snprintf(path, PATH_MAX, "%s/%s.per-pkg", dir, name);
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	close(fd);
+
+	alias->per_pkg = true;
+	return 0;
+}
+
 static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FILE *file)
 {
 	struct perf_pmu_alias *alias;
@@ -191,6 +210,7 @@ static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FI
 	INIT_LIST_HEAD(&alias->terms);
 	alias->scale = 1.0;
 	alias->unit[0] = '\0';
+	alias->per_pkg = false;
 
 	ret = parse_events_terms(&alias->terms, buf);
 	if (ret) {
@@ -204,6 +224,7 @@ static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FI
 	 */
 	perf_pmu__parse_unit(alias, dir, name);
 	perf_pmu__parse_scale(alias, dir, name);
+	perf_pmu__parse_per_pkg(alias, dir, name);
 
 	list_add_tail(&alias->list, list);
 
@@ -219,6 +240,8 @@ static inline bool pmu_alias_info_file(char *name)
 		return true;
 	if (len > 6 && !strcmp(name + len - 6, ".scale"))
 		return true;
+	if (len > 8 && !strcmp(name + len - 8, ".per-pkg"))
+		return true;
 
 	return false;
 }
@@ -659,6 +682,8 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 	struct perf_pmu_alias *alias;
 	int ret;
 
+	info->per_pkg = false;
+
 	/*
 	 * Mark unit and scale as not set
 	 * (different from default values, see below)
@@ -678,6 +703,9 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 		if (ret)
 			return ret;
 
+		if (alias->per_pkg)
+			info->per_pkg = true;
+
 		list_del(&term->list);
 		free(term);
 	}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index fe90a012c003..82262eb90de8 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -28,6 +28,7 @@ struct perf_pmu {
 struct perf_pmu_info {
 	const char *unit;
 	double scale;
+	bool per_pkg;
 };
 
 struct perf_pmu *perf_pmu__find(const char *name);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/11] perf: Make perf_cgroup_from_task() global
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (2 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 03/11] perf tools: Parse event per-package info files Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-10-07 18:51   ` Peter Zijlstra
  2014-09-24 14:04 ` [PATCH 05/11] perf: Add ->count() function to read per-package counters Matt Fleming
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

Move perf_cgroup_from_task() from kernel/events to include/ along with
the necessary struct definitions, so that it can be used by the PMU
code.

The upcoming Intel Cache Monitoring PMU driver assigns monitoring IDs
based on a task's association with a cgroup - all tasks in the same
cgroup share an ID. We can use perf_cgroup_from_task() to track this
association.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 include/linux/perf_event.h | 30 ++++++++++++++++++++++++++++++
 kernel/events/core.c       | 28 +---------------------------
 2 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d07986f..1a4e2846d6fb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -53,6 +53,7 @@ struct perf_guest_info_callbacks {
 #include <linux/sysfs.h>
 #include <linux/perf_regs.h>
 #include <linux/workqueue.h>
+#include <linux/cgroup.h>
 #include <asm/local.h>
 
 struct perf_callchain_entry {
@@ -544,6 +545,35 @@ struct perf_output_handle {
 	int				page;
 };
 
+#ifdef CONFIG_CGROUP_PERF
+
+/*
+ * perf_cgroup_info keeps track of time_enabled for a cgroup.
+ * This is a per-cpu dynamically allocated data structure.
+ */
+struct perf_cgroup_info {
+	u64				time;
+	u64				timestamp;
+};
+
+struct perf_cgroup {
+	struct cgroup_subsys_state	css;
+	struct perf_cgroup_info	__percpu *info;
+};
+
+/*
+ * Must ensure cgroup is pinned (css_get) before calling
+ * this function. In other words, we cannot call this function
+ * if there is no cgroup event for the current CPU context.
+ */
+static inline struct perf_cgroup *
+perf_cgroup_from_task(struct task_struct *task)
+{
+	return container_of(task_css(task, perf_event_cgrp_id),
+			    struct perf_cgroup, css);
+}
+#endif /* CONFIG_CGROUP_PERF */
+
 #ifdef CONFIG_PERF_EVENTS
 
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 733c61636f0d..957cc13d4b1a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -34,11 +34,11 @@
 #include <linux/syscalls.h>
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
+#include <linux/cgroup.h>
 #include <linux/perf_event.h>
 #include <linux/ftrace_event.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/mm_types.h>
-#include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
@@ -351,32 +351,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
 
 #ifdef CONFIG_CGROUP_PERF
 
-/*
- * perf_cgroup_info keeps track of time_enabled for a cgroup.
- * This is a per-cpu dynamically allocated data structure.
- */
-struct perf_cgroup_info {
-	u64				time;
-	u64				timestamp;
-};
-
-struct perf_cgroup {
-	struct cgroup_subsys_state	css;
-	struct perf_cgroup_info	__percpu *info;
-};
-
-/*
- * Must ensure cgroup is pinned (css_get) before calling
- * this function. In other words, we cannot call this function
- * if there is no cgroup event for the current CPU context.
- */
-static inline struct perf_cgroup *
-perf_cgroup_from_task(struct task_struct *task)
-{
-	return container_of(task_css(task, perf_event_cgrp_id),
-			    struct perf_cgroup, css);
-}
-
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/11] perf: Add ->count() function to read per-package counters
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (3 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 04/11] perf: Make perf_cgroup_from_task() global Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 14:04 ` [PATCH 06/11] perf: Move cgroup init before PMU ->event_init() Matt Fleming
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

For PMU drivers that record per-package counters, the ->count variable
cannot be used to record an accurate aggregated value, since it's not
possible to perform SMP cross-calls to cpus on other packages from the
context in which we update ->count.

Introduce a new optional ->count() accessor function that can be used to
customize how values are collected. If a PMU driver doesn't provide a
->count() function, we fallback to the existing code.

There is necessarily a window of staleness with this approach because
the task that generated the counter value may not have been scheduled by
the cpu recently.

An alternative and more complex approach would be to use a hrtimer to
periodically refresh the values from a more permissive scheduling
context. So, we're trading off complexity for accuracy.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 include/linux/perf_event.h | 10 ++++++++++
 kernel/events/core.c       |  5 ++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a4e2846d6fb..1bf06b6fd5dc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -264,6 +264,11 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Return the count value for a counter.
+	 */
+	u64 (*count)			(struct perf_event *event); /*optional*/
 };
 
 /**
@@ -743,6 +748,11 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
 		__perf_event_task_sched_out(prev, next);
 }
 
+static inline u64 __perf_event_count(struct perf_event *event)
+{
+	return local64_read(&event->count) + atomic64_read(&event->child_count);
+}
+
 extern void perf_event_mmap(struct vm_area_struct *vma);
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
 extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 957cc13d4b1a..7925f21cee7f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3052,7 +3052,10 @@ static void __perf_event_read(void *info)
 
 static inline u64 perf_event_count(struct perf_event *event)
 {
-	return local64_read(&event->count) + atomic64_read(&event->child_count);
+	if (event->pmu->count)
+		return event->pmu->count(event);
+
+	return __perf_event_count(event);
 }
 
 static u64 perf_event_read(struct perf_event *event)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/11] perf: Move cgroup init before PMU ->event_init()
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (4 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 05/11] perf: Add ->count() function to read per-package counters Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-10-07 19:34   ` Peter Zijlstra
  2014-09-24 14:04 ` [PATCH 07/11] x86: Add support for Intel Cache QoS Monitoring (CQM) detection Matt Fleming
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

The Intel QoS PMU needs to know whether an event is part of a cgroup
during ->event_init(), because tasks in the same cgroup share a
monitoring ID.

Move the cgroup initialisation before calling into the PMU driver.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 kernel/events/core.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7925f21cee7f..8a07f51a23ff 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6859,7 +6859,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		 struct perf_event *group_leader,
 		 struct perf_event *parent_event,
 		 perf_overflow_handler_t overflow_handler,
-		 void *context)
+		 void *context, bool cgroup, pid_t pid)
 {
 	struct pmu *pmu;
 	struct perf_event *event;
@@ -6952,6 +6952,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
 		goto err_ns;
 
+	if (cgroup) {
+		err = perf_cgroup_connect(pid, event, attr, group_leader);
+		if (err)
+			goto err_ns;
+	}
+
 	pmu = perf_init_event(event);
 	if (!pmu)
 		goto err_ns;
@@ -6975,6 +6981,8 @@ err_pmu:
 		event->destroy(event);
 	module_put(pmu->module);
 err_ns:
+	if (is_cgroup_event(event))
+		perf_detach_cgroup(event);
 	if (event->ns)
 		put_pid_ns(event->ns);
 	kfree(event);
@@ -7178,6 +7186,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	struct fd group = {NULL, 0};
 	struct task_struct *task = NULL;
 	struct pmu *pmu;
+	bool cgroup = false;
 	int event_fd;
 	int move_group = 0;
 	int err;
@@ -7247,21 +7256,16 @@ SYSCALL_DEFINE5(perf_event_open,
 
 	get_online_cpus();
 
+	if (flags & PERF_FLAG_PID_CGROUP)
+		cgroup = true;
+
 	event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
-				 NULL, NULL);
+				 NULL, NULL, cgroup, pid);
 	if (IS_ERR(event)) {
 		err = PTR_ERR(event);
 		goto err_cpus;
 	}
 
-	if (flags & PERF_FLAG_PID_CGROUP) {
-		err = perf_cgroup_connect(pid, event, &attr, group_leader);
-		if (err) {
-			__free_event(event);
-			goto err_cpus;
-		}
-	}
-
 	if (is_sampling_event(event)) {
 		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
 			err = -ENOTSUPP;
@@ -7461,7 +7465,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	 */
 
 	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
-				 overflow_handler, context);
+				 overflow_handler, context, false, -1);
 	if (IS_ERR(event)) {
 		err = PTR_ERR(event);
 		goto err;
@@ -7792,7 +7796,7 @@ inherit_event(struct perf_event *parent_event,
 					   parent_event->cpu,
 					   child,
 					   group_leader, parent_event,
-				           NULL, NULL);
+				           NULL, NULL, false, -1);
 	if (IS_ERR(child_event))
 		return child_event;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/11] x86: Add support for Intel Cache QoS Monitoring (CQM) detection
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (5 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 06/11] perf: Move cgroup init before PMU ->event_init() Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 14:04 ` [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support Matt Fleming
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin,
	Peter P Waskiewicz Jr, Matt Fleming

From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>

This patch adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors.  It includes the
new values to track CQM resources to the cpuinfo_x86 structure,
plus the CPUID detection routines for CQM.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group.  Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 arch/x86/include/asm/cpufeature.h |  9 ++++++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/common.c      | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index bb9b258d60e7..dfc795ec3320 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -8,7 +8,7 @@
 #include <asm/required-features.h>
 #endif
 
-#define NCAPINTS	11	/* N 32-bit words worth of info */
+#define NCAPINTS	13	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -215,6 +215,7 @@
 #define X86_FEATURE_ERMS	( 9*32+ 9) /* Enhanced REP MOVSB/STOSB */
 #define X86_FEATURE_INVPCID	( 9*32+10) /* Invalidate Processor Context ID */
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
+#define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
@@ -231,6 +232,12 @@
 #define X86_FEATURE_XGETBV1	(10*32+ 2) /* XGETBV with ECX = 1 */
 #define X86_FEATURE_XSAVES	(10*32+ 3) /* XSAVES/XRSTORS */
 
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:0 (edx), word 11 */
+#define X86_FEATURE_CQM_LLC	(11*32+ 1) /* LLC QoS if 1 */
+
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
+#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec794732..2c3a46af3334 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -109,6 +109,9 @@ struct cpuinfo_x86 {
 	/* in KB - valid for CPUS which support this call: */
 	int			x86_cache_size;
 	int			x86_cache_alignment;	/* In bytes */
+	/* Cache QoS architectural values: */
+	int			x86_cache_max_rmid;	/* max index */
+	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
 	/* cpuid returned max cores value: */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e4ab2b42bd6f..4e00b578ea2a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -642,6 +642,30 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		c->x86_capability[10] = eax;
 	}
 
+	/* Additional Intel-defined flags: level 0x0000000F */
+	if (c->cpuid_level >= 0x0000000F) {
+		u32 eax, ebx, ecx, edx;
+
+		/* QoS sub-leaf, EAX=0Fh, ECX=0 */
+		cpuid_count(0x0000000F, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[11] = edx;
+		if (cpu_has(c, X86_FEATURE_CQM_LLC)) {
+			/* will be overridden if occupancy monitoring exists */
+			c->x86_cache_max_rmid = ebx;
+
+			/* QoS sub-leaf, EAX=0Fh, ECX=1 */
+			cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_capability[12] = edx;
+			if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+				c->x86_cache_max_rmid = ecx;
+				c->x86_cache_occ_scale = ebx;
+			}
+		} else {
+			c->x86_cache_max_rmid = -1;
+			c->x86_cache_occ_scale = -1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	xlvl = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = xlvl;
@@ -830,6 +854,20 @@ static void generic_identify(struct cpuinfo_x86 *c)
 	detect_nopl(c);
 }
 
+static void x86_init_cache_qos(struct cpuinfo_x86 *c)
+{
+	/*
+	 * The heavy lifting of max_rmid and cache_occ_scale are handled
+	 * in get_cpu_cap().  Here we just set the max_rmid for the boot_cpu
+	 * in case CQM bits really aren't there in this CPU.
+	 */
+	if (c != &boot_cpu_data) {
+		boot_cpu_data.x86_cache_max_rmid =
+			min(boot_cpu_data.x86_cache_max_rmid,
+			    c->x86_cache_max_rmid);
+	}
+}
+
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
@@ -919,6 +957,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	init_hypervisor(c);
 	x86_init_rdrand(c);
+	x86_init_cache_qos(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (6 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 07/11] x86: Add support for Intel Cache QoS Monitoring (CQM) detection Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 16:40   ` Andi Kleen
  2014-10-07 19:43   ` Peter Zijlstra
  2014-09-24 14:04 ` [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM Matt Fleming
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

Future Intel Xeon processors support a Cache QoS Monitoring feature that
allows tracking of the LLC occupancy for a task or task group, i.e. the
amount of data in pulled into the LLC for the task (group).

Currently the PMU only supports per-cpu events. We create an event for
each cpu and read out all the LLC occupancy values.

Because this results in duplicate values being written out to userspace,
we also export a .per-pkg event file so that the perf tools only
accumulate values for one cpu per package.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 arch/x86/kernel/cpu/Makefile               |   2 +-
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 502 +++++++++++++++++++++++++++++
 include/linux/perf_event.h                 |   7 +
 3 files changed, 510 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_cqm.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 7e1fd4e08552..8abb18fbcd13 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -38,7 +38,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o perf_event_intel_cqm.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
new file mode 100644
index 000000000000..acf687fd7e2e
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -0,0 +1,502 @@
+/*
+ * Intel Cache Quality-of-Service Monitoring (CQM) support.
+ *
+ * Based very, very heavily on work by Peter Zijlstra.
+ */
+
+#include <linux/perf_event.h>
+#include <linux/slab.h>
+#include "perf_event.h"
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+#define MSR_IA32_QM_CTR		0x0c8e
+#define MSR_IA32_QM_EVTSEL	0x0c8d
+
+static unsigned int cqm_max_rmid = -1;
+static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+
+struct intel_cqm_state {
+	raw_spinlock_t		lock;
+	int			rmid;
+	int 			cnt;
+};
+
+static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+
+/*
+ * Protects cache_cgroups.
+ */
+static DEFINE_MUTEX(cache_mutex);
+
+/*
+ * Groups of events that have the same target(s), one RMID per group.
+ */
+static LIST_HEAD(cache_groups);
+
+/*
+ * Mask of CPUs for reading CQM values. We only need one per-socket.
+ */
+static cpumask_t cqm_cpumask;
+
+#define RMID_VAL_ERROR		(1ULL << 63)
+#define RMID_VAL_UNAVAIL	(1ULL << 62)
+
+#define QOS_L3_OCCUP_EVENT_ID	(1 << 0)
+
+#define QOS_EVENT_MASK	QOS_L3_OCCUP_EVENT_ID
+
+static u64 __rmid_read(unsigned long rmid)
+{
+	u64 val;
+
+	/*
+	 * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
+	 * it just says that to increase confusion.
+	 */
+	wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
+	rdmsrl(MSR_IA32_QM_CTR, val);
+
+	/*
+	 * Aside from the ERROR and UNAVAIL bits, assume this thing returns
+	 * the number of cachelines tagged with @rmid.
+	 */
+	return val;
+}
+
+static unsigned long *cqm_rmid_bitmap;
+
+/*
+ * Returns < 0 on fail.
+ */
+static int __get_rmid(void)
+{
+	return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0);
+}
+
+static void __put_rmid(int rmid)
+{
+	bitmap_release_region(cqm_rmid_bitmap, rmid, 0);
+}
+
+static int intel_cqm_setup_rmid_cache(void)
+{
+	cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), GFP_KERNEL);
+	if (!cqm_rmid_bitmap)
+		return -ENOMEM;
+
+	bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid);
+
+	/*
+	 * RMID 0 is special and is always allocated. It's used for all
+	 * tasks that are not monitored.
+	 */
+	bitmap_allocate_region(cqm_rmid_bitmap, 0, 0);
+
+	return 0;
+}
+
+/*
+ * Determine if @a and @b measure the same set of tasks.
+ */
+static bool __match_event(struct perf_event *a, struct perf_event *b)
+{
+	if ((a->attach_state & PERF_ATTACH_TASK) !=
+	    (b->attach_state & PERF_ATTACH_TASK))
+		return false;
+
+	/* not task */
+
+	return true; /* if not task, we're machine wide */
+}
+
+/*
+ * Determine if @a's tasks intersect with @b's tasks
+ */
+static bool __conflict_event(struct perf_event *a, struct perf_event *b)
+{
+	/*
+	 * If one of them is not a task, same story as above with cgroups.
+	 */
+	if (!(a->attach_state & PERF_ATTACH_TASK) ||
+	    !(b->attach_state & PERF_ATTACH_TASK))
+		return true;
+
+	/*
+	 * Must be non-overlapping.
+	 */
+	return false;
+}
+
+/*
+ * Find a group and setup RMID.
+ *
+ * If we're part of a group, we use the group's RMID.
+ */
+static int intel_cqm_setup_event(struct perf_event *event,
+				 struct perf_event **group)
+{
+	struct perf_event *iter;
+	int rmid;
+
+	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+		if (__match_event(iter, event)) {
+			/* All tasks in a group share an RMID */
+			event->hw.cqm_rmid = iter->hw.cqm_rmid;
+			*group = iter;
+			return 0;
+		}
+
+		if (__conflict_event(iter, event))
+			return -EBUSY;
+	}
+
+	rmid = __get_rmid();
+	if (rmid < 0)
+		return rmid;
+
+	event->hw.cqm_rmid = rmid;
+	return 0;
+}
+
+static void intel_cqm_event_read(struct perf_event *event)
+{
+	unsigned long rmid = event->hw.cqm_rmid;
+	u64 val;
+
+	val = __rmid_read(rmid);
+
+	/*
+	 * Ignore this reading on error states and do not update the value.
+	 */
+	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+		return;
+
+	val *= cqm_l3_scale; /* cachelines -> bytes */
+
+	local64_set(&event->count, val);
+}
+
+static void intel_cqm_event_start(struct perf_event *event, int mode)
+{
+	struct intel_cqm_state *state = &__get_cpu_var(cqm_state);
+	unsigned long rmid = event->hw.cqm_rmid;
+	unsigned long flags;
+
+	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
+		return;
+
+	event->hw.cqm_state &= ~PERF_HES_STOPPED;
+
+	raw_spin_lock_irqsave(&state->lock, flags);
+
+	if (state->cnt++)
+		WARN_ON_ONCE(state->rmid != rmid);
+	else
+		WARN_ON_ONCE(state->rmid);
+
+	state->rmid = rmid;
+	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+
+	raw_spin_unlock_irqrestore(&state->lock, flags);
+}
+
+static void intel_cqm_event_stop(struct perf_event *event, int mode)
+{
+	struct intel_cqm_state *state = &__get_cpu_var(cqm_state);
+	unsigned long flags;
+
+	if (event->hw.cqm_state & PERF_HES_STOPPED)
+		return;
+
+	event->hw.cqm_state |= PERF_HES_STOPPED;
+
+	raw_spin_lock_irqsave(&state->lock, flags);
+	intel_cqm_event_read(event);
+
+	if (!--state->cnt) {
+		state->rmid = 0;
+		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+	} else {
+		WARN_ON_ONCE(!state->rmid);
+	}
+
+	raw_spin_unlock_irqrestore(&state->lock, flags);
+}
+
+static int intel_cqm_event_add(struct perf_event *event, int mode)
+{
+	int rmid;
+
+	event->hw.cqm_state = PERF_HES_STOPPED;
+	rmid = event->hw.cqm_rmid;
+	WARN_ON_ONCE(!rmid);
+
+	if (mode & PERF_EF_START)
+		intel_cqm_event_start(event, mode);
+
+	return 0;
+}
+
+static void intel_cqm_event_del(struct perf_event *event, int mode)
+{
+	intel_cqm_event_stop(event, mode);
+}
+
+static void intel_cqm_event_destroy(struct perf_event *event)
+{
+	struct perf_event *group_other = NULL;
+
+	mutex_lock(&cache_mutex);
+
+	/*
+	 * If there's another event in this group...
+	 */
+	if (!list_empty(&event->hw.cqm_group_entry)) {
+		group_other = list_first_entry(&event->hw.cqm_group_entry,
+					       struct perf_event,
+					       hw.cqm_group_entry);
+		list_del(&event->hw.cqm_group_entry);
+	}
+
+	/*
+	 * And we're the group leader..
+	 */
+	if (!list_empty(&event->hw.cqm_groups_entry)) {
+		/*
+		 * If there was a group_other, make that leader, otherwise
+		 * destroy the group and return the RMID.
+		 */
+		if (group_other) {
+			list_replace(&event->hw.cqm_groups_entry,
+				     &group_other->hw.cqm_groups_entry);
+		} else {
+			int rmid = event->hw.cqm_rmid;
+			__put_rmid(rmid);
+			list_del(&event->hw.cqm_groups_entry);
+		}
+	}
+
+	mutex_unlock(&cache_mutex);
+}
+
+static struct pmu intel_cqm_pmu;
+
+/*
+ * XXX there's a bit of a problem in that we cannot simply do the one
+ * event per node as one would want, since that one event would one get
+ * scheduled on the one cpu. But we want to 'schedule' the RMID on all
+ * CPUs.
+ *
+ * This means we want events for each CPU, however, that generates a lot
+ * of duplicate values out to userspace -- this is not to be helped
+ * unless we want to change the core code in some way. Fore more info,
+ * see intel_cqm_event_read().
+ */
+static int intel_cqm_event_init(struct perf_event *event)
+{
+	struct perf_event *group = NULL;
+	int err;
+
+	if (event->attr.type != intel_cqm_pmu.type)
+		return -ENOENT;
+
+	if (event->attr.config & ~QOS_EVENT_MASK)
+		return -EINVAL;
+
+	if (event->cpu == -1)
+		return -EINVAL;
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
+	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
+
+	event->destroy = intel_cqm_event_destroy;
+
+	mutex_lock(&cache_mutex);
+
+	err = intel_cqm_setup_event(event, &group); /* will also set rmid */
+	if (err)
+		goto out;
+
+	if (group) {
+		list_add_tail(&event->hw.cqm_group_entry,
+			      &group->hw.cqm_group_entry);
+	} else {
+		list_add_tail(&event->hw.cqm_groups_entry,
+			      &cache_groups);
+	}
+
+out:
+	mutex_unlock(&cache_mutex);
+	return err;
+}
+
+EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
+EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
+
+static struct attribute *intel_cqm_events_attr[] = {
+	EVENT_PTR(intel_cqm_llc),
+	EVENT_PTR(intel_cqm_llc_pkg),
+	NULL,
+};
+
+static struct attribute_group intel_cqm_events_group = {
+	.name = "events",
+	.attrs = intel_cqm_events_attr,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+static struct attribute *intel_cqm_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group intel_cqm_format_group = {
+	.name = "format",
+	.attrs = intel_cqm_formats_attr,
+};
+
+static const struct attribute_group *intel_cqm_attr_groups[] = {
+	&intel_cqm_events_group,
+	&intel_cqm_format_group,
+	NULL,
+};
+
+static struct pmu intel_cqm_pmu = {
+	.attr_groups	= intel_cqm_attr_groups,
+	.task_ctx_nr	= perf_sw_context,
+	.event_init	= intel_cqm_event_init,
+	.add		= intel_cqm_event_add,
+	.del		= intel_cqm_event_del,
+	.start		= intel_cqm_event_start,
+	.stop		= intel_cqm_event_stop,
+	.read		= intel_cqm_event_read,
+};
+
+static inline void cqm_pick_event_reader(int cpu)
+{
+	int phys_id = topology_physical_package_id(cpu);
+	int i;
+
+	for_each_cpu(i, &cqm_cpumask) {
+		if (phys_id == topology_physical_package_id(i))
+			return;	/* already got reader for this socket */
+	}
+
+	cpumask_set_cpu(cpu, &cqm_cpumask);
+}
+
+static void intel_cqm_cpu_prepare(unsigned int cpu)
+{
+	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+
+	raw_spin_lock_init(&state->lock);
+	state->rmid = 0;
+
+	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
+	WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
+}
+
+static void intel_cqm_cpu_exit(unsigned int cpu)
+{
+	int phys_id = topology_physical_package_id(cpu);
+	int i;
+
+	/*
+	 * Is @cpu a designated cqm reader?
+	 */
+	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
+		return;
+
+	for_each_online_cpu(i) {
+		if (i == cpu)
+			continue;
+
+		if (phys_id == topology_physical_package_id(i)) {
+			cpumask_set_cpu(i, &cqm_cpumask);
+			break;
+		}
+	}
+}
+
+static int intel_cqm_cpu_notifier(struct notifier_block *nb,
+				  unsigned long action, void *hcpu)
+{
+	unsigned int cpu  = (unsigned long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		intel_cqm_cpu_prepare(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		intel_cqm_cpu_exit(cpu);
+		break;
+	case CPU_STARTING:
+		cqm_pick_event_reader(cpu);
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static int __init intel_cqm_init(void)
+{
+	int i, cpu, ret;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_CQM_OCCUP_LLC))
+		return -ENODEV;
+
+	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
+
+	/*
+	 * It's possible that not all resources support the same number
+	 * of RMIDs. Instead of making scheduling much more complicated
+	 * (where we have to match a task's RMID to a cpu that supports
+	 * that many RMIDs) just find the minimum RMIDs supported across
+	 * all cpus.
+	 *
+	 * Also, check that the scales match on all cpus.
+	 */
+	for_each_online_cpu(cpu) {
+		struct cpuinfo_x86 *c = &cpu_data(cpu);
+
+		if (c->x86_cache_max_rmid < cqm_max_rmid)
+			cqm_max_rmid = c->x86_cache_max_rmid;
+
+		if (c->x86_cache_occ_scale != cqm_l3_scale) {
+			pr_err("Multiple LLC scale values, disabling\n");
+			return -EINVAL;
+		}
+	}
+
+	ret = intel_cqm_setup_rmid_cache();
+	if (ret)
+		return ret;
+
+	for_each_online_cpu(i) {
+		intel_cqm_cpu_prepare(i);
+		cqm_pick_event_reader(i);
+	}
+
+	__perf_cpu_notifier(intel_cqm_cpu_notifier);
+
+	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
+
+	if (ret)
+		pr_err("Intel CQM perf registration failed: %d\n", ret);
+	else
+		pr_info("Intel CQM monitoring enabled\n");
+
+	return ret;
+}
+device_initcall(intel_cqm_init);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1bf06b6fd5dc..50ceb0665cda 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,6 +128,13 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+		struct { /* intel_cqm */
+			int			cqm_state;
+			int			cqm_rmid;
+			struct list_head	cqm_events_entry;
+			struct list_head	cqm_groups_entry;
+			struct list_head	cqm_group_entry;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (7 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-10-08  9:51   ` Peter Zijlstra
  2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
  2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
  10 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

It's possible to run into issues with re-using unused monitoring IDs
because there may be stale cachelines associated with that ID from a
previous allocation. This can cause the LLC occupancy values to be
inaccurate.

To attempt to mitigate this problem we place the IDs on a least recently
used list, essentially a FIFO. The basic idea is that the longer the
time period between ID re-use the lower the probability that stale
cachelines exist in the cache.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 100 ++++++++++++++++++++++++++---
 1 file changed, 92 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index acf687fd7e2e..a8c1e40b32b9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -24,7 +24,7 @@ struct intel_cqm_state {
 static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
 
 /*
- * Protects cache_cgroups.
+ * Protects cache_cgroups and cqm_rmid_lru.
  */
 static DEFINE_MUTEX(cache_mutex);
 
@@ -63,36 +63,120 @@ static u64 __rmid_read(unsigned long rmid)
 	return val;
 }
 
-static unsigned long *cqm_rmid_bitmap;
+struct cqm_rmid_entry {
+	u64 rmid;
+	struct list_head list;
+};
+
+/*
+ * A least recently used list of RMIDs.
+ *
+ * Oldest entry at the head, newest (most recently used) entry at the
+ * tail. This list is never traversed, it's only used to keep track of
+ * the lru order. That is, we only pick entries of the head or insert
+ * them on the tail.
+ *
+ * All entries on the list are 'free', and their RMIDs are not currently
+ * in use. To mark an RMID as in use, remove its entry from the lru
+ * list.
+ *
+ * This list is protected by cache_mutex.
+ */
+static LIST_HEAD(cqm_rmid_lru);
+
+/*
+ * We use a simple array of pointers so that we can lookup a struct
+ * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
+ * and __put_rmid() from having to worry about dealing with struct
+ * cqm_rmid_entry - they just deal with rmids, i.e. integers.
+ *
+ * Once this array is initialized it is read-only. No locks are required
+ * to access it.
+ *
+ * All entries for all RMIDs can be looked up in the this array at all
+ * times.
+ */
+static struct cqm_rmid_entry **cqm_rmid_ptrs;
+
+static inline struct cqm_rmid_entry *__rmid_entry(int rmid)
+{
+	struct cqm_rmid_entry *entry;
+
+	entry = cqm_rmid_ptrs[rmid];
+	WARN_ON(entry->rmid != rmid);
+
+	return entry;
+}
 
 /*
  * Returns < 0 on fail.
+ *
+ * We expect to be called with cache_mutex held.
  */
 static int __get_rmid(void)
 {
-	return bitmap_find_free_region(cqm_rmid_bitmap, cqm_max_rmid, 0);
+	struct cqm_rmid_entry *entry;
+
+	lockdep_assert_held(&cache_mutex);
+
+	if (list_empty(&cqm_rmid_lru))
+		return -EAGAIN;
+
+	entry = list_first_entry(&cqm_rmid_lru, struct cqm_rmid_entry, list);
+	list_del(&entry->list);
+
+	return entry->rmid;
 }
 
 static void __put_rmid(int rmid)
 {
-	bitmap_release_region(cqm_rmid_bitmap, rmid, 0);
+	struct cqm_rmid_entry *entry;
+
+	lockdep_assert_held(&cache_mutex);
+
+	entry = __rmid_entry(rmid);
+
+	list_add_tail(&entry->list, &cqm_rmid_lru);
 }
 
 static int intel_cqm_setup_rmid_cache(void)
 {
-	cqm_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(cqm_max_rmid), GFP_KERNEL);
-	if (!cqm_rmid_bitmap)
+	struct cqm_rmid_entry *entry;
+	int r;
+
+	cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+				(cqm_max_rmid + 1), GFP_KERNEL);
+	if (!cqm_rmid_ptrs)
 		return -ENOMEM;
 
-	bitmap_zero(cqm_rmid_bitmap, cqm_max_rmid);
+	for (r = 0; r <= cqm_max_rmid; r++) {
+		struct cqm_rmid_entry *entry;
+
+		entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+		if (!entry)
+			goto fail;
+
+		INIT_LIST_HEAD(&entry->list);
+		entry->rmid = r;
+		cqm_rmid_ptrs[r] = entry;
+
+		list_add_tail(&entry->list, &cqm_rmid_lru);
+	}
 
 	/*
 	 * RMID 0 is special and is always allocated. It's used for all
 	 * tasks that are not monitored.
 	 */
-	bitmap_allocate_region(cqm_rmid_bitmap, 0, 0);
+	entry = __rmid_entry(0);
+	list_del(&entry->list);
 
 	return 0;
+fail:
+	while (r--)
+		kfree(cqm_rmid_ptrs[r]);
+
+	kfree(cqm_rmid_ptrs);
+	return -ENOMEM;
 }
 
 /*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (8 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-09-24 14:45   ` Matt Fleming
                     ` (2 more replies)
  2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
  10 siblings, 3 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

From: Matt Fleming <matt.fleming@intel.com>

Add support for task events as well as system-wide events. This change
has a big impact on the way that we gather LLC occupancy values in
intel_cqm_event_read().

Currently, for system-wide (per-cpu) events we defer processing to
userspace which knows how to discard all but one cpu result per package.

Things aren't so simple for task events because we need to do the value
aggregation ourselves. To do this, we defer updating the LLC occupancy
value in event->count from intel_cqm_event_read() and do an SMP
cross-call to read values for all packages in intel_cqm_event_count().
We need to ensure that we only do this for one task event per cache
group, otherwise we'll report duplicate values.

If we're a system-wide event we want to fallback to the default
perf_event_count() implementation. Refactor this into a common function
so that we don't duplicate the code.

Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
driver.  This task information is only availble in the upper layers of
the perf infrastructure.

Other perf backends stash the target task in event->hw.*target so we
need to do something similar. The task is used to determine whether
events should share a cache group and an RMID.

All tasks in a thread group are tracked together.

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 198 +++++++++++++++++++++++++----
 include/linux/perf_event.h                 |   1 +
 include/uapi/linux/perf_event.h            |   1 +
 kernel/events/core.c                       |   2 +
 4 files changed, 180 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index a8c1e40b32b9..d40c29e6eeff 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -181,23 +181,124 @@ fail:
 
 /*
  * Determine if @a and @b measure the same set of tasks.
+ *
+ * If @a and @b measure the same set of tasks then we want to share a
+ * single RMID.
  */
 static bool __match_event(struct perf_event *a, struct perf_event *b)
 {
+	/* Per-cpu and task events don't mix */
 	if ((a->attach_state & PERF_ATTACH_TASK) !=
 	    (b->attach_state & PERF_ATTACH_TASK))
 		return false;
 
-	/* not task */
+#ifdef CONFIG_CGROUP_PERF
+	if (a->cgrp != b->cgrp)
+		return false;
+#endif
+
+	/* If not task event, we're machine wide */
+	if (!(b->attach_state & PERF_ATTACH_TASK))
+		return true;
+
+	/*
+	 * Events that target same task are placed into the same cache group.
+	 */
+	if (a->hw.cqm_target == b->hw.cqm_target)
+		return true;
+
+	/*
+	 * Are we an inherited event?
+	 */
+	if (b->parent == a)
+		return true;
 
-	return true; /* if not task, we're machine wide */
+	return false;
+}
+
+#ifdef CONFIG_CGROUP_PERF
+static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
+{
+	if (event->attach_state & PERF_ATTACH_TASK)
+		return perf_cgroup_from_task(event->hw.cqm_target);
+
+	return event->cgrp;
 }
+#endif
 
 /*
  * Determine if @a's tasks intersect with @b's tasks
+ *
+ * There are combinations of events that we explicitly prohibit,
+ *
+ *		   PROHIBITS
+ *     system-wide    -> 	cgroup and task
+ *     cgroup 	      ->	system-wide
+ *     		      ->	task in cgroup
+ *     task 	      -> 	system-wide
+ *     		      ->	task in cgroup
+ *
+ * Call this function before allocating an RMID.
  */
 static bool __conflict_event(struct perf_event *a, struct perf_event *b)
 {
+#ifdef CONFIG_CGROUP_PERF
+	/*
+	 * We can have any number of cgroups but only one system-wide
+	 * event at a time.
+	 */
+	if (a->cgrp && b->cgrp) {
+		struct perf_cgroup *ac = a->cgrp;
+		struct perf_cgroup *bc = b->cgrp;
+
+		/*
+		 * This condition should have been caught in
+		 * __match_event() and we should be sharing an RMID.
+		 */
+		WARN_ON_ONCE(ac == bc);
+
+		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
+		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
+			return true;
+
+		return false;
+	}
+
+	if (a->cgrp || b->cgrp) {
+		struct perf_cgroup *ac, *bc;
+
+		/*
+		 * cgroup and system-wide events are mutually exclusive
+		 */
+		if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
+		    (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
+			return true;
+
+		/*
+		 * Ensure neither event is part of the other's cgroup
+		 */
+		ac = event_to_cgroup(a);
+		bc = event_to_cgroup(b);
+		if (ac == bc)
+			return true;
+
+		/*
+		 * Must have cgroup and non-intersecting task events.
+		 */
+		if (!ac || !bc)
+			return false;
+
+		/*
+		 * We have cgroup and task events, and the task belongs
+		 * to a cgroup. Check for for overlap.
+		 */
+		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
+		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
+			return true;
+
+		return false;
+	}
+#endif
 	/*
 	 * If one of them is not a task, same story as above with cgroups.
 	 */
@@ -217,7 +318,7 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
  * If we're part of a group, we use the group's RMID.
  */
 static int intel_cqm_setup_event(struct perf_event *event,
-				 struct perf_event **group)
+				 struct perf_event **group, int cpu)
 {
 	struct perf_event *iter;
 	int rmid;
@@ -244,9 +345,16 @@ static int intel_cqm_setup_event(struct perf_event *event,
 
 static void intel_cqm_event_read(struct perf_event *event)
 {
-	unsigned long rmid = event->hw.cqm_rmid;
+	unsigned long rmid;
 	u64 val;
 
+	/*
+	 * Task events are handled by intel_cqm_event_count().
+	 */
+	if (event->cpu == -1)
+		return;
+
+	rmid = event->hw.cqm_rmid;
 	val = __rmid_read(rmid);
 
 	/*
@@ -257,9 +365,67 @@ static void intel_cqm_event_read(struct perf_event *event)
 
 	val *= cqm_l3_scale; /* cachelines -> bytes */
 
+	/*
+	 * If this event is per-cpu then we don't need to do any
+	 * aggregation in the kernel, it's all done in userland.
+	 */
 	local64_set(&event->count, val);
 }
 
+static void __intel_cqm_event_count(void *info)
+{
+	struct perf_event *event = info;
+	u64 val;
+
+	val = __rmid_read(event->hw.cqm_rmid);
+
+	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+		return;
+
+	val *= cqm_l3_scale; /* cachelines -> bytes */
+
+	local64_add(val, &event->count);
+}
+
+static inline bool cqm_group_leader(struct perf_event *event)
+{
+	return !list_empty(&event->hw.cqm_groups_entry);
+}
+
+static u64 intel_cqm_event_count(struct perf_event *event)
+{
+	unsigned int cpu;
+
+	/*
+	 * We only need to worry about task events. System-wide events
+	 * are handled like usual, i.e. entirely with
+	 * intel_cqm_event_read().
+	 */
+	if (event->cpu != -1)
+		return __perf_event_count(event);
+
+	/*
+	 * Only the group leader gets to report values. This stops us
+	 * reporting duplicate values to userspace, and gives us a clear
+	 * rule for which task gets to report the values.
+	 *
+	 * Note that it is impossible to attribute these values to
+	 * specific packages - we forfeit that ability when we create
+	 * task events.
+	 */
+	if (!cqm_group_leader(event))
+		return 0;
+
+	local64_set(&event->count, 0);
+
+	for_each_cpu(cpu, &cqm_cpumask) {
+		smp_call_function_single(cpu, __intel_cqm_event_count,
+					 event, 1);
+	}
+
+	return __perf_event_count(event);
+}
+
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
 	struct intel_cqm_state *state = &__get_cpu_var(cqm_state);
@@ -345,7 +511,7 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 	/*
 	 * And we're the group leader..
 	 */
-	if (!list_empty(&event->hw.cqm_groups_entry)) {
+	if (cqm_group_leader(event)) {
 		/*
 		 * If there was a group_other, make that leader, otherwise
 		 * destroy the group and return the RMID.
@@ -365,17 +531,6 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 
 static struct pmu intel_cqm_pmu;
 
-/*
- * XXX there's a bit of a problem in that we cannot simply do the one
- * event per node as one would want, since that one event would one get
- * scheduled on the one cpu. But we want to 'schedule' the RMID on all
- * CPUs.
- *
- * This means we want events for each CPU, however, that generates a lot
- * of duplicate values out to userspace -- this is not to be helped
- * unless we want to change the core code in some way. Fore more info,
- * see intel_cqm_event_read().
- */
 static int intel_cqm_event_init(struct perf_event *event)
 {
 	struct perf_event *group = NULL;
@@ -387,9 +542,6 @@ static int intel_cqm_event_init(struct perf_event *event)
 	if (event->attr.config & ~QOS_EVENT_MASK)
 		return -EINVAL;
 
-	if (event->cpu == -1)
-		return -EINVAL;
-
 	/* unsupported modes and filters */
 	if (event->attr.exclude_user   ||
 	    event->attr.exclude_kernel ||
@@ -407,7 +559,8 @@ static int intel_cqm_event_init(struct perf_event *event)
 
 	mutex_lock(&cache_mutex);
 
-	err = intel_cqm_setup_event(event, &group); /* will also set rmid */
+	/* Will also set rmid */
+	err = intel_cqm_setup_event(event, &group, event->cpu);
 	if (err)
 		goto out;
 
@@ -464,6 +617,7 @@ static struct pmu intel_cqm_pmu = {
 	.start		= intel_cqm_event_start,
 	.stop		= intel_cqm_event_stop,
 	.read		= intel_cqm_event_read,
+	.count		= intel_cqm_event_count,
 };
 
 static inline void cqm_pick_event_reader(int cpu)
@@ -574,8 +728,8 @@ static int __init intel_cqm_init(void)
 
 	__perf_cpu_notifier(intel_cqm_cpu_notifier);
 
-	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
-
+	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm",
+				PERF_TYPE_INTEL_CQM);
 	if (ret)
 		pr_err("Intel CQM perf registration failed: %d\n", ret);
 	else
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 50ceb0665cda..4cb6e1e668a1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -134,6 +134,7 @@ struct hw_perf_event {
 			struct list_head	cqm_events_entry;
 			struct list_head	cqm_groups_entry;
 			struct list_head	cqm_group_entry;
+			struct task_struct	*cqm_target;
 		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de254874..85bd517878f5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -32,6 +32,7 @@ enum perf_type_id {
 	PERF_TYPE_HW_CACHE			= 3,
 	PERF_TYPE_RAW				= 4,
 	PERF_TYPE_BREAKPOINT			= 5,
+	PERF_TYPE_INTEL_CQM			= 6,
 
 	PERF_TYPE_MAX,				/* non-ABI */
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8a07f51a23ff..44174b395705 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6924,6 +6924,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		else if (attr->type == PERF_TYPE_BREAKPOINT)
 			event->hw.bp_target = task;
 #endif
+		else if (attr->type == PERF_TYPE_INTEL_CQM)
+			event->hw.cqm_target = task;
 	}
 
 	if (!overflow_handler && parent_event) {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
                   ` (9 preceding siblings ...)
  2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
@ 2014-09-24 14:04 ` Matt Fleming
  2014-10-08 11:19   ` Peter Zijlstra
                     ` (2 more replies)
  10 siblings, 3 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

From: Matt Fleming <matt.fleming@intel.com>

There are many use cases where people will want to monitor more tasks
than there exist RMIDs in the hardware, meaning that we have to perform
some kind of multiplexing.

We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID
to a waiting event when the RMID becomes unused.

This scheme reserves one RMID at all times for rotation. When we need to
schedule a new event we give it the reserved RMID, pick a victim event
from the front of the global CQM list and wait for the victim's RMID to
drop to zero occupancy, before it becomes the new reserved RMID.

The comments above __intel_cqm_rmid_rotate() have more details.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 475 +++++++++++++++++++++++++++--
 1 file changed, 447 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index d40c29e6eeff..717d77fde652 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -24,9 +24,13 @@ struct intel_cqm_state {
 static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
 
 /*
- * Protects cache_cgroups and cqm_rmid_lru.
+ * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
+ * Also protects event->hw.cqm_rmid
+ *
+ * Hold either for stability, both for modification.
  */
 static DEFINE_MUTEX(cache_mutex);
+static DEFINE_RAW_SPINLOCK(cache_lock);
 
 /*
  * Groups of events that have the same target(s), one RMID per group.
@@ -45,7 +49,34 @@ static cpumask_t cqm_cpumask;
 
 #define QOS_EVENT_MASK	QOS_L3_OCCUP_EVENT_ID
 
-static u64 __rmid_read(unsigned long rmid)
+/*
+ * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
+ *
+ * This rmid is always free and is guaranteed to have an associated
+ * near-zero occupancy value, i.e. no cachelines are tagged with this
+ * RMID, once __intel_cqm_rmid_rotate() returns.
+ */
+static unsigned int intel_cqm_rotation_rmid;
+
+#define INVALID_RMID		(-1)
+
+/*
+ * Is @rmid valid for programming the hardware?
+ *
+ * rmid 0 is reserved by the hardware for all non-monitored tasks, which
+ * means that we should never come across an rmid with that value.
+ * Likewise, an rmid value of -1 is used to indicate "no rmid currently
+ * assigned" and is used as part of the rotation code.
+ */
+static inline bool __rmid_valid(unsigned int rmid)
+{
+	if (!rmid || rmid == INVALID_RMID)
+		return false;
+
+	return true;
+}
+
+static u64 __rmid_read(unsigned int rmid)
 {
 	u64 val;
 
@@ -64,12 +95,12 @@ static u64 __rmid_read(unsigned long rmid)
 }
 
 struct cqm_rmid_entry {
-	u64 rmid;
+	unsigned int rmid;
 	struct list_head list;
 };
 
 /*
- * A least recently used list of RMIDs.
+ * cqm_rmid_free_lru - A least recently used list of RMIDs.
  *
  * Oldest entry at the head, newest (most recently used) entry at the
  * tail. This list is never traversed, it's only used to keep track of
@@ -80,9 +111,18 @@ struct cqm_rmid_entry {
  * in use. To mark an RMID as in use, remove its entry from the lru
  * list.
  *
- * This list is protected by cache_mutex.
+ *
+ * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
+ *
+ * This list is contains RMIDs that no one is currently using but that
+ * may have a non-zero occupancy value associated with them. The
+ * rotation worker moves RMIDs from the limbo list to the free list once
+ * the occupancy value drops below __intel_cqm_threshold.
+ *
+ * Both lists are protected by cache_mutex.
  */
-static LIST_HEAD(cqm_rmid_lru);
+static LIST_HEAD(cqm_rmid_free_lru);
+static LIST_HEAD(cqm_rmid_limbo_lru);
 
 /*
  * We use a simple array of pointers so that we can lookup a struct
@@ -119,24 +159,25 @@ static int __get_rmid(void)
 
 	lockdep_assert_held(&cache_mutex);
 
-	if (list_empty(&cqm_rmid_lru))
-		return -EAGAIN;
+	if (list_empty(&cqm_rmid_free_lru))
+		return INVALID_RMID;
 
-	entry = list_first_entry(&cqm_rmid_lru, struct cqm_rmid_entry, list);
+	entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
 	list_del(&entry->list);
 
 	return entry->rmid;
 }
 
-static void __put_rmid(int rmid)
+static void __put_rmid(unsigned int rmid)
 {
 	struct cqm_rmid_entry *entry;
 
 	lockdep_assert_held(&cache_mutex);
 
+	WARN_ON(!__rmid_valid(rmid));
 	entry = __rmid_entry(rmid);
 
-	list_add_tail(&entry->list, &cqm_rmid_lru);
+	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
 }
 
 static int intel_cqm_setup_rmid_cache(void)
@@ -160,7 +201,7 @@ static int intel_cqm_setup_rmid_cache(void)
 		entry->rmid = r;
 		cqm_rmid_ptrs[r] = entry;
 
-		list_add_tail(&entry->list, &cqm_rmid_lru);
+		list_add_tail(&entry->list, &cqm_rmid_free_lru);
 	}
 
 	/*
@@ -170,6 +211,10 @@ static int intel_cqm_setup_rmid_cache(void)
 	entry = __rmid_entry(0);
 	list_del(&entry->list);
 
+	mutex_lock(&cache_mutex);
+	intel_cqm_rotation_rmid = __get_rmid();
+	mutex_unlock(&cache_mutex);
+
 	return 0;
 fail:
 	while (r--)
@@ -313,6 +358,345 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
 }
 
 /*
+ * Exchange the RMID of a group of events.
+ */
+static unsigned int
+intel_cqm_xchg_rmid(struct perf_event *group, unsigned int rmid)
+{
+	struct perf_event *event;
+	unsigned int old_rmid = group->hw.cqm_rmid;
+	struct list_head *head = &group->hw.cqm_group_entry;
+
+	lockdep_assert_held(&cache_mutex);
+	lockdep_assert_held(&cache_lock);
+
+	group->hw.cqm_rmid = rmid;
+	list_for_each_entry(event, head, hw.cqm_group_entry)
+		event->hw.cqm_rmid = rmid;
+
+	return old_rmid;
+}
+
+/*
+ * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
+ * cachelines are still tagged with RMIDs in limbo, we progressively
+ * increment the threshold until we find an RMID in limbo with <=
+ * __intel_cqm_threshold lines tagged. This is designed to mitigate the
+ * problem where cachelines tagged with an RMID are not steadily being
+ * evicted.
+ *
+ * On successful rotations we decrease the threshold back towards zero.
+ */
+static unsigned int __intel_cqm_threshold;
+
+struct intel_cqm_smp_info {
+	unsigned long *bitmap;	/* index by cpu num */
+	unsigned int nr_bits;
+};
+
+/*
+ * Test whether an RMID has a zero occupancy value on this cpu.
+ */
+static void intel_cqm_stable(void *arg)
+{
+	struct intel_cqm_smp_info *info = arg;
+	unsigned long *bitmap;
+	unsigned int nr_bits;
+	unsigned int cpu;
+	int i = -1;
+
+	nr_bits = info->nr_bits;
+	cpu = smp_processor_id();
+	bitmap = &info->bitmap[cpu * BITS_TO_LONGS(nr_bits)];
+
+	for (; i = find_next_bit(bitmap, nr_bits, i+1), i < nr_bits;) {
+		if (__rmid_read(i) > __intel_cqm_threshold)
+			clear_bit(i, bitmap);
+	}
+}
+
+/*
+ * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
+ * cachelines are tagged with those RMIDs. After this we can reuse them
+ * and know that the current set of active RMIDs is stable.
+ *
+ * Return %true or %false depending on whether we were able to stabilize
+ * an RMID for intel_cqm_rotation_rmid.
+ */
+static bool intel_cqm_rmid_stabilize(void)
+{
+	struct cqm_rmid_entry *entry;
+	struct intel_cqm_smp_info info;
+	unsigned long *limbo_bitmap;
+	unsigned long *free_bitmap;
+	unsigned int nr_bits, cpu;
+	struct perf_event *event;
+	unsigned long timeout;
+
+	lockdep_assert_held(&cache_mutex);
+
+	timeout = jiffies + msecs_to_jiffies(50);
+
+	nr_bits = cqm_max_rmid + 1;
+	limbo_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(nr_bits) *
+			       nr_cpumask_bits, GFP_KERNEL);
+	if (!limbo_bitmap)
+		return false;
+
+	free_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(nr_bits), GFP_KERNEL);
+	if (!free_bitmap) {
+		kfree(limbo_bitmap);
+		return false;
+	}
+
+	info.nr_bits = nr_bits;
+	info.bitmap = limbo_bitmap;
+
+retry:
+	bitmap_zero(limbo_bitmap, nr_bits * nr_cpumask_bits);
+
+	preempt_disable();
+	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
+		for_each_cpu(cpu, &cqm_cpumask) {
+			unsigned long *map;
+
+			map = &limbo_bitmap[cpu * BITS_TO_LONGS(nr_bits)];
+			set_bit(entry->rmid, map);
+		}
+	}
+
+	/*
+	 * Test whether an RMID is free for each package.
+	 */
+	smp_call_function_many(&cqm_cpumask, intel_cqm_stable, &info, true);
+
+	/*
+	 * Convert all cpu bitmaps into a single bitmap by ANDing all of
+	 * them together. If we've still got any bits set that indicates
+	 * an RMID is now unused on all cpus.
+	 */
+	bitmap_fill(free_bitmap, nr_bits);
+	for_each_cpu(cpu, &cqm_cpumask) {
+		unsigned long *map;
+
+		map = &limbo_bitmap[cpu * BITS_TO_LONGS(nr_bits)];
+		bitmap_and(free_bitmap, free_bitmap, map, nr_bits);
+	}
+
+	if (!bitmap_empty(free_bitmap, nr_bits)) {
+		int i = -1;
+
+		for (; i = find_next_bit(free_bitmap, nr_bits, i+1),
+			i < nr_bits;) {
+			entry = __rmid_entry(i);
+
+			list_del(&entry->list);	/* remove from limbo */
+
+			/*
+			 * The rotation RMID gets priority if it's
+			 * currently invalid. In which case, skip adding
+			 * the RMID to the the free lru.
+			 */
+			if (!__rmid_valid(intel_cqm_rotation_rmid)) {
+				intel_cqm_rotation_rmid = i;
+				continue;
+			}
+
+			/*
+			 * If we have groups waiting for RMIDs, hand
+			 * them one now.
+			 */
+			raw_spin_lock_irq(&cache_lock);
+			list_for_each_entry(event, &cache_groups,
+					    hw.cqm_groups_entry) {
+				if (__rmid_valid(event->hw.cqm_rmid))
+					continue;
+
+				intel_cqm_xchg_rmid(event, i);
+				entry = NULL;
+				break;
+			}
+			raw_spin_unlock_irq(&cache_lock);
+
+			if (!entry)
+				continue;
+
+			/*
+			 * Otherwise place it onto the free list.
+			 */
+			list_add_tail(&entry->list, &cqm_rmid_free_lru);
+		}
+	}
+
+	preempt_enable();
+
+	if (!__rmid_valid(intel_cqm_rotation_rmid)) {
+		schedule_timeout_interruptible(1);
+
+		if (time_before(jiffies, timeout))
+			goto retry;
+	}
+
+	kfree(free_bitmap);
+	kfree(limbo_bitmap);
+
+	return __rmid_valid(intel_cqm_rotation_rmid);
+}
+
+/*
+ * Pick a victim group and move it to the tail of the group list.
+ */
+static struct perf_event *
+__intel_cqm_pick_and_rotate(void)
+{
+	struct perf_event *rotor;
+
+	lockdep_assert_held(&cache_mutex);
+	lockdep_assert_held(&cache_lock);
+
+	rotor = list_first_entry(&cache_groups, struct perf_event,
+				 hw.cqm_groups_entry);
+	list_rotate_left(&cache_groups);
+
+	return rotor;
+}
+
+/*
+ * Attempt to rotate the groups and assign new RMIDs.
+ *
+ * Rotating RMIDs is complicated because the hardware doesn't give us
+ * any clues.
+ *
+ * There's problems with the hardware interface; when you change the
+ * task:RMID map cachelines retain their 'old' tags, giving a skewed
+ * picture. In order to work around this, we must always keep one free
+ * RMID - intel_cqm_rotation_rmid.
+ *
+ * Rotation works by taking away an RMID from a group (the old RMID),
+ * and assigning the free RMID to another group (the new RMID). We must
+ * then wait for the old RMID to not be used (no cachelines tagged).
+ * This ensure that all cachelines are tagged with 'active' RMIDs. At
+ * this point we can start reading values for the new RMID and treat the
+ * old RMID as the free RMID for the next rotation.
+ *
+ * Return %true or %false depending on whether we did any rotating.
+ */
+static bool __intel_cqm_rmid_rotate(void)
+{
+	struct perf_event *group, *rotor, *start = NULL;
+	struct cqm_rmid_entry *entry;
+	unsigned int nr_needed = 0;
+	unsigned int rmid;
+	bool rotated = false;
+
+	mutex_lock(&cache_mutex);
+
+again:
+	/*
+	 * Fast path through this function if there are no groups and no
+	 * RMIDs that need cleaning.
+	 */
+	if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
+		goto out;
+
+	list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
+		if (!__rmid_valid(group->hw.cqm_rmid)) {
+			if (!start)
+				start = group;
+			nr_needed++;
+		}
+	}
+
+	/*
+	 * We have some event groups, but they all have RMIDs assigned
+	 * and no RMIDs need cleaning.
+	 */
+	if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
+		goto out;
+
+	if (!nr_needed)
+		goto stabilize;
+
+	/*
+	 * We have more event groups without RMIDs than available RMIDs.
+	 *
+	 * We force deallocate the rmid of the group at the head of
+	 * cache_groups. The first event group without an RMID then gets
+	 * assigned intel_cqm_rotation_rmid. This ensures we always make
+	 * forward progress.
+	 *
+	 * Rotate the cache_groups list so the previous head is now the
+	 * tail.
+	 */
+	raw_spin_lock_irq(&cache_lock);
+
+	rotor = __intel_cqm_pick_and_rotate();
+	rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
+
+	/*
+	 * The group at the front of the list should always have a valid
+	 * RMID. If it doesn't then no groups have RMIDs assigned.
+	 */
+	if (!__rmid_valid(rmid)) {
+		raw_spin_unlock_irq(&cache_lock);
+		goto stabilize;
+	}
+
+	/*
+	 * If the rotation is going to succeed, reduce the threshold so
+	 * that we don't needlessly reuse dirty RMIDs.
+	 */
+	if (__rmid_valid(intel_cqm_rotation_rmid)) {
+		intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
+		intel_cqm_rotation_rmid = INVALID_RMID;
+
+		if (__intel_cqm_threshold)
+			__intel_cqm_threshold--;
+	}
+
+	entry = __rmid_entry(rmid);
+	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
+
+	raw_spin_unlock_irq(&cache_lock);
+
+	rotated = true;
+
+stabilize:
+	/*
+	 * We now need to stablize the RMID we freed above (if any) to
+	 * ensure that the next time we rotate we have an RMID with zero
+	 * occupancy value.
+	 *
+	 * Alternatively, if we didn't need to perform any rotation,
+	 * we'll have a bunch of RMIDs in limbo that need stabilizing.
+	 */
+	if (!intel_cqm_rmid_stabilize()) {
+		__intel_cqm_threshold++;
+		goto again;
+	}
+
+	WARN_ON(!__rmid_valid(intel_cqm_rotation_rmid));
+
+out:
+	mutex_unlock(&cache_mutex);
+	return rotated;
+}
+
+static void intel_cqm_rmid_rotate(struct work_struct *work);
+
+static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
+
+static unsigned int __rotation_period = 250; /* ms */
+
+static void intel_cqm_rmid_rotate(struct work_struct *work)
+{
+	__intel_cqm_rmid_rotate();
+
+	schedule_delayed_work(&intel_cqm_rmid_work,
+			      msecs_to_jiffies(__rotation_period));
+}
+
+/*
  * Find a group and setup RMID.
  *
  * If we're part of a group, we use the group's RMID.
@@ -321,7 +705,6 @@ static int intel_cqm_setup_event(struct perf_event *event,
 				 struct perf_event **group, int cpu)
 {
 	struct perf_event *iter;
-	int rmid;
 
 	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
 		if (__match_event(iter, event)) {
@@ -335,17 +718,14 @@ static int intel_cqm_setup_event(struct perf_event *event,
 			return -EBUSY;
 	}
 
-	rmid = __get_rmid();
-	if (rmid < 0)
-		return rmid;
-
-	event->hw.cqm_rmid = rmid;
+	event->hw.cqm_rmid = __get_rmid();
 	return 0;
 }
 
 static void intel_cqm_event_read(struct perf_event *event)
 {
-	unsigned long rmid;
+	unsigned long flags;
+	unsigned int rmid;
 	u64 val;
 
 	/*
@@ -354,14 +734,19 @@ static void intel_cqm_event_read(struct perf_event *event)
 	if (event->cpu == -1)
 		return;
 
+	raw_spin_lock_irqsave(&cache_lock, flags);
 	rmid = event->hw.cqm_rmid;
+
+	if (!__rmid_valid(rmid))
+		goto out;
+
 	val = __rmid_read(rmid);
 
 	/*
 	 * Ignore this reading on error states and do not update the value.
 	 */
 	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
+		goto out;
 
 	val *= cqm_l3_scale; /* cachelines -> bytes */
 
@@ -370,21 +755,34 @@ static void intel_cqm_event_read(struct perf_event *event)
 	 * aggregation in the kernel, it's all done in userland.
 	 */
 	local64_set(&event->count, val);
+out:
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
 }
 
 static void __intel_cqm_event_count(void *info)
 {
 	struct perf_event *event = info;
+	unsigned long flags;
+	unsigned int rmid;
 	u64 val;
 
-	val = __rmid_read(event->hw.cqm_rmid);
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
+	rmid = event->hw.cqm_rmid;
+	if (!__rmid_valid(rmid))
+		goto unlock;
+
+	val = __rmid_read(rmid);
 
 	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
+		goto unlock;
 
 	val *= cqm_l3_scale; /* cachelines -> bytes */
 
 	local64_add(val, &event->count);
+
+unlock:
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
 }
 
 static inline bool cqm_group_leader(struct perf_event *event)
@@ -429,7 +827,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
 	struct intel_cqm_state *state = &__get_cpu_var(cqm_state);
-	unsigned long rmid = event->hw.cqm_rmid;
+	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
 	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
@@ -475,15 +873,19 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
 {
-	int rmid;
+	unsigned long flags;
+	unsigned int rmid;
+
+	raw_spin_lock_irqsave(&cache_lock, flags);
 
 	event->hw.cqm_state = PERF_HES_STOPPED;
 	rmid = event->hw.cqm_rmid;
-	WARN_ON_ONCE(!rmid);
 
-	if (mode & PERF_EF_START)
+	if (__rmid_valid(rmid) && (mode & PERF_EF_START))
 		intel_cqm_event_start(event, mode);
 
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
 	return 0;
 }
 
@@ -520,8 +922,10 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 			list_replace(&event->hw.cqm_groups_entry,
 				     &group_other->hw.cqm_groups_entry);
 		} else {
-			int rmid = event->hw.cqm_rmid;
-			__put_rmid(rmid);
+			unsigned int rmid = event->hw.cqm_rmid;
+
+			if (__rmid_valid(rmid))
+				__put_rmid(rmid);
 			list_del(&event->hw.cqm_groups_entry);
 		}
 	}
@@ -534,6 +938,7 @@ static struct pmu intel_cqm_pmu;
 static int intel_cqm_event_init(struct perf_event *event)
 {
 	struct perf_event *group = NULL;
+	bool rotate = false;
 	int err;
 
 	if (event->attr.type != intel_cqm_pmu.type)
@@ -570,10 +975,24 @@ static int intel_cqm_event_init(struct perf_event *event)
 	} else {
 		list_add_tail(&event->hw.cqm_groups_entry,
 			      &cache_groups);
+
+		/*
+		 * All RMIDs are either in use or have recently been
+		 * used. Kick the rotation worker to clean/free some.
+		 *
+		 * We only do this for the group leader, rather than for
+		 * every event in a group to save on needless work.
+		 */
+		if (!__rmid_valid(event->hw.cqm_rmid))
+			rotate = true;
 	}
 
 out:
 	mutex_unlock(&cache_mutex);
+
+	if (rotate)
+		schedule_delayed_work(&intel_cqm_rmid_work, 0);
+
 	return err;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
@ 2014-09-24 14:45   ` Matt Fleming
  2014-10-08 10:01   ` Peter Zijlstra
  2014-10-08 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 14:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 24 Sep, at 03:04:14PM, Matt Fleming wrote:
> 
> All tasks in a thread group are tracked together.

D'oh, this is a stale comment and isn't true.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 14:04 ` [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support Matt Fleming
@ 2014-09-24 16:40   ` Andi Kleen
  2014-09-24 20:27     ` Matt Fleming
  2014-10-07 19:43   ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2014-09-24 16:40 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

Matt Fleming <matt@console-pimps.org> writes:
>
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index 7e1fd4e08552..8abb18fbcd13 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -38,7 +38,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
>  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
>  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
>  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
> -obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
> +obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o perf_event_intel_cqm.o

What's missing to be able to make this a module?

> +
> +	/*
> +	 * Is @cpu a designated cqm reader?
> +	 */
> +	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
> +		return;
> +
> +	for_each_online_cpu(i) {

Likely possible cpus to avoid races? Otherwise you'll need more locking.

> +		if (i == cpu)
> +			continue;
> +
> +		if (phys_id == topology_physical_package_id(i)) {
> +			cpumask_set_cpu(i, &cqm_cpumask);
> +			break;
> +		}
> +	}
> +}
> +
> +static int intel_cqm_cpu_notifier(struct notifier_block *nb,
> +				  unsigned long action, void *hcpu)
> +{
> +	unsigned int cpu  = (unsigned long)hcpu;
> +
> +	switch (action & ~CPU_TASKS_FROZEN) {
> +	case CPU_UP_PREPARE:
> +		intel_cqm_cpu_prepare(cpu);
> +		break;
> +	case CPU_DOWN_PREPARE:
> +		intel_cqm_cpu_exit(cpu);
> +		break;
> +	case CPU_STARTING:
> +		cqm_pick_event_reader(cpu);
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static int __init intel_cqm_init(void)
> +{
> +	int i, cpu, ret;
> +
> +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_CQM_OCCUP_LLC))
> +		return -ENODEV;

This should use cpufeature.h

> +
> +	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> +
> +	/*
> +	 * It's possible that not all resources support the same number
> +	 * of RMIDs. Instead of making scheduling much more complicated
> +	 * (where we have to match a task's RMID to a cpu that supports
> +	 * that many RMIDs) just find the minimum RMIDs supported across
> +	 * all cpus.
> +	 *
> +	 * Also, check that the scales match on all cpus.
> +	 */
> +	for_each_online_cpu(cpu) {

And this should take the cpu hotplug lock (although it may be
latent at this point if it's only running at early initializion)

But in fact what good is the test then if you only
every likely check cpu #0?

> +		struct cpuinfo_x86 *c = &cpu_data(cpu);
> +
> +		if (c->x86_cache_max_rmid < cqm_max_rmid)
> +			cqm_max_rmid = c->x86_cache_max_rmid;
> +
> +		if (c->x86_cache_occ_scale != cqm_l3_scale) {
> +			pr_err("Multiple LLC scale values, disabling\n");
> +			return -EINVAL;
> +		}
> +	}
> +
> +	ret = intel_cqm_setup_rmid_cache();
> +	if (ret)
> +		return ret;
> +
> +	for_each_online_cpu(i) {
> +		intel_cqm_cpu_prepare(i);
> +		cqm_pick_event_reader(i);
> +	}


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 16:40   ` Andi Kleen
@ 2014-09-24 20:27     ` Matt Fleming
  2014-09-24 20:39       ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-09-24 20:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 24 Sep, at 09:40:10AM, Andi Kleen wrote:
> Matt Fleming <matt@console-pimps.org> writes:
> >
> > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> > index 7e1fd4e08552..8abb18fbcd13 100644
> > --- a/arch/x86/kernel/cpu/Makefile
> > +++ b/arch/x86/kernel/cpu/Makefile
> > @@ -38,7 +38,7 @@ obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
> >  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
> >  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_uncore_snb.o
> >  obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore_snbep.o perf_event_intel_uncore_nhmex.o
> > -obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o
> > +obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_rapl.o perf_event_intel_cqm.o
> 
> What's missing to be able to make this a module?
 
Not sure, that's not something I'd thought of. I simply copied every
other PMU driver in this directory.

But as an experiment I tried it, and the only thing that appears to be
necessary is adding EXPORT_SYMBOL_GPL() to events_sysfs_show.

> > +
> > +	/*
> > +	 * Is @cpu a designated cqm reader?
> > +	 */
> > +	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
> > +		return;
> > +
> > +	for_each_online_cpu(i) {
> 
> Likely possible cpus to avoid races? Otherwise you'll need more locking.
 
I was under the impression that CPU_DOWN_PREPARE notifiers were
serialized against cpu hotplug. And reading the code, that does appear
to be so.

What race have you got in mind?

> > +static int __init intel_cqm_init(void)
> > +{
> > +	int i, cpu, ret;
> > +
> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_CQM_OCCUP_LLC))
> > +		return -ENODEV;
> 
> This should use cpufeature.h
 
What? Please be more explicit.

> > +
> > +	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> > +
> > +	/*
> > +	 * It's possible that not all resources support the same number
> > +	 * of RMIDs. Instead of making scheduling much more complicated
> > +	 * (where we have to match a task's RMID to a cpu that supports
> > +	 * that many RMIDs) just find the minimum RMIDs supported across
> > +	 * all cpus.
> > +	 *
> > +	 * Also, check that the scales match on all cpus.
> > +	 */
> > +	for_each_online_cpu(cpu) {
> 
> And this should take the cpu hotplug lock (although it may be
> latent at this point if it's only running at early initializion)
 
Good catch, this is racy. I'll fix this up.

> But in fact what good is the test then if you only
> every likely check cpu #0?

We don't, we check every online cpu, not just cpu 0.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 20:27     ` Matt Fleming
@ 2014-09-24 20:39       ` Andi Kleen
  2014-09-29 18:51         ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2014-09-24 20:39 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Andi Kleen, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel,
	H. Peter Anvin, Matt Fleming, Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 09:27:39PM +0100, Matt Fleming wrote:
> > > +static int __init intel_cqm_init(void)
> > > +{
> > > +	int i, cpu, ret;
> > > +
> > > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_CQM_OCCUP_LLC))
> > > +		return -ENODEV;
> > 
> > This should use cpufeature.h
>  
> What? Please be more explicit.

Using x86_match_cpu and friends, then it would work for module
auto loading too.

> > But in fact what good is the test then if you only
> > every likely check cpu #0?
> 
> We don't, we check every online cpu, not just cpu 0.

There are no other cpus online at this point?

-Andi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/11] perf tools: Refactor unit and scale function parameters
  2014-09-24 14:04 ` [PATCH 02/11] perf tools: Refactor unit and scale function parameters Matt Fleming
@ 2014-09-29  9:21   ` Jiri Olsa
  2014-10-03  5:25   ` [tip:perf/core] " tip-bot for Matt Fleming
  1 sibling, 0 replies; 46+ messages in thread
From: Jiri Olsa @ 2014-09-29  9:21 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:06PM +0100, Matt Fleming wrote:
> From: Matt Fleming <matt.fleming@intel.com>
> 
> Passing pointers to alias modifiers 'unit' and 'scale' isn't very
> future-proof since if we add more modifiers to the list we'll end up
> passing more arguments.
> 
> Instead wrap everything up in a struct perf_pmu_info, which can easily
> be expanded when additional alias modifiers are necessary in the future.
> 
> Cc: Jiri Olsa <jolsa@redhat.com>

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 20:39       ` Andi Kleen
@ 2014-09-29 18:51         ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-09-29 18:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 24 Sep, at 10:39:37PM, Andi Kleen wrote:
> On Wed, Sep 24, 2014 at 09:27:39PM +0100, Matt Fleming wrote:
> > > > +static int __init intel_cqm_init(void)
> > > > +{
> > > > +	int i, cpu, ret;
> > > > +
> > > > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_CQM_OCCUP_LLC))
> > > > +		return -ENODEV;
> > > 
> > > This should use cpufeature.h
> >  
> > What? Please be more explicit.
> 
> Using x86_match_cpu and friends, then it would work for module
> auto loading too.
 
OK, I'll take a look at that, thanks.

> > > But in fact what good is the test then if you only
> > > every likely check cpu #0?
> > 
> > We don't, we check every online cpu, not just cpu 0.
> 
> There are no other cpus online at this point?

There are definitely other cpus online at this point, smp_init() gets
called before do_initcalls().

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip:perf/core] perf tools: Refactor unit and scale function parameters
  2014-09-24 14:04 ` [PATCH 02/11] perf tools: Refactor unit and scale function parameters Matt Fleming
  2014-09-29  9:21   ` Jiri Olsa
@ 2014-10-03  5:25   ` tip-bot for Matt Fleming
  1 sibling, 0 replies; 46+ messages in thread
From: tip-bot for Matt Fleming @ 2014-10-03  5:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, linux-kernel, hpa, mingo, jolsa, peterz, matt.fleming, jolsa, tglx

Commit-ID:  46441bdc76fee08e297ebcf17e4ca91013b1ee9e
Gitweb:     http://git.kernel.org/tip/46441bdc76fee08e297ebcf17e4ca91013b1ee9e
Author:     Matt Fleming <matt.fleming@intel.com>
AuthorDate: Wed, 24 Sep 2014 15:04:06 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 29 Sep 2014 15:03:57 -0300

perf tools: Refactor unit and scale function parameters

Passing pointers to alias modifiers 'unit' and 'scale' isn't very
future-proof since if we add more modifiers to the list we'll end up
passing more arguments.

Instead wrap everything up in a struct perf_pmu_info, which can easily
be expanded when additional alias modifiers are necessary in the future.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1411567455-31264-3-git-send-email-matt@console-pimps.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/parse-events.c |  9 ++++-----
 tools/perf/util/pmu.c          | 38 +++++++++++++++++++++++---------------
 tools/perf/util/pmu.h          |  7 ++++++-
 3 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 61be3e6..9522cf2 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -634,10 +634,9 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 			 char *name, struct list_head *head_config)
 {
 	struct perf_event_attr attr;
+	struct perf_pmu_info info;
 	struct perf_pmu *pmu;
 	struct perf_evsel *evsel;
-	const char *unit;
-	double scale;
 
 	pmu = perf_pmu__find(name);
 	if (!pmu)
@@ -656,7 +655,7 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 		return evsel ? 0 : -ENOMEM;
 	}
 
-	if (perf_pmu__check_alias(pmu, head_config, &unit, &scale))
+	if (perf_pmu__check_alias(pmu, head_config, &info))
 		return -EINVAL;
 
 	/*
@@ -671,8 +670,8 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 	evsel = __add_event(list, idx, &attr, pmu_event_name(head_config),
 			    pmu->cpus);
 	if (evsel) {
-		evsel->unit = unit;
-		evsel->scale = scale;
+		evsel->unit = info.unit;
+		evsel->scale = info.scale;
 	}
 
 	return evsel ? 0 : -ENOMEM;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 22a4ad5..93a41ca 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -210,6 +210,19 @@ static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FI
 	return 0;
 }
 
+static inline bool pmu_alias_info_file(char *name)
+{
+	size_t len;
+
+	len = strlen(name);
+	if (len > 5 && !strcmp(name + len - 5, ".unit"))
+		return true;
+	if (len > 6 && !strcmp(name + len - 6, ".scale"))
+		return true;
+
+	return false;
+}
+
 /*
  * Process all the sysfs attributes located under the directory
  * specified in 'dir' parameter.
@@ -218,7 +231,6 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 {
 	struct dirent *evt_ent;
 	DIR *event_dir;
-	size_t len;
 	int ret = 0;
 
 	event_dir = opendir(dir);
@@ -234,13 +246,9 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 			continue;
 
 		/*
-		 * skip .unit and .scale info files
-		 * parsed in perf_pmu__new_alias()
+		 * skip info files parsed in perf_pmu__new_alias()
 		 */
-		len = strlen(name);
-		if (len > 5 && !strcmp(name + len - 5, ".unit"))
-			continue;
-		if (len > 6 && !strcmp(name + len - 6, ".scale"))
+		if (pmu_alias_info_file(name))
 			continue;
 
 		snprintf(path, PATH_MAX, "%s/%s", dir, name);
@@ -645,7 +653,7 @@ static int check_unit_scale(struct perf_pmu_alias *alias,
  * defined for the alias
  */
 int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
-			  const char **unit, double *scale)
+			  struct perf_pmu_info *info)
 {
 	struct parse_events_term *term, *h;
 	struct perf_pmu_alias *alias;
@@ -655,8 +663,8 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 	 * Mark unit and scale as not set
 	 * (different from default values, see below)
 	 */
-	*unit   = NULL;
-	*scale  = 0.0;
+	info->unit   = NULL;
+	info->scale  = 0.0;
 
 	list_for_each_entry_safe(term, h, head_terms, list) {
 		alias = pmu_find_alias(pmu, term);
@@ -666,7 +674,7 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 		if (ret)
 			return ret;
 
-		ret = check_unit_scale(alias, unit, scale);
+		ret = check_unit_scale(alias, &info->unit, &info->scale);
 		if (ret)
 			return ret;
 
@@ -679,11 +687,11 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
 	 * set defaults as for evsel
 	 * unit cannot left to NULL
 	 */
-	if (*unit == NULL)
-		*unit   = "";
+	if (info->unit == NULL)
+		info->unit   = "";
 
-	if (*scale == 0.0)
-		*scale  = 1.0;
+	if (info->scale == 0.0)
+		info->scale  = 1.0;
 
 	return 0;
 }
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 0f5c0a8..fe90a01 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -25,6 +25,11 @@ struct perf_pmu {
 	struct list_head list;    /* ELEM */
 };
 
+struct perf_pmu_info {
+	const char *unit;
+	double scale;
+};
+
 struct perf_pmu *perf_pmu__find(const char *name);
 int perf_pmu__config(struct perf_pmu *pmu, struct perf_event_attr *attr,
 		     struct list_head *head_terms);
@@ -33,7 +38,7 @@ int perf_pmu__config_terms(struct list_head *formats,
 			   struct list_head *head_terms,
 			   bool zero);
 int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
-			  const char **unit, double *scale);
+			  struct perf_pmu_info *info);
 struct list_head *perf_pmu__alias(struct perf_pmu *pmu,
 				  struct list_head *head_terms);
 int perf_pmu_wrap(void);

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/11] perf: Make perf_cgroup_from_task() global
  2014-09-24 14:04 ` [PATCH 04/11] perf: Make perf_cgroup_from_task() global Matt Fleming
@ 2014-10-07 18:51   ` Peter Zijlstra
  2014-10-08 10:27     ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-07 18:51 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:08PM +0100, Matt Fleming wrote:
> From: Matt Fleming <matt.fleming@intel.com>
> 
> Move perf_cgroup_from_task() from kernel/events to include/ along with
> the necessary struct definitions, so that it can be used by the PMU
> code.
> 
> The upcoming Intel Cache Monitoring PMU driver assigns monitoring IDs
> based on a task's association with a cgroup - all tasks in the same
> cgroup share an ID. We can use perf_cgroup_from_task() to track this
> association.

Not yet having read the rest of the patches and maybe understanding
things wrong, that doesn't sound right.

The RMID should be associated with events, not groups. The event can be
associated with whatever perf provides {task, cgroup, cpu}.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 06/11] perf: Move cgroup init before PMU ->event_init()
  2014-09-24 14:04 ` [PATCH 06/11] perf: Move cgroup init before PMU ->event_init() Matt Fleming
@ 2014-10-07 19:34   ` Peter Zijlstra
  2014-10-08 10:32     ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-07 19:34 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:10PM +0100, Matt Fleming wrote:
> +++ b/kernel/events/core.c
> @@ -6859,7 +6859,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		 struct perf_event *group_leader,
>  		 struct perf_event *parent_event,
>  		 perf_overflow_handler_t overflow_handler,
> -		 void *context)
> +		 void *context, bool cgroup, pid_t pid)
>  {
>  	struct pmu *pmu;
>  	struct perf_event *event;

I don't get this extension, why a bool and pid_t ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-09-24 14:04 ` [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support Matt Fleming
  2014-09-24 16:40   ` Andi Kleen
@ 2014-10-07 19:43   ` Peter Zijlstra
  2014-10-08 10:36     ` Matt Fleming
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-07 19:43 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:12PM +0100, Matt Fleming wrote:
> +/*
> + * Determine if @a and @b measure the same set of tasks.
> + */
> +static bool __match_event(struct perf_event *a, struct perf_event *b)
> +{
> +	if ((a->attach_state & PERF_ATTACH_TASK) !=
> +	    (b->attach_state & PERF_ATTACH_TASK))
> +		return false;
> +
> +	/* not task */
> +
> +	return true; /* if not task, we're machine wide */
> +}

You cut too much out there. That first test checks weather the two
events are of the same type; ie. both tasks or both cpu. After that you
still need to verify that they are indeed the same target.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM
  2014-09-24 14:04 ` [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM Matt Fleming
@ 2014-10-08  9:51   ` Peter Zijlstra
  2014-10-08 10:53     ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08  9:51 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:13PM +0100, Matt Fleming wrote:
> From: Matt Fleming <matt.fleming@intel.com>
> 
> It's possible to run into issues with re-using unused monitoring IDs
> because there may be stale cachelines associated with that ID from a
> previous allocation. This can cause the LLC occupancy values to be
> inaccurate.
> 
> To attempt to mitigate this problem we place the IDs on a least recently
> used list, essentially a FIFO. The basic idea is that the longer the
> time period between ID re-use the lower the probability that stale
> cachelines exist in the cache.

Do we want to provide a user configurable minumum guaranteed queue time?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
  2014-09-24 14:45   ` Matt Fleming
@ 2014-10-08 10:01   ` Peter Zijlstra
  2014-10-08 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 10:01 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:14PM +0100, Matt Fleming wrote:
> Other perf backends stash the target task in event->hw.*target so we
> need to do something similar. The task is used to determine whether
> events should share a cache group and an RMID.

Yeah, we should maybe clean that up and provide a task target point in
the event or so. I've actually done that patch several times, but never
gotten around to finishing it or so -- sure can't seem to find it atm.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/11] perf: Make perf_cgroup_from_task() global
  2014-10-07 18:51   ` Peter Zijlstra
@ 2014-10-08 10:27     ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Tue, 07 Oct, at 08:51:57PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:08PM +0100, Matt Fleming wrote:
> > From: Matt Fleming <matt.fleming@intel.com>
> > 
> > Move perf_cgroup_from_task() from kernel/events to include/ along with
> > the necessary struct definitions, so that it can be used by the PMU
> > code.
> > 
> > The upcoming Intel Cache Monitoring PMU driver assigns monitoring IDs
> > based on a task's association with a cgroup - all tasks in the same
> > cgroup share an ID. We can use perf_cgroup_from_task() to track this
> > association.
> 
> Not yet having read the rest of the patches and maybe understanding
> things wrong, that doesn't sound right.
> 
> The RMID should be associated with events, not groups. The event can be
> associated with whatever perf provides {task, cgroup, cpu}.

I think I just wrote the commit message in a goofy way.

What we actually use perf_cgroup_from_task() for is to figure out when
to prohibit an event from being created if it overlaps/conflicts with an
existing event.

I'll rewrite the commit message to be clearer.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 06/11] perf: Move cgroup init before PMU ->event_init()
  2014-10-07 19:34   ` Peter Zijlstra
@ 2014-10-08 10:32     ` Matt Fleming
  2014-10-08 10:49       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Tue, 07 Oct, at 09:34:24PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:10PM +0100, Matt Fleming wrote:
> > +++ b/kernel/events/core.c
> > @@ -6859,7 +6859,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> >  		 struct perf_event *group_leader,
> >  		 struct perf_event *parent_event,
> >  		 perf_overflow_handler_t overflow_handler,
> > -		 void *context)
> > +		 void *context, bool cgroup, pid_t pid)
> >  {
> >  	struct pmu *pmu;
> >  	struct perf_event *event;
> 
> I don't get this extension, why a bool and pid_t ?

So that it's possible to figure out whether we need to call
perf_cgroup_connect(). 

Oh, why not just 'int fd', you mean? Ermm... yeah that would be better.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-10-07 19:43   ` Peter Zijlstra
@ 2014-10-08 10:36     ` Matt Fleming
  2014-10-08 12:15       ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Tue, 07 Oct, at 09:43:10PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:12PM +0100, Matt Fleming wrote:
> > +/*
> > + * Determine if @a and @b measure the same set of tasks.
> > + */
> > +static bool __match_event(struct perf_event *a, struct perf_event *b)
> > +{
> > +	if ((a->attach_state & PERF_ATTACH_TASK) !=
> > +	    (b->attach_state & PERF_ATTACH_TASK))
> > +		return false;
> > +
> > +	/* not task */
> > +
> > +	return true; /* if not task, we're machine wide */
> > +}
> 
> You cut too much out there. That first test checks weather the two
> events are of the same type; ie. both tasks or both cpu. After that you
> still need to verify that they are indeed the same target.

This gets fixed in PATCH 10 where we actually implement monitoring of
task events. At this point in the series, we'll return -EINVAL from
intel_cqm_event_init() for anything other than a cpu event.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 06/11] perf: Move cgroup init before PMU ->event_init()
  2014-10-08 10:32     ` Matt Fleming
@ 2014-10-08 10:49       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 10:49 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Oct 08, 2014 at 11:32:50AM +0100, Matt Fleming wrote:
> On Tue, 07 Oct, at 09:34:24PM, Peter Zijlstra wrote:
> > On Wed, Sep 24, 2014 at 03:04:10PM +0100, Matt Fleming wrote:
> > > +++ b/kernel/events/core.c
> > > @@ -6859,7 +6859,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> > >  		 struct perf_event *group_leader,
> > >  		 struct perf_event *parent_event,
> > >  		 perf_overflow_handler_t overflow_handler,
> > > -		 void *context)
> > > +		 void *context, bool cgroup, pid_t pid)
> > >  {
> > >  	struct pmu *pmu;
> > >  	struct perf_event *event;
> > 
> > I don't get this extension, why a bool and pid_t ?
> 
> So that it's possible to figure out whether we need to call
> perf_cgroup_connect(). 
> 
> Oh, why not just 'int fd', you mean? Ermm... yeah that would be better.

Jah ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM
  2014-10-08  9:51   ` Peter Zijlstra
@ 2014-10-08 10:53     ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 10:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 08 Oct, at 11:51:09AM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:13PM +0100, Matt Fleming wrote:
> > From: Matt Fleming <matt.fleming@intel.com>
> > 
> > It's possible to run into issues with re-using unused monitoring IDs
> > because there may be stale cachelines associated with that ID from a
> > previous allocation. This can cause the LLC occupancy values to be
> > inaccurate.
> > 
> > To attempt to mitigate this problem we place the IDs on a least recently
> > used list, essentially a FIFO. The basic idea is that the longer the
> > time period between ID re-use the lower the probability that stale
> > cachelines exist in the cache.
> 
> Do we want to provide a user configurable minumum guaranteed queue time?

Potentially, yeah. That might be better suited as part of the final
patch that includes the rotation code, which already has a delayed
workqueue.

We could add a minimum queue time before we start querying whether the
data occupancy value for an RMID has dropped to zero on all sockets.

That'd save us from an expensive smp_call_function_many() when we're
unlikely to succeed anyway.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
  2014-09-24 14:45   ` Matt Fleming
  2014-10-08 10:01   ` Peter Zijlstra
@ 2014-10-08 11:07   ` Peter Zijlstra
  2014-10-08 12:10     ` Matt Fleming
  2014-10-10 10:54     ` Peter Zijlstra
  2 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 11:07 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Sep 24, 2014 at 03:04:14PM +0100, Matt Fleming wrote:
> From: Matt Fleming <matt.fleming@intel.com>
> 
> Add support for task events as well as system-wide events. This change
> has a big impact on the way that we gather LLC occupancy values in
> intel_cqm_event_read().
> 
> Currently, for system-wide (per-cpu) events we defer processing to
> userspace which knows how to discard all but one cpu result per package.
> 
> Things aren't so simple for task events because we need to do the value
> aggregation ourselves. To do this, we defer updating the LLC occupancy
> value in event->count from intel_cqm_event_read() and do an SMP
> cross-call to read values for all packages in intel_cqm_event_count().
> We need to ensure that we only do this for one task event per cache
> group, otherwise we'll report duplicate values.
> 
> If we're a system-wide event we want to fallback to the default
> perf_event_count() implementation. Refactor this into a common function
> so that we don't duplicate the code.

So it looks like these events will be classified as regular HW events,
this means they'll be mixed with the other HW events, and we'll stop
scheduling the moment either one returns a fail.

There are two alternatives;
 1) create an extra task context to keep them in
 2) pretend to be a software event and do the scheduling yourself

I think my initial proposal was 2, can you clarify why you've changed
that? Lemme go read the next patch though, maybe that'll clarify things
further.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
@ 2014-10-08 11:19   ` Peter Zijlstra
  2014-10-08 11:56     ` Matt Fleming
  2014-10-08 18:08   ` Peter Zijlstra
  2014-10-08 18:10   ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 11:19 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> This scheme reserves one RMID at all times for rotation. When we need to
> schedule a new event we give it the reserved RMID, pick a victim event
> from the front of the global CQM list and wait for the victim's RMID to
> drop to zero occupancy, before it becomes the new reserved RMID.

> +/*
> + * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
> + * cachelines are still tagged with RMIDs in limbo, we progressively
> + * increment the threshold until we find an RMID in limbo with <=
> + * __intel_cqm_threshold lines tagged. This is designed to mitigate the
> + * problem where cachelines tagged with an RMID are not steadily being
> + * evicted.
> + *
> + * On successful rotations we decrease the threshold back towards zero.
> + */
> +static unsigned int __intel_cqm_threshold;

Ah, so I was about to tell you there is the possibiliy we'll never quite
reach 0. But it appears you've cured that with this adaptive threshold
thing?

Is there an upper bound on the threshold after which we'll just wait, or
will you keep increasing it until something matches?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-10-08 11:19   ` Peter Zijlstra
@ 2014-10-08 11:56     ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 11:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, 08 Oct, at 01:19:27PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> > This scheme reserves one RMID at all times for rotation. When we need to
> > schedule a new event we give it the reserved RMID, pick a victim event
> > from the front of the global CQM list and wait for the victim's RMID to
> > drop to zero occupancy, before it becomes the new reserved RMID.
> 
> > +/*
> > + * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
> > + * cachelines are still tagged with RMIDs in limbo, we progressively
> > + * increment the threshold until we find an RMID in limbo with <=
> > + * __intel_cqm_threshold lines tagged. This is designed to mitigate the
> > + * problem where cachelines tagged with an RMID are not steadily being
> > + * evicted.
> > + *
> > + * On successful rotations we decrease the threshold back towards zero.
> > + */
> > +static unsigned int __intel_cqm_threshold;
> 
> Ah, so I was about to tell you there is the possibiliy we'll never quite
> reach 0. But it appears you've cured that with this adaptive threshold
> thing?
 
Yeah, that is the idea. There are more games that we can play for
picking a "good" RMID to reuse, but this threshold provides a final
guarantee that we will make forward progress.

It also provides a good indication of how inaccurate you can expect your
results to be at any given time and for a particular event, but we don't
expose that currently. It might make sense to print a warning each time
the threshold reaches a new high.

> Is there an upper bound on the threshold after which we'll just wait, or
> will you keep increasing it until something matches?

We'll keep increasing it until something matches, though crucially, we
will decrease it for every consecutive match thereafter.

A threshold upper bound does seem like a good idea, though. I'm not a
massive fan of user-configurable knobs, but this does seem like the kind
of thing where people may want that control.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-08 11:07   ` Peter Zijlstra
@ 2014-10-08 12:10     ` Matt Fleming
  2014-10-08 14:49       ` Peter Zijlstra
  2014-10-10 10:54     ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 08 Oct, at 01:07:43PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:14PM +0100, Matt Fleming wrote:
> > From: Matt Fleming <matt.fleming@intel.com>
> > 
> > Add support for task events as well as system-wide events. This change
> > has a big impact on the way that we gather LLC occupancy values in
> > intel_cqm_event_read().
> > 
> > Currently, for system-wide (per-cpu) events we defer processing to
> > userspace which knows how to discard all but one cpu result per package.
> > 
> > Things aren't so simple for task events because we need to do the value
> > aggregation ourselves. To do this, we defer updating the LLC occupancy
> > value in event->count from intel_cqm_event_read() and do an SMP
> > cross-call to read values for all packages in intel_cqm_event_count().
> > We need to ensure that we only do this for one task event per cache
> > group, otherwise we'll report duplicate values.
> > 
> > If we're a system-wide event we want to fallback to the default
> > perf_event_count() implementation. Refactor this into a common function
> > so that we don't duplicate the code.
> 
> So it looks like these events will be classified as regular HW events,
> this means they'll be mixed with the other HW events, and we'll stop
> scheduling the moment either one returns a fail.
> 
> There are two alternatives;
>  1) create an extra task context to keep them in
>  2) pretend to be a software event and do the scheduling yourself
> 
> I think my initial proposal was 2, can you clarify why you've changed
> that? Lemme go read the next patch though, maybe that'll clarify things
> further.

Ah, interesting.

I dropped the internal scheduling because I preferred the idea of
"failing fast", in the sense that if we can't schedule multiple events
simultaneously because they conflict, we should report that to the user
at event init time, rather than trying to manage the conflict ourselves,
with the resultant loss of accuracy.

But I wasn't aware of the issue you've brought up, and it sounds like we
need to do the scheduling anyway.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-10-08 10:36     ` Matt Fleming
@ 2014-10-08 12:15       ` Matt Fleming
  2014-10-08 14:47         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 08 Oct, at 11:36:58AM, Matt Fleming wrote:
> On Tue, 07 Oct, at 09:43:10PM, Peter Zijlstra wrote:
> > On Wed, Sep 24, 2014 at 03:04:12PM +0100, Matt Fleming wrote:
> > > +/*
> > > + * Determine if @a and @b measure the same set of tasks.
> > > + */
> > > +static bool __match_event(struct perf_event *a, struct perf_event *b)
> > > +{
> > > +	if ((a->attach_state & PERF_ATTACH_TASK) !=
> > > +	    (b->attach_state & PERF_ATTACH_TASK))
> > > +		return false;
> > > +
> > > +	/* not task */
> > > +
> > > +	return true; /* if not task, we're machine wide */
> > > +}
> > 
> > You cut too much out there. That first test checks weather the two
> > events are of the same type; ie. both tasks or both cpu. After that you
> > still need to verify that they are indeed the same target.
> 
> This gets fixed in PATCH 10 where we actually implement monitoring of
> task events. At this point in the series, we'll return -EINVAL from
> intel_cqm_event_init() for anything other than a cpu event.

I was having an interesting discussion with one of the teams using this
stuff at Intel and they made the suggestion that when using,

  perf stat -p <pid>

we should by default opt for sharing an RMID between all tasks in that
thread group, rather that assigning a new RMID for each task, which is
what we do currently.

Right now, it's like the Oprah Winfrey of RMID assignment, "You get an
RMID, and you get an RMID!"

Which means we'll run out of RMIDs quicker, and enable the rotation code
sooner.

I'm wondering whether we should require that the user specify whether
they want per-thread monitoring if using -p, via some perf tools event
modifier, and make the record-per-thread-data scenario the exceptional
case, rather than the default?

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-10-08 12:15       ` Matt Fleming
@ 2014-10-08 14:47         ` Peter Zijlstra
  2014-10-08 20:51           ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 14:47 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Oct 08, 2014 at 01:15:35PM +0100, Matt Fleming wrote:
> I was having an interesting discussion with one of the teams using this
> stuff at Intel and they made the suggestion that when using,
> 
>   perf stat -p <pid>
> 
> we should by default opt for sharing an RMID between all tasks in that
> thread group, rather that assigning a new RMID for each task, which is
> what we do currently.
> 
> Right now, it's like the Oprah Winfrey of RMID assignment, "You get an
> RMID, and you get an RMID!"
> 
> Which means we'll run out of RMIDs quicker, and enable the rotation code
> sooner.
> 
> I'm wondering whether we should require that the user specify whether
> they want per-thread monitoring if using -p, via some perf tools event
> modifier, and make the record-per-thread-data scenario the exceptional
> case, rather than the default?

Right so perf cannot do this. And I'm not sure that's fixable, it
depends a bit on how the cqm thing deals with inherited events, IFF it
can reuse RMIDs for inherited events we might be able to extend the
syscall to install 'inherited' events throughout the process group,
instead of just the one thread.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-08 12:10     ` Matt Fleming
@ 2014-10-08 14:49       ` Peter Zijlstra
  2014-10-10 12:02         ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 14:49 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Oct 08, 2014 at 01:10:44PM +0100, Matt Fleming wrote:
> Ah, interesting.
> 
> I dropped the internal scheduling because I preferred the idea of
> "failing fast", in the sense that if we can't schedule multiple events
> simultaneously because they conflict, we should report that to the user
> at event init time, rather than trying to manage the conflict ourselves,
> with the resultant loss of accuracy.

The thing is, with multiplexing you cannot fail at event creation time
anyhow. The only time where you can 'fail' is when programming the PMU,
when its full its full.

Those that don't fit, get to wait their turn.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
  2014-10-08 11:19   ` Peter Zijlstra
@ 2014-10-08 18:08   ` Peter Zijlstra
  2014-10-08 19:02     ` Thomas Gleixner
  2014-10-08 18:10   ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 18:08 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> +	preempt_disable();
> +	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
> +		for_each_cpu(cpu, &cqm_cpumask) {
> +			unsigned long *map;
> +
> +			map = &limbo_bitmap[cpu * BITS_TO_LONGS(nr_bits)];
> +			set_bit(entry->rmid, map);
> +		}
> +	}
> +
> +	/*
> +	 * Test whether an RMID is free for each package.
> +	 */
> +	smp_call_function_many(&cqm_cpumask, intel_cqm_stable, &info, true);
> +
> +	/*
> +	 * Convert all cpu bitmaps into a single bitmap by ANDing all of
> +	 * them together. If we've still got any bits set that indicates
> +	 * an RMID is now unused on all cpus.
> +	 */
> +	bitmap_fill(free_bitmap, nr_bits);
> +	for_each_cpu(cpu, &cqm_cpumask) {
> +		unsigned long *map;
> +
> +		map = &limbo_bitmap[cpu * BITS_TO_LONGS(nr_bits)];
> +		bitmap_and(free_bitmap, free_bitmap, map, nr_bits);
> +	}
> +
> +	if (!bitmap_empty(free_bitmap, nr_bits)) {
> +		int i = -1;
> +
> +		for (; i = find_next_bit(free_bitmap, nr_bits, i+1),
> +			i < nr_bits;) {
> +			entry = __rmid_entry(i);
> +
> +			list_del(&entry->list);	/* remove from limbo */
> +
> +			/*
> +			 * The rotation RMID gets priority if it's
> +			 * currently invalid. In which case, skip adding
> +			 * the RMID to the the free lru.
> +			 */
> +			if (!__rmid_valid(intel_cqm_rotation_rmid)) {
> +				intel_cqm_rotation_rmid = i;
> +				continue;
> +			}
> +
> +			/*
> +			 * If we have groups waiting for RMIDs, hand
> +			 * them one now.
> +			 */
> +			raw_spin_lock_irq(&cache_lock);
> +			list_for_each_entry(event, &cache_groups,
> +					    hw.cqm_groups_entry) {
> +				if (__rmid_valid(event->hw.cqm_rmid))
> +					continue;
> +
> +				intel_cqm_xchg_rmid(event, i);
> +				entry = NULL;
> +				break;
> +			}
> +			raw_spin_unlock_irq(&cache_lock);
> +
> +			if (!entry)
> +				continue;
> +
> +			/*
> +			 * Otherwise place it onto the free list.
> +			 */
> +			list_add_tail(&entry->list, &cqm_rmid_free_lru);
> +		}
> +	}
> +
> +	preempt_enable();

Why is all that under preempt_disable()?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
  2014-10-08 11:19   ` Peter Zijlstra
  2014-10-08 18:08   ` Peter Zijlstra
@ 2014-10-08 18:10   ` Peter Zijlstra
  2014-10-08 20:04     ` Matt Fleming
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 18:10 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> +	limbo_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(nr_bits) *
> +			       nr_cpumask_bits, GFP_KERNEL);

That's going to be a _huge_ amount of memory on SGI class systems. Do we
really need per-cpu storage for this?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-10-08 18:08   ` Peter Zijlstra
@ 2014-10-08 19:02     ` Thomas Gleixner
  2014-10-08 19:59       ` Matt Fleming
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2014-10-08 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, 8 Oct 2014, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> > +	preempt_disable();

< SNIP loooong code >

> > +	preempt_enable();
> 
> Why is all that under preempt_disable()?

To make life harder for people who care about latencies and
deterministic behaviour perhaps?

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-10-08 19:02     ` Thomas Gleixner
@ 2014-10-08 19:59       ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 19:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, 08 Oct, at 09:02:16PM, Thomas Gleixner wrote:
> On Wed, 8 Oct 2014, Peter Zijlstra wrote:
> > On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> > > +	preempt_disable();
> 
> < SNIP loooong code >
> 
> > > +	preempt_enable();
> > 
> > Why is all that under preempt_disable()?
> 
> To make life harder for people who care about latencies and
> deterministic behaviour perhaps?

Yeah that looks... wrong.

I'll fix that.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs
  2014-10-08 18:10   ` Peter Zijlstra
@ 2014-10-08 20:04     ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-08 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming

On Wed, 08 Oct, at 08:10:44PM, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:15PM +0100, Matt Fleming wrote:
> > +	limbo_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(nr_bits) *
> > +			       nr_cpumask_bits, GFP_KERNEL);
> 
> That's going to be a _huge_ amount of memory on SGI class systems. Do we
> really need per-cpu storage for this?

Ah, no we don't.

Allocating it the above way just makes things easier because you can
index the array directly using your cpu.

I'll shrink this down to the minimum memory needed.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support
  2014-10-08 14:47         ` Peter Zijlstra
@ 2014-10-08 20:51           ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-08 20:51 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Oct 08, 2014 at 04:47:04PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 01:15:35PM +0100, Matt Fleming wrote:
> > I was having an interesting discussion with one of the teams using this
> > stuff at Intel and they made the suggestion that when using,
> > 
> >   perf stat -p <pid>
> > 
> > we should by default opt for sharing an RMID between all tasks in that
> > thread group, rather that assigning a new RMID for each task, which is
> > what we do currently.
> > 
> > Right now, it's like the Oprah Winfrey of RMID assignment, "You get an
> > RMID, and you get an RMID!"
> > 
> > Which means we'll run out of RMIDs quicker, and enable the rotation code
> > sooner.
> > 
> > I'm wondering whether we should require that the user specify whether
> > they want per-thread monitoring if using -p, via some perf tools event
> > modifier, and make the record-per-thread-data scenario the exceptional
> > case, rather than the default?
> 
> Right so perf cannot do this. And I'm not sure that's fixable, it
> depends a bit on how the cqm thing deals with inherited events, IFF it
> can reuse RMIDs for inherited events we might be able to extend the
> syscall to install 'inherited' events throughout the process group,
> instead of just the one thread.

Right, so inherited events aren't going to work right for this. Look at
patch 10, these events target a different task and will not match, so
we'll not share RMIDs.

The only way to make this happen is stuff the process in a cgroup and
measure the cgroup.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-08 11:07   ` Peter Zijlstra
  2014-10-08 12:10     ` Matt Fleming
@ 2014-10-10 10:54     ` Peter Zijlstra
  2014-10-10 11:10       ` Matt Fleming
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-10 10:54 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, Oct 08, 2014 at 01:07:43PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 24, 2014 at 03:04:14PM +0100, Matt Fleming wrote:
> > From: Matt Fleming <matt.fleming@intel.com>
> > 
> > Add support for task events as well as system-wide events. This change
> > has a big impact on the way that we gather LLC occupancy values in
> > intel_cqm_event_read().
> > 
> > Currently, for system-wide (per-cpu) events we defer processing to
> > userspace which knows how to discard all but one cpu result per package.
> > 
> > Things aren't so simple for task events because we need to do the value
> > aggregation ourselves. To do this, we defer updating the LLC occupancy
> > value in event->count from intel_cqm_event_read() and do an SMP
> > cross-call to read values for all packages in intel_cqm_event_count().
> > We need to ensure that we only do this for one task event per cache
> > group, otherwise we'll report duplicate values.
> > 
> > If we're a system-wide event we want to fallback to the default
> > perf_event_count() implementation. Refactor this into a common function
> > so that we don't duplicate the code.
> 
> So it looks like these events will be classified as regular HW events,
> this means they'll be mixed with the other HW events, and we'll stop
> scheduling the moment either one returns a fail.
> 
> There are two alternatives;
>  1) create an extra task context to keep them in
>  2) pretend to be a software event and do the scheduling yourself
> 
> I think my initial proposal was 2, can you clarify why you've changed
> that? Lemme go read the next patch though, maybe that'll clarify things
> further.

As you rightly pointed out on IRC, the following (as found in patch 8):

+static struct pmu intel_cqm_pmu = {
+       .attr_groups    = intel_cqm_attr_groups,
+       .task_ctx_nr    = perf_sw_context,

Does indeed make it a software event, so then need to do all the
scheduling outselves.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-10 10:54     ` Peter Zijlstra
@ 2014-10-10 11:10       ` Matt Fleming
  0 siblings, 0 replies; 46+ messages in thread
From: Matt Fleming @ 2014-10-10 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Fri, 10 Oct, at 12:54:25PM, Peter Zijlstra wrote:
> 
> As you rightly pointed out on IRC, the following (as found in patch 8):
> 
> +static struct pmu intel_cqm_pmu = {
> +       .attr_groups    = intel_cqm_attr_groups,
> +       .task_ctx_nr    = perf_sw_context,
> 
> Does indeed make it a software event, so then need to do all the
> scheduling outselves.

OK, I'll take a look at that.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-08 14:49       ` Peter Zijlstra
@ 2014-10-10 12:02         ` Matt Fleming
  2014-10-10 14:02           ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Matt Fleming @ 2014-10-10 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Wed, 08 Oct, at 04:49:57PM, Peter Zijlstra wrote:
> 
> The thing is, with multiplexing you cannot fail at event creation time
> anyhow. The only time where you can 'fail' is when programming the PMU,
> when its full its full.
> 
> Those that don't fit, get to wait their turn.

For CQM it's not about "fitting" everything in the PMU but more about
monitoring the same thing (task, cgroup) with different events, i.e. one
thing with two RMIDs. We have the RMID recycling algorithm to make
things fit, but that doesn't help us out here.

An example scenario that isn't supported by this patch series is
monitoring a cgroup while simultaneously monitoring a task that's part
of that cgroup. Whichever event is created second will fail at event
init time.

And that seemed like a fair approach to me. But the more I think about
it, the more I begin to agree that maybe we should allow users the
flexibility to create conflicting events, particularly because there
appears to be precedent in other parts of perf.

Hmm... "rotation" is starting to become my least favourite word.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM
  2014-10-10 12:02         ` Matt Fleming
@ 2014-10-10 14:02           ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2014-10-10 14:02 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, linux-kernel, H. Peter Anvin, Matt Fleming,
	Arnaldo Carvalho de Melo

On Fri, Oct 10, 2014 at 01:02:02PM +0100, Matt Fleming wrote:
> On Wed, 08 Oct, at 04:49:57PM, Peter Zijlstra wrote:
> > 
> > The thing is, with multiplexing you cannot fail at event creation time
> > anyhow. The only time where you can 'fail' is when programming the PMU,
> > when its full its full.
> > 
> > Those that don't fit, get to wait their turn.
> 
> For CQM it's not about "fitting" everything in the PMU but more about
> monitoring the same thing (task, cgroup) with different events, i.e. one
> thing with two RMIDs. We have the RMID recycling algorithm to make
> things fit, but that doesn't help us out here.
> 
> An example scenario that isn't supported by this patch series is
> monitoring a cgroup while simultaneously monitoring a task that's part
> of that cgroup. Whichever event is created second will fail at event
> init time.

That was what I had that conflict thing for, ISTR seeing some parts of
that here.

> And that seemed like a fair approach to me. But the more I think about
> it, the more I begin to agree that maybe we should allow users the
> flexibility to create conflicting events, particularly because there
> appears to be precedent in other parts of perf.

Yeah, although typically its not this hard. CQM is 'interesting' because
its so different.

> Hmm... "rotation" is starting to become my least favourite word.

:-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2014-10-10 14:02 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24 14:04 [PATCH 00/11] perf: Intel Cache QoS Monitoring support Matt Fleming
2014-09-24 14:04 ` [PATCH 01/11] perf stat: Fix AGGR_CORE segfault on multi-socket system Matt Fleming
2014-09-24 14:04 ` [PATCH 02/11] perf tools: Refactor unit and scale function parameters Matt Fleming
2014-09-29  9:21   ` Jiri Olsa
2014-10-03  5:25   ` [tip:perf/core] " tip-bot for Matt Fleming
2014-09-24 14:04 ` [PATCH 03/11] perf tools: Parse event per-package info files Matt Fleming
2014-09-24 14:04 ` [PATCH 04/11] perf: Make perf_cgroup_from_task() global Matt Fleming
2014-10-07 18:51   ` Peter Zijlstra
2014-10-08 10:27     ` Matt Fleming
2014-09-24 14:04 ` [PATCH 05/11] perf: Add ->count() function to read per-package counters Matt Fleming
2014-09-24 14:04 ` [PATCH 06/11] perf: Move cgroup init before PMU ->event_init() Matt Fleming
2014-10-07 19:34   ` Peter Zijlstra
2014-10-08 10:32     ` Matt Fleming
2014-10-08 10:49       ` Peter Zijlstra
2014-09-24 14:04 ` [PATCH 07/11] x86: Add support for Intel Cache QoS Monitoring (CQM) detection Matt Fleming
2014-09-24 14:04 ` [PATCH 08/11] perf/x86/intel: Add Intel Cache QoS Monitoring support Matt Fleming
2014-09-24 16:40   ` Andi Kleen
2014-09-24 20:27     ` Matt Fleming
2014-09-24 20:39       ` Andi Kleen
2014-09-29 18:51         ` Matt Fleming
2014-10-07 19:43   ` Peter Zijlstra
2014-10-08 10:36     ` Matt Fleming
2014-10-08 12:15       ` Matt Fleming
2014-10-08 14:47         ` Peter Zijlstra
2014-10-08 20:51           ` Peter Zijlstra
2014-09-24 14:04 ` [PATCH 09/11] perf/x86/intel: Implement LRU monitoring ID allocation for CQM Matt Fleming
2014-10-08  9:51   ` Peter Zijlstra
2014-10-08 10:53     ` Matt Fleming
2014-09-24 14:04 ` [PATCH 10/11] perf/x86/intel: Support task events with Intel CQM Matt Fleming
2014-09-24 14:45   ` Matt Fleming
2014-10-08 10:01   ` Peter Zijlstra
2014-10-08 11:07   ` Peter Zijlstra
2014-10-08 12:10     ` Matt Fleming
2014-10-08 14:49       ` Peter Zijlstra
2014-10-10 12:02         ` Matt Fleming
2014-10-10 14:02           ` Peter Zijlstra
2014-10-10 10:54     ` Peter Zijlstra
2014-10-10 11:10       ` Matt Fleming
2014-09-24 14:04 ` [PATCH 11/11] perf/x86/intel: Perform rotation on Intel CQM RMIDs Matt Fleming
2014-10-08 11:19   ` Peter Zijlstra
2014-10-08 11:56     ` Matt Fleming
2014-10-08 18:08   ` Peter Zijlstra
2014-10-08 19:02     ` Thomas Gleixner
2014-10-08 19:59       ` Matt Fleming
2014-10-08 18:10   ` Peter Zijlstra
2014-10-08 20:04     ` Matt Fleming

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.