linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
@ 2017-01-06 21:59 Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 01/12] Documentation, x86/cqm: Intel Resource Monitoring Documentation Vikas Shivappa
                   ` (12 more replies)
  0 siblings, 13 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

Resending version 5 with updated send list. Sorry for the spam.

Cqm(cache quality monitoring) is part of Intel RDT(resource director
technology) which enables monitoring and controlling of processor shared
resources via MSR interface.

The current upstream cqm(Cache monitoring) has major issues which make
the feature almost unusable which this series tries to fix and also
address Thomas comments on previous versions of the cqm2 patch series to
better document/organize what we are trying to fix.

	Changes in V5
- Based on Peterz feedback, removed the file interface in perf_event
cgroup to start and stop continuous monitoring.
- Based on Andi's feedback and references David has sent a patch optimizing
the perf overhead as a seperate patch which is generic and not cqm
specific.

This is a continuation of patch series David(davidcc@google.com)
previously posted and hence its based on his patches and is also trying
to fix the same issues. Patches apply on 4.10-rc2 

Below are the issues and the fixes we attempt-

- Issue(1): Inaccurate data for per package data, systemwide. Just prints
zeros or arbitrary numbers.

Fix: Patches fix this by just throwing an error if the mode is not supported. 
The modes supported is task monitoring and cgroup monitoring. 
Also the per package
data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
The systemwide data can be looked up by monitoring root cgroup.

- Issue(2): RMIDs are global and dont scale with more packages and hence
also run out of RMIDs very soon.

Fix: Support per pkg RMIDs hence scale better with more
packages, and get more RMIDs to use and use when needed (ie when tasks
are actually scheduled on the package).

- Issue(3): Cgroup monitoring is not complete. No hierarchical monitoring
support, inconsistent or wrong data seen when monitoring cgroup.

Fix: cgroup monitoring support added. 
Patch adds full cgroup monitoring support. Can monitor different cgroups
in the same hierarchy together and separately. And can also monitor a
task and the cgroup which the task belongs.

- Issue(4): Lot of inconsistent data is seen currently when we monitor different
kind of events like cgroup and task events *together*.

Fix: Patch adds support to be
able to monitor a cgroup x and as task p1 with in a cgroup x and also
monitor different cgroup and tasks together.

- Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately
Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to 

- Issue(6): RMID recycling leads to inaccurate data and complicates the
code and increases the code foot print. Currently, it almost makes the
feature *unusable* as we only see zeroes and inconsistent data once we
run out of RMIDs in the life time of a systemboot. The only way to get
right numbers is to reboot the system once we run out of RMIDs.

Root cause: Recycling steals an RMID from an existing event x and gives
it to an other event y. However due to the nature of monitoring
llc_occupancy we may miss tracking an unknown(possibly large) part of
cache fills at the time when event does not have RMID. Hence the user
ends up with inaccurate data for both events x and y and the inaccuracy
is arbitrary and cannot be measured.  Even if an event x gets another
RMID very soon after loosing the previous RMID, we still miss all the
occupancy data that was tied to the previous RMID which means we cannot
get accurate data even when for most of the time event has an RMID.
There is no way to guarantee accurate results with recycling and data is
inaccurate by arbitrary degree. The fact that an event can loose an RMID
anytime complicates a lot of code in sched_in, init, count, read. It
also complicates mbm as we may loose the RMID anytime and hence need to
keep a history of all the old counts.

Fix: Recycling is removed based on Tony's idea originally that its
introducing a lot of code, failing to provide accurate data and hence
questionable benefits. Because inspite of several attempts to improve
the recycling there is no way to guarantee accurate data as explained
above and the incorrectness is of arbitrary degree(where we cant say for
    ex: the data is off by x% ). As a fix we introduce per-pkg RMIDs to
mitigate the scarcity of RMIDs to a large extent - this is because RMIDs
are plenty - about 2 to 4 per logical processor/SMT thread on each
package. So on a 2 socket BDW system with say 44 logical processors/SMT
threads we have 176 RMIDs on each package (a total of 2x176 = 352
    RMIDs). Also cgroup is fully supported and hence many threads like
all threads in one VM/container can be grouped which use just one RMID.
The RMIDs scale with the number of sockets. If we still run out of RMIDs
perf read throws an error because we are not able to monitor as we run
out of limited h/w resource.

This may be better unlike recycling(even with a better version than the
one upstream)where the user thinks events are being monitored but they
actually are not monitored for arbitrary amount of time hence resulting
in inaccurate data of arbitrary degree. The inaccurate data defeats the
purpose of RDT whose goal is to provide a consistent system behaviour by
giving the ability to monitor and control processor resources in an
accurate and reliable fashion.  The fix instead helps provide accurate
data and for large extent mitigates the RMID scarcity.

Whats working now (unit tested):
Task monitoring, cgroup hierarchical monitoring, monitor multiple
cgroups, cgroup and task in same cgroup, 
per pkg rmids, error on read.

TBD : 
- Most of MBM is working but will need updates to hierarchical
monitoring and other new feature related changes we introduce. 

Below is a list of patches and what each patch fixes, Each commit
message also gives details on what the patch actually fixes among the
bunch:

[PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling

Before the patch: Users sees only zeros or wrong data once we run out of
RMIDs.
After: User would see either correct data or an error that we run out of
RMIDs.

[PATCH 03/12] x86/rdt: Add rdt common/cqm compile option
[PATCH 04/12] x86/cqm: Add Per pkg rmid support

Before patch: RMIds are global.
Tests: Available RMIDs increase by x times where x is # of packages.
Adds LAZY RMID alloc - RMIDs are alloced during first sched in

[PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
[PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support
[PATCH 07/12] x86/rdt,cqm: Scheduling support update

Before patch: cgroup monitoring not supported fully.
After: cgroup monitoring is fully supported including hierarchical
monitoring.

[PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup

Before patch: cgroup and task could not be monitored together and would
result in a lot of inconsistent data.
After : Can monitor task and cgroup together and also supports
monitoring a task within a cgroup and the cgroup together.

[PATCH 9/12] x86/cqm: Add RMID reuse

Before patch: Once RMID is used , its never used again.
After: We reuse the RMIDs which are freed. User can specify NOLAZY RMID
allocation and open fails if we fail to get all RMIDs at open.

[PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg
[PATCH 11/12] perf/stat: fix bug in handling events in error state
[PATCH 12/12] perf/stat: revamp read error handling, snapshot and

Patches 1/12 - 9/12 Add all the features but the data is not visible to
the perf/core nor the perf user mode. The 11-12 fix these and make the
data availabe to the perf user mode.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/12] Documentation, x86/cqm: Intel Resource Monitoring Documentation
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling Vikas Shivappa
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

Add documentation of usage of cqm and mbm events using perf interface
and examples.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/x86/intel_rdt_mon_ui.txt | 62 ++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt_mon_ui.txt

diff --git a/Documentation/x86/intel_rdt_mon_ui.txt b/Documentation/x86/intel_rdt_mon_ui.txt
new file mode 100644
index 0000000..881fa58
--- /dev/null
+++ b/Documentation/x86/intel_rdt_mon_ui.txt
@@ -0,0 +1,62 @@
+User Interface for Resource Monitoring in Intel Resource Director Technology
+
+Vikas Shivappa<vikas.shivappa@intel.com>
+David Carrillo-Cisneros<davidcc@google.com>
+Stephane Eranian <eranian@google.com>
+
+This feature is enabled by the CONFIG_INTEL_RDT_M Kconfig and the
+X86 /proc/cpuinfo flag bits cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
+
+Resource Monitoring
+-------------------
+Resource Monitoring includes cqm(cache quality monitoring) and
+mbm(memory bandwidth monitoring) and uses the perf interface. A light
+weight interface to enable monitoring without perf is enabled as well.
+
+CQM provides OS/VMM a way to monitor llc occupancy. It measures the
+amount of L3 cache fills per task or cgroup.
+
+MBM provides OS/VMM a way to monitor bandwidth from one level of cache
+to another. The current patches support L3 external bandwidth
+monitoring. It supports both 'local bandwidth' and 'total bandwidth'
+monitoring for the socket. Local bandwidth measures the amount of data
+sent through the memory controller on the socket and total b/w measures
+the total system bandwidth.
+
+To check the monitoring events enabled:
+
+$ ./tools/perf/perf list | grep -i cqm
+intel_cqm/llc_occupancy/                           [Kernel PMU event]
+intel_cqm/local_bytes/                             [Kernel PMU event]
+intel_cqm/total_bytes/                             [Kernel PMU event]
+
+Monitoring tasks and cgroups using perf
+---------------------------------------
+Monitoring tasks and cgroup is like using any other perf event.
+
+$perf stat -I 1000 -e intel_cqm_llc/local_bytes/ -p PID1
+
+This will monitor the local_bytes event of the PID1 and report once
+every 1000ms
+
+$mkdir /sys/fs/cgroup/perf_event/p1
+$echo PID1 > /sys/fs/cgroup/perf_event/p1/tasks
+$echo PID2 > /sys/fs/cgroup/perf_event/p1/tasks
+
+$perf stat -I 1000 -e intel_cqm_llc/llc_occupancy/ -a -G p1
+
+This will monitor the llc_occupancy event of the perf cgroup p1 in
+interval mode.
+
+Hierarchical monitoring should work just like other events and users can
+also monitor a task with in a cgroup and the cgroup together, or
+different cgroups in the same hierarchy can be monitored together.
+
+The events are associated with RMIDs and are grouped when optimal. The
+RMIDs are limited hardware resources and if runout the events would just
+throw error on read.
+
+To obtain per package data for cgroups(package x) provide any cpu in the
+package as input to -C:
+
+$perf stat -I 1000 -e intel_cqm_llc/llc_occupancy/ -C <cpu_y on package_x> -G p1
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 01/12] Documentation, x86/cqm: Intel Resource Monitoring Documentation Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option Vikas Shivappa
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: David Carrillo-Cisneros <davidcc@google.com>

We only supported global RMIDs till now and not per-pkg RMIDs and hence
the RMIDs would not scale and we would more easily run out of RMIDs.
When we run out of RMIDs we did RMID recycling and ended up in several
issues some of which listed below and complications which may outweigh
benefits.

RMID recycling 'steals' RMID from an event x that is being monitored and
gives it to an event y that is also being monitored.
- This does not guarantee that we get the correct data for both the events
as there is always times when the events are not being monitored. Hence
there could be incorrect data to the user. The extent of error is
arbitrary and unknown as we cannot measure how much of the occupancy we
missed.
- It complicated the usage of RMIDs, reading of data, MBM counting
a lot because when reading the counters we had to keep a history of
previous counts and also make sure that no one has stolen that RMID we
want to use. All this had a lot of code foot print.
- When a process changes RMID it is
no more tracking the old cache fills it had in the old RMID - so we are
not even guaranteeing the data to be correct for the time its monitored
if it had and RMID for almost all of the time.
- There was inconsistent data in the current patch due to the way RMID
was recycled, we ended up stealing RMIDs from monitored events before
using the freed RMIDs in the limbo list. This led to 0 counts as soon as
one runs out of RMIDs in the lifetime of systemboot which almost makes
the feature unusable. Also the state transitions were messy.

This patch removes RMID recycling and just
throw error when there are no more RMIDs to monitor. H/w provides ~4
RMIDs per logical processor on each package and if we use all of them
and use only when the tasks are scheduled on the pkgs,
the problem of scarcity can be mitigated.

The conflict handling is also removed as it resulted in a lot of
inconsistent data when different kinds of events like systemwide,cgroup
and task were monitored together.

David's original patch modified by Vikas<vikas.shivappa@linux.intel.com>
to only remove recycling parts from code and edit commit message.

Tests: The cqm should either gives correct numbers for task events
or just throw error that its run out of RMIDs.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c | 647 ++------------------------------------------
 1 file changed, 30 insertions(+), 617 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 8c00dc0..7c37a25 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -4,6 +4,8 @@
  * Based very, very heavily on work by Peter Zijlstra.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
@@ -18,6 +20,7 @@
  * Guaranteed time in ms as per SDM where MBM counters will not overflow.
  */
 #define MBM_CTR_OVERFLOW_TIME	1000
+#define RMID_DEFAULT_QUEUE_TIME	250
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
@@ -91,15 +94,6 @@ struct sample {
 #define QOS_MBM_TOTAL_EVENT_ID	0x02
 #define QOS_MBM_LOCAL_EVENT_ID	0x03
 
-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
-
 #define INVALID_RMID		(-1)
 
 /*
@@ -112,7 +106,7 @@ struct sample {
  */
 static inline bool __rmid_valid(u32 rmid)
 {
-	if (!rmid || rmid == INVALID_RMID)
+	if (!rmid || rmid > cqm_max_rmid)
 		return false;
 
 	return true;
@@ -137,8 +131,7 @@ static u64 __rmid_read(u32 rmid)
 }
 
 enum rmid_recycle_state {
-	RMID_YOUNG = 0,
-	RMID_AVAILABLE,
+	RMID_AVAILABLE = 0,
 	RMID_DIRTY,
 };
 
@@ -228,7 +221,7 @@ static void __put_rmid(u32 rmid)
 	entry = __rmid_entry(rmid);
 
 	entry->queue_time = jiffies;
-	entry->state = RMID_YOUNG;
+	entry->state = RMID_DIRTY;
 
 	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
 }
@@ -281,10 +274,6 @@ static int intel_cqm_setup_rmid_cache(void)
 	entry = __rmid_entry(0);
 	list_del(&entry->list);
 
-	mutex_lock(&cache_mutex);
-	intel_cqm_rotation_rmid = __get_rmid();
-	mutex_unlock(&cache_mutex);
-
 	return 0;
 
 fail:
@@ -343,92 +332,6 @@ static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
 }
 #endif
 
-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- *		   PROHIBITS
- *     system-wide    -> 	cgroup and task
- *     cgroup 	      ->	system-wide
- *     		      ->	task in cgroup
- *     task 	      -> 	system-wide
- *     		      ->	task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
-	/*
-	 * We can have any number of cgroups but only one system-wide
-	 * event at a time.
-	 */
-	if (a->cgrp && b->cgrp) {
-		struct perf_cgroup *ac = a->cgrp;
-		struct perf_cgroup *bc = b->cgrp;
-
-		/*
-		 * This condition should have been caught in
-		 * __match_event() and we should be sharing an RMID.
-		 */
-		WARN_ON_ONCE(ac == bc);
-
-		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-			return true;
-
-		return false;
-	}
-
-	if (a->cgrp || b->cgrp) {
-		struct perf_cgroup *ac, *bc;
-
-		/*
-		 * cgroup and system-wide events are mutually exclusive
-		 */
-		if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
-		    (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
-			return true;
-
-		/*
-		 * Ensure neither event is part of the other's cgroup
-		 */
-		ac = event_to_cgroup(a);
-		bc = event_to_cgroup(b);
-		if (ac == bc)
-			return true;
-
-		/*
-		 * Must have cgroup and non-intersecting task events.
-		 */
-		if (!ac || !bc)
-			return false;
-
-		/*
-		 * We have cgroup and task events, and the task belongs
-		 * to a cgroup. Check for for overlap.
-		 */
-		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-			return true;
-
-		return false;
-	}
-#endif
-	/*
-	 * If one of them is not a task, same story as above with cgroups.
-	 */
-	if (!(a->attach_state & PERF_ATTACH_TASK) ||
-	    !(b->attach_state & PERF_ATTACH_TASK))
-		return true;
-
-	/*
-	 * Must be non-overlapping.
-	 */
-	return false;
-}
-
 struct rmid_read {
 	u32 rmid;
 	u32 evt_type;
@@ -458,461 +361,14 @@ static void cqm_mask_call(struct rmid_read *rr)
 }
 
 /*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
-	struct perf_event *event;
-	struct list_head *head = &group->hw.cqm_group_entry;
-	u32 old_rmid = group->hw.cqm_rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	/*
-	 * If our RMID is being deallocated, perform a read now.
-	 */
-	if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
-		struct rmid_read rr = {
-			.rmid = old_rmid,
-			.evt_type = group->attr.config,
-			.value = ATOMIC64_INIT(0),
-		};
-
-		cqm_mask_call(&rr);
-		local64_set(&group->count, atomic64_read(&rr.value));
-	}
-
-	raw_spin_lock_irq(&cache_lock);
-
-	group->hw.cqm_rmid = rmid;
-	list_for_each_entry(event, head, hw.cqm_group_entry)
-		event->hw.cqm_rmid = rmid;
-
-	raw_spin_unlock_irq(&cache_lock);
-
-	/*
-	 * If the allocation is for mbm, init the mbm stats.
-	 * Need to check if each event in the group is mbm event
-	 * because there could be multiple type of events in the same group.
-	 */
-	if (__rmid_valid(rmid)) {
-		event = group;
-		if (is_mbm_event(event->attr.config))
-			init_mbm_sample(rmid, event->attr.config);
-
-		list_for_each_entry(event, head, hw.cqm_group_entry) {
-			if (is_mbm_event(event->attr.config))
-				init_mbm_sample(rmid, event->attr.config);
-		}
-	}
-
-	return old_rmid;
-}
-
-/*
- * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
- * cachelines are still tagged with RMIDs in limbo, we progressively
- * increment the threshold until we find an RMID in limbo with <=
- * __intel_cqm_threshold lines tagged. This is designed to mitigate the
- * problem where cachelines tagged with an RMID are not steadily being
- * evicted.
- *
- * On successful rotations we decrease the threshold back towards zero.
- *
  * __intel_cqm_max_threshold provides an upper bound on the threshold,
  * and is measured in bytes because it's exposed to userland.
  */
 static unsigned int __intel_cqm_threshold;
 static unsigned int __intel_cqm_max_threshold;
 
-/*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
-	struct cqm_rmid_entry *entry;
-
-	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-		if (entry->state != RMID_AVAILABLE)
-			break;
-
-		if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
-			entry->state = RMID_DIRTY;
-	}
-}
-
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
-	struct perf_event *leader, *event;
-
-	lockdep_assert_held(&cache_mutex);
-
-	leader = list_first_entry(&cache_groups, struct perf_event,
-				  hw.cqm_groups_entry);
-	event = leader;
-
-	list_for_each_entry_continue(event, &cache_groups,
-				     hw.cqm_groups_entry) {
-		if (__rmid_valid(event->hw.cqm_rmid))
-			continue;
-
-		if (__conflict_event(event, leader))
-			continue;
-
-		intel_cqm_xchg_rmid(event, rmid);
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * Initially use this constant for both the limbo queue time and the
- * rotation timer interval, pmu::hrtimer_interval_ms.
- *
- * They don't need to be the same, but the two are related since if you
- * rotate faster than you recycle RMIDs, you may run out of available
- * RMIDs.
- */
-#define RMID_DEFAULT_QUEUE_TIME 250	/* ms */
-
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
-	struct cqm_rmid_entry *entry, *tmp;
-
-	lockdep_assert_held(&cache_mutex);
-
-	*available = 0;
-	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-		unsigned long min_queue_time;
-		unsigned long now = jiffies;
-
-		/*
-		 * We hold RMIDs placed into limbo for a minimum queue
-		 * time. Before the minimum queue time has elapsed we do
-		 * not recycle RMIDs.
-		 *
-		 * The reasoning is that until a sufficient time has
-		 * passed since we stopped using an RMID, any RMID
-		 * placed onto the limbo list will likely still have
-		 * data tagged in the cache, which means we'll probably
-		 * fail to recycle it anyway.
-		 *
-		 * We can save ourselves an expensive IPI by skipping
-		 * any RMIDs that have not been queued for the minimum
-		 * time.
-		 */
-		min_queue_time = entry->queue_time +
-			msecs_to_jiffies(__rmid_queue_time_ms);
-
-		if (time_after(min_queue_time, now))
-			break;
-
-		entry->state = RMID_AVAILABLE;
-		(*available)++;
-	}
-
-	/*
-	 * Fast return if none of the RMIDs on the limbo list have been
-	 * sitting on the queue for the minimum queue time.
-	 */
-	if (!*available)
-		return false;
-
-	/*
-	 * Test whether an RMID is free for each package.
-	 */
-	on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
-
-	list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
-		/*
-		 * Exhausted all RMIDs that have waited min queue time.
-		 */
-		if (entry->state == RMID_YOUNG)
-			break;
-
-		if (entry->state == RMID_DIRTY)
-			continue;
-
-		list_del(&entry->list);	/* remove from limbo */
-
-		/*
-		 * The rotation RMID gets priority if it's
-		 * currently invalid. In which case, skip adding
-		 * the RMID to the the free lru.
-		 */
-		if (!__rmid_valid(intel_cqm_rotation_rmid)) {
-			intel_cqm_rotation_rmid = entry->rmid;
-			continue;
-		}
-
-		/*
-		 * If we have groups waiting for RMIDs, hand
-		 * them one now provided they don't conflict.
-		 */
-		if (intel_cqm_sched_in_event(entry->rmid))
-			continue;
-
-		/*
-		 * Otherwise place it onto the free list.
-		 */
-		list_add_tail(&entry->list, &cqm_rmid_free_lru);
-	}
-
-
-	return __rmid_valid(intel_cqm_rotation_rmid);
-}
-
-/*
- * Pick a victim group and move it to the tail of the group list.
- * @next: The first group without an RMID
- */
-static void __intel_cqm_pick_and_rotate(struct perf_event *next)
-{
-	struct perf_event *rotor;
-	u32 rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	rotor = list_first_entry(&cache_groups, struct perf_event,
-				 hw.cqm_groups_entry);
-
-	/*
-	 * The group at the front of the list should always have a valid
-	 * RMID. If it doesn't then no groups have RMIDs assigned and we
-	 * don't need to rotate the list.
-	 */
-	if (next == rotor)
-		return;
-
-	rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
-	__put_rmid(rmid);
-
-	list_rotate_left(&cache_groups);
-}
-
-/*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
-	struct perf_event *group, *g;
-	u32 rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
-		if (group == event)
-			continue;
-
-		rmid = group->hw.cqm_rmid;
-
-		/*
-		 * Skip events that don't have a valid RMID.
-		 */
-		if (!__rmid_valid(rmid))
-			continue;
-
-		/*
-		 * No conflict? No problem! Leave the event alone.
-		 */
-		if (!__conflict_event(group, event))
-			continue;
-
-		intel_cqm_xchg_rmid(group, INVALID_RMID);
-		__put_rmid(rmid);
-	}
-}
-
-/*
- * Attempt to rotate the groups and assign new RMIDs.
- *
- * We rotate for two reasons,
- *   1. To handle the scheduling of conflicting events
- *   2. To recycle RMIDs
- *
- * Rotating RMIDs is complicated because the hardware doesn't give us
- * any clues.
- *
- * There's problems with the hardware interface; when you change the
- * task:RMID map cachelines retain their 'old' tags, giving a skewed
- * picture. In order to work around this, we must always keep one free
- * RMID - intel_cqm_rotation_rmid.
- *
- * Rotation works by taking away an RMID from a group (the old RMID),
- * and assigning the free RMID to another group (the new RMID). We must
- * then wait for the old RMID to not be used (no cachelines tagged).
- * This ensure that all cachelines are tagged with 'active' RMIDs. At
- * this point we can start reading values for the new RMID and treat the
- * old RMID as the free RMID for the next rotation.
- *
- * Return %true or %false depending on whether we did any rotating.
- */
-static bool __intel_cqm_rmid_rotate(void)
-{
-	struct perf_event *group, *start = NULL;
-	unsigned int threshold_limit;
-	unsigned int nr_needed = 0;
-	unsigned int nr_available;
-	bool rotated = false;
-
-	mutex_lock(&cache_mutex);
-
-again:
-	/*
-	 * Fast path through this function if there are no groups and no
-	 * RMIDs that need cleaning.
-	 */
-	if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
-		goto out;
-
-	list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
-		if (!__rmid_valid(group->hw.cqm_rmid)) {
-			if (!start)
-				start = group;
-			nr_needed++;
-		}
-	}
-
-	/*
-	 * We have some event groups, but they all have RMIDs assigned
-	 * and no RMIDs need cleaning.
-	 */
-	if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
-		goto out;
-
-	if (!nr_needed)
-		goto stabilize;
-
-	/*
-	 * We have more event groups without RMIDs than available RMIDs,
-	 * or we have event groups that conflict with the ones currently
-	 * scheduled.
-	 *
-	 * We force deallocate the rmid of the group at the head of
-	 * cache_groups. The first event group without an RMID then gets
-	 * assigned intel_cqm_rotation_rmid. This ensures we always make
-	 * forward progress.
-	 *
-	 * Rotate the cache_groups list so the previous head is now the
-	 * tail.
-	 */
-	__intel_cqm_pick_and_rotate(start);
-
-	/*
-	 * If the rotation is going to succeed, reduce the threshold so
-	 * that we don't needlessly reuse dirty RMIDs.
-	 */
-	if (__rmid_valid(intel_cqm_rotation_rmid)) {
-		intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
-		intel_cqm_rotation_rmid = __get_rmid();
-
-		intel_cqm_sched_out_conflicting_events(start);
-
-		if (__intel_cqm_threshold)
-			__intel_cqm_threshold--;
-	}
-
-	rotated = true;
-
-stabilize:
-	/*
-	 * We now need to stablize the RMID we freed above (if any) to
-	 * ensure that the next time we rotate we have an RMID with zero
-	 * occupancy value.
-	 *
-	 * Alternatively, if we didn't need to perform any rotation,
-	 * we'll have a bunch of RMIDs in limbo that need stabilizing.
-	 */
-	threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
-
-	while (intel_cqm_rmid_stabilize(&nr_available) &&
-	       __intel_cqm_threshold < threshold_limit) {
-		unsigned int steal_limit;
-
-		/*
-		 * Don't spin if nobody is actively waiting for an RMID,
-		 * the rotation worker will be kicked as soon as an
-		 * event needs an RMID anyway.
-		 */
-		if (!nr_needed)
-			break;
-
-		/* Allow max 25% of RMIDs to be in limbo. */
-		steal_limit = (cqm_max_rmid + 1) / 4;
-
-		/*
-		 * We failed to stabilize any RMIDs so our rotation
-		 * logic is now stuck. In order to make forward progress
-		 * we have a few options:
-		 *
-		 *   1. rotate ("steal") another RMID
-		 *   2. increase the threshold
-		 *   3. do nothing
-		 *
-		 * We do both of 1. and 2. until we hit the steal limit.
-		 *
-		 * The steal limit prevents all RMIDs ending up on the
-		 * limbo list. This can happen if every RMID has a
-		 * non-zero occupancy above threshold_limit, and the
-		 * occupancy values aren't dropping fast enough.
-		 *
-		 * Note that there is prioritisation at work here - we'd
-		 * rather increase the number of RMIDs on the limbo list
-		 * than increase the threshold, because increasing the
-		 * threshold skews the event data (because we reuse
-		 * dirty RMIDs) - threshold bumps are a last resort.
-		 */
-		if (nr_available < steal_limit)
-			goto again;
-
-		__intel_cqm_threshold++;
-	}
-
-out:
-	mutex_unlock(&cache_mutex);
-	return rotated;
-}
-
-static void intel_cqm_rmid_rotate(struct work_struct *work);
-
-static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
-
 static struct pmu intel_cqm_pmu;
 
-static void intel_cqm_rmid_rotate(struct work_struct *work)
-{
-	unsigned long delay;
-
-	__intel_cqm_rmid_rotate();
-
-	delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
-	schedule_delayed_work(&intel_cqm_rmid_work, delay);
-}
-
 static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
 {
 	struct sample *mbm_current;
@@ -984,11 +440,10 @@ static void init_mbm_sample(u32 rmid, u32 evt_type)
  *
  * If we're part of a group, we use the group's RMID.
  */
-static void intel_cqm_setup_event(struct perf_event *event,
+static int intel_cqm_setup_event(struct perf_event *event,
 				  struct perf_event **group)
 {
 	struct perf_event *iter;
-	bool conflict = false;
 	u32 rmid;
 
 	event->hw.is_group_event = false;
@@ -1001,26 +456,24 @@ static void intel_cqm_setup_event(struct perf_event *event,
 			*group = iter;
 			if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
 				init_mbm_sample(rmid, event->attr.config);
-			return;
+			return 0;
 		}
 
-		/*
-		 * We only care about conflicts for events that are
-		 * actually scheduled in (and hence have a valid RMID).
-		 */
-		if (__conflict_event(iter, event) && __rmid_valid(rmid))
-			conflict = true;
 	}
 
-	if (conflict)
-		rmid = INVALID_RMID;
-	else
-		rmid = __get_rmid();
+	rmid = __get_rmid();
+
+	if (!__rmid_valid(rmid)) {
+		pr_info("out of RMIDs\n");
+		return -EINVAL;
+	}
 
 	if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
 		init_mbm_sample(rmid, event->attr.config);
 
 	event->hw.cqm_rmid = rmid;
+
+	return 0;
 }
 
 static void intel_cqm_event_read(struct perf_event *event)
@@ -1166,7 +619,6 @@ static void mbm_hrtimer_init(void)
 
 static u64 intel_cqm_event_count(struct perf_event *event)
 {
-	unsigned long flags;
 	struct rmid_read rr = {
 		.evt_type = event->attr.config,
 		.value = ATOMIC64_INIT(0),
@@ -1206,24 +658,11 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 	 * Notice that we don't perform the reading of an RMID
 	 * atomically, because we can't hold a spin lock across the
 	 * IPIs.
-	 *
-	 * Speculatively perform the read, since @event might be
-	 * assigned a different (possibly invalid) RMID while we're
-	 * busying performing the IPI calls. It's therefore necessary to
-	 * check @event's RMID afterwards, and if it has changed,
-	 * discard the result of the read.
 	 */
 	rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
-
-	if (!__rmid_valid(rr.rmid))
-		goto out;
-
 	cqm_mask_call(&rr);
+	local64_set(&event->count, atomic64_read(&rr.value));
 
-	raw_spin_lock_irqsave(&cache_lock, flags);
-	if (event->hw.cqm_rmid == rr.rmid)
-		local64_set(&event->count, atomic64_read(&rr.value));
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
 out:
 	return __perf_event_count(event);
 }
@@ -1238,34 +677,16 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 
 	event->hw.cqm_state &= ~PERF_HES_STOPPED;
 
-	if (state->rmid_usecnt++) {
-		if (!WARN_ON_ONCE(state->rmid != rmid))
-			return;
-	} else {
-		WARN_ON_ONCE(state->rmid);
-	}
-
 	state->rmid = rmid;
 	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
 		return;
 
 	event->hw.cqm_state |= PERF_HES_STOPPED;
-
-	intel_cqm_event_read(event);
-
-	if (!--state->rmid_usecnt) {
-		state->rmid = 0;
-		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);
-	} else {
-		WARN_ON_ONCE(!state->rmid);
-	}
 }
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -1342,8 +763,8 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 static int intel_cqm_event_init(struct perf_event *event)
 {
 	struct perf_event *group = NULL;
-	bool rotate = false;
 	unsigned long flags;
+	int ret = 0;
 
 	if (event->attr.type != intel_cqm_pmu.type)
 		return -ENOENT;
@@ -1373,46 +794,36 @@ static int intel_cqm_event_init(struct perf_event *event)
 
 	mutex_lock(&cache_mutex);
 
+	/* Will also set rmid, return error on RMID not being available*/
+	if (intel_cqm_setup_event(event, &group)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	/*
 	 * Start the mbm overflow timers when the first event is created.
 	*/
 	if (mbm_enabled && list_empty(&cache_groups))
 		mbm_start_timers();
 
-	/* Will also set rmid */
-	intel_cqm_setup_event(event, &group);
-
 	/*
 	* Hold the cache_lock as mbm timer handlers be
 	* scanning the list of events.
 	*/
 	raw_spin_lock_irqsave(&cache_lock, flags);
 
-	if (group) {
+	if (group)
 		list_add_tail(&event->hw.cqm_group_entry,
 			      &group->hw.cqm_group_entry);
-	} else {
+	else
 		list_add_tail(&event->hw.cqm_groups_entry,
 			      &cache_groups);
 
-		/*
-		 * All RMIDs are either in use or have recently been
-		 * used. Kick the rotation worker to clean/free some.
-		 *
-		 * We only do this for the group leader, rather than for
-		 * every event in a group to save on needless work.
-		 */
-		if (!__rmid_valid(event->hw.cqm_rmid))
-			rotate = true;
-	}
-
 	raw_spin_unlock_irqrestore(&cache_lock, flags);
+out:
 	mutex_unlock(&cache_mutex);
 
-	if (rotate)
-		schedule_delayed_work(&intel_cqm_rmid_work, 0);
-
-	return 0;
+	return ret;
 }
 
 EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
@@ -1706,6 +1117,8 @@ static int __init intel_cqm_init(void)
 	__intel_cqm_max_threshold =
 		boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);
 
+	__intel_cqm_threshold = __intel_cqm_max_threshold / cqm_l3_scale;
+
 	snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
 	str = kstrdup(scale, GFP_KERNEL);
 	if (!str) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 01/12] Documentation, x86/cqm: Intel Resource Monitoring Documentation Vikas Shivappa
  2017-01-06 21:59 ` [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-16 18:05   ` Thomas Gleixner
  2017-01-06 21:59 ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support Vikas Shivappa
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

Add a compile option INTEL_RDT which enables common code for all
RDT(Resource director technology) and a specific INTEL_RDT_M which
enables code for RDT monitoring. CQM(cache quality monitoring) and
mbm(memory b/w monitoring) are part of Intel RDT monitoring.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>

Conflicts:
	arch/x86/Kconfig
---
 arch/x86/Kconfig               | 17 +++++++++++++++++
 arch/x86/events/intel/Makefile |  3 ++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..b2f4b24 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -412,11 +412,28 @@ config GOLDFISH
        def_bool y
        depends on X86_GOLDFISH
 
+config INTEL_RDT
+	bool
+
+config INTEL_RDT_M
+	bool "Intel Resource Director Technology Monitoring support"
+	default n
+	depends on X86 && CPU_SUP_INTEL
+	select INTEL_RDT
+	help
+	  Select to enable resource monitoring which is a sub-feature of
+	  Intel Resource Director Technology(RDT). More information about
+	  RDT can be found in the Intel x86 Architecture Software
+	  Developer Manual.
+
+	  Say N if unsure.
+
 config INTEL_RDT_A
 	bool "Intel Resource Director Technology Allocation support"
 	default n
 	depends on X86 && CPU_SUP_INTEL
 	select KERNFS
+	select INTEL_RDT
 	help
 	  Select to enable resource allocation which is a sub-feature of
 	  Intel Resource Director Technology(RDT). More information about
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 06c2baa..2e002a5 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,5 @@
-obj-$(CONFIG_CPU_SUP_INTEL)		+= core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= core.o bts.o
+obj-$(CONFIG_INTEL_RDT_M)		+= cqm.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= lbr.o p4.o p6.o pt.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)	+= intel-rapl-perf.o
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/12] x86/cqm: Add Per pkg rmid support
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (2 preceding siblings ...)
  2017-01-06 21:59 ` [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-16 18:15   ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support\ Thomas Gleixner
  2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

The RMID is currently global and this extends it to per pkg rmid. The
h/w provides a set of RMIDs on each package and the same task can hence
be associated with different RMIDs on each package.

Patch introduces a new cqm_pkgs_data to keep track of the per package
free list, limbo list and other locking structures.
The corresponding rmid structures in the perf_event is
changed to hold an array of u32 RMIDs instead of a single u32.

The RMIDs are not assigned at the time of event creation and are
assigned in lazy mode at the first sched_in time for a task, thereby
rmid is never allocated if a task is not scheduled on a package. This
helps better usage of RMIDs and its scales with the increasing
sockets/packages.

Locking:
event list - perf init and terminate hold mutex. spin lock is held to
gaurd from the mbm hrtimer.
per pkg free and limbo list - global spin lock. Used by
get_rmid,put_rmid, perf start, terminate

Tests: RMIDs available increase by x times where x is number of sockets
and the usage is dynamic so we save more.

Patch is based on David Carrillo-Cisneros <davidcc@google.com> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c | 340 ++++++++++++++++++++++++--------------------
 arch/x86/events/intel/cqm.h |  37 +++++
 include/linux/perf_event.h  |   2 +-
 3 files changed, 226 insertions(+), 153 deletions(-)
 create mode 100644 arch/x86/events/intel/cqm.h

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 7c37a25..68fd1da 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -11,6 +11,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/intel_rdt_common.h>
 #include "../perf_event.h"
+#include "cqm.h"
 
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
@@ -25,7 +26,7 @@
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
+unsigned int cqm_socket_max;
 
 /*
  * The cached intel_pqr_state is strictly per CPU and can never be
@@ -83,6 +84,8 @@ struct sample {
  */
 static cpumask_t cqm_cpumask;
 
+struct pkg_data **cqm_pkgs_data;
+
 #define RMID_VAL_ERROR		(1ULL << 63)
 #define RMID_VAL_UNAVAIL	(1ULL << 62)
 
@@ -142,50 +145,11 @@ struct cqm_rmid_entry {
 	unsigned long queue_time;
 };
 
-/*
- * cqm_rmid_free_lru - A least recently used list of RMIDs.
- *
- * Oldest entry at the head, newest (most recently used) entry at the
- * tail. This list is never traversed, it's only used to keep track of
- * the lru order. That is, we only pick entries of the head or insert
- * them on the tail.
- *
- * All entries on the list are 'free', and their RMIDs are not currently
- * in use. To mark an RMID as in use, remove its entry from the lru
- * list.
- *
- *
- * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
- *
- * This list is contains RMIDs that no one is currently using but that
- * may have a non-zero occupancy value associated with them. The
- * rotation worker moves RMIDs from the limbo list to the free list once
- * the occupancy value drops below __intel_cqm_threshold.
- *
- * Both lists are protected by cache_mutex.
- */
-static LIST_HEAD(cqm_rmid_free_lru);
-static LIST_HEAD(cqm_rmid_limbo_lru);
-
-/*
- * We use a simple array of pointers so that we can lookup a struct
- * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
- * and __put_rmid() from having to worry about dealing with struct
- * cqm_rmid_entry - they just deal with rmids, i.e. integers.
- *
- * Once this array is initialized it is read-only. No locks are required
- * to access it.
- *
- * All entries for all RMIDs can be looked up in the this array at all
- * times.
- */
-static struct cqm_rmid_entry **cqm_rmid_ptrs;
-
-static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
+static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
 {
 	struct cqm_rmid_entry *entry;
 
-	entry = cqm_rmid_ptrs[rmid];
+	entry = &cqm_pkgs_data[domain]->cqm_rmid_ptrs[rmid];
 	WARN_ON(entry->rmid != rmid);
 
 	return entry;
@@ -196,91 +160,56 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
  *
  * We expect to be called with cache_mutex held.
  */
-static u32 __get_rmid(void)
+static u32 __get_rmid(int domain)
 {
+	struct list_head *cqm_flist;
 	struct cqm_rmid_entry *entry;
 
-	lockdep_assert_held(&cache_mutex);
+	lockdep_assert_held(&cache_lock);
 
-	if (list_empty(&cqm_rmid_free_lru))
+	cqm_flist = &cqm_pkgs_data[domain]->cqm_rmid_free_lru;
+
+	if (list_empty(cqm_flist))
 		return INVALID_RMID;
 
-	entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
+	entry = list_first_entry(cqm_flist, struct cqm_rmid_entry, list);
 	list_del(&entry->list);
 
 	return entry->rmid;
 }
 
-static void __put_rmid(u32 rmid)
+static void __put_rmid(u32 rmid, int domain)
 {
 	struct cqm_rmid_entry *entry;
 
-	lockdep_assert_held(&cache_mutex);
+	lockdep_assert_held(&cache_lock);
 
-	WARN_ON(!__rmid_valid(rmid));
-	entry = __rmid_entry(rmid);
+	WARN_ON(!rmid);
+	entry = __rmid_entry(rmid, domain);
 
 	entry->queue_time = jiffies;
 	entry->state = RMID_DIRTY;
 
-	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
+	list_add_tail(&entry->list, &cqm_pkgs_data[domain]->cqm_rmid_limbo_lru);
 }
 
 static void cqm_cleanup(void)
 {
 	int i;
 
-	if (!cqm_rmid_ptrs)
+	if (!cqm_pkgs_data)
 		return;
 
-	for (i = 0; i < cqm_max_rmid; i++)
-		kfree(cqm_rmid_ptrs[i]);
-
-	kfree(cqm_rmid_ptrs);
-	cqm_rmid_ptrs = NULL;
-	cqm_enabled = false;
-}
-
-static int intel_cqm_setup_rmid_cache(void)
-{
-	struct cqm_rmid_entry *entry;
-	unsigned int nr_rmids;
-	int r = 0;
-
-	nr_rmids = cqm_max_rmid + 1;
-	cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
-				nr_rmids, GFP_KERNEL);
-	if (!cqm_rmid_ptrs)
-		return -ENOMEM;
-
-	for (; r <= cqm_max_rmid; r++) {
-		struct cqm_rmid_entry *entry;
-
-		entry = kmalloc(sizeof(*entry), GFP_KERNEL);
-		if (!entry)
-			goto fail;
-
-		INIT_LIST_HEAD(&entry->list);
-		entry->rmid = r;
-		cqm_rmid_ptrs[r] = entry;
-
-		list_add_tail(&entry->list, &cqm_rmid_free_lru);
+	for (i = 0; i < cqm_socket_max; i++) {
+		if (cqm_pkgs_data[i]) {
+			kfree(cqm_pkgs_data[i]->cqm_rmid_ptrs);
+			kfree(cqm_pkgs_data[i]);
+		}
 	}
-
-	/*
-	 * RMID 0 is special and is always allocated. It's used for all
-	 * tasks that are not monitored.
-	 */
-	entry = __rmid_entry(0);
-	list_del(&entry->list);
-
-	return 0;
-
-fail:
-	cqm_cleanup();
-	return -ENOMEM;
+	kfree(cqm_pkgs_data);
 }
 
+
 /*
  * Determine if @a and @b measure the same set of tasks.
  *
@@ -333,13 +262,13 @@ static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
 #endif
 
 struct rmid_read {
-	u32 rmid;
+	u32 *rmid;
 	u32 evt_type;
 	atomic64_t value;
 };
 
 static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
+static void init_mbm_sample(u32 *rmid, u32 evt_type);
 static void __intel_mbm_event_count(void *info);
 
 static bool is_cqm_event(int e)
@@ -420,10 +349,11 @@ static void __intel_mbm_event_init(void *info)
 {
 	struct rmid_read *rr = info;
 
-	update_sample(rr->rmid, rr->evt_type, 1);
+	if (__rmid_valid(rr->rmid[pkg_id]))
+		update_sample(rr->rmid[pkg_id], rr->evt_type, 1);
 }
 
-static void init_mbm_sample(u32 rmid, u32 evt_type)
+static void init_mbm_sample(u32 *rmid, u32 evt_type)
 {
 	struct rmid_read rr = {
 		.rmid = rmid,
@@ -444,7 +374,7 @@ static int intel_cqm_setup_event(struct perf_event *event,
 				  struct perf_event **group)
 {
 	struct perf_event *iter;
-	u32 rmid;
+	u32 *rmid, sizet;
 
 	event->hw.is_group_event = false;
 	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
@@ -454,24 +384,20 @@ static int intel_cqm_setup_event(struct perf_event *event,
 			/* All tasks in a group share an RMID */
 			event->hw.cqm_rmid = rmid;
 			*group = iter;
-			if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
+			if (is_mbm_event(event->attr.config))
 				init_mbm_sample(rmid, event->attr.config);
 			return 0;
 		}
-
-	}
-
-	rmid = __get_rmid();
-
-	if (!__rmid_valid(rmid)) {
-		pr_info("out of RMIDs\n");
-		return -EINVAL;
 	}
 
-	if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
-		init_mbm_sample(rmid, event->attr.config);
-
-	event->hw.cqm_rmid = rmid;
+	/*
+	 * RMIDs are allocated in LAZY mode by default only when
+	 * tasks monitored are scheduled in.
+	 */
+	sizet = sizeof(u32) * cqm_socket_max;
+	event->hw.cqm_rmid = kzalloc(sizet, GFP_KERNEL);
+	if (!event->hw.cqm_rmid)
+		return -ENOMEM;
 
 	return 0;
 }
@@ -489,7 +415,7 @@ static void intel_cqm_event_read(struct perf_event *event)
 		return;
 
 	raw_spin_lock_irqsave(&cache_lock, flags);
-	rmid = event->hw.cqm_rmid;
+	rmid = event->hw.cqm_rmid[pkg_id];
 
 	if (!__rmid_valid(rmid))
 		goto out;
@@ -515,12 +441,12 @@ static void __intel_cqm_event_count(void *info)
 	struct rmid_read *rr = info;
 	u64 val;
 
-	val = __rmid_read(rr->rmid);
-
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
-
-	atomic64_add(val, &rr->value);
+	if (__rmid_valid(rr->rmid[pkg_id])) {
+		val = __rmid_read(rr->rmid[pkg_id]);
+		if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+			return;
+		atomic64_add(val, &rr->value);
+	}
 }
 
 static inline bool cqm_group_leader(struct perf_event *event)
@@ -533,10 +459,12 @@ static void __intel_mbm_event_count(void *info)
 	struct rmid_read *rr = info;
 	u64 val;
 
-	val = rmid_read_mbm(rr->rmid, rr->evt_type);
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
-	atomic64_add(val, &rr->value);
+	if (__rmid_valid(rr->rmid[pkg_id])) {
+		val = rmid_read_mbm(rr->rmid[pkg_id], rr->evt_type);
+		if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+			return;
+		atomic64_add(val, &rr->value);
+	}
 }
 
 static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
@@ -559,7 +487,7 @@ static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
 	}
 
 	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
-		grp_rmid = iter->hw.cqm_rmid;
+		grp_rmid = iter->hw.cqm_rmid[pkg_id];
 		if (!__rmid_valid(grp_rmid))
 			continue;
 		if (is_mbm_event(iter->attr.config))
@@ -572,7 +500,7 @@ static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
 			if (!iter1->hw.is_group_event)
 				break;
 			if (is_mbm_event(iter1->attr.config))
-				update_sample(iter1->hw.cqm_rmid,
+				update_sample(iter1->hw.cqm_rmid[pkg_id],
 					      iter1->attr.config, 0);
 		}
 	}
@@ -610,7 +538,7 @@ static void mbm_hrtimer_init(void)
 	struct hrtimer *hr;
 	int i;
 
-	for (i = 0; i < mbm_socket_max; i++) {
+	for (i = 0; i < cqm_socket_max; i++) {
 		hr = &mbm_timers[i];
 		hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 		hr->function = mbm_hrtimer_handle;
@@ -667,16 +595,39 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
+void alloc_needed_pkg_rmid(u32 *cqm_rmid)
+{
+	unsigned long flags;
+	u32 rmid;
+
+	if (WARN_ON(!cqm_rmid))
+		return;
+
+	if (cqm_rmid[pkg_id])
+		return;
+
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
+	rmid = __get_rmid(pkg_id);
+	if (__rmid_valid(rmid))
+		cqm_rmid[pkg_id] = rmid;
+
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+}
+
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 rmid = event->hw.cqm_rmid;
+	u32 rmid;
 
 	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
 		return;
 
 	event->hw.cqm_state &= ~PERF_HES_STOPPED;
 
+	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+
+	rmid = event->hw.cqm_rmid[pkg_id];
 	state->rmid = rmid;
 	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
 }
@@ -691,22 +642,27 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
 {
-	unsigned long flags;
-	u32 rmid;
-
-	raw_spin_lock_irqsave(&cache_lock, flags);
-
 	event->hw.cqm_state = PERF_HES_STOPPED;
-	rmid = event->hw.cqm_rmid;
 
-	if (__rmid_valid(rmid) && (mode & PERF_EF_START))
+	if ((mode & PERF_EF_START))
 		intel_cqm_event_start(event, mode);
 
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-
 	return 0;
 }
 
+static inline void
+	cqm_event_free_rmid(struct perf_event *event)
+{
+	u32 *rmid = event->hw.cqm_rmid;
+	int d;
+
+	for (d = 0; d < cqm_socket_max; d++) {
+		if (__rmid_valid(rmid[d]))
+			__put_rmid(rmid[d], d);
+	}
+	kfree(event->hw.cqm_rmid);
+	list_del(&event->hw.cqm_groups_entry);
+}
 static void intel_cqm_event_destroy(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
@@ -737,16 +693,11 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 		 * If there was a group_other, make that leader, otherwise
 		 * destroy the group and return the RMID.
 		 */
-		if (group_other) {
+		if (group_other)
 			list_replace(&event->hw.cqm_groups_entry,
 				     &group_other->hw.cqm_groups_entry);
-		} else {
-			u32 rmid = event->hw.cqm_rmid;
-
-			if (__rmid_valid(rmid))
-				__put_rmid(rmid);
-			list_del(&event->hw.cqm_groups_entry);
-		}
+		else
+			cqm_event_free_rmid(event);
 	}
 
 	raw_spin_unlock_irqrestore(&cache_lock, flags);
@@ -794,7 +745,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 
 	mutex_lock(&cache_mutex);
 
-	/* Will also set rmid, return error on RMID not being available*/
+	/* Delay allocating RMIDs */
 	if (intel_cqm_setup_event(event, &group)) {
 		ret = -EINVAL;
 		goto out;
@@ -1036,12 +987,95 @@ static void mbm_cleanup(void)
 	{}
 };
 
+static int pkg_data_init_cpu(int cpu)
+{
+	struct cqm_rmid_entry *ccqm_rmid_ptrs = NULL, *entry = NULL;
+	int curr_pkgid = topology_physical_package_id(cpu);
+	struct pkg_data *pkg_data = NULL;
+	int i = 0, nr_rmids, ret = 0;
+
+	if (cqm_pkgs_data[curr_pkgid])
+		return 0;
+
+	pkg_data = kzalloc_node(sizeof(struct pkg_data),
+				GFP_KERNEL, cpu_to_node(cpu));
+	if (!pkg_data)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&pkg_data->cqm_rmid_free_lru);
+	INIT_LIST_HEAD(&pkg_data->cqm_rmid_limbo_lru);
+
+	mutex_init(&pkg_data->pkg_data_mutex);
+	raw_spin_lock_init(&pkg_data->pkg_data_lock);
+
+	pkg_data->rmid_work_cpu = cpu;
+
+	nr_rmids = cqm_max_rmid + 1;
+	ccqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry) *
+			   nr_rmids, GFP_KERNEL);
+	if (!ccqm_rmid_ptrs) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	for (; i <= cqm_max_rmid; i++) {
+		entry = &ccqm_rmid_ptrs[i];
+		INIT_LIST_HEAD(&entry->list);
+		entry->rmid = i;
+
+		list_add_tail(&entry->list, &pkg_data->cqm_rmid_free_lru);
+	}
+
+	pkg_data->cqm_rmid_ptrs = ccqm_rmid_ptrs;
+	cqm_pkgs_data[curr_pkgid] = pkg_data;
+
+	/*
+	 * RMID 0 is special and is always allocated. It's used for all
+	 * tasks that are not monitored.
+	 */
+	entry = __rmid_entry(0, curr_pkgid);
+	list_del(&entry->list);
+
+	return 0;
+fail:
+	kfree(ccqm_rmid_ptrs);
+	ccqm_rmid_ptrs = NULL;
+	kfree(pkg_data);
+	pkg_data = NULL;
+	cqm_pkgs_data[curr_pkgid] = NULL;
+	return ret;
+}
+
+static int cqm_init_pkgs_data(void)
+{
+	int i, cpu, ret = 0;
+
+	cqm_pkgs_data = kzalloc(
+		sizeof(struct pkg_data *) * cqm_socket_max,
+		GFP_KERNEL);
+	if (!cqm_pkgs_data)
+		return -ENOMEM;
+
+	for (i = 0; i < cqm_socket_max; i++)
+		cqm_pkgs_data[i] = NULL;
+
+	for_each_online_cpu(cpu) {
+		ret = pkg_data_init_cpu(cpu);
+		if (ret)
+			goto fail;
+	}
+
+	return 0;
+fail:
+	cqm_cleanup();
+	return ret;
+}
+
 static int intel_mbm_init(void)
 {
 	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
 
-	mbm_socket_max = topology_max_packages();
-	array_size = sizeof(struct sample) * maxid * mbm_socket_max;
+	array_size = sizeof(struct sample) * maxid * cqm_socket_max;
 	mbm_local = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_local)
 		return -ENOMEM;
@@ -1052,7 +1086,7 @@ static int intel_mbm_init(void)
 		goto out;
 	}
 
-	array_size = sizeof(struct hrtimer) * mbm_socket_max;
+	array_size = sizeof(struct hrtimer) * cqm_socket_max;
 	mbm_timers = kmalloc(array_size, GFP_KERNEL);
 	if (!mbm_timers) {
 		ret = -ENOMEM;
@@ -1128,7 +1162,8 @@ static int __init intel_cqm_init(void)
 
 	event_attr_intel_cqm_llc_scale.event_str = str;
 
-	ret = intel_cqm_setup_rmid_cache();
+	cqm_socket_max = topology_max_packages();
+	ret = cqm_init_pkgs_data();
 	if (ret)
 		goto out;
 
@@ -1171,6 +1206,7 @@ static int __init intel_cqm_init(void)
 	if (ret) {
 		kfree(str);
 		cqm_cleanup();
+		cqm_enabled = false;
 		mbm_cleanup();
 	}
 
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
new file mode 100644
index 0000000..4415497
--- /dev/null
+++ b/arch/x86/events/intel/cqm.h
@@ -0,0 +1,37 @@
+#ifndef _ASM_X86_CQM_H
+#define _ASM_X86_CQM_H
+
+#ifdef CONFIG_INTEL_RDT_M
+
+#include <linux/perf_event.h>
+
+/**
+ * struct pkg_data - cqm per package(socket) meta data
+ * @cqm_rmid_free_lru    A least recently used list of free RMIDs
+ *     These RMIDs are guaranteed to have an occupancy less than the
+ * threshold occupancy
+ * @cqm_rmid_limbo_lru       list of currently unused but (potentially)
+ *     dirty RMIDs.
+ *     This list contains RMIDs that no one is currently using but that
+ *     may have a occupancy value > __intel_cqm_threshold. User can change
+ *     the threshold occupancy value.
+ * @cqm_rmid_entry - The entry in the limbo and free lists.
+ * @delayed_work - Work to reuse the RMIDs that have been freed.
+ * @rmid_work_cpu - The cpu on the package on which work is scheduled.
+ */
+struct pkg_data {
+	struct list_head	cqm_rmid_free_lru;
+	struct list_head	cqm_rmid_limbo_lru;
+
+	struct cqm_rmid_entry	*cqm_rmid_ptrs;
+
+	struct mutex		pkg_data_mutex;
+	raw_spinlock_t		pkg_data_lock;
+
+	struct delayed_work	intel_cqm_rmid_work;
+	atomic_t		reuse_scheduled;
+
+	int			rmid_work_cpu;
+};
+#endif
+#endif
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..a8f4749 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -141,7 +141,7 @@ struct hw_perf_event {
 		};
 		struct { /* intel_cqm */
 			int			cqm_state;
-			u32			cqm_rmid;
+			u32			*cqm_rmid;
 			int			is_group_event;
 			struct list_head	cqm_events_entry;
 			struct list_head	cqm_groups_entry;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (3 preceding siblings ...)
  2017-01-06 21:59 ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-17 12:11   ` Thomas Gleixner
                     ` (2 more replies)
  2017-01-06 21:59 ` [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support Vikas Shivappa
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: David Carrillo-Cisneros <davidcc@google.com>

cgroup hierarchy monitoring is not supported currently. This patch
builds all the necessary datastructures, cgroup APIs like alloc, free
etc and necessary quirks for supporting cgroup hierarchy monitoring in
later patches.

- Introduce a architecture specific data structure arch_info in
perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
- perf sched_in calls all the cgroup ancestors when a cgroup is
scheduled in. This will not work with cqm as we have a common per pkg
rmid associated with one task and hence cannot write different RMIds
into the MSR for each event. cqm driver enables a flag
PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
ancestor cgroups for each event and let the driver handle the hierarchy
monitoring for cgroup.
- Introduce event_terminate as event_destroy is called after cgrp is
disassociated from the event to support rmid handling of the cgroup.
This helps cqm clean up the cqm specific arch_info.
- Add the cgroup APIs for alloc,free,attach and can_attach

The above framework will be used to build different cgroup features in
later patches.

Tests: Same as before. Cgroup still doesnt work but we did the prep to
get it to work

Patch modified/refactored by Vikas Shivappa
<vikas.shivappa@linux.intel.com> to support recycling removal.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c       | 19 ++++++++++++++++++-
 arch/x86/include/asm/perf_event.h | 27 +++++++++++++++++++++++++++
 include/linux/perf_event.h        | 32 ++++++++++++++++++++++++++++++++
 kernel/events/core.c              | 28 +++++++++++++++++++++++++++-
 4 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 68fd1da..a9bd7bd 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
 	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
 	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
 
-	event->destroy = intel_cqm_event_destroy;
+	/*
+	 * CQM driver handles cgroup recursion and since only noe
+	 * RMID can be programmed at the time in each core, then
+	 * it is incompatible with the way generic code handles
+	 * cgroup hierarchies.
+	 */
+	event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
 
 	mutex_lock(&cache_mutex);
 
@@ -918,6 +924,17 @@ static int intel_cqm_event_init(struct perf_event *event)
 	.read		     = intel_cqm_event_read,
 	.count		     = intel_cqm_event_count,
 };
+#ifdef CONFIG_CGROUP_PERF
+int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+				      struct cgroup_subsys_state *new_css)
+{}
+void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
+{}
+void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
+{}
+int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
+{}
+#endif
 
 static inline void cqm_pick_event_reader(int cpu)
 {
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f353061..f38c7f0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -299,4 +299,31 @@ static inline void perf_check_microcode(void) { }
 
 #define arch_perf_out_copy_user copy_from_user_nmi
 
+/*
+ * Hooks for architecture specific features of perf_event cgroup.
+ * Currently used by Intel's CQM.
+ */
+#ifdef CONFIG_INTEL_RDT_M
+#ifdef CONFIG_CGROUP_PERF
+
+#define perf_cgroup_arch_css_alloc	perf_cgroup_arch_css_alloc
+
+int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+				      struct cgroup_subsys_state *new_css);
+
+#define perf_cgroup_arch_css_free	perf_cgroup_arch_css_free
+
+void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_attach		perf_cgroup_arch_attach
+
+void perf_cgroup_arch_attach(struct cgroup_taskset *tset);
+
+#define perf_cgroup_arch_can_attach	perf_cgroup_arch_can_attach
+
+int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset);
+
+#endif
+
+#endif
 #endif /* _ASM_X86_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a8f4749..410642a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -300,6 +300,12 @@ struct pmu {
 	int (*event_init)		(struct perf_event *event);
 
 	/*
+	 * Terminate the event for this PMU. Optional complement for a
+	 * successful event_init. Called before the event fields are tear down.
+	 */
+	void (*event_terminate)		(struct perf_event *event);
+
+	/*
 	 * Notification that the event was mapped or unmapped.  Called
 	 * in the context of the mapping task.
 	 */
@@ -516,9 +522,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  * PERF_EV_CAP_SOFTWARE: Is a software event.
  * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
  * from any CPU in the package where it is active.
+ * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
+ * cgroup scoping. It does not need to be enabled for all of its descendants
+ * cgroups.
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
+#define PERF_EV_CAP_CGROUP_NO_RECURSION	BIT(2)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
@@ -823,6 +833,8 @@ struct perf_cgroup_info {
 };
 
 struct perf_cgroup {
+	/* Architecture specific information. */
+	void				 *arch_info;
 	struct cgroup_subsys_state	css;
 	struct perf_cgroup_info	__percpu *info;
 };
@@ -844,6 +856,7 @@ struct perf_cgroup {
 
 #ifdef CONFIG_PERF_EVENTS
 
+extern int is_cgroup_event(struct perf_event *event);
 extern void *perf_aux_output_begin(struct perf_output_handle *handle,
 				   struct perf_event *event);
 extern void perf_aux_output_end(struct perf_output_handle *handle,
@@ -1387,4 +1400,23 @@ ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
 #define perf_event_exit_cpu	NULL
 #endif
 
+/*
+ * Hooks for architecture specific extensions for perf_cgroup.
+ */
+#ifndef perf_cgroup_arch_css_alloc
+#define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_free
+#define perf_cgroup_arch_css_free(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_attach
+#define perf_cgroup_arch_attach(tskset) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_can_attach
+#define perf_cgroup_arch_can_attach(tskset) 0
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ab15509..229f611 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -590,6 +590,9 @@ static inline u64 perf_event_clock(struct perf_event *event)
 	if (!cpuctx->cgrp)
 		return false;
 
+	if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
+		return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
+
 	/*
 	 * Cgroup scoping is recursive.  An event enabled for a cgroup is
 	 * also enabled for all its descendant cgroups.  If @cpuctx's
@@ -606,7 +609,7 @@ static inline void perf_detach_cgroup(struct perf_event *event)
 	event->cgrp = NULL;
 }
 
-static inline int is_cgroup_event(struct perf_event *event)
+int is_cgroup_event(struct perf_event *event)
 {
 	return event->cgrp != NULL;
 }
@@ -4019,6 +4022,9 @@ static void _free_event(struct perf_event *event)
 		mutex_unlock(&event->mmap_mutex);
 	}
 
+	if (event->pmu->event_terminate)
+		event->pmu->event_terminate(event);
+
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
 
@@ -9246,6 +9252,8 @@ static void account_event(struct perf_event *event)
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (event->pmu->event_terminate)
+		event->pmu->event_terminate(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
@@ -10748,6 +10756,7 @@ static int __init perf_event_sysfs_init(void)
 perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct perf_cgroup *jc;
+	int ret;
 
 	jc = kzalloc(sizeof(*jc), GFP_KERNEL);
 	if (!jc)
@@ -10759,6 +10768,12 @@ static int __init perf_event_sysfs_init(void)
 		return ERR_PTR(-ENOMEM);
 	}
 
+	jc->arch_info = NULL;
+
+	ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
+	if (ret)
+		return ERR_PTR(ret);
+
 	return &jc->css;
 }
 
@@ -10766,6 +10781,8 @@ static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
 
+	perf_cgroup_arch_css_free(css);
+
 	free_percpu(jc->info);
 	kfree(jc);
 }
@@ -10786,11 +10803,20 @@ static void perf_cgroup_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_for_each(task, css, tset)
 		task_function_call(task, __perf_cgroup_move, task);
+
+	perf_cgroup_arch_attach(tset);
+}
+
+static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+	return perf_cgroup_arch_can_attach(tset);
 }
 
+
 struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
 	.css_free	= perf_cgroup_css_free,
+	.can_attach	= perf_cgroup_can_attach,
 	.attach		= perf_cgroup_attach,
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (4 preceding siblings ...)
  2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
@ 2017-01-06 21:59 ` Vikas Shivappa
  2017-01-17 14:07   ` Thomas Gleixner
  2017-01-06 22:00 ` [PATCH 07/12] x86/rdt,cqm: Scheduling support update Vikas Shivappa
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 21:59 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: David Carrillo-Cisneros <davidcc@google.com>

Patch adds support for monitoring cgroup hierarchy. The
arch_info that was introduced in the perf_cgroup is used to maintain the
cgroup related rmid and hierarchy information.

Since cgroup supports hierarchical monitoring, a cgroup is always
monitoring for some ancestor. By default root is always monitored with
RMID 0 and hence when any cgroup is first created it always reports data
to the root.  mfa or 'monitor for ancestor' is used to keep track of
this information.  Basically which ancestor the cgroup is actually
monitoring for or has to report the data to.

By default, all cgroup's mfa points to root.
1.event init: When ever a new cgroup x would start to be monitored,
the mfa of the monitored cgroup's descendants point towards the cgroup
x.
2.switch_to: task finds the cgroup its associated with and if the
cgroup itself is being monitored cgroup uses its own rmid(a) else it uses
the rmid of the mfa(b).
3.read: During the read call, the cgroup x just adds the
counts of its descendants who had cgroup x as mfa and were also
monitored(To count the scenario (a) in switch_to).

Locking: cgroup traversal: rcu_readlock.  cgroup->arch_info: css_alloc,
css_free, event terminate, init hold mutex.

Tests: Cgroup monitoring should work. Monitoring multiple cgroups in the
same hierarchy works. monitoring cgroup and a task within same cgroup
doesnt work yet.

Patch modified/refactored by Vikas Shivappa
<vikas.shivappa@linux.intel.com> to support recycling removal.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c             | 227 +++++++++++++++++++++++++++-----
 arch/x86/include/asm/intel_rdt_common.h |  64 +++++++++
 2 files changed, 257 insertions(+), 34 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index a9bd7bd..c6479ae 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -85,6 +85,7 @@ struct sample {
 static cpumask_t cqm_cpumask;
 
 struct pkg_data **cqm_pkgs_data;
+struct cgrp_cqm_info cqm_rootcginfo;
 
 #define RMID_VAL_ERROR		(1ULL << 63)
 #define RMID_VAL_UNAVAIL	(1ULL << 62)
@@ -193,6 +194,11 @@ static void __put_rmid(u32 rmid, int domain)
 	list_add_tail(&entry->list, &cqm_pkgs_data[domain]->cqm_rmid_limbo_lru);
 }
 
+static bool is_task_event(struct perf_event *e)
+{
+	return (e->attach_state & PERF_ATTACH_TASK);
+}
+
 static void cqm_cleanup(void)
 {
 	int i;
@@ -209,7 +215,6 @@ static void cqm_cleanup(void)
 	kfree(cqm_pkgs_data);
 }
 
-
 /*
  * Determine if @a and @b measure the same set of tasks.
  *
@@ -224,20 +229,18 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
 		return false;
 
 #ifdef CONFIG_CGROUP_PERF
-	if (a->cgrp != b->cgrp)
-		return false;
-#endif
-
-	/* If not task event, we're machine wide */
-	if (!(b->attach_state & PERF_ATTACH_TASK))
+	if ((is_cgroup_event(a) && is_cgroup_event(b)) &&
+		(a->cgrp == b->cgrp))
 		return true;
+#endif
 
 	/*
 	 * Events that target same task are placed into the same cache group.
 	 * Mark it as a multi event group, so that we update ->count
 	 * for every event rather than just the group leader later.
 	 */
-	if (a->hw.target == b->hw.target) {
+	if ((is_task_event(a) && is_task_event(b)) &&
+		(a->hw.target == b->hw.target)) {
 		b->hw.is_group_event = true;
 		return true;
 	}
@@ -365,6 +368,63 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
 	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
 }
 
+static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
+{
+	if (rmid != NULL) {
+		cqm_info->mon_enabled = true;
+		cqm_info->rmid = rmid;
+	} else {
+		cqm_info->mon_enabled = false;
+		cqm_info->rmid = NULL;
+	}
+}
+
+static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)
+{
+	struct cgrp_cqm_info *ccqm_info, *rcqm_info;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+
+	rcqm_info = css_to_cqm_info(rcss);
+
+	/* Enable or disable monitoring based on rmid.*/
+	cqm_enable_mon(rcqm_info, rmid);
+
+	pos_css = css_next_descendant_pre(rcss, rcss);
+	while (pos_css) {
+		ccqm_info = css_to_cqm_info(pos_css);
+
+		/*
+		 * Monitoring is being enabled.
+		 * Update the descendents to monitor for you, unless
+		 * they were already monitoring for a descendent of yours.
+		 */
+		if (rmid && (rcqm_info->level > ccqm_info->mfa->level))
+			ccqm_info->mfa = rcqm_info;
+
+		/*
+		 * Monitoring is being disabled.
+		 * Update the descendents who were monitoring for you
+		 * to monitor for the ancestor you were monitoring.
+		 */
+		if (!rmid && (ccqm_info->mfa == rcqm_info))
+			ccqm_info->mfa = rcqm_info->mfa;
+		pos_css = css_next_descendant_pre(pos_css, rcss);
+	}
+	rcu_read_unlock();
+}
+
+static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
+{
+#ifdef CONFIG_CGROUP_PERF
+	if (is_cgroup_event(event)) {
+		cqm_assign_hier_rmid(&event->cgrp->css, rmid);
+	}
+#endif
+	return 0;
+}
+
 /*
  * Find a group and setup RMID.
  *
@@ -402,11 +462,14 @@ static int intel_cqm_setup_event(struct perf_event *event,
 	return 0;
 }
 
+static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr);
+
 static void intel_cqm_event_read(struct perf_event *event)
 {
-	unsigned long flags;
-	u32 rmid;
-	u64 val;
+	struct rmid_read rr = {
+		.evt_type = event->attr.config,
+		.value = ATOMIC64_INIT(0),
+	};
 
 	/*
 	 * Task events are handled by intel_cqm_event_count().
@@ -414,26 +477,9 @@ static void intel_cqm_event_read(struct perf_event *event)
 	if (event->cpu == -1)
 		return;
 
-	raw_spin_lock_irqsave(&cache_lock, flags);
-	rmid = event->hw.cqm_rmid[pkg_id];
-
-	if (!__rmid_valid(rmid))
-		goto out;
-
-	if (is_mbm_event(event->attr.config))
-		val = rmid_read_mbm(rmid, event->attr.config);
-	else
-		val = __rmid_read(rmid);
-
-	/*
-	 * Ignore this reading on error states and do not update the value.
-	 */
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		goto out;
+	rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
 
-	local64_set(&event->count, val);
-out:
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
+	cqm_read_subtree(event, &rr);
 }
 
 static void __intel_cqm_event_count(void *info)
@@ -545,6 +591,55 @@ static void mbm_hrtimer_init(void)
 	}
 }
 
+static void cqm_mask_call_local(struct rmid_read *rr)
+{
+	if (is_mbm_event(rr->evt_type))
+		__intel_mbm_event_count(rr);
+	else
+		__intel_cqm_event_count(rr);
+}
+
+static inline void
+	delta_local(struct perf_event *event, struct rmid_read *rr, u32 *rmid)
+{
+	atomic64_set(&rr->value, 0);
+	rr->rmid = ACCESS_ONCE(rmid);
+
+	cqm_mask_call_local(rr);
+	local64_add(atomic64_read(&rr->value), &event->count);
+}
+
+/*
+ * Since cgroup follows hierarchy, add the count of
+ *  the descendents who were being monitored as well.
+ */
+static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
+{
+#ifdef CONFIG_CGROUP_PERF
+
+	struct cgroup_subsys_state *rcss, *pos_css;
+	struct cgrp_cqm_info *ccqm_info;
+
+	cqm_mask_call_local(rr);
+	local64_set(&event->count, atomic64_read(&(rr->value)));
+
+	if (is_task_event(event))
+		return __perf_event_count(event);
+
+	rcu_read_lock();
+	rcss = &event->cgrp->css;
+	css_for_each_descendant_pre(pos_css, rcss) {
+		ccqm_info = (css_to_cqm_info(pos_css));
+
+		/* Add the descendent 'monitored cgroup' counts */
+		if (pos_css != rcss && ccqm_info->mon_enabled)
+			delta_local(event, rr, ccqm_info->rmid);
+	}
+	rcu_read_unlock();
+#endif
+	return __perf_event_count(event);
+}
+
 static u64 intel_cqm_event_count(struct perf_event *event)
 {
 	struct rmid_read rr = {
@@ -603,7 +698,7 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
 	if (WARN_ON(!cqm_rmid))
 		return;
 
-	if (cqm_rmid[pkg_id])
+	if (cqm_rmid == cqm_rootcginfo.rmid || cqm_rmid[pkg_id])
 		return;
 
 	raw_spin_lock_irqsave(&cache_lock, flags);
@@ -661,9 +756,11 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
 			__put_rmid(rmid[d], d);
 	}
 	kfree(event->hw.cqm_rmid);
+	cqm_assign_rmid(event, NULL);
 	list_del(&event->hw.cqm_groups_entry);
 }
-static void intel_cqm_event_destroy(struct perf_event *event)
+
+static void intel_cqm_event_terminate(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
 	unsigned long flags;
@@ -917,6 +1014,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 	.attr_groups	     = intel_cqm_attr_groups,
 	.task_ctx_nr	     = perf_sw_context,
 	.event_init	     = intel_cqm_event_init,
+	.event_terminate     = intel_cqm_event_terminate,
 	.add		     = intel_cqm_event_add,
 	.del		     = intel_cqm_event_stop,
 	.start		     = intel_cqm_event_start,
@@ -924,12 +1022,67 @@ static int intel_cqm_event_init(struct perf_event *event)
 	.read		     = intel_cqm_event_read,
 	.count		     = intel_cqm_event_count,
 };
+
 #ifdef CONFIG_CGROUP_PERF
 int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
 				      struct cgroup_subsys_state *new_css)
-{}
+{
+	struct cgrp_cqm_info *cqm_info, *pcqm_info;
+	struct perf_cgroup *new_cgrp;
+
+	if (!parent_css) {
+		cqm_rootcginfo.level = 0;
+
+		cqm_rootcginfo.mon_enabled = true;
+		cqm_rootcginfo.cont_mon = true;
+		cqm_rootcginfo.mfa = NULL;
+		INIT_LIST_HEAD(&cqm_rootcginfo.tskmon_rlist);
+
+		if (new_css) {
+			new_cgrp = css_to_perf_cgroup(new_css);
+			new_cgrp->arch_info = &cqm_rootcginfo;
+		}
+		return 0;
+	}
+
+	mutex_lock(&cache_mutex);
+
+	new_cgrp = css_to_perf_cgroup(new_css);
+
+	cqm_info = kzalloc(sizeof(struct cgrp_cqm_info), GFP_KERNEL);
+	if (!cqm_info) {
+		mutex_unlock(&cache_mutex);
+		return -ENOMEM;
+	}
+
+	pcqm_info = (css_to_cqm_info(parent_css));
+	cqm_info->level = pcqm_info->level + 1;
+	cqm_info->rmid = pcqm_info->rmid;
+
+	cqm_info->cont_mon = false;
+	cqm_info->mon_enabled = false;
+	INIT_LIST_HEAD(&cqm_info->tskmon_rlist);
+	if (!pcqm_info->mfa)
+		cqm_info->mfa = pcqm_info;
+	else
+		cqm_info->mfa = pcqm_info->mfa;
+
+	new_cgrp->arch_info = cqm_info;
+	mutex_unlock(&cache_mutex);
+
+	return 0;
+}
+
 void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
-{}
+{
+	struct perf_cgroup *cgrp = css_to_perf_cgroup(css);
+
+	mutex_lock(&cache_mutex);
+	kfree(cgrp_to_cqm_info(cgrp));
+	cgrp->arch_info = NULL;
+	mutex_unlock(&cache_mutex);
+}
+
 void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
 {}
 int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
@@ -1053,6 +1206,12 @@ static int pkg_data_init_cpu(int cpu)
 	entry = __rmid_entry(0, curr_pkgid);
 	list_del(&entry->list);
 
+	cqm_rootcginfo.rmid = kzalloc(sizeof(u32) * cqm_socket_max, GFP_KERNEL);
+	if (!cqm_rootcginfo.rmid) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
 	return 0;
 fail:
 	kfree(ccqm_rmid_ptrs);
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index b31081b..e11ed5e 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -24,4 +24,68 @@ struct intel_pqr_state {
 
 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 
+/**
+ * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
+ * @cont_mon     Continuous monitoring flag
+ * @mon_enabled  Whether monitoring is enabled
+ * @level        Level in the cgroup tree. Root is level 0.
+ * @rmid        The rmids of the cgroup.
+ * @mfa          'Monitoring for ancestor' points to the cqm_info
+ *  of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
+ *  means you will use an ancestors RMID at sched_in if you are
+ *  not monitoring yourself.
+ *
+ *  Due to the hierarchical nature of cgroups, every cgroup just
+ *  monitors for the 'nearest monitored ancestor' at all times.
+ *  Since root cgroup is always monitored, all descendents
+ *  at boot time monitor for root and hence all mfa points to root except
+ *  for root->mfa which is NULL.
+ *  1. RMID setup: When cgroup x start monitoring:
+ *    for each descendent y, if y's mfa->level < x->level, then
+ *    y->mfa = x. (Where level of root node = 0...)
+ *  2. sched_in: During sched_in for x
+ *    if (x->mon_enabled) choose x->rmid
+ *    else choose x->mfa->rmid.
+ *  3. read: for each descendent of cgroup x
+ *     if (x->monitored) count += rmid_read(x->rmid).
+ *  4. evt_destroy: for each descendent y of x, if (y->mfa == x) then
+ *     y->mfa = x->mfa. Meaning if any descendent was monitoring for x,
+ *     set that descendent to monitor for the cgroup which x was monitoring for.
+ *
+ * @tskmon_rlist List of tasks being monitored in the cgroup
+ *  When a task which belongs to a cgroup x is being monitored, it always uses
+ *  its own task->rmid even if cgroup x is monitored during sched_in.
+ *  To account for the counts of such tasks, cgroup keeps this list
+ *  and parses it during read.
+ *
+ *  Perf handles hierarchy for other events, but because RMIDs are per pkg
+ *  this is handled here.
+*/
+struct cgrp_cqm_info {
+	bool cont_mon;
+	bool mon_enabled;
+	int level;
+	u32 *rmid;
+	struct cgrp_cqm_info *mfa;
+	struct list_head tskmon_rlist;
+};
+
+struct tsk_rmid_entry {
+	u32 *rmid;
+	struct list_head list;
+};
+
+#ifdef CONFIG_CGROUP_PERF
+
+# define css_to_perf_cgroup(css_) container_of(css_, struct perf_cgroup, css)
+# define cgrp_to_cqm_info(cgrp_) ((struct cgrp_cqm_info *)cgrp_->arch_info)
+# define css_to_cqm_info(css_) cgrp_to_cqm_info(css_to_perf_cgroup(css_))
+
+#else
+
+# define css_to_perf_cgroup(css_) NULL
+# define cgrp_to_cqm_info(cgrp_) NULL
+# define css_to_cqm_info(css_) NULL
+
+#endif
 #endif /* _ASM_X86_INTEL_RDT_COMMON_H */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/12] x86/rdt,cqm: Scheduling support update
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (5 preceding siblings ...)
  2017-01-06 21:59 ` [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-17 21:58   ` Thomas Gleixner
  2017-01-06 22:00 ` [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together Vikas Shivappa
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

Introduce a scheduling hook finish_arch_pre_lock_switch which is
called just after the perf sched_in during context switch. This method
handles both cat and cqm sched in scenarios.
The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
patch integrates the two msr writes to one. The common sched_in patch
checks if the per cpu cache has a different RMID or CLOSid than the task
and does the MSR write.

During sched_in the task uses the task RMID if the task is monitored or
else uses the task's cgroup rmid.

Patch is based on David Carrillo-Cisneros <davidcc@google.com> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c              | 45 ++++++++++-----
 arch/x86/include/asm/intel_pqr_common.h  | 38 +++++++++++++
 arch/x86/include/asm/intel_rdt.h         | 39 -------------
 arch/x86/include/asm/intel_rdt_common.h  | 11 ++++
 arch/x86/include/asm/processor.h         |  4 ++
 arch/x86/kernel/cpu/Makefile             |  1 +
 arch/x86/kernel/cpu/intel_rdt_common.c   | 98 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  4 +-
 arch/x86/kernel/process_32.c             |  4 --
 arch/x86/kernel/process_64.c             |  4 --
 kernel/sched/core.c                      |  1 +
 kernel/sched/sched.h                     |  3 +
 12 files changed, 188 insertions(+), 64 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_pqr_common.h
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index c6479ae..597a184 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -28,13 +28,6 @@
 static bool cqm_enabled, mbm_enabled;
 unsigned int cqm_socket_max;
 
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 static struct hrtimer *mbm_timers;
 /**
  * struct sample - mbm event's (local or total) data
@@ -74,6 +67,8 @@ struct sample {
 static DEFINE_MUTEX(cache_mutex);
 static DEFINE_RAW_SPINLOCK(cache_lock);
 
+DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
+
 /*
  * Groups of events that have the same target(s), one RMID per group.
  */
@@ -108,7 +103,7 @@ struct sample {
  * Likewise, an rmid value of -1 is used to indicate "no rmid currently
  * assigned" and is used as part of the rotation code.
  */
-static inline bool __rmid_valid(u32 rmid)
+bool __rmid_valid(u32 rmid)
 {
 	if (!rmid || rmid > cqm_max_rmid)
 		return false;
@@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
  *
  * We expect to be called with cache_mutex held.
  */
-static u32 __get_rmid(int domain)
+u32 __get_rmid(int domain)
 {
 	struct list_head *cqm_flist;
 	struct cqm_rmid_entry *entry;
@@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
 	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
 }
 
+#ifdef CONFIG_CGROUP_PERF
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
+{
+	struct cgrp_cqm_info *ccinfo = NULL;
+	struct perf_cgroup *pcgrp;
+
+	pcgrp = perf_cgroup_from_task(tsk, NULL);
+
+	if (!pcgrp)
+		return NULL;
+	else
+		ccinfo = cgrp_to_cqm_info(pcgrp);
+
+	return ccinfo;
+}
+#endif
+
 static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
 {
 	if (rmid != NULL) {
@@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 rmid;
 
 	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
 		return;
 
 	event->hw.cqm_state &= ~PERF_HES_STOPPED;
 
-	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
-
-	rmid = event->hw.cqm_rmid[pkg_id];
-	state->rmid = rmid;
-	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+	if (is_task_event(event)) {
+		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
+	}
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
 		return;
 
 	event->hw.cqm_state |= PERF_HES_STOPPED;
+	state->next_task_rmid = 0;
 }
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
 	if (mbm_enabled)
 		pr_info("Intel MBM enabled\n");
 
+	static_branch_enable(&cqm_enable_key);
+
 	/*
 	 * Setup the hot cpu notifier once we are sure cqm
 	 * is enabled to avoid notifier leak.
diff --git a/arch/x86/include/asm/intel_pqr_common.h b/arch/x86/include/asm/intel_pqr_common.h
new file mode 100644
index 0000000..8fe9d8e
--- /dev/null
+++ b/arch/x86/include/asm/intel_pqr_common.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_X86_INTEL_PQR_COMMON_H
+#define _ASM_X86_INTEL_PQR_COMMON_H
+
+#ifdef CONFIG_INTEL_RDT
+
+#include <linux/jump_label.h>
+#include <linux/types.h>
+#include <asm/percpu.h>
+#include <asm/msr.h>
+#include <asm/intel_rdt_common.h>
+
+void __intel_rdt_sched_in(void);
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ *   which supports resource control and we enable by mounting the
+ *   resctrl file system.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ *   when a task with a different CLOSid is scheduled in.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+	if (static_branch_likely(&rdt_enable_key) ||
+		static_branch_unlikely(&cqm_enable_key)) {
+		__intel_rdt_sched_in();
+	}
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 95ce5c8..3b4a099 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -5,7 +5,6 @@
 
 #include <linux/kernfs.h>
 #include <linux/jump_label.h>
-
 #include <asm/intel_rdt_common.h>
 
 #define IA32_L3_QOS_CFG		0xc81
@@ -182,43 +181,5 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
 
-/*
- * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
- *
- * Following considerations are made so that this has minimal impact
- * on scheduler hot path:
- * - This will stay as no-op unless we are running on an Intel SKU
- *   which supports resource control and we enable by mounting the
- *   resctrl file system.
- * - Caches the per cpu CLOSid values and does the MSR write only
- *   when a task with a different CLOSid is scheduled in.
- *
- * Must be called with preemption disabled.
- */
-static inline void intel_rdt_sched_in(void)
-{
-	if (static_branch_likely(&rdt_enable_key)) {
-		struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-		int closid;
-
-		/*
-		 * If this task has a closid assigned, use it.
-		 * Else use the closid assigned to this cpu.
-		 */
-		closid = current->closid;
-		if (closid == 0)
-			closid = this_cpu_read(cpu_closid);
-
-		if (closid != state->closid) {
-			state->closid = closid;
-			wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
-		}
-	}
-}
-
-#else
-
-static inline void intel_rdt_sched_in(void) {}
-
 #endif /* CONFIG_INTEL_RDT_A */
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index e11ed5e..544acaa 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -18,12 +18,23 @@
  */
 struct intel_pqr_state {
 	u32			rmid;
+	u32			next_task_rmid;
 	u32			closid;
 	int			rmid_usecnt;
 };
 
 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 
+u32 __get_rmid(int domain);
+bool __rmid_valid(u32 rmid);
+void alloc_needed_pkg_rmid(u32 *cqm_rmid);
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
+
+extern struct cgrp_cqm_info cqm_rootcginfo;
+
+DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
 /**
  * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
  * @cont_mon     Continuous monitoring flag
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eaf1005..ec4beed 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@
 #include <asm/nops.h>
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
+#include <asm/intel_pqr_common.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -903,4 +904,7 @@ static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
 
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
+
+#define finish_arch_pre_lock_switch intel_rdt_sched_in
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 5200001..d354e84 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
+obj-$(CONFIG_INTEL_RDT)		+= intel_rdt_common.o
 obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
new file mode 100644
index 0000000..c3c50cd
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_common.c
@@ -0,0 +1,98 @@
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/cacheinfo.h>
+#include <linux/cpuhotplug.h>
+#include <asm/intel-family.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Both functions which modify the state
+ * (intel_cqm_event_start and intel_cqm_event_stop) are called with
+ * interrupts disabled, which is sufficient for the protection.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#define pkg_id	topology_physical_package_id(smp_processor_id())
+
+#ifdef CONFIG_INTEL_RDT_M
+static inline int get_cgroup_sched_rmid(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+	struct cgrp_cqm_info *ccinfo = NULL;
+
+	ccinfo = cqminfo_from_tsk(current);
+
+	if (!ccinfo)
+		return 0;
+
+	/*
+	 * A cgroup is always monitoring for itself or
+	 * for an ancestor(default is root).
+	 */
+	if (ccinfo->mon_enabled) {
+		alloc_needed_pkg_rmid(ccinfo->rmid);
+		return ccinfo->rmid[pkg_id];
+	} else {
+		alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
+		return ccinfo->mfa->rmid[pkg_id];
+	}
+#endif
+
+	return 0;
+}
+
+static inline int get_sched_in_rmid(void)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	u32 rmid = 0;
+
+	rmid = state->next_task_rmid;
+
+	return rmid ? rmid : get_cgroup_sched_rmid();
+}
+#endif
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ *   which supports resource control and we enable by mounting the
+ *   resctrl file system or it supports resource monitoring.
+ * - Caches the per cpu CLOSid/RMID values and does the MSR write only
+ *   when a task with a different CLOSid/RMID is scheduled in.
+ */
+void __intel_rdt_sched_in(void)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	int closid = 0;
+	u32 rmid = 0;
+
+#ifdef CONFIG_INTEL_RDT_A
+	if (static_branch_likely(&rdt_enable_key)) {
+		/*
+		 * If this task has a closid assigned, use it.
+		 * Else use the closid assigned to this cpu.
+		 */
+		closid = current->closid;
+		if (closid == 0)
+			closid = this_cpu_read(cpu_closid);
+	}
+#endif
+
+#ifdef CONFIG_INTEL_RDT_M
+	if (static_branch_unlikely(&cqm_enable_key))
+		rmid = get_sched_in_rmid();
+#endif
+
+	if (closid != state->closid || rmid != state->rmid) {
+
+		state->closid = closid;
+		state->rmid = rmid;
+		wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
+	}
+}
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 8af04af..8b6b429 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -206,7 +206,7 @@ static void rdt_update_cpu_closid(void *closid)
 	 * executing task might have its own closid selected. Just reuse
 	 * the context switch code.
 	 */
-	intel_rdt_sched_in();
+	__intel_rdt_sched_in();
 }
 
 /*
@@ -328,7 +328,7 @@ static void move_myself(struct callback_head *head)
 
 	preempt_disable();
 	/* update PQR_ASSOC MSR to make resource group go into effect */
-	intel_rdt_sched_in();
+	__intel_rdt_sched_in();
 	preempt_enable();
 
 	kfree(callback);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index a0ac3e8..d0d7441 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -53,7 +53,6 @@
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
-#include <asm/intel_rdt.h>
 
 void __show_regs(struct pt_regs *regs, int all)
 {
@@ -297,8 +296,5 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 
 	this_cpu_write(current_task, next_p);
 
-	/* Load the Intel cache allocation PQR MSR. */
-	intel_rdt_sched_in();
-
 	return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a61e141..a76b65e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,7 +49,6 @@
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
-#include <asm/intel_rdt.h>
 
 __visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
 
@@ -477,9 +476,6 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
 			loadsegment(ss, __KERNEL_DS);
 	}
 
-	/* Load the Intel cache allocation PQR MSR. */
-	intel_rdt_sched_in();
-
 	return prev_p;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57..bf970ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2767,6 +2767,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = prev->state;
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	finish_arch_pre_lock_switch();
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b34c78..61b47a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1121,6 +1121,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
+#ifndef finish_arch_pre_lock_switch
+# define finish_arch_pre_lock_switch()	do { } while (0)
+#endif
 #ifndef finish_arch_post_lock_switch
 # define finish_arch_post_lock_switch()	do { } while (0)
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (6 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 07/12] x86/rdt,cqm: Scheduling support update Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-17 16:11   ` Thomas Gleixner
  2017-01-06 22:00 ` [PATCH 09/12] x86/cqm: Add RMID reuse Vikas Shivappa
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: Vikas Shivappa<vikas.shivappa@intel.com>

This patch adds support to monitor a cgroup x and a task p1
when p1 is part of cgroup x. Since we cannot write two RMIDs during
sched in the driver handles this.

This patch introduces a u32 *rmid in the task_struck which keeps track
of the RMIDs associated with the task.  There is also a list in the
arch_info of perf_cgroup called taskmon_list which keeps track of tasks
in the cgroup that are monitored.

The taskmon_list is modified in 2 scenarios.
- at event_init of task p1 which is part of a cgroup, add the p1 to the
cgroup->tskmon_list. At event_destroy delete the task from the list.
- at the time of task move from cgrp x to a cgp y, if task was monitored
remove the task from the cgrp x tskmon_list and add it the cgrp y
tskmon_list.

sched in: When the task p1 is scheduled in, we write the task RMID in
the PQR_ASSOC MSR

read(for task p1): As any other cqm task event

read(for the cgroup x): When counting for cgroup, the taskmon list is
traversed and the corresponding RMID counts are added.

Tests: Monitoring a cgroup x and a task with in the cgroup x should
work.

Patch heavily based on David<davidcc@google.com> patch cqm2 series prior
version.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c | 137 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/sched.h       |   3 +
 2 files changed, 137 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 597a184..e1765e3 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -363,6 +363,36 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
 	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
 }
 
+static inline int add_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
+{
+	struct tsk_rmid_entry *entry;
+
+	entry = kzalloc(sizeof(struct tsk_rmid_entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&entry->list);
+	entry->rmid = rmid;
+
+	list_add_tail(&entry->list, l);
+
+	return 0;
+}
+
+static inline void del_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
+{
+	struct tsk_rmid_entry *entry = NULL, *tmp1;
+
+	list_for_each_entry_safe(entry, tmp1, l, list) {
+		if (entry->rmid == rmid) {
+
+			list_del(&entry->list);
+			kfree(entry);
+			break;
+		}
+	}
+}
+
 #ifdef CONFIG_CGROUP_PERF
 struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
 {
@@ -380,6 +410,49 @@ struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
 }
 #endif
 
+static inline void
+	cgrp_tskmon_update(struct task_struct *tsk, u32 *rmid, bool ena)
+{
+	struct cgrp_cqm_info *ccinfo = NULL;
+
+#ifdef CONFIG_CGROUP_PERF
+	ccinfo = cqminfo_from_tsk(tsk);
+#endif
+	if (!ccinfo)
+		return;
+
+	if (ena)
+		add_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
+	else
+		del_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
+}
+
+static int cqm_assign_task_rmid(struct perf_event *event, u32 *rmid)
+{
+	struct task_struct *tsk;
+	int ret = 0;
+
+	rcu_read_lock();
+	tsk = event->hw.target;
+	if (pid_alive(tsk)) {
+		get_task_struct(tsk);
+
+		if (rmid != NULL)
+			cgrp_tskmon_update(tsk, rmid, true);
+		else
+			cgrp_tskmon_update(tsk, tsk->rmid, false);
+
+		tsk->rmid = rmid;
+
+		put_task_struct(tsk);
+	} else {
+		ret = -EINVAL;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
 {
 	if (rmid != NULL) {
@@ -429,8 +502,12 @@ static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)
 
 static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
 {
+	if (is_task_event(event)) {
+		if (cqm_assign_task_rmid(event, rmid))
+			return -EINVAL;
+	}
 #ifdef CONFIG_CGROUP_PERF
-	if (is_cgroup_event(event)) {
+	else if (is_cgroup_event(event)) {
 		cqm_assign_hier_rmid(&event->cgrp->css, rmid);
 	}
 #endif
@@ -631,6 +708,8 @@ static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
 
 	struct cgroup_subsys_state *rcss, *pos_css;
 	struct cgrp_cqm_info *ccqm_info;
+	struct tsk_rmid_entry *entry;
+	struct list_head *l;
 
 	cqm_mask_call_local(rr);
 	local64_set(&event->count, atomic64_read(&(rr->value)));
@@ -646,6 +725,13 @@ static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
 		/* Add the descendent 'monitored cgroup' counts */
 		if (pos_css != rcss && ccqm_info->mon_enabled)
 			delta_local(event, rr, ccqm_info->rmid);
+
+		/* Add your and descendent 'monitored task' counts */
+		if (!list_empty(&ccqm_info->tskmon_rlist)) {
+			l = &ccqm_info->tskmon_rlist;
+			list_for_each_entry(entry, l, list)
+				delta_local(event, rr, entry->rmid);
+		}
 	}
 	rcu_read_unlock();
 #endif
@@ -1096,10 +1182,55 @@ void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
 	mutex_unlock(&cache_mutex);
 }
 
+/*
+ * Called while attaching/detaching task to a cgroup.
+ */
+static bool is_task_monitored(struct task_struct *tsk)
+{
+	return (tsk->rmid != NULL);
+}
+
 void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
-{}
+{
+	struct cgroup_subsys_state *new_css;
+	struct cgrp_cqm_info *cqm_info;
+	struct task_struct *task;
+
+	mutex_lock(&cache_mutex);
+
+	cgroup_taskset_for_each(task, new_css, tset) {
+		if (!is_task_monitored(task))
+			continue;
+
+		cqm_info = cqminfo_from_tsk(task);
+		if (cqm_info)
+			add_cgrp_tskmon_entry(task->rmid,
+					     &cqm_info->tskmon_rlist);
+	}
+	mutex_unlock(&cache_mutex);
+}
+
 int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
-{}
+{
+	struct cgroup_subsys_state *new_css;
+	struct cgrp_cqm_info *cqm_info;
+	struct task_struct *task;
+
+	mutex_lock(&cache_mutex);
+	cgroup_taskset_for_each(task, new_css, tset) {
+		if (!is_task_monitored(task))
+			continue;
+		cqm_info = cqminfo_from_tsk(task);
+
+		if (cqm_info)
+			del_cgrp_tskmon_entry(task->rmid,
+					     &cqm_info->tskmon_rlist);
+
+	}
+	mutex_unlock(&cache_mutex);
+
+	return 0;
+}
 #endif
 
 static inline void cqm_pick_event_reader(int cpu)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..8072583 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1824,6 +1824,9 @@ struct task_struct {
 #ifdef CONFIG_INTEL_RDT_A
 	int closid;
 #endif
+#ifdef CONFIG_INTEL_RDT_M
+	u32 *rmid;
+#endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
 #ifdef CONFIG_COMPAT
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/12] x86/cqm: Add RMID reuse
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (7 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-17 16:59   ` Thomas Gleixner
  2017-01-06 22:00 ` [PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg reads Vikas Shivappa
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

When an RMID is freed by an event it cannot be reused immediately as the
RMID may still have some cache occupancy. Hence when an RMID is freed it
goes into limbo list and not free list. This patch provides support to
periodically check the occupancy values of such RMIDs and move them to
the free list once its occupancy < threshold_occupancy value. The
threshold occupancy value can be modified by user based on his
requirements.

Tests: Before the patch, task monitoring would just throw error once
RMIDs are used in the lifetime of systemboot.
After this patch, we would be able to reuse the RMIDs that are freed.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c | 107 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index e1765e3..92efe12 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -174,6 +174,13 @@ u32 __get_rmid(int domain)
 	return entry->rmid;
 }
 
+static void cqm_schedule_rmidwork(int domain);
+
+static inline bool is_first_cqmwork(int domain)
+{
+	return (!atomic_cmpxchg(&cqm_pkgs_data[domain]->reuse_scheduled, 0, 1));
+}
+
 static void __put_rmid(u32 rmid, int domain)
 {
 	struct cqm_rmid_entry *entry;
@@ -294,6 +301,93 @@ static void cqm_mask_call(struct rmid_read *rr)
 static unsigned int __intel_cqm_threshold;
 static unsigned int __intel_cqm_max_threshold;
 
+/*
+ * Test whether an RMID has a zero occupancy value on this cpu.
+ */
+static void intel_cqm_stable(void)
+{
+	struct cqm_rmid_entry *entry;
+	struct list_head *llist;
+
+	llist = &cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru;
+	list_for_each_entry(entry, llist, list) {
+
+		if (__rmid_read(entry->rmid) < __intel_cqm_threshold)
+			entry->state = RMID_AVAILABLE;
+	}
+}
+
+static void __intel_cqm_rmid_reuse(void)
+{
+	struct cqm_rmid_entry *entry, *tmp;
+	struct list_head *llist, *flist;
+	struct pkg_data *pdata;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&cache_lock, flags);
+	pdata = cqm_pkgs_data[pkg_id];
+	llist = &pdata->cqm_rmid_limbo_lru;
+	flist = &pdata->cqm_rmid_free_lru;
+
+	if (list_empty(llist))
+		goto end;
+	/*
+	 * Test whether an RMID is free
+	 */
+	intel_cqm_stable();
+
+	list_for_each_entry_safe(entry, tmp, llist, list) {
+
+		if (entry->state == RMID_DIRTY)
+			continue;
+		/*
+		 * Otherwise remove from limbo and place it onto the free list.
+		 */
+		list_del(&entry->list);
+		list_add_tail(&entry->list, flist);
+	}
+
+end:
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+}
+
+static bool reschedule_cqm_work(void)
+{
+	unsigned long flags;
+	bool nwork = false;
+
+	raw_spin_lock_irqsave(&cache_lock, flags);
+
+	if (!list_empty(&cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru))
+		nwork = true;
+	else
+		atomic_set(&cqm_pkgs_data[pkg_id]->reuse_scheduled, 0U);
+
+	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	return nwork;
+}
+
+static void cqm_schedule_rmidwork(int domain)
+{
+	struct delayed_work *dwork;
+	unsigned long delay;
+
+	dwork = &cqm_pkgs_data[domain]->intel_cqm_rmid_work;
+	delay = msecs_to_jiffies(RMID_DEFAULT_QUEUE_TIME);
+
+	schedule_delayed_work_on(cqm_pkgs_data[domain]->rmid_work_cpu,
+			     dwork, delay);
+}
+
+static void intel_cqm_rmid_reuse(struct work_struct *work)
+{
+	__intel_cqm_rmid_reuse();
+
+	if (reschedule_cqm_work())
+		cqm_schedule_rmidwork(pkg_id);
+}
+
 static struct pmu intel_cqm_pmu;
 
 static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
@@ -548,6 +642,8 @@ static int intel_cqm_setup_event(struct perf_event *event,
 	if (!event->hw.cqm_rmid)
 		return -ENOMEM;
 
+	cqm_assign_rmid(event, event->hw.cqm_rmid);
+
 	return 0;
 }
 
@@ -863,6 +959,7 @@ static void intel_cqm_event_terminate(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
 	unsigned long flags;
+	int d;
 
 	mutex_lock(&cache_mutex);
 	/*
@@ -905,6 +1002,13 @@ static void intel_cqm_event_terminate(struct perf_event *event)
 		mbm_stop_timers();
 
 	mutex_unlock(&cache_mutex);
+
+	for (d = 0; d < cqm_socket_max; d++) {
+
+		if (cqm_pkgs_data[d] != NULL && is_first_cqmwork(d)) {
+			cqm_schedule_rmidwork(d);
+		}
+	}
 }
 
 static int intel_cqm_event_init(struct perf_event *event)
@@ -1322,6 +1426,9 @@ static int pkg_data_init_cpu(int cpu)
 	mutex_init(&pkg_data->pkg_data_mutex);
 	raw_spin_lock_init(&pkg_data->pkg_data_lock);
 
+	INIT_DEFERRABLE_WORK(
+		&pkg_data->intel_cqm_rmid_work, intel_cqm_rmid_reuse);
+
 	pkg_data->rmid_work_cpu = cpu;
 
 	nr_rmids = cqm_max_rmid + 1;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg reads.
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (8 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 09/12] x86/cqm: Add RMID reuse Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-06 22:00 ` [PATCH 11/12] perf/stat: fix bug in handling events in error state Vikas Shivappa
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

For cqm cgroup events, the events can be read even if the event was not
active on the cpu on which the event is being read. This is because the
RMIDs are per package and hence if we read the llc_occupancy value on a
cpu x, we are really reading the occupancy for the package where cpu x
belongs.

This patch adds a PERF_INACTIVE_CPU_READ_PKG to indicate this behaviour
of cqm and also changes the perf/core to still call the reads even when
the event is inactive on the cpu for cgroup events. The task events have
event->cpu as -1 and hence it does not apply for task events.

Tests: perf stat -C <cpux> would not return a count before this patch to
the perf/core.  After this patch the count of the package is returned to
the perf/core. We still dont see the count in the perf user mode - that
is fixed in next patches.

Patch is based on David Carrillo-Cisneros <davidcc@google.com> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/events/intel/cqm.c             | 31 ++++++++++++++++++++++++-------
 arch/x86/include/asm/intel_rdt_common.h |  2 +-
 include/linux/perf_event.h              | 19 ++++++++++++++++---
 kernel/events/core.c                    | 16 ++++++++++++----
 4 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 92efe12..3f5860c 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -111,6 +111,13 @@ bool __rmid_valid(u32 rmid)
 	return true;
 }
 
+static inline bool __rmid_valid_raw(u32 rmid)
+{
+	if (rmid > cqm_max_rmid)
+		return false;
+	return true;
+}
+
 static u64 __rmid_read(u32 rmid)
 {
 	u64 val;
@@ -884,16 +891,16 @@ static u64 intel_cqm_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
-void alloc_needed_pkg_rmid(u32 *cqm_rmid)
+u32 alloc_needed_pkg_rmid(u32 *cqm_rmid)
 {
 	unsigned long flags;
 	u32 rmid;
 
 	if (WARN_ON(!cqm_rmid))
-		return;
+		return -EINVAL;
 
 	if (cqm_rmid == cqm_rootcginfo.rmid || cqm_rmid[pkg_id])
-		return;
+		return 0;
 
 	raw_spin_lock_irqsave(&cache_lock, flags);
 
@@ -902,6 +909,8 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
 		cqm_rmid[pkg_id] = rmid;
 
 	raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+	return rmid;
 }
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
@@ -913,10 +922,8 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 
 	event->hw.cqm_state &= ~PERF_HES_STOPPED;
 
-	if (is_task_event(event)) {
-		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+	if (is_task_event(event))
 		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
-	}
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
@@ -932,10 +939,19 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
 {
+	u32 rmid;
+
 	event->hw.cqm_state = PERF_HES_STOPPED;
 
-	if ((mode & PERF_EF_START))
+	/*
+	 * If Lazy RMID alloc fails indicate the error to the user.
+	*/
+	if ((mode & PERF_EF_START)) {
+		rmid = alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+		if (!__rmid_valid_raw(rmid))
+			return -EINVAL;
 		intel_cqm_event_start(event, mode);
+	}
 
 	return 0;
 }
@@ -1048,6 +1064,7 @@ static int intel_cqm_event_init(struct perf_event *event)
 	 * cgroup hierarchies.
 	 */
 	event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
+	event->event_caps |= PERF_EV_CAP_INACTIVE_CPU_READ_PKG;
 
 	mutex_lock(&cache_mutex);
 
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index 544acaa..fcaaaeb 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -27,7 +27,7 @@ struct intel_pqr_state {
 
 u32 __get_rmid(int domain);
 bool __rmid_valid(u32 rmid);
-void alloc_needed_pkg_rmid(u32 *cqm_rmid);
+u32 alloc_needed_pkg_rmid(u32 *cqm_rmid);
 struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
 
 extern struct cgrp_cqm_info cqm_rootcginfo;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 410642a..adfddec 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -525,10 +525,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
  * cgroup scoping. It does not need to be enabled for all of its descendants
  * cgroups.
+ * PERF_EV_CAP_INACTIVE_CPU_READ_PKG: A cgroup event where we can read
+ * the package count on any cpu on the pkg even if inactive.
  */
-#define PERF_EV_CAP_SOFTWARE		BIT(0)
-#define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
-#define PERF_EV_CAP_CGROUP_NO_RECURSION	BIT(2)
+#define PERF_EV_CAP_SOFTWARE                    BIT(0)
+#define PERF_EV_CAP_READ_ACTIVE_PKG             BIT(1)
+#define PERF_EV_CAP_CGROUP_NO_RECURSION         BIT(2)
+#define PERF_EV_CAP_INACTIVE_CPU_READ_PKG       BIT(3)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
@@ -722,6 +725,16 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+#ifdef CONFIG_PERF_EVENTS
+static inline bool __perf_can_read_inactive(struct perf_event *event)
+{
+	if ((event->group_caps & PERF_EV_CAP_INACTIVE_CPU_READ_PKG))
+		return true;
+
+	return false;
+}
+#endif
+
 /**
  * struct perf_event_context - event context structure
  *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 229f611..e71ca66 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3443,9 +3443,13 @@ struct perf_read_data {
 
 static int find_cpu_to_read(struct perf_event *event, int local_cpu)
 {
+	bool active = event->state == PERF_EVENT_STATE_ACTIVE;
 	int event_cpu = event->oncpu;
 	u16 local_pkg, event_pkg;
 
+	if (__perf_can_read_inactive(event) && !active)
+		event_cpu = event->cpu;
+
 	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
 		event_pkg =  topology_physical_package_id(event_cpu);
 		local_pkg =  topology_physical_package_id(local_cpu);
@@ -3467,6 +3471,7 @@ static void __perf_event_read(void *info)
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct pmu *pmu = event->pmu;
+	bool read_inactive = __perf_can_read_inactive(event);
 
 	/*
 	 * If this is a task context, we need to check whether it is
@@ -3475,7 +3480,7 @@ static void __perf_event_read(void *info)
 	 * event->count would have been updated to a recent sample
 	 * when the event was scheduled out.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
+	if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
 		return;
 
 	raw_spin_lock(&ctx->lock);
@@ -3485,7 +3490,7 @@ static void __perf_event_read(void *info)
 	}
 
 	update_event_times(event);
-	if (event->state != PERF_EVENT_STATE_ACTIVE)
+	if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
 		goto unlock;
 
 	if (!data->group) {
@@ -3500,7 +3505,8 @@ static void __perf_event_read(void *info)
 
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		update_event_times(sub);
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+		if (sub->state == PERF_EVENT_STATE_ACTIVE ||
+		    __perf_can_read_inactive(sub)) {
 			/*
 			 * Use sibling's PMU rather than @event's since
 			 * sibling could be on different (eg: software) PMU.
@@ -3578,13 +3584,15 @@ u64 perf_event_read_local(struct perf_event *event)
 
 static int perf_event_read(struct perf_event *event, bool group)
 {
+	bool active = event->state == PERF_EVENT_STATE_ACTIVE;
 	int ret = 0, cpu_to_read, local_cpu;
 
 	/*
 	 * If event is enabled and currently active on a CPU, update the
 	 * value in the event structure:
 	 */
-	if (event->state == PERF_EVENT_STATE_ACTIVE) {
+	if (active || __perf_can_read_inactive(event)) {
+
 		struct perf_read_data data = {
 			.event = event,
 			.group = group,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/12] perf/stat: fix bug in handling events in error state
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (9 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg reads Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-06 22:00 ` [PATCH 12/12] perf/stat: revamp read error handling, snapshot and per_pkg events Vikas Shivappa
  2017-01-17 17:31 ` [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Thomas Gleixner
  12 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: Stephane Eranian <eranian@google.com>

When an event is in error state, read() returns 0
instead of sizeof() buffer. In certain modes, such
as interval printing, ignoring the 0 return value
may cause bogus count deltas to be computed and
thus invalid results printed.

this patch fixes this problem by modifying read_counters()
to mark the event as not scaled (scaled = -1) to force
the printout routine to show <NOT COUNTED>.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 tools/perf/builtin-stat.c | 12 +++++++++---
 tools/perf/util/evsel.c   |  4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index a02f2e9..9f109b9 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -310,8 +310,12 @@ static int read_counter(struct perf_evsel *counter)
 			struct perf_counts_values *count;
 
 			count = perf_counts(counter->counts, cpu, thread);
-			if (perf_evsel__read(counter, cpu, thread, count))
+			if (perf_evsel__read(counter, cpu, thread, count)) {
+				counter->counts->scaled = -1;
+				perf_counts(counter->counts, cpu, thread)->ena = 0;
+				perf_counts(counter->counts, cpu, thread)->run = 0;
 				return -1;
+			}
 
 			if (STAT_RECORD) {
 				if (perf_evsel__write_stat_event(counter, cpu, thread, count)) {
@@ -336,12 +340,14 @@ static int read_counter(struct perf_evsel *counter)
 static void read_counters(void)
 {
 	struct perf_evsel *counter;
+	int ret;
 
 	evlist__for_each_entry(evsel_list, counter) {
-		if (read_counter(counter))
+		ret = read_counter(counter);
+		if (ret)
 			pr_debug("failed to read counter %s\n", counter->name);
 
-		if (perf_stat_process_counter(&stat_config, counter))
+		if (ret == 0 && perf_stat_process_counter(&stat_config, counter))
 			pr_warning("failed to process counter %s\n", counter->name);
 	}
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 04e536a..7aa10e3 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1232,7 +1232,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 	if (FD(evsel, cpu, thread) < 0)
 		return -EINVAL;
 
-	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
+	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
 		return -errno;
 
 	return 0;
@@ -1250,7 +1250,7 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 	if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
 		return -ENOMEM;
 
-	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) < 0)
+	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
 		return -errno;
 
 	perf_evsel__compute_deltas(evsel, cpu, thread, &count);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/12] perf/stat: revamp read error handling, snapshot and per_pkg events
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (10 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 11/12] perf/stat: fix bug in handling events in error state Vikas Shivappa
@ 2017-01-06 22:00 ` Vikas Shivappa
  2017-01-17 17:31 ` [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Thomas Gleixner
  12 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-06 22:00 UTC (permalink / raw)
  To: vikas.shivappa, vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, tglx, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

From: David Carrillo-Cisneros <davidcc@google.com>

A package wide event can return a valid read even if it has not run in a
specific cpu, this does not fit well with the assumption that run == 0
is equivalent to a <not counted>.

To fix the problem, this patch defines special error values for val,
run and ena (~0ULL), and use them to signal read errors, allowing run == 0
to be a valid value for package events. A new value, (NA), is output on
read error and when event has not been enabled (timed enabled == 0).

Finally, this patch revamps calculation of deltas and scaling for snapshot
events, removing the calculation of deltas for time running and enabled in
snapshot event, as should be.

Tests: After this patch, the user mode can see the package llc_occupancy
count when user calls per stat -C <cpux> -G cgroup_y and the systemwide
count when run with -a option for the cgroup

Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 tools/perf/builtin-stat.c | 36 +++++++++++++++++++++++-----------
 tools/perf/util/counts.h  | 19 ++++++++++++++++++
 tools/perf/util/evsel.c   | 49 ++++++++++++++++++++++++++++++++++++-----------
 tools/perf/util/evsel.h   |  8 ++++++--
 tools/perf/util/stat.c    | 35 +++++++++++----------------------
 5 files changed, 99 insertions(+), 48 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 9f109b9..69dd3a4 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -311,10 +311,8 @@ static int read_counter(struct perf_evsel *counter)
 
 			count = perf_counts(counter->counts, cpu, thread);
 			if (perf_evsel__read(counter, cpu, thread, count)) {
-				counter->counts->scaled = -1;
-				perf_counts(counter->counts, cpu, thread)->ena = 0;
-				perf_counts(counter->counts, cpu, thread)->run = 0;
-				return -1;
+				/* do not write stat for failed reads. */
+				continue;
 			}
 
 			if (STAT_RECORD) {
@@ -725,12 +723,16 @@ static int run_perf_stat(int argc, const char **argv)
 
 static void print_running(u64 run, u64 ena)
 {
+	bool is_na = run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || !ena;
+
 	if (csv_output) {
-		fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
-					csv_sep,
-					run,
-					csv_sep,
-					ena ? 100.0 * run / ena : 100.0);
+		if (is_na)
+			fprintf(stat_config.output, "%sNA%sNA", csv_sep, csv_sep);
+		else
+			fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
+				csv_sep, run, csv_sep, 100.0 * run / ena);
+	} else if (is_na) {
+		fprintf(stat_config.output, "  (NA)");
 	} else if (run != ena) {
 		fprintf(stat_config.output, "  (%.2f%%)", 100.0 * run / ena);
 	}
@@ -1103,7 +1105,7 @@ static void printout(int id, int nr, struct perf_evsel *counter, double uval,
 		if (counter->cgrp)
 			os.nfields++;
 	}
-	if (run == 0 || ena == 0 || counter->counts->scaled == -1) {
+	if (run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || counter->counts->scaled == -1) {
 		if (metric_only) {
 			pm(&os, NULL, "", "", 0);
 			return;
@@ -1209,12 +1211,17 @@ static void print_aggr(char *prefix)
 		id = aggr_map->map[s];
 		first = true;
 		evlist__for_each_entry(evsel_list, counter) {
+			bool all_nan = true;
 			val = ena = run = 0;
 			nr = 0;
 			for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
 				s2 = aggr_get_id(perf_evsel__cpus(counter), cpu);
 				if (s2 != id)
 					continue;
+				/* skip NA reads. */
+				if (perf_counts_values__is_na(perf_counts(counter->counts, cpu, 0)))
+					continue;
+				all_nan = false;
 				val += perf_counts(counter->counts, cpu, 0)->val;
 				ena += perf_counts(counter->counts, cpu, 0)->ena;
 				run += perf_counts(counter->counts, cpu, 0)->run;
@@ -1228,6 +1235,10 @@ static void print_aggr(char *prefix)
 				fprintf(output, "%s", prefix);
 
 			uval = val * counter->scale;
+			if (all_nan) {
+				run = PERF_COUNTS_NA;
+				ena = PERF_COUNTS_NA;
+			}
 			printout(id, nr, counter, uval, prefix, run, ena, 1.0);
 			if (!metric_only)
 				fputc('\n', output);
@@ -1306,7 +1317,10 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 		if (prefix)
 			fprintf(output, "%s", prefix);
 
-		uval = val * counter->scale;
+		if (val != PERF_COUNTS_NA)
+			uval = val * counter->scale;
+		else
+			uval = NAN;
 		printout(cpu, 0, counter, uval, prefix, run, ena, 1.0);
 
 		fputc('\n', output);
diff --git a/tools/perf/util/counts.h b/tools/perf/util/counts.h
index 34d8baa..b65e97a 100644
--- a/tools/perf/util/counts.h
+++ b/tools/perf/util/counts.h
@@ -3,6 +3,9 @@
 
 #include "xyarray.h"
 
+/* Not Available (NA) value. Any operation with a NA equals a NA. */
+#define PERF_COUNTS_NA ((u64)~0ULL)
+
 struct perf_counts_values {
 	union {
 		struct {
@@ -14,6 +17,22 @@ struct perf_counts_values {
 	};
 };
 
+static inline void
+perf_counts_values__make_na(struct perf_counts_values *values)
+{
+	values->val = PERF_COUNTS_NA;
+	values->ena = PERF_COUNTS_NA;
+	values->run = PERF_COUNTS_NA;
+}
+
+static inline bool
+perf_counts_values__is_na(struct perf_counts_values *values)
+{
+	return values->val == PERF_COUNTS_NA ||
+	       values->ena == PERF_COUNTS_NA ||
+	       values->run == PERF_COUNTS_NA;
+}
+
 struct perf_counts {
 	s8			  scaled;
 	struct perf_counts_values aggr;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 7aa10e3..bbcdcd7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1191,6 +1191,9 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 	if (!evsel->prev_raw_counts)
 		return;
 
+	if (perf_counts_values__is_na(count))
+		return;
+
 	if (cpu == -1) {
 		tmp = evsel->prev_raw_counts->aggr;
 		evsel->prev_raw_counts->aggr = *count;
@@ -1199,26 +1202,43 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 		*perf_counts(evsel->prev_raw_counts, cpu, thread) = *count;
 	}
 
-	count->val = count->val - tmp.val;
+	/* Snapshot events do not calculate deltas for count values. */
+	if (!evsel->snapshot)
+		count->val = count->val - tmp.val;
 	count->ena = count->ena - tmp.ena;
 	count->run = count->run - tmp.run;
 }
 
 void perf_counts_values__scale(struct perf_counts_values *count,
-			       bool scale, s8 *pscaled)
+			       bool scale, bool per_pkg, bool snapshot, s8 *pscaled)
 {
 	s8 scaled = 0;
 
+	if (perf_counts_values__is_na(count)) {
+		if (pscaled)
+			*pscaled = -1;
+		return;
+	}
+
 	if (scale) {
-		if (count->run == 0) {
+		/*
+		 * per-pkg events can have run == 0 in a CPU and still be
+		 * valid.
+		 */
+		if (count->run == 0 && !per_pkg) {
 			scaled = -1;
 			count->val = 0;
 		} else if (count->run < count->ena) {
 			scaled = 1;
-			count->val = (u64)((double) count->val * count->ena / count->run + 0.5);
+			/* Snapshot events do not scale counts values. */
+			if (!snapshot && count->run)
+				count->val = (u64)((double) count->val * count->ena /
+					     count->run + 0.5);
 		}
-	} else
-		count->ena = count->run = 0;
+
+	} else {
+		count->run = count->ena;
+	}
 
 	if (pscaled)
 		*pscaled = scaled;
@@ -1232,8 +1252,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 	if (FD(evsel, cpu, thread) < 0)
 		return -EINVAL;
 
-	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
+	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0) {
+		perf_counts_values__make_na(count);
 		return -errno;
+	}
 
 	return 0;
 }
@@ -1241,6 +1263,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 			      int cpu, int thread, bool scale)
 {
+	int ret = 0;
 	struct perf_counts_values count;
 	size_t nv = scale ? 3 : 1;
 
@@ -1250,13 +1273,17 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 	if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
 		return -ENOMEM;
 
-	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
-		return -errno;
+	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0) {
+		perf_counts_values__make_na(&count);
+		ret = -errno;
+		goto exit;
+	}
 
 	perf_evsel__compute_deltas(evsel, cpu, thread, &count);
-	perf_counts_values__scale(&count, scale, NULL);
+	perf_counts_values__scale(&count, scale, evsel->per_pkg, evsel->snapshot, NULL);
+exit:
 	*perf_counts(evsel->counts, cpu, thread) = count;
-	return 0;
+	return ret;
 }
 
 static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 06ef6f2..9788ca1 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -82,6 +82,10 @@ struct perf_evsel_config_term {
  * @is_pos: the position (counting backwards) of the event id (PERF_SAMPLE_ID or
  *          PERF_SAMPLE_IDENTIFIER) in a non-sample event i.e. if sample_id_all
  *          is used there is an id sample appended to non-sample events
+ * @snapshot: an event that whose raw value cannot be extrapolated based on
+ *	    the ratio of running/enabled time.
+ * @per_pkg: an event that runs package wide. All cores in same package will
+ *	    read the same value, even if running time == 0.
  * @priv:   And what is in its containing unnamed union are tool specific
  */
 struct perf_evsel {
@@ -153,8 +157,8 @@ static inline int perf_evsel__nr_cpus(struct perf_evsel *evsel)
 	return perf_evsel__cpus(evsel)->nr;
 }
 
-void perf_counts_values__scale(struct perf_counts_values *count,
-			       bool scale, s8 *pscaled);
+void perf_counts_values__scale(struct perf_counts_values *count, bool scale,
+			       bool per_pkg, bool snapshot, s8 *pscaled);
 
 void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 				struct perf_counts_values *count);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 39345c2d..514b953 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -202,7 +202,7 @@ static void zero_per_pkg(struct perf_evsel *counter)
 }
 
 static int check_per_pkg(struct perf_evsel *counter,
-			 struct perf_counts_values *vals, int cpu, bool *skip)
+			 int cpu, bool *skip)
 {
 	unsigned long *mask = counter->per_pkg_mask;
 	struct cpu_map *cpus = perf_evsel__cpus(counter);
@@ -224,17 +224,6 @@ static int check_per_pkg(struct perf_evsel *counter,
 		counter->per_pkg_mask = mask;
 	}
 
-	/*
-	 * we do not consider an event that has not run as a good
-	 * instance to mark a package as used (skip=1). Otherwise
-	 * we may run into a situation where the first CPU in a package
-	 * is not running anything, yet the second is, and this function
-	 * would mark the package as used after the first CPU and would
-	 * not read the values from the second CPU.
-	 */
-	if (!(vals->run && vals->ena))
-		return 0;
-
 	s = cpu_map__get_socket(cpus, cpu, NULL);
 	if (s < 0)
 		return -1;
@@ -249,30 +238,27 @@ static int check_per_pkg(struct perf_evsel *counter,
 		       struct perf_counts_values *count)
 {
 	struct perf_counts_values *aggr = &evsel->counts->aggr;
-	static struct perf_counts_values zero;
 	bool skip = false;
 
-	if (check_per_pkg(evsel, count, cpu, &skip)) {
+	if (check_per_pkg(evsel, cpu, &skip)) {
 		pr_err("failed to read per-pkg counter\n");
 		return -1;
 	}
 
-	if (skip)
-		count = &zero;
-
 	switch (config->aggr_mode) {
 	case AGGR_THREAD:
 	case AGGR_CORE:
 	case AGGR_SOCKET:
 	case AGGR_NONE:
-		if (!evsel->snapshot)
-			perf_evsel__compute_deltas(evsel, cpu, thread, count);
-		perf_counts_values__scale(count, config->scale, NULL);
+		perf_evsel__compute_deltas(evsel, cpu, thread, count);
+		perf_counts_values__scale(count, config->scale,
+					  evsel->per_pkg, evsel->snapshot, NULL);
 		if (config->aggr_mode == AGGR_NONE)
 			perf_stat__update_shadow_stats(evsel, count->values, cpu);
 		break;
 	case AGGR_GLOBAL:
-		aggr->val += count->val;
+		if (!skip)
+			aggr->val += count->val;
 		if (config->scale) {
 			aggr->ena += count->ena;
 			aggr->run += count->run;
@@ -337,9 +323,10 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	if (config->aggr_mode != AGGR_GLOBAL)
 		return 0;
 
-	if (!counter->snapshot)
-		perf_evsel__compute_deltas(counter, -1, -1, aggr);
-	perf_counts_values__scale(aggr, config->scale, &counter->counts->scaled);
+	perf_evsel__compute_deltas(counter, -1, -1, aggr);
+	perf_counts_values__scale(aggr, config->scale,
+				  counter->per_pkg, counter->snapshot,
+				  &counter->counts->scaled);
 
 	for (i = 0; i < 3; i++)
 		update_stats(&ps->res_stats[i], count[i]);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option
  2017-01-06 21:59 ` [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option Vikas Shivappa
@ 2017-01-16 18:05   ` Thomas Gleixner
  2017-01-17 17:25     ` Shivappa Vikas
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-16 18:05 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> Add a compile option INTEL_RDT which enables common code for all
> RDT(Resource director technology) and a specific INTEL_RDT_M which
> enables code for RDT monitoring. CQM(cache quality monitoring) and
> mbm(memory b/w monitoring) are part of Intel RDT monitoring.

If we handle this with its own config option, can you please make this
thing modular? There is no point to have this compiled in and wasting
memory if it's not supported by the CPU.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/12] x86/cqm: Add Per pkg rmid support\
  2017-01-06 21:59 ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support Vikas Shivappa
@ 2017-01-16 18:15   ` Thomas Gleixner
  2017-01-17 19:11     ` Shivappa Vikas
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-16 18:15 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:

> Subject : [PATCH 04/12] x86/cqm: Add Per pkg rmid support

Can you please be a bit more careful about the subject lines. 'Per' wants
to be 'per' and pkg really can be written as package. There is no point in
cryptic abbreviations for no value. Finaly RMID wants to be all uppercase
as you use it in the text below.

> Patch introduces a new cqm_pkgs_data to keep track of the per package

Again. We know that this is a patch.....

> free list, limbo list and other locking structures.

So free list and limbo list are locking structures, interesting.

> The corresponding rmid structures in the perf_event is

s/structures/structure/

> changed to hold an array of u32 RMIDs instead of a single u32.
> 
> The RMIDs are not assigned at the time of event creation and are
> assigned in lazy mode at the first sched_in time for a task, thereby
> rmid is never allocated if a task is not scheduled on a package. This
> helps better usage of RMIDs and its scales with the increasing
> sockets/packages.
> 
> Locking:
> event list - perf init and terminate hold mutex. spin lock is held to
> gaurd from the mbm hrtimer.
> per pkg free and limbo list - global spin lock. Used by
> get_rmid,put_rmid, perf start, terminate

Locking documentation wants to be in the code not in some random changelog.

> Tests: RMIDs available increase by x times where x is number of sockets
> and the usage is dynamic so we save more.

What means: 'Tests:'? Is there a test in this patch? If yes, I seem to be
missing it.

> Patch is based on David Carrillo-Cisneros <davidcc@google.com> patches
> in cqm2 series.

We document such attributions with a tag:

Originally-From: David Carrillo-Cisneros <davidcc@google.com>

That way tools can pick it up.

>  static cpumask_t cqm_cpumask;
>  
> +struct pkg_data **cqm_pkgs_data;

Why is this global? There is no user outside of cqm.c AFAICT.

> -/*
> - * We use a simple array of pointers so that we can lookup a struct
> - * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
> - * and __put_rmid() from having to worry about dealing with struct
> - * cqm_rmid_entry - they just deal with rmids, i.e. integers.
> - *
> - * Once this array is initialized it is read-only. No locks are required
> - * to access it.
> - *
> - * All entries for all RMIDs can be looked up in the this array at all
> - * times.

So this comment was actually valuable. Sure, it does not match the new
implementation, but the basic principle is still the same. The comment
wants to be updated and not just removed.

>   *
>   * We expect to be called with cache_mutex held.

You rename cache_mutex to cache lock, but updating the comments is
optional, right?

> -static u32 __get_rmid(void)
> +static u32 __get_rmid(int domain)

  unsigned int domain please. The domain cannot be negative.

>  {
> +	struct list_head *cqm_flist;
>  	struct cqm_rmid_entry *entry;

Please keep the variables as a reverse fir tree:

>  	struct cqm_rmid_entry *entry;
> +	struct list_head *cqm_flist;

>  
> -	lockdep_assert_held(&cache_mutex);
> +	lockdep_assert_held(&cache_lock);
 
> -static void __put_rmid(u32 rmid)
> +static void __put_rmid(u32 rmid, int domain)
>  {
>  	struct cqm_rmid_entry *entry;
>  
> -	lockdep_assert_held(&cache_mutex);
> +	lockdep_assert_held(&cache_lock);
>  
> -	WARN_ON(!__rmid_valid(rmid));
> -	entry = __rmid_entry(rmid);
> +	WARN_ON(!rmid);

What's wrong with __rmid_valid() ?

> +	entry = __rmid_entry(rmid, domain);
>  
> +	kfree(cqm_pkgs_data);
>  }
>  
> +

Random white space noise.

>  static bool is_cqm_event(int e)
> @@ -420,10 +349,11 @@ static void __intel_mbm_event_init(void *info)
>  {
>  	struct rmid_read *rr = info;
>  
> -	update_sample(rr->rmid, rr->evt_type, 1);
> +	if (__rmid_valid(rr->rmid[pkg_id]))

Where the heck is pkg_id coming from?

Ahh:

#define pkg_id  topology_physical_package_id(smp_processor_id())

That's crap. It's existing crap, but nevertheless crap.

First of all it's irritating as 'pkg_id' looks like a variable and in this
case it would be at least file global, which makes no sense at all.

Secondly, we really want to move this to the logical package id as that is
actually guaranteing that the package id is smaller than
topology_max_packages(). The physical package id has no such guarantee.

And then please keep this local so its simple to read and parse:

    	 unsigned int pkgid = topology_logical_package_id(smp_processor_id());

> +		update_sample(rr->rmid[pkg_id], rr->evt_type, 1);
>  }

> @@ -444,7 +374,7 @@ static int intel_cqm_setup_event(struct perf_event *event,
>  				  struct perf_event **group)
>  {
>  	struct perf_event *iter;
> -	u32 rmid;
> +	u32 *rmid, sizet;

What's wrong with size? It's not confusable with size_t, right?

> +void alloc_needed_pkg_rmid(u32 *cqm_rmid)

Again: Why is this global?

And what's the meaning of needed in the function name? If the RMID wouldn't
be needed then the function would not be called, right?

> +{
> +	unsigned long flags;
> +	u32 rmid;
> +
> +	if (WARN_ON(!cqm_rmid))
> +		return;
> +
> +	if (cqm_rmid[pkg_id])
> +		return;
> +
> +	raw_spin_lock_irqsave(&cache_lock, flags);
> +
> +	rmid = __get_rmid(pkg_id);
> +	if (__rmid_valid(rmid))
> +		cqm_rmid[pkg_id] = rmid;
> +
> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
> +}
> +
>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> -	u32 rmid = event->hw.cqm_rmid;
> +	u32 rmid;
>  
>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>  		return;
>  
>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>  
> +	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
> +
> +	rmid = event->hw.cqm_rmid[pkg_id];

The allocation can fail. What's in event->hw.cqm_rmid[pkg_id] then? Zero or
what? And why is any of those values fine?

Can you please add comments so reviewers and readers do not have to figure
out every substantial detail themself?

>  	state->rmid = rmid;
>  	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);

> +static inline void
> +	cqm_event_free_rmid(struct perf_event *event)

Yet another coding style variant from your unlimited supply of coding
horrors.

> +{
> +	u32 *rmid = event->hw.cqm_rmid;
> +	int d;
> +
> +	for (d = 0; d < cqm_socket_max; d++) {
> +		if (__rmid_valid(rmid[d]))
> +			__put_rmid(rmid[d], d);
> +	}
> +	kfree(event->hw.cqm_rmid);
> +	list_del(&event->hw.cqm_groups_entry);
> +}
>  static void intel_cqm_event_destroy(struct perf_event *event)
>  {
>  	struct perf_event *group_other = NULL;
> @@ -737,16 +693,11 @@ static void intel_cqm_event_destroy(struct perf_event *event)
>  		 * If there was a group_other, make that leader, otherwise
>  		 * destroy the group and return the RMID.
>  		 */
> -		if (group_other) {
> +		if (group_other)
>  			list_replace(&event->hw.cqm_groups_entry,
>  				     &group_other->hw.cqm_groups_entry);

Please keep the brackets. See:

http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos

> -		} else {
> -			u32 rmid = event->hw.cqm_rmid;
> -
> -			if (__rmid_valid(rmid))
> -				__put_rmid(rmid);
> -			list_del(&event->hw.cqm_groups_entry);
> -		}
> +		else
> +			cqm_event_free_rmid(event);
>  	}
>  
>  	raw_spin_unlock_irqrestore(&cache_lock, flags);

> +static int pkg_data_init_cpu(int cpu)

cpus are unsigned int

> +{
> +	struct cqm_rmid_entry *ccqm_rmid_ptrs = NULL, *entry = NULL;

What are the pointer intializations for?

> +	int curr_pkgid = topology_physical_package_id(cpu);

Please use logical packages.

> +	struct pkg_data *pkg_data = NULL;
> +	int i = 0, nr_rmids, ret = 0;

Crap. 'i' is used in the for() loop below and therefor initialized exactly
there and not at some random other place. ret is a completely pointless
variable and the initialization is even more pointless. There is exactly
one code path using it (fail) and you can just return -ENOMEM from there.

> +	if (cqm_pkgs_data[curr_pkgid])
> +		return 0;
> +
> +	pkg_data = kzalloc_node(sizeof(struct pkg_data),
> +				GFP_KERNEL, cpu_to_node(cpu));
> +	if (!pkg_data)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&pkg_data->cqm_rmid_free_lru);
> +	INIT_LIST_HEAD(&pkg_data->cqm_rmid_limbo_lru);
> +
> +	mutex_init(&pkg_data->pkg_data_mutex);
> +	raw_spin_lock_init(&pkg_data->pkg_data_lock);
> +
> +	pkg_data->rmid_work_cpu = cpu;
> +
> +	nr_rmids = cqm_max_rmid + 1;
> +	ccqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry) *
> +			   nr_rmids, GFP_KERNEL);
> +	if (!ccqm_rmid_ptrs) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	for (; i <= cqm_max_rmid; i++) {
> +		entry = &ccqm_rmid_ptrs[i];
> +		INIT_LIST_HEAD(&entry->list);
> +		entry->rmid = i;
> +
> +		list_add_tail(&entry->list, &pkg_data->cqm_rmid_free_lru);
> +	}
> +
> +	pkg_data->cqm_rmid_ptrs = ccqm_rmid_ptrs;
> +	cqm_pkgs_data[curr_pkgid] = pkg_data;
> +
> +	/*
> +	 * RMID 0 is special and is always allocated. It's used for all
> +	 * tasks that are not monitored.
> +	 */
> +	entry = __rmid_entry(0, curr_pkgid);
> +	list_del(&entry->list);
> +
> +	return 0;
> +fail:
> +	kfree(ccqm_rmid_ptrs);
> +	ccqm_rmid_ptrs = NULL;

And clearing the local variable has which value?

> +	kfree(pkg_data);
> +	pkg_data = NULL;
> +	cqm_pkgs_data[curr_pkgid] = NULL;

It never got set, so why do you need to clear it? Just because you do not
trust the compiler?

> +	return ret;
> +}
> +
> +static int cqm_init_pkgs_data(void)
> +{
> +	int i, cpu, ret = 0;

Again, pointless 0 initialization.

> +	cqm_pkgs_data = kzalloc(
> +		sizeof(struct pkg_data *) * cqm_socket_max,
> +		GFP_KERNEL);

Using a simple local variable to calculate size first would spare this new
variant of coding style horror.

> +	if (!cqm_pkgs_data)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < cqm_socket_max; i++)
> +		cqm_pkgs_data[i] = NULL;

Surely kzalloc() is not reliable enough, so make sure that the pointers are
really NULL. Brilliant stuff.

> +
> +	for_each_online_cpu(cpu) {
> +		ret = pkg_data_init_cpu(cpu);
> +		if (ret)
> +			goto fail;
> +	}

And that's protected against CPU hotplug in which way?

Aside of that. What allocates the package data for packages which come
online _AFTER_ this function is called? Nothing AFAICT.

What you really need is a CPU hotplug callback for the prepare stage, which
is doing the setup of this race free and also handling packages which come
online after this.

> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
> new file mode 100644
> index 0000000..4415497
> --- /dev/null
> +++ b/arch/x86/events/intel/cqm.h
> @@ -0,0 +1,37 @@
> +#ifndef _ASM_X86_CQM_H
> +#define _ASM_X86_CQM_H
> +
> +#ifdef CONFIG_INTEL_RDT_M
> +
> +#include <linux/perf_event.h>
> +
> +/**
> + * struct pkg_data - cqm per package(socket) meta data
> + * @cqm_rmid_free_lru    A least recently used list of free RMIDs
> + *     These RMIDs are guaranteed to have an occupancy less than the
> + * threshold occupancy

You certainly did not even try to run kernel doc on this.

    @var:  Explanation

is the proper format. Can you spot the difference?

Also the multi line comments are horribly formatted:

   @var:       	    This is a multiline commend which is necessary because
   		    it needs a lot of text......

> + * @cqm_rmid_entry - The entry in the limbo and free lists.

And this is incorrect as well. You are not even trying to make stuff
consistently wrong.

> + * @delayed_work - Work to reuse the RMIDs that have been freed.
> + * @rmid_work_cpu - The cpu on the package on which work is scheduled.

Also the formatting wants to be tabular:

 * @var:		Text
 * @long_var:		Text
 * @really_long_var:	Text

Sigh,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
@ 2017-01-17 12:11   ` Thomas Gleixner
  2017-01-17 12:31     ` Peter Zijlstra
  2017-01-18  2:14     ` Shivappa Vikas
  2017-01-17 13:46   ` Thomas Gleixner
  2017-01-17 15:26   ` Peter Zijlstra
  2 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 12:11 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:

> From: David Carrillo-Cisneros <davidcc@google.com>
> 
> cgroup hierarchy monitoring is not supported currently. This patch
> builds all the necessary datastructures, cgroup APIs like alloc, free
> etc and necessary quirks for supporting cgroup hierarchy monitoring in
> later patches.
> 
> - Introduce a architecture specific data structure arch_info in
> perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
> - perf sched_in calls all the cgroup ancestors when a cgroup is
> scheduled in. This will not work with cqm as we have a common per pkg
> rmid associated with one task and hence cannot write different RMIds
> into the MSR for each event. cqm driver enables a flag
> PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
> ancestor cgroups for each event and let the driver handle the hierarchy
> monitoring for cgroup.
> - Introduce event_terminate as event_destroy is called after cgrp is
> disassociated from the event to support rmid handling of the cgroup.
> This helps cqm clean up the cqm specific arch_info.
> - Add the cgroup APIs for alloc,free,attach and can_attach
> 
> The above framework will be used to build different cgroup features in
> later patches.

That's not a framework. It's a hodgepodge of core and x86 specific changes.

I'm not even trying to review it as a whole, simply because such changes
want to be split into several preparatory changes in the core which provide
the 'framework' parts and then an actual user in the architecture
code. I'll give some general feedback whatsoever.

> Tests: Same as before. Cgroup still doesnt work but we did the prep to
> get it to work

Oh well. What tests are the same as before? This information is just there
to take room in the changelog, right?

> Patch modified/refactored by Vikas Shivappa
> <vikas.shivappa@linux.intel.com> to support recycling removal.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>

So this patch is from David, but where is Davids Signed-off-by?

> ---
>  arch/x86/events/intel/cqm.c       | 19 ++++++++++++++++++-
>  arch/x86/include/asm/perf_event.h | 27 +++++++++++++++++++++++++++
>  include/linux/perf_event.h        | 32 ++++++++++++++++++++++++++++++++
>  kernel/events/core.c              | 28 +++++++++++++++++++++++++++-
>  4 files changed, 104 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
> index 68fd1da..a9bd7bd 100644
> --- a/arch/x86/events/intel/cqm.c
> +++ b/arch/x86/events/intel/cqm.c
> @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
>  	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
>  	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
>  
> -	event->destroy = intel_cqm_event_destroy;
> +	/*
> +	 * CQM driver handles cgroup recursion and since only noe
> +	 * RMID can be programmed at the time in each core, then
> +	 * it is incompatible with the way generic code handles
> +	 * cgroup hierarchies.
> +	 */
> +	event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
>  
>  	mutex_lock(&cache_mutex);
>  
> @@ -918,6 +924,17 @@ static int intel_cqm_event_init(struct perf_event *event)
>  	.read		     = intel_cqm_event_read,
>  	.count		     = intel_cqm_event_count,
>  };
> +#ifdef CONFIG_CGROUP_PERF
> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
> +				      struct cgroup_subsys_state *new_css)
> +{}
> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
> +{}
> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
> +{}
> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
> +{}
> +#endif

What the heck is this for? It does not even compile because
perf_cgroup_arch_css_alloc() and perf_cgroup_arch_can_attach() are empty
functions.

Crap, crap, crap.

>  static inline void cqm_pick_event_reader(int cpu)
>  {
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f353061..f38c7f0 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -299,4 +299,31 @@ static inline void perf_check_microcode(void) { }
>  
>  #define arch_perf_out_copy_user copy_from_user_nmi
>  
> +/*
> + * Hooks for architecture specific features of perf_event cgroup.
> + * Currently used by Intel's CQM.
> + */
> +#ifdef CONFIG_INTEL_RDT_M
> +#ifdef CONFIG_CGROUP_PERF
> +
> +#define perf_cgroup_arch_css_alloc	perf_cgroup_arch_css_alloc
> +
> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
> +				      struct cgroup_subsys_state *new_css);
> +
> +#define perf_cgroup_arch_css_free	perf_cgroup_arch_css_free
> +
> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
> +
> +#define perf_cgroup_arch_attach		perf_cgroup_arch_attach
> +
> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset);
> +
> +#define perf_cgroup_arch_can_attach	perf_cgroup_arch_can_attach
> +
> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset);
> +
> +#endif
> +
> +#endif
>  #endif /* _ASM_X86_PERF_EVENT_H */

How the heck is one supposed to figure out which endif is belonging to
what? Random new lines are not helping for that.

Aside of that the double ifdef is horrible and this really is not at even
remotely a framework. It's hardcoded crap to serve that CQM mess. Nothing
else can ever use it. So don't pretend it to be a 'framework'. 

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a8f4749..410642a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -300,6 +300,12 @@ struct pmu {
>  	int (*event_init)		(struct perf_event *event);
>  
>  	/*
> +	 * Terminate the event for this PMU. Optional complement for a
> +	 * successful event_init. Called before the event fields are tear down.
> +	 */
> +	void (*event_terminate)		(struct perf_event *event);

And why does this need to be a PMU callback. It's called right before
perf_cgroup_detach(). Why does it need extra treatment and cannot be done
from the cgroup muck?

> +
> +	/*
>  	 * Notification that the event was mapped or unmapped.  Called
>  	 * in the context of the mapping task.
>  	 */
> @@ -516,9 +522,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
>   * PERF_EV_CAP_SOFTWARE: Is a software event.
>   * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
>   * from any CPU in the package where it is active.
> + * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
> + * cgroup scoping. It does not need to be enabled for all of its descendants
> + * cgroups.
>   */
>  #define PERF_EV_CAP_SOFTWARE		BIT(0)
>  #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
> +#define PERF_EV_CAP_CGROUP_NO_RECURSION	BIT(2)
>  
>  #define SWEVENT_HLIST_BITS		8
>  #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
> @@ -823,6 +833,8 @@ struct perf_cgroup_info {
>  };
>  
>  struct perf_cgroup {
> +	/* Architecture specific information. */

That's a really useful comment.

> +	void				 *arch_info;
>  	struct cgroup_subsys_state	css;
>  	struct perf_cgroup_info	__percpu *info;
>  };
> @@ -844,6 +856,7 @@ struct perf_cgroup {
>  
>  #ifdef CONFIG_PERF_EVENTS
>  
> +extern int is_cgroup_event(struct perf_event *event);
>  extern void *perf_aux_output_begin(struct perf_output_handle *handle,
>  				   struct perf_event *event);
>  extern void perf_aux_output_end(struct perf_output_handle *handle,
> @@ -1387,4 +1400,23 @@ ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
>  #define perf_event_exit_cpu	NULL
>  #endif
>  
> +/*
> + * Hooks for architecture specific extensions for perf_cgroup.

No. That's not architecture specific. That's CQM specific hackery.

> + */
> +#ifndef perf_cgroup_arch_css_alloc
> +#define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
> +#endif

I really hate this define style. That can be solved nicely with weak
functions which avoid all this define and ifdeffery mess.

> +#ifndef perf_cgroup_arch_css_free
> +#define perf_cgroup_arch_css_free(css) do { } while (0)
> +#endif
> +
> +#ifndef perf_cgroup_arch_attach
> +#define perf_cgroup_arch_attach(tskset) do { } while (0)
> +#endif
> +
> +#ifndef perf_cgroup_arch_can_attach
> +#define perf_cgroup_arch_can_attach(tskset) 0

This one is exceptionally stupid. Here is the use case:

> +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
> +{
> +     return perf_cgroup_arch_can_attach(tset);
>  }
>
> +
>  struct cgroup_subsys perf_event_cgrp_subsys = {
>       .css_alloc      = perf_cgroup_css_alloc,
>       .css_free       = perf_cgroup_css_free,
> +     .can_attach     = perf_cgroup_can_attach,
>       .attach         = perf_cgroup_attach,
>  };

So you need a extra function for calling a stub macro if it just can
be done by assigning the real function (if required) to the callback
pointer and otherwise leave it NULL.

But all of this is moot because this 'arch framework' is just crap because
it's in reality a CQM extension of the core code, which is not going to
happen.

> +#endif
> +
>  #endif /* _LINUX_PERF_EVENT_H */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ab15509..229f611 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -590,6 +590,9 @@ static inline u64 perf_event_clock(struct perf_event *event)
>  	if (!cpuctx->cgrp)
>  		return false;
>  
> +	if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
> +		return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
> +

Comments explaining what this does are overrated, right?

>  	/*
>  	 * Cgroup scoping is recursive.  An event enabled for a cgroup is
>  	 * also enabled for all its descendant cgroups.  If @cpuctx's
> @@ -606,7 +609,7 @@ static inline void perf_detach_cgroup(struct perf_event *event)
>  	event->cgrp = NULL;
>  }
>  
> -static inline int is_cgroup_event(struct perf_event *event)
> +int is_cgroup_event(struct perf_event *event)

So this is made global because there is no actual user outside of the core.

>  {
>  	return event->cgrp != NULL;
>  }
> @@ -4019,6 +4022,9 @@ static void _free_event(struct perf_event *event)
>  		mutex_unlock(&event->mmap_mutex);
>  	}
>  
> +	if (event->pmu->event_terminate)
> +		event->pmu->event_terminate(event);
> +
>  	if (is_cgroup_event(event))
>  		perf_detach_cgroup(event);
>  
> @@ -9246,6 +9252,8 @@ static void account_event(struct perf_event *event)
>  	exclusive_event_destroy(event);
>  
>  err_pmu:
> +	if (event->pmu->event_terminate)
> +		event->pmu->event_terminate(event);
>  	if (event->destroy)
>  		event->destroy(event);
>  	module_put(pmu->module);
> @@ -10748,6 +10756,7 @@ static int __init perf_event_sysfs_init(void)
>  perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  {
>  	struct perf_cgroup *jc;
> +	int ret;
>  
>  	jc = kzalloc(sizeof(*jc), GFP_KERNEL);
>  	if (!jc)
> @@ -10759,6 +10768,12 @@ static int __init perf_event_sysfs_init(void)
>  		return ERR_PTR(-ENOMEM);
>  	}
>  
> +	jc->arch_info = NULL;

Never trust kzalloc to zero out a data structure correctly!

> +
> +	ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
> +	if (ret)
> +		return ERR_PTR(ret);

And then leak the allocated memory in case of error.

Another wonderful piece of trainwreck engineering.

As I said above: Split this into bits and pieces and provide a proper
justification for each of the items you add to the core: terminate,
PERF_EV_CAP_CGROUP_NO_RECURSION.

Then sit down and come up with a solution which allows to make use of the
cgroup core extensions for more than a single instance of a particular
piece of x86 perf hardware.

And please provide changelogs which explain WHY all of this is necessary,
not just the WHAT.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-17 12:11   ` Thomas Gleixner
@ 2017-01-17 12:31     ` Peter Zijlstra
  2017-01-18  2:14     ` Shivappa Vikas
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2017-01-17 12:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin

On Tue, Jan 17, 2017 at 01:11:50PM +0100, Thomas Gleixner wrote:
> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> 
> > From: David Carrillo-Cisneros <davidcc@google.com>
> > 
> > cgroup hierarchy monitoring is not supported currently. This patch
> > builds all the necessary datastructures, cgroup APIs like alloc, free
> > etc and necessary quirks for supporting cgroup hierarchy monitoring in
> > later patches.
> > 
> > - Introduce a architecture specific data structure arch_info in
> > perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
> > - perf sched_in calls all the cgroup ancestors when a cgroup is
> > scheduled in. This will not work with cqm as we have a common per pkg
> > rmid associated with one task and hence cannot write different RMIds
> > into the MSR for each event. cqm driver enables a flag
> > PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
> > ancestor cgroups for each event and let the driver handle the hierarchy
> > monitoring for cgroup.
> > - Introduce event_terminate as event_destroy is called after cgrp is
> > disassociated from the event to support rmid handling of the cgroup.
> > This helps cqm clean up the cqm specific arch_info.
> > - Add the cgroup APIs for alloc,free,attach and can_attach
> > 
> > The above framework will be used to build different cgroup features in
> > later patches.
> 
> That's not a framework. It's a hodgepodge of core and x86 specific changes.

Trainwreck comes to mind. It completely fails to describe semantics of
the hacks and how they would preserve the cgroup invariants.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
  2017-01-17 12:11   ` Thomas Gleixner
@ 2017-01-17 13:46   ` Thomas Gleixner
  2017-01-17 20:22     ` Shivappa Vikas
  2017-01-17 15:26   ` Peter Zijlstra
  2 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 13:46 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
>  	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
>  	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
>  
> -	event->destroy = intel_cqm_event_destroy;

I missed this in the first round, but tripped over it when looking at one
of the follow up patches.

How is that supposed to work?

1) intel_cqm_event_destroy() is still in the code and unused which emits a
   compiler warning, but that can obviously be ignored for a good measure.

2) How would any testing of this mess actually work?

   Not all all. Nothing ever tears down an event. So you just leave
   everything hanging around probably with dangling pointers left and
   right.

So now the 'Tests: Same as before.' in the so called changelog makes sense:

   'Same as before' means: Completely untested and broken.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support
  2017-01-06 21:59 ` [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support Vikas Shivappa
@ 2017-01-17 14:07   ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 14:07 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> From: David Carrillo-Cisneros <davidcc@google.com>
> 
> Patch adds support for monitoring cgroup hierarchy. The
> arch_info that was introduced in the perf_cgroup is used to maintain the
> cgroup related rmid and hierarchy information.
> 
> Since cgroup supports hierarchical monitoring, a cgroup is always
> monitoring for some ancestor. By default root is always monitored with
> RMID 0 and hence when any cgroup is first created it always reports data
> to the root.  mfa or 'monitor for ancestor' is used to keep track of
> this information.  Basically which ancestor the cgroup is actually
> monitoring for or has to report the data to.
> 
> By default, all cgroup's mfa points to root.
> 1.event init: When ever a new cgroup x would start to be monitored,
> the mfa of the monitored cgroup's descendants point towards the cgroup
> x.
> 2.switch_to: task finds the cgroup its associated with and if the
> cgroup itself is being monitored cgroup uses its own rmid(a) else it uses
> the rmid of the mfa(b).
> 3.read: During the read call, the cgroup x just adds the
> counts of its descendants who had cgroup x as mfa and were also
> monitored(To count the scenario (a) in switch_to).
> 
> Locking: cgroup traversal: rcu_readlock.  cgroup->arch_info: css_alloc,
> css_free, event terminate, init hold mutex.

Again: Locking must be documented in the code not in a changelog. And the
documentation of lokcing wants to be elaborate not a sloppy unparseable
blurb.

> Tests: Cgroup monitoring should work. Monitoring multiple cgroups in the

should work is really a great test.....

> same hierarchy works. monitoring cgroup and a task within same cgroup
> doesnt work yet.
> 
> Patch modified/refactored by Vikas Shivappa
> <vikas.shivappa@linux.intel.com> to support recycling removal.

Preserving Signed-off-by tags is not in your book, right?

> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>  
>  struct pkg_data **cqm_pkgs_data;
> +struct cgrp_cqm_info cqm_rootcginfo;

And this is global because?

>  #define RMID_VAL_ERROR		(1ULL << 63)
>  #define RMID_VAL_UNAVAIL	(1ULL << 62)
> @@ -193,6 +194,11 @@ static void __put_rmid(u32 rmid, int domain)
>  	list_add_tail(&entry->list, &cqm_pkgs_data[domain]->cqm_rmid_limbo_lru);
>  }
>  
> +static bool is_task_event(struct perf_event *e)

inline

> +{
> +	return (e->attach_state & PERF_ATTACH_TASK);
> +}

And if at all this should go into a core perf header. It's not at all CQM
specific.

> +static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
> +{
> +	if (rmid != NULL) {
> +		cqm_info->mon_enabled = true;
> +		cqm_info->rmid = rmid;
> +	} else {
> +		cqm_info->mon_enabled = false;
> +		cqm_info->rmid = NULL;
> +	}

The function truly implements what the function name suggests. Really
intuitive - NOT!

>  
> +static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr);

Forward declaration should be on top of the file and not at some random
place in the middle of the code.

> +
>  static void intel_cqm_event_read(struct perf_event *event)
>  {
> -	unsigned long flags;
> -	u32 rmid;
> -	u64 val;
> +	struct rmid_read rr = {
> +		.evt_type = event->attr.config,
> +		.value = ATOMIC64_INIT(0),
> +	};
>  
>  	/*
>  	 * Task events are handled by intel_cqm_event_count().
> @@ -414,26 +477,9 @@ static void intel_cqm_event_read(struct perf_event *event)
>  	if (event->cpu == -1)
>  		return;
>  
> -	raw_spin_lock_irqsave(&cache_lock, flags);
> -	rmid = event->hw.cqm_rmid[pkg_id];
> -
> -	if (!__rmid_valid(rmid))
> -		goto out;
> -
> -	if (is_mbm_event(event->attr.config))
> -		val = rmid_read_mbm(rmid, event->attr.config);
> -	else
> -		val = __rmid_read(rmid);
> -
> -	/*
> -	 * Ignore this reading on error states and do not update the value.
> -	 */
> -	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> -		goto out;
> +	rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
>  
> -	local64_set(&event->count, val);
> -out:
> -	raw_spin_unlock_irqrestore(&cache_lock, flags);
> +	cqm_read_subtree(event, &rr);

So why isn't the implementation right here? That's just another pointless
indirection in the code to make review and reading harder.

>  }
>  
>  static void __intel_cqm_event_count(void *info)
> @@ -545,6 +591,55 @@ static void mbm_hrtimer_init(void)
>  	}
>  }
>  
> +static void cqm_mask_call_local(struct rmid_read *rr)

I have a hard time to understand the name of this function. Where the heck
is a mask or a call involved here?

> +{
> +	if (is_mbm_event(rr->evt_type))
> +		__intel_mbm_event_count(rr);
> +	else
> +		__intel_cqm_event_count(rr);
> +}
> +
> +static inline void
> +	delta_local(struct perf_event *event, struct rmid_read *rr, u32 *rmid)

Hell no. We either do

static inline void
delta_local(struct perf_event *event, struct rmid_read *rr, u32 *rmid)

or

static inline void delta_local(struct perf_event *event, struct rmid_read *rr,
       	      	   	       u32 *rmid)

but not this completely unparseable crap.

Aside of that what is delta_local? I cannot see any delta here. And that
rmid_read pointer is confusing as hell. What's the point of it?

You clear the value and set the rmid. So the only value you preserve is the
event type and that one is in the event and can be reinitialized locally.

> +{
> +	atomic64_set(&rr->value, 0);
> +	rr->rmid = ACCESS_ONCE(rmid);

What's the purpose of this access once? Voodoo programming is the only
reasonable explanation I came up with.

> +
> +	cqm_mask_call_local(rr);
> +	local64_add(atomic64_read(&rr->value), &event->count);
> +}
> +
> +/*
> + * Since cgroup follows hierarchy, add the count of
> + *  the descendents who were being monitored as well.
> + */
> +static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
> +{
> +#ifdef CONFIG_CGROUP_PERF
> +
> +	struct cgroup_subsys_state *rcss, *pos_css;
> +	struct cgrp_cqm_info *ccqm_info;
> +
> +	cqm_mask_call_local(rr);
> +	local64_set(&event->count, atomic64_read(&(rr->value)));
> +
> +	if (is_task_event(event))
> +		return __perf_event_count(event);

And what happens for non task and non cgroup, aka. CPU wide events?

> +
> +	rcu_read_lock();
> +	rcss = &event->cgrp->css;
> +	css_for_each_descendant_pre(pos_css, rcss) {
> +		ccqm_info = (css_to_cqm_info(pos_css));
> +
> +		/* Add the descendent 'monitored cgroup' counts */
> +		if (pos_css != rcss && ccqm_info->mon_enabled)
> +			delta_local(event, rr, ccqm_info->rmid);
> +	}
> +	rcu_read_unlock();
> +#endif
> +	return __perf_event_count(event);

Oh well. Yet another function which does something different than the
function name suggests. It does not read the subtree, it reads either the
task or the cgroup thingy.

> -static void intel_cqm_event_destroy(struct perf_event *event)
> +
> +static void intel_cqm_event_terminate(struct perf_event *event)

Ah. So now that function gets reused. Not sure whether that works better
than before, but at least the compiler warning is gone. Progress!

>  {
>  	struct perf_event *group_other = NULL;
>  	unsigned long flags;
> @@ -917,6 +1014,7 @@ static int intel_cqm_event_init(struct perf_event *event)
>  	.attr_groups	     = intel_cqm_attr_groups,
>  	.task_ctx_nr	     = perf_sw_context,
>  	.event_init	     = intel_cqm_event_init,
> +	.event_terminate     = intel_cqm_event_terminate,
>  	.add		     = intel_cqm_event_add,
>  	.del		     = intel_cqm_event_stop,
>  	.start		     = intel_cqm_event_start,
> @@ -924,12 +1022,67 @@ static int intel_cqm_event_init(struct perf_event *event)
>  	.read		     = intel_cqm_event_read,
>  	.count		     = intel_cqm_event_count,
>  };
> +
>  #ifdef CONFIG_CGROUP_PERF
>  int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
>  				      struct cgroup_subsys_state *new_css)
> -{}
> +{
> +	struct cgrp_cqm_info *cqm_info, *pcqm_info;
> +	struct perf_cgroup *new_cgrp;
> +
> +	if (!parent_css) {
> +		cqm_rootcginfo.level = 0;
> +
> +		cqm_rootcginfo.mon_enabled = true;
> +		cqm_rootcginfo.cont_mon = true;
> +		cqm_rootcginfo.mfa = NULL;
> +		INIT_LIST_HEAD(&cqm_rootcginfo.tskmon_rlist);
> +
> +		if (new_css) {
> +			new_cgrp = css_to_perf_cgroup(new_css);
> +			new_cgrp->arch_info = &cqm_rootcginfo;
> +		}
> +		return 0;
> +	}
> +
> +	mutex_lock(&cache_mutex);
> +
> +	new_cgrp = css_to_perf_cgroup(new_css);
> +
> +	cqm_info = kzalloc(sizeof(struct cgrp_cqm_info), GFP_KERNEL);
> +	if (!cqm_info) {
> +		mutex_unlock(&cache_mutex);
> +		return -ENOMEM;
> +	}
> +
> +	pcqm_info = (css_to_cqm_info(parent_css));

The extra brackets are necessary to make it more readable, right?

> +	cqm_info->level = pcqm_info->level + 1;
> +	cqm_info->rmid = pcqm_info->rmid;
> +
> +	cqm_info->cont_mon = false;
> +	cqm_info->mon_enabled = false;
> +	INIT_LIST_HEAD(&cqm_info->tskmon_rlist);
> +	if (!pcqm_info->mfa)
> +		cqm_info->mfa = pcqm_info;
> +	else
> +		cqm_info->mfa = pcqm_info->mfa;

Comments would really confuse the reader here.

> +
> +	new_cgrp->arch_info = cqm_info;
> +	mutex_unlock(&cache_mutex);
> +
> +	return 0;
> +}
> +
>  void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
> -{}
> +{
> +	struct perf_cgroup *cgrp = css_to_perf_cgroup(css);
> +
> +	mutex_lock(&cache_mutex);
> +	kfree(cgrp_to_cqm_info(cgrp));
> +	cgrp->arch_info = NULL;

So the two lines above deal both with cgrp->arch_info. Why do you need that
extra conversion macro for the kfree() call?

> +	mutex_unlock(&cache_mutex);
> +}
> +
>  void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
>  {}
>  int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
> @@ -1053,6 +1206,12 @@ static int pkg_data_init_cpu(int cpu)
>  	entry = __rmid_entry(0, curr_pkgid);
>  	list_del(&entry->list);
>  
> +	cqm_rootcginfo.rmid = kzalloc(sizeof(u32) * cqm_socket_max, GFP_KERNEL);
> +	if (!cqm_rootcginfo.rmid) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}

Brilliant. So this allocates the cqm_rootcginfo.rmid array for each package
over and over. That's just a memory leak on boot, but when hotplugging a
full socket _AFTER_ initialization this is going to replace the previously
used and RMID populated array with a zeroed out one.

Why would a global allocation be placed into a per cpu data setup function?
Just because there happened to be a nice empty spot to hack it in?

> +
>  	return 0;
>  fail:
>  	kfree(ccqm_rmid_ptrs);
> diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
> index b31081b..e11ed5e 100644
> --- a/arch/x86/include/asm/intel_rdt_common.h
> +++ b/arch/x86/include/asm/intel_rdt_common.h
> @@ -24,4 +24,68 @@ struct intel_pqr_state {
>  
>  DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>  
> +/**
> + * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
> + * @cont_mon     Continuous monitoring flag

Please check the syntax for kernel doc struct members ...

> + * @mon_enabled  Whether monitoring is enabled
> + * @level        Level in the cgroup tree. Root is level 0.
> + * @rmid        The rmids of the cgroup.
> + * @mfa          'Monitoring for ancestor' points to the cqm_info
> + *  of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
> + *  means you will use an ancestors RMID at sched_in if you are
> + *  not monitoring yourself.
> + *
> + *  Due to the hierarchical nature of cgroups, every cgroup just
> + *  monitors for the 'nearest monitored ancestor' at all times.
> + *  Since root cgroup is always monitored, all descendents
> + *  at boot time monitor for root and hence all mfa points to root except
> + *  for root->mfa which is NULL.
> + *  1. RMID setup: When cgroup x start monitoring:
> + *    for each descendent y, if y's mfa->level < x->level, then
> + *    y->mfa = x. (Where level of root node = 0...)
> + *  2. sched_in: During sched_in for x
> + *    if (x->mon_enabled) choose x->rmid
> + *    else choose x->mfa->rmid.
> + *  3. read: for each descendent of cgroup x
> + *     if (x->monitored) count += rmid_read(x->rmid).
> + *  4. evt_destroy: for each descendent y of x, if (y->mfa == x) then
> + *     y->mfa = x->mfa. Meaning if any descendent was monitoring for x,
> + *     set that descendent to monitor for the cgroup which x was monitoring for.

That mfa member is not the right place for this extensive
documentation. Put it somewhere in the code where the whole machinery is
described.

> + * @tskmon_rlist List of tasks being monitored in the cgroup
> + *  When a task which belongs to a cgroup x is being monitored, it always uses
> + *  its own task->rmid even if cgroup x is monitored during sched_in.
> + *  To account for the counts of such tasks, cgroup keeps this list
> + *  and parses it during read.
> + *
> + *  Perf handles hierarchy for other events, but because RMIDs are per pkg
> + *  this is handled here.
> +*/
> +struct cgrp_cqm_info {
> +	bool cont_mon;
> +	bool mon_enabled;
> +	int level;
> +	u32 *rmid;
> +	struct cgrp_cqm_info *mfa;
> +	struct list_head tskmon_rlist;
> +};
> +
> +struct tsk_rmid_entry {
> +	u32 *rmid;
> +	struct list_head list;
> +};

Make the struct members tabular please for readability sake.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
  2017-01-17 12:11   ` Thomas Gleixner
  2017-01-17 13:46   ` Thomas Gleixner
@ 2017-01-17 15:26   ` Peter Zijlstra
  2017-01-17 20:27     ` Shivappa Vikas
  2 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2017-01-17 15:26 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, tglx,
	mingo, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, Jan 06, 2017 at 01:59:58PM -0800, Vikas Shivappa wrote:
> - Introduce event_terminate as event_destroy is called after cgrp is
> disassociated from the event to support rmid handling of the cgroup.
> This helps cqm clean up the cqm specific arch_info.

You've not even tried to audit the code to see if you can either move
the existing ->destroy() invocation or the perf_detach_cgroup() one,
have you?

Minimal APIs are a good thing, don't expand unless you absolutely have
to.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together
  2017-01-06 22:00 ` [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together Vikas Shivappa
@ 2017-01-17 16:11   ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 16:11 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:

> From: Vikas Shivappa<vikas.shivappa@intel.com>
> 
> This patch adds support to monitor a cgroup x and a task p1
> when p1 is part of cgroup x. Since we cannot write two RMIDs during
> sched in the driver handles this.

Again you explain WHAT not WHY....

> This patch introduces a u32 *rmid in the task_struck which keeps track
> of the RMIDs associated with the task.  There is also a list in the
> arch_info of perf_cgroup called taskmon_list which keeps track of tasks
> in the cgroup that are monitored.
> 
> The taskmon_list is modified in 2 scenarios.
> - at event_init of task p1 which is part of a cgroup, add the p1 to the
> cgroup->tskmon_list. At event_destroy delete the task from the list.
> - at the time of task move from cgrp x to a cgp y, if task was monitored
> remove the task from the cgrp x tskmon_list and add it the cgrp y
> tskmon_list.
> 
> sched in: When the task p1 is scheduled in, we write the task RMID in
> the PQR_ASSOC MSR

Great information.

> read(for task p1): As any other cqm task event
> 
> read(for the cgroup x): When counting for cgroup, the taskmon list is
> traversed and the corresponding RMID counts are added.
> 
> Tests: Monitoring a cgroup x and a task with in the cgroup x should
> work.

Emphasis on should.

> +static inline int add_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
> +{
> +	struct tsk_rmid_entry *entry;
> +
> +	entry = kzalloc(sizeof(struct tsk_rmid_entry), GFP_KERNEL);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&entry->list);
> +	entry->rmid = rmid;
> +
> +	list_add_tail(&entry->list, l);
> +
> +	return 0;

And this function has a return value because the return value is never
evaluated at any call site.

> +}
> +
> +static inline void del_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
> +{
> +	struct tsk_rmid_entry *entry = NULL, *tmp1;

And where is *tmp2? What is wrong with simply *tmp?

> +
> +	list_for_each_entry_safe(entry, tmp1, l, list) {
> +		if (entry->rmid == rmid) {
> +
> +			list_del(&entry->list);
> +			kfree(entry);
> +			break;
> +		}
> +	}
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>  {
> @@ -380,6 +410,49 @@ struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>  }
>  #endif
>  
> +static inline void
> +	cgrp_tskmon_update(struct task_struct *tsk, u32 *rmid, bool ena)

Sigh

> +{
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +
> +#ifdef CONFIG_CGROUP_PERF
> +	ccinfo = cqminfo_from_tsk(tsk);
> +#endif
> +	if (!ccinfo)
> +		return;
> +
> +	if (ena)
> +		add_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
> +	else
> +		del_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
> +}
> +
> +static int cqm_assign_task_rmid(struct perf_event *event, u32 *rmid)
> +{
> +	struct task_struct *tsk;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +	tsk = event->hw.target;
> +	if (pid_alive(tsk)) {
> +		get_task_struct(tsk);

This works because after the pid_alive() check the task cannot be
released before issuing get_task_struct(), right?

That's voodoo protection. How would a non alive task end up as event
target?

> +
> +		if (rmid != NULL)
> +			cgrp_tskmon_update(tsk, rmid, true);
> +		else
> +			cgrp_tskmon_update(tsk, tsk->rmid, false);
> +
> +		tsk->rmid = rmid;
> +
> +		put_task_struct(tsk);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>  {
>  	if (rmid != NULL) {
> @@ -429,8 +502,12 @@ static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)
>  
>  static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
>  {
> +	if (is_task_event(event)) {
> +		if (cqm_assign_task_rmid(event, rmid))
> +			return -EINVAL;
> +	}
>  #ifdef CONFIG_CGROUP_PERF
> -	if (is_cgroup_event(event)) {
> +	else if (is_cgroup_event(event)) {
>  		cqm_assign_hier_rmid(&event->cgrp->css, rmid);
>  	}

So you keep adding stuff to cqm_assign_rmid() which handles enable and
disable. But the only call site is in cqm_event_free_rmid() which calls
that function with rmid = NULL, i.e. disable.

Can you finally explain how this is supposed to work and how all of this
has been tested and validated?

If you had used the already known 'Tests: Same as before' line to the
changelog, then we would have known that it's broken as before w/o looking
at the patch.

So the new variant of 'broken' is: Bla should work ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/12] x86/cqm: Add RMID reuse
  2017-01-06 22:00 ` [PATCH 09/12] x86/cqm: Add RMID reuse Vikas Shivappa
@ 2017-01-17 16:59   ` Thomas Gleixner
  2017-01-18  0:26     ` Shivappa Vikas
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 16:59 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> +static void cqm_schedule_rmidwork(int domain);

This forward declaration is required because all callers of that function
are coming _after_ the function implementation, right?

> +static inline bool is_first_cqmwork(int domain)
> +{
> +	return (!atomic_cmpxchg(&cqm_pkgs_data[domain]->reuse_scheduled, 0, 1));

What's the purpose of these outer brackets? Enhanced readability?

> +}
> +
>  static void __put_rmid(u32 rmid, int domain)
>  {
>  	struct cqm_rmid_entry *entry;
> @@ -294,6 +301,93 @@ static void cqm_mask_call(struct rmid_read *rr)
>  static unsigned int __intel_cqm_threshold;
>  static unsigned int __intel_cqm_max_threshold;
>  
> +/*
> + * Test whether an RMID has a zero occupancy value on this cpu.

This tests whether the occupancy value is less than
__intel_cqm_threshold. Unless I'm missing something the value can be set by
user space and therefor is not necessarily zero.

Your commentry is really useful: It's either wrong or superflous or non
existent. 

> + */
> +static void intel_cqm_stable(void)
> +{
> +	struct cqm_rmid_entry *entry;
> +	struct list_head *llist;
> +
> +	llist = &cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru;
> +	list_for_each_entry(entry, llist, list) {
> +
> +		if (__rmid_read(entry->rmid) < __intel_cqm_threshold)
> +			entry->state = RMID_AVAILABLE;
> +	}
> +}
> +
> +static void __intel_cqm_rmid_reuse(void)
> +{
> +	struct cqm_rmid_entry *entry, *tmp;
> +	struct list_head *llist, *flist;
> +	struct pkg_data *pdata;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&cache_lock, flags);
> +	pdata = cqm_pkgs_data[pkg_id];
> +	llist = &pdata->cqm_rmid_limbo_lru;
> +	flist = &pdata->cqm_rmid_free_lru;
> +
> +	if (list_empty(llist))
> +		goto end;
> +	/*
> +	 * Test whether an RMID is free
> +	 */
> +	intel_cqm_stable();
> +
> +	list_for_each_entry_safe(entry, tmp, llist, list) {
> +
> +		if (entry->state == RMID_DIRTY)
> +			continue;
> +		/*
> +		 * Otherwise remove from limbo and place it onto the free list.
> +		 */
> +		list_del(&entry->list);
> +		list_add_tail(&entry->list, flist);

This is really a performance optimized implementation. Iterate the limbo
list first and check the occupancy value, conditionally update the state
and then iterate the same list and conditionally move the entries which got
just updated to the free list. Of course all of this happens with
interrupts disabled and a global lock held to enhance it further.

> +	}
> +
> +end:
> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
> +}
> +
> +static bool reschedule_cqm_work(void)
> +{
> +	unsigned long flags;
> +	bool nwork = false;
> +
> +	raw_spin_lock_irqsave(&cache_lock, flags);
> +
> +	if (!list_empty(&cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru))
> +		nwork = true;
> +	else
> +		atomic_set(&cqm_pkgs_data[pkg_id]->reuse_scheduled, 0U);

0U because the 'val' argument of atomic_set() is 'int', right?

> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
> +
> +	return nwork;
> +}
> +
> +static void cqm_schedule_rmidwork(int domain)
> +{
> +	struct delayed_work *dwork;
> +	unsigned long delay;
> +
> +	dwork = &cqm_pkgs_data[domain]->intel_cqm_rmid_work;
> +	delay = msecs_to_jiffies(RMID_DEFAULT_QUEUE_TIME);
> +
> +	schedule_delayed_work_on(cqm_pkgs_data[domain]->rmid_work_cpu,
> +			     dwork, delay);
> +}
> +
> +static void intel_cqm_rmid_reuse(struct work_struct *work)

Your naming conventions are really random. cqm_* intel_cqm_* and then
totaly random function names. Is there any deeper thought involved or is it
just what it looks like: random?

> +{
> +	__intel_cqm_rmid_reuse();
> +
> +	if (reschedule_cqm_work())
> +		cqm_schedule_rmidwork(pkg_id);

Great stuff. You first try to move the limbo RMIDs to the free list and
then you reevaluate the limbo list again. Thereby acquiring and dropping
the global cache lock several times.

Dammit, the stupid reuse function already knows whether the list is empty
or not. But returning that information would make too much sense.

> +}
> +
>  static struct pmu intel_cqm_pmu;
>  
>  static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
> @@ -548,6 +642,8 @@ static int intel_cqm_setup_event(struct perf_event *event,
>  	if (!event->hw.cqm_rmid)
>  		return -ENOMEM;
>  
> +	cqm_assign_rmid(event, event->hw.cqm_rmid);

Oh, so now there is a second user which calls cqm_assign_rmid(). Finally it
might actually do something. Whether that something is useful and
functional, I seriously doubt it.

> +
>  	return 0;
>  }
>  
> @@ -863,6 +959,7 @@ static void intel_cqm_event_terminate(struct perf_event *event)
>  {
>  	struct perf_event *group_other = NULL;
>  	unsigned long flags;
> +	int d;
>  
>  	mutex_lock(&cache_mutex);
>  	/*
> @@ -905,6 +1002,13 @@ static void intel_cqm_event_terminate(struct perf_event *event)
>  		mbm_stop_timers();
>  
>  	mutex_unlock(&cache_mutex);
> +
> +	for (d = 0; d < cqm_socket_max; d++) {
> +
> +		if (cqm_pkgs_data[d] != NULL && is_first_cqmwork(d)) {
> +			cqm_schedule_rmidwork(d);
> +		}

Let's look at the logic of this.

When the event terminates, then you unconditionally schedule the rmid work
on ALL packages whether the event was freed or not. Really brilliant.

There is no reason to schedule anything if the event was never using any
rmid on a given package or if the RMIDs are not freed because there is a
new group leader.

The NOHZ FULL people will love that extra work activity for nothing.

Also the detection whether the work is already scheduled is intuitive as
usual: is_first_cqmwork() ? What???? !cqmwork_scheduled() would be too
simple to understand, right?

Oh well.

> +	}
>  }
>  
>  static int intel_cqm_event_init(struct perf_event *event)
> @@ -1322,6 +1426,9 @@ static int pkg_data_init_cpu(int cpu)
>  	mutex_init(&pkg_data->pkg_data_mutex);
>  	raw_spin_lock_init(&pkg_data->pkg_data_lock);
>  
> +	INIT_DEFERRABLE_WORK(
> +		&pkg_data->intel_cqm_rmid_work, intel_cqm_rmid_reuse);

Stop this crappy formatting.

	INIT_DEFERRABLE_WORK(&pkg_data->intel_cqm_rmid_work,
			     intel_cqm_rmid_reuse);

is the canonical way to do line breaks. Is it really that hard?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option
  2017-01-16 18:05   ` Thomas Gleixner
@ 2017-01-17 17:25     ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-17 17:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Mon, 16 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> Add a compile option INTEL_RDT which enables common code for all
>> RDT(Resource director technology) and a specific INTEL_RDT_M which
>> enables code for RDT monitoring. CQM(cache quality monitoring) and
>> mbm(memory b/w monitoring) are part of Intel RDT monitoring.
>
> If we handle this with its own config option, can you please make this
> thing modular? There is no point to have this compiled in and wasting
> memory if it's not supported by the CPU.

will fix, can change this  as a module...

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
                   ` (11 preceding siblings ...)
  2017-01-06 22:00 ` [PATCH 12/12] perf/stat: revamp read error handling, snapshot and per_pkg events Vikas Shivappa
@ 2017-01-17 17:31 ` Thomas Gleixner
  2017-01-18  2:38   ` Shivappa Vikas
  12 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 17:31 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> Cqm(cache quality monitoring) is part of Intel RDT(resource director
> technology) which enables monitoring and controlling of processor shared
> resources via MSR interface.

We know that already. No need for advertising this over and over.

> Below are the issues and the fixes we attempt-

Emphasis on attempt.

> - Issue(1): Inaccurate data for per package data, systemwide. Just prints
> zeros or arbitrary numbers.
> 
> Fix: Patches fix this by just throwing an error if the mode is not supported. 
> The modes supported is task monitoring and cgroup monitoring. 
> Also the per package
> data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
> The systemwide data can be looked up by monitoring root cgroup.

Fine. That just lacks any comment in the implementation. Otherwise I would
not have asked the question about cpu monitoring. Though I fundamentaly
hate the idea of requiring cgroups for this to work.

If I just want to look at CPU X why on earth do I have to set up all that
cgroup muck? Just because your main focus is cgroups?

> - Issue(2): RMIDs are global and dont scale with more packages and hence
> also run out of RMIDs very soon.
> 
> Fix: Support per pkg RMIDs hence scale better with more
> packages, and get more RMIDs to use and use when needed (ie when tasks
> are actually scheduled on the package).

That's fine, just the implementation is completely broken 

> - Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately
> Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to 

Brilliant stuff that. I bet that the seperate MSR writes which we have now
are actually faster even in the worst case of two writes.

> [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling

That's the only undisputed and probably correct patch in that whole series.

Executive summary: The whole series except 2/12 is a complete trainwreck.

Major issues:

- Patch split is random and leads to non compilable and non functional
  steps aside of creating a unreviewable mess.

- The core code is updated with random hooks which claim to be a generic
  framework but are completely hardcoded for that particular CQM case.

  The introduced hooks are neither fully documented (semantical and
  functional) nor justified why they are required.

- The code quality is horrible in coding style, technical correctness and
  design.

So how can we proceed from here?

I want to see small patch series which change a certain aspect of the
implementation and only that. They need to be split in preparatory changes,
which refactor code or add new interfaces to the core code, and the actual
implementation in CQM/MBM.

Each of the patches must have a proper changelog explaining the WHY and if
required the semantical properties of a new interface.

Each of the patches must compile without warnigns/errors and be fully
functional vs. the changes they do.

Any attempt to resend this disaster as a whole will be NACKed w/o even
looking at it. I've wasted enough time with this wholesale approach in the
past and it did not work out. I'm not going to play that wasteful game
over and over again.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/12] x86/cqm: Add Per pkg rmid support\
  2017-01-16 18:15   ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support\ Thomas Gleixner
@ 2017-01-17 19:11     ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-17 19:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Mon, 16 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>
>> Subject : [PATCH 04/12] x86/cqm: Add Per pkg rmid support
>
> Can you please be a bit more careful about the subject lines. 'Per' wants
> to be 'per' and pkg really can be written as package. There is no point in
> cryptic abbreviations for no value. Finaly RMID wants to be all uppercase
> as you use it in the text below.

Will fix and write more readable names.

>
>> Patch introduces a new cqm_pkgs_data to keep track of the per package
>
> Again. We know that this is a patch.....
>
>> free list, limbo list and other locking structures.
>
> So free list and limbo list are locking structures, interesting.
>
>> The corresponding rmid structures in the perf_event is
>
> s/structures/structure/

Will fix

>
>> changed to hold an array of u32 RMIDs instead of a single u32.
>>
>> The RMIDs are not assigned at the time of event creation and are
>> assigned in lazy mode at the first sched_in time for a task, thereby
>> rmid is never allocated if a task is not scheduled on a package. This
>> helps better usage of RMIDs and its scales with the increasing
>> sockets/packages.
>>
>> Locking:
>> event list - perf init and terminate hold mutex. spin lock is held to
>> gaurd from the mbm hrtimer.
>> per pkg free and limbo list - global spin lock. Used by
>> get_rmid,put_rmid, perf start, terminate
>
> Locking documentation wants to be in the code not in some random changelog.

Will add this to the code where i do the locking.

>
>> Tests: RMIDs available increase by x times where x is number of sockets
>> and the usage is dynamic so we save more.
>
> What means: 'Tests:'? Is there a test in this patch? If yes, I seem to be
> missing it.

Patch does not include a test case. Will add a description of the actual tests. 
Will that work ?

In this case , the test was done as below:

On a dual socket bdw system
For testing , force the max_rmid to x (prefreably less #)
Run x threads which a affinitized to each socket.
-This gives valid data with the per package patch.

>
>> Patch is based on David Carrillo-Cisneros <davidcc@google.com> patches
>> in cqm2 series.
>
> We document such attributions with a tag:
>
> Originally-From: David Carrillo-Cisneros <davidcc@google.com>
>
> That way tools can pick it up.

Ok , will fix this on all patches rather than the confusing 'based on' in the 
change log on all the patches.

>
>>  static cpumask_t cqm_cpumask;
>>
>> +struct pkg_data **cqm_pkgs_data;
>
> Why is this global? There is no user outside of cqm.c AFAICT.

will fix

>
>> -/*
>> - * We use a simple array of pointers so that we can lookup a struct
>> - * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
>> - * and __put_rmid() from having to worry about dealing with struct
>> - * cqm_rmid_entry - they just deal with rmids, i.e. integers.
>> - *
>> - * Once this array is initialized it is read-only. No locks are required
>> - * to access it.
>> - *
>> - * All entries for all RMIDs can be looked up in the this array at all
>> - * times.
>
> So this comment was actually valuable. Sure, it does not match the new
> implementation, but the basic principle is still the same. The comment
> wants to be updated and not just removed.
>
>>   *
>>   * We expect to be called with cache_mutex held.
>
> You rename cache_mutex to cache lock, but updating the comments is
> optional, right?
>
>> -static u32 __get_rmid(void)
>> +static u32 __get_rmid(int domain)
>
>  unsigned int domain please. The domain cannot be negative.
>
>>  {
>> +	struct list_head *cqm_flist;
>>  	struct cqm_rmid_entry *entry;
>
> Please keep the variables as a reverse fir tree:
>
>>  	struct cqm_rmid_entry *entry;
>> +	struct list_head *cqm_flist;
>
>>
>> -	lockdep_assert_held(&cache_mutex);
>> +	lockdep_assert_held(&cache_lock);
>
>> -static void __put_rmid(u32 rmid)
>> +static void __put_rmid(u32 rmid, int domain)
>>  {
>>  	struct cqm_rmid_entry *entry;
>>
>> -	lockdep_assert_held(&cache_mutex);
>> +	lockdep_assert_held(&cache_lock);
>>
>> -	WARN_ON(!__rmid_valid(rmid));
>> -	entry = __rmid_entry(rmid);
>> +	WARN_ON(!rmid);
>
> What's wrong with __rmid_valid() ?
>
>> +	entry = __rmid_entry(rmid, domain);
>>
>> +	kfree(cqm_pkgs_data);
>>  }
>>
>> +
>
> Random white space noise.
>

Will fix with respect to all comments above

>>  static bool is_cqm_event(int e)
>> @@ -420,10 +349,11 @@ static void __intel_mbm_event_init(void *info)
>>  {
>>  	struct rmid_read *rr = info;
>>
>> -	update_sample(rr->rmid, rr->evt_type, 1);
>> +	if (__rmid_valid(rr->rmid[pkg_id]))
>
> Where the heck is pkg_id coming from?
>
> Ahh:
>
> #define pkg_id  topology_physical_package_id(smp_processor_id())
>
> That's crap. It's existing crap, but nevertheless crap.
>
> First of all it's irritating as 'pkg_id' looks like a variable and in this
> case it would be at least file global, which makes no sense at all.
>
> Secondly, we really want to move this to the logical package id as that is
> actually guaranteing that the package id is smaller than
> topology_max_packages(). The physical package id has no such guarantee.
>
> And then please keep this local so its simple to read and parse:

Ok , this pkg_id was from the mbm patch. Will change this and fix, probably 
have a seperate patch to first change this one as its already used in 
upstream.

>
>    	 unsigned int pkgid = topology_logical_package_id(smp_processor_id());
>
>> +		update_sample(rr->rmid[pkg_id], rr->evt_type, 1);
>>  }
>
>> @@ -444,7 +374,7 @@ static int intel_cqm_setup_event(struct perf_event *event,
>>  				  struct perf_event **group)
>>  {
>>  	struct perf_event *iter;
>> -	u32 rmid;
>> +	u32 *rmid, sizet;
>
> What's wrong with size? It's not confusable with size_t, right?
>
>> +void alloc_needed_pkg_rmid(u32 *cqm_rmid)
>
> Again: Why is this global?
>
> And what's the meaning of needed in the function name? If the RMID wouldn't
> be needed then the function would not be called, right?

THis is related to the other problem you commented a lot that I put 
structures/functions and used them later in a seperate patch (hence compiler 
warnings .. and a whole bunch of such issues ). This i used later -

i will better organize the patches to fix all these.

>
>> +{
>> +	unsigned long flags;
>> +	u32 rmid;
>> +
>> +	if (WARN_ON(!cqm_rmid))
>> +		return;
>> +
>> +	if (cqm_rmid[pkg_id])
>> +		return;
>> +
>> +	raw_spin_lock_irqsave(&cache_lock, flags);
>> +
>> +	rmid = __get_rmid(pkg_id);
>> +	if (__rmid_valid(rmid))
>> +		cqm_rmid[pkg_id] = rmid;
>> +
>> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
>> +}
>> +
>>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>>  {
>>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> -	u32 rmid = event->hw.cqm_rmid;
>> +	u32 rmid;
>>
>>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>>  		return;
>>
>>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>>
>> +	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
>> +
>> +	rmid = event->hw.cqm_rmid[pkg_id];
>
> The allocation can fail. What's in event->hw.cqm_rmid[pkg_id] then? Zero or
> what? And why is any of those values fine?
>
> Can you please add comments so reviewers and readers do not have to figure
> out every substantial detail themself?

If the allocation fails , it will have zero which is the default. And zero is 
the default RMID for all threads. A later patch indicates this as an error to 
the user if the RMID wasnt available. Will comment this.

>
>>  	state->rmid = rmid;
>>  	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
>
>> +static inline void
>> +	cqm_event_free_rmid(struct perf_event *event)
>
> Yet another coding style variant from your unlimited supply of coding
> horrors.
>
>> +{
>> +	u32 *rmid = event->hw.cqm_rmid;
>> +	int d;
>> +
>> +	for (d = 0; d < cqm_socket_max; d++) {
>> +		if (__rmid_valid(rmid[d]))
>> +			__put_rmid(rmid[d], d);
>> +	}
>> +	kfree(event->hw.cqm_rmid);
>> +	list_del(&event->hw.cqm_groups_entry);
>> +}
>>  static void intel_cqm_event_destroy(struct perf_event *event)
>>  {
>>  	struct perf_event *group_other = NULL;
>> @@ -737,16 +693,11 @@ static void intel_cqm_event_destroy(struct perf_event *event)
>>  		 * If there was a group_other, make that leader, otherwise
>>  		 * destroy the group and return the RMID.
>>  		 */
>> -		if (group_other) {
>> +		if (group_other)
>>  			list_replace(&event->hw.cqm_groups_entry,
>>  				     &group_other->hw.cqm_groups_entry);
>
> Please keep the brackets. See:
>
> http://lkml.kernel.org/r/alpine.DEB.2.20.1609101416420.32361@nanos
>
>> -		} else {
>> -			u32 rmid = event->hw.cqm_rmid;
>> -
>> -			if (__rmid_valid(rmid))
>> -				__put_rmid(rmid);
>> -			list_del(&event->hw.cqm_groups_entry);
>> -		}
>> +		else
>> +			cqm_event_free_rmid(event);
>>  	}
>>
>>  	raw_spin_unlock_irqrestore(&cache_lock, flags);
>
>> +static int pkg_data_init_cpu(int cpu)
>
> cpus are unsigned int
>
>> +{
>> +	struct cqm_rmid_entry *ccqm_rmid_ptrs = NULL, *entry = NULL;
>
> What are the pointer intializations for?
>
>> +	int curr_pkgid = topology_physical_package_id(cpu);
>
> Please use logical packages.

Will fix this and all coding style errors pointed.

>
>> +	struct pkg_data *pkg_data = NULL;
>> +	int i = 0, nr_rmids, ret = 0;
>
> Crap. 'i' is used in the for() loop below and therefor initialized exactly
> there and not at some random other place. ret is a completely pointless
> variable and the initialization is even more pointless. There is exactly
> one code path using it (fail) and you can just return -ENOMEM from there.
>
>> +	if (cqm_pkgs_data[curr_pkgid])
>> +		return 0;
>> +
>> +	pkg_data = kzalloc_node(sizeof(struct pkg_data),
>> +				GFP_KERNEL, cpu_to_node(cpu));
>> +	if (!pkg_data)
>> +		return -ENOMEM;
>> +
>> +	INIT_LIST_HEAD(&pkg_data->cqm_rmid_free_lru);
>> +	INIT_LIST_HEAD(&pkg_data->cqm_rmid_limbo_lru);
>> +
>> +	mutex_init(&pkg_data->pkg_data_mutex);
>> +	raw_spin_lock_init(&pkg_data->pkg_data_lock);
>> +
>> +	pkg_data->rmid_work_cpu = cpu;
>> +
>> +	nr_rmids = cqm_max_rmid + 1;
>> +	ccqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry) *
>> +			   nr_rmids, GFP_KERNEL);
>> +	if (!ccqm_rmid_ptrs) {
>> +		ret = -ENOMEM;
>> +		goto fail;
>> +	}
>> +
>> +	for (; i <= cqm_max_rmid; i++) {
>> +		entry = &ccqm_rmid_ptrs[i];
>> +		INIT_LIST_HEAD(&entry->list);
>> +		entry->rmid = i;
>> +
>> +		list_add_tail(&entry->list, &pkg_data->cqm_rmid_free_lru);
>> +	}
>> +
>> +	pkg_data->cqm_rmid_ptrs = ccqm_rmid_ptrs;
>> +	cqm_pkgs_data[curr_pkgid] = pkg_data;
>> +
>> +	/*
>> +	 * RMID 0 is special and is always allocated. It's used for all
>> +	 * tasks that are not monitored.
>> +	 */
>> +	entry = __rmid_entry(0, curr_pkgid);
>> +	list_del(&entry->list);
>> +
>> +	return 0;
>> +fail:
>> +	kfree(ccqm_rmid_ptrs);
>> +	ccqm_rmid_ptrs = NULL;
>
> And clearing the local variable has which value?
>
>> +	kfree(pkg_data);
>> +	pkg_data = NULL;
>> +	cqm_pkgs_data[curr_pkgid] = NULL;
>
> It never got set, so why do you need to clear it? Just because you do not
> trust the compiler?
>
>> +	return ret;
>> +}
>> +
>> +static int cqm_init_pkgs_data(void)
>> +{
>> +	int i, cpu, ret = 0;
>
> Again, pointless 0 initialization.
>
>> +	cqm_pkgs_data = kzalloc(
>> +		sizeof(struct pkg_data *) * cqm_socket_max,
>> +		GFP_KERNEL);
>
> Using a simple local variable to calculate size first would spare this new
> variant of coding style horror.
>
>> +	if (!cqm_pkgs_data)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < cqm_socket_max; i++)
>> +		cqm_pkgs_data[i] = NULL;
>
> Surely kzalloc() is not reliable enough, so make sure that the pointers are
> really NULL. Brilliant stuff.

Will fix all the initialization and repeated zeroing indicated.

>
>> +
>> +	for_each_online_cpu(cpu) {
>> +		ret = pkg_data_init_cpu(cpu);
>> +		if (ret)
>> +			goto fail;
>> +	}
>
> And that's protected against CPU hotplug in which way?
>
> Aside of that. What allocates the package data for packages which come
> online _AFTER_ this function is called? Nothing AFAICT.
>
> What you really need is a CPU hotplug callback for the prepare stage, which
> is doing the setup of this race free and also handling packages which come
> online after this.

Will fix the init and protecting this in cpu hotplug (get_online_cpus()).

>
>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>> new file mode 100644
>> index 0000000..4415497
>> --- /dev/null
>> +++ b/arch/x86/events/intel/cqm.h
>> @@ -0,0 +1,37 @@
>> +#ifndef _ASM_X86_CQM_H
>> +#define _ASM_X86_CQM_H
>> +
>> +#ifdef CONFIG_INTEL_RDT_M
>> +
>> +#include <linux/perf_event.h>
>> +
>> +/**
>> + * struct pkg_data - cqm per package(socket) meta data
>> + * @cqm_rmid_free_lru    A least recently used list of free RMIDs
>> + *     These RMIDs are guaranteed to have an occupancy less than the
>> + * threshold occupancy
>
> You certainly did not even try to run kernel doc on this.
>
>    @var:  Explanation
>
> is the proper format. Can you spot the difference?
>
> Also the multi line comments are horribly formatted:
>
>   @var:       	    This is a multiline commend which is necessary because
>   		    it needs a lot of text......
>
>> + * @cqm_rmid_entry - The entry in the limbo and free lists.
>
> And this is incorrect as well. You are not even trying to make stuff
> consistently wrong.
>
>> + * @delayed_work - Work to reuse the RMIDs that have been freed.
>> + * @rmid_work_cpu - The cpu on the package on which work is scheduled.
>
> Also the formatting wants to be tabular:
>
> * @var:		Text
> * @long_var:		Text
> * @really_long_var:	Text

Will fix the comment formats.

Thanks,
Vikas



>
> Sigh,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-17 13:46   ` Thomas Gleixner
@ 2017-01-17 20:22     ` Shivappa Vikas
  2017-01-17 21:31       ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-17 20:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
>>  	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
>>  	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
>>
>> -	event->destroy = intel_cqm_event_destroy;
>
> I missed this in the first round, but tripped over it when looking at one
> of the follow up patches.
>
> How is that supposed to work?
>
> 1) intel_cqm_event_destroy() is still in the code and unused which emits a
>   compiler warning, but that can obviously be ignored for a good measure.
>
> 2) How would any testing of this mess actually work?
>
>   Not all all. Nothing ever tears down an event. So you just leave
>   everything hanging around probably with dangling pointers left and
>   right.

The terminate is defined in next patch. Will fix this as we dont need all this 
new api and the cgroup cqm specific structures can be freed with cgroup hooks 
instead of creating this new one.

Thanks,
Vikas

>
> So now the 'Tests: Same as before.' in the so called changelog makes sense:
>
>   'Same as before' means: Completely untested and broken.
>
> Thanks,
>
> 	tglx
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-17 15:26   ` Peter Zijlstra
@ 2017-01-17 20:27     ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-17 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, tglx, mingo, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Peter Zijlstra wrote:

> On Fri, Jan 06, 2017 at 01:59:58PM -0800, Vikas Shivappa wrote:
>> - Introduce event_terminate as event_destroy is called after cgrp is
>> disassociated from the event to support rmid handling of the cgroup.
>> This helps cqm clean up the cqm specific arch_info.
>
> You've not even tried to audit the code to see if you can either move
> the existing ->destroy() invocation or the perf_detach_cgroup() one,
> have you?
>
> Minimal APIs are a good thing, don't expand unless you absolutely have
> to.

yes , the perf_detach_cgroup can just do the job.. Thanks for pointing. Will 
fix.

>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-17 20:22     ` Shivappa Vikas
@ 2017-01-17 21:31       ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 21:31 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Vikas Shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Tue, 17 Jan 2017, Shivappa Vikas wrote:
> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> 
> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> > > @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event
> > > *event)
> > >  	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
> > >  	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
> > > 
> > > -	event->destroy = intel_cqm_event_destroy;
> > 
> > I missed this in the first round, but tripped over it when looking at one
> > of the follow up patches.
> > 
> > How is that supposed to work?
> > 
> > 1) intel_cqm_event_destroy() is still in the code and unused which emits a
> >   compiler warning, but that can obviously be ignored for a good measure.
> > 
> > 2) How would any testing of this mess actually work?
> > 
> >   Not all all. Nothing ever tears down an event. So you just leave
> >   everything hanging around probably with dangling pointers left and
> >   right.
> 
> The terminate is defined in next patch.

I know and that does not make it any better. It's broken, end of story.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update
  2017-01-06 22:00 ` [PATCH 07/12] x86/rdt,cqm: Scheduling support update Vikas Shivappa
@ 2017-01-17 21:58   ` Thomas Gleixner
  2017-01-17 22:30     ` Shivappa Vikas
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-17 21:58 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> Introduce a scheduling hook finish_arch_pre_lock_switch which is
> called just after the perf sched_in during context switch. This method
> handles both cat and cqm sched in scenarios.

Sure, we need yet another special hook. What's wrong with
finish_arch_post_lock_switch()?

And again. This wants to be a seperate patch to the core code with a proper
justification for that hook. Dammit. Changelogs are supposed to explain WHY
not WHAT. How often do I have to explain that again?

> The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
> patch integrates the two msr writes to one. The common sched_in patch
> checks if the per cpu cache has a different RMID or CLOSid than the task
> and does the MSR write.
> 
> During sched_in the task uses the task RMID if the task is monitored or
> else uses the task's cgroup rmid.

And that's relevant for that patch because it explains the existing
behaviour of the RMID, right?

Darn, again you create a unreviewable hodgepodge of changes. The whole
split of the RMID handling into a perf part and the actual RMID update can
be done as a seperate patch before switching over to the combined
RMID/COSID update mechanism.

> +DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
> +
>  /*
>   * Groups of events that have the same target(s), one RMID per group.
>   */
> @@ -108,7 +103,7 @@ struct sample {
>   * Likewise, an rmid value of -1 is used to indicate "no rmid currently
>   * assigned" and is used as part of the rotation code.
>   */
> -static inline bool __rmid_valid(u32 rmid)
> +bool __rmid_valid(u32 rmid)

And once more this becomes global because there is no user outside of cqm.c.

>  {
>  	if (!rmid || rmid > cqm_max_rmid)
>  		return false;
> @@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
>   *
>   * We expect to be called with cache_mutex held.
>   */
> -static u32 __get_rmid(int domain)
> +u32 __get_rmid(int domain)

Ditto.

>  {
>  	struct list_head *cqm_flist;
>  	struct cqm_rmid_entry *entry;
> @@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
>  	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>  }
>  
> +#ifdef CONFIG_CGROUP_PERF
> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
> +{
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +	struct perf_cgroup *pcgrp;
> +
> +	pcgrp = perf_cgroup_from_task(tsk, NULL);
> +
> +	if (!pcgrp)
> +		return NULL;
> +	else
> +		ccinfo = cgrp_to_cqm_info(pcgrp);
> +
> +	return ccinfo;

What the heck? Either you do:

	struct cgrp_cqm_info *ccinfo = NULL;
	struct perf_cgroup *pcgrp;

	pcgrp = perf_cgroup_from_task(tsk, NULL);
	if (pcgrp)
		ccinfo = cgrp_to_cqm_info(pcgrp);

	return ccinfo;

or

	struct perf_cgroup *pcgrp;

	pcgrp = perf_cgroup_from_task(tsk, NULL);
	if (pcgrp)
		return cgrp_to_cqm_info(pcgrp);
	return NULL;

But the above combination does not make any sense at all. Hack at it until
it compiles and works by chance is not a really good engineering principle.

> +}
> +#endif
> +
>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>  {
>  	if (rmid != NULL) {
> @@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> -	u32 rmid;
>  
>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>  		return;
>  
>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>  
> -	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
> -
> -	rmid = event->hw.cqm_rmid[pkg_id];
> -	state->rmid = rmid;
> -	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
> +	if (is_task_event(event)) {
> +		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
> +		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];

Huch? When is this going to be evaluated? Assume the task is running on a
CPU in NOHZ full mode in user space w/o ever going through schedule. How is
that supposed to activate the event ever? Not, AFAICT.

> +	}
>  }
>  
>  static void intel_cqm_event_stop(struct perf_event *event, int mode)
>  {
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
>  	if (event->hw.cqm_state & PERF_HES_STOPPED)
>  		return;
>  
>  	event->hw.cqm_state |= PERF_HES_STOPPED;
> +	state->next_task_rmid = 0;

Ditto.

>  }
>  
>  static int intel_cqm_event_add(struct perf_event *event, int mode)
> @@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
>  	if (mbm_enabled)
>  		pr_info("Intel MBM enabled\n");
>  
> +	static_branch_enable(&cqm_enable_key);
> +
> +++ b/arch/x86/include/asm/intel_pqr_common.h
> @@ -0,0 +1,38 @@
> +#ifndef _ASM_X86_INTEL_PQR_COMMON_H
> +#define _ASM_X86_INTEL_PQR_COMMON_H
> +
> +#ifdef CONFIG_INTEL_RDT
> +
> +#include <linux/jump_label.h>
> +#include <linux/types.h>

Missing new line to seperate include blocks

> +#include <asm/percpu.h>
> +#include <asm/msr.h>
> +#include <asm/intel_rdt_common.h>
> +
> +void __intel_rdt_sched_in(void);
> +
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + *   which supports resource control and we enable by mounting the
> + *   resctrl file system.
> + * - Caches the per cpu CLOSid values and does the MSR write only
> + *   when a task with a different CLOSid is scheduled in.
> + */
> +static inline void intel_rdt_sched_in(void)
> +{
> +	if (static_branch_likely(&rdt_enable_key) ||
> +		static_branch_unlikely(&cqm_enable_key)) {

Groan. Why do you need TWO keys in order to make this decision? Just
because the context switch path is not yet bloated enough?

> +		__intel_rdt_sched_in();
> +	}
> +}
> +
> +#else
> +
> +static inline void intel_rdt_sched_in(void) {}
> +
> +#endif
> +#endif
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index 95ce5c8..3b4a099 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -5,7 +5,6 @@
>  
>  #include <linux/kernfs.h>
>  #include <linux/jump_label.h>
> -

Darn. Your random whitespace changes are annoying. And this one is actually
wrong. The new line is there on purpose to seperate linux and asm includes
visually.

>  #include <asm/intel_rdt_common.h>
>  

> diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
> index e11ed5e..544acaa 100644
> --- a/arch/x86/include/asm/intel_rdt_common.h
> +++ b/arch/x86/include/asm/intel_rdt_common.h
> @@ -18,12 +18,23 @@
>   */
>  struct intel_pqr_state {
>  	u32			rmid;
> +	u32			next_task_rmid;
>  	u32			closid;
>  	int			rmid_usecnt;
>  };
>  
>  DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>  
> +u32 __get_rmid(int domain);
> +bool __rmid_valid(u32 rmid);
> +void alloc_needed_pkg_rmid(u32 *cqm_rmid);
> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
> +
> +extern struct cgrp_cqm_info cqm_rootcginfo;

Yet another completely pointless export.

> +DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> +
>  /**
>   * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
>   * @cont_mon     Continuous monitoring flag

> diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
> new file mode 100644
> index 0000000..c3c50cd
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/intel_rdt_common.c
> @@ -0,0 +1,98 @@
> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
> +
> +#include <linux/slab.h>
> +#include <linux/err.h>
> +#include <linux/cacheinfo.h>
> +#include <linux/cpuhotplug.h>

See above.

> +#include <asm/intel-family.h>
> +#include <asm/intel_rdt.h>
> +
> +/*
> + * The cached intel_pqr_state is strictly per CPU and can never be
> + * updated from a remote CPU. Both functions which modify the state
> + * (intel_cqm_event_start and intel_cqm_event_stop) are called with
> + * interrupts disabled, which is sufficient for the protection.

Just copying comments without actually fixing them is lazy at best. There
are other functions which modify that state.

> + */
> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> +
> +#define pkg_id	topology_physical_package_id(smp_processor_id())

How often are you replicating this macro?  Get rid of it completely
everywhere instead of spreading this crap all over the place.

> +#ifdef CONFIG_INTEL_RDT_M
> +static inline int get_cgroup_sched_rmid(void)
> +{
> +#ifdef CONFIG_CGROUP_PERF
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +
> +	ccinfo = cqminfo_from_tsk(current);
> +
> +	if (!ccinfo)
> +		return 0;
> +
> +	/*
> +	 * A cgroup is always monitoring for itself or
> +	 * for an ancestor(default is root).
> +	 */
> +	if (ccinfo->mon_enabled) {
> +		alloc_needed_pkg_rmid(ccinfo->rmid);
> +		return ccinfo->rmid[pkg_id];
> +	} else {
> +		alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
> +		return ccinfo->mfa->rmid[pkg_id];
> +	}
> +#endif
> +
> +	return 0;
> +}
> +
> +static inline int get_sched_in_rmid(void)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	u32 rmid = 0;
> +
> +	rmid = state->next_task_rmid;
> +
> +	return rmid ? rmid : get_cgroup_sched_rmid();

So this effectively splits the perf sched in work into two parts. One is
handling the perf sched in mechanics and the other is doing the eventual
cgroup muck including rmid allocation.

Why on earth can't the perf sched in work update state->rmid right away and
this extra sched in function just handle CLOSID/RMID updates to the MSR ?
Because that would actually make sense, right?

I really can;t see how all of this hackery which goes through loops and
hoops over and over will make context switches faster. I'm inclined to bet
that the current code, which is halfways sane is faster even with two
writes.

> +}
> +#endif
> +
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + *   which supports resource control and we enable by mounting the
> + *   resctrl file system or it supports resource monitoring.
> + * - Caches the per cpu CLOSid/RMID values and does the MSR write only
> + *   when a task with a different CLOSid/RMID is scheduled in.
> + */
> +void __intel_rdt_sched_in(void)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	int closid = 0;
> +	u32 rmid = 0;

So rmid is a u32 and closid is int. Are you rolling a dice when chosing
data types?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update
  2017-01-17 21:58   ` Thomas Gleixner
@ 2017-01-17 22:30     ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-17 22:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> Introduce a scheduling hook finish_arch_pre_lock_switch which is
>> called just after the perf sched_in during context switch. This method
>> handles both cat and cqm sched in scenarios.
>
> Sure, we need yet another special hook. What's wrong with
> finish_arch_post_lock_switch()?
>
> And again. This wants to be a seperate patch to the core code with a proper
> justification for that hook. Dammit. Changelogs are supposed to explain WHY
> not WHAT. How often do I have to explain that again?

Will fix. Will split this into three.
The sched hook (using the finish_arch_post_lock_switch) patch , perf sched in 
patch, actual write to msr.

>
>> The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
>> patch integrates the two msr writes to one. The common sched_in patch
>> checks if the per cpu cache has a different RMID or CLOSid than the task
>> and does the MSR write.
>>
>> During sched_in the task uses the task RMID if the task is monitored or
>> else uses the task's cgroup rmid.
>
> And that's relevant for that patch because it explains the existing
> behaviour of the RMID, right?
>
> Darn, again you create a unreviewable hodgepodge of changes. The whole
> split of the RMID handling into a perf part and the actual RMID update can
> be done as a seperate patch before switching over to the combined
> RMID/COSID update mechanism.
>
>> +DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
>> +
>>  /*
>>   * Groups of events that have the same target(s), one RMID per group.
>>   */
>> @@ -108,7 +103,7 @@ struct sample {
>>   * Likewise, an rmid value of -1 is used to indicate "no rmid currently
>>   * assigned" and is used as part of the rotation code.
>>   */
>> -static inline bool __rmid_valid(u32 rmid)
>> +bool __rmid_valid(u32 rmid)
>
> And once more this becomes global because there is no user outside of cqm.c.
>
>>  {
>>  	if (!rmid || rmid > cqm_max_rmid)
>>  		return false;
>> @@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
>>   *
>>   * We expect to be called with cache_mutex held.
>>   */
>> -static u32 __get_rmid(int domain)
>> +u32 __get_rmid(int domain)
>
> Ditto.

Will fix the unnecessary globals. This should have been removed in this version 
as all of this should have gone with the removal of continuous monitoring.

>
>>  {
>>  	struct list_head *cqm_flist;
>>  	struct cqm_rmid_entry *entry;
>> @@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
>>  	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>>  }
>>
>> +#ifdef CONFIG_CGROUP_PERF
>> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>> +{
>> +	struct cgrp_cqm_info *ccinfo = NULL;
>> +	struct perf_cgroup *pcgrp;
>> +
>> +	pcgrp = perf_cgroup_from_task(tsk, NULL);
>> +
>> +	if (!pcgrp)
>> +		return NULL;
>> +	else
>> +		ccinfo = cgrp_to_cqm_info(pcgrp);
>> +
>> +	return ccinfo;
>
> What the heck? Either you do:
>
> 	struct cgrp_cqm_info *ccinfo = NULL;
> 	struct perf_cgroup *pcgrp;
>
> 	pcgrp = perf_cgroup_from_task(tsk, NULL);
> 	if (pcgrp)
> 		ccinfo = cgrp_to_cqm_info(pcgrp);
>
> 	return ccinfo;

Will fix.

>
> or
>
> 	struct perf_cgroup *pcgrp;
>
> 	pcgrp = perf_cgroup_from_task(tsk, NULL);
> 	if (pcgrp)
> 		return cgrp_to_cqm_info(pcgrp);
> 	return NULL;
>
> But the above combination does not make any sense at all. Hack at it until
> it compiles and works by chance is not a really good engineering principle.
>
>> +}
>> +#endif
>> +
>>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>>  {
>>  	if (rmid != NULL) {
>> @@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
>>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>>  {
>>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> -	u32 rmid;
>>
>>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>>  		return;
>>
>>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>>
>> -	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
>> -
>> -	rmid = event->hw.cqm_rmid[pkg_id];
>> -	state->rmid = rmid;
>> -	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
>> +	if (is_task_event(event)) {
>> +		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
>> +		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
>
> Huch? When is this going to be evaluated? Assume the task is running on a
> CPU in NOHZ full mode in user space w/o ever going through schedule. How is
> that supposed to activate the event ever? Not, AFAICT.
>
>> +	}
>>  }
>>
>>  static void intel_cqm_event_stop(struct perf_event *event, int mode)
>>  {
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +
>>  	if (event->hw.cqm_state & PERF_HES_STOPPED)
>>  		return;
>>
>>  	event->hw.cqm_state |= PERF_HES_STOPPED;
>> +	state->next_task_rmid = 0;
>
> Ditto.
>
>>  }
>>
>>  static int intel_cqm_event_add(struct perf_event *event, int mode)
>> @@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
>>  	if (mbm_enabled)
>>  		pr_info("Intel MBM enabled\n");
>>
>> +	static_branch_enable(&cqm_enable_key);
>> +
>> +++ b/arch/x86/include/asm/intel_pqr_common.h
>> @@ -0,0 +1,38 @@
>> +#ifndef _ASM_X86_INTEL_PQR_COMMON_H
>> +#define _ASM_X86_INTEL_PQR_COMMON_H
>> +
>> +#ifdef CONFIG_INTEL_RDT
>> +
>> +#include <linux/jump_label.h>
>> +#include <linux/types.h>
>
> Missing new line to seperate include blocks
>
>> +#include <asm/percpu.h>
>> +#include <asm/msr.h>
>> +#include <asm/intel_rdt_common.h>
>> +
>> +void __intel_rdt_sched_in(void);
>> +
>> +/*
>> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + *
>> + * Following considerations are made so that this has minimal impact
>> + * on scheduler hot path:
>> + * - This will stay as no-op unless we are running on an Intel SKU
>> + *   which supports resource control and we enable by mounting the
>> + *   resctrl file system.
>> + * - Caches the per cpu CLOSid values and does the MSR write only
>> + *   when a task with a different CLOSid is scheduled in.
>> + */
>> +static inline void intel_rdt_sched_in(void)
>> +{
>> +	if (static_branch_likely(&rdt_enable_key) ||
>> +		static_branch_unlikely(&cqm_enable_key)) {
>
> Groan. Why do you need TWO keys in order to make this decision? Just
> because the context switch path is not yet bloated enough?

Will fix to a common rdt enable for both. Also will disable this when no event 
is monitored which is most likely the case.

>
>> +		__intel_rdt_sched_in();
>> +	}
>> +}
>> +
>> +#else
>> +
>> +static inline void intel_rdt_sched_in(void) {}
>> +
>> +#endif
>> +#endif
>> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
>> index 95ce5c8..3b4a099 100644
>> --- a/arch/x86/include/asm/intel_rdt.h
>> +++ b/arch/x86/include/asm/intel_rdt.h
>> @@ -5,7 +5,6 @@
>>
>>  #include <linux/kernfs.h>
>>  #include <linux/jump_label.h>
>> -
>
> Darn. Your random whitespace changes are annoying. And this one is actually
> wrong. The new line is there on purpose to seperate linux and asm includes
> visually.

Will fix.

>
>>  #include <asm/intel_rdt_common.h>
>>
>
>> diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
>> index e11ed5e..544acaa 100644
>> --- a/arch/x86/include/asm/intel_rdt_common.h
>> +++ b/arch/x86/include/asm/intel_rdt_common.h
>> @@ -18,12 +18,23 @@
>>   */
>>  struct intel_pqr_state {
>>  	u32			rmid;
>> +	u32			next_task_rmid;
>>  	u32			closid;
>>  	int			rmid_usecnt;
>>  };
>>
>>  DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>
>> +u32 __get_rmid(int domain);
>> +bool __rmid_valid(u32 rmid);
>> +void alloc_needed_pkg_rmid(u32 *cqm_rmid);
>> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
>> +
>> +extern struct cgrp_cqm_info cqm_rootcginfo;
>
> Yet another completely pointless export.
>
>> +DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
>> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>> +
>>  /**
>>   * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
>>   * @cont_mon     Continuous monitoring flag
>
>> diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
>> new file mode 100644
>> index 0000000..c3c50cd
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/intel_rdt_common.c
>> @@ -0,0 +1,98 @@
>> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/slab.h>
>> +#include <linux/err.h>
>> +#include <linux/cacheinfo.h>
>> +#include <linux/cpuhotplug.h>
>
> See above.
>
>> +#include <asm/intel-family.h>
>> +#include <asm/intel_rdt.h>
>> +
>> +/*
>> + * The cached intel_pqr_state is strictly per CPU and can never be
>> + * updated from a remote CPU. Both functions which modify the state
>> + * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>> + * interrupts disabled, which is sufficient for the protection.
>
> Just copying comments without actually fixing them is lazy at best. There
> are other functions which modify that state.
>
>> + */
>> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>> +
>> +#define pkg_id	topology_physical_package_id(smp_processor_id())
>
> How often are you replicating this macro?  Get rid of it completely
> everywhere instead of spreading this crap all over the place.

Will fix

>
>> +#ifdef CONFIG_INTEL_RDT_M
>> +static inline int get_cgroup_sched_rmid(void)
>> +{
>> +#ifdef CONFIG_CGROUP_PERF
>> +	struct cgrp_cqm_info *ccinfo = NULL;
>> +
>> +	ccinfo = cqminfo_from_tsk(current);
>> +
>> +	if (!ccinfo)
>> +		return 0;
>> +
>> +	/*
>> +	 * A cgroup is always monitoring for itself or
>> +	 * for an ancestor(default is root).
>> +	 */
>> +	if (ccinfo->mon_enabled) {
>> +		alloc_needed_pkg_rmid(ccinfo->rmid);
>> +		return ccinfo->rmid[pkg_id];
>> +	} else {
>> +		alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
>> +		return ccinfo->mfa->rmid[pkg_id];
>> +	}
>> +#endif
>> +
>> +	return 0;
>> +}
>> +
>> +static inline int get_sched_in_rmid(void)
>> +{
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +	u32 rmid = 0;
>> +
>> +	rmid = state->next_task_rmid;
>> +
>> +	return rmid ? rmid : get_cgroup_sched_rmid();
>
> So this effectively splits the perf sched in work into two parts. One is
> handling the perf sched in mechanics and the other is doing the eventual
> cgroup muck including rmid allocation.
>
> Why on earth can't the perf sched in work update state->rmid right away and
> this extra sched in function just handle CLOSID/RMID updates to the MSR ?
> Because that would actually make sense, right?
>
> I really can;t see how all of this hackery which goes through loops and
> hoops over and over will make context switches faster. I'm inclined to bet
> that the current code, which is halfways sane is faster even with two
> writes.

Yes will fix this.
I was terrible to forgot to change this when i removed the 
contiuous monitoring. It indeed should make the sched code simpler which is 
always better.

Originally wanted because perf sched_in calls in all the cgroup ancestor events 
for each event in the hierarchy but we cannot have two different RMIDs (in cases where 
multiple cgroups are monitored in the same hierarchy) to be monitored at the 
same time. So we just introduce the _NO_RECURSIOn flag for perf to call only the 
event and not its ancestors.

>
>> +}
>> +#endif
>> +
>> +/*
>> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + *
>> + * Following considerations are made so that this has minimal impact
>> + * on scheduler hot path:
>> + * - This will stay as no-op unless we are running on an Intel SKU
>> + *   which supports resource control and we enable by mounting the
>> + *   resctrl file system or it supports resource monitoring.
>> + * - Caches the per cpu CLOSid/RMID values and does the MSR write only
>> + *   when a task with a different CLOSid/RMID is scheduled in.
>> + */
>> +void __intel_rdt_sched_in(void)
>> +{
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +	int closid = 0;
>> +	u32 rmid = 0;
>
> So rmid is a u32 and closid is int. Are you rolling a dice when chosing
> data types?

Will fix..

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/12] x86/cqm: Add RMID reuse
  2017-01-17 16:59   ` Thomas Gleixner
@ 2017-01-18  0:26     ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-18  0:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> +static void cqm_schedule_rmidwork(int domain);
>
> This forward declaration is required because all callers of that function
> are coming _after_ the function implementation, right?
>
>> +static inline bool is_first_cqmwork(int domain)
>> +{
>> +	return (!atomic_cmpxchg(&cqm_pkgs_data[domain]->reuse_scheduled, 0, 1));
>
> What's the purpose of these outer brackets? Enhanced readability?

Will fix the coding styles and ordering of function declarations

>
>> +}
>> +
>>  static void __put_rmid(u32 rmid, int domain)
>>  {
>>  	struct cqm_rmid_entry *entry;
>> @@ -294,6 +301,93 @@ static void cqm_mask_call(struct rmid_read *rr)
>>  static unsigned int __intel_cqm_threshold;
>>  static unsigned int __intel_cqm_max_threshold;
>>
>> +/*
>> + * Test whether an RMID has a zero occupancy value on this cpu.
>
> This tests whether the occupancy value is less than
> __intel_cqm_threshold. Unless I'm missing something the value can be set by
> user space and therefor is not necessarily zero.
>
> Your commentry is really useful: It's either wrong or superflous or non
> existent.

Will fix , yes its the user configurable value.

>
>> + */
>> +static void intel_cqm_stable(void)
>> +{
>> +	struct cqm_rmid_entry *entry;
>> +	struct list_head *llist;
>> +
>> +	llist = &cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru;
>> +	list_for_each_entry(entry, llist, list) {
>> +
>> +		if (__rmid_read(entry->rmid) < __intel_cqm_threshold)
>> +			entry->state = RMID_AVAILABLE;
>> +	}
>> +}
>> +
>> +static void __intel_cqm_rmid_reuse(void)
>> +{
>> +	struct cqm_rmid_entry *entry, *tmp;
>> +	struct list_head *llist, *flist;
>> +	struct pkg_data *pdata;
>> +	unsigned long flags;
>> +
>> +	raw_spin_lock_irqsave(&cache_lock, flags);
>> +	pdata = cqm_pkgs_data[pkg_id];
>> +	llist = &pdata->cqm_rmid_limbo_lru;
>> +	flist = &pdata->cqm_rmid_free_lru;
>> +
>> +	if (list_empty(llist))
>> +		goto end;
>> +	/*
>> +	 * Test whether an RMID is free
>> +	 */
>> +	intel_cqm_stable();
>> +
>> +	list_for_each_entry_safe(entry, tmp, llist, list) {
>> +
>> +		if (entry->state == RMID_DIRTY)
>> +			continue;
>> +		/*
>> +		 * Otherwise remove from limbo and place it onto the free list.
>> +		 */
>> +		list_del(&entry->list);
>> +		list_add_tail(&entry->list, flist);
>
> This is really a performance optimized implementation. Iterate the limbo
> list first and check the occupancy value, conditionally update the state
> and then iterate the same list and conditionally move the entries which got
> just updated to the free list. Of course all of this happens with
> interrupts disabled and a global lock held to enhance it further.

Will fix this. This can really use the per pkg lock as the cache_lock global 
one is protecting the event list. Also yes, dont need two loops now as its per 
package.

>
>> +	}
>> +
>> +end:
>> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
>> +}
>> +
>> +static bool reschedule_cqm_work(void)
>> +{
>> +	unsigned long flags;
>> +	bool nwork = false;
>> +
>> +	raw_spin_lock_irqsave(&cache_lock, flags);
>> +
>> +	if (!list_empty(&cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru))
>> +		nwork = true;
>> +	else
>> +		atomic_set(&cqm_pkgs_data[pkg_id]->reuse_scheduled, 0U);
>
> 0U because the 'val' argument of atomic_set() is 'int', right?

Will fix

>
>> +	raw_spin_unlock_irqrestore(&cache_lock, flags);
>> +
>> +	return nwork;
>> +}
>> +
>> +static void cqm_schedule_rmidwork(int domain)
>> +{
>> +	struct delayed_work *dwork;
>> +	unsigned long delay;
>> +
>> +	dwork = &cqm_pkgs_data[domain]->intel_cqm_rmid_work;
>> +	delay = msecs_to_jiffies(RMID_DEFAULT_QUEUE_TIME);
>> +
>> +	schedule_delayed_work_on(cqm_pkgs_data[domain]->rmid_work_cpu,
>> +			     dwork, delay);
>> +}
>> +
>> +static void intel_cqm_rmid_reuse(struct work_struct *work)
>
> Your naming conventions are really random. cqm_* intel_cqm_* and then
> totaly random function names. Is there any deeper thought involved or is it
> just what it looks like: random?
>
>> +{
>> +	__intel_cqm_rmid_reuse();
>> +
>> +	if (reschedule_cqm_work())
>> +		cqm_schedule_rmidwork(pkg_id);
>
> Great stuff. You first try to move the limbo RMIDs to the free list and
> then you reevaluate the limbo list again. Thereby acquiring and dropping
> the global cache lock several times.
>
> Dammit, the stupid reuse function already knows whether the list is empty
> or not. But returning that information would make too much sense.

Will fix. Dont need the global lock also.

>
>> +}
>> +
>>  static struct pmu intel_cqm_pmu;
>>
>>  static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
>> @@ -548,6 +642,8 @@ static int intel_cqm_setup_event(struct perf_event *event,
>>  	if (!event->hw.cqm_rmid)
>>  		return -ENOMEM;
>>
>> +	cqm_assign_rmid(event, event->hw.cqm_rmid);
>
> Oh, so now there is a second user which calls cqm_assign_rmid(). Finally it
> might actually do something. Whether that something is useful and
> functional, I seriously doubt it.
>
>> +
>>  	return 0;
>>  }
>>
>> @@ -863,6 +959,7 @@ static void intel_cqm_event_terminate(struct perf_event *event)
>>  {
>>  	struct perf_event *group_other = NULL;
>>  	unsigned long flags;
>> +	int d;
>>
>>  	mutex_lock(&cache_mutex);
>>  	/*
>> @@ -905,6 +1002,13 @@ static void intel_cqm_event_terminate(struct perf_event *event)
>>  		mbm_stop_timers();
>>
>>  	mutex_unlock(&cache_mutex);
>> +
>> +	for (d = 0; d < cqm_socket_max; d++) {
>> +
>> +		if (cqm_pkgs_data[d] != NULL && is_first_cqmwork(d)) {
>> +			cqm_schedule_rmidwork(d);
>> +		}
>
> Let's look at the logic of this.
>
> When the event terminates, then you unconditionally schedule the rmid work
> on ALL packages whether the event was freed or not. Really brilliant.
>
> There is no reason to schedule anything if the event was never using any
> rmid on a given package or if the RMIDs are not freed because there is a
> new group leader.

Will fix to not schedule if event did not use the package.

>
> The NOHZ FULL people will love that extra work activity for nothing.
>
> Also the detection whether the work is already scheduled is intuitive as
> usual: is_first_cqmwork() ? What???? !cqmwork_scheduled() would be too
> simple to understand, right?
>
> Oh well.
>
>> +	}
>>  }
>>
>>  static int intel_cqm_event_init(struct perf_event *event)
>> @@ -1322,6 +1426,9 @@ static int pkg_data_init_cpu(int cpu)
>>  	mutex_init(&pkg_data->pkg_data_mutex);
>>  	raw_spin_lock_init(&pkg_data->pkg_data_lock);
>>
>> +	INIT_DEFERRABLE_WORK(
>> +		&pkg_data->intel_cqm_rmid_work, intel_cqm_rmid_reuse);
>
> Stop this crappy formatting.
>
> 	INIT_DEFERRABLE_WORK(&pkg_data->intel_cqm_rmid_work,
> 			     intel_cqm_rmid_reuse);
>
> is the canonical way to do line breaks. Is it really that hard?

Will fix the formatting.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
  2017-01-17 12:11   ` Thomas Gleixner
  2017-01-17 12:31     ` Peter Zijlstra
@ 2017-01-18  2:14     ` Shivappa Vikas
  1 sibling, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-18  2:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>
>> From: David Carrillo-Cisneros <davidcc@google.com>
>>
>> cgroup hierarchy monitoring is not supported currently. This patch
>> builds all the necessary datastructures, cgroup APIs like alloc, free
>> etc and necessary quirks for supporting cgroup hierarchy monitoring in
>> later patches.
>>
>> - Introduce a architecture specific data structure arch_info in
>> perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
>> - perf sched_in calls all the cgroup ancestors when a cgroup is
>> scheduled in. This will not work with cqm as we have a common per pkg
>> rmid associated with one task and hence cannot write different RMIds
>> into the MSR for each event. cqm driver enables a flag
>> PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
>> ancestor cgroups for each event and let the driver handle the hierarchy
>> monitoring for cgroup.
>> - Introduce event_terminate as event_destroy is called after cgrp is
>> disassociated from the event to support rmid handling of the cgroup.
>> This helps cqm clean up the cqm specific arch_info.
>> - Add the cgroup APIs for alloc,free,attach and can_attach
>>
>> The above framework will be used to build different cgroup features in
>> later patches.
>
> That's not a framework. It's a hodgepodge of core and x86 specific changes.
>
> I'm not even trying to review it as a whole, simply because such changes
> want to be split into several preparatory changes in the core which provide
> the 'framework' parts and then an actual user in the architecture
> code. I'll give some general feedback whatsoever.

Will split them into multiple patches.

>
>> Tests: Same as before. Cgroup still doesnt work but we did the prep to
>> get it to work
>
> Oh well. What tests are the same as before? This information is just there
> to take room in the changelog, right?

Will fix the logs

>
>> Patch modified/refactored by Vikas Shivappa
>> <vikas.shivappa@linux.intel.com> to support recycling removal.
>>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>
> So this patch is from David, but where is Davids Signed-off-by?
>
>> ---
>>  arch/x86/events/intel/cqm.c       | 19 ++++++++++++++++++-
>>  arch/x86/include/asm/perf_event.h | 27 +++++++++++++++++++++++++++
>>  include/linux/perf_event.h        | 32 ++++++++++++++++++++++++++++++++
>>  kernel/events/core.c              | 28 +++++++++++++++++++++++++++-
>>  4 files changed, 104 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>> index 68fd1da..a9bd7bd 100644
>> --- a/arch/x86/events/intel/cqm.c
>> +++ b/arch/x86/events/intel/cqm.c
>> @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
>>  	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
>>  	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
>>
>> -	event->destroy = intel_cqm_event_destroy;
>> +	/*
>> +	 * CQM driver handles cgroup recursion and since only noe
>> +	 * RMID can be programmed at the time in each core, then
>> +	 * it is incompatible with the way generic code handles
>> +	 * cgroup hierarchies.
>> +	 */
>> +	event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
>>
>>  	mutex_lock(&cache_mutex);
>>
>> @@ -918,6 +924,17 @@ static int intel_cqm_event_init(struct perf_event *event)
>>  	.read		     = intel_cqm_event_read,
>>  	.count		     = intel_cqm_event_count,
>>  };
>> +#ifdef CONFIG_CGROUP_PERF
>> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
>> +				      struct cgroup_subsys_state *new_css)
>> +{}
>> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
>> +{}
>> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
>> +{}
>> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
>> +{}
>> +#endif
>
> What the heck is this for? It does not even compile because
> perf_cgroup_arch_css_alloc() and perf_cgroup_arch_can_attach() are empty
> functions.
>
> Crap, crap, crap.
>
>>  static inline void cqm_pick_event_reader(int cpu)
>>  {
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index f353061..f38c7f0 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -299,4 +299,31 @@ static inline void perf_check_microcode(void) { }
>>
>>  #define arch_perf_out_copy_user copy_from_user_nmi
>>
>> +/*
>> + * Hooks for architecture specific features of perf_event cgroup.
>> + * Currently used by Intel's CQM.
>> + */
>> +#ifdef CONFIG_INTEL_RDT_M
>> +#ifdef CONFIG_CGROUP_PERF
>> +
>> +#define perf_cgroup_arch_css_alloc	perf_cgroup_arch_css_alloc
>> +
>> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
>> +				      struct cgroup_subsys_state *new_css);
>> +
>> +#define perf_cgroup_arch_css_free	perf_cgroup_arch_css_free
>> +
>> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
>> +
>> +#define perf_cgroup_arch_attach		perf_cgroup_arch_attach
>> +
>> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset);
>> +
>> +#define perf_cgroup_arch_can_attach	perf_cgroup_arch_can_attach
>> +
>> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset);
>> +
>> +#endif
>> +
>> +#endif
>>  #endif /* _ASM_X86_PERF_EVENT_H */
>
> How the heck is one supposed to figure out which endif is belonging to
> what? Random new lines are not helping for that.

Will add comments to indicate which ifdef the endif belongs

>
> Aside of that the double ifdef is horrible and this really is not at even
> remotely a framework. It's hardcoded crap to serve that CQM mess. Nothing
> else can ever use it. So don't pretend it to be a 'framework'.
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a8f4749..410642a 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -300,6 +300,12 @@ struct pmu {
>>  	int (*event_init)		(struct perf_event *event);
>>
>>  	/*
>> +	 * Terminate the event for this PMU. Optional complement for a
>> +	 * successful event_init. Called before the event fields are tear down.
>> +	 */
>> +	void (*event_terminate)		(struct perf_event *event);
>
> And why does this need to be a PMU callback. It's called right before
> perf_cgroup_detach(). Why does it need extra treatment and cannot be done
> from the cgroup muck?

Will remove the terminate and do this in the cgroup api itself.

>
>> +
>> +	/*
>>  	 * Notification that the event was mapped or unmapped.  Called
>>  	 * in the context of the mapping task.
>>  	 */
>> @@ -516,9 +522,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
>>   * PERF_EV_CAP_SOFTWARE: Is a software event.
>>   * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
>>   * from any CPU in the package where it is active.
>> + * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
>> + * cgroup scoping. It does not need to be enabled for all of its descendants
>> + * cgroups.
>>   */
>>  #define PERF_EV_CAP_SOFTWARE		BIT(0)
>>  #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
>> +#define PERF_EV_CAP_CGROUP_NO_RECURSION	BIT(2)
>>
>>  #define SWEVENT_HLIST_BITS		8
>>  #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
>> @@ -823,6 +833,8 @@ struct perf_cgroup_info {
>>  };
>>
>>  struct perf_cgroup {
>> +	/* Architecture specific information. */
>
> That's a really useful comment.
>
>> +	void				 *arch_info;
>>  	struct cgroup_subsys_state	css;
>>  	struct perf_cgroup_info	__percpu *info;
>>  };
>> @@ -844,6 +856,7 @@ struct perf_cgroup {
>>
>>  #ifdef CONFIG_PERF_EVENTS
>>
>> +extern int is_cgroup_event(struct perf_event *event);
>>  extern void *perf_aux_output_begin(struct perf_output_handle *handle,
>>  				   struct perf_event *event);
>>  extern void perf_aux_output_end(struct perf_output_handle *handle,
>> @@ -1387,4 +1400,23 @@ ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
>>  #define perf_event_exit_cpu	NULL
>>  #endif
>>
>> +/*
>> + * Hooks for architecture specific extensions for perf_cgroup.
>
> No. That's not architecture specific. That's CQM specific hackery.
>
>> + */
>> +#ifndef perf_cgroup_arch_css_alloc
>> +#define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
>> +#endif
>
> I really hate this define style. That can be solved nicely with weak
> functions which avoid all this define and ifdeffery mess.
>
>> +#ifndef perf_cgroup_arch_css_free
>> +#define perf_cgroup_arch_css_free(css) do { } while (0)
>> +#endif
>> +
>> +#ifndef perf_cgroup_arch_attach
>> +#define perf_cgroup_arch_attach(tskset) do { } while (0)
>> +#endif
>> +
>> +#ifndef perf_cgroup_arch_can_attach
>> +#define perf_cgroup_arch_can_attach(tskset) 0
>
> This one is exceptionally stupid. Here is the use case:

Will fix and use the weak functions.

>
>> +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
>> +{
>> +     return perf_cgroup_arch_can_attach(tset);
>>  }
>>
>> +
>>  struct cgroup_subsys perf_event_cgrp_subsys = {
>>       .css_alloc      = perf_cgroup_css_alloc,
>>       .css_free       = perf_cgroup_css_free,
>> +     .can_attach     = perf_cgroup_can_attach,
>>       .attach         = perf_cgroup_attach,
>>  };
>
> So you need a extra function for calling a stub macro if it just can
> be done by assigning the real function (if required) to the callback
> pointer and otherwise leave it NULL.
>
> But all of this is moot because this 'arch framework' is just crap because
> it's in reality a CQM extension of the core code, which is not going to
> happen.
>
>> +#endif
>> +
>>  #endif /* _LINUX_PERF_EVENT_H */
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index ab15509..229f611 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -590,6 +590,9 @@ static inline u64 perf_event_clock(struct perf_event *event)
>>  	if (!cpuctx->cgrp)
>>  		return false;
>>
>> +	if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
>> +		return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
>> +
>
> Comments explaining what this does are overrated, right?

Will fix. Add a comment.

perf sched_in calls the cgroup ancestors during a cgroup event sched in. But the 
cqm events are package wide having mapped to RMIDs. Since only RMID can be 
written to the PQR_ASSOC we need to handle this seperately. The 
EV_CAP_NO_RECURSION event_caps indicates to not call all the ancestor groups.

>
>>  	/*
>>  	 * Cgroup scoping is recursive.  An event enabled for a cgroup is
>>  	 * also enabled for all its descendant cgroups.  If @cpuctx's
>> @@ -606,7 +609,7 @@ static inline void perf_detach_cgroup(struct perf_event *event)
>>  	event->cgrp = NULL;
>>  }
>>
>> -static inline int is_cgroup_event(struct perf_event *event)
>> +int is_cgroup_event(struct perf_event *event)
>
> So this is made global because there is no actual user outside of the core.

Will fix

>
>>  {
>>  	return event->cgrp != NULL;
>>  }
>> @@ -4019,6 +4022,9 @@ static void _free_event(struct perf_event *event)
>>  		mutex_unlock(&event->mmap_mutex);
>>  	}
>>
>> +	if (event->pmu->event_terminate)
>> +		event->pmu->event_terminate(event);
>> +
>>  	if (is_cgroup_event(event))
>>  		perf_detach_cgroup(event);
>>
>> @@ -9246,6 +9252,8 @@ static void account_event(struct perf_event *event)
>>  	exclusive_event_destroy(event);
>>
>>  err_pmu:
>> +	if (event->pmu->event_terminate)
>> +		event->pmu->event_terminate(event);
>>  	if (event->destroy)
>>  		event->destroy(event);
>>  	module_put(pmu->module);
>> @@ -10748,6 +10756,7 @@ static int __init perf_event_sysfs_init(void)
>>  perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>>  {
>>  	struct perf_cgroup *jc;
>> +	int ret;
>>
>>  	jc = kzalloc(sizeof(*jc), GFP_KERNEL);
>>  	if (!jc)
>> @@ -10759,6 +10768,12 @@ static int __init perf_event_sysfs_init(void)
>>  		return ERR_PTR(-ENOMEM);
>>  	}
>>
>> +	jc->arch_info = NULL;
>
> Never trust kzalloc to zero out a data structure correctly!
>
>> +
>> +	ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
>> +	if (ret)
>> +		return ERR_PTR(ret);
>
> And then leak the allocated memory in case of error.
>
> Another wonderful piece of trainwreck engineering.
>
> As I said above: Split this into bits and pieces and provide a proper
> justification for each of the items you add to the core: terminate,
> PERF_EV_CAP_CGROUP_NO_RECURSION.
>
> Then sit down and come up with a solution which allows to make use of the
> cgroup core extensions for more than a single instance of a particular
> piece of x86 perf hardware.

Will fix. Can split the patches (can remove the terminate..) and then add 
generic extensions which can be used.

cqm would need a way to maintain cgroup specific information as shown in the 
struct below because of the way h/w RMIDs are designed:

-the RMIDs are per package meaning the count(for mbm) and occupancy (for cqm) 
cannot be updated per cpu on when a task is scheduled.
-and we cannot track two events in a hierarchy at the same as we cannot write 
two RMIDs.

Because of this we maintain the mfa as shown below to keep track of which 
ancestor the cgroup(or the task) needs to report data to.

  * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
  * @mon_enabled  Whether monitoring is enabled
  * @level        Level in the cgroup tree. Root is level 0.
  * @rmid        The rmids of the cgroup.
  * @mfa          'Monitoring for ancestor' points to the cqm_info
  *  of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
  *  means you will use an ancestors RMID at sched_in if you are
  *  not monitoring yourself.
  *
  *  Due to the hierarchical nature of cgroups, every cgroup just
  *  monitors for the 'nearest monitored ancestor' at all times.
  *  Since root cgroup is always monitored, all descendents
  *  at boot time monitor for root and hence all mfa points to root except
  *  for root->mfa which is NULL.
  *  1. RMID setup: When cgroup x start monitoring:
  *    for each descendent y, if y's mfa->level < x->level, then
  *    y->mfa = x. (Where level of root node = 0...)
  *  2. sched_in: During sched_in for x
  *    if (x->mon_enabled) choose x->rmid
  *    else choose x->mfa->rmid.
  *  3. read: for each descendent of cgroup x
  *     if (x->monitored) count += rmid_read(x->rmid).
  *  4. evt_destroy: for each descendent y of x, if (y->mfa == x) then
  *     y->mfa = x->mfa. Meaning if any descendent was monitoring for x,
  *     set that descendent to monitor for the cgroup which x was monitoring for.
  *
  * @tskmon_rlist List of tasks being monitored in the cgroup
  *  When a task which belongs to a cgroup x is being monitored, it always uses
  *  its own task->rmid even if cgroup x is monitored during sched_in.
  *  To account for the counts of such tasks, cgroup keeps this list
  *  and parses it during read.

Thanks,
Vikas

>
> And please provide changelogs which explain WHY all of this is necessary,
> not just the WHAT.
>
> Thanks,
>
> 	tglx
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-17 17:31 ` [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Thomas Gleixner
@ 2017-01-18  2:38   ` Shivappa Vikas
  2017-01-18  8:53     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-18  2:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> Cqm(cache quality monitoring) is part of Intel RDT(resource director
>> technology) which enables monitoring and controlling of processor shared
>> resources via MSR interface.
>
> We know that already. No need for advertising this over and over.
>
>> Below are the issues and the fixes we attempt-
>
> Emphasis on attempt.
>
>> - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>> zeros or arbitrary numbers.
>>
>> Fix: Patches fix this by just throwing an error if the mode is not supported.
>> The modes supported is task monitoring and cgroup monitoring.
>> Also the per package
>> data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
>> The systemwide data can be looked up by monitoring root cgroup.
>
> Fine. That just lacks any comment in the implementation. Otherwise I would
> not have asked the question about cpu monitoring. Though I fundamentaly
> hate the idea of requiring cgroups for this to work.
>
> If I just want to look at CPU X why on earth do I have to set up all that
> cgroup muck? Just because your main focus is cgroups?

The upstream per cpu data is broken because its not overriding the other task 
event RMIDs on that cpu with the cpu event RMID.

Can be fixed by adding a percpu struct to hold the RMID thats affinitized
to the cpu, however then we miss all the task llc_occupancy in that - still 
evaluating it.

>
>> - Issue(2): RMIDs are global and dont scale with more packages and hence
>> also run out of RMIDs very soon.
>>
>> Fix: Support per pkg RMIDs hence scale better with more
>> packages, and get more RMIDs to use and use when needed (ie when tasks
>> are actually scheduled on the package).
>
> That's fine, just the implementation is completely broken
>
>> - Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately
>> Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to
>
> Brilliant stuff that. I bet that the seperate MSR writes which we have now
> are actually faster even in the worst case of two writes.
>
>> [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling
>
> That's the only undisputed and probably correct patch in that whole series.
>
> Executive summary: The whole series except 2/12 is a complete trainwreck.
>
> Major issues:
>
> - Patch split is random and leads to non compilable and non functional
>  steps aside of creating a unreviewable mess.

Will split as per the comments into divisions that are compilable without 
warnings. The series was tested for compile and build but wasnt for warnings. 
(except for the int if_cgroup_event which you pointed..)

>
> - The core code is updated with random hooks which claim to be a generic
>  framework but are completely hardcoded for that particular CQM case.
>
>  The introduced hooks are neither fully documented (semantical and
>  functional) nor justified why they are required.
>
> - The code quality is horrible in coding style, technical correctness and
>  design.
>
> So how can we proceed from here?
>
> I want to see small patch series which change a certain aspect of the
> implementation and only that. They need to be split in preparatory changes,
> which refactor code or add new interfaces to the core code, and the actual
> implementation in CQM/MBM.

Appreciate your time for review and all the feedback.

Will plan to send a smaller patch series which includes the following contents 
at first and then follow up other series:

- the current 2/12 
- add the per package RMID with the fixes mentioned.
- The reuse patch which reuses the RMIDs that are freed from events.
- Patch to indicate error if RMID was still not available.
- Any relavant sched and hot cpu updates.

That way its a functional set where tasks would be able to use different RMIDs 
on different packages and provide correct data for 'task events'. Either the 
data would be correct or would indicate an error that we were limited by h/w 
RMIDs.
Is that a reasonable set to target ?

Then followup with different series to get correct data for 'cgroup events' and 
support cgroup and task together and other issues, system/ per cpu events etc.

Thanks,
Vikas

>
> Each of the patches must have a proper changelog explaining the WHY and if
> required the semantical properties of a new interface.
>
> Each of the patches must compile without warnigns/errors and be fully
> functional vs. the changes they do.
>
> Any attempt to resend this disaster as a whole will be NACKed w/o even
> looking at it. I've wasted enough time with this wholesale approach in the
> past and it did not work out. I'm not going to play that wasteful game
> over and over again.
>
> Thanks,
>
> 	tglx
>
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  2:38   ` Shivappa Vikas
@ 2017-01-18  8:53     ` Thomas Gleixner
  2017-01-18  9:56       ` Peter Zijlstra
                         ` (7 more replies)
  0 siblings, 8 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-18  8:53 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Vikas Shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen,
	h.peter.anvin

On Tue, 17 Jan 2017, Shivappa Vikas wrote:
> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
> > > zeros or arbitrary numbers.
> > > 
> > > Fix: Patches fix this by just throwing an error if the mode is not
> > > supported.
> > > The modes supported is task monitoring and cgroup monitoring.
> > > Also the per package
> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
> > > option.
> > > The systemwide data can be looked up by monitoring root cgroup.
> > 
> > Fine. That just lacks any comment in the implementation. Otherwise I would
> > not have asked the question about cpu monitoring. Though I fundamentaly
> > hate the idea of requiring cgroups for this to work.
> > 
> > If I just want to look at CPU X why on earth do I have to set up all that
> > cgroup muck? Just because your main focus is cgroups?
> 
> The upstream per cpu data is broken because its not overriding the other task
> event RMIDs on that cpu with the cpu event RMID.
> 
> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
> to the cpu, however then we miss all the task llc_occupancy in that - still
> evaluating it.

The point here is that CQM is closely connected to the cache allocation
technology. After a lengthy discussion we ended up having

  - per cpu CLOSID
  - per task CLOSID

where all tasks which do not have a CLOSID assigned use the CLOSID which is
assigned to the CPU they are running on.

So if I configure a system by simply partitioning the cache per cpu, which
is the proper way to do it for HPC and RT usecases where workloads are
partitioned on CPUs as well, then I really want to have an equaly simple
way to monitor the occupancy for that reservation.

And looking at that from the CAT point of view, which is the proper way to
do it, makes it obvious that CQM should be modeled to match CAT.

So lets assume the following:

   CPU 0-3     default CLOSID 0
   CPU 4       	       CLOSID 1
   CPU 5	       CLOSID 2
   CPU 6	       CLOSID 3
   CPU 7	       CLOSID 3

   T1  		       CLOSID 4
   T2		       CLOSID 5
   T3		       CLOSID 6
   T4		       CLOSID 6

   All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
   they run on.

then the obvious basic monitoring requirement is to have a RMID for each
CLOSID.

So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
care at all about the occupancy of T1 simply because that is running on a
seperate reservation. Trying to make that an aggregated value in the first
place is completely wrong. If you want an aggregate, which is pretty much
useless, then user space tools can generate it easily.

The whole approach you and David have taken is to whack some desired cgroup
functionality and whatever into CQM without rethinking the overall
design. And that's fundamentaly broken because it does not take cache (and
memory bandwidth) allocation into account.

I seriously doubt, that the existing CQM/MBM code can be refactored in any
useful way. As Peter Zijlstra said before: Remove the existing cruft
completely and start with completely new design from scratch.

And this new design should start from the allocation angle and then add the
whole other muck on top so far its possible. Allocation related monitoring
must be the primary focus, everything else is just tinkering.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
@ 2017-01-18  9:56       ` Peter Zijlstra
  2017-01-19 19:59         ` Shivappa Vikas
  2017-01-18 19:41       ` Shivappa Vikas
                         ` (6 subsequent siblings)
  7 siblings, 1 reply; 91+ messages in thread
From: Peter Zijlstra @ 2017-01-18  9:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin

On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote:
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.
> 
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.
> 
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.

Agreed, the little I have seen of these patches is quite horrible. And
there seems to be a definite lack of design; or at the very least an
utter lack of communication of it.

The approach, in so far that I could make sense of it, seems to utterly
rape perf-cgroup. I think Thomas makes a sensible point in trying to
match it to the CAT stuffs.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
  2017-01-18  9:56       ` Peter Zijlstra
@ 2017-01-18 19:41       ` Shivappa Vikas
  2017-01-18 21:03       ` David Carrillo-Cisneros
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-18 19:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, davidcc, eranian, linux-kernel,
	x86, hpa, mingo, peterz, ravi.v.shankar, tony.luck, fenghua.yu,
	andi.kleen, h.peter.anvin



On Wed, 18 Jan 2017, Thomas Gleixner wrote:

> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>>> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>>>> - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>>>> zeros or arbitrary numbers.
>>>>
>>>> Fix: Patches fix this by just throwing an error if the mode is not
>>>> supported.
>>>> The modes supported is task monitoring and cgroup monitoring.
>>>> Also the per package
>>>> data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>>>> option.
>>>> The systemwide data can be looked up by monitoring root cgroup.
>>>
>>> Fine. That just lacks any comment in the implementation. Otherwise I would
>>> not have asked the question about cpu monitoring. Though I fundamentaly
>>> hate the idea of requiring cgroups for this to work.
>>>
>>> If I just want to look at CPU X why on earth do I have to set up all that
>>> cgroup muck? Just because your main focus is cgroups?
>>
>> The upstream per cpu data is broken because its not overriding the other task
>> event RMIDs on that cpu with the cpu event RMID.
>>
>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>> to the cpu, however then we miss all the task llc_occupancy in that - still
>> evaluating it.
>
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
>
>  - per cpu CLOSID
>  - per task CLOSID
>
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
>
> So if I configure a system by simply partitioning the cache per cpu, which
> is the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple
> way to monitor the occupancy for that reservation.
>
> And looking at that from the CAT point of view, which is the proper way to
> do it, makes it obvious that CQM should be modeled to match CAT.

Ok , makes sense. Tony and Fenghua had suggested some ideas to model the two 
more close together. Let me do some more brainstorming and try to come up with a 
draft that can be discussed.

>
> So lets assume the following:
>
>   CPU 0-3     default CLOSID 0
>   CPU 4       	       CLOSID 1
>   CPU 5	       CLOSID 2
>   CPU 6	       CLOSID 3
>   CPU 7	       CLOSID 3
>
>   T1  		       CLOSID 4
>   T2		       CLOSID 5
>   T3		       CLOSID 6
>   T4		       CLOSID 6
>
>   All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>   they run on.
>
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.
>
> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> care at all about the occupancy of T1 simply because that is running on a
> seperate reservation.

Ok, then we can give the cpu monitoring a priority just like CAT.

Trying to make that an aggregated value in the first
> place is completely wrong. If you want an aggregate, which is pretty much
> useless, then user space tools can generate it easily.
>
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.
>
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.

I missed Peterz indicated new design from scratch. Was only bothered with the 
implementations given that CAt was still going on. Since CAT is up now we may be 
able to do better.

Thanks,
Vikas

>
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.
>
> Thanks,
>
> 	tglx
>
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
  2017-01-18  9:56       ` Peter Zijlstra
  2017-01-18 19:41       ` Shivappa Vikas
@ 2017-01-18 21:03       ` David Carrillo-Cisneros
  2017-01-19 17:41         ` Thomas Gleixner
  2017-01-18 21:16       ` Yu, Fenghua
                         ` (4 subsequent siblings)
  7 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-18 21:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>> > > zeros or arbitrary numbers.
>> > >
>> > > Fix: Patches fix this by just throwing an error if the mode is not
>> > > supported.
>> > > The modes supported is task monitoring and cgroup monitoring.
>> > > Also the per package
>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>> > > option.
>> > > The systemwide data can be looked up by monitoring root cgroup.
>> >
>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>> > not have asked the question about cpu monitoring. Though I fundamentaly
>> > hate the idea of requiring cgroups for this to work.
>> >
>> > If I just want to look at CPU X why on earth do I have to set up all that
>> > cgroup muck? Just because your main focus is cgroups?
>>
>> The upstream per cpu data is broken because its not overriding the other task
>> event RMIDs on that cpu with the cpu event RMID.
>>
>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>> to the cpu, however then we miss all the task llc_occupancy in that - still
>> evaluating it.
>
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
>
>   - per cpu CLOSID
>   - per task CLOSID
>
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
>
> So if I configure a system by simply partitioning the cache per cpu, which
> is the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple
> way to monitor the occupancy for that reservation.
>
> And looking at that from the CAT point of view, which is the proper way to
> do it, makes it obvious that CQM should be modeled to match CAT.
>
> So lets assume the following:
>
>    CPU 0-3     default CLOSID 0
>    CPU 4               CLOSID 1
>    CPU 5               CLOSID 2
>    CPU 6               CLOSID 3
>    CPU 7               CLOSID 3
>
>    T1                  CLOSID 4
>    T2                  CLOSID 5
>    T3                  CLOSID 6
>    T4                  CLOSID 6
>
>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>    they run on.
>
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.

There are use cases where the RMID to CLOSID mapping is not that simple.
Some of them are:
1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
during phases that initialize relevant data, while changing it to another during
phases that pollute cache. Yet, we want the RMID to remain the same.

A different variation is to change CLOSID to increase/decrease the size of the
allocated cache when high/low contention is detected.

2. Contention detection. I start with:
   - T1 has RMID 1.
   - T1 changes RMID to 2.
 will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.
The rate of change will be relative to the level of cache contention present
at the time. This all happens without changing the CLOSID.

>
> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> care at all about the occupancy of T1 simply because that is running on a
> seperate reservation.

It is not useless for scenarios where CLOSID and RMIDs change dynamically
See above.

> Trying to make that an aggregated value in the first
> place is completely wrong. If you want an aggregate, which is pretty much
> useless, then user space tools can generate it easily.

Not useless, see above.

Having user space tools to aggregate implies wasting some of the already
scarce RMIDs.

>
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.

Monitoring and allocation are closely related yet independent.

I see the advantages of allowing a per-cpu RMID as you describe in the example.

Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
one simply monitoring occupancy per allocation.

>
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.
>
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.

Assuming that my stated need for more than one RMID per CLOSID or more
than one CLOSID per RMID is recognized, what would be the advantage of
starting the design of monitoring from the allocation perspective?

It's quite doable to create a new version of CQM/CMT without all the
cgroup murk.
We can also create an easy way to open events to monitor CLOSIDs. Yet,
I don't see
the advantage of dissociating monitoring from perf and directly
building in on top of
allocation without the assumption of 1 CLOSID : 1 RMID.

Thanks,
David

>
> Thanks,
>
>         tglx
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
                         ` (2 preceding siblings ...)
  2017-01-18 21:03       ` David Carrillo-Cisneros
@ 2017-01-18 21:16       ` Yu, Fenghua
  2017-01-19  2:09       ` David Carrillo-Cisneros
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 91+ messages in thread
From: Yu, Fenghua @ 2017-01-18 21:16 UTC (permalink / raw)
  To: Thomas Gleixner, Shivappa, Vikas
  Cc: Vikas Shivappa, davidcc, eranian, linux-kernel, x86, hpa, mingo,
	peterz, Shankar, Ravi V, Luck, Tony, Kleen, Andi, Anvin, H Peter

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
> > On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> > > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> > > > - Issue(1): Inaccurate data for per package data, systemwide. Just
> > > > prints zeros or arbitrary numbers.
> > > >
> > > > Fix: Patches fix this by just throwing an error if the mode is not
> > > > supported.
> > > > The modes supported is task monitoring and cgroup monitoring.
> > > > Also the per package
> > > > data for say socket x is returned with the -C <cpu on socketx> -G
> > > > cgrpy option.
> > > > The systemwide data can be looked up by monitoring root cgroup.
> > >
> > > Fine. That just lacks any comment in the implementation. Otherwise I
> > > would not have asked the question about cpu monitoring. Though I
> > > fundamentaly hate the idea of requiring cgroups for this to work.
> > >
> > > If I just want to look at CPU X why on earth do I have to set up all
> > > that cgroup muck? Just because your main focus is cgroups?
> >
> > The upstream per cpu data is broken because its not overriding the
> > other task event RMIDs on that cpu with the cpu event RMID.
> >
> > Can be fixed by adding a percpu struct to hold the RMID thats
> > affinitized to the cpu, however then we miss all the task
> > llc_occupancy in that - still evaluating it.
> 
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
> 
>   - per cpu CLOSID
>   - per task CLOSID
> 
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
> 
> So if I configure a system by simply partitioning the cache per cpu, which is
> the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple way
> to monitor the occupancy for that reservation.
> 
> And looking at that from the CAT point of view, which is the proper way to do
> it, makes it obvious that CQM should be modeled to match CAT.
> 
> So lets assume the following:
> 
>    CPU 0-3     default CLOSID 0
>    CPU 4       	       CLOSID 1
>    CPU 5	       CLOSID 2
>    CPU 6	       CLOSID 3
>    CPU 7	       CLOSID 3
> 
>    T1  		       CLOSID 4
>    T2		       CLOSID 5
>    T3		       CLOSID 6
>    T4		       CLOSID 6
> 
>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>    they run on.
> 
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.

So the mapping between RMID and CLOSID is 1:1 mapping, right?

Then changing the current resctrl interface in kernel as follows:

1. In rdtgroup_mkdir() (i.e. creating a partition in resctrl), allocate one RMID for the partition. Then the mapping between RMID and CLOSID is set up in mkdir.
2. In rdtgroup_rmdir() (i.e. removing a partition in resctrl), free the RMID. Then the mapping between RMID and CLOSID is dismissed.

In user space:
1. Create a partition in resctrl and allocate L3 CBM in schemata and assign a PID in "tasks".
2. Start a user monitoring tool (e.g. perf) to monitor the PID. The monitoring tool needs to be updated to know resctrl interface. We may update perf to work with resctrl interface.

Since the PID is assigned to the partition which has a CLOSID and RMID mapping, the PID is monitored while it's running in the allocated portion of L3.

Is above proposal the right way to go?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
                         ` (3 preceding siblings ...)
  2017-01-18 21:16       ` Yu, Fenghua
@ 2017-01-19  2:09       ` David Carrillo-Cisneros
  2017-01-19 16:58         ` David Carrillo-Cisneros
  2017-01-19  2:21       ` Vikas Shivappa
                         ` (2 subsequent siblings)
  7 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-19  2:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>> > > zeros or arbitrary numbers.
>> > >
>> > > Fix: Patches fix this by just throwing an error if the mode is not
>> > > supported.
>> > > The modes supported is task monitoring and cgroup monitoring.
>> > > Also the per package
>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>> > > option.
>> > > The systemwide data can be looked up by monitoring root cgroup.
>> >
>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>> > not have asked the question about cpu monitoring. Though I fundamentaly
>> > hate the idea of requiring cgroups for this to work.
>> >
>> > If I just want to look at CPU X why on earth do I have to set up all that
>> > cgroup muck? Just because your main focus is cgroups?
>>
>> The upstream per cpu data is broken because its not overriding the other task
>> event RMIDs on that cpu with the cpu event RMID.
>>
>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>> to the cpu, however then we miss all the task llc_occupancy in that - still
>> evaluating it.
>
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
>
>   - per cpu CLOSID
>   - per task CLOSID
>
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
>
> So if I configure a system by simply partitioning the cache per cpu, which
> is the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple
> way to monitor the occupancy for that reservation.
>
> And looking at that from the CAT point of view, which is the proper way to
> do it, makes it obvious that CQM should be modeled to match CAT.
>
> So lets assume the following:
>
>    CPU 0-3     default CLOSID 0
>    CPU 4               CLOSID 1
>    CPU 5               CLOSID 2
>    CPU 6               CLOSID 3
>    CPU 7               CLOSID 3
>
>    T1                  CLOSID 4
>    T2                  CLOSID 5
>    T3                  CLOSID 6
>    T4                  CLOSID 6
>
>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>    they run on.
>
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.
>
> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> care at all about the occupancy of T1 simply because that is running on a
> seperate reservation. Trying to make that an aggregated value in the first
> place is completely wrong. If you want an aggregate, which is pretty much
> useless, then user space tools can generate it easily.
>
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.
>
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.
>
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.
>

If in this email you meant "Resource group" where you wrote "CLOSID", then
please disregard my previous email. It seems like a good idea to me to have
a 1:1 mapping between RMIDs and "Resource groups".

The distinction matter because changing the schemata in the resource group
would likely trigger a change of CLOSID, which is useful.

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
                         ` (4 preceding siblings ...)
  2017-01-19  2:09       ` David Carrillo-Cisneros
@ 2017-01-19  2:21       ` Vikas Shivappa
  2017-01-19  6:45       ` Stephane Eranian
  2017-01-20  2:32       ` Vikas Shivappa
  7 siblings, 0 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-19  2:21 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin

Based on Thomas and Peterz feedback Can think of two variants which target:

-Support monitoring and allocating using the same resctrl group.
 user can use a resctrl group to allocate resources and also monitor
 them (with respect to tasks or cpu)
-allows 'task only' monitoring outside of resctrl. This mode can be used
 when user wants to override the RMIDs in the resctrl or when he wants
 to monitor more than just the resctrl groups.

	option 1> without modifying the resctrl

In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same) but the resctrl groups are mapped to one RMID as well as a CLOSID.

But we need a user interface for the user to read the counters. We can
create one file to set monitoring and one
file in resctrl directory which will reflect the counts but may not be
efficient as a lot of times user reads the counts frequently.

For the user interface there may be two options to do this:

	1.a> Build a new user mode interface resmon

Since modifying the existing perf to
suit the different h/w architecture seems to not follow the CAT
interface model, it may well be better to have a different and dedicated
interface for the RDT monitoring (just like we had a new fs for CAT)

$resmon -r <resctrl group> -s <mon_mask> -I <time in ms>

"resctrl group": is the resctrl directory.

"mon_mask:  is a bit mask of logical packages which indicates which packages user is
interested in monitoring.

"time in ms": The time for which the monitoring takes place
(this can potentially be changed to start and stop/read options)

Example 1 (Some examples modeled from resctrl ui documentation)
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.

# cd /sys/fs/resctrl

# echo "L3:0=3ff" > schemata

core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo 03 > p0/cpus

monitor the cpus 0-1 for 5s

# resmon -r p0 -s 1 -I 5000


Example 2
---------

A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.

# cd /sys/fs/resctrl

# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678

Monitor the task for 5s on socket zero

# resmon -r p0 -s 1 -I 5000

Example 3
---------

sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.

monitor a task PIDx on socket 0 for 10s

# resmon -t PIDx -s 1 -I 10000

	1.b> Add a new option to perf apart from supporting the task monitoring
	in perf.

- Monitor a resctrl group.

Introduce a new option for perf "-R" which indicates to monitor a
resctrl group.

$mkdir /sys/fs/resctrl/p1
$echo PID1 > /sys/fs/resctrl/p1/tasks
$echo PID2 > /sys/fs/resctrl/p1/tasks

$perf stat -e llc_occupancy -R p1

would return the count for the resctrl group p1.

- Monitor a task outside of resctrl group ('task only')

In this case , the perf can also monitor individual tasks using the -t
option just like before.

$perf stat -e llc_occupancy -t p1

- Monitor CPUs.

For the example 1 above , perf can be used to monitor the resctrl group
p0

$perf stat -e llc_occupancy -t p0

The issue with both options may be what happens when we run out of
RMIDs. For the resctrl groups , since we know the max groups that can be
created and the # of CLOSIds is very less compared to # of RMIDs we
reserve an RMID for each resctrl group so there is never a case that
RMID is not available for resctrl group.
For task monitoring , it can use the rest of the RMIDs.

Why do we need seperate 'task only' monitoring ?
-----------------------------------------

The seperate task monitoring option lets
the user use the RMIDs effectively and not be restricted to # of
CLOSids. Also deal with the scenarios of example 3.

RMID allocation/init
--------------------

resctrl monitoring:
RMIDs are allocated when CLOSIds are allocated during mkdir. One RMId is
allocated per socket just like CLOSid.

task monitoring:
When task events are created, RMIDs are allocated. Can also do a lazy
allocation of RMIDs when the tasks are actually scheduled in on a
socket.

Kernel Scheduling 
-----------------

During ctx switch cqm choses the RMID in the following priority (1-
highest priority)

1. if cpu has a RMID , choose that
2. if the task has a RMID directly tied to it choose that
3. choose the RMID of the task's resctrl

Read
----

When user calls cqm to retrieve the monitored count, we read the
counter_msr and return the count.


	option 2> Modifying the resctrl

This changes the resctrl interface schemata where user inputs the
CLOSids and RMIDs instead of CBMs.

# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=<closidx>;1=<closidy>" > /sys/fs/resctrl/p0/schemata

There is a mapping between closid and cbm which the user can change.

# echo 0xff > .../config/L3/0/cbm

Display the CLOSids

# ls .../config/L3/
0
1
2
.
.
.
15

As an extension to cqm , this schemata can be modified to also have the
RMIDs be chosen by the user. That way user can configure different RMIDs
for the same CLOSid if needed like in example 3. and also since we have
so many more RMIDs than CLOSids , user is not restricted by the number
of resctrl groups he can create (With the current model, user cannot
create more directories than the number of CLOSIds)

# echo "L3:0=<closidx>,<RMID1>;1=<closidy>,<RMID2>" > /sys/fs/resctrl/p0/schemata

user interface to monitor can be same as shown in the design variant #1
with the difference that this may have a lesser need for the 'task only'
monitoring.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
                         ` (5 preceding siblings ...)
  2017-01-19  2:21       ` Vikas Shivappa
@ 2017-01-19  6:45       ` Stephane Eranian
  2017-01-19 18:03         ` Thomas Gleixner
  2017-01-20  2:32       ` Vikas Shivappa
  7 siblings, 1 reply; 91+ messages in thread
From: Stephane Eranian @ 2017-01-19  6:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, David Carrillo Cisneros, LKML,
	x86, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Shankar,
	Ravi V, Luck, Tony, Fenghua Yu, Kleen, Andi, h.peter.anvin

On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>> > > zeros or arbitrary numbers.
>> > >
>> > > Fix: Patches fix this by just throwing an error if the mode is not
>> > > supported.
>> > > The modes supported is task monitoring and cgroup monitoring.
>> > > Also the per package
>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>> > > option.
>> > > The systemwide data can be looked up by monitoring root cgroup.
>> >
>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>> > not have asked the question about cpu monitoring. Though I fundamentaly
>> > hate the idea of requiring cgroups for this to work.
>> >
>> > If I just want to look at CPU X why on earth do I have to set up all that
>> > cgroup muck? Just because your main focus is cgroups?
>>
>> The upstream per cpu data is broken because its not overriding the other task
>> event RMIDs on that cpu with the cpu event RMID.
>>
>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>> to the cpu, however then we miss all the task llc_occupancy in that - still
>> evaluating it.
>
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
>
>   - per cpu CLOSID
>   - per task CLOSID
>
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
>
> So if I configure a system by simply partitioning the cache per cpu, which
> is the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple
> way to monitor the occupancy for that reservation.
>
Your use case is specific to HPC and not Web workloads we run.
Jobs run in cgroups which may span all the CPUs of the machine.
CAT may be used to partition the cache. Cgroups would run inside a partition.
There may be multiple cgroups running in the same partition. I can
understand the value
of tracking occupancy per CLOSID, however that granularity is not
enough for our use case.
Inside a partition, we want to know the occupancy of each cgroup to be
able to assign blame
to the top consumer. Thus, there needs to be a way to monitor
occupancy per cgroup. I'd like to
understand how your proposal would cover this use case.

Another important aspect is that CQM measures new allocations, thus to
get total occupancy you need
to be able to monitor the thread, CPU, CLOSid or cgroup from the
beginning of execution. In the case
of a cgroup from the moment where the first thread is scheduled into
the cgroup. To do this a RMID
needs to be assigned from the beginning to the entity to be monitored.
It could be by creating a CQM event
just to cause an RMID to be assigned as discussed earlier on this
thread. And then if a perf stat is launched
later it will get the same RMID and report full occupancy. But that
requires the first event to remain alive, i.e.,
some process must keep the file descriptor open, i.e., need some
daemon or a perf stat running in the background.

There are also use cases where you want CQM without necessarily
enabling CAT, for instance, if you want
to know the cache footprint of a workload to estimate how if it could
be co-located with others.

I think any viable proposal needs to be able to support your use case
as well as the ones I described above.

> And looking at that from the CAT point of view, which is the proper way to
> do it, makes it obvious that CQM should be modeled to match CAT.
>
> So lets assume the following:
>
>    CPU 0-3     default CLOSID 0
>    CPU 4               CLOSID 1
>    CPU 5               CLOSID 2
>    CPU 6               CLOSID 3
>    CPU 7               CLOSID 3
>
>    T1                  CLOSID 4
>    T2                  CLOSID 5
>    T3                  CLOSID 6
>    T4                  CLOSID 6
>
>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>    they run on.
>
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.
>
> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> care at all about the occupancy of T1 simply because that is running on a
> seperate reservation. Trying to make that an aggregated value in the first
> place is completely wrong. If you want an aggregate, which is pretty much
> useless, then user space tools can generate it easily.
>
> The whole approach you and David have taken is to whack some desired cgroup
> functionality and whatever into CQM without rethinking the overall
> design. And that's fundamentaly broken because it does not take cache (and
> memory bandwidth) allocation into account.
>
> I seriously doubt, that the existing CQM/MBM code can be refactored in any
> useful way. As Peter Zijlstra said before: Remove the existing cruft
> completely and start with completely new design from scratch.
>
> And this new design should start from the allocation angle and then add the
> whole other muck on top so far its possible. Allocation related monitoring
> must be the primary focus, everything else is just tinkering.
>
> Thanks,
>
>         tglx
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-19  2:09       ` David Carrillo-Cisneros
@ 2017-01-19 16:58         ` David Carrillo-Cisneros
  2017-01-19 17:54           ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-19 16:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneros
<davidcc@google.com> wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>>> > > zeros or arbitrary numbers.
>>> > >
>>> > > Fix: Patches fix this by just throwing an error if the mode is not
>>> > > supported.
>>> > > The modes supported is task monitoring and cgroup monitoring.
>>> > > Also the per package
>>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>>> > > option.
>>> > > The systemwide data can be looked up by monitoring root cgroup.
>>> >
>>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>>> > not have asked the question about cpu monitoring. Though I fundamentaly
>>> > hate the idea of requiring cgroups for this to work.
>>> >
>>> > If I just want to look at CPU X why on earth do I have to set up all that
>>> > cgroup muck? Just because your main focus is cgroups?
>>>
>>> The upstream per cpu data is broken because its not overriding the other task
>>> event RMIDs on that cpu with the cpu event RMID.
>>>
>>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>>> to the cpu, however then we miss all the task llc_occupancy in that - still
>>> evaluating it.
>>
>> The point here is that CQM is closely connected to the cache allocation
>> technology. After a lengthy discussion we ended up having
>>
>>   - per cpu CLOSID
>>   - per task CLOSID
>>
>> where all tasks which do not have a CLOSID assigned use the CLOSID which is
>> assigned to the CPU they are running on.
>>
>> So if I configure a system by simply partitioning the cache per cpu, which
>> is the proper way to do it for HPC and RT usecases where workloads are
>> partitioned on CPUs as well, then I really want to have an equaly simple
>> way to monitor the occupancy for that reservation.
>>
>> And looking at that from the CAT point of view, which is the proper way to
>> do it, makes it obvious that CQM should be modeled to match CAT.
>>
>> So lets assume the following:
>>
>>    CPU 0-3     default CLOSID 0
>>    CPU 4               CLOSID 1
>>    CPU 5               CLOSID 2
>>    CPU 6               CLOSID 3
>>    CPU 7               CLOSID 3
>>
>>    T1                  CLOSID 4
>>    T2                  CLOSID 5
>>    T3                  CLOSID 6
>>    T4                  CLOSID 6
>>
>>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>>    they run on.
>>
>> then the obvious basic monitoring requirement is to have a RMID for each
>> CLOSID.
>>
>> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> care at all about the occupancy of T1 simply because that is running on a
>> seperate reservation. Trying to make that an aggregated value in the first
>> place is completely wrong. If you want an aggregate, which is pretty much
>> useless, then user space tools can generate it easily.
>>
>> The whole approach you and David have taken is to whack some desired cgroup
>> functionality and whatever into CQM without rethinking the overall
>> design. And that's fundamentaly broken because it does not take cache (and
>> memory bandwidth) allocation into account.
>>
>> I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> useful way. As Peter Zijlstra said before: Remove the existing cruft
>> completely and start with completely new design from scratch.
>>
>> And this new design should start from the allocation angle and then add the
>> whole other muck on top so far its possible. Allocation related monitoring
>> must be the primary focus, everything else is just tinkering.
>>
>
> If in this email you meant "Resource group" where you wrote "CLOSID", then
> please disregard my previous email. It seems like a good idea to me to have
> a 1:1 mapping between RMIDs and "Resource groups".
>
> The distinction matter because changing the schemata in the resource group
> would likely trigger a change of CLOSID, which is useful.
>

Just realized that the sharing of CLOSIDs is not part of the accepted
version of RDT.
My mental model was still on the old CAT driver that did allow sharing
of CLOSIDs
between cgroups. Now I understand why CLOSID was assumed to be equal with
"Resource groups". Sorry for the noise. Then the comments in my previous email
hold.

In summary and addition to latest emails:

A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
is very problematic because the number of CLOSIDs is much much smaller than the
number of RMIDs, and, as Stephane mentioned it's a common use case to want to
independently monitor many task/cgroups inside an allocation partition.

A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
cgroup monitoring but the case where CLOSID changes would be messy. In
llc_occupancy, if RMIDs are changed, old RMIDs still hold valid occupancy
for indefinite time, so either RMIDs would be preserved (breaking the
1:many mapping)
or old RMIDs should be tracked while they are dirty.

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18 21:03       ` David Carrillo-Cisneros
@ 2017-01-19 17:41         ` Thomas Gleixner
  2017-01-20  7:37           ` David Carrillo-Cisneros
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-19 17:41 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> There are use cases where the RMID to CLOSID mapping is not that simple.
> Some of them are:
>
> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
> during phases that initialize relevant data, while changing it to another during
> phases that pollute cache. Yet, we want the RMID to remain the same.

That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
point is that monitoring across different CLOSID domains is pointless.

I have no idea how you want to do that with the proposed implementation to
switch the RMID of the thread on the fly, but that's a different story.

> A different variation is to change CLOSID to increase/decrease the size of the
> allocated cache when high/low contention is detected.
> 
> 2. Contention detection. I start with:
>    - T1 has RMID 1.
>    - T1 changes RMID to 2.
>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.

Of course does RMID1 decrease because it's not longer in use. Oh well.

> The rate of change will be relative to the level of cache contention present
> at the time. This all happens without changing the CLOSID.

See above.

> >
> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> > care at all about the occupancy of T1 simply because that is running on a
> > seperate reservation.
> 
> It is not useless for scenarios where CLOSID and RMIDs change dynamically
> See above.

Above you are talking about the same CLOSID and different RMIDS and not
about changing both.

> > Trying to make that an aggregated value in the first
> > place is completely wrong. If you want an aggregate, which is pretty much
> > useless, then user space tools can generate it easily.
> 
> Not useless, see above.

It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
making an aggregate over those two has absolutely nothing to do with your
scenario above.

If you want the aggregate value, then create it in user space and oracle
(or should I say google) out of it whatever you want, but do not impose
that to the kernel.

> Having user space tools to aggregate implies wasting some of the already
> scarce RMIDs.

Oh well. Can you please explain how you want to monitor the scenario I
explained above:

CPU4	  CLOSID 1
T1	  CLOSID 4

So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
the cache occupancy of CLOSID 1. So if you use the same RMID then you
pollute either the information of CPU4 (CLOSID1) or the information of T1
(CLOSID4)

To gather any useful information for both CPU1 and T1 you need TWO
RMIDs. Everything else is voodoo and crystal ball analysis and we are not
going to support that.
 
> > The whole approach you and David have taken is to whack some desired cgroup
> > functionality and whatever into CQM without rethinking the overall
> > design. And that's fundamentaly broken because it does not take cache (and
> > memory bandwidth) allocation into account.
> 
> Monitoring and allocation are closely related yet independent.

Independent to some degree. Sure you can claim they are completely
independent, but lots of the resulting combinations make absolutely no
sense at all. And we really don't want to support non-sensical measurements
just because we can. The outcome of this is complexity, inaccuracy and code
which is too horrible to look at.

> I see the advantages of allowing a per-cpu RMID as you describe in the example.
> 
> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
> one simply monitoring occupancy per allocation.

I agree there are use cases where you want to monitor across allocations,
like monitoring a task which has no CLOSID assigned and runs on different
CPUs and therefor potentially on different CLOSIDs which are assigned to
the different CPUs.

That's fine and you want a seperate RMID for this.

But once you have a fixed CLOSID association then reusing and aggregating
across CLOSID domains is more than useless.

> > I seriously doubt, that the existing CQM/MBM code can be refactored in any
> > useful way. As Peter Zijlstra said before: Remove the existing cruft
> > completely and start with completely new design from scratch.
> >
> > And this new design should start from the allocation angle and then add the
> > whole other muck on top so far its possible. Allocation related monitoring
> > must be the primary focus, everything else is just tinkering.
> 
> Assuming that my stated need for more than one RMID per CLOSID or more
> than one CLOSID per RMID is recognized, what would be the advantage of
> starting the design of monitoring from the allocation perspective?
>
> It's quite doable to create a new version of CQM/CMT without all the
> cgroup murk.
>
> We can also create an easy way to open events to monitor CLOSIDs. Yet, I
> don't see the advantage of dissociating monitoring from perf and directly
> building in on top of allocation without the assumption of 1 CLOSID : 1
> RMID.

I did not say that you need to remove it from perf. perf is still going to
be the interface to interact with monitoring, but it needs to be done in a
way which makes sense. The current cgroup focussed proposal which is
completely oblivious of the allocation mechanism does not make any sense to
me at all.

Starting the design from the allocation POV makes a lot of sense because
that's the point where you start to make the decisions about useful and
useless monitoring choices. And limiting the choices is the best way to
limit the RMID exhaustion in the first place.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-19 16:58         ` David Carrillo-Cisneros
@ 2017-01-19 17:54           ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-19 17:54 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
> is very problematic because the number of CLOSIDs is much much smaller than the
> number of RMIDs, and, as Stephane mentioned it's a common use case to want to
> independently monitor many task/cgroups inside an allocation partition.

Again, that was not my intention. I just want to limit the combinations.

> A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
> cgroup monitoring but the case where CLOSID changes would be messy. In

CLOSIDs of RDT groups do not change. They are allocated when the group is
created.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-19  6:45       ` Stephane Eranian
@ 2017-01-19 18:03         ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-19 18:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Shivappa Vikas, Vikas Shivappa, David Carrillo Cisneros, LKML,
	x86, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Shankar,
	Ravi V, Luck, Tony, Fenghua Yu, Kleen, Andi, h.peter.anvin

On Wed, 18 Jan 2017, Stephane Eranian wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >

> Your use case is specific to HPC and not Web workloads we run.  Jobs run
> in cgroups which may span all the CPUs of the machine.  CAT may be used
> to partition the cache. Cgroups would run inside a partition.  There may
> be multiple cgroups running in the same partition. I can understand the
> value of tracking occupancy per CLOSID, however that granularity is not
> enough for our use case.  Inside a partition, we want to know the
> occupancy of each cgroup to be able to assign blame to the top
> consumer. Thus, there needs to be a way to monitor occupancy per
> cgroup. I'd like to understand how your proposal would cover this use
> case.

The point I'm making as I explained to David is that we need to start from
the allocation angle. Of course can you monitor different tasks or task
groups inside an allocation.

> Another important aspect is that CQM measures new allocations, thus to
> get total occupancy you need to be able to monitor the thread, CPU,
> CLOSid or cgroup from the beginning of execution. In the case of a cgroup
> from the moment where the first thread is scheduled into the cgroup. To
> do this a RMID needs to be assigned from the beginning to the entity to
> be monitored.  It could be by creating a CQM event just to cause an RMID
> to be assigned as discussed earlier on this thread. And then if a perf
> stat is launched later it will get the same RMID and report full
> occupancy. But that requires the first event to remain alive, i.e., some
> process must keep the file descriptor open, i.e., need some daemon or a
> perf stat running in the background.

That's fine, but there must be a less convoluted way to do that. The
currently proposed stuff is simply horrible because it lacks any form of
design and is just hacked into submission.

> There are also use cases where you want CQM without necessarily enabling
> CAT, for instance, if you want to know the cache footprint of a workload
> to estimate how if it could be co-located with others.

That's a subset of the other stuff because it's all bound to CLOSID 0. So
you can again monitor tasks or tasks groups seperately.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  9:56       ` Peter Zijlstra
@ 2017-01-19 19:59         ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-19 19:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Shivappa Vikas, Vikas Shivappa, davidcc,
	eranian, linux-kernel, x86, hpa, mingo, ravi.v.shankar,
	tony.luck, fenghua.yu, andi.kleen, h.peter.anvin


Hello Peterz,

On Wed, 18 Jan 2017, Peter Zijlstra wrote:

> On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote:
>> The whole approach you and David have taken is to whack some desired cgroup
>> functionality and whatever into CQM without rethinking the overall
>> design. And that's fundamentaly broken because it does not take cache (and
>> memory bandwidth) allocation into account.
>>
>> I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> useful way. As Peter Zijlstra said before: Remove the existing cruft
>> completely and start with completely new design from scratch.
>>
>> And this new design should start from the allocation angle and then add the
>> whole other muck on top so far its possible. Allocation related monitoring
>> must be the primary focus, everything else is just tinkering.
>
> Agreed, the little I have seen of these patches is quite horrible. And
> there seems to be a definite lack of design; or at the very least an
> utter lack of communication of it.

the 1/12 Documentation patch describes the interface. Basically we are just 
trying to support the task and cgroup monitoring.

By the design document, do you want a document describing how we enable the 
cgroup for cqm since its a special case?
(which would include all the arch_info in the perf_cgroup we add to keep track 
of hierarchy in the driver , etc ..)

Thanks,
Vikas

>
> The approach, in so far that I could make sense of it, seems to utterly
> rape perf-cgroup. I think Thomas makes a sensible point in trying to
> match it to the CAT stuffs.
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-18  8:53     ` Thomas Gleixner
                         ` (6 preceding siblings ...)
  2017-01-19  6:45       ` Stephane Eranian
@ 2017-01-20  2:32       ` Vikas Shivappa
  2017-01-20  7:58         ` David Carrillo-Cisneros
  2017-01-20 19:31         ` Stephane Eranian
  7 siblings, 2 replies; 91+ messages in thread
From: Vikas Shivappa @ 2017-01-20  2:32 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: davidcc, eranian, linux-kernel, x86, hpa, mingo, peterz,
	ravi.v.shankar, tony.luck, fenghua.yu, andi.kleen, h.peter.anvin,
	tglx

Resending including Thomas , also with some changes. Sorry for the spam

Based on Thomas and Peterz feedback Can think of two design 
variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)

-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.

The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon. The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.

1> First Design option: without modifying the resctrl and using perf
--------------------------------------------------------------------
--------------------------------------------------------------------

In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)


Monitor cqm using perf
----------------------

perf can monitor individual tasks using the -t
option just like before.

# perf stat -e llc_occupancy -t PID1,PID2

user can monitor the cpu occupancy using the -C option in perf:

# perf stat -e llc_occupancy -C 5

Below shows how user can monitor cgroup occupancy:

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks

# perf stat -e intel_cqm/llc_occupancy/ -a -G g2

To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.

To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1

# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks

Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.

ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4

The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.

Monitoring cqm cgroups Implementation
-------------------------------------

When monitoring two different cgroups in the same hierarchy (ex say g11
has an ancestor g1 which are both being monitored as shown below) we
need the g11 counts to be considered for g1 as well. 

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g1/g11

When measuring for g1 llc_occupancy we cannot write two different RMIDs
(because we need to count for g11 as well)
during context switch to measure the occupancy for both g1 and g11.
Hence the driver maintains this information and writes the RMID of the
lowest member in the ancestory which is being monitored during ctx
switch.

The cqm_info is added to the perf_cgroup structure to maintain this
information. The structure is allocated and destroyed at css_alloc and
css_free. All the events tied to a cgroup can use the same
information while reading the counts.

struct perf_cgroup {
#ifdef CONFIG_INTEL_RDT_M
   	void *cqm_info;
#endif
...

 }

struct cqm_info {
  bool mon_enabled;
  int level;
  u32 *rmid;
  struct cgrp_cqm_info *mfa;
  struct list_head tskmon_rlist;
 };

Due to the hierarchical nature of cgroups, every cgroup just
monitors for the 'nearest monitored ancestor' at all times.
Since root cgroup is always monitored, all descendents
at boot time monitor for root and hence all mfa points to root
except for root->mfa which is NULL.

1. RMID setup: When cgroup x start monitoring:
   for each descendent y, if y's mfa->level < x->level, then
   y->mfa = x. (Where level of root node = 0...)
2. sched_in: During sched_in for x
   if (x->mon_enabled) choose x->rmid
     else choose x->mfa->rmid.
3. read: for each descendent of cgroup x
   if (x->monitored) count += rmid_read(x->rmid).
4. evt_destroy: for each descendent y of x, if (y->mfa == x)
   then y->mfa = x->mfa. Meaning if any descendent was monitoring for
   x, set that descendent to monitor for the cgroup which x was
   monitoring for.

To monitor a task in a cgroup x along with monitoring cgroup x itself
cqm_info maintains a list of tasks that are being monitored in the
cgroup.

When a task which belongs to a cgroup x is being monitored, it
always uses its own task->rmid even if cgroup x is monitored during sched_in.
To account for the counts of such tasks, cgroup keeps this list
and parses it during read.
taskmon_rlist is used to maintain the list. The list is modified when a
task is attached to the cgroup or removed from the group.

Example 1 (Some examples modeled from resctrl ui documentation)
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.

# cd /sys/fs/resctrl

# echo "L3:0=3ff" > schemata

core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo 03 > p0/cpus

monitor the cpus 0-1 

# perf stat -e llc_occupancy -C 0-1

Example 2
---------

A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.

# cd /sys/fs/resctrl

# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678

To monitor the same group of tasks create a cgroup g1

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# perf stat -e llc_occupancy -a -G g1

Example 3
---------

sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.

# perf stat -e llc_occupancy -t PIDx,PIDy

RMID allocation
---------------

RMIDs are allocated per package to achieve better scaling of RMIDs.
RMIDs are plenty (2-4 per logical processor) and also are per package
meaning a two socket system would have twice the number of RMIDs.
If we still run out of RMIDs an error is thrown that monitoring wasnt
possible as the RMID wasnt available.

Kernel Scheduling 
-----------------

During ctx switch cqm choses the RMID in the following priority 

1. if cpu has a RMID , choose that
2. if the task has a RMID directly tied to it choose that (task is
   monitored)
3. choose the RMID of the task's cgroup (by default tasks belong to root
   cgroup with RMID 0)

Read
----

When user calls cqm to retrieve the monitored count, we read the
counter_msr and return the count. For cgroup hierarcy , the count is
measured as explained in the cgroup implementation section by traversing
the cgroup hierarchy.


2> Second Design option: Build a new usermode tool resmon
---------------------------------------------------------
---------------------------------------------------------

In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same).

This version supports monitoring resctrl groups directly.
But we need a user interface for the user to read the counters.  We can
create one file to set monitoring and one
file in resctrl directory which will reflect the counts but may not be
efficient as a lot of times user reads the counts frequently.

Build a new user mode interface resmon
--------------------------------------

Since modifying the existing perf to
suit the different h/w architecture seems to not follow the CAT
interface model, it may well be better to have a different and dedicated
interface for the RDT monitoring (just like we had a new fs for CAT)

resmon supports monitoring a resctrl group or a task. The two modes may
provide enough granularity needed for monitoring
-can monitor cpu data.
-can monitor per resctrl group data.
-can choose custom or subset of tasks with in a resctrl group and monitor.

# resmon [<options>]
-r <resctrl group> 
-t <PID>
-s <mon_mask> 
-I <time in ms>

"resctrl group": is the resctrl directory.

"mon_mask:  is a bit mask of logical packages which indicates which packages user is
interested in monitoring.

"time in ms": The time for which the monitoring takes place
(this can potentially be changed to start and stop/read options)

Example 1 (Some examples modeled from resctrl ui documentation)
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share
text and data, so a per task association is not required and due to
interaction with the kernel it's desired that the kernel on these cores shares L3
with the tasks.

# cd /sys/fs/resctrl
# mkdir p0
# echo "L3:0=3ff" > p0/schemata

core 0-1 are assigned to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo 03 > p0/cpus

monitor the cpus 0-1 for 10s.

# resmon -r p0 -s 1 -I 10000

Example 2
---------

A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
cache.

# cd /sys/fs/resctrl

# mkdir p1
# echo "L3:0=0f00;1=ffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2-3 5678

Monitor the task for 5s on socket zero

# resmon -r p1 -s 1 -I 5000

Example 3
---------

sometimes user may just want to profile the cache occupancy first before
assigning any CLOSids. Also this provides an override option where user
can monitor some tasks which have say CLOS 0 that he is about to place
in a CLOSId based on the amount of cache occupancy. This could apply to
the same real time tasks above where user is caliberating the % of cache
thats needed.

# resmon -t PIDx,PIDy -s 1 -I 10000

returns the sum of count of PIDx and PIDy

RMID Allocation
---------------

This would remain the same like design version 1, where we support per
package RMIDs and throw error when out of RMIDs due to h/w limited
RMIDs.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-19 17:41         ` Thomas Gleixner
@ 2017-01-20  7:37           ` David Carrillo-Cisneros
  2017-01-20  8:30             ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-20  7:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
>> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> There are use cases where the RMID to CLOSID mapping is not that simple.
>> Some of them are:
>>
>> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
>> during phases that initialize relevant data, while changing it to another during
>> phases that pollute cache. Yet, we want the RMID to remain the same.
>
> That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
> point is that monitoring across different CLOSID domains is pointless.
>
> I have no idea how you want to do that with the proposed implementation to
> switch the RMID of the thread on the fly, but that's a different story.
>
>> A different variation is to change CLOSID to increase/decrease the size of the
>> allocated cache when high/low contention is detected.
>>
>> 2. Contention detection. I start with:
>>    - T1 has RMID 1.
>>    - T1 changes RMID to 2.
>>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.
>
> Of course does RMID1 decrease because it's not longer in use. Oh well.
>
>> The rate of change will be relative to the level of cache contention present
>> at the time. This all happens without changing the CLOSID.
>
> See above.
>
>> >
>> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> > care at all about the occupancy of T1 simply because that is running on a
>> > seperate reservation.
>>
>> It is not useless for scenarios where CLOSID and RMIDs change dynamically
>> See above.
>
> Above you are talking about the same CLOSID and different RMIDS and not
> about changing both.

The scenario I talked about implies changing CLOSID without affecting
monitoring.
It happens when the allocation needs for a thread/cgroup/CPU change
dynamically. Forcing to change the RMID together with the CLOSID would
give wrong monitoring values unless the old RMID is kept around until
becomes free, which is ugly and would waste a RMID.

>
>> > Trying to make that an aggregated value in the first
>> > place is completely wrong. If you want an aggregate, which is pretty much
>> > useless, then user space tools can generate it easily.
>>
>> Not useless, see above.
>
> It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
> making an aggregate over those two has absolutely nothing to do with your
> scenario above.

That's true. It is useless in the case you mentioned. I erroneously
interpreted the "useless" in your comment as a general statement about
aggregating RMID occupancies.

>
> If you want the aggregate value, then create it in user space and oracle
> (or should I say google) out of it whatever you want, but do not impose
> that to the kernel.
>
>> Having user space tools to aggregate implies wasting some of the already
>> scarce RMIDs.
>
> Oh well. Can you please explain how you want to monitor the scenario I
> explained above:
>
> CPU4      CLOSID 1
> T1        CLOSID 4
>
> So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
> the cache occupancy of CLOSID 1. So if you use the same RMID then you
> pollute either the information of CPU4 (CLOSID1) or the information of T1
> (CLOSID4)
>
> To gather any useful information for both CPU1 and T1 you need TWO
> RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> going to support that.
>

Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
just because the CLOSID changed is wasteful.

>> > The whole approach you and David have taken is to whack some desired cgroup
>> > functionality and whatever into CQM without rethinking the overall
>> > design. And that's fundamentaly broken because it does not take cache (and
>> > memory bandwidth) allocation into account.
>>
>> Monitoring and allocation are closely related yet independent.
>
> Independent to some degree. Sure you can claim they are completely
> independent, but lots of the resulting combinations make absolutely no
> sense at all. And we really don't want to support non-sensical measurements
> just because we can. The outcome of this is complexity, inaccuracy and code
> which is too horrible to look at.
>
>> I see the advantages of allowing a per-cpu RMID as you describe in the example.
>>
>> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
>> one simply monitoring occupancy per allocation.
>
> I agree there are use cases where you want to monitor across allocations,
> like monitoring a task which has no CLOSID assigned and runs on different
> CPUs and therefor potentially on different CLOSIDs which are assigned to
> the different CPUs.
>
> That's fine and you want a seperate RMID for this.
>
> But once you have a fixed CLOSID association then reusing and aggregating
> across CLOSID domains is more than useless.
>

Correct. But there may not be a fixed CLOSID association if loads
exhibit dynamic behavior and/or system load changes dynamically.

>> > I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> > useful way. As Peter Zijlstra said before: Remove the existing cruft
>> > completely and start with completely new design from scratch.
>> >
>> > And this new design should start from the allocation angle and then add the
>> > whole other muck on top so far its possible. Allocation related monitoring
>> > must be the primary focus, everything else is just tinkering.
>>
>> Assuming that my stated need for more than one RMID per CLOSID or more
>> than one CLOSID per RMID is recognized, what would be the advantage of
>> starting the design of monitoring from the allocation perspective?
>>
>> It's quite doable to create a new version of CQM/CMT without all the
>> cgroup murk.
>>
>> We can also create an easy way to open events to monitor CLOSIDs. Yet, I
>> don't see the advantage of dissociating monitoring from perf and directly
>> building in on top of allocation without the assumption of 1 CLOSID : 1
>> RMID.
>
> I did not say that you need to remove it from perf. perf is still going to
> be the interface to interact with monitoring, but it needs to be done in a
> way which makes sense. The current cgroup focussed proposal which is
> completely oblivious of the allocation mechanism does not make any sense to
> me at all.
>
> Starting the design from the allocation POV makes a lot of sense because
> that's the point where you start to make the decisions about useful and
> useless monitoring choices. And limiting the choices is the best way to
> limit the RMID exhaustion in the first place.

Thanks for the extra explanation.
David

>
> Thanks,
>
>         tglx
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  2:32       ` Vikas Shivappa
@ 2017-01-20  7:58         ` David Carrillo-Cisneros
  2017-01-20 13:28           ` Thomas Gleixner
  2017-01-20 20:40           ` Shivappa Vikas
  2017-01-20 19:31         ` Stephane Eranian
  1 sibling, 2 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-20  7:58 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck, Tony,
	Fenghua Yu, andi.kleen, H. Peter Anvin, Thomas Gleixner

On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
<vikas.shivappa@linux.intel.com> wrote:
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon.

The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.

> The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> --------------------------------------------------------------------
> --------------------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> ----------------------
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.

That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be shared between resctrl groups, allocation wouldn't break.

It's a first idea, so please dont hate too hard ;) .

David

>
> Monitoring cqm cgroups Implementation
> -------------------------------------
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
>         void *cqm_info;
> #endif
> ...
>
>  }
>
> struct cqm_info {
>   bool mon_enabled;
>   int level;
>   u32 *rmid;
>   struct cgrp_cqm_info *mfa;
>   struct list_head tskmon_rlist;
>  };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
>    for each descendent y, if y's mfa->level < x->level, then
>    y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
>    if (x->mon_enabled) choose x->rmid
>      else choose x->mfa->rmid.
> 3. read: for each descendent of cgroup x
>    if (x->monitored) count += rmid_read(x->rmid).
> 4. evt_destroy: for each descendent y of x, if (y->mfa == x)
>    then y->mfa = x->mfa. Meaning if any descendent was monitoring for
>    x, set that descendent to monitor for the cgroup which x was
>    monitoring for.
>
> To monitor a task in a cgroup x along with monitoring cgroup x itself
> cqm_info maintains a list of tasks that are being monitored in the
> cgroup.
>
> When a task which belongs to a cgroup x is being monitored, it
> always uses its own task->rmid even if cgroup x is monitored during sched_in.
> To account for the counts of such tasks, cgroup keeps this list
> and parses it during read.
> taskmon_rlist is used to maintain the list. The list is modified when a
> task is attached to the cgroup or removed from the group.
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
>
> # echo "L3:0=3ff" > schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1
>
> # perf stat -e llc_occupancy -C 0-1
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> To monitor the same group of tasks create a cgroup g1
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # perf stat -e llc_occupancy -a -G g1
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # perf stat -e llc_occupancy -t PIDx,PIDy
>
> RMID allocation
> ---------------
>
> RMIDs are allocated per package to achieve better scaling of RMIDs.
> RMIDs are plenty (2-4 per logical processor) and also are per package
> meaning a two socket system would have twice the number of RMIDs.
> If we still run out of RMIDs an error is thrown that monitoring wasnt
> possible as the RMID wasnt available.
>
> Kernel Scheduling
> -----------------
>
> During ctx switch cqm choses the RMID in the following priority
>
> 1. if cpu has a RMID , choose that
> 2. if the task has a RMID directly tied to it choose that (task is
>    monitored)
> 3. choose the RMID of the task's cgroup (by default tasks belong to root
>    cgroup with RMID 0)
>
> Read
> ----
>
> When user calls cqm to retrieve the monitored count, we read the
> counter_msr and return the count. For cgroup hierarcy , the count is
> measured as explained in the cgroup implementation section by traversing
> the cgroup hierarchy.
>
>
> 2> Second Design option: Build a new usermode tool resmon
> ---------------------------------------------------------
> ---------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same).
>
> This version supports monitoring resctrl groups directly.
> But we need a user interface for the user to read the counters.  We can
> create one file to set monitoring and one
> file in resctrl directory which will reflect the counts but may not be
> efficient as a lot of times user reads the counts frequently.
>
> Build a new user mode interface resmon
> --------------------------------------
>
> Since modifying the existing perf to
> suit the different h/w architecture seems to not follow the CAT
> interface model, it may well be better to have a different and dedicated
> interface for the RDT monitoring (just like we had a new fs for CAT)
>
> resmon supports monitoring a resctrl group or a task. The two modes may
> provide enough granularity needed for monitoring
> -can monitor cpu data.
> -can monitor per resctrl group data.
> -can choose custom or subset of tasks with in a resctrl group and monitor.
>
> # resmon [<options>]
> -r <resctrl group>
> -t <PID>
> -s <mon_mask>
> -I <time in ms>
>
> "resctrl group": is the resctrl directory.
>
> "mon_mask:  is a bit mask of logical packages which indicates which packages user is
> interested in monitoring.
>
> "time in ms": The time for which the monitoring takes place
> (this can potentially be changed to start and stop/read options)
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
> # mkdir p0
> # echo "L3:0=3ff" > p0/schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1 for 10s.
>
> # resmon -r p0 -s 1 -I 10000
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> Monitor the task for 5s on socket zero
>
> # resmon -r p1 -s 1 -I 5000
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # resmon -t PIDx,PIDy -s 1 -I 10000
>
> returns the sum of count of PIDx and PIDy
>
> RMID Allocation
> ---------------
>
> This would remain the same like design version 1, where we support per
> package RMIDs and throw error when out of RMIDs due to h/w limited
> RMIDs.
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  7:37           ` David Carrillo-Cisneros
@ 2017-01-20  8:30             ` Thomas Gleixner
  2017-01-20 20:27               ` David Carrillo-Cisneros
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-20  8:30 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Above you are talking about the same CLOSID and different RMIDS and not
> > about changing both.
> 
> The scenario I talked about implies changing CLOSID without affecting
> monitoring.
> It happens when the allocation needs for a thread/cgroup/CPU change
> dynamically. Forcing to change the RMID together with the CLOSID would
> give wrong monitoring values unless the old RMID is kept around until
> becomes free, which is ugly and would waste a RMID.

When the allocation needs for a resource control group change, then we
simply update the allocation constraints of that group without chaning the
CLOSID. So everything just stays the same.

If you move entities to a different group then of course the CLOSID
changes and then it's a different story how to deal with monitoring.

> > To gather any useful information for both CPU1 and T1 you need TWO
> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> > going to support that.
> >
> 
> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
> just because the CLOSID changed is wasteful.

Again, the CLOSID only changes if you move entities to a different resource
control group and in that case the RMID change is the least of your worries.

> Correct. But there may not be a fixed CLOSID association if loads
> exhibit dynamic behavior and/or system load changes dynamically.

So, you really want to move entities around between resource control groups
dynamically? I'm not seing why you would want to do that, but I'm all ear
to get educated on that.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  7:58         ` David Carrillo-Cisneros
@ 2017-01-20 13:28           ` Thomas Gleixner
  2017-01-20 20:11             ` David Carrillo-Cisneros
  2017-01-20 20:40           ` Shivappa Vikas
  1 sibling, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-20 13:28 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Vikas Shivappa, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> 
> If resctrl groups could lift the restriction of one resctl per CLOSID,
> then the user can create many resctrl in the way perf cgroups are
> created now. The advantage is that there wont be cgroup hierarchy!
> making things much simpler. Also no need to optimize perf event
> context switch to make llc_occupancy work.

So if I understand you correctly, then you want a mechanism to have groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID
   
And you want to move MONGRP from one CTRLGRP to another.

Can you please write up in a abstract way what the design requirements are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  2:32       ` Vikas Shivappa
  2017-01-20  7:58         ` David Carrillo-Cisneros
@ 2017-01-20 19:31         ` Stephane Eranian
  1 sibling, 0 replies; 91+ messages in thread
From: Stephane Eranian @ 2017-01-20 19:31 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Shivappa, Vikas, David Carrillo Cisneros, LKML, x86,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V,
	Luck, Tony, Fenghua Yu, Kleen, Andi, h.peter.anvin,
	Thomas Gleixner

On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
<vikas.shivappa@linux.intel.com> wrote:
>
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon. The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> --------------------------------------------------------------------
> --------------------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> ----------------------
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
Presented this way, this  does not quite address the use case I
described earlier here.
We want to be able to monitor the cgroup allocations from first thread
creation. What you have above has a large gap. Many apps do allocations
as their very first steps, so if you do:
$ my_test_prg &
[1456]
$ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks
$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2

You have a race. But if you allow:

$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup)
$ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so
my_test_prg runs immediately in the cgroup)
$ my_test_prg &

Then there is a way to avoid the gap.

>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.
>
> Monitoring cqm cgroups Implementation
> -------------------------------------
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
>         void *cqm_info;
> #endif
> ...
>
>  }
>
> struct cqm_info {
>   bool mon_enabled;
>   int level;
>   u32 *rmid;
>   struct cgrp_cqm_info *mfa;
>   struct list_head tskmon_rlist;
>  };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
>    for each descendent y, if y's mfa->level < x->level, then
>    y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
>    if (x->mon_enabled) choose x->rmid
>      else choose x->mfa->rmid.
> 3. read: for each descendent of cgroup x
>    if (x->monitored) count += rmid_read(x->rmid).
> 4. evt_destroy: for each descendent y of x, if (y->mfa == x)
>    then y->mfa = x->mfa. Meaning if any descendent was monitoring for
>    x, set that descendent to monitor for the cgroup which x was
>    monitoring for.
>
> To monitor a task in a cgroup x along with monitoring cgroup x itself
> cqm_info maintains a list of tasks that are being monitored in the
> cgroup.
>
> When a task which belongs to a cgroup x is being monitored, it
> always uses its own task->rmid even if cgroup x is monitored during sched_in.
> To account for the counts of such tasks, cgroup keeps this list
> and parses it during read.
> taskmon_rlist is used to maintain the list. The list is modified when a
> task is attached to the cgroup or removed from the group.
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
>
> # echo "L3:0=3ff" > schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1
>
> # perf stat -e llc_occupancy -C 0-1
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> To monitor the same group of tasks create a cgroup g1
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # perf stat -e llc_occupancy -a -G g1
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # perf stat -e llc_occupancy -t PIDx,PIDy
>
> RMID allocation
> ---------------
>
> RMIDs are allocated per package to achieve better scaling of RMIDs.
> RMIDs are plenty (2-4 per logical processor) and also are per package
> meaning a two socket system would have twice the number of RMIDs.
> If we still run out of RMIDs an error is thrown that monitoring wasnt
> possible as the RMID wasnt available.
>
> Kernel Scheduling
> -----------------
>
> During ctx switch cqm choses the RMID in the following priority
>
> 1. if cpu has a RMID , choose that
> 2. if the task has a RMID directly tied to it choose that (task is
>    monitored)
> 3. choose the RMID of the task's cgroup (by default tasks belong to root
>    cgroup with RMID 0)
>
> Read
> ----
>
> When user calls cqm to retrieve the monitored count, we read the
> counter_msr and return the count. For cgroup hierarcy , the count is
> measured as explained in the cgroup implementation section by traversing
> the cgroup hierarchy.
>
>
> 2> Second Design option: Build a new usermode tool resmon
> ---------------------------------------------------------
> ---------------------------------------------------------
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same).
>
> This version supports monitoring resctrl groups directly.
> But we need a user interface for the user to read the counters.  We can
> create one file to set monitoring and one
> file in resctrl directory which will reflect the counts but may not be
> efficient as a lot of times user reads the counts frequently.
>
> Build a new user mode interface resmon
> --------------------------------------
>
> Since modifying the existing perf to
> suit the different h/w architecture seems to not follow the CAT
> interface model, it may well be better to have a different and dedicated
> interface for the RDT monitoring (just like we had a new fs for CAT)
>
> resmon supports monitoring a resctrl group or a task. The two modes may
> provide enough granularity needed for monitoring
> -can monitor cpu data.
> -can monitor per resctrl group data.
> -can choose custom or subset of tasks with in a resctrl group and monitor.
>
> # resmon [<options>]
> -r <resctrl group>
> -t <PID>
> -s <mon_mask>
> -I <time in ms>
>
> "resctrl group": is the resctrl directory.
>
> "mon_mask:  is a bit mask of logical packages which indicates which packages user is
> interested in monitoring.
>
> "time in ms": The time for which the monitoring takes place
> (this can potentially be changed to start and stop/read options)
>
> Example 1 (Some examples modeled from resctrl ui documentation)
> ---------
>
> A single socket system which has real-time tasks running on core 4-7 and
> non real-time workload assigned to core 0-3. The real-time tasks share
> text and data, so a per task association is not required and due to
> interaction with the kernel it's desired that the kernel on these cores shares L3
> with the tasks.
>
> # cd /sys/fs/resctrl
> # mkdir p0
> # echo "L3:0=3ff" > p0/schemata
>
> core 0-1 are assigned to the new group and make sure that the
> kernel and the tasks running there get 50% of the cache.
>
> # echo 03 > p0/cpus
>
> monitor the cpus 0-1 for 10s.
>
> # resmon -r p0 -s 1 -I 10000
>
> Example 2
> ---------
>
> A real time task running on cpu 2-3(socket 0) is allocated a dedicated 25% of the
> cache.
>
> # cd /sys/fs/resctrl
>
> # mkdir p1
> # echo "L3:0=0f00;1=ffff" > p1/schemata
> # echo 5678 > p1/tasks
> # taskset -cp 2-3 5678
>
> Monitor the task for 5s on socket zero
>
> # resmon -r p1 -s 1 -I 5000
>
> Example 3
> ---------
>
> sometimes user may just want to profile the cache occupancy first before
> assigning any CLOSids. Also this provides an override option where user
> can monitor some tasks which have say CLOS 0 that he is about to place
> in a CLOSId based on the amount of cache occupancy. This could apply to
> the same real time tasks above where user is caliberating the % of cache
> thats needed.
>
> # resmon -t PIDx,PIDy -s 1 -I 10000
>
> returns the sum of count of PIDx and PIDy
>
> RMID Allocation
> ---------------
>
> This would remain the same like design version 1, where we support per
> package RMIDs and throw error when out of RMIDs due to h/w limited
> RMIDs.
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 13:28           ` Thomas Gleixner
@ 2017-01-20 20:11             ` David Carrillo-Cisneros
  2017-01-20 21:08               ` Shivappa Vikas
                                 ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-20 20:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> >
> > If resctrl groups could lift the restriction of one resctl per CLOSID,
> > then the user can create many resctrl in the way perf cgroups are
> > created now. The advantage is that there wont be cgroup hierarchy!
> > making things much simpler. Also no need to optimize perf event
> > context switch to make llc_occupancy work.
>
> So if I understand you correctly, then you want a mechanism to have groups
> of entities (tasks, cpus) and associate them to a particular resource
> control group.
>
> So they share the CLOSID of the control group and each entity group can
> have its own RMID.
>
> Now you want to be able to move the entity groups around between control
> groups without losing the RMID associated to the entity group.
>
> So the whole picture would look like this:
>
> rdt ->  CTRLGRP -> CLOSID
>
> mon ->  MONGRP  -> RMID
>
> And you want to move MONGRP from one CTRLGRP to another.

Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.

>
> Can you please write up in a abstract way what the design requirements are
> that you need. So far we are talking about implementation details and
> unspecfied wishlists, but what we really need is an abstract requirement.

My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.
-----------------------------------------------------------------------------

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
            CTRLGRP0     CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0       L3:0=x00f
tasks:       PID0       P0_0,P0_1,P1_0,P1_1
cpus:        0x3                0xC

If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
            CTRLGRP0     CTRLGRP1     CTRLGRP2        CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00f    L3:0=0x00f     L3:0=0x00f
tasks:       PID0         <none>      P0_0,P0_1     P1_0, P1_1
cpus:        0x3           0xC          0x0             0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.

If CTRLGRP's schemata changes, the RDT subsystem will find a new
CLOSID for the new schemata (potentially reusing an existing one) or
fail (just like the old CAT used to). The RMID does not change during
schemata updates.

If a CTRLGRP dies, the monitoring perf_event continues to exists as a
useless wraith, just as happens with cgroup events now.

Since CTRLGRPs have no hierarchy. There is no need to handle that in
the new RDT Monitoring PMU, greatly simplifying it over the previously
proposed versions.

A breaking change in user observed behavior with respect to the
existing CQM PMU is that there wouldn't be task events. A task must be
part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
pair. If an user wants to monitor a task across multiple resources
(e.g. l3_occupancy across two packages), she must create one event per
resource_id and add the two counts.

I see this breaking change as an improvement, since hiding the cache
topology to user space introduced lots of ugliness and complexity to
the CQM PMU without improving accuracy over user space adding the
events.

Implementation ideas:

First idea is to expose one monitoring file per resource in a CTRLGRP,
so the list of CTRLGRP's files would be: schemata, tasks, cpus,
monitor_l3_0, monitor_l3_1, ...

the monitor_<resource_id> file descriptor is passed to perf_event_open
in the way cgroup file descriptors are passed now. All events to the
same (CTRLGRP,resource_id) share RMID.

The RMID allocation part can either be handled by RDT Allocation or by
the RDT Monitoring PMU. Either ways, the existence of PMU's
perf_events allocates/releases the RMID.

Also, since this new design removes hierarchy and task events, it
allows for a simple solution of the RMID rotation problem. The removal
of task events eliminates the cgroup vs task event conflict existing
in the upstream version; it also removes the need to ensure that all
active packages have RMIDs at the same time that added complexity to
my version of CQM/CMT. Lastly, the removal of hierarchy removes the
reliance on cgroups, the complex tree based read, and all the hooks
and cgroup files that "raped" the cgroup subsystem.

Thoughts?

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  8:30             ` Thomas Gleixner
@ 2017-01-20 20:27               ` David Carrillo-Cisneros
  0 siblings, 0 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-20 20:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, Jan 20, 2017 at 12:30 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Above you are talking about the same CLOSID and different RMIDS and not
>> > about changing both.
>>
>> The scenario I talked about implies changing CLOSID without affecting
>> monitoring.
>> It happens when the allocation needs for a thread/cgroup/CPU change
>> dynamically. Forcing to change the RMID together with the CLOSID would
>> give wrong monitoring values unless the old RMID is kept around until
>> becomes free, which is ugly and would waste a RMID.
>
> When the allocation needs for a resource control group change, then we
> simply update the allocation constraints of that group without chaning the
> CLOSID. So everything just stays the same.
>
> If you move entities to a different group then of course the CLOSID
> changes and then it's a different story how to deal with monitoring.
>
>> > To gather any useful information for both CPU1 and T1 you need TWO
>> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
>> > going to support that.
>> >
>>
>> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
>> just because the CLOSID changed is wasteful.
>
> Again, the CLOSID only changes if you move entities to a different resource
> control group and in that case the RMID change is the least of your worries.
>
>> Correct. But there may not be a fixed CLOSID association if loads
>> exhibit dynamic behavior and/or system load changes dynamically.
>
> So, you really want to move entities around between resource control groups
> dynamically? I'm not seing why you would want to do that, but I'm all ear
> to get educated on that.

No, I don't want to move entities across resource control groups. I
was confused by the idea of CLOSIDs being married to control groups,
but now is clear even to me that that was never the intention.

Thanks,
David

>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20  7:58         ` David Carrillo-Cisneros
  2017-01-20 13:28           ` Thomas Gleixner
@ 2017-01-20 20:40           ` Shivappa Vikas
  1 sibling, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-20 20:40 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Vikas Shivappa, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin, Thomas Gleixner



On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:

> On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
> <vikas.shivappa@linux.intel.com> wrote:
>> Resending including Thomas , also with some changes. Sorry for the spam
>>
>> Based on Thomas and Peterz feedback Can think of two design
>> variants which target:
>>
>> -Support monitoring and allocating using the same resctrl group.
>> user can use a resctrl group to allocate resources and also monitor
>> them (with respect to tasks or cpu)
>>
>> -Also allows monitoring outside of resctrl so that user can
>> monitor subgroups who use the same closid. This mode can be used
>> when user wants to monitor more than just the resctrl groups.
>>
>> The first design version uses and modifies perf_cgroup, second version
>> builds a new interface resmon.
>
> The second version would require to build a whole new set of tools,
> deploy them and maintain them. Users will have to run perf for certain
> events and resmon (or whatever is named the new tool) for rdt. I see
> it as too complex and much prefer to keep using perf.

This was so that we have the flexibility to align the tools as per the 
requirement of the feature rather than twisting the perf behaviour and also have 
that flexibility for future when new RDT features are added (something 
similar to what we did by introducing resctrl groups instead of using cgroups 
for CAT)

Sometimes thats a lot simpler as we dont need a lot code given the 
limited/specific syscalls we need to support. Just like the resctrl fs which is 
specific to RDT.

It looks like your requirement is to be able to monitor a group of tasks 
independently apart from the resctrl groups?

The task option should provide that flexibility to monitor a bunch of tasks 
independently apart from whether they are part of resctrl group or not. The 
assignment of RMID is contolled underneat by the kernel so we can optimize the 
usage of RMIDs and also RMIDs are tied to this group of tasks whether its a 
subset of resctrl group or not.

>
>> The first version is close to the patches
>> sent with some additions/changes. This includes details of the design as
>> per Thomas/Peterz feedback.
>>
>> 1> First Design option: without modifying the resctrl and using perf
>> --------------------------------------------------------------------
>> --------------------------------------------------------------------
>>
>> In this design everything in resctrl interface works like
>> before (the info, resource group files like task schemata all remain the
>> same)
>>
>>
>> Monitor cqm using perf
>> ----------------------
>>
>> perf can monitor individual tasks using the -t
>> option just like before.
>>
>> # perf stat -e llc_occupancy -t PID1,PID2
>>
>> user can monitor the cpu occupancy using the -C option in perf:
>>
>> # perf stat -e llc_occupancy -C 5
>>
>> Below shows how user can monitor cgroup occupancy:
>>
>> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
>> # mkdir /sys/fs/cgroup/perf_event/g1
>> # mkdir /sys/fs/cgroup/perf_event/g2
>> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>>
>> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>>
>> To monitor a resctrl group, user can group the same tasks in resctrl
>> group into the cgroup.
>>
>> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
>> group p1 to cgroup g1
>>
>> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>>
>> Introducing a new option for resctrl may complicate monitoring because
>> supporting cgroup 'task groups' and resctrl 'task groups' leads to
>> situations where:
>> if the groups intersect, then there is no way to know what
>> l3_allocations contribute to which group.
>>
>> ex:
>> p1 has tasks t1, t2, t3
>> g1 has tasks t2, t3, t4
>>
>> The only way to get occupancy for g1 and p1 would be to allocate an RMID
>> for each task which can as well be done with the -t option.
>
> That's simply recreating the resctrl group as a cgroup.
>
> I think that the main advantage of doing allocation first is that we
> could use the context switch in rdt allocation and greatly simplify
> the pmu side of it.
>
> If resctrl groups could lift the restriction of one resctl per CLOSID,
> then the user can create many resctrl in the way perf cgroups are
> created now. The advantage is that there wont be cgroup hierarchy!
> making things much simpler. Also no need to optimize perf event
> context switch to make llc_occupancy work.
>
> Then we only need a way to express that monitoring must happen in a
> resctl to the perf_event_open syscall.
>
> My first thought is to have a "rdt_monitor" file per resctl group. A
> user passes it to perf_event_open in the way cgroups are passed now.
> We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
> cover rdt_monitor files. The syscall can figure if it's a cgroup or a
> rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
> groups
>
> Then the rdm_monitoring PMU will be pretty dumb, having neither task
> nor CPU contexts. Just providing the pmu->read and pmu->event_init
> functions.
>
> Task monitoring can be done with resctrl as well by adding the PID to
> a new resctl and opening the event on it. And, since we'd allow CLOSID
> to be shared between resctrl groups, allocation wouldn't break.

It looks like we are trying to create a MONGRP and CTRLGRP like Thomas mentions.

Although resctrl group now does not have a hierarchy a task can be part of 
only one group - breaking this is just equivalent to having a seperate resmon 
group which may group the tasks independent of how they are grouped in the 
resctrl group?

That can be achieved as well with the option to monitor at task granularity ? 
that means if we support task option and the option to monitor resctrl groups we 
obtain the same functionality.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 20:11             ` David Carrillo-Cisneros
@ 2017-01-20 21:08               ` Shivappa Vikas
  2017-01-20 21:44                 ` David Carrillo-Cisneros
  2017-01-23  9:47               ` Thomas Gleixner
  2017-02-08 10:11               ` Peter Zijlstra
  2 siblings, 1 reply; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-20 21:08 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Vikas Shivappa,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Luck, Tony, Fenghua Yu,
	andi.kleen, H. Peter Anvin



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:

> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>>>
>>> If resctrl groups could lift the restriction of one resctl per CLOSID,
>>> then the user can create many resctrl in the way perf cgroups are
>>> created now. The advantage is that there wont be cgroup hierarchy!
>>> making things much simpler. Also no need to optimize perf event
>>> context switch to make llc_occupancy work.
>>
>> So if I understand you correctly, then you want a mechanism to have groups
>> of entities (tasks, cpus) and associate them to a particular resource
>> control group.
>>
>> So they share the CLOSID of the control group and each entity group can
>> have its own RMID.
>>
>> Now you want to be able to move the entity groups around between control
>> groups without losing the RMID associated to the entity group.
>>
>> So the whole picture would look like this:
>>
>> rdt ->  CTRLGRP -> CLOSID
>>
>> mon ->  MONGRP  -> RMID
>>
>> And you want to move MONGRP from one CTRLGRP to another.
>
> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
> same thing. Details below.
>
>>
>> Can you please write up in a abstract way what the design requirements are
>> that you need. So far we are talking about implementation details and
>> unspecfied wishlists, but what we really need is an abstract requirement.
>
> My pleasure:
>
>
> Design Proposal for Monitoring of RDT Allocation Groups.
> -----------------------------------------------------------------------------
>
> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
> cache bitmask (CBM) per resource. Non-unique CBM are possible although
> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
> CLOSIDs are much more scarce than RMIDs.
>
> If we lift the condition of unique CLOSID, then the user can create
> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
> would share the CLOSID and RDT_Allocation must maintain the schemata
> to CLOSID relationship (similarly to what the previous CAT driver used
> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
> now: adding an element removes it from its previous CTRLGRP.
>
>
> This change would allow further partitioning the allocation groups
> into (allocation, monitoring) groups as follows:
>
> With allocation only:
>            CTRLGRP0     CTRLGRP_ALLOC_ONLY
> schemata:  L3:0=0xff0       L3:0=x00f
> tasks:       PID0       P0_0,P0_1,P1_0,P1_1
> cpus:        0x3                0xC

Not clear what the PID0 and P0_0 mean ?

If you have to support something like MONGRP and CTRLGRP overall 
you want to allow for a task to be present in multiple groups ?

>
> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
> independently, with the new model we could create:
>            CTRLGRP0     CTRLGRP1     CTRLGRP2        CTRLGRP3
> schemata:  L3:0=0xff0   L3:0=x00f    L3:0=0x00f     L3:0=0x00f
> tasks:       PID0         <none>      P0_0,P0_1     P1_0, P1_1
> cpus:        0x3           0xC          0x0             0x0
>
> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).
>
>
> Now we can ask perf to monitor any of the CTRLGRPs independently -once
> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
> The perf_event will reserve and assign the RMID to the monitored
> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
> (CLOSID and RMID), so perf won't have to.

This can be solved by suporting just the -t in perf and a new option in perf to 
suport resctrl group monitoring (something similar to -R). That way we provide 
the flexible granularity to monitor tasks 
independent of whether they are in any resctrl group (and hence also a subset).

CTRLGRP		TASKS		MASK
CTRLGRP1	PID1,PID2	L3:0=0Xf,1=0xf0
CTRLGRP2	PID3,PID4	L3:0=0Xf0,1=0xf00

#perf stat -e llc_occupancy -R CTRLGRP1

#perf stat -e llc_occupancy -t PID3,PID4

The RMID allocation is independent of resctrl CLOSid allocation and hence the 
RMID is not always married to CLOS which seems like the requirement here.

OR

We could have CTRLGRPs with control_only, monitor_only or control_monitor 
options.

now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The transitions 
from one state to another are guarded by this same principle.

CTRLGRP		TASKS		MASK			TYPE
CTRLGRP1	PID1,PID2	L3:0=0Xf,1=0xf0		control_only
CTRLGRP2	PID3,PID4	L3:0=0Xf0,1=0xf00	control_only
CTRLGRP3	PID2,PID3				monitor_only
CTRLGRP4	PID5,PID6	L3:0=0Xf0,1=0xf00	control_monitor

CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in the 
same CTRLGRP and you can add or move tasks into this. The adding and removing 
the tasks is whats easily supported compared to the task granularity although 
such a thing could still be supported with the task granularity.

CTRLGRP4 allows you to tie the monitor and control together so when tasks move 
in and out of this we still have that group to consider. And these groups still 
retain the cpu masks like before so that cpu monitoring is still supported.

In this case we would need a new option to support the ctrlgrp monitoring in 
perf or a new tool to do all this if we dont want to bother perf.

>
> If CTRLGRP's schemata changes, the RDT subsystem will find a new
> CLOSID for the new schemata (potentially reusing an existing one) or
> fail (just like the old CAT used to). The RMID does not change during
> schemata updates.
>
> If a CTRLGRP dies, the monitoring perf_event continues to exists as a
> useless wraith, just as happens with cgroup events now.
>
> Since CTRLGRPs have no hierarchy. There is no need to handle that in
> the new RDT Monitoring PMU, greatly simplifying it over the previously
> proposed versions.
>
> A breaking change in user observed behavior with respect to the
> existing CQM PMU is that there wouldn't be task events. A task must be
> part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
> pair. If an user wants to monitor a task across multiple resources
> (e.g. l3_occupancy across two packages), she must create one event per
> resource_id and add the two counts.
>
> I see this breaking change as an improvement, since hiding the cache
> topology to user space introduced lots of ugliness and complexity to
> the CQM PMU without improving accuracy over user space adding the
> events.
>
> Implementation ideas:
>
> First idea is to expose one monitoring file per resource in a CTRLGRP,
> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
> monitor_l3_0, monitor_l3_1, ...
>
> the monitor_<resource_id> file descriptor is passed to perf_event_open
> in the way cgroup file descriptors are passed now. All events to the
> same (CTRLGRP,resource_id) share RMID.
>
> The RMID allocation part can either be handled by RDT Allocation or by
> the RDT Monitoring PMU. Either ways, the existence of PMU's
> perf_events allocates/releases the RMID.
>
> Also, since this new design removes hierarchy and task events, it
> allows for a simple solution of the RMID rotation problem. The removal
> of task events eliminates the cgroup vs task event conflict existing
> in the upstream version; it also removes the need to ensure that all
> active packages have RMIDs at the same time that added complexity to
> my version of CQM/CMT. Lastly, the removal of hierarchy removes the
> reliance on cgroups, the complex tree based read, and all the hooks
> and cgroup files that "raped" the cgroup subsystem.

Yes, not sure if the view is same after I sent the implementation details in 
documentation :) (most likely it is).
But the option could be to not support perf_cgroup for cqm and 
support a new option in perf to monitor resctrl groups and tasks (or some other 
options like mongrp)

I am so far inclined to creating a new monitoring interface that way we dont try 
to "rape" the existing perf specifics for this RDT or later RDT quirk/features.

>
> Thoughts?
>
> Thanks,
> David
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 21:08               ` Shivappa Vikas
@ 2017-01-20 21:44                 ` David Carrillo-Cisneros
  2017-01-20 23:51                   ` Shivappa Vikas
  0 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-20 21:44 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Thomas Gleixner, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
<vikas.shivappa@intel.com> wrote:
>
>
> On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
>
>> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@linutronix.de>
>> wrote:
>>>
>>>
>>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>>>>
>>>>
>>>> If resctrl groups could lift the restriction of one resctl per CLOSID,
>>>> then the user can create many resctrl in the way perf cgroups are
>>>> created now. The advantage is that there wont be cgroup hierarchy!
>>>> making things much simpler. Also no need to optimize perf event
>>>> context switch to make llc_occupancy work.
>>>
>>>
>>> So if I understand you correctly, then you want a mechanism to have
>>> groups
>>> of entities (tasks, cpus) and associate them to a particular resource
>>> control group.
>>>
>>> So they share the CLOSID of the control group and each entity group can
>>> have its own RMID.
>>>
>>> Now you want to be able to move the entity groups around between control
>>> groups without losing the RMID associated to the entity group.
>>>
>>> So the whole picture would look like this:
>>>
>>> rdt ->  CTRLGRP -> CLOSID
>>>
>>> mon ->  MONGRP  -> RMID
>>>
>>> And you want to move MONGRP from one CTRLGRP to another.
>>
>>
>> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
>> same thing. Details below.
>>
>>>
>>> Can you please write up in a abstract way what the design requirements
>>> are
>>> that you need. So far we are talking about implementation details and
>>> unspecfied wishlists, but what we really need is an abstract requirement.
>>
>>
>> My pleasure:
>>
>>
>> Design Proposal for Monitoring of RDT Allocation Groups.
>>
>> -----------------------------------------------------------------------------
>>
>> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
>> cache bitmask (CBM) per resource. Non-unique CBM are possible although
>> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
>> CLOSIDs are much more scarce than RMIDs.
>>
>> If we lift the condition of unique CLOSID, then the user can create
>> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
>> would share the CLOSID and RDT_Allocation must maintain the schemata
>> to CLOSID relationship (similarly to what the previous CAT driver used
>> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
>> now: adding an element removes it from its previous CTRLGRP.
>>
>>
>> This change would allow further partitioning the allocation groups
>> into (allocation, monitoring) groups as follows:
>>
>> With allocation only:
>>            CTRLGRP0     CTRLGRP_ALLOC_ONLY
>> schemata:  L3:0=0xff0       L3:0=x00f
>> tasks:       PID0       P0_0,P0_1,P1_0,P1_1
>> cpus:        0x3                0xC
>
>
> Not clear what the PID0 and P0_0 mean ?

PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.

>
> If you have to support something like MONGRP and CTRLGRP overall you want to
> allow for a task to be present in multiple groups ?

I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.

>>
>> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
>> independently, with the new model we could create:
>>            CTRLGRP0     CTRLGRP1     CTRLGRP2        CTRLGRP3
>> schemata:  L3:0=0xff0   L3:0=x00f    L3:0=0x00f     L3:0=0x00f
>> tasks:       PID0         <none>      P0_0,P0_1     P1_0, P1_1
>> cpus:        0x3           0xC          0x0             0x0
>>
>> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
>> (L3,0).
>>
>>
>> Now we can ask perf to monitor any of the CTRLGRPs independently -once
>> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
>> The perf_event will reserve and assign the RMID to the monitored
>> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
>> (CLOSID and RMID), so perf won't have to.
>
>
> This can be solved by suporting just the -t in perf and a new option in perf
> to suport resctrl group monitoring (something similar to -R). That way we
> provide the flexible granularity to monitor tasks independent of whether
> they are in any resctrl group (and hence also a subset).

One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.

>
> CTRLGRP         TASKS           MASK
> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0
> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00
>
> #perf stat -e llc_occupancy -R CTRLGRP1
>
> #perf stat -e llc_occupancy -t PID3,PID4
>
> The RMID allocation is independent of resctrl CLOSid allocation and hence
> the RMID is not always married to CLOS which seems like the requirement
> here.

It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.

>
> OR
>
> We could have CTRLGRPs with control_only, monitor_only or control_monitor
> options.
>
> now a task could be present in both control_only and monitor_only
> group or it could be present only in a control_monitor_group. The
> transitions from one state to another are guarded by this same principle.
>
> CTRLGRP         TASKS           MASK                    TYPE
> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0         control_only
> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00       control_only
> CTRLGRP3        PID2,PID3                               monitor_only
> CTRLGRP4        PID5,PID6       L3:0=0Xf0,1=0xf00       control_monitor
>
> CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in
> the same CTRLGRP and you can add or move tasks into this. The adding and
> removing the tasks is whats easily supported compared to the task
> granularity although such a thing could still be supported with the task
> granularity.
>
> CTRLGRP4 allows you to tie the monitor and control together so when tasks
> move in and out of this we still have that group to consider. And these
> groups still retain the cpu masks like before so that cpu monitoring is
> still supported.

Instead of having 3 types of CTRLGRPl, I am proposing one kind
(equivalent to your control_monitor type) that uses a non-zero RMID
when an appropriate perf_event is attached to it. What advantages do
you see on having 3 distinct types?

>
> In this case we would need a new option to support the ctrlgrp monitoring in
> perf or a new tool to do all this if we dont want to bother perf.
>
>

Agree, I like expanding the cgroup fd option to take CTRLGRP fds, as
described in the Implementation Ideas part of the proposal.

>>
>> If CTRLGRP's schemata changes, the RDT subsystem will find a new
>> CLOSID for the new schemata (potentially reusing an existing one) or
>> fail (just like the old CAT used to). The RMID does not change during
>> schemata updates.
>>
>> If a CTRLGRP dies, the monitoring perf_event continues to exists as a
>> useless wraith, just as happens with cgroup events now.
>>
>> Since CTRLGRPs have no hierarchy. There is no need to handle that in
>> the new RDT Monitoring PMU, greatly simplifying it over the previously
>> proposed versions.
>>
>> A breaking change in user observed behavior with respect to the
>> existing CQM PMU is that there wouldn't be task events. A task must be
>> part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
>> pair. If an user wants to monitor a task across multiple resources
>> (e.g. l3_occupancy across two packages), she must create one event per
>> resource_id and add the two counts.
>>
>> I see this breaking change as an improvement, since hiding the cache
>> topology to user space introduced lots of ugliness and complexity to
>> the CQM PMU without improving accuracy over user space adding the
>> events.
>>
>> Implementation ideas:
>>
>> First idea is to expose one monitoring file per resource in a CTRLGRP,
>> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
>> monitor_l3_0, monitor_l3_1, ...
>>
>> the monitor_<resource_id> file descriptor is passed to perf_event_open
>> in the way cgroup file descriptors are passed now. All events to the
>> same (CTRLGRP,resource_id) share RMID.
>>
>> The RMID allocation part can either be handled by RDT Allocation or by
>> the RDT Monitoring PMU. Either ways, the existence of PMU's
>> perf_events allocates/releases the RMID.
>>
>> Also, since this new design removes hierarchy and task events, it
>> allows for a simple solution of the RMID rotation problem. The removal
>> of task events eliminates the cgroup vs task event conflict existing
>> in the upstream version; it also removes the need to ensure that all
>> active packages have RMIDs at the same time that added complexity to
>> my version of CQM/CMT. Lastly, the removal of hierarchy removes the
>> reliance on cgroups, the complex tree based read, and all the hooks
>> and cgroup files that "raped" the cgroup subsystem.
>
>
> Yes, not sure if the view is same after I sent the implementation details in
> documentation :) (most likely it is).
> But the option could be to not support perf_cgroup for cqm and support a new
> option in perf to monitor resctrl groups and tasks (or some other options
> like mongrp)

Agree with no supporting cgroups. This proposal is about supporting
neither cgroups nor tasks and do all monitoring through CTRLGRPs
through an expansion of an existing perf option.

>
> I am so far inclined to creating a new monitoring interface that way we dont
> try to "rape" the existing perf specifics for this RDT or later RDT
> quirk/features.
>

On first inspection it seems to me like perf would be fine with this
approach. It requires no changes to the system call and just some
changes in the way the cgroup_fd is handled in perf_event_open
(besides making sure that a context-less PMU don't break things). Do
you foresee any conflict with future features?

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 21:44                 ` David Carrillo-Cisneros
@ 2017-01-20 23:51                   ` Shivappa Vikas
  2017-02-08 10:13                     ` Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Shivappa Vikas @ 2017-01-20 23:51 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Shivappa Vikas, Thomas Gleixner, Vikas Shivappa,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Luck, Tony, Fenghua Yu,
	andi.kleen, H. Peter Anvin



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:

> On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
> <vikas.shivappa@intel.com> wrote:
>>
>>
>> On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
>>
>>> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@linutronix.de>
>>> wrote:
>>>>
>>>>
>>>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>>>>>
>>>>>
>>>>> If resctrl groups could lift the restriction of one resctl per CLOSID,
>>>>> then the user can create many resctrl in the way perf cgroups are
>>>>> created now. The advantage is that there wont be cgroup hierarchy!
>>>>> making things much simpler. Also no need to optimize perf event
>>>>> context switch to make llc_occupancy work.
>>>>
>>>>
>>>> So if I understand you correctly, then you want a mechanism to have
>>>> groups
>>>> of entities (tasks, cpus) and associate them to a particular resource
>>>> control group.
>>>>
>>>> So they share the CLOSID of the control group and each entity group can
>>>> have its own RMID.
>>>>
>>>> Now you want to be able to move the entity groups around between control
>>>> groups without losing the RMID associated to the entity group.
>>>>
>>>> So the whole picture would look like this:
>>>>
>>>> rdt ->  CTRLGRP -> CLOSID
>>>>
>>>> mon ->  MONGRP  -> RMID
>>>>
>>>> And you want to move MONGRP from one CTRLGRP to another.
>>>
>>>
>>> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
>>> same thing. Details below.
>>>
>>>>
>>>> Can you please write up in a abstract way what the design requirements
>>>> are
>>>> that you need. So far we are talking about implementation details and
>>>> unspecfied wishlists, but what we really need is an abstract requirement.
>>>
>>>
>>> My pleasure:
>>>
>>>
>>> Design Proposal for Monitoring of RDT Allocation Groups.
>>>
>>> -----------------------------------------------------------------------------
>>>
>>> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
>>> cache bitmask (CBM) per resource. Non-unique CBM are possible although
>>> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
>>> CLOSIDs are much more scarce than RMIDs.
>>>
>>> If we lift the condition of unique CLOSID, then the user can create
>>> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
>>> would share the CLOSID and RDT_Allocation must maintain the schemata
>>> to CLOSID relationship (similarly to what the previous CAT driver used
>>> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
>>> now: adding an element removes it from its previous CTRLGRP.
>>>
>>>
>>> This change would allow further partitioning the allocation groups
>>> into (allocation, monitoring) groups as follows:
>>>
>>> With allocation only:
>>>            CTRLGRP0     CTRLGRP_ALLOC_ONLY
>>> schemata:  L3:0=0xff0       L3:0=x00f
>>> tasks:       PID0       P0_0,P0_1,P1_0,P1_1
>>> cpus:        0x3                0xC
>>
>>
>> Not clear what the PID0 and P0_0 mean ?
>
> PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
> does now in RDT. I am not changing that.
>
>>
>> If you have to support something like MONGRP and CTRLGRP overall you want to
>> allow for a task to be present in multiple groups ?
>
> I am not proposing to support MONGRP and CTRLGRP. I am proposing to
> allow monitoring of CTRGRPs only.
>
>>>
>>> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
>>> independently, with the new model we could create:
>>>            CTRLGRP0     CTRLGRP1     CTRLGRP2        CTRLGRP3
>>> schemata:  L3:0=0xff0   L3:0=x00f    L3:0=0x00f     L3:0=0x00f
>>> tasks:       PID0         <none>      P0_0,P0_1     P1_0, P1_1
>>> cpus:        0x3           0xC          0x0             0x0
>>>
>>> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
>>> (L3,0).
>>>
>>>
>>> Now we can ask perf to monitor any of the CTRLGRPs independently -once
>>> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
>>> The perf_event will reserve and assign the RMID to the monitored
>>> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
>>> (CLOSID and RMID), so perf won't have to.
>>
>>
>> This can be solved by suporting just the -t in perf and a new option in perf
>> to suport resctrl group monitoring (something similar to -R). That way we
>> provide the flexible granularity to monitor tasks independent of whether
>> they are in any resctrl group (and hence also a subset).
>
> One of the key points of my proposal is to remove monitoring PIDs
> independently. That simplifies things by letting RDT handle CLOSIDs
> and RMIDs together.
>
>>
>> CTRLGRP         TASKS           MASK
>> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0
>> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00
>>
>> #perf stat -e llc_occupancy -R CTRLGRP1
>>
>> #perf stat -e llc_occupancy -t PID3,PID4
>>
>> The RMID allocation is independent of resctrl CLOSid allocation and hence
>> the RMID is not always married to CLOS which seems like the requirement
>> here.
>
> It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
> change in my proposal.
>
>>
>> OR
>>
>> We could have CTRLGRPs with control_only, monitor_only or control_monitor
>> options.
>>
>> now a task could be present in both control_only and monitor_only
>> group or it could be present only in a control_monitor_group. The
>> transitions from one state to another are guarded by this same principle.
>>
>> CTRLGRP         TASKS           MASK                    TYPE
>> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0         control_only
>> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00       control_only
>> CTRLGRP3        PID2,PID3                               monitor_only
>> CTRLGRP4        PID5,PID6       L3:0=0Xf0,1=0xf00       control_monitor
>>
>> CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in
>> the same CTRLGRP and you can add or move tasks into this. The adding and
>> removing the tasks is whats easily supported compared to the task
>> granularity although such a thing could still be supported with the task
>> granularity.
>>
>> CTRLGRP4 allows you to tie the monitor and control together so when tasks
>> move in and out of this we still have that group to consider. And these
>> groups still retain the cpu masks like before so that cpu monitoring is
>> still supported.
>
> Instead of having 3 types of CTRLGRPl, I am proposing one kind
> (equivalent to your control_monitor type) that uses a non-zero RMID
> when an appropriate perf_event is attached to it. What advantages do
> you see on having 3 distinct types?

Basically I am trying to collage the requirements of what you are Stephan 
mentioned, Thomas and some other OEMs who only care about task monitoring 
(probably for a long time/life time so want them to be efficient etc).

To cover all the scenarios I see these may be the design requirements:

A group here is a group of tasks

1. To setup control groups and be able to monitor the control groups.
2. To be able to monitor the groups from the begining of task creation 
(lifetime or continuous)
3. To be able to monitor groups and not care about the CLOS(or allocation is 
'dont care'). Meaning the monitor groups may be a subset of the control groups 
or a superset or may intersect multiple control groups.
Or IOW , the task set here is arbitrary.
4. To monitor a task without bothering to create a group.

CTRLGRP         TASKS           MASK                    TYPE
CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0         control_only
CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00       control_only
CTRLGRP3        PID2,PID3                               monitor_only
CTRLGRP4        PID5,PID6       L3:0=0Xf0,1=0xf00       control_monitor

now the implementation of this is say either through adding a new option to perf 
(reusing some of cgroup/or not etc which is implementation specific), or create 
a new tool - by tool i really mean an ioctl mechanism - so we just a syscall 
which can be called. Dont need a user mode tool per say. For ex I use resmon as 
the tool , but that is equivalent to having similar option in perf.

With this now if the user wants to do #1,

# resmon -R CTRLGRP1

#2

The resctrl group adds the children of the tasks in the same group. So we can 
essentially do continuous monitoring.

# echo $$ > ../ctrlgrp1/tasks
# resmon -R ctrlgrp1
# task1 &

#3 above is the situation where you want to first profile a bunch of tasks as to 
what the cache usage is (dont care about the CLOS) and then based on the usage , 
assign them or group them into control groups giving cache. This could be 
typical usage model in large scale cluster workload distribution or even in a 
real time workload.

Create the monitor_only groups for this like the CTRLGRP3 above

# echo PID1, ... PIDn > ../ctrlgrp3/tasks
# resmon -R ctrlgrp3

#4 above is when you say are already monitoring some tasks like PID2 and PID3 
which are part of a CTRLGRP but you want to monitor just PID2 now. Cannot create 
a new CTRLGRP with just PID2 if PID2 is already present in one of the CTRLGRPS 
(if we break this requirement , then it complicates the number of groups that 
can be created as we assume there is no hierarchy).

user can use the -t or task monitoring option to do this. Also the user can use 
this option without bothering to create any resctrl groups at all.

# resmon -t PID4, PID5, PID6

I think the email thread is going very long and we should just meet f2f probably 
next week to iron out the requirements and chalk out a design proposal.

overall it looks like we can do without recycling/rotation and just do reuse 
and throw error when run out of RMIDs, support per-package RMIDs. But need to 
add support for cpu monitoring and resctrl groups monitoring (without loosing 
the option to be able to have the flexibility to support monitoring at other 
granularities than the resctrl groups)

>
>>
>> In this case we would need a new option to support the ctrlgrp monitoring in
>> perf or a new tool to do all this if we dont want to bother perf.
>>
>>
>
> Agree, I like expanding the cgroup fd option to take CTRLGRP fds, as
> described in the Implementation Ideas part of the proposal.
>
>>>
>>> If CTRLGRP's schemata changes, the RDT subsystem will find a new
>>> CLOSID for the new schemata (potentially reusing an existing one) or
>>> fail (just like the old CAT used to). The RMID does not change during
>>> schemata updates.
>>>
>>> If a CTRLGRP dies, the monitoring perf_event continues to exists as a
>>> useless wraith, just as happens with cgroup events now.
>>>
>>> Since CTRLGRPs have no hierarchy. There is no need to handle that in
>>> the new RDT Monitoring PMU, greatly simplifying it over the previously
>>> proposed versions.
>>>
>>> A breaking change in user observed behavior with respect to the
>>> existing CQM PMU is that there wouldn't be task events. A task must be
>>> part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
>>> pair. If an user wants to monitor a task across multiple resources
>>> (e.g. l3_occupancy across two packages), she must create one event per
>>> resource_id and add the two counts.
>>>
>>> I see this breaking change as an improvement, since hiding the cache
>>> topology to user space introduced lots of ugliness and complexity to
>>> the CQM PMU without improving accuracy over user space adding the
>>> events.
>>>
>>> Implementation ideas:
>>>
>>> First idea is to expose one monitoring file per resource in a CTRLGRP,
>>> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
>>> monitor_l3_0, monitor_l3_1, ...
>>>
>>> the monitor_<resource_id> file descriptor is passed to perf_event_open
>>> in the way cgroup file descriptors are passed now. All events to the
>>> same (CTRLGRP,resource_id) share RMID.
>>>
>>> The RMID allocation part can either be handled by RDT Allocation or by
>>> the RDT Monitoring PMU. Either ways, the existence of PMU's
>>> perf_events allocates/releases the RMID.
>>>
>>> Also, since this new design removes hierarchy and task events, it
>>> allows for a simple solution of the RMID rotation problem. The removal
>>> of task events eliminates the cgroup vs task event conflict existing
>>> in the upstream version; it also removes the need to ensure that all
>>> active packages have RMIDs at the same time that added complexity to
>>> my version of CQM/CMT. Lastly, the removal of hierarchy removes the
>>> reliance on cgroups, the complex tree based read, and all the hooks
>>> and cgroup files that "raped" the cgroup subsystem.
>>
>>
>> Yes, not sure if the view is same after I sent the implementation details in
>> documentation :) (most likely it is).
>> But the option could be to not support perf_cgroup for cqm and support a new
>> option in perf to monitor resctrl groups and tasks (or some other options
>> like mongrp)
>
> Agree with no supporting cgroups. This proposal is about supporting
> neither cgroups nor tasks and do all monitoring through CTRLGRPs
> through an expansion of an existing perf option.
>
>>
>> I am so far inclined to creating a new monitoring interface that way we dont
>> try to "rape" the existing perf specifics for this RDT or later RDT
>> quirk/features.
>>
>
> On first inspection it seems to me like perf would be fine with this
> approach. It requires no changes to the system call and just some
> changes in the way the cgroup_fd is handled in perf_event_open
> (besides making sure that a context-less PMU don't break things). Do
> you foresee any conflict with future features?

The new tool would really add a new syscall instead of modifying the existing 
perf open syscall for the cgroup (pls see above)

Thanks,
Vikas

>
> Thanks,
> David
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 20:11             ` David Carrillo-Cisneros
  2017-01-20 21:08               ` Shivappa Vikas
@ 2017-01-23  9:47               ` Thomas Gleixner
  2017-01-23 11:30                 ` Peter Zijlstra
  2017-02-01 20:08                 ` Luck, Tony
  2017-02-08 10:11               ` Peter Zijlstra
  2 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-01-23  9:47 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Vikas Shivappa, Vikas Shivappa, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Luck,
	Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Can you please write up in a abstract way what the design requirements are
> > that you need. So far we are talking about implementation details and
> > unspecfied wishlists, but what we really need is an abstract requirement.
> 
> My pleasure:
> 
> 
> Design Proposal for Monitoring of RDT Allocation Groups.

I was asking for requirements, not a design proposal. In order to make a
design you need a requirements specification.

So again: 

  Can please everyone involved write up their specific requirements
  for CQM and stop spamming us with half baken design proposals?

  And I mean abstract requirements and not again something which is
  referring to existing crap or some desired crap.

The complete list of requirements has to be agreed on before we talk about
anything else.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-23  9:47               ` Thomas Gleixner
@ 2017-01-23 11:30                 ` Peter Zijlstra
  2017-02-01 20:08                 ` Luck, Tony
  1 sibling, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2017-01-23 11:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Carrillo-Cisneros, Vikas Shivappa, Vikas Shivappa,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar, Shankar,
	Ravi V, Luck, Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Mon, Jan 23, 2017 at 10:47:44AM +0100, Thomas Gleixner wrote:
> So again: 
> 
>   Can please everyone involved write up their specific requirements
>   for CQM and stop spamming us with half baken design proposals?
> 
>   And I mean abstract requirements and not again something which is
>   referring to existing crap or some desired crap.
> 
> The complete list of requirements has to be agreed on before we talk about
> anything else.

So something along the lines of:

 A) need to create a (named) group of tasks
   1) group composition needs to be dynamic; ie. we can add/remove member
      tasks at any time.
   2) a task can only belong to _one_ group at any one time.
   3) grouping need not be hierarchical?

 B) for each group, we need to set a CAT mask
   1) this CAT mask must be dynamic; ie. we can, during the existence of
      the group, change the mask at any time.

 C) for each group, we need to monitor CQM bits
   1) this monitor need not change


Supporting Use-Cases:

A.1: The Job (or VM) can have a dynamic task set
B.1: Dynamic QoS for each Job (or VM) as demand / load changes



Feel free to expand etc..

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-23  9:47               ` Thomas Gleixner
  2017-01-23 11:30                 ` Peter Zijlstra
@ 2017-02-01 20:08                 ` Luck, Tony
  2017-02-01 23:12                   ` David Carrillo-Cisneros
  2017-02-02  0:35                   ` Andi Kleen
  1 sibling, 2 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-01 20:08 UTC (permalink / raw)
  To: Thomas Gleixner, David Carrillo-Cisneros
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

> I was asking for requirements, not a design proposal. In order to make a
> design you need a requirements specification.

Here's what I came up with ... not a fully baked list, but should allow for some useful
discussion on whether any of these are not really needed, or if there is a glaring hole
that misses some use case:

1)	Able to measure using all supported events (currently L3 occupancy, Total B/W, Local B/W)
2)	Measure per thread
3)	Including kernel threads
4)	Put multiple threads into a single measurement group (forced by h/w shortage of RMIDs, but probably good to have anyway)
5)	New threads created inherit measurement group from parent
6)	Report separate results per domain (L3)
7)	Must be able to measure based on existing resctrl CAT group
8)	Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources)
9)	Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)
10)	Put multiple CPUs into a group

Nice to have:
1)	Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible.

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 20:08                 ` Luck, Tony
@ 2017-02-01 23:12                   ` David Carrillo-Cisneros
  2017-02-02 17:39                     ` Luck, Tony
                                       ` (3 more replies)
  2017-02-02  0:35                   ` Andi Kleen
  1 sibling, 4 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-01 23:12 UTC (permalink / raw)
  To: Luck, Tony, Thomas Gleixner
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

On Wed, Feb 1, 2017 at 12:08 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> > I was asking for requirements, not a design proposal. In order to make a
> > design you need a requirements specification.
>
> Here's what I came up with ... not a fully baked list, but should allow for some useful
> discussion on whether any of these are not really needed, or if there is a glaring hole
> that misses some use case:
>
> 1)      Able to measure using all supported events (currently L3 occupancy, Total B/W, Local B/W)
> 2)      Measure per thread
> 3)      Including kernel threads
> 4)      Put multiple threads into a single measurement group (forced by h/w shortage of RMIDs, but probably good to have anyway)

Even with infinite hw RMIDs you want to be able to have one RMID per
thread groups to avoid reading a potentially large list of RMIDs every
time you measure one group's event (with the delay and error
associated to measure many RMIDs whose values fluctuate rapidly).

> 5)      New threads created inherit measurement group from parent
> 6)      Report separate results per domain (L3)
> 7)      Must be able to measure based on existing resctrl CAT group
> 8)      Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources)
> 9)      Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)

I agree that "Measure per logical CPU" is a requirement, but why is
"pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
 one as well? Are we set on handling RMIDs the way CLOSIDs are
handled? there are drawbacks to do so, one is that it would make
impossible to do CPU monitoring and CPU filtering the way is done for
all other PMUs.

i.e. the following commands (or their equivalent in whatever other API
you create) won't work:

a) perf stat -e intel_cqm/total_bytes/ -C 2

or

b.1) perf stat -e intel_cqm/total_bytes/ -C 2 <a_measurement_group>

or

b.2) perf stat -e intel_cqm/llc_occupancy/ -a <a_measurement_group>

in (a) because many RMIDs may run in the CPU and, in (b's) because the
same measurement group's RMID will be used across all CPUs. I know
this is similar to how it is in CAT, but CAT was never intended to do
monitoring. We can do the CAT way and the perf way, or not, but if we
will drop support for perf's like CPU support, it must be explicitly
stated and not an implicit consequence of a design choice leaked into
requirements.

> 10)     Put multiple CPUs into a group


11) Able to measure across CAT groups.  So that a user can:
  A) measure a task that runs on CPUs that are in different CAT groups
(one of Thomas' use case FWICT), and
  B) measure tasks even if they change their CAT group (my use case).

>
> Nice to have:
> 1)      Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible.


We discussed this offline and I still disagree that it is close to
impossible to use perf and perf_event_open. In fact, I think it's very
simple :

a) We stretch the usage of the pid parameter in perf_event_open to
also allow a PMU specific task group fd (as of now it's either a PID
or a cgroup fd).
b) PMUs that can handle non-cgroup task groups have a special PMU_CAP
flag to signal the generic code to not resolve the fd to a cgroup
pointer and, instead, save it as is in struct perf_event (a few lines
of code).
c) The PMU takes care of resolving the task group's fd.

The above is ONE way to do it, there may be others. But there is a big
advantage on leveraging perf_event_open and ease integration with the
perf tool and the myriads of tools that use the perf API.

12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.

>
> -Tony
>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 20:08                 ` Luck, Tony
  2017-02-01 23:12                   ` David Carrillo-Cisneros
@ 2017-02-02  0:35                   ` Andi Kleen
  2017-02-02  1:12                     ` David Carrillo-Cisneros
  2017-02-02  1:22                     ` Yu, Fenghua
  1 sibling, 2 replies; 91+ messages in thread
From: Andi Kleen @ 2017-02-02  0:35 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, David Carrillo-Cisneros, Vikas Shivappa,
	Shivappa, Vikas, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen,
	Andi, Anvin, H Peter

"Luck, Tony" <tony.luck@intel.com> writes:
> 9)	Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)
> 10)	Put multiple CPUs into a group

I'm not sure this is a real requirement. It's just an optimization,
right? If you can assign policies to threads, you can implicitly set it
per CPU through affinity (or the other way around).
The only benefit would be possibly less context switch overhead,
but if all the thread (including idle) assigned to a CPU have the
same policy it would have the same results.

I suspect dropping this would likely simplify the interface significantly.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02  0:35                   ` Andi Kleen
@ 2017-02-02  1:12                     ` David Carrillo-Cisneros
  2017-02-02  1:19                       ` Andi Kleen
  2017-02-02  1:22                     ` Yu, Fenghua
  1 sibling, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-02  1:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Luck, Tony, Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Wed, Feb 1, 2017 at 4:35 PM, Andi Kleen <andi@firstfloor.org> wrote:
> "Luck, Tony" <tony.luck@intel.com> writes:
>> 9)    Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)
>> 10)   Put multiple CPUs into a group
>
> I'm not sure this is a real requirement. It's just an optimization,
> right? If you can assign policies to threads, you can implicitly set it
> per CPU through affinity (or the other way around).

That's difficult when distinct users/systems do monitoring and system
management.  What if the cluster manager decides to change affinity
for a task after the monitoring service has initiated monitoring a CPU
in the way you describe?

> The only benefit would be possibly less context switch overhead,
> but if all the thread (including idle) assigned to a CPU have the
> same policy it would have the same results.

I think another of the reasons for the CPU monitoring requirement is
to monitor interruptions in CPUs running the idle thread. In CAT,
those interruptions use the CPU's CLOSID. Here they'd use the CPU's
RMID. Since RMID's are scarce, CPUs can be aggregated into groups to
save many.

Also, if perf's like monitoring is supported, it'd allow something like

  perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02  1:12                     ` David Carrillo-Cisneros
@ 2017-02-02  1:19                       ` Andi Kleen
  0 siblings, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2017-02-02  1:19 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Andi Kleen, Luck, Tony, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen,
	Andi, Anvin, H Peter

> > I'm not sure this is a real requirement. It's just an optimization,
> > right? If you can assign policies to threads, you can implicitly set it
> > per CPU through affinity (or the other way around).
> 
> That's difficult when distinct users/systems do monitoring and system
> management.  What if the cluster manager decides to change affinity
> for a task after the monitoring service has initiated monitoring a CPU
> in the way you describe?

Why would you want to monitor a CPU if you don't know what it is
running?  The results would be meaningless. So you really want
to integrate those two services.

> 
> > The only benefit would be possibly less context switch overhead,
> > but if all the thread (including idle) assigned to a CPU have the
> > same policy it would have the same results.
> 
> I think another of the reasons for the CPU monitoring requirement is
> to monitor interruptions in CPUs running the idle thread. In CAT,

idle threads are just threads, so they could be just exposed
to perf (e.g. combination of pid 0 + cpu filter)


> Also, if perf's like monitoring is supported, it'd allow something like
> 
>   perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

This would work without a special API.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02  0:35                   ` Andi Kleen
  2017-02-02  1:12                     ` David Carrillo-Cisneros
@ 2017-02-02  1:22                     ` Yu, Fenghua
  2017-02-02 17:51                       ` Shivappa Vikas
  1 sibling, 1 reply; 91+ messages in thread
From: Yu, Fenghua @ 2017-02-02  1:22 UTC (permalink / raw)
  To: Andi Kleen, Luck, Tony
  Cc: Thomas Gleixner, David Carrillo-Cisneros, Vikas Shivappa,
	Shivappa, Vikas, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Kleen, Andi, Anvin,
	H Peter

> From: Andi Kleen [mailto:andi@firstfloor.org]
> "Luck, Tony" <tony.luck@intel.com> writes:
> > 9)	Measure per logical CPU (pick active RMID in same precedence for
> task/cpu as CAT picks CLOSID)
> > 10)	Put multiple CPUs into a group
> 
> I'm not sure this is a real requirement. It's just an optimization, right? If you
> can assign policies to threads, you can implicitly set it per CPU through affinity
> (or the other way around).
> The only benefit would be possibly less context switch overhead, but if all
> the thread (including idle) assigned to a CPU have the same policy it would
> have the same results.
> 
> I suspect dropping this would likely simplify the interface significantly.

Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 23:12                   ` David Carrillo-Cisneros
@ 2017-02-02 17:39                     ` Luck, Tony
  2017-02-02 19:33                     ` Luck, Tony
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-02 17:39 UTC (permalink / raw)
  To: David Carrillo-Cisneros, Thomas Gleixner
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

>> 7)      Must be able to measure based on existing resctrl CAT group
>> 8)      Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources)
>> 9)      Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID)
>
> I agree that "Measure per logical CPU" is a requirement, but why is
> "pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
> one as well? Are we set on handling RMIDs the way CLOSIDs are
> handled? there are drawbacks to do so, one is that it would make
> impossible to do CPU monitoring and CPU filtering the way is done for
> all other PMUs.

I'm too focused on monitoring existing CAT groups.  If we move the parenthetical remark
from item 9, to item 7, then I think it is better.  When monitoring a CAT group we need to
monitor exactly the processes that are controlled by the CAT group. So RMID must match
CLOSID, and the precedence rules make that work.

For other monitoring cases we can do things differently - so long as we have a way
to express what we want, and we don't pile a ton of code into context switch to figure
out which RMID is to be loaded into PQR_ASSOC.

I thought of another requirement this morning:

N+1) When we set up monitoring we must allocate all the resources we need (or fail the setup
         if we can't get them). Not allowed to error in the middle of monitoring because we can't
         find a free RMID)

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02  1:22                     ` Yu, Fenghua
@ 2017-02-02 17:51                       ` Shivappa Vikas
  0 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-02-02 17:51 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Andi Kleen, Luck, Tony, Thomas Gleixner, David Carrillo-Cisneros,
	Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Kleen,
	Andi, Anvin, H Peter



On Wed, 1 Feb 2017, Yu, Fenghua wrote:

>> From: Andi Kleen [mailto:andi@firstfloor.org]
>> "Luck, Tony" <tony.luck@intel.com> writes:
>>> 9)	Measure per logical CPU (pick active RMID in same precedence for
>> task/cpu as CAT picks CLOSID)
>>> 10)	Put multiple CPUs into a group
>>
>> I'm not sure this is a real requirement. It's just an optimization, right? If you
>> can assign policies to threads, you can implicitly set it per CPU through affinity
>> (or the other way around).
>> The only benefit would be possibly less context switch overhead, but if all
>> the thread (including idle) assigned to a CPU have the same policy it would
>> have the same results.
>>
>> I suspect dropping this would likely simplify the interface significantly.
>
> Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU.
> Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU.
> Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases.

This matches the use case where a bunch of real time tasks which have no CLOS 
id(kernel threads or others in root group) would want to run exclusively on a 
cpu and are configured so. If any other tasks run there from other class of 
service we dont want to pullute the cache - hence choose their own CLOSId.

Now in order to measure this RMIds need to match the same policy as CAT.

Thanks,
Vikas

>
> Thanks.
>
> -Fenghua
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 23:12                   ` David Carrillo-Cisneros
  2017-02-02 17:39                     ` Luck, Tony
@ 2017-02-02 19:33                     ` Luck, Tony
  2017-02-02 20:20                       ` Shivappa Vikas
  2017-02-02 20:22                       ` David Carrillo-Cisneros
  2017-02-06 18:54                     ` Luck, Tony
  2017-02-06 21:22                     ` Luck, Tony
  3 siblings, 2 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-02 19:33 UTC (permalink / raw)
  To: David Carrillo-Cisneros, Thomas Gleixner
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

>> Nice to have:
>> 1)      Readout using "perf(1)" [subset of modes that make sense ... tying monitoring
>> to resctrl file system will make most command line usage of perf(1) close to impossible.
>
>
> We discussed this offline and I still disagree that it is close to
> impossible to use perf and perf_event_open. In fact, I think it's very
> simple :

Maybe s/most/many/ ?

The issue here is that we are going to define which tasks and cpus are being
monitored *outside* of the perf command.  So usage like:

# perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

are completely out of scope ... we aren't planning to change the perf(1)
command to know about creating a CQM monitor group, assigning the
PID of {command} to it, and then report on llc_occupancy.

So perf(1) usage is only going to support modes where it attaches to some
monitor group that was previously established.  The "-C 2" option to monitor
CPU 2 is certainly plausible ... assuming you set up a monitor group to track
what is happening on CPU 2 ... I just don't know how perf(1) would know the
name of that group.

Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
depending on event type.

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02 19:33                     ` Luck, Tony
@ 2017-02-02 20:20                       ` Shivappa Vikas
  2017-02-02 20:22                       ` David Carrillo-Cisneros
  1 sibling, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-02-02 20:20 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen,
	Andi, Anvin, H Peter


Hello Peterz/Andi,

On Thu, 2 Feb 2017, Luck, Tony wrote:

>>> Nice to have:
>>> 1)      Readout using "perf(1)" [subset of modes that make sense ... tying monitoring
>>> to resctrl file system will make most command line usage of perf(1) close to impossible.
>>
> Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
> overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
> depending on event type.

Assume we build support to monitor the existing resctrl CAT groups like Thomas 
suggested. For the perf interface would 
something like below seems reasonable or a disaster(given that we have a new -R 
option specific to the PMU/which works only on this PMU) ?

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

Now monitor the group p1 using perf. perf would have a new option -R to monitor 
the resctrl groups. perf would still have a cqm event like today 
intel_cqm/llc_occupancy which supports however only one mode -R and not any of 
-C,-t,-G etc. So pretty much the -R works like a -G .. except that it works on 
the resctrl fs and not perf_cgroup.
PMU would have a flag to indicate the perf user mode to check only 
the llc_occupancy event is supported for the -R.

# perf stat -e intel_cqm/llc_occupancy -R p1

-Vikas

>
> -Tony
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02 19:33                     ` Luck, Tony
  2017-02-02 20:20                       ` Shivappa Vikas
@ 2017-02-02 20:22                       ` David Carrillo-Cisneros
  2017-02-02 23:41                         ` Luck, Tony
  1 sibling, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-02 20:22 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Thu, Feb 2, 2017 at 11:33 AM, Luck, Tony <tony.luck@intel.com> wrote:
>>> Nice to have:
>>> 1)      Readout using "perf(1)" [subset of modes that make sense ... tying monitoring
>>> to resctrl file system will make most command line usage of perf(1) close to impossible.
>>
>>
>> We discussed this offline and I still disagree that it is close to
>> impossible to use perf and perf_event_open. In fact, I think it's very
>> simple :
>
> Maybe s/most/many/ ?
>
> The issue here is that we are going to define which tasks and cpus are being
> monitored *outside* of the perf command.  So usage like:
>
> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>
> are completely out of scope ... we aren't planning to change the perf(1)
> command to know about creating a CQM monitor group, assigning the
> PID of {command} to it, and then report on llc_occupancy.
>
> So perf(1) usage is only going to support modes where it attaches to some
> monitor group that was previously established.  The "-C 2" option to monitor
> CPU 2 is certainly plausible ... assuming you set up a monitor group to track
> what is happening on CPU 2 ... I just don't know how perf(1) would know the
> name of that group.

There is no need to change perf(1) to support
 # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

the PMU can work with resctrl to provide the support through
perf_event_open, with the advantage that tools other than perf could
also use it.

I'd argue is more stable and has less corner cases if the
task_mongroups get extra RMIDs for the task events attached to them
than having userspace tools create and destroy groups and move tasks
behind the scenes.

I provided implementation details on the write-up I shared offline on
Monday. If "easy monitoring" of stand-alone task becomes a
requirement, we can dig on the pros and cons of implementing in kernel
vs user space.

>
> Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
> overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
> depending on event type.

I see it more like generalizing the -G option to represent a task
group that can be a cgroup or a PMU specific one.

Currently the perf(1) simply translates the argument of the -G option
into a file descriptor. My idea doesn't change that, just makes perf
tool to look for a "task_group_root" file in the PMU folder and use it
to find as base path for the file descriptor. If a PMU doesnt have
such file, then perf(1) uses the perf cgroup mounting point, as it
does now. That makes for a very simple implementation on the perf tool
side.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02 20:22                       ` David Carrillo-Cisneros
@ 2017-02-02 23:41                         ` Luck, Tony
  2017-02-03  1:40                           ` David Carrillo-Cisneros
  0 siblings, 1 reply; 91+ messages in thread
From: Luck, Tony @ 2017-02-02 23:41 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
> There is no need to change perf(1) to support
>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
> 
> the PMU can work with resctrl to provide the support through
> perf_event_open, with the advantage that tools other than perf could
> also use it.

I agree it would be better to expose the counters through
a standard perf_event_open() interface ... but we don't seem
to have had much luck doing that so far.

That would need the requirements to be re-written with the
focus of what does resctrl need to do to support each of the
perf(1) command line modes of operation.  The fact that these
counters work rather differently from normal h/w counters
has resulted in massively complex volumes of code trying
to map them into what perf_event_open() expects.

The key points of weirdness seem to be:

1) We need to allocate an RMID for the duration of monitoring. While
   there are quite a lot of RMIDs, it is easy to envision scenarios
   where there are not enough.

2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
   of interest is running.

3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events

4) For llc_occupancy the count can change even when none of the processes
   are running becauase cache lines are evicted

5) llc_occupancy measures the delta, not the absolute occupancy. To
   get a good result requires monitoring from process creation (or
   lots of patience, or the nuclear option "wbinvd").

6) RMID counters are package scoped


These result in all sorts of hard to resolve situations. E.g. you are
monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
looking at the cache occupancy of PID=234 using RMID=45. The scheduler
decides to run my proocess on your CPU.  We can only load one RMID, so
one of us will be disappointed (unless we have some crazy complex code
where your instance of perf borrows RMID=45 and reads out the local
byte count on sched_in() and sched_out() to add to the runing count
you were keeping against RMID=22).

How can we document such restrictions for people who haven't been
digging in this code for over a year?

I think a perf_event_open() interface would make some simple cases
work, but result in some swearing once people start running multiple
complex monitors at the same time.

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-02 23:41                         ` Luck, Tony
@ 2017-02-03  1:40                           ` David Carrillo-Cisneros
  2017-02-03  2:14                             ` David Carrillo-Cisneros
  0 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-03  1:40 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony <tony.luck@intel.com> wrote:
> On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
>> There is no need to change perf(1) to support
>>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>>
>> the PMU can work with resctrl to provide the support through
>> perf_event_open, with the advantage that tools other than perf could
>> also use it.
>
> I agree it would be better to expose the counters through
> a standard perf_event_open() interface ... but we don't seem
> to have had much luck doing that so far.
>
> That would need the requirements to be re-written with the
> focus of what does resctrl need to do to support each of the
> perf(1) command line modes of operation.  The fact that these
> counters work rather differently from normal h/w counters
> has resulted in massively complex volumes of code trying
> to map them into what perf_event_open() expects.
>
> The key points of weirdness seem to be:
>
> 1) We need to allocate an RMID for the duration of monitoring. While
>    there are quite a lot of RMIDs, it is easy to envision scenarios
>    where there are not enough.
>
> 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
>    of interest is running.
>
> 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events
>
> 4) For llc_occupancy the count can change even when none of the processes
>    are running becauase cache lines are evicted
>
> 5) llc_occupancy measures the delta, not the absolute occupancy. To
>    get a good result requires monitoring from process creation (or
>    lots of patience, or the nuclear option "wbinvd").
>
> 6) RMID counters are package scoped
>
>
> These result in all sorts of hard to resolve situations. E.g. you are
> monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
> looking at the cache occupancy of PID=234 using RMID=45. The scheduler
> decides to run my proocess on your CPU.  We can only load one RMID, so
> one of us will be disappointed (unless we have some crazy complex code
> where your instance of perf borrows RMID=45 and reads out the local
> byte count on sched_in() and sched_out() to add to the runing count
> you were keeping against RMID=22).
>
> How can we document such restrictions for people who haven't been
> digging in this code for over a year?
>
> I think a perf_event_open() interface would make some simple cases
> work, but result in some swearing once people start running multiple
> complex monitors at the same time.

More problems:

7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset.

8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC.

Most of the complexity in past attempts were mainly caused by:
  A. Task events being defined as system-wide and not package-wide.
What you describe in points (4) and (6) made this complicated.
  B. The cgroup hierarchy, due to (7) and (8).

A and B caused the bulk of the code by complicating RMID assignment,
reading and rotation.

Now that we've learned from the past experience, we have defined
per-domain monitoring and use flat groups. FWICT, that enough to allow
a simple implementation that can be expressed through perf_event_open.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-03  1:40                           ` David Carrillo-Cisneros
@ 2017-02-03  2:14                             ` David Carrillo-Cisneros
  2017-02-03 17:52                               ` Luck, Tony
  0 siblings, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-03  2:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

Something to be aware is that CAT cpus don't work the way CPU
filtering works in perf:

If I have the following CAT groups:
 - default group with task TD
 - group GC1 with CPU0 and CLOSID 1
 - group GT1 with no CPUs and task T1 and CLOSID2
 - TD and T1 run on CPU0.

Then T1 will use CLOSID2 and TD CLOSID1. Some allocations done in CPU0
did not use CLOSID1.

Now, if I have the same setup in monitoring groups and I were to read
llc_occupancy in the RMID of GC1, I'd read llc_occupancy for TD only,
and have a blind spot on T1. That's not how CPU events work on perf.

So CPUs have a different meaning on CAT than on perf.

The above is another reason to separate the allocation and the
monitoring groups. Having
  - Independent allocation and monitoring groups.
  - Independent CPU and task grouping.
would allow us semantics that monitor CAT groups and eventually can be
extended to also monitor the perf way, this is support:
  - filter by task
  - filter by task group (cgroup or monitoring group or whatever).
  - filter by CPU (the perf way)
  - combinations of task/task_group and CPU (the perf way)

If we tie allocation groups and monitoring groups, we are tying the
meaning of CPUs and we'll have to choose between the CAT meaning or
the perf meaning.

Let's allow semantics that will allow perf like monitoring to
eventually work, even if its not immediately supported.

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-03  2:14                             ` David Carrillo-Cisneros
@ 2017-02-03 17:52                               ` Luck, Tony
  2017-02-03 21:08                                 ` David Carrillo-Cisneros
  2017-02-07  8:08                                 ` Stephane Eranian
  0 siblings, 2 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-03 17:52 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> If we tie allocation groups and monitoring groups, we are tying the
> meaning of CPUs and we'll have to choose between the CAT meaning or
> the perf meaning.
> 
> Let's allow semantics that will allow perf like monitoring to
> eventually work, even if its not immediately supported.

Would it work to make monitor groups be "task list only" or "cpu mask only"
(unlike control groups that allow mixing).

Then the intel_rdt_sched_in() code could pick the RMID in ways that
give you the perf(1) meaning. I.e. if you create a monitor group and assign
some CPUs to it, then we will always load the RMID for that monitor group
when running on those cpus, regardless of what group(s) the current process
belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
assign RMID using same rules as CLOSID (so measurements from a control group
would track allocation policies).

We are already planning that creating monitor only groups will change
what is reported in the control group (e.g. you pull some tasks out of
the control group to monitor them separately, so the control group only
reports the tasks that you didn't move out for monitoring).

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-03 17:52                               ` Luck, Tony
@ 2017-02-03 21:08                                 ` David Carrillo-Cisneros
  2017-02-03 22:24                                   ` Luck, Tony
  2017-02-07  8:08                                 ` Stephane Eranian
  1 sibling, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-03 21:08 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony <tony.luck@intel.com> wrote:
> On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
>> If we tie allocation groups and monitoring groups, we are tying the
>> meaning of CPUs and we'll have to choose between the CAT meaning or
>> the perf meaning.
>>
>> Let's allow semantics that will allow perf like monitoring to
>> eventually work, even if its not immediately supported.
>
> Would it work to make monitor groups be "task list only" or "cpu mask only"
> (unlike control groups that allow mixing).

That works, but please don't use chmod. Make it explicit by the group
position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

>
> Then the intel_rdt_sched_in() code could pick the RMID in ways that
> give you the perf(1) meaning. I.e. if you create a monitor group and assign
> some CPUs to it, then we will always load the RMID for that monitor group
> when running on those cpus, regardless of what group(s) the current process
> belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> assign RMID using same rules as CLOSID (so measurements from a control group
> would track allocation policies).

I think that's very confusing for the user. A group's observed
behavior should be determined by its attributes and not change
depending on how other groups are configured. Think on multiple users
monitoring simultaneously.

>
> We are already planning that creating monitor only groups will change
> what is reported in the control group (e.g. you pull some tasks out of
> the control group to monitor them separately, so the control group only
> reports the tasks that you didn't move out for monitoring).

That's also confusing, and the work-around that Vikas proposed of two
separate files to enumerate tasks (one for control and one for
monitoring) breaks the concept of a task group.





>From our discussions, we can support the use cases we care about
without weird-corner cases, by having:
  - A set of allocation group as stand now. Either use the current
resctrl, or rename it to something like resdir/ctrl (before v4.10
sails).
  - A set of monitoring task groups. Either in a "tasks" folder in a
resmon fs  or in resdir/mon/tasks.
  - A set of monitoring CPU groups. Either in a "cpus" folder in a
resmon fs  or in resdir/mon/cpus.

So when a user measures a group (shown using the -G option, it could
as well be the -R Vikas wants):

1) perf stat -e llc_occupancy -G resdir/ctrl/g1
measures the CAT allocation group as if RMIDs were managed like CLOSIDs.

2) perf stat -e llc_occupancy -G resdir/mon/tasks/g1
measures the combined occupancy of all tasks in g1 (like a cgroups in
present perf).

3) perf stat -e llc_occupancy -C <some id of resdir/mon/cpus/g1>
*XOR* perf stat -e llc_occupancy -G resdir/mon/cpus/g1
measures the combined occupancy of all tasks while ran in any CPU in
g1 (perf-like filtering, not the CAT way).

I know the present implementation scope is limited, so you could:
  - support 1) and/or 2) only
  - do a simple RMID management (e.g. same RMID all packages, allocate
RMID on creation or fail)
  - do the custom fs based tool that Vikas mentioned instead of using
perf_event_open (if it's somehow easier to build and maintain a new
tool rather than reuse perf(1) ).

any or all of the above are fine. But please don't choose group
semantics that will prevent us from eventually supporting full
perf-like behavior or that we already know explode in user's face.

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-03 21:08                                 ` David Carrillo-Cisneros
@ 2017-02-03 22:24                                   ` Luck, Tony
  0 siblings, 0 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-03 22:24 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote:
> On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony <tony.luck@intel.com> wrote:
> > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> >> If we tie allocation groups and monitoring groups, we are tying the
> >> meaning of CPUs and we'll have to choose between the CAT meaning or
> >> the perf meaning.
> >>
> >> Let's allow semantics that will allow perf like monitoring to
> >> eventually work, even if its not immediately supported.
> >
> > Would it work to make monitor groups be "task list only" or "cpu mask only"
> > (unlike control groups that allow mixing).
> 
> That works, but please don't use chmod. Make it explicit by the group
> position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

I had been thinking that after writing a PID to "tasks" we'd disallow
writes to "cpus". But is sounds nicer for the user to declare their
intention upfront. Counter propsosal in the naming war:

	.../monitor/bytask/{groupname}
	.../monitor/bycpu/{groupname}

> > Then the intel_rdt_sched_in() code could pick the RMID in ways that
> > give you the perf(1) meaning. I.e. if you create a monitor group and assign
> > some CPUs to it, then we will always load the RMID for that monitor group
> > when running on those cpus, regardless of what group(s) the current process
> > belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> > assign RMID using same rules as CLOSID (so measurements from a control group
> > would track allocation policies).
> 
> I think that's very confusing for the user. A group's observed
> behavior should be determined by its attributes and not change
> depending on how other groups are configured. Think on multiple users
> monitoring simultaneously.
> 
> >
> > We are already planning that creating monitor only groups will change
> > what is reported in the control group (e.g. you pull some tasks out of
> > the control group to monitor them separately, so the control group only
> > reports the tasks that you didn't move out for monitoring).
> 
> That's also confusing, and the work-around that Vikas proposed of two
> separate files to enumerate tasks (one for control and one for
> monitoring) breaks the concept of a task group.

There are some simple cases where we can make the data shown in the
original control group look the same. E.g. we move a few tasks over to a
/bytask/ group (or several groups if we want a very fine breakdown) and
then have the report from the control group sum the RMIDs from the monitor
groups and add to the total from the native RMID of the control group.

But this falls apart if the user asks a single monitor group to monitor
tasks from multiple control groups.  Perhaps we could disallow this
(when we assign the first task to a monitor group, capture the CLOSID
and then only allow other tasks with the same CLOSID to be added ... unless
the group becomes empty, and which point we can latch onto a new CLOSID).

/bycpu/ monitoring is very resource intensive if we have to preserve
the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for
each group so that we can keep separate counts for tasks from each
control group that run on our CPUs and then sum them to report the
/bycpu/ data (instead of just one RMID, and no math).  This also
puts more memory references into the sched_in path while we
figure out which RMID to load into PQR_ASSOC.

I'd want to warn the user in the Documentation that splitting off
too many monitor groups from a control group will result in less
than stellar accuracy in reporting as the kernel cannot read
multiple RMIDs atomically and data is changing between reads.

> I know the present implementation scope is limited, so you could:
>   - support 1) and/or 2) only
>   - do a simple RMID management (e.g. same RMID all packages, allocate
> RMID on creation or fail)
>   - do the custom fs based tool that Vikas mentioned instead of using
> perf_event_open (if it's somehow easier to build and maintain a new
> tool rather than reuse perf(1) ).
> 
> any or all of the above are fine. But please don't choose group
> semantics that will prevent us from eventually supporting full
> perf-like behavior or that we already know explode in user's face.

I'm trying hard to find a way to do this. I.e. start with a patch
that has limited capabilities and needs a custom tool, but can later
grow into something that meets your needs.

-Tony

[1] Lazy allocation means finding we can't find a free RMID in the
middle of context switch ... not willing to go there.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 23:12                   ` David Carrillo-Cisneros
  2017-02-02 17:39                     ` Luck, Tony
  2017-02-02 19:33                     ` Luck, Tony
@ 2017-02-06 18:54                     ` Luck, Tony
  2017-02-06 21:22                     ` Luck, Tony
  3 siblings, 0 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-06 18:54 UTC (permalink / raw)
  To: David Carrillo-Cisneros, Thomas Gleixner
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

Digging through the e-mails from last week to generate a new version
of the requirements I looked harder at this:

> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

This looks tricky if we are piggy-backing on the CAT code to switch
RMID along with CLOSID at context switch time.  We could get an
approximation by adding:

	if (newRMID != oldRMID) {
		now = grab current time in some format
		atomic_add(rmid_enabled_time[oldRMID], now - this_cpu_read(rmid_time));
		this_cpu_write(rmid_time, now);
	}

but:

1) that would only work on a single socket machine (we'd really want rmid_enabled_time
separately for each socket)
2) when we want to read that enabled time, we'd really need to add time for all the
threads currently running on CPUs across the system since we last switched RMID
3) reading the time and doing atomic ops in context switch code won't be popular

:-(

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-01 23:12                   ` David Carrillo-Cisneros
                                       ` (2 preceding siblings ...)
  2017-02-06 18:54                     ` Luck, Tony
@ 2017-02-06 21:22                     ` Luck, Tony
  2017-02-06 21:36                       ` Shivappa Vikas
  2017-02-06 22:16                       ` David Carrillo-Cisneros
  3 siblings, 2 replies; 91+ messages in thread
From: Luck, Tony @ 2017-02-06 21:22 UTC (permalink / raw)
  To: David Carrillo-Cisneros, Thomas Gleixner
  Cc: Vikas Shivappa, Shivappa, Vikas, Stephane Eranian, linux-kernel,
	x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu,
	Fenghua, Kleen, Andi, Anvin, H Peter

> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

It seems that it is hard to define what we even mean by memory bandwidth.

If you are measuring just one task and you find that the total number of bytes
read is 1GB at some point, and one second later the total bytes is 2GB, then
it is clear that the average bandwidth for this process is 1GB/s. If you know
that the task was only running for 50% of the cycles during that 1s interval,
you could say that it is doing 2GB/s ... which is I believe what you were
thinking when you wrote #12 above.  But whether that is right depends a
bit on *why* it only ran 50% of the time. If it was time-sliced out by the
scheduler ... then it may have been trying to be a 2GB/s app. But if it
was waiting for packets from the network, then it really is using 1 GB/s.

All bets are off if you are measuring a service that consists of several
tasks running concurrently. All you can really talk about is the aggregate
average bandwidth (total bytes / wall-clock time). It makes no sense to
try and factor in how much cpu time each of the individual tasks got.

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-06 21:22                     ` Luck, Tony
@ 2017-02-06 21:36                       ` Shivappa Vikas
  2017-02-06 21:46                         ` David Carrillo-Cisneros
  2017-02-06 22:16                       ` David Carrillo-Cisneros
  1 sibling, 1 reply; 91+ messages in thread
From: Shivappa Vikas @ 2017-02-06 21:36 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, Stephane Eranian, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen,
	Andi, Anvin, H Peter



On Mon, 6 Feb 2017, Luck, Tony wrote:

>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>> should provide total_time_enabled in the way perf does, otherwise is
>> hard to interpret MBM values.
>
> It seems that it is hard to define what we even mean by memory bandwidth.
>
> If you are measuring just one task and you find that the total number of bytes
> read is 1GB at some point, and one second later the total bytes is 2GB, then
> it is clear that the average bandwidth for this process is 1GB/s. If you know
> that the task was only running for 50% of the cycles during that 1s interval,
> you could say that it is doing 2GB/s ... which is I believe what you were
> thinking when you wrote #12 above.  But whether that is right depends a
> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
> scheduler ... then it may have been trying to be a 2GB/s app. But if it
> was waiting for packets from the network, then it really is using 1 GB/s.

Is the requirement is to have both enabled and run time or just enabled time 
(enabled time must be easy to report - just the wall time from start trace to 
end trace)?

This is not reported correctly in the upstream perf cqm and for
cgroup -C we dont report it either (since we report the package).

Thanks,
Vikas

>
> All bets are off if you are measuring a service that consists of several
> tasks running concurrently. All you can really talk about is the aggregate
> average bandwidth (total bytes / wall-clock time). It makes no sense to
> try and factor in how much cpu time each of the individual tasks got.
>
> -Tony
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-06 21:36                       ` Shivappa Vikas
@ 2017-02-06 21:46                         ` David Carrillo-Cisneros
  0 siblings, 0 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-06 21:46 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Luck, Tony, Thomas Gleixner, Vikas Shivappa, Stephane Eranian,
	linux-kernel, x86, hpa, Ingo Molnar, Peter Zijlstra, Shankar,
	Ravi V, Yu, Fenghua, Kleen, Andi, Anvin, H Peter

On Mon, Feb 6, 2017 at 1:36 PM, Shivappa Vikas <vikas.shivappa@intel.com> wrote:
>
>
> On Mon, 6 Feb 2017, Luck, Tony wrote:
>
>>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>>> should provide total_time_enabled in the way perf does, otherwise is
>>> hard to interpret MBM values.
>>
>>
>> It seems that it is hard to define what we even mean by memory bandwidth.
>>
>> If you are measuring just one task and you find that the total number of
>> bytes
>> read is 1GB at some point, and one second later the total bytes is 2GB,
>> then
>> it is clear that the average bandwidth for this process is 1GB/s. If you
>> know
>> that the task was only running for 50% of the cycles during that 1s
>> interval,
>> you could say that it is doing 2GB/s ... which is I believe what you were
>> thinking when you wrote #12 above.  But whether that is right depends a
>> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
>> scheduler ... then it may have been trying to be a 2GB/s app. But if it
>> was waiting for packets from the network, then it really is using 1 GB/s.
>
>
> Is the requirement is to have both enabled and run time or just enabled time
> (enabled time must be easy to report - just the wall time from start trace
> to end trace)?

Both, but since the original requirements dropped rotation, then
total_running == total_enabled.

>
> This is not reported correctly in the upstream perf cqm and for
> cgroup -C we dont report it either (since we report the package).

using the -x option shows the run time and the % enabled. Many tools
uses that csv output.

>
> Thanks,
> Vikas
>
>
>>
>> All bets are off if you are measuring a service that consists of several
>> tasks running concurrently. All you can really talk about is the aggregate
>> average bandwidth (total bytes / wall-clock time). It makes no sense to
>> try and factor in how much cpu time each of the individual tasks got.
>>
>> -Tony
>>
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-06 21:22                     ` Luck, Tony
  2017-02-06 21:36                       ` Shivappa Vikas
@ 2017-02-06 22:16                       ` David Carrillo-Cisneros
  2017-02-06 23:27                         ` Luck, Tony
  1 sibling, 1 reply; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-06 22:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Mon, Feb 6, 2017 at 1:22 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>> should provide total_time_enabled in the way perf does, otherwise is
>> hard to interpret MBM values.
>
> It seems that it is hard to define what we even mean by memory bandwidth.
>
> If you are measuring just one task and you find that the total number of bytes
> read is 1GB at some point, and one second later the total bytes is 2GB, then
> it is clear that the average bandwidth for this process is 1GB/s. If you know
> that the task was only running for 50% of the cycles during that 1s interval,
> you could say that it is doing 2GB/s ... which is I believe what you were
> thinking when you wrote #12 above.

Yes, that's one of the cases.

> But whether that is right depends a
> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
> scheduler ... then it may have been trying to be a 2GB/s app. But if it
> was waiting for packets from the network, then it really is using 1 GB/s.

IMO, "right" means that measured bandwidth and running time are
correct. The *why* is a bigger question.

>
> All bets are off if you are measuring a service that consists of several
> tasks running concurrently. All you can really talk about is the aggregate
> average bandwidth (total bytes / wall-clock time). It makes no sense to
> try and factor in how much cpu time each of the individual tasks got.

cgroup mode gives a per-CPU breakdown of event and running time, the
tool aggregates it into running time vs event count. Both per-cpu
breakdown and the aggregate are useful.

Piggy-backing on perf's cgroup mode would give us all the above for free.

>
> -Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-06 22:16                       ` David Carrillo-Cisneros
@ 2017-02-06 23:27                         ` Luck, Tony
  2017-02-07  0:33                           ` David Carrillo-Cisneros
  0 siblings, 1 reply; 91+ messages in thread
From: Luck, Tony @ 2017-02-06 23:27 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

> cgroup mode gives a per-CPU breakdown of event and running time, the
> tool aggregates it into running time vs event count. Both per-cpu
> breakdown and the aggregate are useful.
>
> Piggy-backing on perf's cgroup mode would give us all the above for free.

Do you have some sample output from a perf run on a cgroup measuring a
"normal" event showing what you get?

I think that requires that we still go through perf ->start() and ->stop() functions
to know how much time we spent running.  I thought we were looking at bundling
the RMID updates into the same spot in sched() where we switch the CLOSID.
More or less at the "start" point, but there is no "stop".  If we are switching between
runnable processes, it amounts to pretty much the same thing ... except we bill
to someone all the time instead of having a gap in the context switch where we
stopped billing to the old task and haven't started billing to the new one yet.

But if we idle ... then we don't "stop".  Shouldn't matter much from a measurement
perspective because idle won't use cache or consume bandwidth. But we'd count
that time as "on cpu" for the last process to run.

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-06 23:27                         ` Luck, Tony
@ 2017-02-07  0:33                           ` David Carrillo-Cisneros
  0 siblings, 0 replies; 91+ messages in thread
From: David Carrillo-Cisneros @ 2017-02-07  0:33 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Vikas Shivappa, Shivappa, Vikas,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Mon, Feb 6, 2017 at 3:27 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> cgroup mode gives a per-CPU breakdown of event and running time, the
>> tool aggregates it into running time vs event count. Both per-cpu
>> breakdown and the aggregate are useful.
>>
>> Piggy-backing on perf's cgroup mode would give us all the above for free.
>
> Do you have some sample output from a perf run on a cgroup measuring a
> "normal" event showing what you get?

# perf stat -I 1000 -e cycles -a -C 0-1 -A -x, -G /
     1.000116648,CPU0,20677864,,cycles,/
     1.000169948,CPU1,24760887,,cycles,/
     2.000453849,CPU0,36120862,,cycles,/
     2.000480259,CPU1,12535575,,cycles,/
     3.000664762,CPU0,7564504,,cycles,/
     3.000692552,CPU1,7307480,,cycles,/

>
> I think that requires that we still go through perf ->start() and ->stop() functions
> to know how much time we spent running.  I thought we were looking at bundling
> the RMID updates into the same spot in sched() where we switch the CLOSID.
> More or less at the "start" point, but there is no "stop".  If we are switching between
> runnable processes, it amounts to pretty much the same thing ... except we bill
> to someone all the time instead of having a gap in the context switch where we
> stopped billing to the old task and haven't started billing to the new one yet.

Another problem is that it will require a perf event all the time for
timing measurements to be consistent with RMID measurements.

The only sane option I can come up is to do timing in RDT the way perf
cgroup does it (keep a per-cpu time that increases with local clock's
delta). A reader can add the times for all CPUs in cpu_mask.

>
> But if we idle ... then we don't "stop".  Shouldn't matter much from a measurement
> perspective because idle won't use cache or consume bandwidth. But we'd count
> that time as "on cpu" for the last process to run.

I may be missing something basic but isn't  __switch_to called when
switching to the idle task? that will update the CLOSID and RMID to
whatever the idle task in on, isnt it?

Thanks,
David

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-03 17:52                               ` Luck, Tony
  2017-02-03 21:08                                 ` David Carrillo-Cisneros
@ 2017-02-07  8:08                                 ` Stephane Eranian
  2017-02-07 18:52                                   ` Luck, Tony
                                                     ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: Stephane Eranian @ 2017-02-07  8:08 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
>From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
   - the allocations in your CAT partitions
   - the allocations from a task (inclusive of children tasks)
   - the allocations from a group of tasks (inclusive of children tasks)
   - the allocations from a CPU
   - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
    CPUs=CPU0,CPU1
    TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
    CPUs=CPU0,CPU1
    TASKs=PID20

 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
  $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.

Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
    CPUs=CPU0,CPU1
mon2: RMID2
    TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
   1-Count all allocations from any tasks running on CPU
   2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.

Hope this helps clarifies the usage model and design choices.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-07  8:08                                 ` Stephane Eranian
@ 2017-02-07 18:52                                   ` Luck, Tony
  2017-02-08 19:31                                     ` Stephane Eranian
  2017-02-07 20:10                                   ` Shivappa Vikas
  2017-02-17 13:41                                   ` Thomas Gleixner
  2 siblings, 1 reply; 91+ messages in thread
From: Luck, Tony @ 2017-02-07 18:52 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
> Hi,
> 
> I wanted to take a few steps back and look at the overall goals for
> cache monitoring.
> From the various threads and discussion, my understanding is as follows.
> 
> I think the design must ensure that the following usage models can be monitored:
>    - the allocations in your CAT partitions
>    - the allocations from a task (inclusive of children tasks)
>    - the allocations from a group of tasks (inclusive of children tasks)
>    - the allocations from a CPU
>    - the allocations from a group of CPUs
> 
> All cases but first one (CAT) are natural usage. So I want to describe
> the CAT in more details.
> The goal, as I understand it, it to monitor what is going on inside
> the CAT partition to detect
> whether it saturates or if it has room to "breathe". Let's take a
> simple example.

By "natural usage" you mean "like perf(1) provides for other events"?

But we are trying to figure out requirements here ... what data do people
need to manage caches and memory bandwidth.  So from this perspective
monitoring a CAT group is a natural first choice ... did we provision
this group with too much, or too little cache.

>From that starting point I can see that a possible next step when
finding that a CAT group has too small a cache is to drill down to
find out how the tasks in the group are using cache.  Armed with that
information you could move tasks that hog too much cache (and are believed
to be streaming through memory) into a different CAT group.

What I'm not seeing is how drilling to CPUs helps you.

Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
shows that 75% of the cache occupancy is attributed to CPU0, and only
25% to CPU1.  What can you do with this information to improve things?
If it is deemed too complex (from a kernel code perspective) to
implement per-CPU reporting how bad a loss would that be?

-Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-07  8:08                                 ` Stephane Eranian
  2017-02-07 18:52                                   ` Luck, Tony
@ 2017-02-07 20:10                                   ` Shivappa Vikas
  2017-02-17 13:41                                   ` Thomas Gleixner
  2 siblings, 0 replies; 91+ messages in thread
From: Shivappa Vikas @ 2017-02-07 20:10 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Luck, Tony, David Carrillo-Cisneros, Thomas Gleixner,
	Vikas Shivappa, Shivappa, Vikas, linux-kernel, x86, hpa,
	Ingo Molnar, Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen,
	Andi, Anvin, H Peter



On Tue, 7 Feb 2017, Stephane Eranian wrote:

> Hi,
>
> I wanted to take a few steps back and look at the overall goals for
> cache monitoring.
> From the various threads and discussion, my understanding is as follows.
>
> I think the design must ensure that the following usage models can be monitored:
>   - the allocations in your CAT partitions
>   - the allocations from a task (inclusive of children tasks)
>   - the allocations from a group of tasks (inclusive of children tasks)
>   - the allocations from a CPU
>   - the allocations from a group of CPUs
>
> All cases but first one (CAT) are natural usage. So I want to describe
> the CAT in more details.
> The goal, as I understand it, it to monitor what is going on inside
> the CAT partition to detect
> whether it saturates or if it has room to "breathe". Let's take a
> simple example.
>
> Suppose, we have a CAT group, cat1:
>
> cat1: 20MB partition (CLOSID1)
>    CPUs=CPU0,CPU1
>    TASKs=PID20
>
> There can only be one CLOSID active on a CPU at a time. The kernel
> chooses to prioritize tasks over CPU when enforcing cases with multiple
> CLOSIDs.
>
> Let's review how this works for cat1 and for each scenario look at how
> the kernel enforces or not the cache partition:
>
> 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
> 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
> 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
> 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
> 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID
>
> Now, let's review how we could track the allocations done in cat1 using a single
> RMID. There can only be one RMID active at a time per CPU. The kernel
> chooses to prioritize tasks over CPU:
>
> cat1: 20MB partition (CLOSID1, RMID1)
>    CPUs=CPU0,CPU1
>    TASKs=PID20
>
> 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
> 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
> 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
> 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
> 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID
>
> To make sense to a user, the cases where the hardware monitors MUST be
> the same as the cases where the hardware enforces the cache
> partitioning.
>
> Here we see that it works using a single RMID.
>
> However doing so limits certain monitoring modes where a user might want to
> get a breakdown per CPU of the allocations, such as with:
>  $ perf stat -a -A -e llc_occupancy -R cat1
> (where -R points to the monitoring group in rsrcfs). Here this mode would not be
> possible because the two CPUs in the group share the same RMID.

In the requirements here https://marc.info/?l=linux-kernel&m=148597969808732

8)      Can get measurements for subsets of tasks in a CAT group (to find the 
guys hogging the resources).

This should also applies to the subsets of cpus.

That would let you monitor on CPUs that is a subset or different from a CAT 
group.  That should let you create mon groups like in the second example you 
mention along with the control groups above.

mon0: RMID0
     CPUs=CPU0

mon1: RMID1
     CPUs=CPU1

mon2: RMID2
     CPUs=CPU2

...


>
> Now let's take another scenario, and suppose you have two monitoring groups
> as follows:
>
> mon1: RMID1
>    CPUs=CPU0,CPU1
> mon2: RMID2
>    TASKS=PID20
>
> If PID20 runs on CP0, then RMID2 is activated, and thus allocations
> done by PID20 are not counted towards RMID1. There is a blind spot.
>
> Whether or not this is a problem depends on the semantic exported by
> the interface for CPU mode:
>   1-Count all allocations from any tasks running on CPU
>   2-Count all allocations from tasks which are NOT monitoring themselves
>
> If the kernel choses 1, then there is a blind spot and the measurement
> is not as accurate as it could be because of the decision to use only one RDMID.
> But if the kernel choses 2, then everything works fine with a single RMID.
>
> If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
> measure any activity from any thread (choice 1), then the single RMID per group
> does not work.
>
> If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
> CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
> a single RMID per group works.
>

Agree there are blind spots in both. But the requirements is trying to be based 
on the resctrl allocation as Thomas suggested.
Which is aligned to monitoring real time tasks as i understand.
for the above example, some tasks which donot have an RMID(say in the root 
group) are the real time tasks that are specially configured to running on a cpux which need to be 
allocated or monitored.


> Hope this helps clarifies the usage model and design choices.
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 20:11             ` David Carrillo-Cisneros
  2017-01-20 21:08               ` Shivappa Vikas
  2017-01-23  9:47               ` Thomas Gleixner
@ 2017-02-08 10:11               ` Peter Zijlstra
  2 siblings, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2017-02-08 10:11 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, Vikas Shivappa, Vikas Shivappa,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar, Shankar,
	Ravi V, Luck, Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, Jan 20, 2017 at 12:11:53PM -0800, David Carrillo-Cisneros wrote:
> Implementation ideas:
> 
> First idea is to expose one monitoring file per resource in a CTRLGRP,
> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
> monitor_l3_0, monitor_l3_1, ...
> 
> the monitor_<resource_id> file descriptor is passed to perf_event_open
> in the way cgroup file descriptors are passed now. All events to the
> same (CTRLGRP,resource_id) share RMID.
> 
> The RMID allocation part can either be handled by RDT Allocation or by
> the RDT Monitoring PMU. Either ways, the existence of PMU's
> perf_events allocates/releases the RMID.

So I've had complaints about exactly that behaviour. Someone wanted
RMIDs assigned (and start measuring) the moment the grouping got
created/tasks started running etc..

So I think the design should also explicitly state how this is supposed
to be handled and not left as an implementation detail.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-01-20 23:51                   ` Shivappa Vikas
@ 2017-02-08 10:13                     ` Peter Zijlstra
  0 siblings, 0 replies; 91+ messages in thread
From: Peter Zijlstra @ 2017-02-08 10:13 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Stephane Eranian, linux-kernel, x86, hpa, Ingo Molnar, Shankar,
	Ravi V, Luck, Tony, Fenghua Yu, andi.kleen, H. Peter Anvin

On Fri, Jan 20, 2017 at 03:51:48PM -0800, Shivappa Vikas wrote:
> I think the email thread is going very long and we should just meet f2f
> probably next week to iron out the requirements and chalk out a design
> proposal.

The thread isn't the problem; you lot not trimming your emails is
however.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-07 18:52                                   ` Luck, Tony
@ 2017-02-08 19:31                                     ` Stephane Eranian
  0 siblings, 0 replies; 91+ messages in thread
From: Stephane Eranian @ 2017-02-08 19:31 UTC (permalink / raw)
  To: Luck, Tony
  Cc: David Carrillo-Cisneros, Thomas Gleixner, Vikas Shivappa,
	Shivappa, Vikas, linux-kernel, x86, hpa, Ingo Molnar,
	Peter Zijlstra, Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin,
	H Peter

Tony,

On Tue, Feb 7, 2017 at 10:52 AM, Luck, Tony <tony.luck@intel.com> wrote:
> On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
>> Hi,
>>
>> I wanted to take a few steps back and look at the overall goals for
>> cache monitoring.
>> From the various threads and discussion, my understanding is as follows.
>>
>> I think the design must ensure that the following usage models can be monitored:
>>    - the allocations in your CAT partitions
>>    - the allocations from a task (inclusive of children tasks)
>>    - the allocations from a group of tasks (inclusive of children tasks)
>>    - the allocations from a CPU
>>    - the allocations from a group of CPUs
>>
>> All cases but first one (CAT) are natural usage. So I want to describe
>> the CAT in more details.
>> The goal, as I understand it, it to monitor what is going on inside
>> the CAT partition to detect
>> whether it saturates or if it has room to "breathe". Let's take a
>> simple example.
>
> By "natural usage" you mean "like perf(1) provides for other events"?
>
Yes, people are used to monitoring events per task or per CPU. In that
sense, it is the common usage model. Cgroup monitoring is a derivative
of per-cpu mode.

> But we are trying to figure out requirements here ... what data do people
> need to manage caches and memory bandwidth.  So from this perspective
> monitoring a CAT group is a natural first choice ... did we provision
> this group with too much, or too little cache.
>
I am not saying CAT is not natural. I am saying it is a justified requirement
but a new one and thus need to make sure it is understood and that the
kernel must track CAT partition and CAT partition cache occupancy monitoring
similarly.

> From that starting point I can see that a possible next step when
> finding that a CAT group has too small a cache is to drill down to
> find out how the tasks in the group are using cache.  Armed with that
> information you could move tasks that hog too much cache (and are believed
> to be streaming through memory) into a different CAT group.
>
This is a valid usage model. But you have people who care about monitoring
occupancy but do not necessarily use CAT partitions. Yet in this case, the
occupancy data is still very useful to gauge cache footprint of a workload.
Therefore this usage model should not be discounted.

> What I'm not seeing is how drilling to CPUs helps you.
>
Looking for imbalance, for instance.
Are all the allocations done from only a subset of the CPUs?

> Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
> shows that 75% of the cache occupancy is attributed to CPU0, and only
> 25% to CPU1.  What can you do with this information to improve things?
> If it is deemed too complex (from a kernel code perspective) to
> implement per-CPU reporting how bad a loss would that be?
>
It is okay to first focus on per-task and per-CAT partition. What I'd
like to see is
an API that could possibly be extended later on to do per-CPU only mode. I am
okay with having only per-CAT and per-task groups initially to keep
things simpler.
But the rsrcfs interface should allow extension to per-CPU only mode. Then the
kernel implementation would take care of allocating the RMID accordingly. The
key is always to ensure allocations can be tracked since inception of the group
be it CAT, tasks, CPU.

> -Tony

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
  2017-02-07  8:08                                 ` Stephane Eranian
  2017-02-07 18:52                                   ` Luck, Tony
  2017-02-07 20:10                                   ` Shivappa Vikas
@ 2017-02-17 13:41                                   ` Thomas Gleixner
  2 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2017-02-17 13:41 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Luck, Tony, David Carrillo-Cisneros, Vikas Shivappa, Shivappa,
	Vikas, linux-kernel, x86, hpa, Ingo Molnar, Peter Zijlstra,
	Shankar, Ravi V, Yu, Fenghua, Kleen, Andi, Anvin, H Peter

On Tue, 7 Feb 2017, Stephane Eranian wrote:
> 
> I think the design must ensure that the following usage models can be monitored:
>    - the allocations in your CAT partitions
>    - the allocations from a task (inclusive of children tasks)
>    - the allocations from a group of tasks (inclusive of children tasks)
>    - the allocations from a CPU
>    - the allocations from a group of CPUs

What's missing here is:

     - the allocations of a subset of users (tasks/groups/cpu(s)) of a
       particular CAT partition

Looking at your requirement list, all requirements, except the first point,
have no relationship to CAT (at least not from your write up). Now the
obvious questions are:

 - Does it make sense to ignore CAT relations in these sets?

 - Does it make sense to monitor a task / group of tasks, where the tasks
   belong to different CAT partitions?

 - Does it make sense to monitor a CPU / group of CPUs as a whole
   independent of which CAT partitions have been utilized during the
   monitoring period?

I don't think it makes any sense, unless the resulting information is split
up into CAT partitions.

I'm happy to be educated on the value of making this CAT unaware, but so
far I only came up with results, which need a crystal ball to analyze them.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2017-02-17 13:42 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06 21:59 [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Vikas Shivappa
2017-01-06 21:59 ` [PATCH 01/12] Documentation, x86/cqm: Intel Resource Monitoring Documentation Vikas Shivappa
2017-01-06 21:59 ` [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling Vikas Shivappa
2017-01-06 21:59 ` [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option Vikas Shivappa
2017-01-16 18:05   ` Thomas Gleixner
2017-01-17 17:25     ` Shivappa Vikas
2017-01-06 21:59 ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support Vikas Shivappa
2017-01-16 18:15   ` [PATCH 04/12] x86/cqm: Add Per pkg rmid support\ Thomas Gleixner
2017-01-17 19:11     ` Shivappa Vikas
2017-01-06 21:59 ` [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare Vikas Shivappa
2017-01-17 12:11   ` Thomas Gleixner
2017-01-17 12:31     ` Peter Zijlstra
2017-01-18  2:14     ` Shivappa Vikas
2017-01-17 13:46   ` Thomas Gleixner
2017-01-17 20:22     ` Shivappa Vikas
2017-01-17 21:31       ` Thomas Gleixner
2017-01-17 15:26   ` Peter Zijlstra
2017-01-17 20:27     ` Shivappa Vikas
2017-01-06 21:59 ` [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support Vikas Shivappa
2017-01-17 14:07   ` Thomas Gleixner
2017-01-06 22:00 ` [PATCH 07/12] x86/rdt,cqm: Scheduling support update Vikas Shivappa
2017-01-17 21:58   ` Thomas Gleixner
2017-01-17 22:30     ` Shivappa Vikas
2017-01-06 22:00 ` [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup together Vikas Shivappa
2017-01-17 16:11   ` Thomas Gleixner
2017-01-06 22:00 ` [PATCH 09/12] x86/cqm: Add RMID reuse Vikas Shivappa
2017-01-17 16:59   ` Thomas Gleixner
2017-01-18  0:26     ` Shivappa Vikas
2017-01-06 22:00 ` [PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg reads Vikas Shivappa
2017-01-06 22:00 ` [PATCH 11/12] perf/stat: fix bug in handling events in error state Vikas Shivappa
2017-01-06 22:00 ` [PATCH 12/12] perf/stat: revamp read error handling, snapshot and per_pkg events Vikas Shivappa
2017-01-17 17:31 ` [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Thomas Gleixner
2017-01-18  2:38   ` Shivappa Vikas
2017-01-18  8:53     ` Thomas Gleixner
2017-01-18  9:56       ` Peter Zijlstra
2017-01-19 19:59         ` Shivappa Vikas
2017-01-18 19:41       ` Shivappa Vikas
2017-01-18 21:03       ` David Carrillo-Cisneros
2017-01-19 17:41         ` Thomas Gleixner
2017-01-20  7:37           ` David Carrillo-Cisneros
2017-01-20  8:30             ` Thomas Gleixner
2017-01-20 20:27               ` David Carrillo-Cisneros
2017-01-18 21:16       ` Yu, Fenghua
2017-01-19  2:09       ` David Carrillo-Cisneros
2017-01-19 16:58         ` David Carrillo-Cisneros
2017-01-19 17:54           ` Thomas Gleixner
2017-01-19  2:21       ` Vikas Shivappa
2017-01-19  6:45       ` Stephane Eranian
2017-01-19 18:03         ` Thomas Gleixner
2017-01-20  2:32       ` Vikas Shivappa
2017-01-20  7:58         ` David Carrillo-Cisneros
2017-01-20 13:28           ` Thomas Gleixner
2017-01-20 20:11             ` David Carrillo-Cisneros
2017-01-20 21:08               ` Shivappa Vikas
2017-01-20 21:44                 ` David Carrillo-Cisneros
2017-01-20 23:51                   ` Shivappa Vikas
2017-02-08 10:13                     ` Peter Zijlstra
2017-01-23  9:47               ` Thomas Gleixner
2017-01-23 11:30                 ` Peter Zijlstra
2017-02-01 20:08                 ` Luck, Tony
2017-02-01 23:12                   ` David Carrillo-Cisneros
2017-02-02 17:39                     ` Luck, Tony
2017-02-02 19:33                     ` Luck, Tony
2017-02-02 20:20                       ` Shivappa Vikas
2017-02-02 20:22                       ` David Carrillo-Cisneros
2017-02-02 23:41                         ` Luck, Tony
2017-02-03  1:40                           ` David Carrillo-Cisneros
2017-02-03  2:14                             ` David Carrillo-Cisneros
2017-02-03 17:52                               ` Luck, Tony
2017-02-03 21:08                                 ` David Carrillo-Cisneros
2017-02-03 22:24                                   ` Luck, Tony
2017-02-07  8:08                                 ` Stephane Eranian
2017-02-07 18:52                                   ` Luck, Tony
2017-02-08 19:31                                     ` Stephane Eranian
2017-02-07 20:10                                   ` Shivappa Vikas
2017-02-17 13:41                                   ` Thomas Gleixner
2017-02-06 18:54                     ` Luck, Tony
2017-02-06 21:22                     ` Luck, Tony
2017-02-06 21:36                       ` Shivappa Vikas
2017-02-06 21:46                         ` David Carrillo-Cisneros
2017-02-06 22:16                       ` David Carrillo-Cisneros
2017-02-06 23:27                         ` Luck, Tony
2017-02-07  0:33                           ` David Carrillo-Cisneros
2017-02-02  0:35                   ` Andi Kleen
2017-02-02  1:12                     ` David Carrillo-Cisneros
2017-02-02  1:19                       ` Andi Kleen
2017-02-02  1:22                     ` Yu, Fenghua
2017-02-02 17:51                       ` Shivappa Vikas
2017-02-08 10:11               ` Peter Zijlstra
2017-01-20 20:40           ` Shivappa Vikas
2017-01-20 19:31         ` Stephane Eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).