linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/46] Cache Monitoring Technology (aka CQM)
@ 2016-10-30  0:37 David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM David Carrillo-Cisneros
                   ` (45 more replies)
  0 siblings, 46 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

This series introduces the next iteration of kernel support for the
Cache Monitoring Technology or CMT (formerly Cache QoS Monitoring, CQM)
available in Intel Xeon processors.

Documentation has replaced the Intel CQM name with Intel CMT.
This version renames all code to CMT.

It is rebased at tip x86/core to continue on top of
Fenghua Yu's Intel CAT series (partially merged).

One of the main limitations of the previous version is the inability
to simultaneously monitor:
  1) An llc_occupancy CPU event and a cgroup or task llc_occupancy
  event in that CPU.
  2) cgroup events for cgroups in same descendancy line.
  3) cgroup events and any task event whose thread runs in a cgroup
  in same descendancy line.

Another limitation is that monitoring for a cgroup was enabled/disabled by
the existence of a perf event for that cgroup. Since the event
llc_occupancy measures changes in occupancy rather than total occupancy,
in order to read meaningful llc_occupancy values, an event should be
enabled for a long enough period of time. The overhead in context switches
caused by the perf events is undesired in some sensitive scenarios.

This series of patches addresses the shortcomings mentioned above and
add some other improvements. The main changes are:
	- No more potential conflicts between different events. New
	version	builds a hierarchy of RMIDs that captures the dependency
	between	monitored cgroups. llc_occupancy for cgroup is the sum of
	llc_occupancies for that cgroup RMID and all other RMIDs in the
	cgroups subtree (both monitored cgroups and threads).

	- A cgroup integration that allows to start monitoring of a cgroup
	without	creating a perf event, decreasing the context switch
	overhead. Monitoring is controlled by a semicolon separated list
	of flags passed to a perf cgroup attribute, e.g.:

		echo "1;3;0;1" > cgroup_path/perf_event.cmt_monitoring

	CPU packages 0, 1 and 3 have flags > 0 and therefore mark those
	packages to monitor using RMIDs even if no perf_event is attached
	to the cgroup. The meaning of other flag values are explained
	in their own patches.
	
	A perf_event is always required in order to read llc_occupancy.
	This cgroup integration uses Intel's PQR code and is intended to
	share code with the upcoming Intel's CAT driver.
	
	- A more stable rotation algorithm: New algorithm explicitly
	defines SLOs to guarantee that RMIDs are assigned and kept
	long enough to produce meaningful occupancy values.

	- Reduce impact of stealing/rotation of RMIDs: The new algorithm
	tries to assign dirty RMIDs to their previous owners when
	suitable, decreasing the error introduced by RMID rotation and
	the negative impact of dirty RMIDs that drop occupancy too slowly
	when unscheduled.

	- Eliminate pmu::count: perf generic's perf_event_count()
	perform a quick add of atomic types. The introduction of
	pmu::count in the previous CMT series to read occupancy for thread
	events changed the behavior of perf_event_count() by performing a
	potentially slow IPI and write/read to MSR. It also made pmu::read
	to have different behaviors depending on whether the event was a
	cpu/cgroup event or a thread. This patches serie removes the custom
	pmu::count from CMT and provides a consistent behavior for all
	calls of perf_event_read .

	- Add error return for pmu::read: Reads to CQM events may fail
	due to stealing of RMIDs, even after successfully adding an event
	to a PMU. This patch series expands pmu::read with an int return
	value and propagates the error to callers that can fail
	(ie. perf_read).
	The ability to fail of pmu::read is consistent with the recent
	changes	that allow perf_event_read to fail for transactional
	reading of event groups.

	- Introduce additional flags to perf_event::group_caps and
	perf_event::event_caps: the flags PERF_EV_CAP_READ_ANY_{,CPU_}PKG
	allow read of CMT events while an event is inactive, saving
	unnecessary IPIs. The flag PERV_EV_CAP_CGROUP_NO_RECURSION prevents
	generic code from programming multiple CMT events in a CPU, when
	dealing with cgroup hierarchy, since this is unsupported by hw.

This patch series also updates the perf tool to fix error handling and to
better handle the idiosyncrasies of snapshot and per-pkg events.

Support for Intel MBM is yet to be build on-top of this driver.

Changes in 3rd version:
  - Rename from CQM to CMT, making it consistent with latest Intel's docs.
  - Plenty of fixes requested by Thomas G. Mainly:
      - Redesign of pmonr state machine.
      - Avoid abuse of lock_nested() by defining static lock_class_key's.
      - Simplify locking rules.
      - Remove unnecessary macros, inlines and wrappers.
      - Remove reliance on WARN_ONs for error handling.
      - Use kzalloc.
      - Fix comments and line breaks.
      - Add high level overview in comments of cmt header file.
      - Cleaner device initialization/termination. Still not modular,
	I am holding that change until the integration with perf cgroup
	is discussed (currently is through architecture specific hooks,
	see patch 36).
  - Clean up and simplify RMID rotation code.
  - Add user specific flags (uflags) for both events and
    perf_cgroup.cmt_monitoring to allow No Rotation and No Lazy Allocation
    of rmids.
  - Use CPU Hotplug state machine.
  - No longer need the new hook perf_event_exec to start monitoring after
    an exec (hook introduced in v1, removed in this one).
  - Remove polling of llc_occupancy for active rmids. Replaced by
    asynchronous read (see patch 30).
  - Change rmids pools to bitmaps, thus removing "wrapped rmid" (wrmid).
  - Removal of per-package pools of wrmids used as temporal objects.
  - Added a very useful debugfs node to observe internals such as:
	- monr hierarchy.
	- per-package data.
	- llc_occupancy of rmids.
  - Reduction of code size to 66% of v2 (now 641 KBs).
  - Rebased to tip x86/cache.

Changes in 2nd version:
  - As requested by Peter Z., redo commit history to completely remove
    old version of CQM in a single patch.
  - Use topology_max_packages and fix build errors reported by
  Vikas Shivappa.
  - Split largest patches, clean up.
  - Rebased to peterz/queue perf/core .


David Carrillo-Cisneros (45):
  perf/x86/intel/cqm: remove previous version of CQM and MBM
  perf/x86/intel: rename CQM cpufeatures to CMT
  x86/intel: add CONFIG_INTEL_RDT_M configuration flag
  perf/x86/intel/cmt: add device initialization and CPU hotplug support
  perf/x86/intel/cmt: add per-package locks
  perf/x86/intel/cmt: add intel_cmt pmu
  perf/core: add RDT Monitoring attributes to struct hw_perf_event
  perf/x86/intel/cmt: add MONitored Resource (monr) initialization
  perf/x86/intel/cmt: add basic monr hierarchy
  perf/x86/intel/cmt: add Package MONitored Resource (pmonr)
    initialization
  perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr
  perf/x86/intel/cmt: add per-package rmid pools
  perf/x86/intel/cmt: add pmonr's Off and Unused states
  perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states
  perf/x86/intel: encapsulate rmid and closid updates in pqr cache
  perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del
  perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID
  perf/core: add arch_info field to struct perf_cgroup
  perf/x86/intel/cmt: add support for cgroup events
  perf/core: add pmu::event_terminate
  perf/x86/intel/cmt: use newly introduced event_terminate
  perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop
  perf/core: hooks to add architecture specific features in perf_cgroup
  perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline}
  perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE
  sched: introduce the finish_arch_pre_lock_switch() scheduler hook
  perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch
  perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
    pmu::read
  perf/x86/intel/cmt: add error handling to intel_cmt_event_read
  perf/x86/intel/cmt: add asynchronous read for task events
  perf/x86/intel/cmt: add subtree read for cgroup events
  perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags
  perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt
  perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION
  perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt
  perf/core: add perf_event cgroup hooks for subsystem attributes
  perf/x86/intel/cmt: add cont_monitoring to perf cgroup
  perf/x86/intel/cmt: introduce read SLOs for rotation
  perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute
  perf/x86/intel/cmt: add rotation scheduled work
  perf/x86/intel/cmt: add rotation minimum progress SLO
  perf/x86/intel/cmt: add rmid stealing
  perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag
  perf/x86/intel/cmt: add debugfs intel_cmt directory
  perf/stat: revamp read error handling, snapshot and per_pkg events

Stephane Eranian (1):
  perf/stat: fix bug in handling events in error state

 arch/alpha/kernel/perf_event.c           |    3 +-
 arch/arc/kernel/perf_event.c             |    3 +-
 arch/arm64/include/asm/hw_breakpoint.h   |    2 +-
 arch/arm64/kernel/hw_breakpoint.c        |    3 +-
 arch/metag/kernel/perf/perf_event.c      |    5 +-
 arch/mips/kernel/perf_event_mipsxx.c     |    3 +-
 arch/powerpc/include/asm/hw_breakpoint.h |    2 +-
 arch/powerpc/kernel/hw_breakpoint.c      |    3 +-
 arch/powerpc/perf/core-book3s.c          |   11 +-
 arch/powerpc/perf/core-fsl-emb.c         |    5 +-
 arch/powerpc/perf/hv-24x7.c              |    5 +-
 arch/powerpc/perf/hv-gpci.c              |    3 +-
 arch/s390/kernel/perf_cpum_cf.c          |    5 +-
 arch/s390/kernel/perf_cpum_sf.c          |    3 +-
 arch/sh/include/asm/hw_breakpoint.h      |    2 +-
 arch/sh/kernel/hw_breakpoint.c           |    3 +-
 arch/sparc/kernel/perf_event.c           |    2 +-
 arch/tile/kernel/perf_event.c            |    3 +-
 arch/x86/Kconfig                         |   12 +
 arch/x86/events/amd/ibs.c                |    2 +-
 arch/x86/events/amd/iommu.c              |    5 +-
 arch/x86/events/amd/uncore.c             |    3 +-
 arch/x86/events/core.c                   |    3 +-
 arch/x86/events/intel/Makefile           |    3 +-
 arch/x86/events/intel/bts.c              |    3 +-
 arch/x86/events/intel/cmt.c              | 3498 ++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h              |  344 +++
 arch/x86/events/intel/cqm.c              | 1766 ---------------
 arch/x86/events/intel/cstate.c           |    3 +-
 arch/x86/events/intel/pt.c               |    3 +-
 arch/x86/events/intel/rapl.c             |    3 +-
 arch/x86/events/intel/uncore.c           |    3 +-
 arch/x86/events/intel/uncore.h           |    2 +-
 arch/x86/events/msr.c                    |    3 +-
 arch/x86/include/asm/cpufeatures.h       |   14 +-
 arch/x86/include/asm/hw_breakpoint.h     |    2 +-
 arch/x86/include/asm/intel_rdt_common.h  |   62 +-
 arch/x86/include/asm/perf_event.h        |   29 +
 arch/x86/include/asm/processor.h         |    4 +
 arch/x86/kernel/cpu/Makefile             |    3 +-
 arch/x86/kernel/cpu/common.c             |   10 +-
 arch/x86/kernel/cpu/intel_rdt_common.c   |   37 +
 arch/x86/kernel/hw_breakpoint.c          |    3 +-
 arch/x86/kvm/pmu.h                       |   10 +-
 drivers/bus/arm-cci.c                    |    3 +-
 drivers/bus/arm-ccn.c                    |    3 +-
 drivers/perf/arm_pmu.c                   |    3 +-
 include/linux/cpuhotplug.h               |    4 +-
 include/linux/perf_event.h               |   70 +-
 kernel/events/core.c                     |  177 +-
 kernel/sched/core.c                      |    1 +
 kernel/sched/sched.h                     |    3 +
 kernel/trace/bpf_trace.c                 |    4 +-
 tools/perf/builtin-stat.c                |   42 +-
 tools/perf/util/counts.h                 |   19 +
 tools/perf/util/evsel.c                  |   49 +-
 tools/perf/util/evsel.h                  |    8 +-
 tools/perf/util/stat.c                   |   35 +-
 58 files changed, 4361 insertions(+), 1956 deletions(-)
 create mode 100644 arch/x86/events/intel/cmt.c
 create mode 100644 arch/x86/events/intel/cmt.h
 delete mode 100644 arch/x86/events/intel/cqm.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
@ 2016-10-30  0:37 ` David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 02/46] perf/x86/intel: rename CQM cpufeatures to CMT David Carrillo-Cisneros
                   ` (44 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Completely remove previous version of CQM + MBM driver to ease
review of new version (this patch series).

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/Makefile |    2 +-
 arch/x86/events/intel/cqm.c    | 1766 ----------------------------------------
 include/linux/cpuhotplug.h     |    2 -
 include/linux/perf_event.h     |   14 -
 kernel/events/core.c           |   10 -
 kernel/trace/bpf_trace.c       |    4 +-
 6 files changed, 3 insertions(+), 1795 deletions(-)
 delete mode 100644 arch/x86/events/intel/cqm.c

diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 06c2baa..e9d8520 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,4 @@
-obj-$(CONFIG_CPU_SUP_INTEL)		+= core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= lbr.o p4.o p6.o pt.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)	+= intel-rapl-perf.o
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
deleted file mode 100644
index 0c45cc8..0000000
--- a/arch/x86/events/intel/cqm.c
+++ /dev/null
@@ -1,1766 +0,0 @@
-/*
- * Intel Cache Quality-of-Service Monitoring (CQM) support.
- *
- * Based very, very heavily on work by Peter Zijlstra.
- */
-
-#include <linux/perf_event.h>
-#include <linux/slab.h>
-#include <asm/cpu_device_id.h>
-#include <asm/intel_rdt_common.h>
-#include "../perf_event.h"
-
-#define MSR_IA32_QM_CTR		0x0c8e
-#define MSR_IA32_QM_EVTSEL	0x0c8d
-
-#define MBM_CNTR_WIDTH		24
-/*
- * Guaranteed time in ms as per SDM where MBM counters will not overflow.
- */
-#define MBM_CTR_OVERFLOW_TIME	1000
-
-static u32 cqm_max_rmid = -1;
-static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
-
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-static struct hrtimer *mbm_timers;
-/**
- * struct sample - mbm event's (local or total) data
- * @total_bytes    #bytes since we began monitoring
- * @prev_msr       previous value of MSR
- */
-struct sample {
-	u64	total_bytes;
-	u64	prev_msr;
-};
-
-/*
- * samples profiled for total memory bandwidth type events
- */
-static struct sample *mbm_total;
-/*
- * samples profiled for local memory bandwidth type events
- */
-static struct sample *mbm_local;
-
-#define pkg_id	topology_physical_package_id(smp_processor_id())
-/*
- * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
- * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
- * rmids per socket, an example is given below
- * RMID1 of Socket0:  vrmid =  1
- * RMID1 of Socket1:  vrmid =  1 * (cqm_max_rmid + 1) + 1
- * RMID1 of Socket2:  vrmid =  2 * (cqm_max_rmid + 1) + 1
- */
-#define rmid_2_index(rmid)  ((pkg_id * (cqm_max_rmid + 1)) + rmid)
-/*
- * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
- * Also protects event->hw.cqm_rmid
- *
- * Hold either for stability, both for modification of ->hw.cqm_rmid.
- */
-static DEFINE_MUTEX(cache_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
-
-/*
- * Groups of events that have the same target(s), one RMID per group.
- */
-static LIST_HEAD(cache_groups);
-
-/*
- * Mask of CPUs for reading CQM values. We only need one per-socket.
- */
-static cpumask_t cqm_cpumask;
-
-#define RMID_VAL_ERROR		(1ULL << 63)
-#define RMID_VAL_UNAVAIL	(1ULL << 62)
-
-/*
- * Event IDs are used to program IA32_QM_EVTSEL before reading event
- * counter from IA32_QM_CTR
- */
-#define QOS_L3_OCCUP_EVENT_ID	0x01
-#define QOS_MBM_TOTAL_EVENT_ID	0x02
-#define QOS_MBM_LOCAL_EVENT_ID	0x03
-
-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
-
-#define INVALID_RMID		(-1)
-
-/*
- * Is @rmid valid for programming the hardware?
- *
- * rmid 0 is reserved by the hardware for all non-monitored tasks, which
- * means that we should never come across an rmid with that value.
- * Likewise, an rmid value of -1 is used to indicate "no rmid currently
- * assigned" and is used as part of the rotation code.
- */
-static inline bool __rmid_valid(u32 rmid)
-{
-	if (!rmid || rmid == INVALID_RMID)
-		return false;
-
-	return true;
-}
-
-static u64 __rmid_read(u32 rmid)
-{
-	u64 val;
-
-	/*
-	 * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
-	 * it just says that to increase confusion.
-	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
-	rdmsrl(MSR_IA32_QM_CTR, val);
-
-	/*
-	 * Aside from the ERROR and UNAVAIL bits, assume this thing returns
-	 * the number of cachelines tagged with @rmid.
-	 */
-	return val;
-}
-
-enum rmid_recycle_state {
-	RMID_YOUNG = 0,
-	RMID_AVAILABLE,
-	RMID_DIRTY,
-};
-
-struct cqm_rmid_entry {
-	u32 rmid;
-	enum rmid_recycle_state state;
-	struct list_head list;
-	unsigned long queue_time;
-};
-
-/*
- * cqm_rmid_free_lru - A least recently used list of RMIDs.
- *
- * Oldest entry at the head, newest (most recently used) entry at the
- * tail. This list is never traversed, it's only used to keep track of
- * the lru order. That is, we only pick entries of the head or insert
- * them on the tail.
- *
- * All entries on the list are 'free', and their RMIDs are not currently
- * in use. To mark an RMID as in use, remove its entry from the lru
- * list.
- *
- *
- * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
- *
- * This list is contains RMIDs that no one is currently using but that
- * may have a non-zero occupancy value associated with them. The
- * rotation worker moves RMIDs from the limbo list to the free list once
- * the occupancy value drops below __intel_cqm_threshold.
- *
- * Both lists are protected by cache_mutex.
- */
-static LIST_HEAD(cqm_rmid_free_lru);
-static LIST_HEAD(cqm_rmid_limbo_lru);
-
-/*
- * We use a simple array of pointers so that we can lookup a struct
- * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
- * and __put_rmid() from having to worry about dealing with struct
- * cqm_rmid_entry - they just deal with rmids, i.e. integers.
- *
- * Once this array is initialized it is read-only. No locks are required
- * to access it.
- *
- * All entries for all RMIDs can be looked up in the this array at all
- * times.
- */
-static struct cqm_rmid_entry **cqm_rmid_ptrs;
-
-static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
-{
-	struct cqm_rmid_entry *entry;
-
-	entry = cqm_rmid_ptrs[rmid];
-	WARN_ON(entry->rmid != rmid);
-
-	return entry;
-}
-
-/*
- * Returns < 0 on fail.
- *
- * We expect to be called with cache_mutex held.
- */
-static u32 __get_rmid(void)
-{
-	struct cqm_rmid_entry *entry;
-
-	lockdep_assert_held(&cache_mutex);
-
-	if (list_empty(&cqm_rmid_free_lru))
-		return INVALID_RMID;
-
-	entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
-	list_del(&entry->list);
-
-	return entry->rmid;
-}
-
-static void __put_rmid(u32 rmid)
-{
-	struct cqm_rmid_entry *entry;
-
-	lockdep_assert_held(&cache_mutex);
-
-	WARN_ON(!__rmid_valid(rmid));
-	entry = __rmid_entry(rmid);
-
-	entry->queue_time = jiffies;
-	entry->state = RMID_YOUNG;
-
-	list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
-}
-
-static void cqm_cleanup(void)
-{
-	int i;
-
-	if (!cqm_rmid_ptrs)
-		return;
-
-	for (i = 0; i < cqm_max_rmid; i++)
-		kfree(cqm_rmid_ptrs[i]);
-
-	kfree(cqm_rmid_ptrs);
-	cqm_rmid_ptrs = NULL;
-	cqm_enabled = false;
-}
-
-static int intel_cqm_setup_rmid_cache(void)
-{
-	struct cqm_rmid_entry *entry;
-	unsigned int nr_rmids;
-	int r = 0;
-
-	nr_rmids = cqm_max_rmid + 1;
-	cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
-				nr_rmids, GFP_KERNEL);
-	if (!cqm_rmid_ptrs)
-		return -ENOMEM;
-
-	for (; r <= cqm_max_rmid; r++) {
-		struct cqm_rmid_entry *entry;
-
-		entry = kmalloc(sizeof(*entry), GFP_KERNEL);
-		if (!entry)
-			goto fail;
-
-		INIT_LIST_HEAD(&entry->list);
-		entry->rmid = r;
-		cqm_rmid_ptrs[r] = entry;
-
-		list_add_tail(&entry->list, &cqm_rmid_free_lru);
-	}
-
-	/*
-	 * RMID 0 is special and is always allocated. It's used for all
-	 * tasks that are not monitored.
-	 */
-	entry = __rmid_entry(0);
-	list_del(&entry->list);
-
-	mutex_lock(&cache_mutex);
-	intel_cqm_rotation_rmid = __get_rmid();
-	mutex_unlock(&cache_mutex);
-
-	return 0;
-
-fail:
-	cqm_cleanup();
-	return -ENOMEM;
-}
-
-/*
- * Determine if @a and @b measure the same set of tasks.
- *
- * If @a and @b measure the same set of tasks then we want to share a
- * single RMID.
- */
-static bool __match_event(struct perf_event *a, struct perf_event *b)
-{
-	/* Per-cpu and task events don't mix */
-	if ((a->attach_state & PERF_ATTACH_TASK) !=
-	    (b->attach_state & PERF_ATTACH_TASK))
-		return false;
-
-#ifdef CONFIG_CGROUP_PERF
-	if (a->cgrp != b->cgrp)
-		return false;
-#endif
-
-	/* If not task event, we're machine wide */
-	if (!(b->attach_state & PERF_ATTACH_TASK))
-		return true;
-
-	/*
-	 * Events that target same task are placed into the same cache group.
-	 * Mark it as a multi event group, so that we update ->count
-	 * for every event rather than just the group leader later.
-	 */
-	if (a->hw.target == b->hw.target) {
-		b->hw.is_group_event = true;
-		return true;
-	}
-
-	/*
-	 * Are we an inherited event?
-	 */
-	if (b->parent == a)
-		return true;
-
-	return false;
-}
-
-#ifdef CONFIG_CGROUP_PERF
-static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
-{
-	if (event->attach_state & PERF_ATTACH_TASK)
-		return perf_cgroup_from_task(event->hw.target, event->ctx);
-
-	return event->cgrp;
-}
-#endif
-
-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- *		   PROHIBITS
- *     system-wide    -> 	cgroup and task
- *     cgroup 	      ->	system-wide
- *     		      ->	task in cgroup
- *     task 	      -> 	system-wide
- *     		      ->	task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
-	/*
-	 * We can have any number of cgroups but only one system-wide
-	 * event at a time.
-	 */
-	if (a->cgrp && b->cgrp) {
-		struct perf_cgroup *ac = a->cgrp;
-		struct perf_cgroup *bc = b->cgrp;
-
-		/*
-		 * This condition should have been caught in
-		 * __match_event() and we should be sharing an RMID.
-		 */
-		WARN_ON_ONCE(ac == bc);
-
-		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-			return true;
-
-		return false;
-	}
-
-	if (a->cgrp || b->cgrp) {
-		struct perf_cgroup *ac, *bc;
-
-		/*
-		 * cgroup and system-wide events are mutually exclusive
-		 */
-		if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
-		    (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
-			return true;
-
-		/*
-		 * Ensure neither event is part of the other's cgroup
-		 */
-		ac = event_to_cgroup(a);
-		bc = event_to_cgroup(b);
-		if (ac == bc)
-			return true;
-
-		/*
-		 * Must have cgroup and non-intersecting task events.
-		 */
-		if (!ac || !bc)
-			return false;
-
-		/*
-		 * We have cgroup and task events, and the task belongs
-		 * to a cgroup. Check for for overlap.
-		 */
-		if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-		    cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-			return true;
-
-		return false;
-	}
-#endif
-	/*
-	 * If one of them is not a task, same story as above with cgroups.
-	 */
-	if (!(a->attach_state & PERF_ATTACH_TASK) ||
-	    !(b->attach_state & PERF_ATTACH_TASK))
-		return true;
-
-	/*
-	 * Must be non-overlapping.
-	 */
-	return false;
-}
-
-struct rmid_read {
-	u32 rmid;
-	u32 evt_type;
-	atomic64_t value;
-};
-
-static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
-static void __intel_mbm_event_count(void *info);
-
-static bool is_cqm_event(int e)
-{
-	return (e == QOS_L3_OCCUP_EVENT_ID);
-}
-
-static bool is_mbm_event(int e)
-{
-	return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
-}
-
-static void cqm_mask_call(struct rmid_read *rr)
-{
-	if (is_mbm_event(rr->evt_type))
-		on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, rr, 1);
-	else
-		on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, rr, 1);
-}
-
-/*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
-	struct perf_event *event;
-	struct list_head *head = &group->hw.cqm_group_entry;
-	u32 old_rmid = group->hw.cqm_rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	/*
-	 * If our RMID is being deallocated, perform a read now.
-	 */
-	if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
-		struct rmid_read rr = {
-			.rmid = old_rmid,
-			.evt_type = group->attr.config,
-			.value = ATOMIC64_INIT(0),
-		};
-
-		cqm_mask_call(&rr);
-		local64_set(&group->count, atomic64_read(&rr.value));
-	}
-
-	raw_spin_lock_irq(&cache_lock);
-
-	group->hw.cqm_rmid = rmid;
-	list_for_each_entry(event, head, hw.cqm_group_entry)
-		event->hw.cqm_rmid = rmid;
-
-	raw_spin_unlock_irq(&cache_lock);
-
-	/*
-	 * If the allocation is for mbm, init the mbm stats.
-	 * Need to check if each event in the group is mbm event
-	 * because there could be multiple type of events in the same group.
-	 */
-	if (__rmid_valid(rmid)) {
-		event = group;
-		if (is_mbm_event(event->attr.config))
-			init_mbm_sample(rmid, event->attr.config);
-
-		list_for_each_entry(event, head, hw.cqm_group_entry) {
-			if (is_mbm_event(event->attr.config))
-				init_mbm_sample(rmid, event->attr.config);
-		}
-	}
-
-	return old_rmid;
-}
-
-/*
- * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
- * cachelines are still tagged with RMIDs in limbo, we progressively
- * increment the threshold until we find an RMID in limbo with <=
- * __intel_cqm_threshold lines tagged. This is designed to mitigate the
- * problem where cachelines tagged with an RMID are not steadily being
- * evicted.
- *
- * On successful rotations we decrease the threshold back towards zero.
- *
- * __intel_cqm_max_threshold provides an upper bound on the threshold,
- * and is measured in bytes because it's exposed to userland.
- */
-static unsigned int __intel_cqm_threshold;
-static unsigned int __intel_cqm_max_threshold;
-
-/*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
-	struct cqm_rmid_entry *entry;
-
-	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-		if (entry->state != RMID_AVAILABLE)
-			break;
-
-		if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
-			entry->state = RMID_DIRTY;
-	}
-}
-
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
-	struct perf_event *leader, *event;
-
-	lockdep_assert_held(&cache_mutex);
-
-	leader = list_first_entry(&cache_groups, struct perf_event,
-				  hw.cqm_groups_entry);
-	event = leader;
-
-	list_for_each_entry_continue(event, &cache_groups,
-				     hw.cqm_groups_entry) {
-		if (__rmid_valid(event->hw.cqm_rmid))
-			continue;
-
-		if (__conflict_event(event, leader))
-			continue;
-
-		intel_cqm_xchg_rmid(event, rmid);
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * Initially use this constant for both the limbo queue time and the
- * rotation timer interval, pmu::hrtimer_interval_ms.
- *
- * They don't need to be the same, but the two are related since if you
- * rotate faster than you recycle RMIDs, you may run out of available
- * RMIDs.
- */
-#define RMID_DEFAULT_QUEUE_TIME 250	/* ms */
-
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
-	struct cqm_rmid_entry *entry, *tmp;
-
-	lockdep_assert_held(&cache_mutex);
-
-	*available = 0;
-	list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-		unsigned long min_queue_time;
-		unsigned long now = jiffies;
-
-		/*
-		 * We hold RMIDs placed into limbo for a minimum queue
-		 * time. Before the minimum queue time has elapsed we do
-		 * not recycle RMIDs.
-		 *
-		 * The reasoning is that until a sufficient time has
-		 * passed since we stopped using an RMID, any RMID
-		 * placed onto the limbo list will likely still have
-		 * data tagged in the cache, which means we'll probably
-		 * fail to recycle it anyway.
-		 *
-		 * We can save ourselves an expensive IPI by skipping
-		 * any RMIDs that have not been queued for the minimum
-		 * time.
-		 */
-		min_queue_time = entry->queue_time +
-			msecs_to_jiffies(__rmid_queue_time_ms);
-
-		if (time_after(min_queue_time, now))
-			break;
-
-		entry->state = RMID_AVAILABLE;
-		(*available)++;
-	}
-
-	/*
-	 * Fast return if none of the RMIDs on the limbo list have been
-	 * sitting on the queue for the minimum queue time.
-	 */
-	if (!*available)
-		return false;
-
-	/*
-	 * Test whether an RMID is free for each package.
-	 */
-	on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
-
-	list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
-		/*
-		 * Exhausted all RMIDs that have waited min queue time.
-		 */
-		if (entry->state == RMID_YOUNG)
-			break;
-
-		if (entry->state == RMID_DIRTY)
-			continue;
-
-		list_del(&entry->list);	/* remove from limbo */
-
-		/*
-		 * The rotation RMID gets priority if it's
-		 * currently invalid. In which case, skip adding
-		 * the RMID to the the free lru.
-		 */
-		if (!__rmid_valid(intel_cqm_rotation_rmid)) {
-			intel_cqm_rotation_rmid = entry->rmid;
-			continue;
-		}
-
-		/*
-		 * If we have groups waiting for RMIDs, hand
-		 * them one now provided they don't conflict.
-		 */
-		if (intel_cqm_sched_in_event(entry->rmid))
-			continue;
-
-		/*
-		 * Otherwise place it onto the free list.
-		 */
-		list_add_tail(&entry->list, &cqm_rmid_free_lru);
-	}
-
-
-	return __rmid_valid(intel_cqm_rotation_rmid);
-}
-
-/*
- * Pick a victim group and move it to the tail of the group list.
- * @next: The first group without an RMID
- */
-static void __intel_cqm_pick_and_rotate(struct perf_event *next)
-{
-	struct perf_event *rotor;
-	u32 rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	rotor = list_first_entry(&cache_groups, struct perf_event,
-				 hw.cqm_groups_entry);
-
-	/*
-	 * The group at the front of the list should always have a valid
-	 * RMID. If it doesn't then no groups have RMIDs assigned and we
-	 * don't need to rotate the list.
-	 */
-	if (next == rotor)
-		return;
-
-	rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
-	__put_rmid(rmid);
-
-	list_rotate_left(&cache_groups);
-}
-
-/*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
-	struct perf_event *group, *g;
-	u32 rmid;
-
-	lockdep_assert_held(&cache_mutex);
-
-	list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
-		if (group == event)
-			continue;
-
-		rmid = group->hw.cqm_rmid;
-
-		/*
-		 * Skip events that don't have a valid RMID.
-		 */
-		if (!__rmid_valid(rmid))
-			continue;
-
-		/*
-		 * No conflict? No problem! Leave the event alone.
-		 */
-		if (!__conflict_event(group, event))
-			continue;
-
-		intel_cqm_xchg_rmid(group, INVALID_RMID);
-		__put_rmid(rmid);
-	}
-}
-
-/*
- * Attempt to rotate the groups and assign new RMIDs.
- *
- * We rotate for two reasons,
- *   1. To handle the scheduling of conflicting events
- *   2. To recycle RMIDs
- *
- * Rotating RMIDs is complicated because the hardware doesn't give us
- * any clues.
- *
- * There's problems with the hardware interface; when you change the
- * task:RMID map cachelines retain their 'old' tags, giving a skewed
- * picture. In order to work around this, we must always keep one free
- * RMID - intel_cqm_rotation_rmid.
- *
- * Rotation works by taking away an RMID from a group (the old RMID),
- * and assigning the free RMID to another group (the new RMID). We must
- * then wait for the old RMID to not be used (no cachelines tagged).
- * This ensure that all cachelines are tagged with 'active' RMIDs. At
- * this point we can start reading values for the new RMID and treat the
- * old RMID as the free RMID for the next rotation.
- *
- * Return %true or %false depending on whether we did any rotating.
- */
-static bool __intel_cqm_rmid_rotate(void)
-{
-	struct perf_event *group, *start = NULL;
-	unsigned int threshold_limit;
-	unsigned int nr_needed = 0;
-	unsigned int nr_available;
-	bool rotated = false;
-
-	mutex_lock(&cache_mutex);
-
-again:
-	/*
-	 * Fast path through this function if there are no groups and no
-	 * RMIDs that need cleaning.
-	 */
-	if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
-		goto out;
-
-	list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
-		if (!__rmid_valid(group->hw.cqm_rmid)) {
-			if (!start)
-				start = group;
-			nr_needed++;
-		}
-	}
-
-	/*
-	 * We have some event groups, but they all have RMIDs assigned
-	 * and no RMIDs need cleaning.
-	 */
-	if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
-		goto out;
-
-	if (!nr_needed)
-		goto stabilize;
-
-	/*
-	 * We have more event groups without RMIDs than available RMIDs,
-	 * or we have event groups that conflict with the ones currently
-	 * scheduled.
-	 *
-	 * We force deallocate the rmid of the group at the head of
-	 * cache_groups. The first event group without an RMID then gets
-	 * assigned intel_cqm_rotation_rmid. This ensures we always make
-	 * forward progress.
-	 *
-	 * Rotate the cache_groups list so the previous head is now the
-	 * tail.
-	 */
-	__intel_cqm_pick_and_rotate(start);
-
-	/*
-	 * If the rotation is going to succeed, reduce the threshold so
-	 * that we don't needlessly reuse dirty RMIDs.
-	 */
-	if (__rmid_valid(intel_cqm_rotation_rmid)) {
-		intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
-		intel_cqm_rotation_rmid = __get_rmid();
-
-		intel_cqm_sched_out_conflicting_events(start);
-
-		if (__intel_cqm_threshold)
-			__intel_cqm_threshold--;
-	}
-
-	rotated = true;
-
-stabilize:
-	/*
-	 * We now need to stablize the RMID we freed above (if any) to
-	 * ensure that the next time we rotate we have an RMID with zero
-	 * occupancy value.
-	 *
-	 * Alternatively, if we didn't need to perform any rotation,
-	 * we'll have a bunch of RMIDs in limbo that need stabilizing.
-	 */
-	threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
-
-	while (intel_cqm_rmid_stabilize(&nr_available) &&
-	       __intel_cqm_threshold < threshold_limit) {
-		unsigned int steal_limit;
-
-		/*
-		 * Don't spin if nobody is actively waiting for an RMID,
-		 * the rotation worker will be kicked as soon as an
-		 * event needs an RMID anyway.
-		 */
-		if (!nr_needed)
-			break;
-
-		/* Allow max 25% of RMIDs to be in limbo. */
-		steal_limit = (cqm_max_rmid + 1) / 4;
-
-		/*
-		 * We failed to stabilize any RMIDs so our rotation
-		 * logic is now stuck. In order to make forward progress
-		 * we have a few options:
-		 *
-		 *   1. rotate ("steal") another RMID
-		 *   2. increase the threshold
-		 *   3. do nothing
-		 *
-		 * We do both of 1. and 2. until we hit the steal limit.
-		 *
-		 * The steal limit prevents all RMIDs ending up on the
-		 * limbo list. This can happen if every RMID has a
-		 * non-zero occupancy above threshold_limit, and the
-		 * occupancy values aren't dropping fast enough.
-		 *
-		 * Note that there is prioritisation at work here - we'd
-		 * rather increase the number of RMIDs on the limbo list
-		 * than increase the threshold, because increasing the
-		 * threshold skews the event data (because we reuse
-		 * dirty RMIDs) - threshold bumps are a last resort.
-		 */
-		if (nr_available < steal_limit)
-			goto again;
-
-		__intel_cqm_threshold++;
-	}
-
-out:
-	mutex_unlock(&cache_mutex);
-	return rotated;
-}
-
-static void intel_cqm_rmid_rotate(struct work_struct *work);
-
-static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
-
-static struct pmu intel_cqm_pmu;
-
-static void intel_cqm_rmid_rotate(struct work_struct *work)
-{
-	unsigned long delay;
-
-	__intel_cqm_rmid_rotate();
-
-	delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
-	schedule_delayed_work(&intel_cqm_rmid_work, delay);
-}
-
-static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
-{
-	struct sample *mbm_current;
-	u32 vrmid = rmid_2_index(rmid);
-	u64 val, bytes, shift;
-	u32 eventid;
-
-	if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
-		mbm_current = &mbm_local[vrmid];
-		eventid     = QOS_MBM_LOCAL_EVENT_ID;
-	} else {
-		mbm_current = &mbm_total[vrmid];
-		eventid     = QOS_MBM_TOTAL_EVENT_ID;
-	}
-
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
-	rdmsrl(MSR_IA32_QM_CTR, val);
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return mbm_current->total_bytes;
-
-	if (first) {
-		mbm_current->prev_msr = val;
-		mbm_current->total_bytes = 0;
-		return mbm_current->total_bytes;
-	}
-
-	/*
-	 * The h/w guarantees that counters will not overflow
-	 * so long as we poll them at least once per second.
-	 */
-	shift = 64 - MBM_CNTR_WIDTH;
-	bytes = (val << shift) - (mbm_current->prev_msr << shift);
-	bytes >>= shift;
-
-	bytes *= cqm_l3_scale;
-
-	mbm_current->total_bytes += bytes;
-	mbm_current->prev_msr = val;
-
-	return mbm_current->total_bytes;
-}
-
-static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
-{
-	return update_sample(rmid, evt_type, 0);
-}
-
-static void __intel_mbm_event_init(void *info)
-{
-	struct rmid_read *rr = info;
-
-	update_sample(rr->rmid, rr->evt_type, 1);
-}
-
-static void init_mbm_sample(u32 rmid, u32 evt_type)
-{
-	struct rmid_read rr = {
-		.rmid = rmid,
-		.evt_type = evt_type,
-		.value = ATOMIC64_INIT(0),
-	};
-
-	/* on each socket, init sample */
-	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
-}
-
-/*
- * Find a group and setup RMID.
- *
- * If we're part of a group, we use the group's RMID.
- */
-static void intel_cqm_setup_event(struct perf_event *event,
-				  struct perf_event **group)
-{
-	struct perf_event *iter;
-	bool conflict = false;
-	u32 rmid;
-
-	event->hw.is_group_event = false;
-	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
-		rmid = iter->hw.cqm_rmid;
-
-		if (__match_event(iter, event)) {
-			/* All tasks in a group share an RMID */
-			event->hw.cqm_rmid = rmid;
-			*group = iter;
-			if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
-				init_mbm_sample(rmid, event->attr.config);
-			return;
-		}
-
-		/*
-		 * We only care about conflicts for events that are
-		 * actually scheduled in (and hence have a valid RMID).
-		 */
-		if (__conflict_event(iter, event) && __rmid_valid(rmid))
-			conflict = true;
-	}
-
-	if (conflict)
-		rmid = INVALID_RMID;
-	else
-		rmid = __get_rmid();
-
-	if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
-		init_mbm_sample(rmid, event->attr.config);
-
-	event->hw.cqm_rmid = rmid;
-}
-
-static void intel_cqm_event_read(struct perf_event *event)
-{
-	unsigned long flags;
-	u32 rmid;
-	u64 val;
-
-	/*
-	 * Task events are handled by intel_cqm_event_count().
-	 */
-	if (event->cpu == -1)
-		return;
-
-	raw_spin_lock_irqsave(&cache_lock, flags);
-	rmid = event->hw.cqm_rmid;
-
-	if (!__rmid_valid(rmid))
-		goto out;
-
-	if (is_mbm_event(event->attr.config))
-		val = rmid_read_mbm(rmid, event->attr.config);
-	else
-		val = __rmid_read(rmid);
-
-	/*
-	 * Ignore this reading on error states and do not update the value.
-	 */
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		goto out;
-
-	local64_set(&event->count, val);
-out:
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-}
-
-static void __intel_cqm_event_count(void *info)
-{
-	struct rmid_read *rr = info;
-	u64 val;
-
-	val = __rmid_read(rr->rmid);
-
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
-
-	atomic64_add(val, &rr->value);
-}
-
-static inline bool cqm_group_leader(struct perf_event *event)
-{
-	return !list_empty(&event->hw.cqm_groups_entry);
-}
-
-static void __intel_mbm_event_count(void *info)
-{
-	struct rmid_read *rr = info;
-	u64 val;
-
-	val = rmid_read_mbm(rr->rmid, rr->evt_type);
-	if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-		return;
-	atomic64_add(val, &rr->value);
-}
-
-static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
-{
-	struct perf_event *iter, *iter1;
-	int ret = HRTIMER_RESTART;
-	struct list_head *head;
-	unsigned long flags;
-	u32 grp_rmid;
-
-	/*
-	 * Need to cache_lock as the timer Event Select MSR reads
-	 * can race with the mbm/cqm count() and mbm_init() reads.
-	 */
-	raw_spin_lock_irqsave(&cache_lock, flags);
-
-	if (list_empty(&cache_groups)) {
-		ret = HRTIMER_NORESTART;
-		goto out;
-	}
-
-	list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
-		grp_rmid = iter->hw.cqm_rmid;
-		if (!__rmid_valid(grp_rmid))
-			continue;
-		if (is_mbm_event(iter->attr.config))
-			update_sample(grp_rmid, iter->attr.config, 0);
-
-		head = &iter->hw.cqm_group_entry;
-		if (list_empty(head))
-			continue;
-		list_for_each_entry(iter1, head, hw.cqm_group_entry) {
-			if (!iter1->hw.is_group_event)
-				break;
-			if (is_mbm_event(iter1->attr.config))
-				update_sample(iter1->hw.cqm_rmid,
-					      iter1->attr.config, 0);
-		}
-	}
-
-	hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
-out:
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-
-	return ret;
-}
-
-static void __mbm_start_timer(void *info)
-{
-	hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
-			     HRTIMER_MODE_REL_PINNED);
-}
-
-static void __mbm_stop_timer(void *info)
-{
-	hrtimer_cancel(&mbm_timers[pkg_id]);
-}
-
-static void mbm_start_timers(void)
-{
-	on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
-}
-
-static void mbm_stop_timers(void)
-{
-	on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
-}
-
-static void mbm_hrtimer_init(void)
-{
-	struct hrtimer *hr;
-	int i;
-
-	for (i = 0; i < mbm_socket_max; i++) {
-		hr = &mbm_timers[i];
-		hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-		hr->function = mbm_hrtimer_handle;
-	}
-}
-
-static u64 intel_cqm_event_count(struct perf_event *event)
-{
-	unsigned long flags;
-	struct rmid_read rr = {
-		.evt_type = event->attr.config,
-		.value = ATOMIC64_INIT(0),
-	};
-
-	/*
-	 * We only need to worry about task events. System-wide events
-	 * are handled like usual, i.e. entirely with
-	 * intel_cqm_event_read().
-	 */
-	if (event->cpu != -1)
-		return __perf_event_count(event);
-
-	/*
-	 * Only the group leader gets to report values except in case of
-	 * multiple events in the same group, we still need to read the
-	 * other events.This stops us
-	 * reporting duplicate values to userspace, and gives us a clear
-	 * rule for which task gets to report the values.
-	 *
-	 * Note that it is impossible to attribute these values to
-	 * specific packages - we forfeit that ability when we create
-	 * task events.
-	 */
-	if (!cqm_group_leader(event) && !event->hw.is_group_event)
-		return 0;
-
-	/*
-	 * Getting up-to-date values requires an SMP IPI which is not
-	 * possible if we're being called in interrupt context. Return
-	 * the cached values instead.
-	 */
-	if (unlikely(in_interrupt()))
-		goto out;
-
-	/*
-	 * Notice that we don't perform the reading of an RMID
-	 * atomically, because we can't hold a spin lock across the
-	 * IPIs.
-	 *
-	 * Speculatively perform the read, since @event might be
-	 * assigned a different (possibly invalid) RMID while we're
-	 * busying performing the IPI calls. It's therefore necessary to
-	 * check @event's RMID afterwards, and if it has changed,
-	 * discard the result of the read.
-	 */
-	rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
-
-	if (!__rmid_valid(rr.rmid))
-		goto out;
-
-	cqm_mask_call(&rr);
-
-	raw_spin_lock_irqsave(&cache_lock, flags);
-	if (event->hw.cqm_rmid == rr.rmid)
-		local64_set(&event->count, atomic64_read(&rr.value));
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-out:
-	return __perf_event_count(event);
-}
-
-static void intel_cqm_event_start(struct perf_event *event, int mode)
-{
-	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-	u32 rmid = event->hw.cqm_rmid;
-
-	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
-		return;
-
-	event->hw.cqm_state &= ~PERF_HES_STOPPED;
-
-	if (state->rmid_usecnt++) {
-		if (!WARN_ON_ONCE(state->rmid != rmid))
-			return;
-	} else {
-		WARN_ON_ONCE(state->rmid);
-	}
-
-	state->rmid = rmid;
-	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
-}
-
-static void intel_cqm_event_stop(struct perf_event *event, int mode)
-{
-	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
-	if (event->hw.cqm_state & PERF_HES_STOPPED)
-		return;
-
-	event->hw.cqm_state |= PERF_HES_STOPPED;
-
-	intel_cqm_event_read(event);
-
-	if (!--state->rmid_usecnt) {
-		state->rmid = 0;
-		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);
-	} else {
-		WARN_ON_ONCE(!state->rmid);
-	}
-}
-
-static int intel_cqm_event_add(struct perf_event *event, int mode)
-{
-	unsigned long flags;
-	u32 rmid;
-
-	raw_spin_lock_irqsave(&cache_lock, flags);
-
-	event->hw.cqm_state = PERF_HES_STOPPED;
-	rmid = event->hw.cqm_rmid;
-
-	if (__rmid_valid(rmid) && (mode & PERF_EF_START))
-		intel_cqm_event_start(event, mode);
-
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-
-	return 0;
-}
-
-static void intel_cqm_event_destroy(struct perf_event *event)
-{
-	struct perf_event *group_other = NULL;
-	unsigned long flags;
-
-	mutex_lock(&cache_mutex);
-	/*
-	* Hold the cache_lock as mbm timer handlers could be
-	* scanning the list of events.
-	*/
-	raw_spin_lock_irqsave(&cache_lock, flags);
-
-	/*
-	 * If there's another event in this group...
-	 */
-	if (!list_empty(&event->hw.cqm_group_entry)) {
-		group_other = list_first_entry(&event->hw.cqm_group_entry,
-					       struct perf_event,
-					       hw.cqm_group_entry);
-		list_del(&event->hw.cqm_group_entry);
-	}
-
-	/*
-	 * And we're the group leader..
-	 */
-	if (cqm_group_leader(event)) {
-		/*
-		 * If there was a group_other, make that leader, otherwise
-		 * destroy the group and return the RMID.
-		 */
-		if (group_other) {
-			list_replace(&event->hw.cqm_groups_entry,
-				     &group_other->hw.cqm_groups_entry);
-		} else {
-			u32 rmid = event->hw.cqm_rmid;
-
-			if (__rmid_valid(rmid))
-				__put_rmid(rmid);
-			list_del(&event->hw.cqm_groups_entry);
-		}
-	}
-
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-
-	/*
-	 * Stop the mbm overflow timers when the last event is destroyed.
-	*/
-	if (mbm_enabled && list_empty(&cache_groups))
-		mbm_stop_timers();
-
-	mutex_unlock(&cache_mutex);
-}
-
-static int intel_cqm_event_init(struct perf_event *event)
-{
-	struct perf_event *group = NULL;
-	bool rotate = false;
-	unsigned long flags;
-
-	if (event->attr.type != intel_cqm_pmu.type)
-		return -ENOENT;
-
-	if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
-	     (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
-		return -EINVAL;
-
-	if ((is_cqm_event(event->attr.config) && !cqm_enabled) ||
-	    (is_mbm_event(event->attr.config) && !mbm_enabled))
-		return -EINVAL;
-
-	/* unsupported modes and filters */
-	if (event->attr.exclude_user   ||
-	    event->attr.exclude_kernel ||
-	    event->attr.exclude_hv     ||
-	    event->attr.exclude_idle   ||
-	    event->attr.exclude_host   ||
-	    event->attr.exclude_guest  ||
-	    event->attr.sample_period) /* no sampling */
-		return -EINVAL;
-
-	INIT_LIST_HEAD(&event->hw.cqm_group_entry);
-	INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
-
-	event->destroy = intel_cqm_event_destroy;
-
-	mutex_lock(&cache_mutex);
-
-	/*
-	 * Start the mbm overflow timers when the first event is created.
-	*/
-	if (mbm_enabled && list_empty(&cache_groups))
-		mbm_start_timers();
-
-	/* Will also set rmid */
-	intel_cqm_setup_event(event, &group);
-
-	/*
-	* Hold the cache_lock as mbm timer handlers be
-	* scanning the list of events.
-	*/
-	raw_spin_lock_irqsave(&cache_lock, flags);
-
-	if (group) {
-		list_add_tail(&event->hw.cqm_group_entry,
-			      &group->hw.cqm_group_entry);
-	} else {
-		list_add_tail(&event->hw.cqm_groups_entry,
-			      &cache_groups);
-
-		/*
-		 * All RMIDs are either in use or have recently been
-		 * used. Kick the rotation worker to clean/free some.
-		 *
-		 * We only do this for the group leader, rather than for
-		 * every event in a group to save on needless work.
-		 */
-		if (!__rmid_valid(event->hw.cqm_rmid))
-			rotate = true;
-	}
-
-	raw_spin_unlock_irqrestore(&cache_lock, flags);
-	mutex_unlock(&cache_mutex);
-
-	if (rotate)
-		schedule_delayed_work(&intel_cqm_rmid_work, 0);
-
-	return 0;
-}
-
-EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
-EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
-EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
-EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
-EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
-
-EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
-EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
-EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
-EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
-
-EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
-EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
-EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
-EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
-
-static struct attribute *intel_cqm_events_attr[] = {
-	EVENT_PTR(intel_cqm_llc),
-	EVENT_PTR(intel_cqm_llc_pkg),
-	EVENT_PTR(intel_cqm_llc_unit),
-	EVENT_PTR(intel_cqm_llc_scale),
-	EVENT_PTR(intel_cqm_llc_snapshot),
-	NULL,
-};
-
-static struct attribute *intel_mbm_events_attr[] = {
-	EVENT_PTR(intel_cqm_total_bytes),
-	EVENT_PTR(intel_cqm_local_bytes),
-	EVENT_PTR(intel_cqm_total_bytes_pkg),
-	EVENT_PTR(intel_cqm_local_bytes_pkg),
-	EVENT_PTR(intel_cqm_total_bytes_unit),
-	EVENT_PTR(intel_cqm_local_bytes_unit),
-	EVENT_PTR(intel_cqm_total_bytes_scale),
-	EVENT_PTR(intel_cqm_local_bytes_scale),
-	NULL,
-};
-
-static struct attribute *intel_cmt_mbm_events_attr[] = {
-	EVENT_PTR(intel_cqm_llc),
-	EVENT_PTR(intel_cqm_total_bytes),
-	EVENT_PTR(intel_cqm_local_bytes),
-	EVENT_PTR(intel_cqm_llc_pkg),
-	EVENT_PTR(intel_cqm_total_bytes_pkg),
-	EVENT_PTR(intel_cqm_local_bytes_pkg),
-	EVENT_PTR(intel_cqm_llc_unit),
-	EVENT_PTR(intel_cqm_total_bytes_unit),
-	EVENT_PTR(intel_cqm_local_bytes_unit),
-	EVENT_PTR(intel_cqm_llc_scale),
-	EVENT_PTR(intel_cqm_total_bytes_scale),
-	EVENT_PTR(intel_cqm_local_bytes_scale),
-	EVENT_PTR(intel_cqm_llc_snapshot),
-	NULL,
-};
-
-static struct attribute_group intel_cqm_events_group = {
-	.name = "events",
-	.attrs = NULL,
-};
-
-PMU_FORMAT_ATTR(event, "config:0-7");
-static struct attribute *intel_cqm_formats_attr[] = {
-	&format_attr_event.attr,
-	NULL,
-};
-
-static struct attribute_group intel_cqm_format_group = {
-	.name = "format",
-	.attrs = intel_cqm_formats_attr,
-};
-
-static ssize_t
-max_recycle_threshold_show(struct device *dev, struct device_attribute *attr,
-			   char *page)
-{
-	ssize_t rv;
-
-	mutex_lock(&cache_mutex);
-	rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold);
-	mutex_unlock(&cache_mutex);
-
-	return rv;
-}
-
-static ssize_t
-max_recycle_threshold_store(struct device *dev,
-			    struct device_attribute *attr,
-			    const char *buf, size_t count)
-{
-	unsigned int bytes, cachelines;
-	int ret;
-
-	ret = kstrtouint(buf, 0, &bytes);
-	if (ret)
-		return ret;
-
-	mutex_lock(&cache_mutex);
-
-	__intel_cqm_max_threshold = bytes;
-	cachelines = bytes / cqm_l3_scale;
-
-	/*
-	 * The new maximum takes effect immediately.
-	 */
-	if (__intel_cqm_threshold > cachelines)
-		__intel_cqm_threshold = cachelines;
-
-	mutex_unlock(&cache_mutex);
-
-	return count;
-}
-
-static DEVICE_ATTR_RW(max_recycle_threshold);
-
-static struct attribute *intel_cqm_attrs[] = {
-	&dev_attr_max_recycle_threshold.attr,
-	NULL,
-};
-
-static const struct attribute_group intel_cqm_group = {
-	.attrs = intel_cqm_attrs,
-};
-
-static const struct attribute_group *intel_cqm_attr_groups[] = {
-	&intel_cqm_events_group,
-	&intel_cqm_format_group,
-	&intel_cqm_group,
-	NULL,
-};
-
-static struct pmu intel_cqm_pmu = {
-	.hrtimer_interval_ms = RMID_DEFAULT_QUEUE_TIME,
-	.attr_groups	     = intel_cqm_attr_groups,
-	.task_ctx_nr	     = perf_sw_context,
-	.event_init	     = intel_cqm_event_init,
-	.add		     = intel_cqm_event_add,
-	.del		     = intel_cqm_event_stop,
-	.start		     = intel_cqm_event_start,
-	.stop		     = intel_cqm_event_stop,
-	.read		     = intel_cqm_event_read,
-	.count		     = intel_cqm_event_count,
-};
-
-static inline void cqm_pick_event_reader(int cpu)
-{
-	int reader;
-
-	/* First online cpu in package becomes the reader */
-	reader = cpumask_any_and(&cqm_cpumask, topology_core_cpumask(cpu));
-	if (reader >= nr_cpu_ids)
-		cpumask_set_cpu(cpu, &cqm_cpumask);
-}
-
-static int intel_cqm_cpu_starting(unsigned int cpu)
-{
-	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
-	struct cpuinfo_x86 *c = &cpu_data(cpu);
-
-	state->rmid = 0;
-	state->closid = 0;
-	state->rmid_usecnt = 0;
-
-	WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
-	WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
-
-	cqm_pick_event_reader(cpu);
-	return 0;
-}
-
-static int intel_cqm_cpu_exit(unsigned int cpu)
-{
-	int target;
-
-	/* Is @cpu the current cqm reader for this package ? */
-	if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
-		return 0;
-
-	/* Find another online reader in this package */
-	target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
-
-	if (target < nr_cpu_ids)
-		cpumask_set_cpu(target, &cqm_cpumask);
-
-	return 0;
-}
-
-static const struct x86_cpu_id intel_cqm_match[] = {
-	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_OCCUP_LLC },
-	{}
-};
-
-static void mbm_cleanup(void)
-{
-	if (!mbm_enabled)
-		return;
-
-	kfree(mbm_local);
-	kfree(mbm_total);
-	mbm_enabled = false;
-}
-
-static const struct x86_cpu_id intel_mbm_local_match[] = {
-	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
-	{}
-};
-
-static const struct x86_cpu_id intel_mbm_total_match[] = {
-	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
-	{}
-};
-
-static int intel_mbm_init(void)
-{
-	int ret = 0, array_size, maxid = cqm_max_rmid + 1;
-
-	mbm_socket_max = topology_max_packages();
-	array_size = sizeof(struct sample) * maxid * mbm_socket_max;
-	mbm_local = kmalloc(array_size, GFP_KERNEL);
-	if (!mbm_local)
-		return -ENOMEM;
-
-	mbm_total = kmalloc(array_size, GFP_KERNEL);
-	if (!mbm_total) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	array_size = sizeof(struct hrtimer) * mbm_socket_max;
-	mbm_timers = kmalloc(array_size, GFP_KERNEL);
-	if (!mbm_timers) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	mbm_hrtimer_init();
-
-out:
-	if (ret)
-		mbm_cleanup();
-
-	return ret;
-}
-
-static int __init intel_cqm_init(void)
-{
-	char *str = NULL, scale[20];
-	int cpu, ret;
-
-	if (x86_match_cpu(intel_cqm_match))
-		cqm_enabled = true;
-
-	if (x86_match_cpu(intel_mbm_local_match) &&
-	     x86_match_cpu(intel_mbm_total_match))
-		mbm_enabled = true;
-
-	if (!cqm_enabled && !mbm_enabled)
-		return -ENODEV;
-
-	cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
-
-	/*
-	 * It's possible that not all resources support the same number
-	 * of RMIDs. Instead of making scheduling much more complicated
-	 * (where we have to match a task's RMID to a cpu that supports
-	 * that many RMIDs) just find the minimum RMIDs supported across
-	 * all cpus.
-	 *
-	 * Also, check that the scales match on all cpus.
-	 */
-	get_online_cpus();
-	for_each_online_cpu(cpu) {
-		struct cpuinfo_x86 *c = &cpu_data(cpu);
-
-		if (c->x86_cache_max_rmid < cqm_max_rmid)
-			cqm_max_rmid = c->x86_cache_max_rmid;
-
-		if (c->x86_cache_occ_scale != cqm_l3_scale) {
-			pr_err("Multiple LLC scale values, disabling\n");
-			ret = -EINVAL;
-			goto out;
-		}
-	}
-
-	/*
-	 * A reasonable upper limit on the max threshold is the number
-	 * of lines tagged per RMID if all RMIDs have the same number of
-	 * lines tagged in the LLC.
-	 *
-	 * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
-	 */
-	__intel_cqm_max_threshold =
-		boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);
-
-	snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
-	str = kstrdup(scale, GFP_KERNEL);
-	if (!str) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	event_attr_intel_cqm_llc_scale.event_str = str;
-
-	ret = intel_cqm_setup_rmid_cache();
-	if (ret)
-		goto out;
-
-	if (mbm_enabled)
-		ret = intel_mbm_init();
-	if (ret && !cqm_enabled)
-		goto out;
-
-	if (cqm_enabled && mbm_enabled)
-		intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
-	else if (!cqm_enabled && mbm_enabled)
-		intel_cqm_events_group.attrs = intel_mbm_events_attr;
-	else if (cqm_enabled && !mbm_enabled)
-		intel_cqm_events_group.attrs = intel_cqm_events_attr;
-
-	ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
-	if (ret) {
-		pr_err("Intel CQM perf registration failed: %d\n", ret);
-		goto out;
-	}
-
-	if (cqm_enabled)
-		pr_info("Intel CQM monitoring enabled\n");
-	if (mbm_enabled)
-		pr_info("Intel MBM enabled\n");
-
-	/*
-	 * Setup the hot cpu notifier once we are sure cqm
-	 * is enabled to avoid notifier leak.
-	 */
-	cpuhp_setup_state(CPUHP_AP_PERF_X86_CQM_STARTING,
-			  "AP_PERF_X86_CQM_STARTING",
-			  intel_cqm_cpu_starting, NULL);
-	cpuhp_setup_state(CPUHP_AP_PERF_X86_CQM_ONLINE, "AP_PERF_X86_CQM_ONLINE",
-			  NULL, intel_cqm_cpu_exit);
-
-out:
-	put_online_cpus();
-
-	if (ret) {
-		kfree(str);
-		cqm_cleanup();
-		mbm_cleanup();
-	}
-
-	return ret;
-}
-device_initcall(intel_cqm_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index afe641c..320a3be 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -68,7 +68,6 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_X86_AMD_UNCORE_STARTING,
 	CPUHP_AP_PERF_X86_STARTING,
 	CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
-	CPUHP_AP_PERF_X86_CQM_STARTING,
 	CPUHP_AP_PERF_X86_CSTATE_STARTING,
 	CPUHP_AP_PERF_XTENSA_STARTING,
 	CPUHP_AP_PERF_METAG_STARTING,
@@ -111,7 +110,6 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_X86_AMD_UNCORE_ONLINE,
 	CPUHP_AP_PERF_X86_AMD_POWER_ONLINE,
 	CPUHP_AP_PERF_X86_RAPL_ONLINE,
-	CPUHP_AP_PERF_X86_CQM_ONLINE,
 	CPUHP_AP_PERF_X86_CSTATE_ONLINE,
 	CPUHP_AP_PERF_S390_CF_ONLINE,
 	CPUHP_AP_PERF_S390_SF_ONLINE,
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 060d0ed..345ec20 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -139,14 +139,6 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
-		struct { /* intel_cqm */
-			int			cqm_state;
-			u32			cqm_rmid;
-			int			is_group_event;
-			struct list_head	cqm_events_entry;
-			struct list_head	cqm_groups_entry;
-			struct list_head	cqm_group_entry;
-		};
 		struct { /* itrace */
 			int			itrace_started;
 		};
@@ -408,12 +400,6 @@ struct pmu {
 	 */
 	size_t				task_ctx_size;
 
-
-	/*
-	 * Return the count value for a counter.
-	 */
-	u64 (*count)			(struct perf_event *event); /*optional*/
-
 	/*
 	 * Set up pmu-private data structures for an AUX area
 	 */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 55953db..d99a51c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3511,9 +3511,6 @@ static void __perf_event_read(void *info)
 
 static inline u64 perf_event_count(struct perf_event *event)
 {
-	if (event->pmu->count)
-		return event->pmu->count(event);
-
 	return __perf_event_count(event);
 }
 
@@ -3523,7 +3520,6 @@ static inline u64 perf_event_count(struct perf_event *event)
  *   - either for the current task, or for this CPU
  *   - does not have inherit set, for inherited task events
  *     will not be local and we cannot read them atomically
- *   - must not have a pmu::count method
  */
 u64 perf_event_read_local(struct perf_event *event)
 {
@@ -3551,12 +3547,6 @@ u64 perf_event_read_local(struct perf_event *event)
 	WARN_ON_ONCE(event->attr.inherit);
 
 	/*
-	 * It must not have a pmu::count method, those are not
-	 * NMI safe.
-	 */
-	WARN_ON_ONCE(event->pmu->count);
-
-	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
 	 * oncpu == -1).
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 5dcb992..52c2c85 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -252,8 +252,8 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
 		     event->attr.type != PERF_TYPE_RAW))
 		return -EINVAL;
 
-	/* make sure event is local and doesn't have pmu::count */
-	if (unlikely(event->oncpu != cpu || event->pmu->count))
+	/* make sure event is local */
+	if (unlikely(event->oncpu != cpu))
 		return -EINVAL;
 
 	/*
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 02/46] perf/x86/intel: rename CQM cpufeatures to CMT
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM David Carrillo-Cisneros
@ 2016-10-30  0:37 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 03/46] x86/intel: add CONFIG_INTEL_RDT_M configuration flag David Carrillo-Cisneros
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

CMT name has superseded CQM in Intel's documentation.

Rename cpufeatures. Next patches in this series will use the CMT name.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/include/asm/cpufeatures.h | 14 +++++++-------
 arch/x86/kernel/cpu/common.c       | 10 +++++-----
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 90b8c0b..cd3b215 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -222,7 +222,7 @@
 #define X86_FEATURE_ERMS	( 9*32+ 9) /* Enhanced REP MOVSB/STOSB */
 #define X86_FEATURE_INVPCID	( 9*32+10) /* Invalidate Processor Context ID */
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
-#define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
+#define X86_FEATURE_CMT		( 9*32+12) /* Cache Monitoring Technology */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
 #define X86_FEATURE_RDT_A	( 9*32+15) /* Resource Director Technology Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
@@ -245,13 +245,13 @@
 #define X86_FEATURE_XGETBV1	(10*32+ 2) /* XGETBV with ECX = 1 */
 #define X86_FEATURE_XSAVES	(10*32+ 3) /* XSAVES/XRSTORS */
 
-/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:0 (edx), word 11 */
-#define X86_FEATURE_CQM_LLC	(11*32+ 1) /* LLC QoS if 1 */
+/* Intel-defined CPU CMT Sub-leaf, CPUID level 0x0000000F:0 (edx), word 11 */
+#define X86_FEATURE_CMT_LLC	(11*32+ 1) /* LLC CMT if 1 */
 
-/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
-#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
-#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
-#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */
+/* Intel-defined CPU CMT Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
+#define X86_FEATURE_CMT_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CMT_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CMT_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */
 
 /* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
 #define X86_FEATURE_CLZERO	(13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9bd910a..911ee16 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -691,7 +691,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x0000000F, 0, &eax, &ebx, &ecx, &edx);
 		c->x86_capability[CPUID_F_0_EDX] = edx;
 
-		if (cpu_has(c, X86_FEATURE_CQM_LLC)) {
+		if (cpu_has(c, X86_FEATURE_CMT_LLC)) {
 			/* will be overridden if occupancy monitoring exists */
 			c->x86_cache_max_rmid = ebx;
 
@@ -699,9 +699,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 			cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
 			c->x86_capability[CPUID_F_1_EDX] = edx;
 
-			if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
-			      ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
-			       (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
+			if ((cpu_has(c, X86_FEATURE_CMT_OCCUP_LLC)) ||
+			      ((cpu_has(c, X86_FEATURE_CMT_MBM_TOTAL)) ||
+			       (cpu_has(c, X86_FEATURE_CMT_MBM_LOCAL)))) {
 				c->x86_cache_max_rmid = ecx;
 				c->x86_cache_occ_scale = ebx;
 			}
@@ -969,7 +969,7 @@ static void x86_init_cache_qos(struct cpuinfo_x86 *c)
 	/*
 	 * The heavy lifting of max_rmid and cache_occ_scale are handled
 	 * in get_cpu_cap().  Here we just set the max_rmid for the boot_cpu
-	 * in case CQM bits really aren't there in this CPU.
+	 * in case CMT bits really aren't there in this CPU.
 	 */
 	if (c != &boot_cpu_data) {
 		boot_cpu_data.x86_cache_max_rmid =
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 03/46] x86/intel: add CONFIG_INTEL_RDT_M configuration flag
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 02/46] perf/x86/intel: rename CQM cpufeatures to CMT David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support David Carrillo-Cisneros
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

New flag for building drivers for Intel RDT Monitoring
(currently CMT, future: MBM).

This driver may be converted to module once the hooks in
perf cgroup are discussed.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/Kconfig | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 770fb5f..d31825d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -419,6 +419,18 @@ config INTEL_RDT_A
 
 	  Say N if unsure.
 
+config INTEL_RDT_M
+	bool "Intel Resource Director Technology Monitoring support"
+	default n
+	depends on PERF_EVENTS && X86 && CPU_SUP_INTEL
+	---help---
+	  Select to enable resource monitoring which is a sub-feature of
+	  Intel Resource Director Technology(RDT). More information about
+	  RDT can be found in the Intel x86 Architecture Software
+	  Developer Manual.
+
+	  Say N if unsure.
+
 if X86_32
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (2 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 03/46] x86/intel: add CONFIG_INTEL_RDT_M configuration flag David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-11-10 15:19   ` Thomas Gleixner
  2016-10-30  0:38 ` [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks David Carrillo-Cisneros
                   ` (41 subsequent siblings)
  45 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Introduce struct pkg_data to store per CPU package locks and data for
new CMT driver.

Each pkg_data is initialiazed/terminated on demand when the first/last CPU
in its package goes online/offline.

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/Makefile |   1 +
 arch/x86/events/intel/cmt.c    | 268 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h    |  29 +++++
 include/linux/cpuhotplug.h     |   2 +
 4 files changed, 300 insertions(+)
 create mode 100644 arch/x86/events/intel/cmt.c
 create mode 100644 arch/x86/events/intel/cmt.h

diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index e9d8520..02fecbc 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= intel-uncore.o
 intel-uncore-objs			:= uncore.o uncore_nhmex.o uncore_snb.o uncore_snbep.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE)	+= intel-cstate.o
 intel-cstate-objs			:= cstate.o
+obj-$(CONFIG_INTEL_RDT_M)		+= cmt.o
diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
new file mode 100644
index 0000000..267a9ec
--- /dev/null
+++ b/arch/x86/events/intel/cmt.c
@@ -0,0 +1,268 @@
+/*
+ * Intel Cache Monitoring Technology (CMT) support.
+ */
+
+#include <linux/slab.h>
+#include <asm/cpu_device_id.h>
+#include "cmt.h"
+#include "../perf_event.h"
+
+static DEFINE_MUTEX(cmt_mutex);
+
+static unsigned int cmt_l3_scale;	/* cmt hw units to bytes. */
+
+static unsigned int __min_max_rmid;	/* minimum max_rmid across all pkgs. */
+
+/* Array of packages (array of pkgds). It's protected by RCU or cmt_mutex. */
+static struct pkg_data **cmt_pkgs_data;
+
+/*
+ * If @pkgd == NULL, return first online, pkg_data in cmt_pkgs_data.
+ * Otherwise next online pkg_data or NULL if no more.
+ */
+static struct pkg_data *cmt_pkgs_data_next_rcu(struct pkg_data *pkgd)
+{
+	u16 p, nr_pkgs = topology_max_packages();
+
+	if (!pkgd)
+		return rcu_dereference_check(cmt_pkgs_data[0],
+					     lockdep_is_held(&cmt_mutex));
+	p = pkgd->pkgid + 1;
+	pkgd = NULL;
+
+	while (!pkgd && p < nr_pkgs) {
+		pkgd = rcu_dereference_check(cmt_pkgs_data[p++],
+					     lockdep_is_held(&cmt_mutex));
+	}
+
+	return pkgd;
+}
+
+static void free_pkg_data(struct pkg_data *pkg_data)
+{
+	kfree(pkg_data);
+}
+
+/* Init pkg_data for @cpu 's package. */
+static struct pkg_data *alloc_pkg_data(int cpu)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	struct pkg_data *pkgd;
+	int numa_node = cpu_to_node(cpu);
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	if (c->x86_cache_occ_scale != cmt_l3_scale) {
+		/* 0 scale must have been converted to 1 automatically. */
+		if (c->x86_cache_occ_scale || cmt_l3_scale != 1) {
+			pr_err("Multiple LLC scale values, disabling CMT support.\n");
+			return ERR_PTR(-ENXIO);
+		}
+	}
+
+	pkgd = kzalloc_node(sizeof(*pkgd), GFP_KERNEL, numa_node);
+	if (!pkgd)
+		return ERR_PTR(-ENOMEM);
+
+	pkgd->max_rmid = c->x86_cache_max_rmid;
+
+	pkgd->work_cpu = cpu;
+	pkgd->pkgid = pkgid;
+
+	__min_max_rmid = min(__min_max_rmid, pkgd->max_rmid);
+
+	return pkgd;
+}
+
+static void __terminate_pkg_data(struct pkg_data *pkgd)
+{
+	lockdep_assert_held(&cmt_mutex);
+
+	free_pkg_data(pkgd);
+}
+
+static int init_pkg_data(int cpu)
+{
+	struct pkg_data *pkgd;
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	lockdep_assert_held(&cmt_mutex);
+
+	/* Verify that this pkgid isn't already initialized. */
+	if (WARN_ON_ONCE(cmt_pkgs_data[pkgid]))
+		return -EPERM;
+
+	pkgd = alloc_pkg_data(cpu);
+	if (IS_ERR(pkgd))
+		return PTR_ERR(pkgd);
+
+	rcu_assign_pointer(cmt_pkgs_data[pkgid], pkgd);
+	synchronize_rcu();
+
+	return 0;
+}
+
+static int intel_cmt_hp_online_enter(unsigned int cpu)
+{
+	struct pkg_data *pkgd;
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	rcu_read_lock();
+	pkgd = rcu_dereference(cmt_pkgs_data[pkgid]);
+	if (pkgd->work_cpu >= nr_cpu_ids)
+		pkgd->work_cpu = cpu;
+
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int intel_cmt_hp_online_exit(unsigned int cpu)
+{
+	struct pkg_data *pkgd;
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	rcu_read_lock();
+	pkgd = rcu_dereference(cmt_pkgs_data[pkgid]);
+	if (pkgd->work_cpu == cpu)
+		pkgd->work_cpu = cpumask_any_but(
+				topology_core_cpumask(cpu), cpu);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int intel_cmt_prep_up(unsigned int cpu)
+{
+	struct pkg_data *pkgd;
+	int err = 0;
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	mutex_lock(&cmt_mutex);
+	pkgd = rcu_dereference_protected(cmt_pkgs_data[pkgid],
+					 lockdep_is_held(&cmt_mutex));
+	if (!pkgd)
+		err = init_pkg_data(cpu);
+	mutex_unlock(&cmt_mutex);
+
+	return err;
+}
+
+static int intel_cmt_prep_down(unsigned int cpu)
+{
+	struct pkg_data *pkgd;
+	u16 pkgid = topology_logical_package_id(cpu);
+
+	mutex_lock(&cmt_mutex);
+	pkgd = rcu_dereference_protected(cmt_pkgs_data[pkgid],
+					 lockdep_is_held(&cmt_mutex));
+	if (pkgd->work_cpu >= nr_cpu_ids) {
+		/* will destroy pkgd */
+		__terminate_pkg_data(pkgd);
+		RCU_INIT_POINTER(cmt_pkgs_data[pkgid], NULL);
+		synchronize_rcu();
+	}
+	mutex_unlock(&cmt_mutex);
+
+	return 0;
+}
+
+static const struct x86_cpu_id intel_cmt_match[] = {
+	{ .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CMT_OCCUP_LLC },
+	{}
+};
+
+static void cmt_dealloc(void)
+{
+	kfree(cmt_pkgs_data);
+	cmt_pkgs_data = NULL;
+}
+
+static int __init cmt_alloc(void)
+{
+	cmt_l3_scale = boot_cpu_data.x86_cache_occ_scale;
+	if (cmt_l3_scale == 0)
+		cmt_l3_scale = 1;
+
+	cmt_pkgs_data = kcalloc(topology_max_packages(),
+				sizeof(*cmt_pkgs_data), GFP_KERNEL);
+	if (!cmt_pkgs_data)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __init cmt_start(void)
+{
+	char *str, scale[20];
+	int err;
+
+	/* will be modified by init_pkg_data() in intel_cmt_prep_up(). */
+	__min_max_rmid = UINT_MAX;
+	err = cpuhp_setup_state(CPUHP_PERF_X86_CMT_PREP,
+				"PERF_X86_CMT_PREP",
+				intel_cmt_prep_up,
+				intel_cmt_prep_down);
+	if (err)
+		return err;
+
+	err = cpuhp_setup_state(CPUHP_AP_PERF_X86_CMT_ONLINE,
+				"AP_PERF_X86_CMT_ONLINE",
+				intel_cmt_hp_online_enter,
+				intel_cmt_hp_online_exit);
+	if (err)
+		goto rm_prep;
+
+	snprintf(scale, sizeof(scale), "%u", cmt_l3_scale);
+	str = kstrdup(scale, GFP_KERNEL);
+	if (!str) {
+		err = -ENOMEM;
+		goto rm_online;
+	}
+
+	return 0;
+
+rm_online:
+	cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
+rm_prep:
+	cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
+
+	return err;
+}
+
+static int __init intel_cmt_init(void)
+{
+	struct pkg_data *pkgd = NULL;
+	int err = 0;
+
+	if (!x86_match_cpu(intel_cmt_match)) {
+		err = -ENODEV;
+		goto err_exit;
+	}
+
+	err = cmt_alloc();
+	if (err)
+		goto err_exit;
+
+	err = cmt_start();
+	if (err)
+		goto err_dealloc;
+
+	pr_info("Intel CMT enabled with ");
+	rcu_read_lock();
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		pr_cont("%d RMIDs for pkg %d, ",
+			pkgd->max_rmid + 1, pkgd->pkgid);
+	}
+	rcu_read_unlock();
+	pr_cont("and l3 scale of %d KBs.\n", cmt_l3_scale);
+
+	return err;
+
+err_dealloc:
+	cmt_dealloc();
+err_exit:
+	pr_err("Intel CMT registration failed with error: %d\n", err);
+	return err;
+}
+
+device_initcall(intel_cmt_init);
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
new file mode 100644
index 0000000..8c16797
--- /dev/null
+++ b/arch/x86/events/intel/cmt.h
@@ -0,0 +1,29 @@
+/*
+ * Intel Monitoring Technology (CMT) support.
+ * (formerly Intel Cache QoS Monitoring, CQM)
+ *
+ *
+ * Locking
+ *
+ * One global cmt_mutex. One mutex and spin_lock per package.
+ * cmt_pkgs_data is RCU protected.
+ *
+ * Rules:
+ *  - cmt_mutex: Hold for CMT init/terminate, event init/terminate,
+ *  cgroup start/stop.
+ */
+
+/**
+ * struct pkg_data - Per-package CMT data.
+ *
+ * @work_cpu:			CPU to run rotation and other batch jobs.
+ *				It must be in the package associated to its
+ *				instance of pkg_data.
+ * @max_rmid:			Max rmid valid for CPUs in this package.
+ * @pkgid:			The logical package id for this pkgd.
+ */
+struct pkg_data {
+	unsigned int		work_cpu;
+	u32			max_rmid;
+	u16			pkgid;
+};
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 320a3be..604660a 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -11,6 +11,7 @@ enum cpuhp_state {
 	CPUHP_PERF_X86_UNCORE_PREP,
 	CPUHP_PERF_X86_AMD_UNCORE_PREP,
 	CPUHP_PERF_X86_RAPL_PREP,
+	CPUHP_PERF_X86_CMT_PREP,
 	CPUHP_PERF_BFIN,
 	CPUHP_PERF_POWER,
 	CPUHP_PERF_SUPERH,
@@ -110,6 +111,7 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_X86_AMD_UNCORE_ONLINE,
 	CPUHP_AP_PERF_X86_AMD_POWER_ONLINE,
 	CPUHP_AP_PERF_X86_RAPL_ONLINE,
+	CPUHP_AP_PERF_X86_CMT_ONLINE,
 	CPUHP_AP_PERF_X86_CSTATE_ONLINE,
 	CPUHP_AP_PERF_S390_CF_ONLINE,
 	CPUHP_AP_PERF_S390_SF_ONLINE,
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (3 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-11-10 21:23   ` Thomas Gleixner
  2016-10-30  0:38 ` [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu David Carrillo-Cisneros
                   ` (40 subsequent siblings)
  45 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Per-package locks potentially reduce the contention when compared to the
system-wide approach of the previous CQM/CMT driver.

Lockdep needs lock_class_key's to be statically initialized and/or use
nesting, but nesting is currently hard-coded for up to 8 levels and it's
fragile to depend on lockdep internals.
To circumvent this problem, statically define CMT_MAX_NR_PKGS number of
lock_class_key's.

Additional details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 22 ++++++++++++++++++++++
 arch/x86/events/intel/cmt.h |  8 ++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 267a9ec..f12a06b 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -7,6 +7,14 @@
 #include "cmt.h"
 #include "../perf_event.h"
 
+/* Increase as needed as Intel CPUs grow. */
+#define CMT_MAX_NR_PKGS		8
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key	mutex_keys[CMT_MAX_NR_PKGS];
+static struct lock_class_key	lock_keys[CMT_MAX_NR_PKGS];
+#endif
+
 static DEFINE_MUTEX(cmt_mutex);
 
 static unsigned int cmt_l3_scale;	/* cmt hw units to bytes. */
@@ -51,6 +59,12 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 	int numa_node = cpu_to_node(cpu);
 	u16 pkgid = topology_logical_package_id(cpu);
 
+	if (pkgid >= CMT_MAX_NR_PKGS) {
+		pr_err("CMT_MAX_NR_PKGS of %d is insufficient for logical packages.\n",
+		       CMT_MAX_NR_PKGS);
+		return ERR_PTR(-ENOSPC);
+	}
+
 	if (c->x86_cache_occ_scale != cmt_l3_scale) {
 		/* 0 scale must have been converted to 1 automatically. */
 		if (c->x86_cache_occ_scale || cmt_l3_scale != 1) {
@@ -65,9 +79,17 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 
 	pkgd->max_rmid = c->x86_cache_max_rmid;
 
+	mutex_init(&pkgd->mutex);
+	raw_spin_lock_init(&pkgd->lock);
+
 	pkgd->work_cpu = cpu;
 	pkgd->pkgid = pkgid;
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_class(&pkgd->mutex, &mutex_keys[pkgid]);
+	lockdep_set_class(&pkgd->lock, &lock_keys[pkgid]);
+#endif
+
 	__min_max_rmid = min(__min_max_rmid, pkgd->max_rmid);
 
 	return pkgd;
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 8c16797..55416db 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -11,11 +11,16 @@
  * Rules:
  *  - cmt_mutex: Hold for CMT init/terminate, event init/terminate,
  *  cgroup start/stop.
+ *  - Hold pkg->mutex and pkg->lock in _all_ active packages to traverse or
+ *  change the monr hierarchy.
+ *  - pkgd->lock: Hold in current package to access that pkgd's members.
  */
 
 /**
  * struct pkg_data - Per-package CMT data.
  *
+ * @mutex:			Hold when modifying this pkg_data.
+ * @lock:			Hold to protect pmonrs in this pkg_data.
  * @work_cpu:			CPU to run rotation and other batch jobs.
  *				It must be in the package associated to its
  *				instance of pkg_data.
@@ -23,6 +28,9 @@
  * @pkgid:			The logical package id for this pkgd.
  */
 struct pkg_data {
+	struct mutex		mutex;
+	raw_spinlock_t		lock;
+
 	unsigned int		work_cpu;
 	u32			max_rmid;
 	u16			pkgid;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (4 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-11-10 21:27   ` Thomas Gleixner
  2016-10-30  0:38 ` [PATCH v3 07/46] perf/core: add RDT Monitoring attributes to struct hw_perf_event David Carrillo-Cisneros
                   ` (39 subsequent siblings)
  45 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add pmu, llc_occupancy event attributes, and functions for event
initialization.

Empty pmu functions will be filled in future patches in this series.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 106 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index f12a06b..0a24896 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -7,6 +7,9 @@
 #include "cmt.h"
 #include "../perf_event.h"
 
+#define QOS_L3_OCCUP_EVENT_ID	BIT_ULL(0)
+#define QOS_EVENT_MASK		QOS_L3_OCCUP_EVENT_ID
+
 /* Increase as needed as Intel CPUs grow. */
 #define CMT_MAX_NR_PKGS		8
 
@@ -46,6 +49,96 @@ static struct pkg_data *cmt_pkgs_data_next_rcu(struct pkg_data *pkgd)
 	return pkgd;
 }
 
+static struct pmu intel_cmt_pmu;
+
+static void intel_cmt_event_read(struct perf_event *event)
+{
+}
+
+static void intel_cmt_event_start(struct perf_event *event, int mode)
+{
+}
+
+static void intel_cmt_event_stop(struct perf_event *event, int mode)
+{
+}
+
+static int intel_cmt_event_add(struct perf_event *event, int mode)
+{
+	return 0;
+}
+
+static int intel_cmt_event_init(struct perf_event *event)
+{
+	int err = 0;
+
+	if (event->attr.type != intel_cmt_pmu.type)
+		return -ENOENT;
+	if (event->attr.config & ~QOS_EVENT_MASK)
+		return -EINVAL;
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    event->attr.inherit_stat   || /* cmt groups share rmid */
+	    event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	return err;
+}
+
+EVENT_ATTR_STR(llc_occupancy, intel_cmt_llc, "event=0x01");
+EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cmt_llc_pkg, "1");
+EVENT_ATTR_STR(llc_occupancy.unit, intel_cmt_llc_unit, "Bytes");
+EVENT_ATTR_STR(llc_occupancy.scale, intel_cmt_llc_scale, NULL);
+EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cmt_llc_snapshot, "1");
+
+static struct attribute *intel_cmt_events_attr[] = {
+	EVENT_PTR(intel_cmt_llc),
+	EVENT_PTR(intel_cmt_llc_pkg),
+	EVENT_PTR(intel_cmt_llc_unit),
+	EVENT_PTR(intel_cmt_llc_scale),
+	EVENT_PTR(intel_cmt_llc_snapshot),
+	NULL,
+};
+
+static struct attribute_group intel_cmt_events_group = {
+	.name = "events",
+	.attrs = intel_cmt_events_attr,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+static struct attribute *intel_cmt_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group intel_cmt_format_group = {
+	.name = "format",
+	.attrs = intel_cmt_formats_attr,
+};
+
+static const struct attribute_group *intel_cmt_attr_groups[] = {
+	&intel_cmt_events_group,
+	&intel_cmt_format_group,
+	NULL,
+};
+
+static struct pmu intel_cmt_pmu = {
+	.attr_groups	     = intel_cmt_attr_groups,
+	.task_ctx_nr	     = perf_sw_context,
+	.event_init	     = intel_cmt_event_init,
+	.add		     = intel_cmt_event_add,
+	.del		     = intel_cmt_event_stop,
+	.start		     = intel_cmt_event_start,
+	.stop		     = intel_cmt_event_stop,
+	.read		     = intel_cmt_event_read,
+};
+
 static void free_pkg_data(struct pkg_data *pkg_data)
 {
 	kfree(pkg_data);
@@ -199,6 +292,12 @@ static void cmt_dealloc(void)
 	cmt_pkgs_data = NULL;
 }
 
+static void cmt_stop(void)
+{
+	cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
+	cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
+}
+
 static int __init cmt_alloc(void)
 {
 	cmt_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -240,6 +339,7 @@ static int __init cmt_start(void)
 		err = -ENOMEM;
 		goto rm_online;
 	}
+	event_attr_intel_cmt_llc_scale.event_str = str;
 
 	return 0;
 
@@ -269,6 +369,10 @@ static int __init intel_cmt_init(void)
 	if (err)
 		goto err_dealloc;
 
+	err = perf_pmu_register(&intel_cmt_pmu, "intel_cmt", -1);
+	if (err)
+		goto err_stop;
+
 	pr_info("Intel CMT enabled with ");
 	rcu_read_lock();
 	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
@@ -280,6 +384,8 @@ static int __init intel_cmt_init(void)
 
 	return err;
 
+err_stop:
+	cmt_stop();
 err_dealloc:
 	cmt_dealloc();
 err_exit:
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 07/46] perf/core: add RDT Monitoring attributes to struct hw_perf_event
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (5 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization David Carrillo-Cisneros
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add attributes in hw_perf_event that are required by CMT and,
in the future, MBM.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 345ec20..0202b32 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -139,6 +139,12 @@ struct hw_perf_event {
 			/* for tp_event->class */
 			struct list_head	tp_list;
 		};
+#ifdef CONFIG_INTEL_RDT_M
+		struct { /* intel_cmt */
+			void			*cmt_monr;
+			struct list_head	cmt_list;
+		};
+#endif
 		struct { /* itrace */
 			int			itrace_started;
 		};
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (6 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 07/46] perf/core: add RDT Monitoring attributes to struct hw_perf_event David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-11-10 23:09   ` Thomas Gleixner
  2016-10-30  0:38 ` [PATCH v3 09/46] perf/x86/intel/cmt: add basic monr hierarchy David Carrillo-Cisneros
                   ` (37 subsequent siblings)
  45 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Events use monrs to monitor CMT events. Events that share their
monitoring target (same thread or cgroup) share a monr.

This patch introduces monrs and adds support for monr creation/destruction.

An event's associated monr is referenced by event->cmt_monr (introduced
in previous patch).

monr->mon_events references the first event that uses that monr and events
that share monr are appended to first event's cmt_list list head.

Hold all pkgd->mutex to modify monr->mon_events and event's data.

Support for CPU and cgroups is added in future patches in this series.

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 228 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h |  20 ++++
 2 files changed, 248 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 0a24896..23606a7 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -19,6 +19,8 @@ static struct lock_class_key	lock_keys[CMT_MAX_NR_PKGS];
 #endif
 
 static DEFINE_MUTEX(cmt_mutex);
+/* List of monrs that are associated with an event. */
+static LIST_HEAD(cmt_event_monrs);
 
 static unsigned int cmt_l3_scale;	/* cmt hw units to bytes. */
 
@@ -49,8 +51,210 @@ static struct pkg_data *cmt_pkgs_data_next_rcu(struct pkg_data *pkgd)
 	return pkgd;
 }
 
+/*
+ * Functions to lock/unlock/assert all per-package mutexes/locks at once.
+ */
+
+static void monr_hrchy_acquire_mutexes(void)
+{
+	struct pkg_data *pkgd = NULL;
+
+	/* RCU protected by cmt_mutex. */
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		mutex_lock(&pkgd->mutex);
+}
+
+static void monr_hrchy_release_mutexes(void)
+{
+	struct pkg_data *pkgd = NULL;
+
+	/* RCU protected by cmt_mutex. */
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		mutex_unlock(&pkgd->mutex);
+}
+
+static void monr_hrchy_assert_held_mutexes(void)
+{
+	struct pkg_data *pkgd = NULL;
+
+	/* RCU protected by cmt_mutex. */
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		lockdep_assert_held(&pkgd->mutex);
+}
+
+static void monr_dealloc(struct monr *monr)
+{
+	kfree(monr);
+}
+
+static struct monr *monr_alloc(void)
+{
+	struct monr *monr;
+
+	lockdep_assert_held(&cmt_mutex);
+
+	monr = kzalloc(sizeof(*monr), GFP_KERNEL);
+	if (!monr)
+		return ERR_PTR(-ENOMEM);
+
+	return monr;
+}
+
+static inline struct monr *monr_from_event(struct perf_event *event)
+{
+	return (struct monr *) READ_ONCE(event->hw.cmt_monr);
+}
+
+static struct monr *monr_remove_event(struct perf_event *event)
+{
+	struct monr *monr = monr_from_event(event);
+
+	lockdep_assert_held(&cmt_mutex);
+	monr_hrchy_assert_held_mutexes();
+
+	if (list_empty(&monr->mon_events->hw.cmt_list)) {
+		monr->mon_events = NULL;
+		/* remove from cmt_event_monrs */
+		list_del_init(&monr->entry);
+	} else {
+		if (monr->mon_events == event)
+			monr->mon_events = list_next_entry(event, hw.cmt_list);
+		list_del_init(&event->hw.cmt_list);
+	}
+
+	WRITE_ONCE(event->hw.cmt_monr, NULL);
+
+	return monr;
+}
+
+static int monr_append_event(struct monr *monr, struct perf_event *event)
+{
+	lockdep_assert_held(&cmt_mutex);
+	monr_hrchy_assert_held_mutexes();
+
+	if (monr->mon_events) {
+		list_add_tail(&event->hw.cmt_list,
+			      &monr->mon_events->hw.cmt_list);
+	} else {
+		monr->mon_events = event;
+		list_add_tail(&monr->entry, &cmt_event_monrs);
+	}
+
+	WRITE_ONCE(event->hw.cmt_monr, monr);
+
+	return 0;
+}
+
+static bool is_cgroup_event(struct perf_event *event)
+{
+	return false;
+}
+
+static int monr_hrchy_attach_cgroup_event(struct perf_event *event)
+{
+	return -EPERM;
+}
+
+static int monr_hrchy_attach_cpu_event(struct perf_event *event)
+{
+	return -EPERM;
+}
+
+static int monr_hrchy_attach_task_event(struct perf_event *event)
+{
+	struct monr *monr;
+	int err;
+
+	monr = monr_alloc();
+	if (IS_ERR(monr))
+		return -ENOMEM;
+
+	err = monr_append_event(monr, event);
+	if (err)
+		monr_dealloc(monr);
+	return err;
+}
+
+/* Insert or create monr in appropriate position in hierarchy. */
+static int monr_hrchy_attach_event(struct perf_event *event)
+{
+	int err = 0;
+
+	lockdep_assert_held(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+
+	if (!is_cgroup_event(event) &&
+	    !(event->attach_state & PERF_ATTACH_TASK)) {
+		err = monr_hrchy_attach_cpu_event(event);
+		goto exit;
+	}
+	if (is_cgroup_event(event)) {
+		err = monr_hrchy_attach_cgroup_event(event);
+		goto exit;
+	}
+	err = monr_hrchy_attach_task_event(event);
+exit:
+	monr_hrchy_release_mutexes();
+
+	return err;
+}
+
+/**
+ * __match_event() - Determine if @a and @b should share a rmid.
+ */
+static bool __match_event(struct perf_event *a, struct perf_event *b)
+{
+	/* Cgroup/non-task per-cpu and task events don't mix */
+	if ((a->attach_state & PERF_ATTACH_TASK) !=
+	    (b->attach_state & PERF_ATTACH_TASK))
+		return false;
+
+#ifdef CONFIG_CGROUP_PERF
+	if (a->cgrp != b->cgrp)
+		return false;
+#endif
+
+	/* If not task event, it's a a cgroup or a non-task cpu event. */
+	if (!(b->attach_state & PERF_ATTACH_TASK))
+		return true;
+
+	/* Events that target same task are placed into the same group. */
+	if (a->hw.target == b->hw.target)
+		return true;
+
+	/* Are we a inherited event? */
+	if (b->parent == a)
+		return true;
+
+	return false;
+}
+
 static struct pmu intel_cmt_pmu;
 
+/* Try to find a monr with same target, otherwise create new one. */
+static int mon_group_setup_event(struct perf_event *event)
+{
+	struct monr *monr;
+	int err;
+
+	lockdep_assert_held(&cmt_mutex);
+
+	list_for_each_entry(monr, &cmt_event_monrs, entry) {
+		if (!__match_event(monr->mon_events, event))
+			continue;
+		monr_hrchy_acquire_mutexes();
+		err = monr_append_event(monr, event);
+		monr_hrchy_release_mutexes();
+		return err;
+	}
+	/*
+	 * Since no match was found, create a new monr and set this
+	 * event as head of a mon_group. All events in this group
+	 * will share the monr.
+	 */
+	return monr_hrchy_attach_event(event);
+}
+
 static void intel_cmt_event_read(struct perf_event *event)
 {
 }
@@ -68,6 +272,20 @@ static int intel_cmt_event_add(struct perf_event *event, int mode)
 	return 0;
 }
 
+static void intel_cmt_event_destroy(struct perf_event *event)
+{
+	struct monr *monr;
+
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+
+	/* monr is dettached from event. */
+	monr = monr_remove_event(event);
+
+	monr_hrchy_release_mutexes();
+	mutex_unlock(&cmt_mutex);
+}
+
 static int intel_cmt_event_init(struct perf_event *event)
 {
 	int err = 0;
@@ -88,6 +306,16 @@ static int intel_cmt_event_init(struct perf_event *event)
 	    event->attr.sample_period) /* no sampling */
 		return -EINVAL;
 
+	event->destroy = intel_cmt_event_destroy;
+
+	INIT_LIST_HEAD(&event->hw.cmt_list);
+
+	mutex_lock(&cmt_mutex);
+
+	err = mon_group_setup_event(event);
+
+	mutex_unlock(&cmt_mutex);
+
 	return err;
 }
 
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 55416db..0ce5d4d 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -3,6 +3,12 @@
  * (formerly Intel Cache QoS Monitoring, CQM)
  *
  *
+ * A "Monitored Resource" (monr) is the entity monitored by CMT and MBM.
+ * In order to monitor a cgroups and/or thread, it must be associated to
+ * a monr. A monr is active in a CPU when a thread that is associated to
+ * it (either directly or through a cgroup) is scheduled in it.
+ *
+ *
  * Locking
  *
  * One global cmt_mutex. One mutex and spin_lock per package.
@@ -35,3 +41,17 @@ struct pkg_data {
 	u32			max_rmid;
 	u16			pkgid;
 };
+
+/**
+ * struct monr - MONitored Resource.
+ * @mon_events:		The head of event's group that use this monr, if any.
+ * @entry:		List entry into cmt_event_monrs.
+ *
+ * An monr is assigned to every CMT event and/or monitored cgroups when
+ * monitoring is activated and that instance's address do not change during
+ * the lifetime of the event or cgroup.
+ */
+struct monr {
+	struct perf_event		*mon_events;
+	struct list_head		entry;
+};
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 09/46] perf/x86/intel/cmt: add basic monr hierarchy
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (7 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 10/46] perf/x86/intel/cmt: add Package MONitored Resource (pmonr) initialization David Carrillo-Cisneros
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add root for monr hierarchy and auxiliary functions for locking.

Also, add support for attaching CPU and tasks events to monr hierarchy.
As of this patch, both types of events always use the root monr (this
will change when cgroups are introduced later in this series).

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/cmt.h | 26 +++++++++++++++
 2 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 23606a7..39f4bfa 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -26,6 +26,9 @@ static unsigned int cmt_l3_scale;	/* cmt hw units to bytes. */
 
 static unsigned int __min_max_rmid;	/* minimum max_rmid across all pkgs. */
 
+/* Root for system-wide hierarchy of MONitored Resources (monr). */
+static struct monr *monr_hrchy_root;
+
 /* Array of packages (array of pkgds). It's protected by RCU or cmt_mutex. */
 static struct pkg_data **cmt_pkgs_data;
 
@@ -82,6 +85,24 @@ static void monr_hrchy_assert_held_mutexes(void)
 		lockdep_assert_held(&pkgd->mutex);
 }
 
+static void monr_hrchy_acquire_locks(unsigned long *flags)
+{
+	struct pkg_data *pkgd = NULL;
+
+	raw_local_irq_save(*flags);
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		raw_spin_lock(&pkgd->lock);
+}
+
+static void monr_hrchy_release_locks(unsigned long *flags)
+{
+	struct pkg_data *pkgd = NULL;
+
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		raw_spin_unlock(&pkgd->lock);
+	raw_local_irq_restore(*flags);
+}
+
 static void monr_dealloc(struct monr *monr)
 {
 	kfree(monr);
@@ -97,6 +118,10 @@ static struct monr *monr_alloc(void)
 	if (!monr)
 		return ERR_PTR(-ENOMEM);
 
+	INIT_LIST_HEAD(&monr->entry);
+	INIT_LIST_HEAD(&monr->children);
+	INIT_LIST_HEAD(&monr->parent_entry);
+
 	return monr;
 }
 
@@ -145,6 +170,26 @@ static int monr_append_event(struct monr *monr, struct perf_event *event)
 	return 0;
 }
 
+static void monr_hrchy_insert_leaf(struct monr *monr, struct monr *parent)
+{
+	unsigned long flags;
+
+	monr_hrchy_acquire_locks(&flags);
+	list_add_tail(&monr->parent_entry, &parent->children);
+	monr->parent = parent;
+	monr_hrchy_release_locks(&flags);
+}
+
+static void monr_hrchy_remove_leaf(struct monr *monr)
+{
+	unsigned long flags;
+
+	monr_hrchy_acquire_locks(&flags);
+	list_del_init(&monr->parent_entry);
+	monr->parent = NULL;
+	monr_hrchy_release_locks(&flags);
+}
+
 static bool is_cgroup_event(struct perf_event *event)
 {
 	return false;
@@ -155,20 +200,32 @@ static int monr_hrchy_attach_cgroup_event(struct perf_event *event)
 	return -EPERM;
 }
 
+/*
+ * This non cgroup version creates a two-levels hierarchy: Root and first level
+ * and all event monr underneath it.
+ */
+static struct monr *monr_hrchy_get_monr_parent(struct perf_event *event)
+{
+	return monr_hrchy_root;
+}
+
 static int monr_hrchy_attach_cpu_event(struct perf_event *event)
 {
-	return -EPERM;
+	return monr_append_event(monr_hrchy_root, event);
 }
 
 static int monr_hrchy_attach_task_event(struct perf_event *event)
 {
-	struct monr *monr;
+	struct monr *monr_parent, *monr;
 	int err;
 
+	monr_parent = monr_hrchy_get_monr_parent(event);
 	monr = monr_alloc();
 	if (IS_ERR(monr))
 		return -ENOMEM;
 
+	monr_hrchy_insert_leaf(monr, monr_parent);
+
 	err = monr_append_event(monr, event);
 	if (err)
 		monr_dealloc(monr);
@@ -199,6 +256,12 @@ static int monr_hrchy_attach_event(struct perf_event *event)
 	return err;
 }
 
+static void monr_destroy(struct monr *monr)
+{
+	monr_hrchy_remove_leaf(monr);
+	monr_dealloc(monr);
+}
+
 /**
  * __match_event() - Determine if @a and @b should share a rmid.
  */
@@ -281,6 +344,7 @@ static void intel_cmt_event_destroy(struct perf_event *event)
 
 	/* monr is dettached from event. */
 	monr = monr_remove_event(event);
+	monr_destroy(monr);
 
 	monr_hrchy_release_mutexes();
 	mutex_unlock(&cmt_mutex);
@@ -516,6 +580,9 @@ static const struct x86_cpu_id intel_cmt_match[] = {
 
 static void cmt_dealloc(void)
 {
+	kfree(monr_hrchy_root);
+	monr_hrchy_root = NULL;
+
 	kfree(cmt_pkgs_data);
 	cmt_pkgs_data = NULL;
 }
@@ -537,6 +604,12 @@ static int __init cmt_alloc(void)
 	if (!cmt_pkgs_data)
 		return -ENOMEM;
 
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_root = monr_alloc();
+	mutex_unlock(&cmt_mutex);
+	if (IS_ERR(monr_hrchy_root))
+		return PTR_ERR(monr_hrchy_root);
+
 	return 0;
 }
 
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 0ce5d4d..46e8335 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -7,6 +7,25 @@
  * In order to monitor a cgroups and/or thread, it must be associated to
  * a monr. A monr is active in a CPU when a thread that is associated to
  * it (either directly or through a cgroup) is scheduled in it.
+ * The monrs are organized in a tree hierarchy named "monr hierarchy". It
+ * captures the dependencies between the monitored entities, e.g.:
+ *
+ *	   cgroup hierarchy		        monr hierarchy
+ *------------------------------------------------------------------------
+ *          root cgroup                           root monr
+ *       (always monitored)                        /      \
+ *       /                \                     monr A    monr B1
+ *  cgroup A             cgroupB                  |
+ * (monitored)        (no monitored)            monr A1
+ *      |              /          \
+ *   task A1       task B1       task B2
+ * (monitored)   (monitored)  (no monitored)
+ *
+ *
+ * This driver mantains the monr hierarchy as separate from the cgroup
+ * hierarchy in order to reduce the need for synchronization between the two
+ * and to make possible to capture dependencies between threads in the same
+ * cgroup or process.
  *
  *
  * Locking
@@ -46,6 +65,9 @@ struct pkg_data {
  * struct monr - MONitored Resource.
  * @mon_events:		The head of event's group that use this monr, if any.
  * @entry:		List entry into cmt_event_monrs.
+ * @parent:		Parent in monr hierarchy.
+ * @children:		List of children in monr hierarchy.
+ * @parent_entry:	Entry in parent's children list.
  *
  * An monr is assigned to every CMT event and/or monitored cgroups when
  * monitoring is activated and that instance's address do not change during
@@ -54,4 +76,8 @@ struct pkg_data {
 struct monr {
 	struct perf_event		*mon_events;
 	struct list_head		entry;
+
+	struct monr			*parent;
+	struct list_head		children;
+	struct list_head		parent_entry;
 };
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 10/46] perf/x86/intel/cmt: add Package MONitored Resource (pmonr) initialization
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (8 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 09/46] perf/x86/intel/cmt: add basic monr hierarchy David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 11/46] perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr David Carrillo-Cisneros
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

A pmonr is the per-package component of a monr. This patch only adds
initialization/destruction of pmonrs. Future patches explain their
usage and add functionality.

CPU hotplug is supported by initializing/terminating all pmonrs in
monr hierarchy when first/last CPU in package goes online/offline.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 161 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/cmt.h |  20 +++++-
 2 files changed, 177 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 39f4bfa..06e6325 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -103,13 +103,49 @@ static void monr_hrchy_release_locks(unsigned long *flags)
 	raw_local_irq_restore(*flags);
 }
 
+static inline struct pmonr *pkgd_pmonr(struct pkg_data *pkgd, struct monr *monr)
+{
+#ifdef CONFIG_LOCKDEP
+	bool safe = lockdep_is_held(&cmt_mutex) ||
+		    lockdep_is_held(&pkgd->lock) ||
+		    rcu_read_lock_held();
+#endif
+
+	return rcu_dereference_check(monr->pmonrs[pkgd->pkgid], safe);
+}
+
+static struct pmonr *pmonr_alloc(struct pkg_data *pkgd)
+{
+	struct pmonr *pmonr;
+	int cpu_node = cpu_to_node(pkgd->work_cpu);
+
+	pmonr = kzalloc_node(sizeof(*pmonr), GFP_KERNEL, cpu_node);
+	if (!pmonr)
+		return ERR_PTR(-ENOMEM);
+
+	pmonr->pkgd = pkgd;
+
+	return pmonr;
+}
+
 static void monr_dealloc(struct monr *monr)
 {
+	u16 p, nr_pkgs = topology_max_packages();
+
+	for (p = 0; p < nr_pkgs; p++) {
+		/* out of monr_hrchy, so no need for rcu or lock protection. */
+		if (!monr->pmonrs[p])
+			continue;
+		kfree(monr->pmonrs[p]);
+	}
 	kfree(monr);
 }
 
+/* Alloc monr with all pmonrs in Off state. */
 static struct monr *monr_alloc(void)
 {
+	struct pkg_data *pkgd = NULL;
+	struct pmonr *pmonr;
 	struct monr *monr;
 
 	lockdep_assert_held(&cmt_mutex);
@@ -122,6 +158,28 @@ static struct monr *monr_alloc(void)
 	INIT_LIST_HEAD(&monr->children);
 	INIT_LIST_HEAD(&monr->parent_entry);
 
+	monr->pmonrs = kcalloc(topology_max_packages(),
+			       sizeof(pmonr), GFP_KERNEL);
+	if (!monr->pmonrs) {
+		monr_dealloc(monr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Do not create pmonrs for unitialized packages.
+	 * Protected from initialization of new pkgs by cqm_mutex
+	 */
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		pmonr = pmonr_alloc(pkgd);
+		if (IS_ERR(pmonr)) {
+			monr_dealloc(monr);
+			return ERR_CAST(pmonr);
+		}
+		pmonr->monr = monr;
+		/* safe to assign since pmonr is not in monr_hrchy. */
+		RCU_INIT_POINTER(monr->pmonrs[pkgd->pkgid], pmonr);
+	}
+
 	return monr;
 }
 
@@ -318,6 +376,69 @@ static int mon_group_setup_event(struct perf_event *event)
 	return monr_hrchy_attach_event(event);
 }
 
+static struct monr *monr_next_child(struct monr *pos, struct monr *parent)
+{
+	if (!pos)
+		return list_first_entry_or_null(
+			&parent->children, struct monr, parent_entry);
+	if (list_is_last(&pos->parent_entry, &parent->children))
+		return NULL;
+
+	return list_next_entry(pos, parent_entry);
+}
+
+static struct monr *monr_next_descendant_pre(struct monr *pos,
+					     struct monr *root)
+{
+	struct monr *next;
+
+	if (!pos)
+		return root;
+
+	next = monr_next_child(NULL, pos);
+	if (next)
+		return next;
+
+	while (pos != root) {
+		next = monr_next_child(pos, pos->parent);
+		if (next)
+			return next;
+		pos = pos->parent;
+	}
+
+	return NULL;
+}
+
+static struct monr *monr_leftmost_descendant(struct monr *pos)
+{
+	struct monr *last;
+
+	do {
+		last = pos;
+		pos = monr_next_child(NULL, pos);
+	} while (pos);
+
+	return last;
+}
+
+static struct monr *monr_next_descendant_post(struct monr *pos,
+					      struct monr *root)
+{
+	struct monr *next;
+
+	if (!pos)
+		return monr_leftmost_descendant(root);
+
+	if (pos == root)
+		return NULL;
+
+	next = monr_next_child(pos, pos->parent);
+	if (next)
+		return monr_leftmost_descendant(next);
+
+	return pos->parent;
+}
+
 static void intel_cmt_event_read(struct perf_event *event)
 {
 }
@@ -482,14 +603,29 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 
 static void __terminate_pkg_data(struct pkg_data *pkgd)
 {
+	struct monr *pos = NULL;
+	unsigned long flags;
+
 	lockdep_assert_held(&cmt_mutex);
 
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	/* post-order traversal guarantees pos to be leaf of monr hierarchy. */
+	while ((pos = monr_next_descendant_post(pos, monr_hrchy_root)))
+		RCU_INIT_POINTER(pos->pmonrs[pkgd->pkgid], NULL);
+
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	synchronize_rcu();
+
 	free_pkg_data(pkgd);
 }
 
 static int init_pkg_data(int cpu)
 {
+	struct monr *pos = NULL;
 	struct pkg_data *pkgd;
+	struct pmonr *pmonr;
+	int err = 0;
 	u16 pkgid = topology_logical_package_id(cpu);
 
 	lockdep_assert_held(&cmt_mutex);
@@ -502,10 +638,28 @@ static int init_pkg_data(int cpu)
 	if (IS_ERR(pkgd))
 		return PTR_ERR(pkgd);
 
-	rcu_assign_pointer(cmt_pkgs_data[pkgid], pkgd);
-	synchronize_rcu();
+	while ((pos = monr_next_descendant_pre(pos, monr_hrchy_root))) {
+		pmonr = pmonr_alloc(pkgd);
+		if (IS_ERR(pmonr)) {
+			err = PTR_ERR(pmonr);
+			break;
+		}
+		pmonr->monr = pos;
+		/*
+		 * No need to protect pmonrs since this pkgd is
+		 * not set in cmt_pkgs_data yet.
+		 */
+		RCU_INIT_POINTER(pos->pmonrs[pkgid], pmonr);
+	}
 
-	return 0;
+	if (err) {
+		__terminate_pkg_data(pkgd);
+	} else {
+		rcu_assign_pointer(cmt_pkgs_data[pkgid], pkgd);
+		synchronize_rcu();
+	}
+
+	return err;
 }
 
 static int intel_cmt_hp_online_enter(unsigned int cpu)
@@ -604,6 +758,7 @@ static int __init cmt_alloc(void)
 	if (!cmt_pkgs_data)
 		return -ENOMEM;
 
+	/* won't alloc any pmonr since no cmt_pkg_data is initialized yet. */
 	mutex_lock(&cmt_mutex);
 	monr_hrchy_root = monr_alloc();
 	mutex_unlock(&cmt_mutex);
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 46e8335..7f3a7b8 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -27,6 +27,9 @@
  * and to make possible to capture dependencies between threads in the same
  * cgroup or process.
  *
+ * Each monr has a package monr (pmonr) for each package with at least one
+ * online cpu. The pmonr handles the CMT and MBM monitoring within its package.
+ *
  *
  * Locking
  *
@@ -38,8 +41,19 @@
  *  cgroup start/stop.
  *  - Hold pkg->mutex and pkg->lock in _all_ active packages to traverse or
  *  change the monr hierarchy.
- *  - pkgd->lock: Hold in current package to access that pkgd's members.
+ *  - pkgd->lock: Hold in current package to access that pkgd's members. Hold
+ *  a pmonr's package pkgd->lock for non-atomic access to pmonr.
+ */
+
+/**
+ * struct pmonr - per-package componet of MONitored Resources (monr).
+ * @monr:		The monr that contains this pmonr.
+ * @pkgd:		The package data associated with this pmonr.
  */
+struct pmonr {
+	struct monr				*monr;
+	struct pkg_data				*pkgd;
+};
 
 /**
  * struct pkg_data - Per-package CMT data.
@@ -65,6 +79,7 @@ struct pkg_data {
  * struct monr - MONitored Resource.
  * @mon_events:		The head of event's group that use this monr, if any.
  * @entry:		List entry into cmt_event_monrs.
+ * @pmonrs:		Per-package pmonrs.
  * @parent:		Parent in monr hierarchy.
  * @children:		List of children in monr hierarchy.
  * @parent_entry:	Entry in parent's children list.
@@ -72,10 +87,13 @@ struct pkg_data {
  * An monr is assigned to every CMT event and/or monitored cgroups when
  * monitoring is activated and that instance's address do not change during
  * the lifetime of the event or cgroup.
+ *
+ * On initialization, all monr's pmonrs start in Off state.
  */
 struct monr {
 	struct perf_event		*mon_events;
 	struct list_head		entry;
+	struct pmonr			**pmonrs;
 
 	struct monr			*parent;
 	struct list_head		children;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 11/46] perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (9 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 10/46] perf/x86/intel/cmt: add Package MONitored Resource (pmonr) initialization David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 12/46] perf/x86/intel/cmt: add per-package rmid pools David Carrillo-Cisneros
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

uflags allow users to signal special behavior for a pmonr. This patch
series introduces two uflags that provide new behavior and are relevant
to users:
  1) CMT_UF_NOLAZY_RMID: signal that rmids must be reserved immediately.
  2) CMT_UF_NOSTEAL_RMID: rmids cannot be stolen.

A monr mantains one field of cmt_user_flags at "monr level" and a set of
"package level" ones, one per possible hardware package.

The effective uflags for a pmonr is the OR of its monr level uflags and
the package level one of pmonr's pkgd.

A user passes uflags for all pmonrs in an event's monr by setting them
in the perf_event_attr::config1 field. In future patches in this series,
users could specify per package uflags through attributes in the
perf cgroup fs.

This patch only introduces infrastructure to mantain uflags and the first,
uflag: CMT_UF_HAS_USER, that marks monrs and pmonrs as in use by
a cgroup or event. This flag is special because is always taken as set
for a perf event, regardless of the value in event->attr.config1.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 166 ++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/cmt.h |  18 +++++
 2 files changed, 180 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 06e6325..07560e5 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -29,6 +29,13 @@ static unsigned int __min_max_rmid;	/* minimum max_rmid across all pkgs. */
 /* Root for system-wide hierarchy of MONitored Resources (monr). */
 static struct monr *monr_hrchy_root;
 
+/* Flags for root monr and all its pmonrs while being monitored. */
+static enum cmt_user_flags root_monr_uflags = CMT_UF_HAS_USER;
+
+/* Auxiliar flags */
+static enum cmt_user_flags *pkg_uflags_zeroes;
+static size_t pkg_uflags_size;
+
 /* Array of packages (array of pkgds). It's protected by RCU or cmt_mutex. */
 static struct pkg_data **cmt_pkgs_data;
 
@@ -128,10 +135,19 @@ static struct pmonr *pmonr_alloc(struct pkg_data *pkgd)
 	return pmonr;
 }
 
+static inline bool monr_is_root(struct monr *monr)
+{
+	return monr_hrchy_root == monr;
+}
+
 static void monr_dealloc(struct monr *monr)
 {
 	u16 p, nr_pkgs = topology_max_packages();
 
+	if (WARN_ON_ONCE(monr->nr_has_user) ||
+	    WARN_ON_ONCE(monr->mon_events))
+		return;
+
 	for (p = 0; p < nr_pkgs; p++) {
 		/* out of monr_hrchy, so no need for rcu or lock protection. */
 		if (!monr->pmonrs[p])
@@ -150,7 +166,8 @@ static struct monr *monr_alloc(void)
 
 	lockdep_assert_held(&cmt_mutex);
 
-	monr = kzalloc(sizeof(*monr), GFP_KERNEL);
+	/* Extra space for pkg_uflags. */
+	monr = kzalloc(sizeof(*monr) + pkg_uflags_size, GFP_KERNEL);
 	if (!monr)
 		return ERR_PTR(-ENOMEM);
 
@@ -183,14 +200,118 @@ static struct monr *monr_alloc(void)
 	return monr;
 }
 
+static enum cmt_user_flags pmonr_uflags(struct pmonr *pmonr)
+{
+	struct monr *monr = pmonr->monr;
+
+	return monr->uflags | monr->pkg_uflags[pmonr->pkgd->pkgid];
+}
+
+static int __pmonr_apply_uflags(struct pmonr *pmonr,
+		enum cmt_user_flags pmonr_uflags)
+{
+	if (monr_is_root(pmonr->monr) && (~pmonr_uflags & root_monr_uflags))
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool pkg_uflags_has_user(enum cmt_user_flags *uflags)
+{
+	int p, nr_pkgs = topology_max_packages();
+
+	for (p = 0; p < nr_pkgs; p++)
+		if (uflags[p] & CMT_UF_HAS_USER)
+			return true;
+	return false;
+}
+
+static bool monr_has_user(struct monr *monr)
+{
+	return monr->uflags & CMT_UF_HAS_USER ||
+	       pkg_uflags_has_user(monr->pkg_uflags);
+}
+
+static int __monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
+{
+	enum cmt_user_flags pmonr_uflags;
+	struct pkg_data *pkgd = NULL;
+	struct pmonr *pmonr;
+	int p, err;
+
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		p = pkgd->pkgid;
+		pmonr_uflags = monr->uflags |
+				(puflags ? puflags[p] : monr->pkg_uflags[p]);
+		pmonr = pkgd_pmonr(pkgd, monr);
+		err = __pmonr_apply_uflags(pmonr, pmonr_uflags);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+/* Apply puflags for all packages or rollback and fail. */
+static int monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
+{
+	int p, err;
+	unsigned long flags;
+
+	monr_hrchy_assert_held_mutexes();
+	monr_hrchy_acquire_locks(&flags);
+
+	err = __monr_apply_uflags(monr, puflags);
+	if (err)
+		goto exit;
+
+	/* Proceed to exit if no uflags to update to pkg_uflags. */
+	if (!puflags)
+		goto exit;
+
+	/*
+	 * Now that we've succeded to apply puflags to online packages, we
+	 * store new puflags in all packages, even those not online. It's
+	 * up to CPU hot plug to try apply the pkg_flag in oncoming package.
+	 */
+	for (p = 0; p < topology_max_packages(); p++)
+		monr->pkg_uflags[p] = puflags[p];
+
+exit:
+	monr_hrchy_release_locks(&flags);
+
+	return err;
+}
+
 static inline struct monr *monr_from_event(struct perf_event *event)
 {
 	return (struct monr *) READ_ONCE(event->hw.cmt_monr);
 }
 
+static enum cmt_user_flags uflags_from_event(struct perf_event *event)
+{
+	return event->attr.config1 | CMT_UF_HAS_USER;
+}
+
+/* return true if monr uflags will change, false otherwise. */
+static bool monr_account_uflags(struct monr *monr,
+				enum cmt_user_flags uflags, bool account)
+{
+	enum cmt_user_flags old_flags = monr->uflags;
+
+	if (uflags & CMT_UF_HAS_USER)
+		monr->nr_has_user += account ? 1 : -1;
+
+	monr->uflags =  (monr->nr_has_user ? CMT_UF_HAS_USER : 0);
+
+	return old_flags != monr->uflags;
+}
+
 static struct monr *monr_remove_event(struct perf_event *event)
 {
 	struct monr *monr = monr_from_event(event);
+	enum cmt_user_flags uflags = uflags_from_event(event);
+	int err;
 
 	lockdep_assert_held(&cmt_mutex);
 	monr_hrchy_assert_held_mutexes();
@@ -207,11 +328,23 @@ static struct monr *monr_remove_event(struct perf_event *event)
 
 	WRITE_ONCE(event->hw.cmt_monr, NULL);
 
+	if (monr_account_uflags(monr, uflags, false)) {
+		/*
+		 * Undo flags in error, cannot fail since flags require rmids
+		 * and less flags mean less rmids required.
+		 */
+		err = monr_apply_uflags(monr, NULL);
+		WARN_ON_ONCE(err);
+	}
+
 	return monr;
 }
 
 static int monr_append_event(struct monr *monr, struct perf_event *event)
 {
+	enum cmt_user_flags uflags = uflags_from_event(event);
+	int err;
+
 	lockdep_assert_held(&cmt_mutex);
 	monr_hrchy_assert_held_mutexes();
 
@@ -225,7 +358,14 @@ static int monr_append_event(struct monr *monr, struct perf_event *event)
 
 	WRITE_ONCE(event->hw.cmt_monr, monr);
 
-	return 0;
+	if (!monr_account_uflags(monr, uflags, true))
+		return 0;
+
+	err = monr_apply_uflags(monr, NULL);
+	if (err)
+		monr_remove_event(event);
+
+	return err;
 }
 
 static void monr_hrchy_insert_leaf(struct monr *monr, struct monr *parent)
@@ -465,7 +605,8 @@ static void intel_cmt_event_destroy(struct perf_event *event)
 
 	/* monr is dettached from event. */
 	monr = monr_remove_event(event);
-	monr_destroy(monr);
+	if (!monr_has_user(monr))
+		monr_destroy(monr);
 
 	monr_hrchy_release_mutexes();
 	mutex_unlock(&cmt_mutex);
@@ -625,6 +766,7 @@ static int init_pkg_data(int cpu)
 	struct monr *pos = NULL;
 	struct pkg_data *pkgd;
 	struct pmonr *pmonr;
+	unsigned long flags;
 	int err = 0;
 	u16 pkgid = topology_logical_package_id(cpu);
 
@@ -650,6 +792,10 @@ static int init_pkg_data(int cpu)
 		 * not set in cmt_pkgs_data yet.
 		 */
 		RCU_INIT_POINTER(pos->pmonrs[pkgid], pmonr);
+
+		raw_spin_lock_irqsave(&pkgd->lock, flags);
+		err = __pmonr_apply_uflags(pmonr, pmonr_uflags(pmonr));
+		raw_spin_unlock_irqrestore(&pkgd->lock, flags);
 	}
 
 	if (err) {
@@ -739,6 +885,9 @@ static void cmt_dealloc(void)
 
 	kfree(cmt_pkgs_data);
 	cmt_pkgs_data = NULL;
+
+	kfree(pkg_uflags_zeroes);
+	pkg_uflags_zeroes = NULL;
 }
 
 static void cmt_stop(void)
@@ -749,6 +898,11 @@ static void cmt_stop(void)
 
 static int __init cmt_alloc(void)
 {
+	pkg_uflags_size = sizeof(*pkg_uflags_zeroes) * topology_max_packages();
+	pkg_uflags_zeroes = kzalloc(pkg_uflags_size, GFP_KERNEL);
+	if (!pkg_uflags_zeroes)
+		return -ENOMEM;
+
 	cmt_l3_scale = boot_cpu_data.x86_cache_occ_scale;
 	if (cmt_l3_scale == 0)
 		cmt_l3_scale = 1;
@@ -771,7 +925,11 @@ static int __init cmt_alloc(void)
 static int __init cmt_start(void)
 {
 	char *str, scale[20];
-	int err;
+	int err, p;
+
+	monr_account_uflags(monr_hrchy_root, root_monr_uflags, true);
+	for (p = 0; p < topology_max_packages(); p++)
+		monr_hrchy_root->pkg_uflags[p] = root_monr_uflags;
 
 	/* will be modified by init_pkg_data() in intel_cmt_prep_up(). */
 	__min_max_rmid = UINT_MAX;
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 7f3a7b8..66b078a 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -76,6 +76,16 @@ struct pkg_data {
 };
 
 /**
+ * enum cmt_user_flags - user set flags for monr and pmonrs.
+ */
+enum cmt_user_flags {
+	/* if no has_user other flags are meaningless. */
+	CMT_UF_HAS_USER		= BIT(0), /* has cgroup or event users */
+	CMT_UF_MAX		= BIT(1) - 1,
+	CMT_UF_ERROR		= CMT_UF_MAX + 1,
+};
+
+/**
  * struct monr - MONitored Resource.
  * @mon_events:		The head of event's group that use this monr, if any.
  * @entry:		List entry into cmt_event_monrs.
@@ -83,6 +93,10 @@ struct pkg_data {
  * @parent:		Parent in monr hierarchy.
  * @children:		List of children in monr hierarchy.
  * @parent_entry:	Entry in parent's children list.
+ * @nr_has_user:	nr of CMT_UF_HAS_USER set in events in mon_events.
+ * @uflags:		monr level cmt_user_flags, or'ed with pkg_uflags.
+ * @pkg_uflags:		package level cmt_user_flags, each entry is used as
+ *			pmonr uflags if that package is online.
  *
  * An monr is assigned to every CMT event and/or monitored cgroups when
  * monitoring is activated and that instance's address do not change during
@@ -98,4 +112,8 @@ struct monr {
 	struct monr			*parent;
 	struct list_head		children;
 	struct list_head		parent_entry;
+
+	int				nr_has_user;
+	enum cmt_user_flags		uflags;
+	enum cmt_user_flags		pkg_uflags[];
 };
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 12/46] perf/x86/intel/cmt: add per-package rmid pools
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (10 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 11/46] perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 13/46] perf/x86/intel/cmt: add pmonr's Off and Unused states David Carrillo-Cisneros
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

A Resource Monitoring ID (RMID) is a hardware ID used to track cache
occupancy and memory bandwidth. The rmids are a per-package resource
and only one can be programmed at a time per logical CPU.

This patch series creates per-package rmids pools and (default)
lazy allocation of rmids (only reserve when a thread runs in a package)
to potentially allow more simultaneous rmid users than the system-wide
approach of the previous CQM/CMT driver.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c |  5 +++++
 arch/x86/events/intel/cmt.h | 24 +++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 07560e5..5799816 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -725,6 +725,11 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 		return ERR_PTR(-ENOMEM);
 
 	pkgd->max_rmid = c->x86_cache_max_rmid;
+	if (pkgd->max_rmid >= CMT_MAX_NR_RMIDS) {
+		pr_err("CPU Package %d supports %d RMIDs. Using %d only.\n",
+		       pkgid, pkgd->max_rmid, CMT_MAX_NR_RMIDS - 1);
+		pkgd->max_rmid = CMT_MAX_NR_RMIDS - 1;
+	}
 
 	mutex_init(&pkgd->mutex);
 	raw_spin_lock_init(&pkgd->lock);
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 66b078a..6211392 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -3,6 +3,11 @@
  * (formerly Intel Cache QoS Monitoring, CQM)
  *
  *
+ * A Resource Monitoring ID (RMID) is a hardware ID used in Intel RDT to
+ * monitor cache and memory events such as LLC Occupancy and Memory
+ * Bandwidth. Changes in such metrics that are caused by a CPU are
+ * counted towards the rmid active in that CPU at the time.
+ *
  * A "Monitored Resource" (monr) is the entity monitored by CMT and MBM.
  * In order to monitor a cgroups and/or thread, it must be associated to
  * a monr. A monr is active in a CPU when a thread that is associated to
@@ -28,7 +33,8 @@
  * cgroup or process.
  *
  * Each monr has a package monr (pmonr) for each package with at least one
- * online cpu. The pmonr handles the CMT and MBM monitoring within its package.
+ * online cpu. The pmonr handles the CMT and MBM monitoring within its package
+ * by managing the rmid to write into each CPU that runs a monitored thread.
  *
  *
  * Locking
@@ -55,9 +61,22 @@ struct pmonr {
 	struct pkg_data				*pkgd;
 };
 
+/*
+ * Compile constant required for bitmap macros.
+ * Broadwell EP has 2 rmids per logical core, use twice as many as upper bound.
+ * 128 is a reasonable upper bound for logical cores per package for the
+ * foreseeable future. Adjust as CPUs grow.
+ */
+#define CMT_MAX_NR_RMIDS	(2 * 2 * 128)
+#define CMT_MAX_NR_RMIDS_BYTES	DIV_ROUND_UP(CMT_MAX_NR_RMIDS, BITS_PER_BYTE)
+#define CMT_MAX_NR_RMIDS_LONGS	BITS_TO_LONGS(CMT_MAX_NR_RMIDS)
+
 /**
  * struct pkg_data - Per-package CMT data.
  *
+ * @free_rmids:			Pool of free rmids.
+ * @dirty_rmids:		Pool of "dirty" rmids that are not referenced
+ *				by a pmonr.
  * @mutex:			Hold when modifying this pkg_data.
  * @lock:			Hold to protect pmonrs in this pkg_data.
  * @work_cpu:			CPU to run rotation and other batch jobs.
@@ -67,6 +86,9 @@ struct pmonr {
  * @pkgid:			The logical package id for this pkgd.
  */
 struct pkg_data {
+	unsigned long		free_rmids[CMT_MAX_NR_RMIDS_LONGS];
+	unsigned long		dirty_rmids[CMT_MAX_NR_RMIDS_LONGS];
+
 	struct mutex		mutex;
 	raw_spinlock_t		lock;
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 13/46] perf/x86/intel/cmt: add pmonr's Off and Unused states
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (11 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 12/46] perf/x86/intel/cmt: add per-package rmid pools David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 14/46] perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states David Carrillo-Cisneros
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

A pmonr uses a state machine to keep track of its rmids and their
hierarchical dependency with other pmonrs in the same package.

This patch introduces the first two states in such state machine.
It also adds pmonr_rmids: a word-size container to atomically access
a pmonr's sched and read rmids.

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 57 +++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/cmt.h | 53 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 5799816..fb6877f 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -10,6 +10,8 @@
 #define QOS_L3_OCCUP_EVENT_ID	BIT_ULL(0)
 #define QOS_EVENT_MASK		QOS_L3_OCCUP_EVENT_ID
 
+#define INVALID_RMID		-1
+
 /* Increase as needed as Intel CPUs grow. */
 #define CMT_MAX_NR_PKGS		8
 
@@ -121,6 +123,16 @@ static inline struct pmonr *pkgd_pmonr(struct pkg_data *pkgd, struct monr *monr)
 	return rcu_dereference_check(monr->pmonrs[pkgd->pkgid], safe);
 }
 
+static inline void pmonr_set_rmids(struct pmonr *pmonr,
+				   u32 sched_rmid, u32 read_rmid)
+{
+	union pmonr_rmids rmids;
+
+	rmids.sched_rmid = sched_rmid;
+	rmids.read_rmid  = read_rmid;
+	atomic64_set(&pmonr->atomic_rmids, rmids.value);
+}
+
 static struct pmonr *pmonr_alloc(struct pkg_data *pkgd)
 {
 	struct pmonr *pmonr;
@@ -130,6 +142,7 @@ static struct pmonr *pmonr_alloc(struct pkg_data *pkgd)
 	if (!pmonr)
 		return ERR_PTR(-ENOMEM);
 
+	pmonr_set_rmids(pmonr, INVALID_RMID, INVALID_RMID);
 	pmonr->pkgd = pkgd;
 
 	return pmonr;
@@ -140,6 +153,29 @@ static inline bool monr_is_root(struct monr *monr)
 	return monr_hrchy_root == monr;
 }
 
+/* pkg_data lock is not required for transition from Off state. */
+static void pmonr_to_unused(struct pmonr *pmonr)
+{
+	/*
+	 * Do not warn on re-entering Unused state to simplify cleanup
+	 * of initialized pmonrs that were not scheduled.
+	 */
+	if (pmonr->state == PMONR_UNUSED)
+		return;
+
+	if (pmonr->state == PMONR_OFF) {
+		pmonr->state = PMONR_UNUSED;
+		pmonr_set_rmids(pmonr, INVALID_RMID, 0);
+		return;
+	}
+}
+
+static void pmonr_unused_to_off(struct pmonr *pmonr)
+{
+	pmonr->state = PMONR_OFF;
+	pmonr_set_rmids(pmonr, INVALID_RMID, 0);
+}
+
 static void monr_dealloc(struct monr *monr)
 {
 	u16 p, nr_pkgs = topology_max_packages();
@@ -152,6 +188,8 @@ static void monr_dealloc(struct monr *monr)
 		/* out of monr_hrchy, so no need for rcu or lock protection. */
 		if (!monr->pmonrs[p])
 			continue;
+		if (WARN_ON_ONCE(monr->pmonrs[p]->state != PMONR_OFF))
+			continue;
 		kfree(monr->pmonrs[p]);
 	}
 	kfree(monr);
@@ -210,9 +248,20 @@ static enum cmt_user_flags pmonr_uflags(struct pmonr *pmonr)
 static int __pmonr_apply_uflags(struct pmonr *pmonr,
 		enum cmt_user_flags pmonr_uflags)
 {
+	if (!(pmonr_uflags & CMT_UF_HAS_USER)) {
+		if (pmonr->state != PMONR_OFF) {
+			pmonr_to_unused(pmonr);
+			pmonr_unused_to_off(pmonr);
+		}
+		return 0;
+	}
+
 	if (monr_is_root(pmonr->monr) && (~pmonr_uflags & root_monr_uflags))
 		return -EINVAL;
 
+	if (pmonr->state == PMONR_OFF)
+		pmonr_to_unused(pmonr);
+
 	return 0;
 }
 
@@ -750,15 +799,19 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 static void __terminate_pkg_data(struct pkg_data *pkgd)
 {
 	struct monr *pos = NULL;
+	struct pmonr *pmonr;
 	unsigned long flags;
 
 	lockdep_assert_held(&cmt_mutex);
 
 	raw_spin_lock_irqsave(&pkgd->lock, flags);
 	/* post-order traversal guarantees pos to be leaf of monr hierarchy. */
-	while ((pos = monr_next_descendant_post(pos, monr_hrchy_root)))
+	while ((pos = monr_next_descendant_post(pos, monr_hrchy_root))) {
+		pmonr = pkgd_pmonr(pkgd, pos);
+		pmonr_to_unused(pmonr);
+		pmonr_unused_to_off(pmonr);
 		RCU_INIT_POINTER(pos->pmonrs[pkgd->pkgid], NULL);
-
+	}
 	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
 
 	synchronize_rcu();
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 6211392..05325c8 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -52,13 +52,66 @@
  */
 
 /**
+ * enum pmonr_state - pmonrs can be in one of the following states:
+ *   - Off:	  pmonr is unavailable for monitoring. It's the starting state.
+ *   - Unused:	  pmonr is available for monitoring but no thread associated to
+ *		  this pmonr's monr has been scheduled in this pmonr's package.
+ *
+ * The valid state transitions are:
+ *
+ *    From:	|    To:			Cause:
+ *=============================================================================
+ *  Off		|  Unused	monitoring is enabled for a pmonr.
+ *-----------------------------------------------------------------------------
+ *  Unused	|  Off		monitoring is disabled for a pmonr.
+ *-----------------------------------------------------------------------------
+ */
+enum pmonr_state {
+	PMONR_OFF = 0,
+	PMONR_UNUSED,
+};
+
+/**
+ * union pmonr_rmids - Machine-size summary of a pmonr's rmid state.
+ * @value:		One word accesor.
+ * @sched_rmid:		The rmid to write in the PQR MSR in sched in/out.
+ * @read_rmid:		The rmid to read occupancy from.
+ *
+ * An atomically readable/writable summary of the rmids used by a pmonr.
+ * Its values can also used to atomically read the state (preventing
+ * unnecessary locks of pkgd->lock) in the following way:
+ *					pmonr state
+ *	      |      Off         Unused
+ * ============================================================================
+ * sched_rmid |	INVALID_RMID  INVALID_RMID
+ * ----------------------------------------------------------------------------
+ *  read_rmid |	INVALID_RMID        0
+ *
+ */
+union pmonr_rmids {
+	long		value;
+	struct {
+		u32	sched_rmid;
+		u32	read_rmid;
+	};
+};
+
+/**
  * struct pmonr - per-package componet of MONitored Resources (monr).
  * @monr:		The monr that contains this pmonr.
  * @pkgd:		The package data associated with this pmonr.
+ * @atomic_rmids:	Atomic accesor for this pmonr_rmids.
+ * @state:		The state for this pmonr, note that this can also
+ *			be inferred from the combination of sched_rmid and
+ *			read_rmid in @atomic_rmids.
  */
 struct pmonr {
 	struct monr				*monr;
 	struct pkg_data				*pkgd;
+
+	/* all writers are sync'ed by package's lock. */
+	atomic64_t				atomic_rmids;
+	enum pmonr_state			state;
 };
 
 /*
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 14/46] perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (12 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 13/46] perf/x86/intel/cmt: add pmonr's Off and Unused states David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 15/46] perf/x86/intel: encapsulate rmid and closid updates in pqr cache David Carrillo-Cisneros
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add remaining states for pmonr's state machine:
  - Active: A pmonr that is actively used.
  - Dep_Idle: A pmonr that failed to obtain a rmid. It "borrows" its rmid
    from its lowest monitored (Active in same pkgd) ancestor in the
    monr hierarchy.
  - Dep_Dirty: A pmonr that was Active but has lost its rmid (due to rmid
    rotation, introduced later in this patch series). It is similar to
    Dep_Idle but keeps track of its former rmid in case there is a reuse
    opportunity in the future.

This patch adds states, states transition functions for pmonrs.
It also adds infrastructure and usage statistics to struct pkg_data that
will be used later in this series.

The transitions Unused -> Active and Unused -> Dep_Idle are inline because
they will be called during tasks context switch the first time a monr
runs in a package (later in this series).

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 237 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h |  95 +++++++++++++++++-
 2 files changed, 329 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index fb6877f..86c3013 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -142,6 +142,10 @@ static struct pmonr *pmonr_alloc(struct pkg_data *pkgd)
 	if (!pmonr)
 		return ERR_PTR(-ENOMEM);
 
+	/*pmonr_deps_{head, entry} are in a union, initialize one of them. */
+	INIT_LIST_HEAD(&pmonr->pmonr_deps_head);
+	INIT_LIST_HEAD(&pmonr->pkgd_deps_entry);
+	INIT_LIST_HEAD(&pmonr->rot_entry);
 	pmonr_set_rmids(pmonr, INVALID_RMID, INVALID_RMID);
 	pmonr->pkgd = pkgd;
 
@@ -153,9 +157,108 @@ static inline bool monr_is_root(struct monr *monr)
 	return monr_hrchy_root == monr;
 }
 
+/*
+ * @a must be distinct to @b.
+ * @true if @a is ancestor of @b or equal to it.
+ */
+static inline bool monr_hrchy_is_ancestor(struct monr *a, struct monr *b)
+{
+	if (monr_hrchy_root == a || a == b)
+		return true;
+	if (monr_hrchy_root == b)
+		return false;
+
+	b = b->parent;
+	/* Break at the root */
+	while (b != monr_hrchy_root) {
+		if (a == b)
+			return true;
+		b = b->parent;
+	}
+
+	return false;
+}
+
+/**
+ * pmonr_find_lma() - Find Lowest Monitored Ancestor (lma) of a pmonr.
+ * @pmonr:		The pmonr to start the search on.
+ *
+ * Always succeed since pmonrs in monr_hrchy_root are always in Active state.
+ * Return: lma of @pmonr.
+ */
+static struct pmonr *pmonr_find_lma(struct pmonr *pmonr)
+{
+	struct monr *monr = pmonr->monr;
+	struct pkg_data *pkgd = pmonr->pkgd;
+
+	lockdep_assert_held(&pkgd->lock);
+
+	while ((monr = monr->parent)) {
+		/* protected by pkgd lock. */
+		pmonr = pkgd_pmonr(pkgd, monr);
+		if (pmonr->state == PMONR_ACTIVE)
+			return pmonr;
+	}
+	/* Should have hit monr_hrchy_root. */
+	WARN_ON_ONCE(true);
+
+	return pkgd_pmonr(pkgd, monr_hrchy_root);
+}
+
+/**
+ * pmnor_move_all_dependants() - Move all dependants from @old lender to @new.
+ * @old: Old lender.
+ * @new: New lender.
+ *
+ * @new->monr must be ancestor of @old->monr and they must be distinct.
+ */
+static void pmonr_move_all_dependants(struct pmonr *old, struct pmonr *new)
+{
+	struct pmonr *dep;
+	union pmonr_rmids dep_rmids, new_rmids;
+
+	new_rmids.value = atomic64_read(&new->atomic_rmids);
+	/* Update this pmonr's dependants to depend on new lender. */
+	list_for_each_entry(dep, &old->pmonr_deps_head, pmonr_deps_entry) {
+		dep->lender = new;
+		dep_rmids.value = atomic64_read(&dep->atomic_rmids);
+		pmonr_set_rmids(dep, new_rmids.sched_rmid, dep_rmids.read_rmid);
+	}
+	list_splice_tail_init(&old->pmonr_deps_head, &new->pmonr_deps_head);
+}
+
+/**
+ * pmonr_move_dependants() -  Move some dependants from @old lender to @new.
+ *
+ * Move @old's dependants that are @new->monr descendants to be @new's
+ * dependants. An opposed to pmonr_move_all_dependants, @new->monr does not
+ * need to be an ancestor of @old->monr.
+ */
+static inline void pmonr_move_dependants(struct pmonr *old, struct pmonr *new)
+{
+	struct pmonr *dep, *tmp;
+	union pmonr_rmids dep_rmids, new_rmids;
+
+	new_rmids.value = atomic64_read(&new->atomic_rmids);
+
+	list_for_each_entry_safe(dep, tmp, &old->pmonr_deps_head,
+				 pmonr_deps_entry) {
+		if (!monr_hrchy_is_ancestor(new->monr, dep->monr))
+			continue;
+		list_move_tail(&dep->pmonr_deps_entry, &new->pmonr_deps_head);
+		dep->lender = new;
+		dep_rmids.value = atomic64_read(&dep->atomic_rmids);
+		pmonr_set_rmids(dep, new_rmids.sched_rmid, dep_rmids.read_rmid);
+	}
+}
+
 /* pkg_data lock is not required for transition from Off state. */
 static void pmonr_to_unused(struct pmonr *pmonr)
 {
+	struct pkg_data *pkgd = pmonr->pkgd;
+	struct pmonr *lender;
+	union pmonr_rmids rmids;
+
 	/*
 	 * Do not warn on re-entering Unused state to simplify cleanup
 	 * of initialized pmonrs that were not scheduled.
@@ -168,6 +271,98 @@ static void pmonr_to_unused(struct pmonr *pmonr)
 		pmonr_set_rmids(pmonr, INVALID_RMID, 0);
 		return;
 	}
+
+	lockdep_assert_held(&pkgd->lock);
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+
+	if (pmonr->state == PMONR_ACTIVE) {
+		if (monr_is_root(pmonr->monr)) {
+			WARN_ON_ONCE(!list_empty(&pmonr->pmonr_deps_head));
+		} else {
+			lender = pmonr_find_lma(pmonr);
+			pmonr_move_all_dependants(pmonr, lender);
+		}
+		__set_bit(rmids.read_rmid, pkgd->dirty_rmids);
+
+	} else if (pmonr->state == PMONR_DEP_IDLE ||
+		   pmonr->state == PMONR_DEP_DIRTY) {
+
+		pmonr->lender = NULL;
+		list_del_init(&pmonr->pmonr_deps_entry);
+		list_del_init(&pmonr->pkgd_deps_entry);
+
+		if (pmonr->state == PMONR_DEP_DIRTY)
+			__set_bit(rmids.read_rmid, pkgd->dirty_rmids);
+		else
+			pkgd->nr_dep_pmonrs--;
+	} else {
+		WARN_ON_ONCE(true);
+		return;
+	}
+
+	list_del_init(&pmonr->rot_entry);
+	pmonr->state = PMONR_UNUSED;
+	pmonr_set_rmids(pmonr, INVALID_RMID, INVALID_RMID);
+}
+
+static inline void __pmonr_to_active_helper(struct pmonr *pmonr, u32 rmid)
+{
+	struct pkg_data *pkgd = pmonr->pkgd;
+
+	list_move_tail(&pmonr->rot_entry, &pkgd->active_pmonrs);
+	pmonr->state = PMONR_ACTIVE;
+	pmonr_set_rmids(pmonr, rmid, rmid);
+	atomic64_set(&pmonr->last_enter_active, get_jiffies_64());
+}
+
+static inline void pmonr_unused_to_active(struct pmonr *pmonr, u32 rmid)
+{
+	struct pmonr *lender;
+
+	__clear_bit(rmid, pmonr->pkgd->free_rmids);
+	__pmonr_to_active_helper(pmonr, rmid);
+	/*
+	 * If monr is root, no ancestor exists to move pmonr to. If monr is
+	 * root's child, no dependants of its parent (root) could be moved.
+	 * Check both cases separately to avoid unnecessary calls to
+	 * pmonr_move_depandants.
+	 */
+	if (!monr_is_root(pmonr->monr) && !monr_is_root(pmonr->monr->parent)) {
+		lender = pmonr_find_lma(pmonr);
+		pmonr_move_dependants(lender, pmonr);
+	}
+}
+
+/* helper function for transitions to Dep_{Idle,Dirty} states. */
+static inline void __pmonr_to_dep_helper(
+	struct pmonr *pmonr, struct pmonr *lender, u32 read_rmid)
+{
+	struct pkg_data *pkgd = pmonr->pkgd;
+	union pmonr_rmids lender_rmids;
+
+	pmonr->lender = lender;
+	list_move_tail(&pmonr->pmonr_deps_entry, &lender->pmonr_deps_head);
+	list_move_tail(&pmonr->pkgd_deps_entry, &pkgd->dep_pmonrs);
+
+	if (read_rmid == INVALID_RMID) {
+		list_move_tail(&pmonr->rot_entry, &pkgd->dep_idle_pmonrs);
+		pkgd->nr_dep_pmonrs++;
+		pmonr->state = PMONR_DEP_IDLE;
+	} else {
+		list_move_tail(&pmonr->rot_entry, &pkgd->dep_dirty_pmonrs);
+		pmonr->state = PMONR_DEP_DIRTY;
+	}
+
+	lender_rmids.value = atomic64_read(&lender->atomic_rmids);
+	pmonr_set_rmids(pmonr, lender_rmids.sched_rmid, read_rmid);
+}
+
+static inline void pmonr_unused_to_dep_idle(struct pmonr *pmonr)
+{
+	struct pmonr *lender;
+
+	lender = pmonr_find_lma(pmonr);
+	__pmonr_to_dep_helper(pmonr, lender, INVALID_RMID);
 }
 
 static void pmonr_unused_to_off(struct pmonr *pmonr)
@@ -176,6 +371,43 @@ static void pmonr_unused_to_off(struct pmonr *pmonr)
 	pmonr_set_rmids(pmonr, INVALID_RMID, 0);
 }
 
+static void pmonr_active_to_dep_dirty(struct pmonr *pmonr)
+{
+	struct pmonr *lender;
+	union pmonr_rmids rmids;
+
+	lender = pmonr_find_lma(pmonr);
+	pmonr_move_all_dependants(pmonr, lender);
+
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+	__pmonr_to_dep_helper(pmonr, lender, rmids.read_rmid);
+}
+
+static void __pmonr_dep_to_active_helper(struct pmonr *pmonr, u32 rmid)
+{
+	list_del_init(&pmonr->pkgd_deps_entry);
+	/* pmonr will no longer be dependent on pmonr_lender. */
+	list_del_init(&pmonr->pmonr_deps_entry);
+	pmonr_move_dependants(pmonr->lender, pmonr);
+	pmonr->lender = NULL;
+	__pmonr_to_active_helper(pmonr, rmid);
+}
+
+static void pmonr_dep_idle_to_active(struct pmonr *pmonr, u32 rmid)
+{
+	__clear_bit(rmid, pmonr->pkgd->free_rmids);
+	pmonr->pkgd->nr_dep_pmonrs--;
+	__pmonr_dep_to_active_helper(pmonr, rmid);
+}
+
+static void pmonr_dep_dirty_to_active(struct pmonr *pmonr)
+{
+	union pmonr_rmids rmids;
+
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+	__pmonr_dep_to_active_helper(pmonr, rmids.read_rmid);
+}
+
 static void monr_dealloc(struct monr *monr)
 {
 	u16 p, nr_pkgs = topology_max_packages();
@@ -780,6 +1012,11 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 		pkgd->max_rmid = CMT_MAX_NR_RMIDS - 1;
 	}
 
+	INIT_LIST_HEAD(&pkgd->active_pmonrs);
+	INIT_LIST_HEAD(&pkgd->dep_idle_pmonrs);
+	INIT_LIST_HEAD(&pkgd->dep_dirty_pmonrs);
+	INIT_LIST_HEAD(&pkgd->dep_pmonrs);
+
 	mutex_init(&pkgd->mutex);
 	raw_spin_lock_init(&pkgd->lock);
 
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 05325c8..bf90c26 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -36,6 +36,21 @@
  * online cpu. The pmonr handles the CMT and MBM monitoring within its package
  * by managing the rmid to write into each CPU that runs a monitored thread.
  *
+ * The lma of a pmonr is its closest ancestor pmonr in Active state pmonr.
+ *
+ * A pmonr allocates a rmid when needed, depending of its state (see
+ * enum pmonr_state comments). If a pmonr fails to obtain a free rmid, it
+ * "borrows" the one used by its Lowest Monitored Ancestor (lma).
+ *
+ * The "borrowed" rmid is used when threads are scheduled in so that the
+ * occupancy and memory bandwidth for those threads is accounted for in the
+ * monr hierarchy. Yet, that pmonr cannot use a "borrowed" rmid to read,
+ * since that rmid is not counting the "borrower"'s monr cache events.
+ * Therefore, a pmonr uses rmids in two ways:
+ *   (1) to schedule, and (2) to read.
+ * When a pmonr owns a rmid (Active state), that rmid is used for both
+ * schedule and read.
+ *
  *
  * Locking
  *
@@ -56,6 +71,16 @@
  *   - Off:	  pmonr is unavailable for monitoring. It's the starting state.
  *   - Unused:	  pmonr is available for monitoring but no thread associated to
  *		  this pmonr's monr has been scheduled in this pmonr's package.
+ *   - Active:	  pmonr is actively used. It successfully obtained a free rmid
+ *		  to sched in/out and uses it to read pmonr's llc_occupancy.
+ *   - Dep_Idle:  pmonr failed to obtain its own free rmid and is borrowing the
+ *		  rmid from its lowest Active ancestor monr (its lma monr).
+ *   - Dep_Dirty: pmonr was Active but its rmid was stolen. This state differs
+ *		  from Dep_Idle in that the pmonr keeps a reference to its
+ *		  former Active rmid. If the pmonr becomes eligible to recoup
+ *		  its rmid in the near future, this previously used rmid can
+ *		  be reused even if "dirty" without introducing additional
+ *		  counting error.
  *
  * The valid state transitions are:
  *
@@ -64,11 +89,37 @@
  *  Off		|  Unused	monitoring is enabled for a pmonr.
  *-----------------------------------------------------------------------------
  *  Unused	|  Off		monitoring is disabled for a pmonr.
+ *		|--------------------------------------------------------------
+ *		|  Active	First thread associated to pmonr is scheduled
+ *		|		in package and a free rmid is available.
+ *		|--------------------------------------------------------------
+ *		|  Dep_Idle	Could not find a free rmid available.
+ *-----------------------------------------------------------------------------
+ *  Active	|  Dep_Dirty	rmid is stolen, keep reference to old rmid
+ *		|		in read_rmid, but do not used to read.
+ *		|--------------------------------------------------------------
+ *		|  Unused	pmonr releases the rmid, released rmid can be
+ *		|		"dirty" and therefore goes to dirty_rmids.
+ *-----------------------------------------------------------------------------
+ *  Dep_Idle	|  Active	pmonr receives a "clean" rmid.
+ *		|--------------------------------------------------------------
+ *		|  Unused	pmonr is no longer waiting for rmid.
+ *-----------------------------------------------------------------------------
+ *  Dep_Dirty	|  Active	dirty rmid is reissued to pmonr that had it
+ *		|		before the transition.
+ *		|--------------------------------------------------------------
+ *		|  Dep_Idle	dirty rmid has become "clean" and is reissued
+ *		|		to a distinct pmonr (or go to free_rmids).
+ *		|--------------------------------------------------------------
+ *		|  Unused	pmonr is no longer waiting for rmid.
  *-----------------------------------------------------------------------------
  */
 enum pmonr_state {
 	PMONR_OFF = 0,
 	PMONR_UNUSED,
+	PMONR_ACTIVE,
+	PMONR_DEP_IDLE,
+	PMONR_DEP_DIRTY,
 };
 
 /**
@@ -81,11 +132,11 @@ enum pmonr_state {
  * Its values can also used to atomically read the state (preventing
  * unnecessary locks of pkgd->lock) in the following way:
  *					pmonr state
- *	      |      Off         Unused
+ *	      |      Off         Unused       Active      Dep_Idle     Dep_Dirty
  * ============================================================================
- * sched_rmid |	INVALID_RMID  INVALID_RMID
+ * sched_rmid |	INVALID_RMID  INVALID_RMID    valid       lender's     lender's
  * ----------------------------------------------------------------------------
- *  read_rmid |	INVALID_RMID        0
+ *  read_rmid |	INVALID_RMID        0	      (same)    INVALID_RMID   old rmid
  *
  */
 union pmonr_rmids {
@@ -98,16 +149,42 @@ union pmonr_rmids {
 
 /**
  * struct pmonr - per-package componet of MONitored Resources (monr).
+ * @lender:		if in Dep_Idle or Dep_Dirty state, it's the pmonr that
+ *			lends its rmid to this pmonr. NULL otherwise.
+ * @pmonr_deps_head:	List of pmonrs in Dep_Idle or Dep_Dirty state that
+ *			borrow their sched_rmid from this pmonr.
+ * @pmonr_deps_entry:	Entry into lender's @pmonr_deps_head when in Dep_Idle
+ *			or Dep_Dirty state.
+ * @pkgd_deps_entry:	When in Dep_Dirty state, the list entry for dep_pmonrs.
  * @monr:		The monr that contains this pmonr.
  * @pkgd:		The package data associated with this pmonr.
+ * @rot_entry:		List entry to attach to pmonr rotation lists in
+ *			pkg_data.
+
+ * @last_enter_active:	Time last enter Active state.
  * @atomic_rmids:	Atomic accesor for this pmonr_rmids.
  * @state:		The state for this pmonr, note that this can also
  *			be inferred from the combination of sched_rmid and
  *			read_rmid in @atomic_rmids.
  */
 struct pmonr {
+	struct pmonr				*lender;
+	/* save space with union since pmonr is in only one state at a time. */
+	union{
+		struct { /* variables for Active state. */
+			struct list_head	pmonr_deps_head;
+		};
+		struct { /* variables for Dep_Idle and Dep_Dirty states. */
+			struct list_head	pmonr_deps_entry;
+			struct list_head	pkgd_deps_entry;
+		};
+	};
+
 	struct monr				*monr;
 	struct pkg_data				*pkgd;
+	struct list_head			rot_entry;
+
+	atomic64_t				last_enter_active;
 
 	/* all writers are sync'ed by package's lock. */
 	atomic64_t				atomic_rmids;
@@ -130,7 +207,13 @@ struct pmonr {
  * @free_rmids:			Pool of free rmids.
  * @dirty_rmids:		Pool of "dirty" rmids that are not referenced
  *				by a pmonr.
+ * @active_pmonrs:		LRU of Active pmonrs.
+ * @dep_idle_pmonrs:		LRU of Dep_Idle pmonrs.
+ * @dep_dirty_pmonrs:		LRU of Dep_Dirty pmonrs.
+ * @dep_pmonrs:			LRU of Dep_Idle and Dep_Dirty pmonrs.
+ * @nr_dep_pmonrs:		nr Dep_Idle + nr Dep_Dirty pmonrs.
  * @mutex:			Hold when modifying this pkg_data.
+ * @mutex_key:			lockdep class for pkg_data's mutex.
  * @lock:			Hold to protect pmonrs in this pkg_data.
  * @work_cpu:			CPU to run rotation and other batch jobs.
  *				It must be in the package associated to its
@@ -142,6 +225,12 @@ struct pkg_data {
 	unsigned long		free_rmids[CMT_MAX_NR_RMIDS_LONGS];
 	unsigned long		dirty_rmids[CMT_MAX_NR_RMIDS_LONGS];
 
+	struct list_head	active_pmonrs;
+	struct list_head	dep_idle_pmonrs;
+	struct list_head	dep_dirty_pmonrs;
+	struct list_head	dep_pmonrs;
+	int			nr_dep_pmonrs;
+
 	struct mutex		mutex;
 	raw_spinlock_t		lock;
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 15/46] perf/x86/intel: encapsulate rmid and closid updates in pqr cache
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (13 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 14/46] perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 16/46] perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del David Carrillo-Cisneros
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Encapsulate updates for PQR_ASSOC msr's RMID and CLOSID into
intel_rdt_common. Use new interface in Intel CMT.

Change RDT common code to build for both CONFIG_INTEL_RDT_A and
CONFIG_INTEL_RDT_M.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c             |  6 ++++++
 arch/x86/include/asm/intel_rdt_common.h | 30 ++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/Makefile            |  3 ++-
 arch/x86/kernel/cpu/intel_rdt_common.c  |  8 ++++++++
 4 files changed, 44 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 86c3013..ce5be74 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -4,6 +4,7 @@
 
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/intel_rdt_common.h>
 #include "cmt.h"
 #include "../perf_event.h"
 
@@ -1118,11 +1119,16 @@ static int intel_cmt_hp_online_enter(unsigned int cpu)
 	return 0;
 }
 
+/* Restore CPU's pqr_cache to initial state. */
 static int intel_cmt_hp_online_exit(unsigned int cpu)
 {
+	struct intel_pqr_state *state = per_cpu_ptr(&pqr_state, cpu);
 	struct pkg_data *pkgd;
 	u16 pkgid = topology_logical_package_id(cpu);
 
+	pqr_cache_update_rmid(0);
+	memset(state, 0, sizeof(*state));
+
 	rcu_read_lock();
 	pkgd = rcu_dereference(cmt_pkgs_data[pkgid]);
 	if (pkgd->work_cpu == cpu)
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index b31081b..1d5e691 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -1,13 +1,21 @@
 #ifndef _ASM_X86_INTEL_RDT_COMMON_H
 #define _ASM_X86_INTEL_RDT_COMMON_H
 
+#if defined(CONFIG_INTEL_RDT_A) || defined(CONFIG_INTEL_RDT_M)
+
+#include <linux/types.h>
+#include <asm/percpu.h>
+#include <asm/msr.h>
+
 #define MSR_IA32_PQR_ASSOC	0x0c8f
 
+
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
  * @rmid:		The cached Resource Monitoring ID
+ * @next_rmid:		Next rmid to write to hw
  * @closid:		The cached Class Of Service ID
- * @rmid_usecnt:	The usage counter for rmid
+ * @next_closid:	Next closid to write to hw
  *
  * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
  * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
@@ -18,10 +26,28 @@
  */
 struct intel_pqr_state {
 	u32			rmid;
+	u32			next_rmid;
 	u32			closid;
-	int			rmid_usecnt;
+	u32			next_closid;
 };
 
 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 
+static inline void pqr_cache_update_rmid(u32 rmid)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	state->next_rmid = rmid;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+}
+
+static inline void pqr_cache_update_closid(u32 closid)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	state->next_closid = closid;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+}
+
+#endif
 #endif /* _ASM_X86_INTEL_RDT_COMMON_H */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index cf4bfd0..b095e65 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,7 +34,8 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o
+obj-$(CONFIG_INTEL_RDT_A)	+= intel_rdt.o intel_rdt_common.o
+obj-$(CONFIG_INTEL_RDT_M)	+= intel_rdt_common.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
new file mode 100644
index 0000000..7fd5b20
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_common.c
@@ -0,0 +1,8 @@
+#include <asm/intel_rdt_common.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Functions that modify pqr_state
+ * must ensure interruptions are handled properly.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 16/46] perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (14 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 15/46] perf/x86/intel: encapsulate rmid and closid updates in pqr cache David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 17/46] perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID David Carrillo-Cisneros
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Now that the pmonr state machine and pqr_common are in place, add
pmonr_update_sched_rmid to find the adequate rmid to use. With it,
complete the body of PMUs functions that start/stop, add/del events.

A pmonr in Unused state tries to allocate a free rmid the first time one
of its monitored threads is scheduled in a CPU package (lazy allocation
of rmids). If there is no available rmids in that package, the pmonr
enters a Dep_Idle state (it borrows the sched_rmid from its
Lowest Monitored Ancestor (lma) pmonr).

When a event is stopped, no other event runs in that CPU and PQR msr
uses the rmid of the monr_hrchy_root's pmonr for that CPU's package.

Details in pmonr state machine's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 101 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index ce5be74..9421a3e 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -650,6 +650,74 @@ static int monr_append_event(struct monr *monr, struct perf_event *event)
 	return err;
 }
 
+/**
+ * pmonr_update_sched_rmid() - Update sched_rmid for @pmonr in current package.
+ *
+ * Always finds valid rmids for non Off pmonr. Safe to call with IRQs disabled.
+ * A lock-free fast path reuses the rmid when the pmonr has been scheduled
+ * before in this package. Otherwise, tries to get a free rmid. On failure,
+ * enters Dep_Idle state and uses the rmid of its lender. There is always a
+ * pmonr to borrow from since monr_hrchy_root has all its pmonrs in Active
+ * state.
+ * Return: new pmonr_rmids for pmonr.
+ */
+static inline union pmonr_rmids pmonr_update_sched_rmid(struct pmonr *pmonr)
+{
+	struct pkg_data *pkgd = pmonr->pkgd;
+	union pmonr_rmids rmids;
+	u32 free_rmid;
+
+	/* Use atomic_rmids to check state in a lock-free fastpath. */
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+	if (rmids.sched_rmid != INVALID_RMID)
+		return rmids;
+
+	/* No need to obtain RMID if in Off state. */
+	if (rmids.sched_rmid == rmids.read_rmid)
+		return rmids;
+
+	/*
+	 * Lock-free path failed. Now acquire lock and verify that state
+	 * and atomic_rmids haven't changed. If still Unused, try to
+	 * obtain a free RMID.
+	 */
+	raw_spin_lock(&pkgd->lock);
+
+	/* With lock acquired it is ok to read pmonr::state. */
+	if (pmonr->state != PMONR_UNUSED) {
+		/* Update rmids in case they changed before acquiring lock. */
+		rmids.value = atomic64_read(&pmonr->atomic_rmids);
+		raw_spin_unlock(&pkgd->lock);
+		return rmids;
+	}
+
+	free_rmid = find_first_bit(pkgd->free_rmids, CMT_MAX_NR_RMIDS);
+	if (free_rmid == CMT_MAX_NR_RMIDS)
+		pmonr_unused_to_dep_idle(pmonr);
+	else
+		pmonr_unused_to_active(pmonr, free_rmid);
+
+	raw_spin_unlock(&pkgd->lock);
+
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+
+	return rmids;
+}
+
+static inline union pmonr_rmids monr_get_sched_in_rmids(struct monr *monr)
+{
+	struct pmonr *pmonr;
+	union pmonr_rmids rmids;
+	u16 pkgid = topology_logical_package_id(smp_processor_id());
+
+	rcu_read_lock();
+	pmonr = rcu_dereference(monr->pmonrs[pkgid]);
+	rmids = pmonr_update_sched_rmid(pmonr);
+	rcu_read_unlock();
+
+	return rmids;
+}
+
 static void monr_hrchy_insert_leaf(struct monr *monr, struct monr *parent)
 {
 	unsigned long flags;
@@ -865,16 +933,49 @@ static void intel_cmt_event_read(struct perf_event *event)
 {
 }
 
+static inline void __intel_cmt_event_start(struct perf_event *event,
+					   union pmonr_rmids rmids)
+{
+	if (!(event->hw.state & PERF_HES_STOPPED))
+		return;
+	event->hw.state &= ~PERF_HES_STOPPED;
+	pqr_cache_update_rmid(rmids.sched_rmid);
+}
+
 static void intel_cmt_event_start(struct perf_event *event, int mode)
 {
+	union pmonr_rmids rmids;
+
+	rmids = monr_get_sched_in_rmids(monr_from_event(event));
+	__intel_cmt_event_start(event, rmids);
 }
 
 static void intel_cmt_event_stop(struct perf_event *event, int mode)
 {
+	union pmonr_rmids rmids;
+
+	if (event->hw.state & PERF_HES_STOPPED)
+		return;
+	event->hw.state |= PERF_HES_STOPPED;
+	rmids = monr_get_sched_in_rmids(monr_hrchy_root);
+	/*
+	 * HW tracks the rmid even when event is not scheduled and event
+	 * reads occur even if event is Inactive. Therefore there is no need to
+	 * read when event is stopped.
+	 */
+	pqr_cache_update_rmid(rmids.sched_rmid);
 }
 
 static int intel_cmt_event_add(struct perf_event *event, int mode)
 {
+	union pmonr_rmids rmids;
+
+	event->hw.state = PERF_HES_STOPPED;
+	rmids = monr_get_sched_in_rmids(monr_from_event(event));
+
+	if (mode & PERF_EF_START)
+		__intel_cmt_event_start(event, rmids);
+
 	return 0;
 }
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 17/46] perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (15 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 16/46] perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 18/46] perf/core: add arch_info field to struct perf_cgroup David Carrillo-Cisneros
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

This uflag allows user to specify that a rmid must be allocated at monr's
initialization or fail otherwise.

For this to work we split __pmonr_apply_uflags into reserve and apply
modes. The reserve mode will try to reserve a free rmid, and if successful,
the apply mode can proceed using the rmid previously reserved.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 116 +++++++++++++++++++++++++++++++++++++++-----
 arch/x86/events/intel/cmt.h |   5 +-
 2 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 9421a3e..3883cb4 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -33,7 +33,8 @@ static unsigned int __min_max_rmid;	/* minimum max_rmid across all pkgs. */
 static struct monr *monr_hrchy_root;
 
 /* Flags for root monr and all its pmonrs while being monitored. */
-static enum cmt_user_flags root_monr_uflags = CMT_UF_HAS_USER;
+static enum cmt_user_flags root_monr_uflags =
+		CMT_UF_HAS_USER | CMT_UF_NOLAZY_RMID;
 
 /* Auxiliar flags */
 static enum cmt_user_flags *pkg_uflags_zeroes;
@@ -414,6 +415,7 @@ static void monr_dealloc(struct monr *monr)
 	u16 p, nr_pkgs = topology_max_packages();
 
 	if (WARN_ON_ONCE(monr->nr_has_user) ||
+	    WARN_ON_ONCE(monr->nr_nolazy_rmid) ||
 	    WARN_ON_ONCE(monr->mon_events))
 		return;
 
@@ -478,11 +480,28 @@ static enum cmt_user_flags pmonr_uflags(struct pmonr *pmonr)
 	return monr->uflags | monr->pkg_uflags[pmonr->pkgd->pkgid];
 }
 
+/*
+ * Callable in two modes:
+ *   1) @reserve == true: will check if uflags are applicable and store in
+ *   @res_rmid the "reserved" rmid.
+ *   2) @reserve == false: will apply pmonr_uflags using the rmid stored in
+ *   @res_rmid rmid (if any). Cannot fail.
+ */
 static int __pmonr_apply_uflags(struct pmonr *pmonr,
-		enum cmt_user_flags pmonr_uflags)
+		enum cmt_user_flags pmonr_uflags, bool reserve, u32 *res_rmid)
 {
+	struct pkg_data *pkgd = pmonr->pkgd;
+	u32 free_rmid;
+
+	if (WARN_ON_ONCE(!res_rmid))
+		return -EINVAL;
+	if (WARN_ON_ONCE(reserve && *res_rmid != INVALID_RMID))
+		return -EINVAL;
+
 	if (!(pmonr_uflags & CMT_UF_HAS_USER)) {
 		if (pmonr->state != PMONR_OFF) {
+			if (reserve)
+				return 0;
 			pmonr_to_unused(pmonr);
 			pmonr_unused_to_off(pmonr);
 		}
@@ -492,8 +511,40 @@ static int __pmonr_apply_uflags(struct pmonr *pmonr,
 	if (monr_is_root(pmonr->monr) && (~pmonr_uflags & root_monr_uflags))
 		return -EINVAL;
 
-	if (pmonr->state == PMONR_OFF)
-		pmonr_to_unused(pmonr);
+	if (pmonr->state == PMONR_OFF) {
+		if (!reserve)
+			pmonr_to_unused(pmonr);
+	}
+	if (pmonr->state == PMONR_ACTIVE)
+		return 0;
+	if (!(pmonr_uflags & CMT_UF_NOLAZY_RMID))
+		return 0;
+	if (pmonr->state == PMONR_DEP_DIRTY) {
+		if (!reserve)
+			pmonr_dep_dirty_to_active(pmonr);
+		return 0;
+	}
+
+	/*
+	 * At this point pmonr is in either Unused or Dep_Idle state and
+	 * needs a rmid to transition to Active.
+	 */
+	if (reserve) {
+		free_rmid = find_first_bit(pkgd->free_rmids, CMT_MAX_NR_RMIDS);
+		if (free_rmid == CMT_MAX_NR_RMIDS)
+			return -ENOSPC;
+		*res_rmid = free_rmid;
+		__clear_bit(*res_rmid, pkgd->free_rmids);
+		return 0;
+	}
+
+	/* both cases use the reserved rmid. */
+	if (pmonr->state == PMONR_UNUSED) {
+		pmonr_unused_to_active(pmonr, *res_rmid);
+	} else {
+		WARN_ON_ONCE(pmonr->state != PMONR_DEP_IDLE);
+		pmonr_dep_idle_to_active(pmonr, *res_rmid);
+	}
 
 	return 0;
 }
@@ -514,7 +565,10 @@ static bool monr_has_user(struct monr *monr)
 	       pkg_uflags_has_user(monr->pkg_uflags);
 }
 
-static int __monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
+static int __monr_apply_uflags(struct monr *monr,
+			       enum cmt_user_flags *puflags,
+			       bool reserve,
+			       u32 *res_rmids)
 {
 	enum cmt_user_flags pmonr_uflags;
 	struct pkg_data *pkgd = NULL;
@@ -526,7 +580,10 @@ static int __monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
 		pmonr_uflags = monr->uflags |
 				(puflags ? puflags[p] : monr->pkg_uflags[p]);
 		pmonr = pkgd_pmonr(pkgd, monr);
-		err = __pmonr_apply_uflags(pmonr, pmonr_uflags);
+		err = __pmonr_apply_uflags(pmonr, pmonr_uflags,
+					   reserve, &res_rmids[p]);
+		/* Should not fail if reserve is set. */
+		WARN_ON_ONCE(!reserve && err);
 		if (err)
 			return err;
 	}
@@ -537,17 +594,26 @@ static int __monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
 /* Apply puflags for all packages or rollback and fail. */
 static int monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
 {
+	struct pkg_data *pkgd = NULL;
+	u32 res_rmids[CMT_MAX_NR_PKGS];
 	int p, err;
 	unsigned long flags;
 
 	monr_hrchy_assert_held_mutexes();
 	monr_hrchy_acquire_locks(&flags);
 
-	err = __monr_apply_uflags(monr, puflags);
+	for (p = 0; p < CMT_MAX_NR_PKGS; p++)
+		res_rmids[p] = INVALID_RMID;
+
+	/* first call of __monr_apply_uflags will "reserve" rmid only.*/
+	err = __monr_apply_uflags(monr, puflags, true, res_rmids);
 	if (err)
-		goto exit;
+		goto error;
+
+	/* second call actually applies the flags. */
+	err = __monr_apply_uflags(monr, puflags, false, res_rmids);
+	WARN_ON_ONCE(err);
 
-	/* Proceed to exit if no uflags to update to pkg_uflags. */
 	if (!puflags)
 		goto exit;
 
@@ -563,6 +629,14 @@ static int monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
 	monr_hrchy_release_locks(&flags);
 
 	return err;
+
+error:
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		p = pkgd->pkgid;
+		if (res_rmids[p] != INVALID_RMID)
+			__set_bit(res_rmids[p], pkgd->free_rmids);
+	}
+	goto exit;
 }
 
 static inline struct monr *monr_from_event(struct perf_event *event)
@@ -583,8 +657,11 @@ static bool monr_account_uflags(struct monr *monr,
 
 	if (uflags & CMT_UF_HAS_USER)
 		monr->nr_has_user += account ? 1 : -1;
+	if (uflags & CMT_UF_NOLAZY_RMID)
+		monr->nr_nolazy_rmid += account ? 1 : -1;
 
-	monr->uflags =  (monr->nr_has_user ? CMT_UF_HAS_USER : 0);
+	monr->uflags =  (monr->nr_has_user ? CMT_UF_HAS_USER : 0) |
+			(monr->nr_nolazy_rmid ? CMT_UF_NOLAZY_RMID : 0);
 
 	return old_flags != monr->uflags;
 }
@@ -1165,6 +1242,7 @@ static int init_pkg_data(int cpu)
 	struct pmonr *pmonr;
 	unsigned long flags;
 	int err = 0;
+	u32 res_rmid;
 	u16 pkgid = topology_logical_package_id(cpu);
 
 	lockdep_assert_held(&cmt_mutex);
@@ -1190,9 +1268,25 @@ static int init_pkg_data(int cpu)
 		 */
 		RCU_INIT_POINTER(pos->pmonrs[pkgid], pmonr);
 
+		res_rmid = INVALID_RMID;
 		raw_spin_lock_irqsave(&pkgd->lock, flags);
-		err = __pmonr_apply_uflags(pmonr, pmonr_uflags(pmonr));
+		err = __pmonr_apply_uflags(pmonr, pmonr_uflags(pmonr),
+					   true, &res_rmid);
+		if (!err)
+			__pmonr_apply_uflags(pmonr, pmonr_uflags(pmonr),
+					     false, &res_rmid);
 		raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+		/*
+		 * Do not fail the whole package initialization because a pmonr
+		 * failed to apply its uflags, just report the error.
+		 */
+		if (err) {
+			pr_err("Not enough free RMIDs in package %d for Intel CMT.\n",
+				pkgid);
+			pos->pkg_uflags[pkgid] |= CMT_UF_ERROR;
+			err = 0;
+		}
 	}
 
 	if (err) {
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index bf90c26..754a9c8 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -245,7 +245,8 @@ struct pkg_data {
 enum cmt_user_flags {
 	/* if no has_user other flags are meaningless. */
 	CMT_UF_HAS_USER		= BIT(0), /* has cgroup or event users */
-	CMT_UF_MAX		= BIT(1) - 1,
+	CMT_UF_NOLAZY_RMID	= BIT(1), /* try to obtain rmid on creation */
+	CMT_UF_MAX		= BIT(2) - 1,
 	CMT_UF_ERROR		= CMT_UF_MAX + 1,
 };
 
@@ -258,6 +259,7 @@ enum cmt_user_flags {
  * @children:		List of children in monr hierarchy.
  * @parent_entry:	Entry in parent's children list.
  * @nr_has_user:	nr of CMT_UF_HAS_USER set in events in mon_events.
+ * @nr_nolazy_user:	nr of CMT_UF_NOLAZY_RMID set in events in mon_events.
  * @uflags:		monr level cmt_user_flags, or'ed with pkg_uflags.
  * @pkg_uflags:		package level cmt_user_flags, each entry is used as
  *			pmonr uflags if that package is online.
@@ -278,6 +280,7 @@ struct monr {
 	struct list_head		parent_entry;
 
 	int				nr_has_user;
+	int				nr_nolazy_rmid;
 	enum cmt_user_flags		uflags;
 	enum cmt_user_flags		pkg_uflags[];
 };
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 18/46] perf/core: add arch_info field to struct perf_cgroup
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (16 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 17/46] perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 19/46] perf/x86/intel/cmt: add support for cgroup events David Carrillo-Cisneros
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

This is the first patch for cgroup support in this series.

Adds a new field to perf_cgroup that is used by intel_cmt pmu to
associate a monr with a perf_cgroup instance.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 4 +++-
 kernel/events/core.c       | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0202b32..406119b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -815,7 +815,9 @@ struct perf_cgroup_info {
 };
 
 struct perf_cgroup {
-	struct cgroup_subsys_state	css;
+	/* Architecture specific information. */
+	void				 *arch_info;
+	struct cgroup_subsys_state	 css;
 	struct perf_cgroup_info	__percpu *info;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d99a51c..0de3ca5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10739,6 +10739,8 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return ERR_PTR(-ENOMEM);
 	}
 
+	jc->arch_info = NULL;
+
 	return &jc->css;
 }
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 19/46] perf/x86/intel/cmt: add support for cgroup events
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (17 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 18/46] perf/core: add arch_info field to struct perf_cgroup David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 20/46] perf/core: add pmu::event_terminate David Carrillo-Cisneros
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

First part of cgroup support for CMT.

A monr's position in monrs hierarchy depends on the position of it's
target cgroup or thread in the cgroup hierarchy.
(See code comments for details).

A monr that monitors a cgroup keeps a reference to in monr->monr_cgroup
and its used in future patches to add support for cgroup monitoring
without requiring an active perf_event at all times.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 293 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h |   2 +
 2 files changed, 295 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 3883cb4..a5b7d2d 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -125,6 +125,14 @@ static inline struct pmonr *pkgd_pmonr(struct pkg_data *pkgd, struct monr *monr)
 	return rcu_dereference_check(monr->pmonrs[pkgd->pkgid], safe);
 }
 
+#ifdef CONFIG_CGROUP_PERF
+static inline struct cgroup_subsys_state *get_root_perf_css(void)
+{
+	/* Get css for root cgroup */
+	return  init_css_set.subsys[perf_event_cgrp_id];
+}
+#endif
+
 static inline void pmonr_set_rmids(struct pmonr *pmonr,
 				   u32 sched_rmid, u32 read_rmid)
 {
@@ -416,6 +424,7 @@ static void monr_dealloc(struct monr *monr)
 
 	if (WARN_ON_ONCE(monr->nr_has_user) ||
 	    WARN_ON_ONCE(monr->nr_nolazy_rmid) ||
+	    WARN_ON_ONCE(monr->mon_cgrp) ||
 	    WARN_ON_ONCE(monr->mon_events))
 		return;
 
@@ -639,6 +648,7 @@ static int monr_apply_uflags(struct monr *monr, enum cmt_user_flags *puflags)
 	goto exit;
 }
 
+/* can be NULL if the monr was for a cgroup that has gone offline. */
 static inline struct monr *monr_from_event(struct perf_event *event)
 {
 	return (struct monr *) READ_ONCE(event->hw.cmt_monr);
@@ -727,6 +737,75 @@ static int monr_append_event(struct monr *monr, struct perf_event *event)
 	return err;
 }
 
+#ifdef CONFIG_CGROUP_PERF
+static inline struct monr *monr_from_perf_cgroup(struct perf_cgroup *cgrp)
+{
+	return (struct monr *)READ_ONCE(cgrp->arch_info);
+}
+
+static inline void perf_cgroup_set_monr(struct perf_cgroup *cgrp,
+					struct monr *monr)
+{
+	WRITE_ONCE(cgrp->arch_info, monr);
+}
+
+/* Get cgroup for both task and cgroup event. */
+static struct perf_cgroup *perf_cgroup_from_task_event(struct perf_event *event)
+{
+#ifdef CONFIG_LOCKDEP
+	bool rcu_safe = lockdep_is_held(&cmt_mutex);
+#endif
+
+	return container_of(
+		task_css_check(event->hw.target, perf_event_cgrp_id, rcu_safe),
+		struct perf_cgroup, css);
+}
+
+static struct perf_cgroup *perf_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct perf_cgroup, css);
+}
+
+/**
+ * perf_cgroup_mon_started() - Tell if cgroup is monitored by its own monr.
+ *
+ * A perf_cgroup is being monitored when it is referenced back by
+ * its monr's mon_cgrp. Otherwise, the cgroup only uses the monr used to
+ * monitor another cgroup (the one that is referenced back by monr's mon_cgrp).
+ */
+static inline bool perf_cgroup_mon_started(struct perf_cgroup *cgrp)
+{
+	struct monr *monr;
+
+	/*
+	 * monr can be referenced by a cgroup other than the one in its
+	 * mon_cgrp, be careful.
+	 */
+	monr = monr_from_perf_cgroup(cgrp);
+
+	/* Root monr do not have a cgroup associated before initialization. */
+	return  monr->mon_cgrp == cgrp;
+}
+
+/**
+ * perf_cgroup_find_lma() - Find @cgrp lowest monitored ancestor.
+ *
+ * Find lowest monitored ancestor for @cgrp, not including this cgroup
+ * Return: lma or NULL if no ancestor is monitored.
+ */
+struct perf_cgroup *perf_cgroup_find_lma(struct perf_cgroup *cgrp)
+{
+	struct cgroup_subsys_state *parent_css;
+
+	do {
+		parent_css = cgrp->css.parent;
+		cgrp = parent_css ? perf_cgroup_from_css(parent_css) : NULL;
+	} while (cgrp && !perf_cgroup_mon_started(cgrp));
+	return cgrp;
+}
+
+#endif
+
 /**
  * pmonr_update_sched_rmid() - Update sched_rmid for @pmonr in current package.
  *
@@ -815,6 +894,214 @@ static void monr_hrchy_remove_leaf(struct monr *monr)
 	monr_hrchy_release_locks(&flags);
 }
 
+#ifdef CONFIG_CGROUP_PERF
+
+/* Similar to css_next_descendant_pre but skips the subtree rooted by pos. */
+struct cgroup_subsys_state *
+css_skip_subtree_pre(struct cgroup_subsys_state *pos,
+		     struct cgroup_subsys_state *root)
+{
+	struct cgroup_subsys_state *next;
+
+	while (pos != root) {
+		next = css_next_child(pos, pos->parent);
+		if (next)
+			return next;
+		pos = pos->parent;
+	}
+	return NULL;
+}
+
+/* Make all monrs of css descendants of css to depend on new_monr. */
+inline void css_subtree_update_monr_dependants(struct cgroup_subsys_state *css,
+					       struct monr *new_monr)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct perf_cgroup *pos_cgrp;
+	struct monr *pos_monr;
+	unsigned long flags;
+
+	lockdep_assert_held(&cmt_mutex);
+
+	rcu_read_lock();
+
+	pos_css = css_next_descendant_pre(css, css);
+	while (pos_css) {
+		pos_cgrp = perf_cgroup_from_css(pos_css);
+		pos_monr = monr_from_perf_cgroup(pos_cgrp);
+
+		/* Skip css that are not online, sync'ed with cmt_mutex. */
+		if (!(pos_css->flags & CSS_ONLINE)) {
+			pos_css = css_next_descendant_pre(pos_css, css);
+			continue;
+		}
+		if (!perf_cgroup_mon_started(pos_cgrp)) {
+			perf_cgroup_set_monr(pos_cgrp, new_monr);
+			pos_css = css_next_descendant_pre(pos_css, css);
+			continue;
+		}
+		rcu_read_unlock();
+
+		monr_hrchy_acquire_locks(&flags);
+		pos_monr->parent = new_monr;
+		list_move_tail(&pos_monr->parent_entry, &new_monr->children);
+		monr_hrchy_release_locks(&flags);
+
+		rcu_read_lock();
+		/*
+		 * Skip subtrees rooted by a css that owns a monr, since the
+		 * css in those subtrees use the monr at their subtree root.
+		 */
+		pos_css = css_skip_subtree_pre(pos_css, css);
+	}
+	rcu_read_unlock();
+}
+
+static inline int __css_start_monitoring(struct cgroup_subsys_state *css)
+{
+	struct perf_cgroup *cgrp, *cgrp_lma, *pos_cgrp;
+	struct monr *monr, *monr_parent, *pos_monr, *tmp_monr;
+	unsigned long flags;
+
+	lockdep_assert_held(&cmt_mutex);
+
+	cgrp = perf_cgroup_from_css(css);
+
+	cgrp_lma = perf_cgroup_find_lma(cgrp);
+	if (!cgrp_lma) {
+		perf_cgroup_set_monr(cgrp, monr_hrchy_root);
+		monr_hrchy_root->mon_cgrp = cgrp;
+		return 0;
+	}
+	/*
+	 * The monr for the lowest monitored ancestor is direct ancestor
+	 * of monr in the monr hierarchy.
+	 */
+	monr_parent = monr_from_perf_cgroup(cgrp_lma);
+
+	monr = monr_alloc();
+	if (IS_ERR(monr))
+		return PTR_ERR(monr);
+	/*
+	 * New monr has no children yet so it can be inserted in hierarchy as
+	 * a leaf. Since all monr's pmonr are in Off state, there is no risk
+	 * of pmonr state transitions in the scheduler path.
+	 */
+	monr_hrchy_acquire_locks(&flags);
+	monr_hrchy_insert_leaf(monr, monr_parent);
+	monr_hrchy_release_locks(&flags);
+
+	/*
+	 * Previous lock also works as a barrier to prevent attaching
+	 * the monr to cgrp before it is in monr hierarchy.
+	 */
+	perf_cgroup_set_monr(cgrp, monr);
+	monr->mon_cgrp = cgrp;
+	css_subtree_update_monr_dependants(css, monr);
+
+	monr_hrchy_acquire_locks(&flags);
+	/* Move task-event monrs that are descendant from css's cgroup. */
+	list_for_each_entry_safe(pos_monr, tmp_monr,
+				 &monr_parent->children, parent_entry) {
+		if (pos_monr->mon_cgrp)
+			continue;
+		/*
+		 * all events in event group have the same cgroup.
+		 * No RCU read lock necessary for task_css_check since calling
+		 * inside critical section.
+		 */
+		pos_cgrp = perf_cgroup_from_task_event(pos_monr->mon_events);
+		if (!cgroup_is_descendant(pos_cgrp->css.cgroup,
+					  cgrp->css.cgroup))
+			continue;
+		pos_monr->parent = monr;
+		list_move_tail(&pos_monr->parent_entry, &monr->children);
+	}
+	monr_hrchy_release_locks(&flags);
+
+	return 0;
+}
+
+static inline void __css_stop_monitoring(struct cgroup_subsys_state *css)
+{
+	struct perf_cgroup *cgrp, *cgrp_lma;
+	struct monr *monr, *monr_parent, *pos_monr;
+	unsigned long flags;
+
+	lockdep_assert_held(&cmt_mutex);
+
+	cgrp = perf_cgroup_from_css(css);
+	monr = monr_from_perf_cgroup(cgrp);
+	/*
+	 * When css is root cgroup's css, detach cgroup but do not
+	 * destroy monr.
+	 */
+	cgrp_lma = perf_cgroup_find_lma(cgrp);
+	if (!cgrp_lma) {
+		/* monr of root cgrp must be monr_hrchy_root. */
+		monr->mon_cgrp = NULL;
+		return;
+	}
+
+	monr_parent = monr_from_perf_cgroup(cgrp_lma);
+	css_subtree_update_monr_dependants(css, monr_parent);
+
+	monr_hrchy_acquire_locks(&flags);
+
+	/* Move the children monrs that are no cgroups. */
+	list_for_each_entry(pos_monr, &monr->children, parent_entry)
+		pos_monr->parent = monr_parent;
+	list_splice_tail_init(&monr->children, &monr_parent->children);
+
+	perf_cgroup_set_monr(cgrp, monr_from_perf_cgroup(cgrp_lma));
+	monr->mon_cgrp = NULL;
+	monr_hrchy_remove_leaf(monr);
+
+	monr_hrchy_release_locks(&flags);
+}
+
+static bool is_cgroup_event(struct perf_event *event)
+{
+	return event->cgrp;
+}
+
+static int monr_hrchy_attach_cgroup_event(struct perf_event *event)
+{
+	struct monr *monr;
+	struct perf_cgroup *cgrp = event->cgrp;
+	int err;
+	bool started = false;
+
+	if (!perf_cgroup_mon_started(cgrp)) {
+		css_get(&cgrp->css);
+		err = __css_start_monitoring(&cgrp->css);
+		css_put(&cgrp->css);
+		if (err)
+			return err;
+		started = true;
+	}
+
+	monr = monr_from_perf_cgroup(cgrp);
+	err = monr_append_event(monr, event);
+	if (err && started) {
+		css_get(&cgrp->css);
+		__css_stop_monitoring(&cgrp->css);
+		css_put(&cgrp->css);
+	}
+
+	return err;
+}
+
+/* return monr of cgroup that contains the task to monitor. */
+static struct monr *monr_hrchy_get_monr_parent(struct perf_event *event)
+{
+	struct perf_cgroup *cgrp = perf_cgroup_from_task_event(event);
+
+	return monr_from_perf_cgroup(cgrp);
+}
+
+#else /* CONFIG_CGROUP_PERF */
+
 static bool is_cgroup_event(struct perf_event *event)
 {
 	return false;
@@ -834,6 +1121,8 @@ static struct monr *monr_hrchy_get_monr_parent(struct perf_event *event)
 	return monr_hrchy_root;
 }
 
+#endif
+
 static int monr_hrchy_attach_cpu_event(struct perf_event *event)
 {
 	return monr_append_event(monr_hrchy_root, event);
@@ -883,6 +1172,10 @@ static int monr_hrchy_attach_event(struct perf_event *event)
 
 static void monr_destroy(struct monr *monr)
 {
+#ifdef CONFIG_CGROUP_PERF
+	if (monr->mon_cgrp)
+		__css_stop_monitoring(&monr->mon_cgrp->css);
+#endif
 	monr_hrchy_remove_leaf(monr);
 	monr_dealloc(monr);
 }
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 754a9c8..dc52641 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -252,6 +252,7 @@ enum cmt_user_flags {
 
 /**
  * struct monr - MONitored Resource.
+ * @mon_cgrp:		The cgroup associated with this monr, if any
  * @mon_events:		The head of event's group that use this monr, if any.
  * @entry:		List entry into cmt_event_monrs.
  * @pmonrs:		Per-package pmonrs.
@@ -271,6 +272,7 @@ enum cmt_user_flags {
  * On initialization, all monr's pmonrs start in Off state.
  */
 struct monr {
+	struct perf_cgroup		*mon_cgrp;
 	struct perf_event		*mon_events;
 	struct list_head		entry;
 	struct pmonr			**pmonrs;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 20/46] perf/core: add pmu::event_terminate
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (18 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 19/46] perf/x86/intel/cmt: add support for cgroup events David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 21/46] perf/x86/intel/cmt: use newly introduced event_terminate David Carrillo-Cisneros
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

The new pmu::event_terminate allows a PMU to access an event's cgroup
before the event is torn down.

Used in CMT to dettach the cgroup from a monr before perf clears
event->cgrp.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 6 ++++++
 kernel/events/core.c       | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 406119b..14dff7a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -298,6 +298,12 @@ struct pmu {
 	int (*event_init)		(struct perf_event *event);
 
 	/*
+	 * Terminate the event for this PMU. Optional complement for a
+	 * successful event_init. Called before the event fields are tear down.
+	 */
+	void (*event_terminate)		(struct perf_event *event);
+
+	/*
 	 * Notification that the event was mapped or unmapped.  Called
 	 * in the context of the mapping task.
 	 */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0de3ca5..464f46d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4005,6 +4005,8 @@ static void _free_event(struct perf_event *event)
 		ring_buffer_attach(event, NULL);
 		mutex_unlock(&event->mmap_mutex);
 	}
+	if (event->pmu->event_terminate)
+		event->pmu->event_terminate(event);
 
 	if (is_cgroup_event(event))
 		perf_detach_cgroup(event);
@@ -9226,6 +9228,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	exclusive_event_destroy(event);
 
 err_pmu:
+	if (event->pmu->event_terminate)
+		event->pmu->event_terminate(event);
 	if (event->destroy)
 		event->destroy(event);
 	module_put(pmu->module);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 21/46] perf/x86/intel/cmt: use newly introduced event_terminate
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (19 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 20/46] perf/core: add pmu::event_terminate David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 22/46] perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop David Carrillo-Cisneros
                   ` (24 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

The change from event_destroy to event_terminate allows to check
if the event is a event cgroup during event destuction, before
generic code clears event->cgrp.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index a5b7d2d..f7da8cf 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1349,7 +1349,7 @@ static int intel_cmt_event_add(struct perf_event *event, int mode)
 	return 0;
 }
 
-static void intel_cmt_event_destroy(struct perf_event *event)
+static void intel_cmt_event_terminate(struct perf_event *event)
 {
 	struct monr *monr;
 
@@ -1385,8 +1385,6 @@ static int intel_cmt_event_init(struct perf_event *event)
 	    event->attr.sample_period) /* no sampling */
 		return -EINVAL;
 
-	event->destroy = intel_cmt_event_destroy;
-
 	INIT_LIST_HEAD(&event->hw.cmt_list);
 
 	mutex_lock(&cmt_mutex);
@@ -1439,6 +1437,7 @@ static struct pmu intel_cmt_pmu = {
 	.attr_groups	     = intel_cmt_attr_groups,
 	.task_ctx_nr	     = perf_sw_context,
 	.event_init	     = intel_cmt_event_init,
+	.event_terminate     = intel_cmt_event_terminate,
 	.add		     = intel_cmt_event_add,
 	.del		     = intel_cmt_event_stop,
 	.start		     = intel_cmt_event_start,
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 22/46] perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (20 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 21/46] perf/x86/intel/cmt: use newly introduced event_terminate David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 23/46] perf/core: hooks to add architecture specific features in perf_cgroup David Carrillo-Cisneros
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Start/stop cgroup monitoring for cgroups when intel_cmt device is
started/stopped.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 99 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index f7da8cf..5c64d94 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -766,6 +766,11 @@ static struct perf_cgroup *perf_cgroup_from_css(struct cgroup_subsys_state *css)
 	return container_of(css, struct perf_cgroup, css);
 }
 
+static struct monr *monr_from_css(struct cgroup_subsys_state *css)
+{
+	return ((struct monr *)perf_cgroup_from_css(css)->arch_info);
+}
+
 /**
  * perf_cgroup_mon_started() - Tell if cgroup is monitored by its own monr.
  *
@@ -1445,6 +1450,45 @@ static struct pmu intel_cmt_pmu = {
 	.read		     = intel_cmt_event_read,
 };
 
+#ifdef CONFIG_CGROUP_PERF
+static int __css_go_online(struct cgroup_subsys_state *css)
+{
+	struct perf_cgroup *lma, *cgrp = perf_cgroup_from_css(css);
+	int err;
+
+	if (!css->parent) {
+		err = __css_start_monitoring(css);
+		if (err)
+			return err;
+		return monr_apply_uflags(monr_from_css(css), NULL);
+	}
+	lma = perf_cgroup_find_lma(cgrp);
+	perf_cgroup_set_monr(cgrp, monr_from_perf_cgroup(lma));
+
+	return 0;
+}
+
+static void __css_go_offline(struct cgroup_subsys_state *css)
+{
+	struct monr *monr;
+	struct perf_cgroup *cgrp = perf_cgroup_from_css(css);
+
+	monr = monr_from_perf_cgroup(cgrp);
+	if (!perf_cgroup_mon_started(cgrp)) {
+		perf_cgroup_set_monr(cgrp, NULL);
+		return;
+	}
+
+	monr_apply_uflags(monr, pkg_uflags_zeroes);
+	/*
+	 * Terminate monr even if there are other users (events), monr will
+	 * stay zombie until those events are terminated.
+	 */
+	monr_destroy(monr);
+}
+
+#endif
+
 static void free_pkg_data(struct pkg_data *pkg_data)
 {
 	kfree(pkg_data);
@@ -1666,6 +1710,42 @@ static const struct x86_cpu_id intel_cmt_match[] = {
 	{}
 };
 
+#ifdef CONFIG_CGROUP_PERF
+/* Start monitoring for all cgroups in cgroup hierarchy. */
+static int __switch_monitoring_all_cgroups(bool online)
+{
+	int err = 0;
+	struct cgroup_subsys_state *css_root, *css = NULL;
+
+	lockdep_assert_held(&cmt_mutex);
+	monr_hrchy_assert_held_mutexes();
+
+	rcu_read_lock();
+	/* Get css for root cgroup */
+	css_root =  get_root_perf_css();
+
+	css_for_each_descendant_pre(css, css_root) {
+		if (!css_tryget_online(css))
+			continue;
+
+		rcu_read_unlock();
+
+		if (online)
+			err = __css_go_online(css);
+		else
+			__css_go_offline(css);
+		css_put(css);
+		if (err)
+			return err;
+
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+#endif
+
 static void cmt_dealloc(void)
 {
 	kfree(monr_hrchy_root);
@@ -1682,6 +1762,13 @@ static void cmt_stop(void)
 {
 	cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
 	cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
+
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+	__switch_monitoring_all_cgroups(false);
+	monr_hrchy_release_mutexes();
+
+	mutex_unlock(&cmt_mutex);
 }
 
 static int __init cmt_alloc(void)
@@ -1743,6 +1830,18 @@ static int __init cmt_start(void)
 	}
 	event_attr_intel_cmt_llc_scale.event_str = str;
 
+#ifdef CONFIG_CGROUP_PERF
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+	err = __switch_monitoring_all_cgroups(true);
+	if (err)
+		__switch_monitoring_all_cgroups(false);
+	monr_hrchy_release_mutexes();
+	mutex_unlock(&cmt_mutex);
+	if (err)
+		goto rm_online;
+#endif
+
 	return 0;
 
 rm_online:
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 23/46] perf/core: hooks to add architecture specific features in perf_cgroup
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (21 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 22/46] perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 24/46] perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline} David Carrillo-Cisneros
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

The hooks allows architectures to extend the behavior of the perf cgroup.

In this patch series, the hooks are used to allow intel_cmt to follow
changes in the perf_cgroup hierarchy.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 11 +++++++++++
 kernel/events/core.c       | 12 ++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 14dff7a..9f388d4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1386,4 +1386,15 @@ int perf_event_exit_cpu(unsigned int cpu);
 #define perf_event_exit_cpu	NULL
 #endif
 
+/*
+ * Hooks for architecture specific extensions for perf_cgroup.
+ */
+#ifndef perf_cgroup_arch_css_online
+# define perf_cgroup_arch_css_online(css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_offline
+# define perf_cgroup_arch_css_offline(css) do { } while (0)
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 464f46d..e11a16a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10748,6 +10748,16 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	return &jc->css;
 }
 
+static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
+{
+	return perf_cgroup_arch_css_online(css);
+}
+
+static void perf_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+	perf_cgroup_arch_css_offline(css);
+}
+
 static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
@@ -10776,6 +10786,8 @@ static void perf_cgroup_attach(struct cgroup_taskset *tset)
 
 struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
+	.css_online	= perf_cgroup_css_online,
+	.css_offline	= perf_cgroup_css_offline,
 	.css_free	= perf_cgroup_css_free,
 	.attach		= perf_cgroup_attach,
 };
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 24/46] perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline}
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (22 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 23/46] perf/core: hooks to add architecture specific features in perf_cgroup David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 25/46] perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE David Carrillo-Cisneros
                   ` (21 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Use newly introduced architecture specific hooks to update monr hierarchy
when a cgroup goes online/offline.

Add cmt_initialized_key and use it to avoid using the hooks when the
intel_cmt is not active.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c       | 36 ++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h | 19 +++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 5c64d94..7545deb 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -24,6 +24,7 @@ static struct lock_class_key	lock_keys[CMT_MAX_NR_PKGS];
 static DEFINE_MUTEX(cmt_mutex);
 /* List of monrs that are associated with an event. */
 static LIST_HEAD(cmt_event_monrs);
+DEFINE_STATIC_KEY_FALSE(cmt_initialized_key);
 
 static unsigned int cmt_l3_scale;	/* cmt hw units to bytes. */
 
@@ -1468,6 +1469,21 @@ static int __css_go_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+int perf_cgroup_arch_css_online(struct cgroup_subsys_state *css)
+{
+	int err = 0;
+
+	if (static_branch_unlikely(&cmt_initialized_key)) {
+		mutex_lock(&cmt_mutex);
+		monr_hrchy_acquire_mutexes();
+		err = __css_go_online(css);
+		monr_hrchy_release_mutexes();
+		mutex_unlock(&cmt_mutex);
+	}
+
+	return err;
+}
+
 static void __css_go_offline(struct cgroup_subsys_state *css)
 {
 	struct monr *monr;
@@ -1487,6 +1503,17 @@ static void __css_go_offline(struct cgroup_subsys_state *css)
 	monr_destroy(monr);
 }
 
+void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css)
+{
+	if (static_branch_unlikely(&cmt_initialized_key)) {
+		mutex_lock(&cmt_mutex);
+		monr_hrchy_acquire_mutexes();
+		__css_go_offline(css);
+		monr_hrchy_release_mutexes();
+		mutex_unlock(&cmt_mutex);
+	}
+}
+
 #endif
 
 static void free_pkg_data(struct pkg_data *pkg_data)
@@ -1760,6 +1787,12 @@ static void cmt_dealloc(void)
 
 static void cmt_stop(void)
 {
+
+	static_branch_disable(&cmt_initialized_key);
+
+	/* make sure CMT is turned is off before start tearing it down. */
+	barrier();
+
 	cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
 	cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
 
@@ -1841,6 +1874,9 @@ static int __init cmt_start(void)
 	if (err)
 		goto rm_online;
 #endif
+	/* Make sure everything is ready prior to turn on key. */
+	barrier();
+	static_branch_enable(&cmt_initialized_key);
 
 	return 0;
 
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f353061..783bdbb 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -299,4 +299,23 @@ static inline void perf_check_microcode(void) { }
 
 #define arch_perf_out_copy_user copy_from_user_nmi
 
+
+/*
+ * Hooks for architecture specific features of perf_event cgroup.
+ * Currently used by Intel's CMT.
+ */
+#ifdef CONFIG_CGROUP_PERF
+#ifdef CONFIG_INTEL_RDT_M
+
+#define perf_cgroup_arch_css_online \
+	perf_cgroup_arch_css_online
+int perf_cgroup_arch_css_online(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_css_offline \
+	perf_cgroup_arch_css_offline
+void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css);
+
+#endif
+#endif
+
 #endif /* _ASM_X86_PERF_EVENT_H */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 25/46] perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (23 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 24/46] perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline} David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 26/46] sched: introduce the finish_arch_pre_lock_switch() scheduler hook David Carrillo-Cisneros
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add a new enumeration to store monr's internal state information.
This new monr->flags contrasts with monr->uflag that stores user
provided flags.

The first flags is CMT_MONR_ZOMBIE, used to signal that the monr is no
longer valid, yet, must be kept around until its released by its last
user. This is useful when a monr is attached to a cgroup that is destroyed
but there are still events that reference that cgroup.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 17 +++++++++++++++--
 arch/x86/events/intel/cmt.h |  9 +++++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 7545deb..830ce29 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -423,7 +423,12 @@ static void monr_dealloc(struct monr *monr)
 {
 	u16 p, nr_pkgs = topology_max_packages();
 
-	if (WARN_ON_ONCE(monr->nr_has_user) ||
+	/*
+	 * Only can dealloc monrs zombie monrs with all pmonrs in Off state
+	 * and not attached to the monr hierarchy.
+	 */
+	if (WARN_ON_ONCE(!(monr->flags & CMT_MONR_ZOMBIE)) ||
+	    WARN_ON_ONCE(monr->nr_has_user) ||
 	    WARN_ON_ONCE(monr->nr_nolazy_rmid) ||
 	    WARN_ON_ONCE(monr->mon_cgrp) ||
 	    WARN_ON_ONCE(monr->mon_events))
@@ -1178,12 +1183,20 @@ static int monr_hrchy_attach_event(struct perf_event *event)
 
 static void monr_destroy(struct monr *monr)
 {
+	if (monr->flags & CMT_MONR_ZOMBIE)
+		goto zombie;
+
 #ifdef CONFIG_CGROUP_PERF
 	if (monr->mon_cgrp)
 		__css_stop_monitoring(&monr->mon_cgrp->css);
 #endif
 	monr_hrchy_remove_leaf(monr);
-	monr_dealloc(monr);
+
+	monr->flags = CMT_MONR_ZOMBIE;
+
+zombie:
+	if (!monr_has_user(monr))
+		monr_dealloc(monr);
 }
 
 /**
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index dc52641..1e40e6b 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -240,6 +240,13 @@ struct pkg_data {
 };
 
 /**
+ * enum monr_flags - internal monr's flags.
+ */
+enum monr_flags {
+	CMT_MONR_ZOMBIE = BIT(0), /* monr has been terminated. */
+};
+
+/**
  * enum cmt_user_flags - user set flags for monr and pmonrs.
  */
 enum cmt_user_flags {
@@ -259,6 +266,7 @@ enum cmt_user_flags {
  * @parent:		Parent in monr hierarchy.
  * @children:		List of children in monr hierarchy.
  * @parent_entry:	Entry in parent's children list.
+ * @flags:		monr_flags.
  * @nr_has_user:	nr of CMT_UF_HAS_USER set in events in mon_events.
  * @nr_nolazy_user:	nr of CMT_UF_NOLAZY_RMID set in events in mon_events.
  * @uflags:		monr level cmt_user_flags, or'ed with pkg_uflags.
@@ -281,6 +289,7 @@ struct monr {
 	struct list_head		children;
 	struct list_head		parent_entry;
 
+	enum monr_flags			flags;
 	int				nr_has_user;
 	int				nr_nolazy_rmid;
 	enum cmt_user_flags		uflags;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 26/46] sched: introduce the finish_arch_pre_lock_switch() scheduler hook
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (24 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 25/46] perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 27/46] perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch David Carrillo-Cisneros
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

This hook allows architecture specific code to be called right after
perf_events' context switch but before the scheduler lock is released.

It will serve two uses in this patch series:
  1) Calls CMT's cgroup context switch code that update the current RMID
  when no perf event is active (in continuous monitoring mode).
  2) Calls __pqr_ctx_switch to perform the write with the final value to
  the slow PQR_ASSOC msr.

This hook is different than the one used by Intel CAT in the series
currently under review in LKML. CAT series simply adds a call to
intel_rdt_sched_in in __switch_to (see
"[PATCH v6 09/10] x86/intel_rdt: Add scheduler hook").

This series proposes a change to use finish_arch_pre_lock_switch instead.
Since, for CMT, the integration with perf_events requires the context
switch of the intel rdt common code to occur after perf's context switch
and before releasing the switch lock, in order to perform (1) correctly.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 kernel/sched/core.c  | 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..2138ee6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2766,6 +2766,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = prev->state;
 	vtime_task_switch(prev);
 	perf_event_task_sched_in(prev, current);
+	finish_arch_pre_lock_switch();
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..0a0208e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1112,6 +1112,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
+#ifndef finish_arch_pre_lock_switch
+# define finish_arch_pre_lock_switch()	do { } while (0)
+#endif
 #ifndef finish_arch_post_lock_switch
 # define finish_arch_post_lock_switch()	do { } while (0)
 #endif
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 27/46] perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (25 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 26/46] sched: introduce the finish_arch_pre_lock_switch() scheduler hook David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 28/46] perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to pmu::read David Carrillo-Cisneros
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

In order to support intel_cmt for cgroups with no perf_event, this driver
calls intel_pqr_ctx_switch during a task context switch, after perf_event
has add/del all events (if any), using the newly introduced
finish_arch_pre_lock_switch() .

If pmu->add sets a rmids, the next rmid is "marked" using the flag
PQR_RMID_FLAG_EVENT. Removing an event sets the root rmid with the flag
PQR_RMID_FLAG_EVENT to indicate that it can be over-written by an actively
monitored cgroup.

If no event added to the pmu writes a rmid into PQR msr
(and PQR_RMID_FLAG_EVENT is not set), then __pqr_ctx_switch will
call __intel_cmt_no_event_sched_in to update PQR cache with the rmid of
the active perf cgroup's monr.

In some cases, a change to PQR msr software cache must be write-through
immediately, using the flag PQR_FLAG_WT. These are:
  1) updating PQR_ASSOC when a cpu goes offline.
  2) first execution slot of a newly exec'd task, since new exec'd tasks
  runs before a context switch.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c             | 39 ++++++++++++++++++++++++++++-----
 arch/x86/include/asm/intel_rdt_common.h | 38 ++++++++++++++++++++++++++++----
 arch/x86/include/asm/processor.h        |  4 ++++
 arch/x86/kernel/cpu/intel_rdt_common.c  | 29 ++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 830ce29..bd903ae 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1323,12 +1323,13 @@ static void intel_cmt_event_read(struct perf_event *event)
 }
 
 static inline void __intel_cmt_event_start(struct perf_event *event,
-					   union pmonr_rmids rmids)
+					   union pmonr_rmids rmids, bool wt)
 {
 	if (!(event->hw.state & PERF_HES_STOPPED))
 		return;
 	event->hw.state &= ~PERF_HES_STOPPED;
-	pqr_cache_update_rmid(rmids.sched_rmid);
+	pqr_cache_update_rmid(rmids.sched_rmid,
+			PQR_RMID_FLAG_EVENT | (wt ? PQR_FLAG_WT : 0));
 }
 
 static void intel_cmt_event_start(struct perf_event *event, int mode)
@@ -1336,7 +1337,7 @@ static void intel_cmt_event_start(struct perf_event *event, int mode)
 	union pmonr_rmids rmids;
 
 	rmids = monr_get_sched_in_rmids(monr_from_event(event));
-	__intel_cmt_event_start(event, rmids);
+	__intel_cmt_event_start(event, rmids, false);
 }
 
 static void intel_cmt_event_stop(struct perf_event *event, int mode)
@@ -1352,18 +1353,25 @@ static void intel_cmt_event_stop(struct perf_event *event, int mode)
 	 * reads occur even if event is Inactive. Therefore there is no need to
 	 * read when event is stopped.
 	 */
-	pqr_cache_update_rmid(rmids.sched_rmid);
+	pqr_cache_update_rmid(rmids.sched_rmid, 0);
 }
 
 static int intel_cmt_event_add(struct perf_event *event, int mode)
 {
 	union pmonr_rmids rmids;
+	bool wt;
 
 	event->hw.state = PERF_HES_STOPPED;
 	rmids = monr_get_sched_in_rmids(monr_from_event(event));
+	/*
+	 * A newly exec'd task may run from the file loader without a context
+	 * switch, if so, the PQR sw cache will not update the rmid in hw.
+	 * Avoid that by issuing a write-through mode to the PQR sw cache.
+	 */
+	wt = event->total_time_running == 0;
 
 	if (mode & PERF_EF_START)
-		__intel_cmt_event_start(event, rmids);
+		__intel_cmt_event_start(event, rmids, wt);
 
 	return 0;
 }
@@ -1697,7 +1705,7 @@ static int intel_cmt_hp_online_exit(unsigned int cpu)
 	struct pkg_data *pkgd;
 	u16 pkgid = topology_logical_package_id(cpu);
 
-	pqr_cache_update_rmid(0);
+	pqr_cache_update_rmid(0, PQR_FLAG_WT);
 	memset(state, 0, sizeof(*state));
 
 	rcu_read_lock();
@@ -1932,6 +1940,8 @@ static int __init intel_cmt_init(void)
 	rcu_read_unlock();
 	pr_cont("and l3 scale of %d KBs.\n", cmt_l3_scale);
 
+	static_branch_inc(&pqr_common_enable_key);
+
 	return err;
 
 err_stop:
@@ -1943,4 +1953,21 @@ static int __init intel_cmt_init(void)
 	return err;
 }
 
+/* Schedule task without a CMT perf_event. */
+inline void __intel_cmt_no_event_sched_in(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+	struct monr *monr;
+	union pmonr_rmids rmids;
+
+	/* Assume CMT enabled is likely given that PQR is enabled. */
+	if (!static_branch_likely(&cmt_initialized_key))
+		return;
+	/* Safe to call from_task since we are in scheduler lock. */
+	monr = monr_from_perf_cgroup(perf_cgroup_from_task(current, NULL));
+	rmids = monr_get_sched_in_rmids(monr);
+	pqr_cache_update_rmid(rmids.sched_rmid, 0);
+#endif
+}
+
 device_initcall(intel_cmt_init);
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index 1d5e691..d8d0dc3 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -3,6 +3,7 @@
 
 #if defined(CONFIG_INTEL_RDT_A) || defined(CONFIG_INTEL_RDT_M)
 
+#include <linux/jump_label.h>
 #include <linux/types.h>
 #include <asm/percpu.h>
 #include <asm/msr.h>
@@ -10,35 +11,49 @@
 #define MSR_IA32_PQR_ASSOC	0x0c8f
 
 
+extern struct static_key_false pqr_common_enable_key;
+
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
  * @rmid:		The cached Resource Monitoring ID
  * @next_rmid:		Next rmid to write to hw
  * @closid:		The cached Class Of Service ID
  * @next_closid:	Next closid to write to hw
+ * @next_rmid_flags:	Next rmid's flags
  *
  * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
  * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
  * contains both parts, so we need to cache them.
  *
  * The cache also helps to avoid pointless updates if the value does
- * not change.
+ * not change. It also keeps track of the type of RMID set (event vs
+ * no event) used to determine when a cgroup RMID is required.
  */
 struct intel_pqr_state {
 	u32			rmid;
 	u32			next_rmid;
 	u32			closid;
 	u32			next_closid;
+	int			next_rmid_flags;
 };
 
+#define PQR_FLAG_WT		BIT(0) /* write-through. */
+#define PQR_RMID_FLAG_EVENT	BIT(1) /* associated to a perf_event. */
+
 DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 
-static inline void pqr_cache_update_rmid(u32 rmid)
+static inline void pqr_cache_update_rmid(u32 rmid, int next_rmid_flags)
 {
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 
 	state->next_rmid = rmid;
-	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+	state->next_rmid_flags |= next_rmid_flags;
+
+	if (next_rmid_flags & PQR_FLAG_WT) {
+		state->rmid = rmid;
+		wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+		state->next_rmid_flags &= ~PQR_FLAG_WT;
+	}
 }
 
 static inline void pqr_cache_update_closid(u32 closid)
@@ -46,7 +61,22 @@ static inline void pqr_cache_update_closid(u32 closid)
 	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 
 	state->next_closid = closid;
-	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+}
+
+void __pqr_ctx_switch(void);
+
+inline void __intel_cmt_no_event_sched_in(void);
+
+static inline void intel_pqr_ctx_switch(void)
+{
+	if (static_branch_unlikely(&pqr_common_enable_key))
+		__pqr_ctx_switch();
+}
+
+#else
+
+static inline void intel_pqr_ctx_switch(void)
+{
 }
 
 #endif
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 984a7bf..967ea93 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@ struct vm86;
 #include <asm/nops.h>
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
+#include <asm/intel_rdt_common.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -856,4 +857,7 @@ bool xen_set_default_idle(void);
 
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
+
+#define finish_arch_pre_lock_switch intel_pqr_ctx_switch
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
index 7fd5b20..6e45287 100644
--- a/arch/x86/kernel/cpu/intel_rdt_common.c
+++ b/arch/x86/kernel/cpu/intel_rdt_common.c
@@ -6,3 +6,32 @@
  * must ensure interruptions are handled properly.
  */
 DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
+
+/*
+ * Update hw's RMID using cgroup's if perf_event did not.
+ * Sync pqr cache with MSR.
+ */
+inline void __pqr_ctx_switch(void)
+{
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+	/*
+	 * Obtain a rmid from current task's cgroup if no perf event
+	 * set a rmid.
+	 */
+	if (likely(!(state->next_rmid_flags & PQR_RMID_FLAG_EVENT)))
+		__intel_cmt_no_event_sched_in();
+
+	state->next_rmid_flags = 0;
+
+	/* __intel_cmt_no_event_sched_in might have changed next_rmid. */
+	if (likely(state->rmid == state->next_rmid &&
+			state->closid == state->next_closid))
+		return;
+
+	state->rmid = state->next_rmid;
+	state->closid = state->next_closid;
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 28/46] perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to pmu::read
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (26 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 27/46] perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 29/46] perf/x86/intel/cmt: add error handling to intel_cmt_event_read David Carrillo-Cisneros
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

New PMUs, such as CMT's, do not guarantee that a read will succeed even
if pmu::add was successful.

In the generic code, this patch adds an int error return and completes the
error checking path up to perf_read().

In CMT's PMU, it adds proper error handling of hw read failure errors.
In other PMUs, pmu::read() simply returns 0.

Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/alpha/kernel/perf_event.c           |  3 +-
 arch/arc/kernel/perf_event.c             |  3 +-
 arch/arm64/include/asm/hw_breakpoint.h   |  2 +-
 arch/arm64/kernel/hw_breakpoint.c        |  3 +-
 arch/metag/kernel/perf/perf_event.c      |  5 +--
 arch/mips/kernel/perf_event_mipsxx.c     |  3 +-
 arch/powerpc/include/asm/hw_breakpoint.h |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c      |  3 +-
 arch/powerpc/perf/core-book3s.c          | 11 +++---
 arch/powerpc/perf/core-fsl-emb.c         |  5 +--
 arch/powerpc/perf/hv-24x7.c              |  5 +--
 arch/powerpc/perf/hv-gpci.c              |  3 +-
 arch/s390/kernel/perf_cpum_cf.c          |  5 +--
 arch/s390/kernel/perf_cpum_sf.c          |  3 +-
 arch/sh/include/asm/hw_breakpoint.h      |  2 +-
 arch/sh/kernel/hw_breakpoint.c           |  3 +-
 arch/sparc/kernel/perf_event.c           |  2 +-
 arch/tile/kernel/perf_event.c            |  3 +-
 arch/x86/events/amd/ibs.c                |  2 +-
 arch/x86/events/amd/iommu.c              |  5 +--
 arch/x86/events/amd/uncore.c             |  3 +-
 arch/x86/events/core.c                   |  3 +-
 arch/x86/events/intel/bts.c              |  3 +-
 arch/x86/events/intel/cmt.c              |  4 ++-
 arch/x86/events/intel/cstate.c           |  3 +-
 arch/x86/events/intel/pt.c               |  3 +-
 arch/x86/events/intel/rapl.c             |  3 +-
 arch/x86/events/intel/uncore.c           |  3 +-
 arch/x86/events/intel/uncore.h           |  2 +-
 arch/x86/events/msr.c                    |  3 +-
 arch/x86/include/asm/hw_breakpoint.h     |  2 +-
 arch/x86/kernel/hw_breakpoint.c          |  3 +-
 arch/x86/kvm/pmu.h                       | 10 +++---
 drivers/bus/arm-cci.c                    |  3 +-
 drivers/bus/arm-ccn.c                    |  3 +-
 drivers/perf/arm_pmu.c                   |  3 +-
 include/linux/perf_event.h               |  6 ++--
 kernel/events/core.c                     | 60 ++++++++++++++++++++------------
 38 files changed, 120 insertions(+), 73 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 5c218aa..3bf8a60 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -520,11 +520,12 @@ static void alpha_pmu_del(struct perf_event *event, int flags)
 }
 
 
-static void alpha_pmu_read(struct perf_event *event)
+static int alpha_pmu_read(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
 	alpha_perf_event_update(event, hwc, hwc->idx, 0);
+	return 0;
 }
 
 
diff --git a/arch/arc/kernel/perf_event.c b/arch/arc/kernel/perf_event.c
index 2ce24e7..efbcc2d 100644
--- a/arch/arc/kernel/perf_event.c
+++ b/arch/arc/kernel/perf_event.c
@@ -116,9 +116,10 @@ static void arc_perf_event_update(struct perf_event *event,
 	local64_sub(delta, &hwc->period_left);
 }
 
-static void arc_pmu_read(struct perf_event *event)
+static int arc_pmu_read(struct perf_event *event)
 {
 	arc_perf_event_update(event, &event->hw, event->hw.idx);
+	return 0;
 }
 
 static int arc_pmu_cache_event(u64 config)
diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
index 9510ace..f82f21f 100644
--- a/arch/arm64/include/asm/hw_breakpoint.h
+++ b/arch/arm64/include/asm/hw_breakpoint.h
@@ -127,7 +127,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
 
 extern int arch_install_hw_breakpoint(struct perf_event *bp);
 extern void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-extern void hw_breakpoint_pmu_read(struct perf_event *bp);
+extern int hw_breakpoint_pmu_read(struct perf_event *bp);
 extern int hw_breakpoint_slots(int type);
 
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index 948b731..a4fe1c5 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -936,8 +936,9 @@ static int __init arch_hw_breakpoint_init(void)
 }
 arch_initcall(arch_hw_breakpoint_init);
 
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
 {
+	return 0;
 }
 
 /*
diff --git a/arch/metag/kernel/perf/perf_event.c b/arch/metag/kernel/perf/perf_event.c
index 052cba2..128aa0a 100644
--- a/arch/metag/kernel/perf/perf_event.c
+++ b/arch/metag/kernel/perf/perf_event.c
@@ -360,15 +360,16 @@ static void metag_pmu_del(struct perf_event *event, int flags)
 	perf_event_update_userpage(event);
 }
 
-static void metag_pmu_read(struct perf_event *event)
+static int metag_pmu_read(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
 	/* Don't read disabled counters! */
 	if (hwc->idx < 0)
-		return;
+		return 0;
 
 	metag_pmu_event_update(event, hwc, hwc->idx);
+	return 0;
 }
 
 static struct pmu pmu = {
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index d3ba9f4..d64ed3e 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -507,7 +507,7 @@ static void mipspmu_del(struct perf_event *event, int flags)
 	perf_event_update_userpage(event);
 }
 
-static void mipspmu_read(struct perf_event *event)
+static int mipspmu_read(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
@@ -516,6 +516,7 @@ static void mipspmu_read(struct perf_event *event)
 		return;
 
 	mipspmu_event_update(event, hwc, hwc->idx);
+	return 0;
 }
 
 static void mipspmu_enable(struct pmu *pmu)
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h
index ac6432d..5218696 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -66,7 +66,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
 						unsigned long val, void *data);
 int arch_install_hw_breakpoint(struct perf_event *bp);
 void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
 extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk);
 
 extern struct pmu perf_ops_bp;
diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
index 9781c69..8a016ad 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -364,7 +364,8 @@ void flush_ptrace_hw_breakpoint(struct task_struct *tsk)
 	t->ptrace_bps[0] = NULL;
 }
 
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
 {
 	/* TODO */
+	return 0;
 }
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 72c27b8..1019d1e 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1002,20 +1002,20 @@ static u64 check_and_compute_delta(u64 prev, u64 val)
 	return delta;
 }
 
-static void power_pmu_read(struct perf_event *event)
+static int power_pmu_read(struct perf_event *event)
 {
 	s64 val, delta, prev;
 
 	if (event->hw.state & PERF_HES_STOPPED)
-		return;
+		return 0;
 
 	if (!event->hw.idx)
-		return;
+		return 0;
 
 	if (is_ebb_event(event)) {
 		val = read_pmc(event->hw.idx);
 		local64_set(&event->hw.prev_count, val);
-		return;
+		return 0;
 	}
 
 	/*
@@ -1029,7 +1029,7 @@ static void power_pmu_read(struct perf_event *event)
 		val = read_pmc(event->hw.idx);
 		delta = check_and_compute_delta(prev, val);
 		if (!delta)
-			return;
+			return 0;
 	} while (local64_cmpxchg(&event->hw.prev_count, prev, val) != prev);
 
 	local64_add(delta, &event->count);
@@ -1049,6 +1049,7 @@ static void power_pmu_read(struct perf_event *event)
 		if (val < 1)
 			val = 1;
 	} while (local64_cmpxchg(&event->hw.period_left, prev, val) != prev);
+	return 0;
 }
 
 /*
diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c
index 5d747b4..46d982e 100644
--- a/arch/powerpc/perf/core-fsl-emb.c
+++ b/arch/powerpc/perf/core-fsl-emb.c
@@ -176,12 +176,12 @@ static void write_pmlcb(int idx, unsigned long val)
 	isync();
 }
 
-static void fsl_emb_pmu_read(struct perf_event *event)
+static int fsl_emb_pmu_read(struct perf_event *event)
 {
 	s64 val, delta, prev;
 
 	if (event->hw.state & PERF_HES_STOPPED)
-		return;
+		return 0;
 
 	/*
 	 * Performance monitor interrupts come even when interrupts
@@ -198,6 +198,7 @@ static void fsl_emb_pmu_read(struct perf_event *event)
 	delta = (val - prev) & 0xfffffffful;
 	local64_add(delta, &event->count);
 	local64_sub(delta, &event->hw.period_left);
+	return 0;
 }
 
 /*
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 7b2ca16..41cd973 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1268,7 +1268,7 @@ static void update_event_count(struct perf_event *event, u64 now)
 	local64_add(now - prev, &event->count);
 }
 
-static void h_24x7_event_read(struct perf_event *event)
+static int h_24x7_event_read(struct perf_event *event)
 {
 	u64 now;
 	struct hv_24x7_request_buffer *request_buffer;
@@ -1289,7 +1289,7 @@ static void h_24x7_event_read(struct perf_event *event)
 		int ret;
 
 		if (__this_cpu_read(hv_24x7_txn_err))
-			return;
+			return 0;
 
 		request_buffer = (void *)get_cpu_var(hv_24x7_reqb);
 
@@ -1323,6 +1323,7 @@ static void h_24x7_event_read(struct perf_event *event)
 		now = h_24x7_get_value(event);
 		update_event_count(event, now);
 	}
+	return 0;
 }
 
 static void h_24x7_event_start(struct perf_event *event, int flags)
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 43fabb3..66c1ce7 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -191,12 +191,13 @@ static u64 h_gpci_get_value(struct perf_event *event)
 	return count;
 }
 
-static void h_gpci_event_update(struct perf_event *event)
+static int h_gpci_event_update(struct perf_event *event)
 {
 	s64 prev;
 	u64 now = h_gpci_get_value(event);
 	prev = local64_xchg(&event->hw.prev_count, now);
 	local64_add(now - prev, &event->count);
+	return 0;
 }
 
 static void h_gpci_event_start(struct perf_event *event, int flags)
diff --git a/arch/s390/kernel/perf_cpum_cf.c b/arch/s390/kernel/perf_cpum_cf.c
index 037c2a2..37fa78c 100644
--- a/arch/s390/kernel/perf_cpum_cf.c
+++ b/arch/s390/kernel/perf_cpum_cf.c
@@ -471,12 +471,13 @@ static int hw_perf_event_update(struct perf_event *event)
 	return err;
 }
 
-static void cpumf_pmu_read(struct perf_event *event)
+static int cpumf_pmu_read(struct perf_event *event)
 {
 	if (event->hw.state & PERF_HES_STOPPED)
-		return;
+		return 0;
 
 	hw_perf_event_update(event);
+	return 0;
 }
 
 static void cpumf_pmu_start(struct perf_event *event, int flags)
diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index fcc634c..87fd04a 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -1296,9 +1296,10 @@ static void hw_perf_event_update(struct perf_event *event, int flush_all)
 				    sampl_overflow, event_overflow);
 }
 
-static void cpumsf_pmu_read(struct perf_event *event)
+static int cpumsf_pmu_read(struct perf_event *event)
 {
 	/* Nothing to do ... updates are interrupt-driven */
+	return 0;
 }
 
 /* Activate sampling control.
diff --git a/arch/sh/include/asm/hw_breakpoint.h b/arch/sh/include/asm/hw_breakpoint.h
index ec9ad59..d3ad1bf 100644
--- a/arch/sh/include/asm/hw_breakpoint.h
+++ b/arch/sh/include/asm/hw_breakpoint.h
@@ -60,7 +60,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
 
 int arch_install_hw_breakpoint(struct perf_event *bp);
 void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
 
 extern void arch_fill_perf_breakpoint(struct perf_event *bp);
 extern int register_sh_ubc(struct sh_ubc *);
diff --git a/arch/sh/kernel/hw_breakpoint.c b/arch/sh/kernel/hw_breakpoint.c
index 2197fc5..3a2e719 100644
--- a/arch/sh/kernel/hw_breakpoint.c
+++ b/arch/sh/kernel/hw_breakpoint.c
@@ -401,9 +401,10 @@ int __kprobes hw_breakpoint_exceptions_notify(struct notifier_block *unused,
 	return hw_breakpoint_handler(data);
 }
 
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
 {
 	/* TODO */
+	return 0;
 }
 
 int register_sh_ubc(struct sh_ubc *ubc)
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 710f327..ab118e9 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1131,7 +1131,7 @@ static void sparc_pmu_del(struct perf_event *event, int _flags)
 	local_irq_restore(flags);
 }
 
-static void sparc_pmu_read(struct perf_event *event)
+static int sparc_pmu_read(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int idx = active_event_index(cpuc, event);
diff --git a/arch/tile/kernel/perf_event.c b/arch/tile/kernel/perf_event.c
index 6394c1c..e18cdf2 100644
--- a/arch/tile/kernel/perf_event.c
+++ b/arch/tile/kernel/perf_event.c
@@ -734,9 +734,10 @@ static void tile_pmu_del(struct perf_event *event, int flags)
 /*
  * Propagate event elapsed time into the event.
  */
-static inline void tile_pmu_read(struct perf_event *event)
+static inline int tile_pmu_read(struct perf_event *event)
 {
 	tile_perf_event_update(event);
+	return 0;
 }
 
 /*
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index b26ee32..3de2ada 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -511,7 +511,7 @@ static void perf_ibs_del(struct perf_event *event, int flags)
 	perf_event_update_userpage(event);
 }
 
-static void perf_ibs_read(struct perf_event *event) { }
+static int perf_ibs_read(struct perf_event *event) { return 0; }
 
 PMU_FORMAT_ATTR(rand_en,	"config:57");
 PMU_FORMAT_ATTR(cnt_ctl,	"config:19");
diff --git a/arch/x86/events/amd/iommu.c b/arch/x86/events/amd/iommu.c
index b28200d..2bfcaaa 100644
--- a/arch/x86/events/amd/iommu.c
+++ b/arch/x86/events/amd/iommu.c
@@ -317,7 +317,7 @@ static void perf_iommu_start(struct perf_event *event, int flags)
 
 }
 
-static void perf_iommu_read(struct perf_event *event)
+static int perf_iommu_read(struct perf_event *event)
 {
 	u64 count = 0ULL;
 	u64 prev_raw_count = 0ULL;
@@ -335,13 +335,14 @@ static void perf_iommu_read(struct perf_event *event)
 	prev_raw_count =  local64_read(&hwc->prev_count);
 	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
 					count) != prev_raw_count)
-		return;
+		return 0;
 
 	/* Handling 48-bit counter overflowing */
 	delta = (count << COUNTER_SHIFT) - (prev_raw_count << COUNTER_SHIFT);
 	delta >>= COUNTER_SHIFT;
 	local64_add(delta, &event->count);
 
+	return 0;
 }
 
 static void perf_iommu_stop(struct perf_event *event, int flags)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 65577f0..c84415d 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -73,7 +73,7 @@ static struct amd_uncore *event_to_amd_uncore(struct perf_event *event)
 	return NULL;
 }
 
-static void amd_uncore_read(struct perf_event *event)
+static int amd_uncore_read(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	u64 prev, new;
@@ -90,6 +90,7 @@ static void amd_uncore_read(struct perf_event *event)
 	delta = (new << COUNTER_SHIFT) - (prev << COUNTER_SHIFT);
 	delta >>= COUNTER_SHIFT;
 	local64_add(delta, &event->count);
+	return 0;
 }
 
 static void amd_uncore_start(struct perf_event *event, int flags)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d31735f..9e52a7b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1848,9 +1848,10 @@ static int __init init_hw_perf_events(void)
 }
 early_initcall(init_hw_perf_events);
 
-static inline void x86_pmu_read(struct perf_event *event)
+static inline int x86_pmu_read(struct perf_event *event)
 {
 	x86_perf_event_update(event);
+	return 0;
 }
 
 /*
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 982c9e3..6832d59 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -575,8 +575,9 @@ static int bts_event_init(struct perf_event *event)
 	return 0;
 }
 
-static void bts_event_read(struct perf_event *event)
+static int bts_event_read(struct perf_event *event)
 {
+	return 0;
 }
 
 static __init int bts_init(void)
diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index bd903ae..ef1000f 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1318,8 +1318,10 @@ static struct monr *monr_next_descendant_post(struct monr *pos,
 	return pos->parent;
 }
 
-static void intel_cmt_event_read(struct perf_event *event)
+static int intel_cmt_event_read(struct perf_event *event)
 {
+	/* To add support in next patches in series */
+	return -ENOTSUPP;
 }
 
 static inline void __intel_cmt_event_start(struct perf_event *event,
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 3ca87b5..b96fed8 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -323,7 +323,7 @@ static inline u64 cstate_pmu_read_counter(struct perf_event *event)
 	return val;
 }
 
-static void cstate_pmu_event_update(struct perf_event *event)
+static int cstate_pmu_event_update(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	u64 prev_raw_count, new_raw_count;
@@ -337,6 +337,7 @@ static void cstate_pmu_event_update(struct perf_event *event)
 		goto again;
 
 	local64_add(new_raw_count - prev_raw_count, &event->count);
+	return 0;
 }
 
 static void cstate_pmu_event_start(struct perf_event *event, int mode)
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index c5047b8..c46f946 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1356,8 +1356,9 @@ static int pt_event_add(struct perf_event *event, int mode)
 	return ret;
 }
 
-static void pt_event_read(struct perf_event *event)
+static int pt_event_read(struct perf_event *event)
 {
+	return 0;
 }
 
 static void pt_event_destroy(struct perf_event *event)
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 0a535ce..37c3dff 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -411,9 +411,10 @@ static int rapl_pmu_event_init(struct perf_event *event)
 	return ret;
 }
 
-static void rapl_pmu_event_read(struct perf_event *event)
+static int rapl_pmu_event_read(struct perf_event *event)
 {
 	rapl_event_update(event);
+	return 0;
 }
 
 static ssize_t rapl_get_attr_cpumask(struct device *dev,
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index efca268..ad7d035 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -580,10 +580,11 @@ static void uncore_pmu_event_del(struct perf_event *event, int flags)
 	event->hw.last_tag = ~0ULL;
 }
 
-void uncore_pmu_event_read(struct perf_event *event)
+int uncore_pmu_event_read(struct perf_event *event)
 {
 	struct intel_uncore_box *box = uncore_event_to_box(event);
 	uncore_perf_event_update(box, event);
+	return 0;
 }
 
 /*
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
index ad986c1..818b0ef 100644
--- a/arch/x86/events/intel/uncore.h
+++ b/arch/x86/events/intel/uncore.h
@@ -345,7 +345,7 @@ struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu
 u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
 void uncore_pmu_start_hrtimer(struct intel_uncore_box *box);
 void uncore_pmu_cancel_hrtimer(struct intel_uncore_box *box);
-void uncore_pmu_event_read(struct perf_event *event);
+int uncore_pmu_event_read(struct perf_event *event);
 void uncore_perf_event_update(struct intel_uncore_box *box, struct perf_event *event);
 struct event_constraint *
 uncore_get_constraint(struct intel_uncore_box *box, struct perf_event *event);
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index 4bb3ec6..add5caf 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -170,7 +170,7 @@ static inline u64 msr_read_counter(struct perf_event *event)
 
 	return now;
 }
-static void msr_event_update(struct perf_event *event)
+static int msr_event_update(struct perf_event *event)
 {
 	u64 prev, now;
 	s64 delta;
@@ -188,6 +188,7 @@ static void msr_event_update(struct perf_event *event)
 		delta = sign_extend64(delta, 31);
 
 	local64_add(delta, &event->count);
+	return 0;
 }
 
 static void msr_event_start(struct perf_event *event, int flags)
diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index 6c98be8..a1c4ce00 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -59,7 +59,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
 
 int arch_install_hw_breakpoint(struct perf_event *bp);
 void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
 void hw_breakpoint_pmu_unthrottle(struct perf_event *bp);
 
 extern void
diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 8771766..e72ce6e 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -540,7 +540,8 @@ int hw_breakpoint_exceptions_notify(
 	return hw_breakpoint_handler(data);
 }
 
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
 {
 	/* TODO */
+	return 0;
 }
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f96e1f9..46fd299 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -39,12 +39,14 @@ static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
 
 static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 {
-	u64 counter, enabled, running;
+	u64 counter, counter_tmp, enabled, running;
 
 	counter = pmc->counter;
-	if (pmc->perf_event)
-		counter += perf_event_read_value(pmc->perf_event,
-						 &enabled, &running);
+	if (pmc->perf_event) {
+		if (!perf_event_read_value(pmc->perf_event, &counter_tmp,
+					   &enabled, &running))
+			counter += counter_tmp;
+	}
 	/* FIXME: Scaling needed? */
 	return counter & pmc_bitmask(pmc);
 }
diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
index 8900823..2d27c70 100644
--- a/drivers/bus/arm-cci.c
+++ b/drivers/bus/arm-cci.c
@@ -1033,9 +1033,10 @@ static u64 pmu_event_update(struct perf_event *event)
 	return new_raw_count;
 }
 
-static void pmu_read(struct perf_event *event)
+static int pmu_read(struct perf_event *event)
 {
 	pmu_event_update(event);
+	return 0;
 }
 
 static void pmu_event_set_period(struct perf_event *event)
diff --git a/drivers/bus/arm-ccn.c b/drivers/bus/arm-ccn.c
index d1074d9..846ed8b 100644
--- a/drivers/bus/arm-ccn.c
+++ b/drivers/bus/arm-ccn.c
@@ -1145,9 +1145,10 @@ static void arm_ccn_pmu_event_del(struct perf_event *event, int flags)
 		hrtimer_cancel(&ccn->dt.hrtimer);
 }
 
-static void arm_ccn_pmu_event_read(struct perf_event *event)
+static int arm_ccn_pmu_event_read(struct perf_event *event)
 {
 	arm_ccn_pmu_event_update(event);
+	return 0;
 }
 
 static void arm_ccn_pmu_enable(struct pmu *pmu)
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index b37b572..aee7dff 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -163,10 +163,11 @@ u64 armpmu_event_update(struct perf_event *event)
 	return new_raw_count;
 }
 
-static void
+static int
 armpmu_read(struct perf_event *event)
 {
 	armpmu_event_update(event);
+	return 0;
 }
 
 static void
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9f388d4..9120640 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -366,7 +366,7 @@ struct pmu {
 	 * For sampling capable PMUs this will also update the software period
 	 * hw_perf_event::period_left field.
 	 */
-	void (*read)			(struct perf_event *event);
+	int (*read)			(struct perf_event *event);
 
 	/*
 	 * Group events scheduling is treated as a transaction, add
@@ -886,8 +886,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 extern void perf_pmu_migrate_context(struct pmu *pmu,
 				int src_cpu, int dst_cpu);
 extern u64 perf_event_read_local(struct perf_event *event);
-extern u64 perf_event_read_value(struct perf_event *event,
-				 u64 *enabled, u64 *running);
+extern int perf_event_read_value(struct perf_event *event,
+				 u64 *total, u64 *enabled, u64 *running);
 
 
 struct perf_sample_data {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e11a16a..059e5bb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2721,7 +2721,7 @@ static void __perf_event_sync_stat(struct perf_event *event,
 	 */
 	switch (event->state) {
 	case PERF_EVENT_STATE_ACTIVE:
-		event->pmu->read(event);
+		(void)event->pmu->read(event);
 		/* fall-through */
 
 	case PERF_EVENT_STATE_INACTIVE:
@@ -3473,6 +3473,7 @@ static void __perf_event_read(void *info)
 		return;
 
 	raw_spin_lock(&ctx->lock);
+
 	if (ctx->is_active) {
 		update_context_time(ctx);
 		update_cgrp_time_from_event(event);
@@ -3483,14 +3484,15 @@ static void __perf_event_read(void *info)
 		goto unlock;
 
 	if (!data->group) {
-		pmu->read(event);
-		data->ret = 0;
+		data->ret = pmu->read(event);
 		goto unlock;
 	}
 
 	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
 
-	pmu->read(event);
+	data->ret = pmu->read(event);
+	if (data->ret)
+		goto unlock;
 
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		update_event_times(sub);
@@ -3499,7 +3501,9 @@ static void __perf_event_read(void *info)
 			 * Use sibling's PMU rather than @event's since
 			 * sibling could be on different (eg: software) PMU.
 			 */
-			sub->pmu->read(sub);
+			data->ret = sub->pmu->read(sub);
+			if (data->ret)
+				goto unlock;
 		}
 	}
 
@@ -3520,6 +3524,7 @@ static inline u64 perf_event_count(struct perf_event *event)
  *   - either for the current task, or for this CPU
  *   - does not have inherit set, for inherited task events
  *     will not be local and we cannot read them atomically
+ *   - pmu::read cannot fail
  */
 u64 perf_event_read_local(struct perf_event *event)
 {
@@ -3552,7 +3557,7 @@ u64 perf_event_read_local(struct perf_event *event)
 	 * oncpu == -1).
 	 */
 	if (event->oncpu == smp_processor_id())
-		event->pmu->read(event);
+		(void) event->pmu->read(event);
 
 	val = local64_read(&event->count);
 	local_irq_restore(flags);
@@ -3611,7 +3616,6 @@ static int perf_event_read(struct perf_event *event, bool group)
 			update_event_times(event);
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
-
 	return ret;
 }
 
@@ -4219,18 +4223,22 @@ static int perf_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+int perf_event_read_value(struct perf_event *event,
+			  u64 *total, u64 *enabled, u64 *running)
 {
 	struct perf_event *child;
-	u64 total = 0;
 
+	int ret;
+	*total = 0;
 	*enabled = 0;
 	*running = 0;
 
 	mutex_lock(&event->child_mutex);
 
-	(void)perf_event_read(event, false);
-	total += perf_event_count(event);
+	ret = perf_event_read(event, false);
+	if (ret)
+		goto exit;
+	*total += perf_event_count(event);
 
 	*enabled += event->total_time_enabled +
 			atomic64_read(&event->child_total_time_enabled);
@@ -4238,14 +4246,17 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 			atomic64_read(&event->child_total_time_running);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
-		(void)perf_event_read(child, false);
-		total += perf_event_count(child);
+		ret = perf_event_read(child, false);
+		if (ret)
+			goto exit;
+		*total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
 	}
+exit:
 	mutex_unlock(&event->child_mutex);
 
-	return total;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
@@ -4342,9 +4353,11 @@ static int perf_read_one(struct perf_event *event,
 {
 	u64 enabled, running;
 	u64 values[4];
-	int n = 0;
+	int n = 0, ret;
 
-	values[n++] = perf_event_read_value(event, &enabled, &running);
+	ret = perf_event_read_value(event, &values[n++], &enabled, &running);
+	if (ret)
+		return ret;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
 		values[n++] = enabled;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
@@ -5625,7 +5638,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 		values[n++] = running;
 
 	if (leader != event)
-		leader->pmu->read(leader);
+		(void)leader->pmu->read(leader);
 
 	values[n++] = perf_event_count(leader);
 	if (read_format & PERF_FORMAT_ID)
@@ -5638,7 +5651,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 
 		if ((sub != event) &&
 		    (sub->state == PERF_EVENT_STATE_ACTIVE))
-			sub->pmu->read(sub);
+			(void)sub->pmu->read(sub);
 
 		values[n++] = perf_event_count(sub);
 		if (read_format & PERF_FORMAT_ID)
@@ -7349,8 +7362,9 @@ void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
 	preempt_enable_notrace();
 }
 
-static void perf_swevent_read(struct perf_event *event)
+static int perf_swevent_read(struct perf_event *event)
 {
+	return 0;
 }
 
 static int perf_swevent_add(struct perf_event *event, int flags)
@@ -8263,7 +8277,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
 		return HRTIMER_NORESTART;
 
-	event->pmu->read(event);
+	(void)event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0, event->hw.last_period);
 	regs = get_irq_regs();
@@ -8378,9 +8392,10 @@ static void cpu_clock_event_del(struct perf_event *event, int flags)
 	cpu_clock_event_stop(event, flags);
 }
 
-static void cpu_clock_event_read(struct perf_event *event)
+static int cpu_clock_event_read(struct perf_event *event)
 {
 	cpu_clock_event_update(event);
+	return 0;
 }
 
 static int cpu_clock_event_init(struct perf_event *event)
@@ -8455,13 +8470,14 @@ static void task_clock_event_del(struct perf_event *event, int flags)
 	task_clock_event_stop(event, PERF_EF_UPDATE);
 }
 
-static void task_clock_event_read(struct perf_event *event)
+static int task_clock_event_read(struct perf_event *event)
 {
 	u64 now = perf_clock();
 	u64 delta = now - event->ctx->timestamp;
 	u64 time = event->ctx->time + delta;
 
 	task_clock_event_update(event, time);
+	return 0;
 }
 
 static int task_clock_event_init(struct perf_event *event)
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 29/46] perf/x86/intel/cmt: add error handling to intel_cmt_event_read
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (27 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 28/46] perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to pmu::read David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 30/46] perf/x86/intel/cmt: add asynchronous read for task events David Carrillo-Cisneros
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

First part of intel_cmt_event_read. Error conditions and placeholder
for oncoming chunks.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index ef1000f..f5ab48e 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1320,6 +1320,31 @@ static struct monr *monr_next_descendant_post(struct monr *pos,
 
 static int intel_cmt_event_read(struct perf_event *event)
 {
+	struct monr *monr = monr_from_event(event);
+
+	/*
+	 * preemption disabled since called holding
+	 * event's ctx->lock raw_spin_lock.
+	 */
+	WARN_ON_ONCE(!preempt_count());
+
+	/* terminated monrs are zombies and must not be read. */
+	if (WARN_ON_ONCE(monr->flags & CMT_MONR_ZOMBIE))
+		return -ENXIO;
+
+	/*
+	 * Only event parent can return a value, everyone else share its
+	 * rmid and therefore doesn't track occupancy independently.
+	 */
+	if (event->parent) {
+		local64_set(&event->count, 0);
+		return 0;
+	}
+
+	if (event->attach_state & PERF_ATTACH_TASK) {
+		/* To add support in next patches in series */
+		return -ENOTSUPP;
+	}
 	/* To add support in next patches in series */
 	return -ENOTSUPP;
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 30/46] perf/x86/intel/cmt: add asynchronous read for task events
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (28 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 29/46] perf/x86/intel/cmt: add error handling to intel_cmt_event_read David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 31/46] perf/x86/intel/cmt: add subtree read for cgroup events David Carrillo-Cisneros
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Reading CMT/MBM task events in intel_cmt poses a challenge since it
requires to read from multiple sockets (usually accomplished with
an IPI) while called with interruptions disabled.

The current usptream driver avoids the problematic read with
interruptions disabled by not making a dummy perf_event_read() of
llc_occupancy for task events. The actual read is performed in
perf_event_count() whenever perf_event_count() is called with
interruptions enabled. This works but changes the expected behavior of
perf_event_read() and perf_event_count().

This patch follows a different approach by performing asynchronous reads
of all remote packages and waiting until either reads complete or a
deadline expires. It will return an error if the IPI does not complete
on time.

This asynchronous approach has advantages:
  1) It does not alter perf_event_count().
  2) perf_event_read() does read for all types of events.
  3) reads in all packages are executed in parallel. Parallel readings are
  specially advantageous because reading CMT/MBM events is slow
  (it requires sequential read and write to two msrs). I measured a
  llc_occupancy read in my HSW system to take ~1250 cycles.
  Parallel reads of all caches will become a bigger advantage with
  oncoming bigger microprocessors (up to 8 packages) and when CMT support
  for L2 is rolled out, since task events will require a read to all
  L2 caches units.

Also, introduces struct cmt_csd and a per-pkg array of cmt_csd's (one
per rmid). This array is used to control the potentially concurrent
reads to each rmid's event.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 206 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/cmt.h |  14 +++
 2 files changed, 217 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index f5ab48e..f9195ec 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -8,6 +8,12 @@
 #include "cmt.h"
 #include "../perf_event.h"
 
+#define RMID_VAL_UNAVAIL	BIT_ULL(62)
+#define RMID_VAL_ERROR		BIT_ULL(63)
+
+#define MSR_IA32_QM_CTR		0x0c8e
+#define MSR_IA32_QM_EVTSEL	0x0c8d
+
 #define QOS_L3_OCCUP_EVENT_ID	BIT_ULL(0)
 #define QOS_EVENT_MASK		QOS_L3_OCCUP_EVENT_ID
 
@@ -1229,6 +1235,41 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
 	return false;
 }
 
+/* Must be called in a cpu in rmid's package. */
+static int cmt_rmid_read(u32 rmid, u64 *val)
+{
+	wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
+	rdmsrl(MSR_IA32_QM_CTR, *val);
+
+	/* Ignore this reading on error states and do not update the value. */
+	if (WARN_ON_ONCE(*val & RMID_VAL_ERROR))
+		return -EINVAL;
+	if (WARN_ON_ONCE(*val & RMID_VAL_UNAVAIL))
+		return -ENODATA;
+
+	return 0;
+}
+
+/* time to wait before time out rmid read IPI */
+#define CMT_IPI_WAIT_TIME	100	/* ms */
+
+static void smp_call_rmid_read(void *data)
+{
+	struct cmt_csd *ccsd = (struct cmt_csd *)data;
+
+	ccsd->ret = cmt_rmid_read(ccsd->rmid, &ccsd->value);
+
+	/*
+	 * smp_call_function_single_async must have cleared csd.flags
+	 * before invoking func.
+	 */
+	WARN_ON_ONCE(ccsd->csd.flags);
+
+	/* ensure values are stored before clearing on_read. */
+	barrier();
+	atomic_set(&ccsd->on_read, 0);
+}
+
 static struct pmu intel_cmt_pmu;
 
 /* Try to find a monr with same target, otherwise create new one. */
@@ -1318,9 +1359,145 @@ static struct monr *monr_next_descendant_post(struct monr *pos,
 	return pos->parent;
 }
 
+/* Issue reads to CPUs in remote packages. */
+static int issue_read_remote_pkgs(struct monr *monr,
+				  struct cmt_csd **issued_ccsds,
+				  u32 *local_rmid)
+{
+	struct cmt_csd *ccsd;
+	struct pmonr *pmonr;
+	struct pkg_data *pkgd = NULL;
+	union pmonr_rmids rmids;
+	int err = 0, read_cpu;
+	u16 p, local_pkgid = topology_logical_package_id(smp_processor_id());
+
+	/* Issue remote packages. */
+	rcu_read_lock();
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+
+		pmonr = pkgd_pmonr(pkgd, monr);
+		/* Retrieve rmid and check state without acquiring pkg locks. */
+		rmids.value = atomic64_read(&pmonr->atomic_rmids);
+		/* Skip Off and Unused states. */
+		if (rmids.sched_rmid == INVALID_RMID)
+			continue;
+		/*
+		 * pmonrs in Dep_{Idle,Dirty} states have run without rmid
+		 * their own rmid and would report wrong occupancy.
+		 */
+		if (rmids.read_rmid == INVALID_RMID) {
+			err = -EBUSY;
+			goto exit;
+		}
+		p = pkgd->pkgid;
+		if (p == local_pkgid) {
+			*local_rmid = rmids.read_rmid;
+			continue;
+		}
+		ccsd = &pkgd->ccsds[rmids.read_rmid];
+		/*
+		 * Reads of remote packages are only required for task events.
+		 * pmu->read in task events is serialized by task_ctx->lock in
+		 * perf generic code. Events with same task target share rmid
+		 * and task_ctx->lock, so there is no need to support
+		 * concurrent remote reads to same RMID.
+		 *
+		 * ccsd->on_read could be not zero if a read expired before,
+		 * in that rare case, fail now and hope next time the ongoing
+		 * IPI will have completed.
+		 */
+		if (atomic_inc_return(&ccsd->on_read) > 1) {
+			err = -EBUSY;
+			goto exit;
+		}
+		issued_ccsds[p] = ccsd;
+		read_cpu = cpumask_any(topology_core_cpumask(pkgd->work_cpu));
+		err = smp_call_function_single_async(read_cpu, &ccsd->csd);
+		if (WARN_ON_ONCE(err))
+			goto exit;
+	}
+exit:
+	rcu_read_unlock();
+
+	return err;
+}
+
+/*
+ * Fail if IPI hasn't finish by @deadline if @count != NULL.
+ * @count == NULL signals no update and therefore no reason to wait.
+ */
+static int read_issued_pkgs(struct cmt_csd **issued_ccsds,
+			    u64 deadline, u64 *count)
+{
+	struct cmt_csd *ccsd;
+	int p;
+
+	for (p = 0; p < CMT_MAX_NR_PKGS; p++) {
+		ccsd = issued_ccsds[p];
+		if (!ccsd)
+			continue;
+
+		/* A smp_cond_acquire on ccsd->on_read and time. */
+		while (atomic_read(&ccsd->on_read) &&
+				 time_before64(get_jiffies_64(), deadline))
+			cpu_relax();
+
+		/*
+		 * guarantee that cssd->ret and ccsd->value are read after
+		 * read or deadline.
+		 */
+		smp_rmb();
+
+		/* last IPI took unusually long. */
+		if (WARN_ON_ONCE(atomic_read(&ccsd->on_read)))
+			return -EBUSY;
+		/* ccsd->on_read is always cleared after csd.flags. */
+		if (WARN_ON_ONCE(ccsd->csd.flags))
+			return -EBUSY;
+		if (ccsd->ret)
+			return ccsd->ret;
+
+		*count += ccsd->value;
+	}
+
+	return 0;
+}
+
+static int read_all_pkgs(struct monr *monr, int wait_time_ms, u64 *count)
+{
+	struct cmt_csd *issued_ccsds[CMT_MAX_NR_PKGS];
+	int err = 0;
+	u32 local_rmid = INVALID_RMID;
+	u64 deadline, val;
+
+	*count = 0;
+	memset(issued_ccsds, 0, CMT_MAX_NR_PKGS * sizeof(*issued_ccsds));
+	err = issue_read_remote_pkgs(monr, issued_ccsds, &local_rmid);
+	if (err)
+		return err;
+	/*
+	 * Save deadline after issuing reads so that all packages have at
+	 * least wait_time_ms to complete.
+	 */
+	deadline = get_jiffies_64() + msecs_to_jiffies(wait_time_ms);
+
+	/* Read local package. */
+	if (local_rmid != INVALID_RMID) {
+		err = cmt_rmid_read(local_rmid, &val);
+		if (WARN_ON_ONCE(err))
+			return err;
+		*count += val;
+	}
+
+	return read_issued_pkgs(issued_ccsds, deadline, count);
+}
+
 static int intel_cmt_event_read(struct perf_event *event)
 {
 	struct monr *monr = monr_from_event(event);
+	u64 count;
+	u16 pkgid = topology_logical_package_id(smp_processor_id());
+	int err;
 
 	/*
 	 * preemption disabled since called holding
@@ -1342,11 +1519,17 @@ static int intel_cmt_event_read(struct perf_event *event)
 	}
 
 	if (event->attach_state & PERF_ATTACH_TASK) {
+		/* It's a task event. */
+		err = read_all_pkgs(monr, CMT_IPI_WAIT_TIME, &count);
+	} else {
 		/* To add support in next patches in series */
 		return -ENOTSUPP;
 	}
-	/* To add support in next patches in series */
-	return -ENOTSUPP;
+	if (err)
+		return err;
+	local64_set(&event->count, count);
+
+	return 0;
 }
 
 static inline void __intel_cmt_event_start(struct perf_event *event,
@@ -1566,15 +1749,17 @@ void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css)
 
 static void free_pkg_data(struct pkg_data *pkg_data)
 {
+	kfree(pkg_data->ccsds);
 	kfree(pkg_data);
 }
 
 /* Init pkg_data for @cpu 's package. */
 static struct pkg_data *alloc_pkg_data(int cpu)
 {
+	struct cmt_csd *ccsd;
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 	struct pkg_data *pkgd;
-	int numa_node = cpu_to_node(cpu);
+	int r, ccsds_nr_bytes, numa_node = cpu_to_node(cpu);
 	u16 pkgid = topology_logical_package_id(cpu);
 
 	if (pkgid >= CMT_MAX_NR_PKGS) {
@@ -1618,6 +1803,21 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 	lockdep_set_class(&pkgd->lock, &lock_keys[pkgid]);
 #endif
 
+	ccsds_nr_bytes = (pkgd->max_rmid + 1) * sizeof(*(pkgd->ccsds));
+	pkgd->ccsds = kzalloc_node(ccsds_nr_bytes, GFP_KERNEL, numa_node);
+	if (!pkgd->ccsds) {
+		free_pkg_data(pkgd);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	for (r = 0; r <= pkgd->max_rmid; r++) {
+		ccsd = &pkgd->ccsds[r];
+		ccsd->rmid = r;
+		ccsd->csd.func = smp_call_rmid_read;
+		ccsd->csd.info = ccsd;
+		__set_bit(r, pkgd->free_rmids);
+	}
+
 	__min_max_rmid = min(__min_max_rmid, pkgd->max_rmid);
 
 	return pkgd;
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 1e40e6b..8bb43bd 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -191,6 +191,19 @@ struct pmonr {
 	enum pmonr_state			state;
 };
 
+/**
+ * struct cmt_csd - data for async IPI call that read rmids on remote packages.
+ *
+ * One per rmid per package. One issuer at the time. Readers wait on @value_gen.
+ */
+struct cmt_csd {
+	struct call_single_data csd;
+	atomic_t		on_read;
+	u64			value;
+	int			ret;
+	u32			rmid;
+};
+
 /*
  * Compile constant required for bitmap macros.
  * Broadwell EP has 2 rmids per logical core, use twice as many as upper bound.
@@ -237,6 +250,7 @@ struct pkg_data {
 	unsigned int		work_cpu;
 	u32			max_rmid;
 	u16			pkgid;
+	struct cmt_csd		*ccsds;
 };
 
 /**
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 31/46] perf/x86/intel/cmt: add subtree read for cgroup events
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (29 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 30/46] perf/x86/intel/cmt: add asynchronous read for task events David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 32/46] perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags David Carrillo-Cisneros
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

A llc_occupancy read for a cgroup event must read llc_occupancy for all
monrs in or below the event's monr.

The cgroup's monr's pmonr must have a valid rmid for the read to be
meaningful. Descendant pmonrs that do not have a valid read_rmid are
skipped since their occupancy is already included in its Lowest Monitored
Ancestor (lma) pmonr's occupancy.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index f9195ec..275d128 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1359,6 +1359,110 @@ static struct monr *monr_next_descendant_post(struct monr *pos,
 	return pos->parent;
 }
 
+static int read_subtree_rmids(u32 root_r, unsigned long *rmids_bm, u64 *total)
+{
+	u64 val;
+	int r, err;
+
+	/* first iteration reads root's rmid. */
+	r = root_r;
+	do {
+		if (r != INVALID_RMID) {
+			err = cmt_rmid_read(r, &val);
+			if (WARN_ON_ONCE(err))
+				return err;
+			(*total) += val;
+		}
+		if (!rmids_bm)
+			break;
+		if (root_r != INVALID_RMID) {
+			root_r = INVALID_RMID;
+			r = find_first_bit(rmids_bm, CMT_MAX_NR_RMIDS);
+		} else {
+			r = find_next_bit(rmids_bm, CMT_MAX_NR_RMIDS, r + 1);
+		}
+	} while (r < CMT_MAX_NR_RMIDS);
+
+	return 0;
+}
+
+/**
+ * pmonr_read_subtree() - Read occupancy for a pmonr subtree.
+ *
+ * Read and add occupancy for all pmonrs in the subtree rooted at
+ * @root_pmonr->monr and in @root_pmonr->pkgd package.
+ * Fast-path for common case of a leaf pmonr, otherwise, a best effort
+ * two-stages read:
+ *   1) read all rmids in subtree with pkgd->lock held, and
+ *   2) read and add occupancy for rmids in previous stage, without locks held.
+ */
+static int pmonr_read_subtree(struct pmonr *root_pmonr, u64 *total)
+{
+	struct monr *pos;
+	struct pkg_data *pkgd = root_pmonr->pkgd;
+	struct pmonr *pmonr;
+	union pmonr_rmids rmids;
+	int err = 0, root_r;
+	unsigned long flags, *rmids_bm = NULL;
+
+	*total = 0;
+	rmids.value = atomic64_read(&root_pmonr->atomic_rmids);
+	/*
+	 * The root of the subtree must be in Unused or Active state for the
+	 * read to be meaningful (Unused pmonr have zero occupancy), yet its
+	 * descendants can be in Dep_{Idle,Dirty} since states use their
+	 * Lowest Monitored Ancestor's rmid.
+	 */
+	if (rmids.sched_rmid == INVALID_RMID) {
+		/* Unused state. */
+		if (rmids.read_rmid == 0)
+			root_r = INVALID_RMID;
+		else
+		/* Off state. */
+			return -ENODATA;
+	} else {
+		/* Dep_{Idle, Dirty} state. */
+		if (rmids.sched_rmid != rmids.read_rmid)
+			return -ENODATA;
+		/* Active state */
+		root_r = rmids.read_rmid;
+	}
+	/*
+	 * Lock-less fast-path for common case of childless monr. No need
+	 * to lock for list_empty since either path leads to a read that is
+	 * correct at some time close to the moment the check happens.
+	 */
+	if (list_empty(&root_pmonr->monr->children))
+		goto read_rmids;
+
+	rmids_bm = kzalloc(CMT_MAX_NR_RMIDS_BYTES, GFP_ATOMIC);
+	if (!rmids_bm)
+		return -ENOMEM;
+
+	/* Lock to protect againsts changes in pmonr hierarchy. */
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+
+	/* Starts on subtree's first child. */
+	pos = root_pmonr->monr;
+	while ((pos = monr_next_descendant_pre(pos, root_pmonr->monr))) {
+		/* protected by pkgd lock. */
+		pmonr = pkgd_pmonr(pkgd, pos);
+		rmids.value = atomic64_read(&pmonr->atomic_rmids);
+		/* Exclude all pmonrs not in Active or Dep_Dirty states. */
+		if (rmids.sched_rmid == INVALID_RMID ||
+		    rmids.read_rmid == INVALID_RMID)
+			continue;
+		__set_bit(rmids.read_rmid, rmids_bm);
+	}
+
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+read_rmids:
+	err = read_subtree_rmids(root_r, rmids_bm, total);
+	kfree(rmids_bm);
+
+	return err;
+}
+
 /* Issue reads to CPUs in remote packages. */
 static int issue_read_remote_pkgs(struct monr *monr,
 				  struct cmt_csd **issued_ccsds,
@@ -1522,8 +1626,13 @@ static int intel_cmt_event_read(struct perf_event *event)
 		/* It's a task event. */
 		err = read_all_pkgs(monr, CMT_IPI_WAIT_TIME, &count);
 	} else {
-		/* To add support in next patches in series */
-		return -ENOTSUPP;
+		struct pmonr *pmonr;
+
+		/* It's either a cgroup or a cpu event. */
+		rcu_read_lock();
+		pmonr = rcu_dereference(monr->pmonrs[pkgid]);
+		err = pmonr_read_subtree(pmonr, &count);
+		rcu_read_unlock();
 	}
 	if (err)
 		return err;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 32/46] perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (30 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 31/46] perf/x86/intel/cmt: add subtree read for cgroup events David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 33/46] perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt David Carrillo-Cisneros
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Introduce two new PERF_EV_CAP_READ capabilities to save unnecessary IPIs.
Since PMU hw keeps track of rmids at all times, both capabilities in this
patch allow to read events even when inactive.

These capabilities also remove the need to read the value of an event on
pmu->stop (already baked in in previous patches).

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 16 +++++++--
 kernel/events/core.c       | 84 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 75 insertions(+), 25 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9120640..72fe105 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -510,13 +510,23 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
 
 /*
  * Event capabilities. For event_caps and groups caps.
+ * Only one of the PERF_EV_CAP_READ_* can be set at a time.
  *
- * PERF_EV_CAP_SOFTWARE: Is a software event.
- * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
- * from any CPU in the package where it is active.
+ * PERF_EV_CAP_SOFTWARE: A software event.
+ *
+ * PERF_EV_CAP_READ_ACTIVE_PKG: An event readable from any CPU in the
+ * package where it is active.
+ *
+ * PERF_EV_CAP_READ_ANY_CPU_PKG: A CPU (or cgroup) event readable from any
+ * CPU in its event->cpu's package, even if inactive.
+ *
+ * PERF_EV_CAP_READ_ANY_PKG: An event readable from any CPU in any package,
+ * even if inactive.
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
+#define PERF_EV_CAP_READ_ANY_CPU_PKG	BIT(2)
+#define PERF_EV_CAP_READ_ANY_PKG	BIT(3)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 059e5bb..77afd68 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3432,22 +3432,55 @@ static void perf_event_enable_on_exec(int ctxn)
 struct perf_read_data {
 	struct perf_event *event;
 	bool group;
+	bool read_inactive;
 	int ret;
 };
 
-static int find_cpu_to_read(struct perf_event *event, int local_cpu)
+static int find_cpu_to_read(struct perf_event *event, bool *read_inactive)
 {
-	int event_cpu = event->oncpu;
+	bool active = event->state == PERF_EVENT_STATE_ACTIVE;
+	int local_cpu, event_cpu = active ? event->oncpu : event->cpu;
 	u16 local_pkg, event_pkg;
 
+	/* Do not read if event is neither Active nor Inactive. */
+	if (event->state <= PERF_EVENT_STATE_OFF) {
+		*read_inactive = false;
+		return -1;
+	}
+
+	local_cpu = get_cpu();
+	if (event->group_caps & PERF_EV_CAP_READ_ANY_PKG) {
+		*read_inactive = true;
+		event_cpu = local_cpu;
+		goto exit;
+	}
+
+	/* Neither Active nor CPU or cgroup event. */
+	if (event_cpu < 0) {
+		*read_inactive = false;
+		goto exit;
+	}
+
+	*read_inactive = event->group_caps & PERF_EV_CAP_READ_ANY_CPU_PKG;
+	if (!active && !*read_inactive)
+		goto exit;
+
+	/* Could be Inactive and have PERF_EV_CAP_READ_INACTIVE_CPU_PKG. */
 	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
 		event_pkg =  topology_physical_package_id(event_cpu);
 		local_pkg =  topology_physical_package_id(local_cpu);
 
 		if (event_pkg == local_pkg)
-			return local_cpu;
+			event_cpu = local_cpu;
 	}
 
+exit:
+	/*
+	 *  __perf_event_read tolerates change of local cpu.
+	 * There is no need to keep CPU pinned.
+	 */
+	put_cpu();
+
 	return event_cpu;
 }
 
@@ -3461,15 +3494,16 @@ static void __perf_event_read(void *info)
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct pmu *pmu = event->pmu;
+	bool active, read_inactive = data->read_inactive;
 
 	/*
-	 * If this is a task context, we need to check whether it is
-	 * the current task context of this cpu.  If not it has been
-	 * scheduled out before the smp call arrived.  In that case
-	 * event->count would have been updated to a recent sample
-	 * when the event was scheduled out.
+	 * If this is a task context and !read_inactive, we need to check
+	 * whether it is the current task context of this cpu.
+	 * If not it has been scheduled out before the smp call arrived.
+	 * In that case event->count would have been updated to a recent
+	 * sample when the event was scheduled out.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
+	if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
 		return;
 
 	raw_spin_lock(&ctx->lock);
@@ -3480,7 +3514,13 @@ static void __perf_event_read(void *info)
 	}
 
 	update_event_times(event);
-	if (event->state != PERF_EVENT_STATE_ACTIVE)
+
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		goto unlock;
+
+	/* If event->state > Off, then it's either Active or Inactive. */
+	active = event->state == PERF_EVENT_STATE_ACTIVE;
+	if (!active && !read_inactive)
 		goto unlock;
 
 	if (!data->group) {
@@ -3496,7 +3536,12 @@ static void __perf_event_read(void *info)
 
 	list_for_each_entry(sub, &event->sibling_list, group_entry) {
 		update_event_times(sub);
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+		/*
+		 * Since leader is Active, siblings are either Active or
+		 * Inactive.
+		 */
+		active = sub->state == PERF_EVENT_STATE_ACTIVE;
+		if (active || read_inactive) {
 			/*
 			 * Use sibling's PMU rather than @event's since
 			 * sibling could be on different (eg: software) PMU.
@@ -3567,23 +3612,18 @@ u64 perf_event_read_local(struct perf_event *event)
 
 static int perf_event_read(struct perf_event *event, bool group)
 {
-	int ret = 0, cpu_to_read, local_cpu;
+	bool read_inactive;
+	int ret = 0, cpu_to_read;
 
-	/*
-	 * If event is enabled and currently active on a CPU, update the
-	 * value in the event structure:
-	 */
-	if (event->state == PERF_EVENT_STATE_ACTIVE) {
+	cpu_to_read = find_cpu_to_read(event, &read_inactive);
+
+	if (cpu_to_read >= 0) {
 		struct perf_read_data data = {
 			.event = event,
 			.group = group,
+			.read_inactive = read_inactive,
 			.ret = 0,
 		};
-
-		local_cpu = get_cpu();
-		cpu_to_read = find_cpu_to_read(event, local_cpu);
-		put_cpu();
-
 		/*
 		 * Purposely ignore the smp_call_function_single() return
 		 * value.
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 33/46] perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (31 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 32/46] perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 34/46] perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION David Carrillo-Cisneros
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Use new flags in CMT pmu.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 275d128..614b2f4 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1733,6 +1733,15 @@ static int intel_cmt_event_init(struct perf_event *event)
 
 	INIT_LIST_HEAD(&event->hw.cmt_list);
 
+	/*
+	 * Task events can be read in any CPU in any package. CPU events
+	 * only in CPU's package. Both can read even if inactive.
+	 */
+	if (event->cpu < 0)
+		event->event_caps |= PERF_EV_CAP_READ_ANY_PKG;
+	else
+		event->event_caps |= PERF_EV_CAP_READ_ANY_CPU_PKG;
+
 	mutex_lock(&cmt_mutex);
 
 	err = mon_group_setup_event(event);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 34/46] perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (32 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 33/46] perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 35/46] perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt David Carrillo-Cisneros
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

The generic code handles cgroup hierarchy by adding to the PMU the events
of all the ancestor cgroups of the cgroup to read.
This approach is incompatible with the CMT hw that only allows one rmid
per virtual core at a time. CMT's PMU work-arounds this limitation by
internally maintaining the hierarchical dependency between monitored
cgroups (the monr hierarchy).

The flag introduced in this patch signals the generic code that this
cgroup event do not need to add ancestor's event recursively.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 5 +++++
 kernel/events/core.c       | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 72fe105..3b1d542 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -522,11 +522,16 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  *
  * PERF_EV_CAP_READ_ANY_PKG: An event readable from any CPU in any package,
  * even if inactive.
+ *
+ * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
+ * cgroup scoping. It does not need to be enabled for all of its descendants
+ * cgroups.
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
 #define PERF_EV_CAP_READ_ANY_CPU_PKG	BIT(2)
 #define PERF_EV_CAP_READ_ANY_PKG	BIT(3)
+#define PERF_EV_CAP_CGROUP_NO_RECURSION	BIT(4)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 77afd68..4f43c75 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -590,6 +590,9 @@ perf_cgroup_match(struct perf_event *event)
 	if (!cpuctx->cgrp)
 		return false;
 
+	if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
+		return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
+
 	/*
 	 * Cgroup scoping is recursive.  An event enabled for a cgroup is
 	 * also enabled for all its descendant cgroups.  If @cpuctx's
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 35/46] perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (33 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 34/46] perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 36/46] perf/core: add perf_event cgroup hooks for subsystem attributes David Carrillo-Cisneros
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Use newly added flag.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 614b2f4..194038b 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1734,6 +1734,14 @@ static int intel_cmt_event_init(struct perf_event *event)
 	INIT_LIST_HEAD(&event->hw.cmt_list);
 
 	/*
+	 * CMT hw only allows one rmid per core at the time and therefore
+	 * it is not compatible with the way generic code handles cgroup
+	 * dependencies.
+	 */
+	if (event->cgrp)
+		event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
+
+	/*
 	 * Task events can be read in any CPU in any package. CPU events
 	 * only in CPU's package. Both can read even if inactive.
 	 */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 36/46] perf/core: add perf_event cgroup hooks for subsystem attributes
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (34 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 35/46] perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 37/46] perf/x86/intel/cmt: add cont_monitoring to perf cgroup David Carrillo-Cisneros
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Allow architectures to define additional attributes for the perf cgroup.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h | 4 ++++
 kernel/events/core.c       | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3b1d542..26e6ee3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1412,4 +1412,8 @@ int perf_event_exit_cpu(unsigned int cpu);
 # define perf_cgroup_arch_css_offline(css) do { } while (0)
 #endif
 
+#ifndef PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4f43c75..b6ca765 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10849,5 +10849,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_offline	= perf_cgroup_css_offline,
 	.css_free	= perf_cgroup_css_free,
 	.attach		= perf_cgroup_attach,
+	/* Expand architecture specific attributes. */
+	PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 37/46] perf/x86/intel/cmt: add cont_monitoring to perf cgroup
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (35 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 36/46] perf/core: add perf_event cgroup hooks for subsystem attributes David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 38/46] perf/x86/intel/cmt: introduce read SLOs for rotation David Carrillo-Cisneros
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Expose the attribute intel_cmt.cont_monitoring to perf cgroups using
the newly introduced hook PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS.

The format of new attribute is a semi-colon separated list of per-package
hexadecimal flags with missing an empty string assigned to zero.

  echo "1;2" > g1/intel_cmt.cont_monitoring

Implies 0x1 for pkg 0 and 0x2 for pkg 1, 0x0 for all other pkgs.

This patch introduces the basic ideas of per-package uflags through
a cgroup attribute. The format can be changed to match Intel CAT's
schemata's file format once that is settled.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c       | 173 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h |  10 +++
 2 files changed, 183 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 194038b..3ade923 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -2323,4 +2323,177 @@ inline void __intel_cmt_no_event_sched_in(void)
 #endif
 }
 
+#ifdef CONFIG_CGROUP_PERF
+
+static int cmt_monitoring_seq_show(struct seq_file *m, void *v)
+{
+	struct perf_cgroup *cgrp = perf_cgroup_from_css(seq_css(m));
+	struct monr *monr = NULL;
+	int p, nr_pkgs = topology_max_packages();
+
+	mutex_lock(&cmt_mutex);
+
+	if (perf_cgroup_mon_started(cgrp))
+		monr = monr_from_perf_cgroup(cgrp);
+
+	for (p = 0; p < nr_pkgs; p++) {
+		seq_printf(m, "%x", monr ? monr->pkg_uflags[p] : 0);
+		if (p + 1 < nr_pkgs)
+			seq_puts(m, ";");
+	}
+
+	seq_puts(m, "\n");
+
+	mutex_unlock(&cmt_mutex);
+	return 0;
+}
+
+/**
+ * Parses uflags string of the form: m1;m2;...;mP where m* are hexadecimal
+ * masks and P is less or equal than topology_max_packages(). Values not
+ * provided in the mask are assumed to be zero.
+ * On success, it allocates and returns an array.
+ */
+static enum cmt_user_flags *parse_pkg_uflags(char *buf, size_t nbytes)
+{
+	enum cmt_user_flags *uflags;
+	char *local_buf, *b, *m;
+	int err = 0, pmask, nr_pkgs = topology_max_packages();
+	u16 p = 0;
+
+	uflags = kcalloc(nr_pkgs, sizeof(*uflags), GFP_KERNEL);
+	if (!uflags)
+		return ERR_PTR(-ENOMEM);
+
+	local_buf = kcalloc(nbytes, sizeof(char), GFP_KERNEL);
+	if (!local_buf) {
+		err = -ENOMEM;
+		goto error;
+	}
+	memcpy(local_buf, buf, nbytes);
+	b = local_buf;
+	while ((m = strsep(&b, ";")) != NULL) {
+		if (p >= nr_pkgs) {
+			err = -EINVAL;
+			goto error;
+		}
+		if (!*m) {
+			uflags[p] = 0;
+			continue;
+		}
+		err = kstrtoint(m, 16, &pmask);
+		if (err)
+			goto error;
+		uflags[p] = pmask;
+		if (uflags[p] > CMT_UF_MAX) {
+			err = -EINVAL;
+			goto error;
+		}
+		/*
+		 * Non-zero flags must have CMT_UF_HAS_USER set. Otherwise
+		 * there could be that monrs are not used.
+		 */
+		if (uflags[p] && (!(uflags[p] & CMT_UF_HAS_USER))) {
+			err = -EINVAL;
+			goto error;
+		}
+		p++;
+	}
+
+	kfree(local_buf);
+
+	return uflags;
+
+error:
+	kfree(local_buf);
+	kfree(uflags);
+
+	return ERR_PTR(err);
+}
+
+static ssize_t cmt_monitoring_write(struct kernfs_open_file *of,
+		char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct monr *monr;
+	int err = 0;
+	bool is_mon;
+	enum cmt_user_flags *uflags;
+
+	/* root is read-only */
+	if (css == get_root_perf_css())
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+
+	/* Monitoring active, use new flags. */
+	uflags = parse_pkg_uflags(buf, nbytes);
+	if (IS_ERR(uflags)) {
+		err = PTR_ERR(uflags);
+		goto exit_unlock;
+	}
+
+	is_mon = perf_cgroup_mon_started(perf_cgroup_from_css(css));
+	if (!is_mon) {
+		if (pkg_uflags_has_user(uflags)) {
+			err = __css_start_monitoring(css);
+			if (err)
+				goto exit_free;
+		} else {
+			/*
+			 * uflags must be all zero or parse_pkg_uflags
+			 * would have failed.
+			 */
+			goto exit_free;
+		}
+	}
+
+	/* At this point the monr is guaranteed to be this css's monr. */
+	monr = monr_from_css(css);
+
+	/* Disregard if flags have not changed. */
+	if (!memcmp(uflags, monr->pkg_uflags, pkg_uflags_size))
+		goto exit_free;
+
+	/*
+	 * will update monr->pkg_flags. Do not exit on error, continue to
+	 * clean up monr if unused.
+	 */
+	err = monr_apply_uflags(monr, uflags);
+
+	if (!monr_has_user(monr))
+		monr_destroy(monr);
+
+exit_free:
+	kfree(uflags);
+exit_unlock:
+	monr_hrchy_release_mutexes();
+	mutex_unlock(&cmt_mutex);
+	return err ?: nbytes;
+}
+
+struct cftype perf_event_cgrp_arch_subsys_cftypes[] = {
+	{
+		/*
+		 * allows per-package specification of uflags. It takes a
+		 * semi-colon separated list of hex uflags values. The hex
+		 * value in the i-th position corresponds to the uflags of
+		 * the package with logical id == i + 1 . Empty and missing hex
+		 * values receive 0.
+		 * e.g. "1;2" -> uflag 0x1 for pkg 0, uflag 0x2 for pkg 2,
+		 * and 0x0 for all others.
+		 */
+		.name = "cmt_monitoring",
+		.seq_show = cmt_monitoring_seq_show,
+		.write = cmt_monitoring_write,
+	},
+
+	{}	/* terminate */
+};
+
+#endif
+
 device_initcall(intel_cmt_init);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 783bdbb..babee97 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -315,6 +315,16 @@ int perf_cgroup_arch_css_online(struct cgroup_subsys_state *css);
 	perf_cgroup_arch_css_offline
 void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css);
 
+extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS \
+	.dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
+	.legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
+
+#else
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+
 #endif
 #endif
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 38/46] perf/x86/intel/cmt: introduce read SLOs for rotation
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (36 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 37/46] perf/x86/intel/cmt: add cont_monitoring to perf cgroup David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 39/46] perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute David Carrillo-Cisneros
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

To make rmid rotation more dependable, this patch series introduces
rotation Service Level Objectives (SLOs) that are described in
code's documentation.

This patch introduces cmt_{pre,min}_mon_slice SLOs that protects from
bogus values when a rmid has not been available since the beginning of
monitoring. It also introduces auxiliary variables necessary for the
SLOs to work and the checks in intel_cmt_event_read that enforce the SLOs
for the read of llc_occupancy event.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 46 ++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/cmt.h | 28 +++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 3ade923..649eb5f 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -51,6 +51,25 @@ static size_t pkg_uflags_size;
 static struct pkg_data **cmt_pkgs_data;
 
 /*
+ * Rotation Service Level Objectives (SLO) for monrs with llc_occupancy
+ * monitoring. Note that these are monr level SLOs, therefore all pmonrs in
+ * the monr meet or exceed them.
+ * (A "monitored"  monr is a monr with no pmonr in a Dependent state).
+ *
+ * SLOs:
+ *
+ * @__cmt_pre_mon_slice: Min time a monr is monitored before being readable.
+ * @__cmt_min_mon_slice: Min time a monr stays monitored after becoming
+ *                       readable.
+ */
+#define CMT_DEFAULT_PRE_MON_SLICE 2000		/* ms */
+static u64 __cmt_pre_mon_slice;
+
+#define CMT_DEFAULT_MIN_MON_SLICE 5000		/* ms */
+static u64 __cmt_min_mon_slice;
+
+
+/*
  * If @pkgd == NULL, return first online, pkg_data in cmt_pkgs_data.
  * Otherwise next online pkg_data or NULL if no more.
  */
@@ -300,6 +319,7 @@ static void pmonr_to_unused(struct pmonr *pmonr)
 			pmonr_move_all_dependants(pmonr, lender);
 		}
 		__set_bit(rmids.read_rmid, pkgd->dirty_rmids);
+		pkgd->nr_dirty_rmids++;
 
 	} else if (pmonr->state == PMONR_DEP_IDLE ||
 		   pmonr->state == PMONR_DEP_DIRTY) {
@@ -312,6 +332,11 @@ static void pmonr_to_unused(struct pmonr *pmonr)
 			__set_bit(rmids.read_rmid, pkgd->dirty_rmids);
 		else
 			pkgd->nr_dep_pmonrs--;
+
+
+		if (!atomic_dec_and_test(&pmonr->monr->nr_dep_pmonrs))
+			atomic64_set(&pmonr->monr->last_rmid_recoup,
+				     get_jiffies_64());
 	} else {
 		WARN_ON_ONCE(true);
 		return;
@@ -372,6 +397,7 @@ static inline void __pmonr_to_dep_helper(
 
 	lender_rmids.value = atomic64_read(&lender->atomic_rmids);
 	pmonr_set_rmids(pmonr, lender_rmids.sched_rmid, read_rmid);
+	atomic_inc(&pmonr->monr->nr_dep_pmonrs);
 }
 
 static inline void pmonr_unused_to_dep_idle(struct pmonr *pmonr)
@@ -390,6 +416,7 @@ static void pmonr_unused_to_off(struct pmonr *pmonr)
 
 static void pmonr_active_to_dep_dirty(struct pmonr *pmonr)
 {
+	struct pkg_data *pkgd = pmonr->pkgd;
 	struct pmonr *lender;
 	union pmonr_rmids rmids;
 
@@ -398,6 +425,7 @@ static void pmonr_active_to_dep_dirty(struct pmonr *pmonr)
 
 	rmids.value = atomic64_read(&pmonr->atomic_rmids);
 	__pmonr_to_dep_helper(pmonr, lender, rmids.read_rmid);
+	pkgd->nr_dirty_rmids++;
 }
 
 static void __pmonr_dep_to_active_helper(struct pmonr *pmonr, u32 rmid)
@@ -408,6 +436,9 @@ static void __pmonr_dep_to_active_helper(struct pmonr *pmonr, u32 rmid)
 	pmonr_move_dependants(pmonr->lender, pmonr);
 	pmonr->lender = NULL;
 	__pmonr_to_active_helper(pmonr, rmid);
+
+	if (!atomic_dec_and_test(&pmonr->monr->nr_dep_pmonrs))
+		atomic64_set(&pmonr->monr->last_rmid_recoup, get_jiffies_64());
 }
 
 static void pmonr_dep_idle_to_active(struct pmonr *pmonr, u32 rmid)
@@ -422,6 +453,7 @@ static void pmonr_dep_dirty_to_active(struct pmonr *pmonr)
 	union pmonr_rmids rmids;
 
 	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+	pmonr->pkgd->nr_dirty_rmids--;
 	__pmonr_dep_to_active_helper(pmonr, rmids.read_rmid);
 }
 
@@ -1599,7 +1631,7 @@ static int read_all_pkgs(struct monr *monr, int wait_time_ms, u64 *count)
 static int intel_cmt_event_read(struct perf_event *event)
 {
 	struct monr *monr = monr_from_event(event);
-	u64 count;
+	u64 count, recoup, wait_end;
 	u16 pkgid = topology_logical_package_id(smp_processor_id());
 	int err;
 
@@ -1614,6 +1646,15 @@ static int intel_cmt_event_read(struct perf_event *event)
 		return -ENXIO;
 
 	/*
+	 * If rmid has been stolen, only read if enough time has elapsed since
+	 * rmid were recovered.
+	 */
+	recoup = atomic64_read(&monr->last_rmid_recoup);
+	wait_end = recoup + __cmt_pre_mon_slice;
+	if (recoup && time_before64(get_jiffies_64(), wait_end))
+		return -EAGAIN;
+
+	/*
 	 * Only event parent can return a value, everyone else share its
 	 * rmid and therefore doesn't track occupancy independently.
 	 */
@@ -2267,6 +2308,9 @@ static int __init intel_cmt_init(void)
 	struct pkg_data *pkgd = NULL;
 	int err = 0;
 
+	__cmt_pre_mon_slice = msecs_to_jiffies(CMT_DEFAULT_PRE_MON_SLICE);
+	__cmt_min_mon_slice = msecs_to_jiffies(CMT_DEFAULT_MIN_MON_SLICE);
+
 	if (!x86_match_cpu(intel_cmt_match)) {
 		err = -ENODEV;
 		goto err_exit;
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 8bb43bd..8756666 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -52,6 +52,24 @@
  * schedule and read.
  *
  *
+ * Rotation
+ *
+ * The number of rmids in hw is relatively small with respect to the number
+ * of potential monitored resources. rmids are rotated to among pmonrs that
+ * need one to give a fair-ish usage of this resource.
+ *
+ * A hw constraint is that occupancy for a rmid cannot be restarted, therefore
+ * a rmid with llc_occupancy need some time unscheduled until all cache lines
+ * tagged to it are evicted from cache (if this ever happens).
+ *
+ * When a rmid is "rotated", it is stolen from a pmonr and must wait until its
+ * llc_occupancy has decreased enough to be considered "clean". Meanwhile, that
+ * rmid is considered "dirty".
+ *
+ * Rotation logic periodically reads occupancy of this "dirty" rmids and, when
+ * clean, the rmid is either reused or placed in a free pool.
+ *
+ *
  * Locking
  *
  * One global cmt_mutex. One mutex and spin_lock per package.
@@ -62,6 +80,7 @@
  *  cgroup start/stop.
  *  - Hold pkg->mutex and pkg->lock in _all_ active packages to traverse or
  *  change the monr hierarchy.
+ *  - pkgd->mutex: Hold in current package for rotation in that pkgd.
  *  - pkgd->lock: Hold in current package to access that pkgd's members. Hold
  *  a pmonr's package pkgd->lock for non-atomic access to pmonr.
  */
@@ -225,6 +244,7 @@ struct cmt_csd {
  * @dep_dirty_pmonrs:		LRU of Dep_Dirty pmonrs.
  * @dep_pmonrs:			LRU of Dep_Idle and Dep_Dirty pmonrs.
  * @nr_dep_pmonrs:		nr Dep_Idle + nr Dep_Dirty pmonrs.
+ * @nr_dirty_rmids:		"dirty" rmids, both with and without a pmonr.
  * @mutex:			Hold when modifying this pkg_data.
  * @mutex_key:			lockdep class for pkg_data's mutex.
  * @lock:			Hold to protect pmonrs in this pkg_data.
@@ -243,6 +263,7 @@ struct pkg_data {
 	struct list_head	dep_dirty_pmonrs;
 	struct list_head	dep_pmonrs;
 	int			nr_dep_pmonrs;
+	int			nr_dirty_rmids;
 
 	struct mutex		mutex;
 	raw_spinlock_t		lock;
@@ -280,6 +301,10 @@ enum cmt_user_flags {
  * @parent:		Parent in monr hierarchy.
  * @children:		List of children in monr hierarchy.
  * @parent_entry:	Entry in parent's children list.
+ * @last_rmid_recoup:	Last time that nr_dep_pmonrs decreased to zero. It's
+ *			zero if a rmid has never been stolen from this monr.
+ * @nr_dep_pmonrs:	nr of Dep_* pmonrs in this monr. A zero implies that
+ *			monr is monitoring in all required packages.
  * @flags:		monr_flags.
  * @nr_has_user:	nr of CMT_UF_HAS_USER set in events in mon_events.
  * @nr_nolazy_user:	nr of CMT_UF_NOLAZY_RMID set in events in mon_events.
@@ -303,6 +328,9 @@ struct monr {
 	struct list_head		children;
 	struct list_head		parent_entry;
 
+	atomic64_t			last_rmid_recoup;
+	atomic_t			nr_dep_pmonrs;
+
 	enum monr_flags			flags;
 	int				nr_has_user;
 	int				nr_nolazy_rmid;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 39/46] perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (37 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 38/46] perf/x86/intel/cmt: introduce read SLOs for rotation David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 40/46] perf/x86/intel/cmt: add rotation scheduled work David Carrillo-Cisneros
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Expose max_recycle_threshold as a configurable sysfs attribute of
intel_cmt pmu. max_recycle_threshold determines the maximum dirty
threshold for rmids with no zero occupancy.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 649eb5f..05803a8 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -61,6 +61,7 @@ static struct pkg_data **cmt_pkgs_data;
  * @__cmt_pre_mon_slice: Min time a monr is monitored before being readable.
  * @__cmt_min_mon_slice: Min time a monr stays monitored after becoming
  *                       readable.
+ * @__cmt_max_threshold: Max bytes of error due to reusing dirty rmids.
  */
 #define CMT_DEFAULT_PRE_MON_SLICE 2000		/* ms */
 static u64 __cmt_pre_mon_slice;
@@ -68,6 +69,7 @@ static u64 __cmt_pre_mon_slice;
 #define CMT_DEFAULT_MIN_MON_SLICE 5000		/* ms */
 static u64 __cmt_min_mon_slice;
 
+static unsigned int __cmt_max_threshold;	/* bytes */
 
 /*
  * If @pkgd == NULL, return first online, pkg_data in cmt_pkgs_data.
@@ -1831,9 +1833,54 @@ static struct attribute_group intel_cmt_format_group = {
 	.attrs = intel_cmt_formats_attr,
 };
 
+static ssize_t max_recycle_threshold_show(struct device *dev,
+				struct device_attribute *attr, char *page)
+{
+	ssize_t rv;
+
+	mutex_lock(&cmt_mutex);
+	rv = snprintf(page, PAGE_SIZE - 1, "%u\n",
+		      READ_ONCE(__cmt_max_threshold));
+
+	mutex_unlock(&cmt_mutex);
+	return rv;
+}
+
+static ssize_t max_recycle_threshold_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	unsigned int bytes;
+	int err;
+
+	err = kstrtouint(buf, 0, &bytes);
+	if (err)
+		return err;
+
+	mutex_lock(&cmt_mutex);
+	monr_hrchy_acquire_mutexes();
+	WRITE_ONCE(__cmt_max_threshold, bytes);
+	monr_hrchy_release_mutexes();
+	mutex_unlock(&cmt_mutex);
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(max_recycle_threshold);
+
+static struct attribute *intel_cmt_attrs[] = {
+	&dev_attr_max_recycle_threshold.attr,
+	NULL,
+};
+
+static const struct attribute_group intel_cmt_group = {
+	.attrs = intel_cmt_attrs,
+};
+
 static const struct attribute_group *intel_cmt_attr_groups[] = {
 	&intel_cmt_events_group,
 	&intel_cmt_format_group,
+	&intel_cmt_group,
 	NULL,
 };
 
@@ -2270,6 +2317,16 @@ static int __init cmt_start(void)
 	if (err)
 		goto rm_prep;
 
+	/*
+	 * A reasonable default upper limit on the max threshold is half of
+	 * the number of lines tagged per RMID if all RMIDs had the same
+	 * number of lines tagged in the LLC.
+	 *
+	 * For a 35MB LLC and 56 RMIDs, this is ~0.9% of the LLC or 320 KBs.
+	 */
+	__cmt_max_threshold = boot_cpu_data.x86_cache_size * 1024 /
+			(2 * (__min_max_rmid + 1));
+
 	snprintf(scale, sizeof(scale), "%u", cmt_l3_scale);
 	str = kstrdup(scale, GFP_KERNEL);
 	if (!str) {
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 40/46] perf/x86/intel/cmt: add rotation scheduled work
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (38 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 39/46] perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 41/46] perf/x86/intel/cmt: add rotation minimum progress SLO David Carrillo-Cisneros
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Schedule rotation work each pmu->hrtimer_interval_ms ms.
Default to CMT_DEFAULT_ROTATION_PERIOD period.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 98 ++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/cmt.h |  2 +
 2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 05803a8..8bf6aa5 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -51,6 +51,13 @@ static size_t pkg_uflags_size;
 static struct pkg_data **cmt_pkgs_data;
 
 /*
+ * Time between execution of rotation logic. The frequency of execution does
+ * not affect the rate at which RMIDs are recycled.
+ * The rotation period is stored in pmu->hrtimer_interval_ms.
+ */
+#define CMT_DEFAULT_ROTATION_PERIOD 1200	/* ms */
+
+/*
  * Rotation Service Level Objectives (SLO) for monrs with llc_occupancy
  * monitoring. Note that these are monr level SLOs, therefore all pmonrs in
  * the monr meet or exceed them.
@@ -1306,6 +1313,79 @@ static void smp_call_rmid_read(void *data)
 
 static struct pmu intel_cmt_pmu;
 
+/* Schedule rotation in one package. */
+static bool __intel_cmt_schedule_rotation_for_pkg(struct pkg_data *pkgd)
+{
+	unsigned long delay;
+
+	if (pkgd->work_cpu >= nr_cpu_ids)
+		return false;
+	delay = msecs_to_jiffies(intel_cmt_pmu.hrtimer_interval_ms);
+
+	return schedule_delayed_work_on(pkgd->work_cpu,
+					&pkgd->rotation_work, delay);
+}
+
+static void intel_cmt_schedule_rotation(void)
+{
+	struct pkg_data *pkgd = NULL;
+
+	rcu_read_lock();
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd)))
+		__intel_cmt_schedule_rotation_for_pkg(pkgd);
+	rcu_read_unlock();
+}
+
+/*
+ * Rotation for @pkgd is needed if its package has at least one online CPU and:
+ *   - there is a non-root monr (such monr could request rmids in @pkgd at any
+ *     time), or
+ *   - there are dirty rmids in pkgd.
+ */
+static bool intel_cmt_need_rmid_rotation(struct pkg_data *pkgd)
+{
+	unsigned long flags;
+	bool do_rot;
+
+	/* protected by cmt_mutex. */
+	if (!list_empty(&monr_hrchy_root->children))
+		return true;
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	do_rot =  pkgd->nr_dep_pmonrs || pkgd->nr_dirty_rmids;
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	return do_rot;
+}
+
+/*
+ * Rotation function, runs per-package.
+ */
+static void intel_cmt_rmid_rotation_work(struct work_struct *work)
+{
+	struct pkg_data *pkgd;
+
+	pkgd = container_of(to_delayed_work(work),
+			    struct pkg_data, rotation_work);
+
+	/* If this pkg_data is in its way to be destroyed. */
+	if (pkgd->work_cpu >= nr_cpu_ids)
+		return;
+
+	mutex_lock(&pkgd->mutex);
+
+	if (!intel_cmt_need_rmid_rotation(pkgd))
+		goto exit;
+
+	/* To add call to rotation function in next patch */
+
+	if (intel_cmt_need_rmid_rotation(pkgd))
+		__intel_cmt_schedule_rotation_for_pkg(pkgd);
+
+exit:
+	mutex_unlock(&pkgd->mutex);
+}
+
 /* Try to find a monr with same target, otherwise create new one. */
 static int mon_group_setup_event(struct perf_event *event)
 {
@@ -1796,6 +1876,11 @@ static int intel_cmt_event_init(struct perf_event *event)
 	mutex_lock(&cmt_mutex);
 
 	err = mon_group_setup_event(event);
+	/*
+	 * schedule rotation even if error, in case the error was caused by
+	 * insufficient rmids.
+	 */
+	intel_cmt_schedule_rotation();
 
 	mutex_unlock(&cmt_mutex);
 
@@ -1885,6 +1970,7 @@ static const struct attribute_group *intel_cmt_attr_groups[] = {
 };
 
 static struct pmu intel_cmt_pmu = {
+	.hrtimer_interval_ms = CMT_DEFAULT_ROTATION_PERIOD,
 	.attr_groups	     = intel_cmt_attr_groups,
 	.task_ctx_nr	     = perf_sw_context,
 	.event_init	     = intel_cmt_event_init,
@@ -2009,6 +2095,7 @@ static struct pkg_data *alloc_pkg_data(int cpu)
 	mutex_init(&pkgd->mutex);
 	raw_spin_lock_init(&pkgd->lock);
 
+	INIT_DELAYED_WORK(&pkgd->rotation_work, intel_cmt_rmid_rotation_work);
 	pkgd->work_cpu = cpu;
 	pkgd->pkgid = pkgid;
 
@@ -2131,9 +2218,10 @@ static int intel_cmt_hp_online_enter(unsigned int cpu)
 
 	rcu_read_lock();
 	pkgd = rcu_dereference(cmt_pkgs_data[pkgid]);
-	if (pkgd->work_cpu >= nr_cpu_ids)
+	if (pkgd->work_cpu >= nr_cpu_ids) {
 		pkgd->work_cpu = cpu;
-
+		__intel_cmt_schedule_rotation_for_pkg(pkgd);
+	}
 	rcu_read_unlock();
 
 	return 0;
@@ -2184,6 +2272,7 @@ static int intel_cmt_prep_down(unsigned int cpu)
 	pkgd = rcu_dereference_protected(cmt_pkgs_data[pkgid],
 					 lockdep_is_held(&cmt_mutex));
 	if (pkgd->work_cpu >= nr_cpu_ids) {
+		cancel_delayed_work_sync(&pkgd->rotation_work);
 		/* will destroy pkgd */
 		__terminate_pkg_data(pkgd);
 		RCU_INIT_POINTER(cmt_pkgs_data[pkgid], NULL);
@@ -2569,6 +2658,11 @@ static ssize_t cmt_monitoring_write(struct kernfs_open_file *of,
 		monr_destroy(monr);
 
 exit_free:
+	/*
+	 * schedule rotation even if in error, in case the error was caused by
+	 * insufficient rmids.
+	 */
+	intel_cmt_schedule_rotation();
 	kfree(uflags);
 exit_unlock:
 	monr_hrchy_release_mutexes();
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 8756666..872cce0 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -248,6 +248,7 @@ struct cmt_csd {
  * @mutex:			Hold when modifying this pkg_data.
  * @mutex_key:			lockdep class for pkg_data's mutex.
  * @lock:			Hold to protect pmonrs in this pkg_data.
+ * @rotation_work:		Task that performs rotation of rmids.
  * @work_cpu:			CPU to run rotation and other batch jobs.
  *				It must be in the package associated to its
  *				instance of pkg_data.
@@ -268,6 +269,7 @@ struct pkg_data {
 	struct mutex		mutex;
 	raw_spinlock_t		lock;
 
+	struct delayed_work	rotation_work;
 	unsigned int		work_cpu;
 	u32			max_rmid;
 	u16			pkgid;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 41/46] perf/x86/intel/cmt: add rotation minimum progress SLO
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (39 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 40/46] perf/x86/intel/cmt: add rotation scheduled work David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 42/46] perf/x86/intel/cmt: add rmid stealing David Carrillo-Cisneros
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Try to activate monrs at a __cmt_min_progress_rate rate.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 274 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 273 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 8bf6aa5..ba82f95 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -79,6 +79,14 @@ static u64 __cmt_min_mon_slice;
 static unsigned int __cmt_max_threshold;	/* bytes */
 
 /*
+ * Rotation SLO of all monrs events (including those without llc_occupancy):
+ * @__cmt_min_progrees_rate: Min numbers of pmonrs that must go to Active
+ * state per second, otherwise, recycling occupancy error is increased.
+ */
+#define CMT_DEFAULT_MIN_PROGRESS_RATE 2		/* pmonrs per sec */
+static unsigned int __cmt_min_progress_rate = CMT_DEFAULT_MIN_PROGRESS_RATE;
+
+/*
  * If @pkgd == NULL, return first online, pkg_data in cmt_pkgs_data.
  * Otherwise next online pkg_data or NULL if no more.
  */
@@ -466,6 +474,21 @@ static void pmonr_dep_dirty_to_active(struct pmonr *pmonr)
 	__pmonr_dep_to_active_helper(pmonr, rmids.read_rmid);
 }
 
+/* dirty rmid must be clean enough to go to free_rmids. */
+static void pmonr_dep_dirty_to_dep_idle_helper(struct pmonr *pmonr,
+					       union pmonr_rmids rmids)
+{
+	struct pkg_data *pkgd = pmonr->pkgd;
+
+	pmonr->pkgd->nr_dirty_rmids--;
+	__set_bit(rmids.read_rmid, pkgd->free_rmids);
+	list_move_tail(&pmonr->rot_entry, &pkgd->dep_idle_pmonrs);
+	pkgd->nr_dep_pmonrs++;
+
+	pmonr->state = PMONR_DEP_IDLE;
+	pmonr_set_rmids(pmonr, rmids.sched_rmid, INVALID_RMID);
+}
+
 static void monr_dealloc(struct monr *monr)
 {
 	u16 p, nr_pkgs = topology_max_packages();
@@ -1311,6 +1334,242 @@ static void smp_call_rmid_read(void *data)
 	atomic_set(&ccsd->on_read, 0);
 }
 
+/*
+ * Try to reuse dirty rmid's for pmonrs at the front of dep_dirty_pmonrs.
+ */
+static int __try_activate_dep_dirty_pmonrs(struct pkg_data *pkgd)
+{
+	int reused = 0;
+	struct pmonr *pmonr;
+	struct list_head *lhead = &pkgd->dep_pmonrs;
+
+	lockdep_assert_held(&pkgd->lock);
+
+	while ((pmonr = list_first_entry_or_null(
+				lhead, struct pmonr, pkgd_deps_entry))) {
+		if (!pmonr || pmonr->state == PMONR_DEP_IDLE)
+			break;
+		pmonr_dep_dirty_to_active(pmonr);
+		reused++;
+	}
+
+	return reused;
+}
+
+static int try_activate_dep_dirty_pmonrs(struct pkg_data *pkgd)
+{
+	int nr_reused;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	nr_reused = __try_activate_dep_dirty_pmonrs(pkgd);
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	return nr_reused;
+}
+
+static inline int __try_use_free_rmid(struct pkg_data *pkgd, u32 rmid)
+{
+	struct pmonr *pmonr;
+
+	lockdep_assert_held(&pkgd->lock);
+
+	pmonr = list_first_entry_or_null(&pkgd->dep_idle_pmonrs,
+					 struct pmonr, rot_entry);
+	if (!pmonr)
+		return 0;
+	/* The state transition will move the rmid to the active list.  */
+	pmonr_dep_idle_to_active(pmonr, rmid);
+
+	return 1 + __try_activate_dep_dirty_pmonrs(pkgd);
+}
+
+static int __try_use_free_rmids(struct pkg_data *pkgd)
+{
+	int nr_activated = 0, nr_used, r;
+
+	for_each_set_bit(r, pkgd->free_rmids, CMT_MAX_NR_RMIDS) {
+		/* Removes the rmid from free list if succeeds. */
+		nr_used = __try_use_free_rmid(pkgd, r);
+		if (!nr_used)
+			break;
+		nr_activated += nr_used;
+	}
+
+	return nr_activated;
+}
+
+static bool is_rmid_dirty(struct pkg_data *pkgd, u32 rmid, bool do_read,
+			  unsigned int dirty_thld, unsigned int *min_dirty)
+{
+	u64 val;
+
+	if (do_read && WARN_ON_ONCE(cmt_rmid_read(rmid, &val)))
+		return true;
+	if (val > dirty_thld) {
+		if (val < *min_dirty)
+			*min_dirty = val;
+		return true;
+	}
+
+	return false;
+}
+
+static int try_free_dep_dirty_pmonrs(struct pkg_data *pkgd,
+				     bool do_read,
+				     unsigned int dirty_thld,
+				     unsigned int *min_dirty)
+{
+	struct pmonr *pmonr, *tmp;
+	union pmonr_rmids rmids;
+	int nr_activated = 0;
+	unsigned long flags;
+
+	/*
+	 * No need to acquire pkg lock for pkgd->dep_dirty_pmonrs because
+	 * rotation logic is the only user of this list.
+	 */
+	list_for_each_entry_safe(pmonr, tmp,
+				 &pkgd->dep_dirty_pmonrs, rot_entry) {
+		rmids.value = atomic64_read(&pmonr->atomic_rmids);
+		if (is_rmid_dirty(pkgd, rmids.read_rmid,
+					do_read, dirty_thld, min_dirty))
+			continue;
+
+		raw_spin_lock_irqsave(&pkgd->lock, flags);
+		pmonr_dep_dirty_to_dep_idle_helper(pmonr, rmids);
+		nr_activated += __try_use_free_rmid(pkgd, rmids.read_rmid);
+		raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+	}
+
+	return nr_activated;
+}
+
+static int try_free_dirty_rmids(struct pkg_data *pkgd,
+				bool do_read,
+				unsigned int dirty_thld,
+				unsigned int *min_dirty,
+				unsigned long *rmids_bm)
+{
+	int nr_activated = 0, r;
+	unsigned long flags;
+
+	/*
+	 * To avoid holding pkgd->lock while reading rmids in hw (slow), hold
+	 * once and save all rmids that must be read. Then read them while
+	 * unlocked.
+	 */
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	memcpy(rmids_bm, pkgd->dirty_rmids, CMT_MAX_NR_RMIDS_BYTES);
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	for_each_set_bit(r, rmids_bm, CMT_MAX_NR_RMIDS) {
+		if (is_rmid_dirty(pkgd, r, do_read, dirty_thld, min_dirty))
+			continue;
+
+		raw_spin_lock_irqsave(&pkgd->lock, flags);
+
+		pkgd->nr_dirty_rmids--;
+		__clear_bit(r, pkgd->dirty_rmids);
+		__set_bit(r, pkgd->free_rmids);
+		nr_activated += __try_use_free_rmid(pkgd, r);
+
+		raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+	}
+
+	return nr_activated;
+}
+
+/**
+ * __intel_cmt_rmid_rotate - Rotate rmids among pmonrs and handle dirty rmids.
+ * @pkgd:		The package data to rotate rmids on.
+ * @active_goal:	Target min nr of pmonrs to put in Active state.
+ * @max_dirty_thld:	Upper bound for dirty_thld, in CMT cache units.
+ *
+ * The goals for each iteration of rotation logic are:
+ *   1) to activate @active_goal pmonrs.
+ *
+ * In order to activate Dep_{Dirty,Idle} pmonrs, rotation logic:
+ *   1) activate eligible Dep_Dirty pmonrs: These pmonrs can reuse their former
+ *   rmid, even if it is not clean, without increasing the error.
+ *   2) take clean rmids from Dep_Dirty pmonrs and reuse them for other pmonrs
+ *   or add them to pool of free rmids.
+ *   3) use free rmids to activate Dep_Idle pmonrs.
+ *
+ * Rotation logic also checks the occupancy of dirty rmids and, if now clean,
+ * uses them or adds them to free rmids.
+ * When a Dep_Idle pmonr is activated, any Dep_Dirty pmonr that is immediately
+ * after it in the pkg->dep_pmonrs list can be activated reusing its dirty
+ * rmid.
+ */
+static int __intel_cmt_rmid_rotate(struct pkg_data *pkgd,
+		unsigned int active_goal, unsigned int max_dirty_thld)
+{
+	unsigned int dirty_thld = 0, min_dirty, nr_activated;
+	unsigned int nr_dep_pmonrs;
+	unsigned long flags, *rmids_bm = NULL;
+	bool do_active_goal, read_dirty = true, dirty_is_max;
+
+	lockdep_assert_held(&pkgd->mutex);
+
+	rmids_bm = kzalloc(CMT_MAX_NR_RMIDS_BYTES, GFP_KERNEL);
+	if (!rmids_bm)
+		return -ENOMEM;
+
+	nr_activated = try_activate_dep_dirty_pmonrs(pkgd);
+
+again:
+	min_dirty = UINT_MAX;
+
+	/* retry every iteration since dirty_thld may have changed. */
+	nr_activated += try_free_dirty_rmids(pkgd, read_dirty,
+					     dirty_thld, &min_dirty, rmids_bm);
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	nr_activated += __try_use_free_rmids(pkgd);
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	nr_activated += try_free_dep_dirty_pmonrs(pkgd, read_dirty,
+						  dirty_thld, &min_dirty);
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+	nr_activated += __try_use_free_rmids(pkgd);
+	nr_dep_pmonrs = pkgd->nr_dep_pmonrs;
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	/*
+	 * If there is no room to increase dirty_thld, then no more dirty rmids
+	 * could be reused and must give up active goal.
+	 */
+	dirty_is_max = dirty_thld >= max_dirty_thld;
+	do_active_goal = nr_activated < active_goal && !dirty_is_max;
+
+	/*
+	 * Since Dep_Dirty pmonrs have their own dirty rmid, only Dep_Idle
+	 * pmonrs are waiting for a rmid to be available. Stop if no pmonr
+	 * wait for rmid or no goals to pursue.
+	 */
+	if (!nr_dep_pmonrs || !do_active_goal)
+		goto exit;
+
+	/*
+	 * Try to activate more pmonrs by increasing the dirty threshold.
+	 * Using the minimum observed occupancy in dirty rmids guarantees to
+	 * recover at least one rmid per iteration.
+	 */
+	if (do_active_goal) {
+		dirty_thld = min(min_dirty, max_dirty_thld);
+		/* do not read occupancy for dirty rmids twice. */
+		read_dirty = true;
+		goto again;
+	}
+
+exit:
+	kfree(rmids_bm);
+
+	return 0;
+}
+
 static struct pmu intel_cmt_pmu;
 
 /* Schedule rotation in one package. */
@@ -1360,10 +1619,20 @@ static bool intel_cmt_need_rmid_rotation(struct pkg_data *pkgd)
 
 /*
  * Rotation function, runs per-package.
+ * If rmids are needed in a package it will steal rmids from pmonr that have
+ * been active longer than __cmt_pre_mon_slice + __cmt_min_mon_slice.
+ * The hardware doesn't provide a way to free occupancy for a rmid that will
+ * be reused. Therefore, before reusing a rmid, it should stay unscheduled for
+ * a while, hoping that the cache lines counted towards this rmid will
+ * eventually be replaced and the rmid occupancy will decrease below
+ * __cmt_max_threshold.
  */
 static void intel_cmt_rmid_rotation_work(struct work_struct *work)
 {
 	struct pkg_data *pkgd;
+	/* not precise elapsed time, but good enough for rotation purposes. */
+	unsigned int elapsed_ms = intel_cmt_pmu.hrtimer_interval_ms;
+	unsigned int active_goal, max_dirty_threshold;
 
 	pkgd = container_of(to_delayed_work(work),
 			    struct pkg_data, rotation_work);
@@ -1377,7 +1646,10 @@ static void intel_cmt_rmid_rotation_work(struct work_struct *work)
 	if (!intel_cmt_need_rmid_rotation(pkgd))
 		goto exit;
 
-	/* To add call to rotation function in next patch */
+	active_goal = max(1u, (elapsed_ms * __cmt_min_progress_rate) / 1000);
+	max_dirty_threshold = READ_ONCE(__cmt_max_threshold) / cmt_l3_scale;
+
+	__intel_cmt_rmid_rotate(pkgd, active_goal, max_dirty_threshold);
 
 	if (intel_cmt_need_rmid_rotation(pkgd))
 		__intel_cmt_schedule_rotation_for_pkg(pkgd);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 42/46] perf/x86/intel/cmt: add rmid stealing
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (40 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 41/46] perf/x86/intel/cmt: add rotation minimum progress SLO David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 43/46] perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag David Carrillo-Cisneros
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add rmid rotation code to steal an rmid whenever not enough
pmonrs are being reactivated.

More details in code's comments.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 149 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 144 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index ba82f95..e677511 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -1368,6 +1368,106 @@ static int try_activate_dep_dirty_pmonrs(struct pkg_data *pkgd)
 	return nr_reused;
 }
 
+/**
+ * can_steal_rmid() - Tell if this pmonr's rmid can be stolen.
+ *
+ * The "rmid cycle" for a pmonr starts when an Active pmonr gets its rmid
+ * stolen and completes when it receives a rmid again.
+ * A monr "rmid recoup" occurs when all its non Off/Unused pmonrs
+ * obtain a rmid (i.e. when all pmonr than need a rmid have one).
+ *
+ * A pmonr's rmid can be stolen if either:
+ *   1) No other pmonr in pmonr's monr has been stolen before, or
+ *   2) Some pmonrs have had rmids stolen but rmids for all pmonrs have been
+ *   recovered (rmid recoup) and kept for at least
+ *     __cmt_pre_mon_slice + __cmt_min_mon_slice time.
+ *   3) At least one of the pmonrs with pkgid smaller than @pmonr's has not
+ *   completed its first "rmid cycle". Once this condition is false, the pmonr
+ *   will have completed its last "rmid cycle" and stealing will no be longer
+ *   allowed.
+ *   This guarantees that the last "rmid cycle" of a pmonr occurs in
+ *   pkgid order, preventing rmid deadlocks. It also guarantees that eventually
+ *   all pmonrs will eventually have a last "rmid cycle", recovering all
+ *   required rmids.
+ */
+static bool can_steal_rmid(struct pmonr *pmonr)
+{
+	union pmonr_rmids rmids;
+	struct monr *monr = pmonr->monr;
+	struct pkg_data *pkgd = NULL;
+	struct pmonr *pos_pmonr;
+	bool need_rmid_state;
+	u64 last_all_active, next_steal_time, last_pmonr_active;
+
+	last_all_active = atomic64_read(&monr->last_rmid_recoup);
+	/*
+	 * Can steal if no pmonr has been stolen or all not Unused have been
+	 * in Active state for long enough.
+	 */
+	if (!atomic_read(&monr->nr_dep_pmonrs)) {
+		/* Check steal condition 1. */
+		if (!last_all_active)
+			return true;
+		next_steal_time = last_all_active +
+				__cmt_pre_mon_slice + __cmt_min_mon_slice;
+		/* Check steal condition 2. */
+		if (time_after64(next_steal_time, get_jiffies_64()))
+			return true;
+
+		return false;
+	}
+
+	rcu_read_lock();
+
+	/* Check for steal condition 3 without locking. */
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		/* To avoid deadlocks, wait for pmonr in pkgid order. */
+		if (pkgd->pkgid >= pmonr->pkgd->pkgid)
+			break;
+		pos_pmonr = pkgd_pmonr(pkgd, monr);
+		rmids.value = atomic64_read(&pos_pmonr->atomic_rmids);
+		last_pmonr_active = atomic64_read(
+				&pos_pmonr->last_enter_active);
+
+		/* pmonrs in Dep_{Idle,Dirty} states are waiting for a rmid. */
+		need_rmid_state = rmids.sched_rmid != INVALID_RMID &&
+				  rmids.sched_rmid != rmids.read_rmid;
+
+		/* test if pos_pmonr has finished its first rmid cycle. */
+		if (need_rmid_state && last_all_active <= last_pmonr_active) {
+			rcu_read_unlock();
+
+			return true;
+		}
+	}
+	rcu_read_unlock();
+
+	return false;
+}
+
+/* Steal as many rmids as possible, up to @max_to_steal. */
+static int try_steal_active_pmonrs(struct pkg_data *pkgd,
+				   unsigned int max_to_steal)
+{
+	struct pmonr *pmonr, *tmp;
+	unsigned long flags;
+	int nr_stolen = 0;
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+
+	list_for_each_entry_safe(pmonr, tmp, &pkgd->active_pmonrs, rot_entry) {
+		if (!can_steal_rmid(pmonr))
+			continue;
+		pmonr_active_to_dep_dirty(pmonr);
+		nr_stolen++;
+		if (nr_stolen == max_to_steal)
+			break;
+	}
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+
+	return nr_stolen;
+}
+
 static inline int __try_use_free_rmid(struct pkg_data *pkgd, u32 rmid)
 {
 	struct pmonr *pmonr;
@@ -1485,9 +1585,17 @@ static int try_free_dirty_rmids(struct pkg_data *pkgd,
  * @pkgd:		The package data to rotate rmids on.
  * @active_goal:	Target min nr of pmonrs to put in Active state.
  * @max_dirty_thld:	Upper bound for dirty_thld, in CMT cache units.
+ * @max_dirty_goal:	Max nr of rmids to leave dirty, waiting to drop
+ *			occupancy.
+ * @dirty_cushion:	nr of rmids to try to leave in dirty on top of the
+ *			nr of pmonrs that need rmid (Dep_Idle), in case
+ *			some dirty rmids do not drop occupancy fast enough.
  *
  * The goals for each iteration of rotation logic are:
  *   1) to activate @active_goal pmonrs.
+ *   2) if any pmonr is waiting for rmid (Dep_Idle), to steal enough rmids to
+ *   meet its dirty_goal. The dirty_goal is an estimate of the number of dirty
+ *   rmids required so that next call reaches its @active_goal.
  *
  * In order to activate Dep_{Dirty,Idle} pmonrs, rotation logic:
  *   1) activate eligible Dep_Dirty pmonrs: These pmonrs can reuse their former
@@ -1503,12 +1611,14 @@ static int try_free_dirty_rmids(struct pkg_data *pkgd,
  * rmid.
  */
 static int __intel_cmt_rmid_rotate(struct pkg_data *pkgd,
-		unsigned int active_goal, unsigned int max_dirty_thld)
+		unsigned int active_goal, unsigned int max_dirty_thld,
+		unsigned int max_dirty_goal, unsigned int dirty_cushion)
 {
 	unsigned int dirty_thld = 0, min_dirty, nr_activated;
-	unsigned int nr_dep_pmonrs;
+	unsigned int nr_to_steal, nr_stolen;
+	unsigned int nr_dirty, dirty_goal, nr_dep_pmonrs;
 	unsigned long flags, *rmids_bm = NULL;
-	bool do_active_goal, read_dirty = true, dirty_is_max;
+	bool do_active_goal, do_dirty_goal, read_dirty = true, dirty_is_max;
 
 	lockdep_assert_held(&pkgd->mutex);
 
@@ -1534,6 +1644,7 @@ static int __intel_cmt_rmid_rotate(struct pkg_data *pkgd,
 
 	raw_spin_lock_irqsave(&pkgd->lock, flags);
 	nr_activated += __try_use_free_rmids(pkgd);
+	nr_dirty = pkgd->nr_dirty_rmids;
 	nr_dep_pmonrs = pkgd->nr_dep_pmonrs;
 	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
 
@@ -1544,14 +1655,27 @@ static int __intel_cmt_rmid_rotate(struct pkg_data *pkgd,
 	dirty_is_max = dirty_thld >= max_dirty_thld;
 	do_active_goal = nr_activated < active_goal && !dirty_is_max;
 
+	dirty_goal = min(max_dirty_goal, nr_dep_pmonrs + dirty_cushion);
+	do_dirty_goal = nr_dirty < dirty_goal;
+
 	/*
 	 * Since Dep_Dirty pmonrs have their own dirty rmid, only Dep_Idle
 	 * pmonrs are waiting for a rmid to be available. Stop if no pmonr
 	 * wait for rmid or no goals to pursue.
 	 */
-	if (!nr_dep_pmonrs || !do_active_goal)
+	if (!nr_dep_pmonrs || (!do_dirty_goal && !do_active_goal))
 		goto exit;
 
+	if (do_dirty_goal) {
+		nr_to_steal = dirty_goal - nr_dirty;
+		nr_stolen = try_steal_active_pmonrs(pkgd, nr_to_steal);
+		/*
+		 * It tried to steal from all Active pmonrs, makes no sense
+		 * to reattempt.
+		 */
+		max_dirty_goal = 0;
+	}
+
 	/*
 	 * Try to activate more pmonrs by increasing the dirty threshold.
 	 * Using the minimum observed occupancy in dirty rmids guarantees to
@@ -1633,6 +1757,7 @@ static void intel_cmt_rmid_rotation_work(struct work_struct *work)
 	/* not precise elapsed time, but good enough for rotation purposes. */
 	unsigned int elapsed_ms = intel_cmt_pmu.hrtimer_interval_ms;
 	unsigned int active_goal, max_dirty_threshold;
+	unsigned int dirty_cushion, max_dirty_goal;
 
 	pkgd = container_of(to_delayed_work(work),
 			    struct pkg_data, rotation_work);
@@ -1649,7 +1774,21 @@ static void intel_cmt_rmid_rotation_work(struct work_struct *work)
 	active_goal = max(1u, (elapsed_ms * __cmt_min_progress_rate) / 1000);
 	max_dirty_threshold = READ_ONCE(__cmt_max_threshold) / cmt_l3_scale;
 
-	__intel_cmt_rmid_rotate(pkgd, active_goal, max_dirty_threshold);
+	/*
+	 * Upper bound for the nr of rmids to be dirty in order to have a good
+	 * chance of finding enough rmids in next iteration of rotation logic.
+	 */
+	max_dirty_goal = min(active_goal + 1, (pkgd->max_rmid + 1) / 4);
+
+	/*
+	 * Nr of extra rmids to put in dirty in case some don't drop occupancy.
+	 * To be calculated in a sensible manner once statistics about rmid
+	 * recycling rate are in place.
+	 */
+	dirty_cushion = 2;
+
+	__intel_cmt_rmid_rotate(pkgd, active_goal, max_dirty_threshold,
+				max_dirty_goal, dirty_cushion);
 
 	if (intel_cmt_need_rmid_rotation(pkgd))
 		__intel_cmt_schedule_rotation_for_pkg(pkgd);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 43/46] perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (41 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 42/46] perf/x86/intel/cmt: add rmid stealing David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 44/46] perf/x86/intel/cmt: add debugfs intel_cmt directory David Carrillo-Cisneros
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Allow user to specify that rmids used by a cgroup/event cannot be stolen.

The pmonrs that are marked with CMT_UF_NOSTEAL_RMID will never lose a valid
rmid once they have receive one.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 10 +++++++++-
 arch/x86/events/intel/cmt.h |  5 ++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index e677511..8cbcbc6 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -41,7 +41,7 @@ static struct monr *monr_hrchy_root;
 
 /* Flags for root monr and all its pmonrs while being monitored. */
 static enum cmt_user_flags root_monr_uflags =
-		CMT_UF_HAS_USER | CMT_UF_NOLAZY_RMID;
+		CMT_UF_HAS_USER | CMT_UF_NOSTEAL_RMID | CMT_UF_NOLAZY_RMID;
 
 /* Auxiliar flags */
 static enum cmt_user_flags *pkg_uflags_zeroes;
@@ -499,6 +499,7 @@ static void monr_dealloc(struct monr *monr)
 	 */
 	if (WARN_ON_ONCE(!(monr->flags & CMT_MONR_ZOMBIE)) ||
 	    WARN_ON_ONCE(monr->nr_has_user) ||
+	    WARN_ON_ONCE(monr->nr_nosteal_rmid) ||
 	    WARN_ON_ONCE(monr->nr_nolazy_rmid) ||
 	    WARN_ON_ONCE(monr->mon_cgrp) ||
 	    WARN_ON_ONCE(monr->mon_events))
@@ -743,10 +744,13 @@ static bool monr_account_uflags(struct monr *monr,
 
 	if (uflags & CMT_UF_HAS_USER)
 		monr->nr_has_user += account ? 1 : -1;
+	if (uflags & CMT_UF_NOSTEAL_RMID)
+		monr->nr_nosteal_rmid += account ? 1 : -1;
 	if (uflags & CMT_UF_NOLAZY_RMID)
 		monr->nr_nolazy_rmid += account ? 1 : -1;
 
 	monr->uflags =  (monr->nr_has_user ? CMT_UF_HAS_USER : 0) |
+			(monr->nr_nosteal_rmid ? CMT_UF_NOSTEAL_RMID : 0) |
 			(monr->nr_nolazy_rmid ? CMT_UF_NOLAZY_RMID : 0);
 
 	return old_flags != monr->uflags;
@@ -1400,6 +1404,10 @@ static bool can_steal_rmid(struct pmonr *pmonr)
 	u64 last_all_active, next_steal_time, last_pmonr_active;
 
 	last_all_active = atomic64_read(&monr->last_rmid_recoup);
+
+	if (pmonr_uflags(pmonr) & CMT_UF_NOSTEAL_RMID)
+		return false;
+
 	/*
 	 * Can steal if no pmonr has been stolen or all not Unused have been
 	 * in Active state for long enough.
diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
index 872cce0..c377076 100644
--- a/arch/x86/events/intel/cmt.h
+++ b/arch/x86/events/intel/cmt.h
@@ -290,7 +290,8 @@ enum cmt_user_flags {
 	/* if no has_user other flags are meaningless. */
 	CMT_UF_HAS_USER		= BIT(0), /* has cgroup or event users */
 	CMT_UF_NOLAZY_RMID	= BIT(1), /* try to obtain rmid on creation */
-	CMT_UF_MAX		= BIT(2) - 1,
+	CMT_UF_NOSTEAL_RMID	= BIT(2), /* do not steal this rmid */
+	CMT_UF_MAX		= BIT(3) - 1,
 	CMT_UF_ERROR		= CMT_UF_MAX + 1,
 };
 
@@ -309,6 +310,7 @@ enum cmt_user_flags {
  *			monr is monitoring in all required packages.
  * @flags:		monr_flags.
  * @nr_has_user:	nr of CMT_UF_HAS_USER set in events in mon_events.
+ * @nr_nosteal_user:	nr of CMT_UF_NOSTEAL_RMID set in events in mon_events.
  * @nr_nolazy_user:	nr of CMT_UF_NOLAZY_RMID set in events in mon_events.
  * @uflags:		monr level cmt_user_flags, or'ed with pkg_uflags.
  * @pkg_uflags:		package level cmt_user_flags, each entry is used as
@@ -335,6 +337,7 @@ struct monr {
 
 	enum monr_flags			flags;
 	int				nr_has_user;
+	int				nr_nosteal_rmid;
 	int				nr_nolazy_rmid;
 	enum cmt_user_flags		uflags;
 	enum cmt_user_flags		pkg_uflags[];
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 44/46] perf/x86/intel/cmt: add debugfs intel_cmt directory
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (42 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 43/46] perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 45/46] perf/stat: fix bug in handling events in error state David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 46/46] perf/stat: revamp read error handling, snapshot and per_pkg events David Carrillo-Cisneros
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

Add debug fs to intel_cmt to help maintenance. It shows the following
human readable snapshots of the internals of the CMT driver:
  - hrchy: A per monr view of the monr hierarchy.
  - pkgs: A per-package view of all online struct pkg_data.
  - rmids: A per-package view of occupancy and state of all rmids.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 arch/x86/events/intel/cmt.c | 385 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 385 insertions(+)

diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
index 8cbcbc6..554ebd2 100644
--- a/arch/x86/events/intel/cmt.c
+++ b/arch/x86/events/intel/cmt.c
@@ -2372,6 +2372,376 @@ static ssize_t max_recycle_threshold_store(struct device *dev,
 
 static DEVICE_ATTR_RW(max_recycle_threshold);
 
+
+#define INTEL_CMT_DEBUGFS
+
+#ifdef INTEL_CMT_DEBUGFS
+
+#include <linux/debugfs.h>
+
+#define DBG_PRINTF(format__, ...) \
+	seq_printf(s, "%*s" format__, 4 * pad, "", ##__VA_ARGS__)
+
+static void cmt_dbg_show__pmonr(struct seq_file *s,
+				struct pmonr *pmonr,
+				int pad)
+{
+	struct pmonr *pos;
+	union pmonr_rmids rmids;
+	static const char * const state_strs[] = {"OFF", "UNUSED", "ACTIVE",
+						  "DEP_IDLE", "DEP_DIRTY"};
+
+	DBG_PRINTF("pmonr: (%p, pkgid: %d, monr: %p)\n",
+			pmonr, pmonr->pkgd->pkgid, pmonr->monr);
+	pad++;
+
+	rmids.value = atomic64_read(&pmonr->atomic_rmids);
+	DBG_PRINTF("atomic_rmids: (%d,%d), state: %s, lender: %p\n",
+			rmids.sched_rmid, rmids.read_rmid,
+			state_strs[pmonr->state], pmonr->lender);
+
+	if (pmonr->state == PMONR_ACTIVE) {
+		DBG_PRINTF("pmonr_deps_head:");
+		list_for_each_entry(pos, &pmonr->pmonr_deps_head,
+				    pmonr_deps_entry) {
+			seq_printf(s, "%p,", pos);
+		}
+		seq_puts(s, "\n");
+	}
+
+	DBG_PRINTF("last_enter_active: %lu\n",
+			atomic64_read(&pmonr->last_enter_active));
+
+}
+
+static void cmt_dbg_show__monr(struct seq_file *s, struct monr *monr, int pad)
+{
+	struct pkg_data *pkgd = NULL;
+	struct monr *pos;
+	struct pmonr *pmonr;
+	int p;
+
+	DBG_PRINTF("\nmonr: %p, parent: %p\n", monr, monr->parent);
+
+	pad++;
+
+	DBG_PRINTF("children: [");
+
+
+	list_for_each_entry(pos, &monr->children, parent_entry)
+		seq_printf(s, "%p, ", pos);
+
+	seq_puts(s, "]\n");
+
+	DBG_PRINTF("mon_cgrp: (%s, %p)",
+		monr->mon_cgrp ? monr->mon_cgrp->css.cgroup->kn->name : "NA",
+		monr->mon_cgrp);
+	DBG_PRINTF("mon_events: %p\n", monr->mon_events);
+
+	DBG_PRINTF("pmonrs:\n");
+	rcu_read_lock();
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		pmonr = pkgd_pmonr(pkgd, monr);
+		cmt_dbg_show__pmonr(s, pmonr, pad + 1);
+	}
+	rcu_read_unlock();
+
+	DBG_PRINTF("last_rmid_recoup: %lu, nr_dep_pmonrs: %u, flags: %x\n",
+		atomic64_read(&monr->last_rmid_recoup),
+		atomic_read(&monr->nr_dep_pmonrs), monr->flags);
+	DBG_PRINTF("nr_has_user: %d, nr_nosteal_rmid: %d,",
+		monr->nr_has_user, monr->nr_nosteal_rmid);
+	DBG_PRINTF("nr_nolazy_rmid: %d, uflags: %x\n",
+		monr->nr_nolazy_rmid, monr->uflags);
+	DBG_PRINTF("pkg_uflags: [");
+	for (p = 0; p < topology_max_packages(); p++)
+		seq_printf(s, "%x;", monr->pkg_uflags[p]);
+	seq_puts(s, "]\n");
+}
+
+
+static void cmt_dbg_show__pkgd(struct seq_file *s,
+			       struct pkg_data *pkgd, int pad, bool csd_full)
+{
+	struct pmonr *pmonr;
+	unsigned long flags;
+	int r;
+
+	raw_spin_lock_irqsave(&pkgd->lock, flags);
+
+	DBG_PRINTF("\npkgd: %p, pkgid: %d, max_rmid: %d\n",
+		pkgd, pkgd->pkgid, pkgd->max_rmid);
+	pad++;
+
+	DBG_PRINTF("free_rmids: [%*pbl]\n",
+			CMT_MAX_NR_RMIDS, pkgd->free_rmids);
+	DBG_PRINTF("dirty_rmids: [%*pbl]\n",
+			CMT_MAX_NR_RMIDS, pkgd->dirty_rmids);
+
+	DBG_PRINTF("active_pmonrs:\n");
+	list_for_each_entry(pmonr, &pkgd->active_pmonrs, rot_entry)
+		cmt_dbg_show__pmonr(s, pmonr, pad + 1);
+
+	DBG_PRINTF("dep_idle_pmonrs:\n");
+	list_for_each_entry(pmonr, &pkgd->dep_idle_pmonrs, rot_entry)
+		cmt_dbg_show__pmonr(s, pmonr, pad + 1);
+
+	DBG_PRINTF("dep_dirty_pmonrs:\n");
+	list_for_each_entry(pmonr, &pkgd->dep_dirty_pmonrs, rot_entry)
+		cmt_dbg_show__pmonr(s, pmonr, pad + 1);
+	/*
+	 * only print pmonr pointer since this pmonrs are contained in either
+	 * Dep_Idle or Dep_Dirty pmonrs.
+	 */
+	DBG_PRINTF("dep_pmonrs: [");
+	list_for_each_entry(pmonr, &pkgd->dep_pmonrs, pkgd_deps_entry)
+		seq_printf(s, "%p,", pmonr);
+	seq_puts(s, "]\n");
+
+	DBG_PRINTF("nr_dirty_rmids: %d, nr_dep_pmonrs: %d\n",
+		pkgd->nr_dirty_rmids, pkgd->nr_dep_pmonrs);
+
+	DBG_PRINTF("work_cpu: %d\n", pkgd->work_cpu);
+
+	DBG_PRINTF("ccsds (");
+	if (csd_full)
+		seq_puts(s, "flags, info, func, ");
+	seq_puts(s, "on_read, value, ret, rmid): [");
+
+	pad += 1;
+	for (r = 0; r <= pkgd->max_rmid; r++) {
+		struct cmt_csd *ccsd = &pkgd->ccsds[r];
+
+		if (r % 4 == 0) {
+			seq_puts(s, "\n");
+			DBG_PRINTF("(");
+		} else {
+			seq_puts(s, "(");
+		}
+
+		if (csd_full) {
+			seq_printf(s, "%d,  %p, %p, ", ccsd->csd.flags,
+				   ccsd->csd.info, ccsd->csd.func);
+		}
+		seq_printf(s, "%d, %llu, %d, %d",
+			atomic_read(&ccsd->on_read), ccsd->value,
+			ccsd->ret, ccsd->rmid);
+		seq_puts(s, "),\t");
+	}
+	seq_puts(s, "]");
+	pad -= 1;
+
+	raw_spin_unlock_irqrestore(&pkgd->lock, flags);
+}
+
+static int cmt_dbg_pkgs_show(struct seq_file *s, void *unused)
+{
+	struct pkg_data *pkgd = NULL;
+	int pad = 0;
+
+	mutex_lock(&cmt_mutex);
+
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		cmt_dbg_show__pkgd(s, pkgd, pad, false);
+		seq_puts(s, "\n");
+	}
+
+	mutex_unlock(&cmt_mutex);
+
+	return 0;
+}
+
+static int cmt_dbg_hrchy_show(struct seq_file *s, void *unused)
+{
+	struct monr *pos = NULL;
+	int pad = 0;
+
+	mutex_lock(&cmt_mutex);
+	while ((pos = monr_next_descendant_pre(pos, monr_hrchy_root)))
+		cmt_dbg_show__monr(s, pos, pad);
+	mutex_unlock(&cmt_mutex);
+
+	return 0;
+}
+
+/* Must run on a CPU in pkgd's pkg. */
+static int cmt_dbg_rmids__rmids(struct seq_file *s, struct pkg_data *pkgd,
+				unsigned long *rmids, int pad)
+{
+	unsigned long zero_val[CMT_MAX_NR_RMIDS_LONGS];
+	int err, r, nr_printed = 0;
+	u64 val;
+
+	bitmap_copy(zero_val, rmids, CMT_MAX_NR_RMIDS);
+
+	DBG_PRINTF("non-zero value (rmid, scaled llc_occupancy): [");
+	pad++;
+	for_each_set_bit(r, rmids, CMT_MAX_NR_RMIDS) {
+		err = cmt_rmid_read(r, &val);
+		if (!err && !val)
+			continue;
+
+		if (nr_printed % 4 == 0) {
+			seq_puts(s, "\n");
+			DBG_PRINTF("(");
+		} else {
+			seq_puts(s, "(");
+		}
+		nr_printed++;
+
+		if (err) {
+			seq_printf(s, "%d, error: %d),\t", r, err);
+			__clear_bit(r, zero_val);
+			continue;
+		}
+		seq_printf(s, "%d, %llu), ", r, val * cmt_l3_scale);
+		__clear_bit(r, zero_val);
+	}
+	seq_puts(s, "]\n");
+	pad--;
+
+	DBG_PRINTF("zero value: [%*pbl]\n", CMT_MAX_NR_RMIDS, zero_val);
+
+	return 0;
+}
+
+static int __cmt_dbg_rmids__pkgd(struct seq_file *s,
+				struct pkg_data *pkgd, int pad)
+{
+	unsigned long rmids_in_pmonr[CMT_MAX_NR_RMIDS_LONGS];
+	int r;
+
+	memset(rmids_in_pmonr, 0, CMT_MAX_NR_RMIDS_BYTES);
+	bitmap_fill(rmids_in_pmonr, pkgd->max_rmid + 1);
+
+	for_each_set_bit(r, pkgd->free_rmids, CMT_MAX_NR_RMIDS)
+		__clear_bit(r, rmids_in_pmonr);
+
+	for_each_set_bit(r, pkgd->dirty_rmids, CMT_MAX_NR_RMIDS)
+		__clear_bit(r, rmids_in_pmonr);
+
+
+	raw_spin_lock(&pkgd->lock);
+
+	DBG_PRINTF("free_rmids:\n");
+	cmt_dbg_rmids__rmids(s, pkgd, pkgd->free_rmids, pad + 1);
+
+	DBG_PRINTF("dirty_rmids:\n");
+	cmt_dbg_rmids__rmids(s, pkgd, pkgd->dirty_rmids, pad + 1);
+
+	DBG_PRINTF("rmids_in_pmonr:\n");
+	cmt_dbg_rmids__rmids(s, pkgd, rmids_in_pmonr, pad + 1);
+
+	raw_spin_unlock(&pkgd->lock);
+
+	return 0;
+}
+
+struct dbg_smp_data {
+	struct seq_file *s;
+	struct pkg_data *pkgd;
+	int pad;
+};
+
+void cmt_dbg_rmids_pkgd(void *data)
+{
+	struct dbg_smp_data *d;
+
+	d = (struct dbg_smp_data *)data;
+	__cmt_dbg_rmids__pkgd(d->s, d->pkgd, d->pad);
+}
+
+static int cmt_dbg_rmids_show(struct seq_file *s, void *unused)
+{
+	struct dbg_smp_data d;
+	struct pkg_data *pkgd = NULL;
+	int pad = 0, err;
+
+	mutex_lock(&cmt_mutex);
+
+	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
+		DBG_PRINTF("pkgid: %d\n", pkgd->pkgid);
+		d.s = s;
+		d.pkgd = pkgd;
+		d.pad = pad + 1;
+		err = smp_call_function_single(pkgd->work_cpu,
+					       cmt_dbg_rmids_pkgd, &d, true);
+		seq_puts(s, "\n");
+	}
+
+	mutex_unlock(&cmt_mutex);
+
+	return 0;
+
+}
+
+#define CMT_DBGS_FILE(name__) \
+static int cmt_dbg_ ## name__ ## _open(struct inode *inode, struct file *file)\
+{\
+	return single_open(file, cmt_dbg_ ## name__ ## _show,\
+			   inode->i_private);\
+} \
+static const struct file_operations cmt_dbg_ ## name__ ## _ops = {\
+	.open		= cmt_dbg_ ## name__ ## _open,\
+	.read		= seq_read,\
+	.llseek		= seq_lseek,\
+	.release	= single_release,\
+}
+
+CMT_DBGS_FILE(hrchy);
+CMT_DBGS_FILE(pkgs);
+CMT_DBGS_FILE(rmids);
+
+struct dentry *cmt_dbgfs_root;
+
+static int start_debugfs(void)
+{
+	struct dentry *root, *pkgs, *hrchy, *rmids;
+
+	root = debugfs_create_dir("intel_cmt", NULL);
+	if (IS_ERR(root))
+		return PTR_ERR(root);
+
+	pkgs = debugfs_create_file("pkgs", 0444,
+				   root, NULL, &cmt_dbg_pkgs_ops);
+	if (IS_ERR(pkgs))
+		return PTR_ERR(pkgs);
+
+	hrchy = debugfs_create_file("hrchy", 0444,
+				   root, NULL, &cmt_dbg_hrchy_ops);
+	if (IS_ERR(hrchy))
+		return PTR_ERR(hrchy);
+
+	rmids = debugfs_create_file("rmids", 0444,
+				   root, NULL, &cmt_dbg_rmids_ops);
+	if (IS_ERR(rmids))
+		return PTR_ERR(rmids);
+
+	cmt_dbgfs_root = root;
+
+	return 0;
+}
+
+static void stop_debugfs(void)
+{
+	if (!cmt_dbgfs_root)
+		return;
+	debugfs_remove_recursive(cmt_dbgfs_root);
+	cmt_dbgfs_root = NULL;
+}
+
+#else
+
+static int start_debugfs(void)
+{
+}
+
+static void stop_debugfs(void)
+{
+}
+
+#endif
+
 static struct attribute *intel_cmt_attrs[] = {
 	&dev_attr_max_recycle_threshold.attr,
 	NULL,
@@ -2774,6 +3144,15 @@ static void cmt_stop(void)
 	mutex_unlock(&cmt_mutex);
 }
 
+static void intel_cmt_terminate(void)
+{
+	stop_debugfs();
+	static_branch_dec(&pqr_common_enable_key);
+	perf_pmu_unregister(&intel_cmt_pmu);
+	cmt_stop();
+	cmt_dealloc();
+}
+
 static int __init cmt_alloc(void)
 {
 	pkg_uflags_size = sizeof(*pkg_uflags_zeroes) * topology_max_packages();
@@ -2904,6 +3283,12 @@ static int __init intel_cmt_init(void)
 
 	static_branch_inc(&pqr_common_enable_key);
 
+	err = start_debugfs();
+	if (err) {
+		intel_cmt_terminate();
+		goto err_exit;
+	}
+
 	return err;
 
 err_stop:
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 45/46] perf/stat: fix bug in handling events in error state
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (43 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 44/46] perf/x86/intel/cmt: add debugfs intel_cmt directory David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  2016-10-30  0:38 ` [PATCH v3 46/46] perf/stat: revamp read error handling, snapshot and per_pkg events David Carrillo-Cisneros
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

From: Stephane Eranian <eranian@google.com>

When an event is in error state, read() returns 0
instead of sizeof() buffer. In certain modes, such
as interval printing, ignoring the 0 return value
may cause bogus count deltas to be computed and
thus invalid results printed.

this patch fixes this problem by modifying read_counters()
to mark the event as not scaled (scaled = -1) to force
the printout routine to show <NOT COUNTED>.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/builtin-stat.c | 12 +++++++++---
 tools/perf/util/evsel.c   |  4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 688dea7..c3c4b49 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -310,8 +310,12 @@ static int read_counter(struct perf_evsel *counter)
 			struct perf_counts_values *count;
 
 			count = perf_counts(counter->counts, cpu, thread);
-			if (perf_evsel__read(counter, cpu, thread, count))
+			if (perf_evsel__read(counter, cpu, thread, count)) {
+				counter->counts->scaled = -1;
+				perf_counts(counter->counts, cpu, thread)->ena = 0;
+				perf_counts(counter->counts, cpu, thread)->run = 0;
 				return -1;
+			}
 
 			if (STAT_RECORD) {
 				if (perf_evsel__write_stat_event(counter, cpu, thread, count)) {
@@ -336,12 +340,14 @@ static int read_counter(struct perf_evsel *counter)
 static void read_counters(void)
 {
 	struct perf_evsel *counter;
+	int ret;
 
 	evlist__for_each_entry(evsel_list, counter) {
-		if (read_counter(counter))
+		ret = read_counter(counter);
+		if (ret)
 			pr_debug("failed to read counter %s\n", counter->name);
 
-		if (perf_stat_process_counter(&stat_config, counter))
+		if (ret == 0 && perf_stat_process_counter(&stat_config, counter))
 			pr_warning("failed to process counter %s\n", counter->name);
 	}
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 8bc2711..d54efb5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1221,7 +1221,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 	if (FD(evsel, cpu, thread) < 0)
 		return -EINVAL;
 
-	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
+	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
 		return -errno;
 
 	return 0;
@@ -1239,7 +1239,7 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 	if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
 		return -ENOMEM;
 
-	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) < 0)
+	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
 		return -errno;
 
 	perf_evsel__compute_deltas(evsel, cpu, thread, &count);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 46/46] perf/stat: revamp read error handling, snapshot and per_pkg events
  2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
                   ` (44 preceding siblings ...)
  2016-10-30  0:38 ` [PATCH v3 45/46] perf/stat: fix bug in handling events in error state David Carrillo-Cisneros
@ 2016-10-30  0:38 ` David Carrillo-Cisneros
  45 siblings, 0 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

A package wide event can return a valid read even if it has not run in a
specific cpu, this does not fit well with the assumption that run == 0
is equivalent to a <not counted>.

To fix the problem, this patch defines special error values for val,
run and ena (~0ULL), and use them to signal read errors, allowing run == 0
to be a valid value for package events. A new value, (NA), is output on
read error and when event has not been enabled (timed enabled == 0).

Finally, this patch revamps calculation of deltas and scaling for snapshot
events, removing the calculation of deltas for time running and enabled in
snapshot event, as should be.

Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 tools/perf/builtin-stat.c | 36 +++++++++++++++++++++++-----------
 tools/perf/util/counts.h  | 19 ++++++++++++++++++
 tools/perf/util/evsel.c   | 49 ++++++++++++++++++++++++++++++++++++-----------
 tools/perf/util/evsel.h   |  8 ++++++--
 tools/perf/util/stat.c    | 35 +++++++++++----------------------
 5 files changed, 99 insertions(+), 48 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c3c4b49..79043a3 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -311,10 +311,8 @@ static int read_counter(struct perf_evsel *counter)
 
 			count = perf_counts(counter->counts, cpu, thread);
 			if (perf_evsel__read(counter, cpu, thread, count)) {
-				counter->counts->scaled = -1;
-				perf_counts(counter->counts, cpu, thread)->ena = 0;
-				perf_counts(counter->counts, cpu, thread)->run = 0;
-				return -1;
+				/* do not write stat for failed reads. */
+				continue;
 			}
 
 			if (STAT_RECORD) {
@@ -725,12 +723,16 @@ static int run_perf_stat(int argc, const char **argv)
 
 static void print_running(u64 run, u64 ena)
 {
+	bool is_na = run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || !ena;
+
 	if (csv_output) {
-		fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
-					csv_sep,
-					run,
-					csv_sep,
-					ena ? 100.0 * run / ena : 100.0);
+		if (is_na)
+			fprintf(stat_config.output, "%sNA%sNA", csv_sep, csv_sep);
+		else
+			fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
+				csv_sep, run, csv_sep, 100.0 * run / ena);
+	} else if (is_na) {
+		fprintf(stat_config.output, "  (NA)");
 	} else if (run != ena) {
 		fprintf(stat_config.output, "  (%.2f%%)", 100.0 * run / ena);
 	}
@@ -1103,7 +1105,7 @@ static void printout(int id, int nr, struct perf_evsel *counter, double uval,
 		if (counter->cgrp)
 			os.nfields++;
 	}
-	if (run == 0 || ena == 0 || counter->counts->scaled == -1) {
+	if (run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || counter->counts->scaled == -1) {
 		if (metric_only) {
 			pm(&os, NULL, "", "", 0);
 			return;
@@ -1209,12 +1211,17 @@ static void print_aggr(char *prefix)
 		id = aggr_map->map[s];
 		first = true;
 		evlist__for_each_entry(evsel_list, counter) {
+			bool all_nan = true;
 			val = ena = run = 0;
 			nr = 0;
 			for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
 				s2 = aggr_get_id(perf_evsel__cpus(counter), cpu);
 				if (s2 != id)
 					continue;
+				/* skip NA reads. */
+				if (perf_counts_values__is_na(perf_counts(counter->counts, cpu, 0)))
+					continue;
+				all_nan = false;
 				val += perf_counts(counter->counts, cpu, 0)->val;
 				ena += perf_counts(counter->counts, cpu, 0)->ena;
 				run += perf_counts(counter->counts, cpu, 0)->run;
@@ -1228,6 +1235,10 @@ static void print_aggr(char *prefix)
 				fprintf(output, "%s", prefix);
 
 			uval = val * counter->scale;
+			if (all_nan) {
+				run = PERF_COUNTS_NA;
+				ena = PERF_COUNTS_NA;
+			}
 			printout(id, nr, counter, uval, prefix, run, ena, 1.0);
 			if (!metric_only)
 				fputc('\n', output);
@@ -1306,7 +1317,10 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 		if (prefix)
 			fprintf(output, "%s", prefix);
 
-		uval = val * counter->scale;
+		if (val != PERF_COUNTS_NA)
+			uval = val * counter->scale;
+		else
+			uval = NAN;
 		printout(cpu, 0, counter, uval, prefix, run, ena, 1.0);
 
 		fputc('\n', output);
diff --git a/tools/perf/util/counts.h b/tools/perf/util/counts.h
index 34d8baa..b65e97a 100644
--- a/tools/perf/util/counts.h
+++ b/tools/perf/util/counts.h
@@ -3,6 +3,9 @@
 
 #include "xyarray.h"
 
+/* Not Available (NA) value. Any operation with a NA equals a NA. */
+#define PERF_COUNTS_NA ((u64)~0ULL)
+
 struct perf_counts_values {
 	union {
 		struct {
@@ -14,6 +17,22 @@ struct perf_counts_values {
 	};
 };
 
+static inline void
+perf_counts_values__make_na(struct perf_counts_values *values)
+{
+	values->val = PERF_COUNTS_NA;
+	values->ena = PERF_COUNTS_NA;
+	values->run = PERF_COUNTS_NA;
+}
+
+static inline bool
+perf_counts_values__is_na(struct perf_counts_values *values)
+{
+	return values->val == PERF_COUNTS_NA ||
+	       values->ena == PERF_COUNTS_NA ||
+	       values->run == PERF_COUNTS_NA;
+}
+
 struct perf_counts {
 	s8			  scaled;
 	struct perf_counts_values aggr;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d54efb5..fa0ba96 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1180,6 +1180,9 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 	if (!evsel->prev_raw_counts)
 		return;
 
+	if (perf_counts_values__is_na(count))
+		return;
+
 	if (cpu == -1) {
 		tmp = evsel->prev_raw_counts->aggr;
 		evsel->prev_raw_counts->aggr = *count;
@@ -1188,26 +1191,43 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 		*perf_counts(evsel->prev_raw_counts, cpu, thread) = *count;
 	}
 
-	count->val = count->val - tmp.val;
+	/* Snapshot events do not calculate deltas for count values. */
+	if (!evsel->snapshot)
+		count->val = count->val - tmp.val;
 	count->ena = count->ena - tmp.ena;
 	count->run = count->run - tmp.run;
 }
 
 void perf_counts_values__scale(struct perf_counts_values *count,
-			       bool scale, s8 *pscaled)
+			       bool scale, bool per_pkg, bool snapshot, s8 *pscaled)
 {
 	s8 scaled = 0;
 
+	if (perf_counts_values__is_na(count)) {
+		if (pscaled)
+			*pscaled = -1;
+		return;
+	}
+
 	if (scale) {
-		if (count->run == 0) {
+		/*
+		 * per-pkg events can have run == 0 in a CPU and still be
+		 * valid.
+		 */
+		if (count->run == 0 && !per_pkg) {
 			scaled = -1;
 			count->val = 0;
 		} else if (count->run < count->ena) {
 			scaled = 1;
-			count->val = (u64)((double) count->val * count->ena / count->run + 0.5);
+			/* Snapshot events do not scale counts values. */
+			if (!snapshot && count->run)
+				count->val = (u64)((double) count->val * count->ena /
+					     count->run + 0.5);
 		}
-	} else
-		count->ena = count->run = 0;
+
+	} else {
+		count->run = count->ena;
+	}
 
 	if (pscaled)
 		*pscaled = scaled;
@@ -1221,8 +1241,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 	if (FD(evsel, cpu, thread) < 0)
 		return -EINVAL;
 
-	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
+	if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0) {
+		perf_counts_values__make_na(count);
 		return -errno;
+	}
 
 	return 0;
 }
@@ -1230,6 +1252,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 			      int cpu, int thread, bool scale)
 {
+	int ret = 0;
 	struct perf_counts_values count;
 	size_t nv = scale ? 3 : 1;
 
@@ -1239,13 +1262,17 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 	if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
 		return -ENOMEM;
 
-	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
-		return -errno;
+	if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0) {
+		perf_counts_values__make_na(&count);
+		ret = -errno;
+		goto exit;
+	}
 
 	perf_evsel__compute_deltas(evsel, cpu, thread, &count);
-	perf_counts_values__scale(&count, scale, NULL);
+	perf_counts_values__scale(&count, scale, evsel->per_pkg, evsel->snapshot, NULL);
+exit:
 	*perf_counts(evsel->counts, cpu, thread) = count;
-	return 0;
+	return ret;
 }
 
 static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index b1503b0..facb6494 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -80,6 +80,10 @@ struct perf_evsel_config_term {
  * @is_pos: the position (counting backwards) of the event id (PERF_SAMPLE_ID or
  *          PERF_SAMPLE_IDENTIFIER) in a non-sample event i.e. if sample_id_all
  *          is used there is an id sample appended to non-sample events
+ * @snapshot: an event that whose raw value cannot be extrapolated based on
+ *	    the ratio of running/enabled time.
+ * @per_pkg: an event that runs package wide. All cores in same package will
+ *	    read the same value, even if running time == 0.
  * @priv:   And what is in its containing unnamed union are tool specific
  */
 struct perf_evsel {
@@ -150,8 +154,8 @@ static inline int perf_evsel__nr_cpus(struct perf_evsel *evsel)
 	return perf_evsel__cpus(evsel)->nr;
 }
 
-void perf_counts_values__scale(struct perf_counts_values *count,
-			       bool scale, s8 *pscaled);
+void perf_counts_values__scale(struct perf_counts_values *count, bool scale,
+			       bool per_pkg, bool snapshot, s8 *pscaled);
 
 void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
 				struct perf_counts_values *count);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 39345c2d..514b953 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -202,7 +202,7 @@ static void zero_per_pkg(struct perf_evsel *counter)
 }
 
 static int check_per_pkg(struct perf_evsel *counter,
-			 struct perf_counts_values *vals, int cpu, bool *skip)
+			 int cpu, bool *skip)
 {
 	unsigned long *mask = counter->per_pkg_mask;
 	struct cpu_map *cpus = perf_evsel__cpus(counter);
@@ -224,17 +224,6 @@ static int check_per_pkg(struct perf_evsel *counter,
 		counter->per_pkg_mask = mask;
 	}
 
-	/*
-	 * we do not consider an event that has not run as a good
-	 * instance to mark a package as used (skip=1). Otherwise
-	 * we may run into a situation where the first CPU in a package
-	 * is not running anything, yet the second is, and this function
-	 * would mark the package as used after the first CPU and would
-	 * not read the values from the second CPU.
-	 */
-	if (!(vals->run && vals->ena))
-		return 0;
-
 	s = cpu_map__get_socket(cpus, cpu, NULL);
 	if (s < 0)
 		return -1;
@@ -249,30 +238,27 @@ process_counter_values(struct perf_stat_config *config, struct perf_evsel *evsel
 		       struct perf_counts_values *count)
 {
 	struct perf_counts_values *aggr = &evsel->counts->aggr;
-	static struct perf_counts_values zero;
 	bool skip = false;
 
-	if (check_per_pkg(evsel, count, cpu, &skip)) {
+	if (check_per_pkg(evsel, cpu, &skip)) {
 		pr_err("failed to read per-pkg counter\n");
 		return -1;
 	}
 
-	if (skip)
-		count = &zero;
-
 	switch (config->aggr_mode) {
 	case AGGR_THREAD:
 	case AGGR_CORE:
 	case AGGR_SOCKET:
 	case AGGR_NONE:
-		if (!evsel->snapshot)
-			perf_evsel__compute_deltas(evsel, cpu, thread, count);
-		perf_counts_values__scale(count, config->scale, NULL);
+		perf_evsel__compute_deltas(evsel, cpu, thread, count);
+		perf_counts_values__scale(count, config->scale,
+					  evsel->per_pkg, evsel->snapshot, NULL);
 		if (config->aggr_mode == AGGR_NONE)
 			perf_stat__update_shadow_stats(evsel, count->values, cpu);
 		break;
 	case AGGR_GLOBAL:
-		aggr->val += count->val;
+		if (!skip)
+			aggr->val += count->val;
 		if (config->scale) {
 			aggr->ena += count->ena;
 			aggr->run += count->run;
@@ -337,9 +323,10 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	if (config->aggr_mode != AGGR_GLOBAL)
 		return 0;
 
-	if (!counter->snapshot)
-		perf_evsel__compute_deltas(counter, -1, -1, aggr);
-	perf_counts_values__scale(aggr, config->scale, &counter->counts->scaled);
+	perf_evsel__compute_deltas(counter, -1, -1, aggr);
+	perf_counts_values__scale(aggr, config->scale,
+				  counter->per_pkg, counter->snapshot,
+				  &counter->counts->scaled);
 
 	for (i = 0; i < 3; i++)
 		update_stats(&ps->res_stats[i], count[i]);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support
  2016-10-30  0:38 ` [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support David Carrillo-Cisneros
@ 2016-11-10 15:19   ` Thomas Gleixner
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-10 15:19 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Sat, 29 Oct 2016, David Carrillo-Cisneros wrote:

> +static void free_pkg_data(struct pkg_data *pkg_data)
> +{
> +	kfree(pkg_data);
> +}

So this is called from __terminate_pkg_data() which itself is called from
some random other place. Is this a code chasing game or is there a
technical reason why functions which belong together are not grouped
together?

> +/* Init pkg_data for @cpu 's package. */
> +static struct pkg_data *alloc_pkg_data(int cpu)
> +{
> +	struct cpuinfo_x86 *c = &cpu_data(cpu);
> +	struct pkg_data *pkgd;
> +	int numa_node = cpu_to_node(cpu);
> +	u16 pkgid = topology_logical_package_id(cpu);

Can you please sort the variables in reverse fir tree order?

	u16 pkgid = topology_logical_package_id(cpu);
	struct cpuinfo_x86 *c = &cpu_data(cpu);
	int numa_node = cpu_to_node(cpu);
	struct pkg_data *pkgd;

That's way simpler to parse than this random ordering.

> +
> +	if (c->x86_cache_occ_scale != cmt_l3_scale) {

And why would c->x86_cache_occ_scale be intialized already when you really
hotplug a CPU? It cannot be initialized because it is done from
identify_cpu() when the cpu actually starts.

You just never noticed because the driver initialized _AFTER_ all the cpus
are brought up and you never bothered to limit the number of cpus which are
brought up at boot time to a single node and then bring up the other node
_after_ loading the driver.

> +		/* 0 scale must have been converted to 1 automatically. */
> +		if (c->x86_cache_occ_scale || cmt_l3_scale != 1) {

This check will just explode in your face when you do the above because
c->x86_cache_occ_scale is 0.

So IOW. This is broken and the wrong place to do this.

> +			pr_err("Multiple LLC scale values, disabling CMT support.\n");

Interesting. You disable CMT support. That's true for init(). In the real
hotplug case you prevent bringing the cpu up, so the message is misleading
because CMT for the already online cpus is already working and keeps so.

> +			return ERR_PTR(-ENXIO);
> +		}
> +	}
> +
> +	pkgd = kzalloc_node(sizeof(*pkgd), GFP_KERNEL, numa_node);
> +	if (!pkgd)
> +		return ERR_PTR(-ENOMEM);
> +
> +	pkgd->max_rmid = c->x86_cache_max_rmid;
> +
> +	pkgd->work_cpu = cpu;

This is wrong. This want's to be -1 or something invalid. We now can stop
the hotplug process of a CPU at some random state. So if we stop right
after this callback then this not yet online cpu is set as work cpu and if
we then bring up another cpu in the package then it operates with a stale
work cpu. Please make stuff symmetric. The pre online prep stage is just
there to prepare data and pre initialize it. Anything which is related to
operational state has to be done at the point where things become
operational and undone at the same state when going down.

> +	pkgd->pkgid = pkgid;
> +
> +	__min_max_rmid = min(__min_max_rmid, pkgd->max_rmid);

What protects against the case where rmids are in use already and this cuts
__min_max_rmid short during hotplug?

> +static int init_pkg_data(int cpu)
> +{
> +	struct pkg_data *pkgd;
> +	u16 pkgid = topology_logical_package_id(cpu);
> +
> +	lockdep_assert_held(&cmt_mutex);
> +
> +	/* Verify that this pkgid isn't already initialized. */
> +	if (WARN_ON_ONCE(cmt_pkgs_data[pkgid]))
> +		return -EPERM;

For one this is a direct dereference of something which claims to be rcu
protected. That's inconsistent.

Further this check is completely pointless. This function is called from
intel_cmt_prep_up() after detecting that there is no package data for this
particular package id. I'm all for defensive programming, but this is just
beyond silly.

Aside of that why are you looking up pkgid in three functions in a row
instead of simply handing it from one to the other?

 cmt_prep_up() -> init_pkg_data() -> alloc_pkg_data()

> +	pkgd = alloc_pkg_data(cpu);
> +	if (IS_ERR(pkgd))
> +		return PTR_ERR(pkgd);
> +
> +	rcu_assign_pointer(cmt_pkgs_data[pkgid], pkgd);
> +	synchronize_rcu();

And this synchronize_rcu() is required because of what? This is the first
CPU of a package being brought up and you are holding the cmt_mutex.

I might be missing something, but if so, then this is missing a comment.

> +static int intel_cmt_prep_down(unsigned int cpu)
> +{
> +	struct pkg_data *pkgd;
> +	u16 pkgid = topology_logical_package_id(cpu);
> +
> +	mutex_lock(&cmt_mutex);
> +	pkgd = rcu_dereference_protected(cmt_pkgs_data[pkgid],
> +					 lockdep_is_held(&cmt_mutex));
> +	if (pkgd->work_cpu >= nr_cpu_ids) {
> +		/* will destroy pkgd */
> +		__terminate_pkg_data(pkgd);

Oh right. You free data _BEFORE_ setting the pointer to NULL and
synchronizing RCU. So anything which is merily using rcu_read_lock() can
get a valid reference and fiddle with freed data. Well done.

> +		RCU_INIT_POINTER(cmt_pkgs_data[pkgid], NULL);
> +		synchronize_rcu();

> +static int __init cmt_start(void)
> +{
> +	char *str, scale[20];
> +	int err;
> +
> +	/* will be modified by init_pkg_data() in intel_cmt_prep_up(). */
> +	__min_max_rmid = UINT_MAX;
> +	err = cpuhp_setup_state(CPUHP_PERF_X86_CMT_PREP,
> +				"PERF_X86_CMT_PREP",
> +				intel_cmt_prep_up,
> +				intel_cmt_prep_down);
> +	if (err)
> +		return err;
> +
> +	err = cpuhp_setup_state(CPUHP_AP_PERF_X86_CMT_ONLINE,
> +				"AP_PERF_X86_CMT_ONLINE",
> +				intel_cmt_hp_online_enter,
> +				intel_cmt_hp_online_exit);
> +	if (err)
> +		goto rm_prep;
> +
> +	snprintf(scale, sizeof(scale), "%u", cmt_l3_scale);
> +	str = kstrdup(scale, GFP_KERNEL);

That string is duplicated for memory leak detector testing purposes or
what?

> +	if (!str) {
> +		err = -ENOMEM;
> +		goto rm_online;
> +	}
> +
> +	return 0;
> +
> +rm_online:
> +	cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
> +rm_prep:
> +	cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
> +
> +	return err;
> +}
> +
> +static int __init intel_cmt_init(void)
> +{
> +	struct pkg_data *pkgd = NULL;
> +	int err = 0;
> +
> +	if (!x86_match_cpu(intel_cmt_match)) {
> +		err = -ENODEV;
> +		goto err_exit;

This is crap. If a CPU does NOT support this then printing the
registration failed error is just confusing. If it's not supported, return
-ENODEV and be done with it.

> +	}
> +
> +	err = cmt_alloc();
> +	if (err)
> +		goto err_exit;
> +
> +	err = cmt_start();
> +	if (err)
> +		goto err_dealloc;
> +
> +	pr_info("Intel CMT enabled with ");
> +	rcu_read_lock();
> +	while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
> +		pr_cont("%d RMIDs for pkg %d, ",
> +			pkgd->max_rmid + 1, pkgd->pkgid);

And this is useful because? Because it's so nice to have random stuff printed.

The only valuable information is that this is enabled with the detected
possible number of rmids (__min_max_rmid or whatever incomprehensible
variable name you came up with).

> +	}
> +	rcu_read_unlock();
> +	pr_cont("and l3 scale of %d KBs.\n", cmt_l3_scale);

i.e this should be:

  	pr_info("Intel CMT enabled. %d RMIDs, L3 scale %d KBs", ....);

Am I missing something?

If you really want to print out the per package rmdis, then the only reason
to do so is when they atually differ and that can be done where you do that
min() check.

> +
> +device_initcall(intel_cmt_init);

Oh no. This wants to be a module from the very beginning. No point on
forcing this as builtin for no reason.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-10-30  0:38 ` [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks David Carrillo-Cisneros
@ 2016-11-10 21:23   ` Thomas Gleixner
  2016-11-11  2:22     ` David Carrillo-Cisneros
  0 siblings, 1 reply; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-10 21:23 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Sat, 29 Oct 2016, David Carrillo-Cisneros wrote:
> Lockdep needs lock_class_key's to be statically initialized and/or use
> nesting, but nesting is currently hard-coded for up to 8 levels and it's
> fragile to depend on lockdep internals.
> To circumvent this problem, statically define CMT_MAX_NR_PKGS number of
> lock_class_key's.

That's a proper circumvention. Define a random number on your own.
 
> Additional details in code's comments.

This is 

 - pointless. Either there are comments or not. I can see that from
   the patch.

 - wrong. Because for this crucial detail of it (the lockdep
   restriction) there is no single comment in the code.

> Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
> ---
>  arch/x86/events/intel/cmt.c | 22 ++++++++++++++++++++++
>  arch/x86/events/intel/cmt.h |  8 ++++++++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/x86/events/intel/cmt.c b/arch/x86/events/intel/cmt.c
> index 267a9ec..f12a06b 100644
> --- a/arch/x86/events/intel/cmt.c
> +++ b/arch/x86/events/intel/cmt.c
> @@ -7,6 +7,14 @@
>  #include "cmt.h"
>  #include "../perf_event.h"
>  
> +/* Increase as needed as Intel CPUs grow. */

As CPUs grow? CMT is a per L3 cache domain, i.e. package, which is
equivivalent to socket today..

> +#define CMT_MAX_NR_PKGS		8

This already breaks on these shiny new 16 socket machines companies are
working on. Not to talk about the 4096 core monstrosities which are shipped
today, which have definitely more than 16 sockets already.

And now the obvious question: How are you going to fix this?

Just by increasing the number? See below.

> +#ifdef CONFIG_LOCKDEP
> +static struct lock_class_key	mutex_keys[CMT_MAX_NR_PKGS];
> +static struct lock_class_key	lock_keys[CMT_MAX_NR_PKGS];
> +#endif
> +
>  static DEFINE_MUTEX(cmt_mutex);
  
> diff --git a/arch/x86/events/intel/cmt.h b/arch/x86/events/intel/cmt.h
> index 8c16797..55416db 100644
> --- a/arch/x86/events/intel/cmt.h
> +++ b/arch/x86/events/intel/cmt.h
> @@ -11,11 +11,16 @@
>   * Rules:
>   *  - cmt_mutex: Hold for CMT init/terminate, event init/terminate,
>   *  cgroup start/stop.
> + *  - Hold pkg->mutex and pkg->lock in _all_ active packages to traverse or
> + *  change the monr hierarchy.

So if you hold both locks in 8 packages then you already have 16 locks held
nested. Now double that with 16 packages and you are slowly approaching the
maximum lock chain depth lockdep can cope with. Now think SGI and guess
how that works.

I certainly can understand why you want to split the locks, but my gut
feeling tells me that while it's probably not a big issue on a 2/4 socket
machines it will go down the drain rather fast on anything real big.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu
  2016-10-30  0:38 ` [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu David Carrillo-Cisneros
@ 2016-11-10 21:27   ` Thomas Gleixner
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-10 21:27 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Sat, 29 Oct 2016, David Carrillo-Cisneros wrote:
>  static int __init cmt_alloc(void)
>  {
>  	cmt_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> @@ -240,6 +339,7 @@ static int __init cmt_start(void)
>  		err = -ENOMEM;
>  		goto rm_online;
>  	}
> +	event_attr_intel_cmt_llc_scale.event_str = str;

And yes, that string dup should happen with this patch and not detached in
the previous one.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization
  2016-10-30  0:38 ` [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization David Carrillo-Cisneros
@ 2016-11-10 23:09   ` Thomas Gleixner
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-10 23:09 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Sat, 29 Oct 2016, David Carrillo-Cisneros wrote:

> Events use monrs to monitor CMT events. Events that share their
> monitoring target (same thread or cgroup) share a monr.

That wants a proper explanation of the concept of monr before explaining
what is done with them.

> This patch introduces monrs and adds support for monr creation/destruction.

'This patch': We know already that this is a patch. See SubmittingPatches
 
> An event's associated monr is referenced by event->cmt_monr (introduced
> in previous patch).
> 
> monr->mon_events references the first event that uses that monr and events
> that share monr are appended to first event's cmt_list list head.
> 
> Hold all pkgd->mutex to modify monr->mon_events and event's data.

Please do not tell what the code does, tell why ad explain the concepts.

> Support for CPU and cgroups is added in future patches in this series.

There is cgroups and CPU and whatever stuff already in this one.
 
> More details in code's comments.

Please add proper descriptions what this is about instead of describing
what the code does. A changelog should give a reviewer a proper overview
about the functionality it introduces or changes. Sending a reviewer for
comment hunting is not a replacement for that.

> +static void monr_dealloc(struct monr *monr)
> +{
> +	kfree(monr);
> +}

Again your ordering of functions is interesting. Please group them together
as they are used. If you need one of them at an earlier point for an error
cleanup or such, then please use forward declarations.

> +static struct monr *monr_alloc(void)
> +{
> +	struct monr *monr;
> +
> +	lockdep_assert_held(&cmt_mutex);
> +
> +	monr = kzalloc(sizeof(*monr), GFP_KERNEL);
> +	if (!monr)
> +		return ERR_PTR(-ENOMEM);
> +
> +	return monr;
> +}
> +
> +static inline struct monr *monr_from_event(struct perf_event *event)
> +{
> +	return (struct monr *) READ_ONCE(event->hw.cmt_monr);

That type cast is required because READ_ONCE already returns the type of the
thing you read from, right? So type casting makes more sure of that.

This lacks an explanation why you need a read once at all and it would be
more obvious to put the READ_ONCE explicitely into the code where it is
used and required.

> +}
> +
> +static struct monr *monr_remove_event(struct perf_event *event)
> +{
> +	struct monr *monr = monr_from_event(event);

So what is the actual point of the READ_ONCE here?

> +
> +	lockdep_assert_held(&cmt_mutex);
> +	monr_hrchy_assert_held_mutexes();
> +
> +	if (list_empty(&monr->mon_events->hw.cmt_list)) {
> +		monr->mon_events = NULL;
> +		/* remove from cmt_event_monrs */
> +		list_del_init(&monr->entry);
> +	} else {
> +		if (monr->mon_events == event)
> +			monr->mon_events = list_next_entry(event, hw.cmt_list);
> +		list_del_init(&event->hw.cmt_list);
> +	}
> +
> +	WRITE_ONCE(event->hw.cmt_monr, NULL);

And for the WRITE_ONCE here? I'm must be missing something.

event->hw,cmt_monr cannot change here under your feet. So why forcing the
compiler to read it once? The write once is equally pointless. It does not
matter when it is set to NULL. 

READ_ONCE/WRITE_ONCE stand out in the code and makes the reader alert so he
starts to look for something which might change under the feet. Just there
is nothing. Annoying.

> +
> +	return monr;
> +}
> +
> +static int monr_append_event(struct monr *monr, struct perf_event *event)
> +{
> +	lockdep_assert_held(&cmt_mutex);
> +	monr_hrchy_assert_held_mutexes();

Again. I really appreciate defensive programming, but with all these
repetated asserts in all these callchains is that code actually making
progress when you have lockdep enabled?

> +	if (monr->mon_events) {
> +		list_add_tail(&event->hw.cmt_list,
> +			      &monr->mon_events->hw.cmt_list);
> +	} else {
> +		monr->mon_events = event;
> +		list_add_tail(&monr->entry, &cmt_event_monrs);
> +	}
> +
> +	WRITE_ONCE(event->hw.cmt_monr, monr);

See above.

> +
> +	return 0;
> +}
> +
> +static bool is_cgroup_event(struct perf_event *event)
> +{
> +	return false;
> +}
> +
> +static int monr_hrchy_attach_cgroup_event(struct perf_event *event)
> +{
> +	return -EPERM;
> +}
> +
> +static int monr_hrchy_attach_cpu_event(struct perf_event *event)
> +{
> +	return -EPERM;
> +}

So why is all this here despite the changelog claiming otherwise?

> +
> +static int monr_hrchy_attach_task_event(struct perf_event *event)
> +{
> +	struct monr *monr;
> +	int err;
> +
> +	monr = monr_alloc();
> +	if (IS_ERR(monr))
> +		return -ENOMEM;
> +
> +	err = monr_append_event(monr, event);
> +	if (err)
> +		monr_dealloc(monr);
> +	return err;
> +}
> +
> +/* Insert or create monr in appropriate position in hierarchy. */
> +static int monr_hrchy_attach_event(struct perf_event *event)
> +{
> +	int err = 0;

This initialization is pointless.

> +
> +	lockdep_assert_held(&cmt_mutex);
> +	monr_hrchy_acquire_mutexes();
> +
> +	if (!is_cgroup_event(event) &&
> +	    !(event->attach_state & PERF_ATTACH_TASK)) {

So while you have a faible for wrapping things which should be in the code,
here and else where you have these written out checks over and over.

> +		err = monr_hrchy_attach_cpu_event(event);
> +		goto exit;
> +	}
> +	if (is_cgroup_event(event)) {
> +		err = monr_hrchy_attach_cgroup_event(event);
> +		goto exit;
> +	}
> +	err = monr_hrchy_attach_task_event(event);

Can we please avoid these gotos?

	if (is_cpu_event(event))
		err = attach_cpu(event);
	else if (is_cgroup_event(event))
		err = attach_cgroup(event);
	else
		err = attach_task(event);

would be compact and readable code without gotos. Hmm?

> +exit:
> +	monr_hrchy_release_mutexes();
> +
> +	return err;
> +}
> +
> +/**
> + * __match_event() - Determine if @a and @b should share a rmid.

If you use kernel doc comments, then you have to fill out the argument
descriptors as well.

> + */
> +static bool __match_event(struct perf_event *a, struct perf_event *b)

Why underscores? We use underscores when we have something like this:

__bla(...)
{
	do stuff ....
}

bla(...)
{
	lock();
	__bla(...);
	unlock();
}

But not for standalone functions which have no particular counterpart which
does extra work like serialization.

> +{
> +	/* Cgroup/non-task per-cpu and task events don't mix */
> +	if ((a->attach_state & PERF_ATTACH_TASK) !=
> +	    (b->attach_state & PERF_ATTACH_TASK))
> +		return false;

So this would become a simple one liner with a proper wrapper as well

   	if (is_task_event(a) != is_task_event(b))
		return false;

> +
> +#ifdef CONFIG_CGROUP_PERF
> +	if (a->cgrp != b->cgrp)
> +		return false;
> +#endif
> +
> +	/* If not task event, it's a a cgroup or a non-task cpu event. */
> +	if (!(b->attach_state & PERF_ATTACH_TASK))
> +		return true;
> +
> +	/* Events that target same task are placed into the same group. */
> +	if (a->hw.target == b->hw.target)
> +		return true;
> +
> +	/* Are we a inherited event? */
> +	if (b->parent == a)
> +		return true;
> +
> +	return false;
> +}
> +
>  static struct pmu intel_cmt_pmu;

The placement of this in the middle of unrelated code does what? It's
really annoying having to search for stuff at random places.

Please be more careful about placement.

>  
> +/* Try to find a monr with same target, otherwise create new one. */
> +static int mon_group_setup_event(struct perf_event *event)
> +{
> +	struct monr *monr;
> +	int err;
> +
> +	lockdep_assert_held(&cmt_mutex);
> +
> +	list_for_each_entry(monr, &cmt_event_monrs, entry) {
> +		if (!__match_event(monr->mon_events, event))
> +			continue;
> +		monr_hrchy_acquire_mutexes();
> +		err = monr_append_event(monr, event);
> +		monr_hrchy_release_mutexes();
> +		return err;
> +	}
> +	/*
> +	 * Since no match was found, create a new monr and set this
> +	 * event as head of a mon_group. All events in this group
> +	 * will share the monr.
> +	 */
> +	return monr_hrchy_attach_event(event);
> +}
> +
>  static void intel_cmt_event_read(struct perf_event *event)
>  {
>  }
> @@ -68,6 +272,20 @@ static int intel_cmt_event_add(struct perf_event *event, int mode)
>  	return 0;
>  }
>  
> +static void intel_cmt_event_destroy(struct perf_event *event)

So I was looking at the fully applied series and could not find this
function anymore because it was renamed at some random point. Can you
please avoid this for new code? If you rework something existing it might
be necessary, but for new code it's just a pointless exercise.

> +{
> +	struct monr *monr;
> +
> +	mutex_lock(&cmt_mutex);
> +	monr_hrchy_acquire_mutexes();
> +
> +	/* monr is dettached from event. */

Please comment things which are not obvious. I already know that the event
is destroyed and therefore the monr has to be detached. Comments should
explain why and not what. 'What' we can see from the code and if that has
well chosen names, then it's self explanatory. But the 'Why' of something
complex does not spring into the readers brain right away. That's what
comments are for.

> +	monr = monr_remove_event(event);

Now, whats entirely non obvious is the assignement of the return value with
no further check or processing. I know you do that in some follow up patch,
but for now it's just adding a compiler warning.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-10 21:23   ` Thomas Gleixner
@ 2016-11-11  2:22     ` David Carrillo-Cisneros
  2016-11-11  7:21       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-11-11  2:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Ingo Molnar, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

>> To circumvent this problem, statically define CMT_MAX_NR_PKGS number of
>> lock_class_key's.
>
> That's a proper circumvention. Define a random number on your own.

A judiciously chosen number ;) .

>
>> Additional details in code's comments.
>
> This is
>
>  - pointless. Either there are comments or not. I can see that from
>    the patch.
>
>  - wrong. Because for this crucial detail of it (the lockdep
>    restriction) there is no single comment in the code.

Noted. I'll stop adding those notes to the changelogs.

>> +/* Increase as needed as Intel CPUs grow. */
>
> As CPUs grow? CMT is a per L3 cache domain, i.e. package, which is
> equivivalent to socket today..
>
>> +#define CMT_MAX_NR_PKGS              8
>
> This already breaks on these shiny new 16 socket machines companies are
> working on. Not to talk about the 4096 core monstrosities which are shipped
> today, which have definitely more than 16 sockets already.
>
> And now the obvious question: How are you going to fix this?

A possible alternative to nested spinlocks is to use a single rw_lock to
protect changes to the monr hierarchy.

This works because, when manipulating pmonrs, the monr hierarchy (monr's
parent and children) is read but never changed (reader path).
Changes to the monr hierarchy do change pmonrs (writer path).

It would be implemented like:

  // system-wide lock for monr hierarchy
  rwlock_t monr_hrchy_rwlock = __RW_LOCK_UNLOCKED(monr_hrchy_rwlock);


Reader Path: Access pmonrs in same pkgd. Used in pmonr state transitions
and when reading cgroups (read subtree of pmonrs in same package):

v3 (old):

  raw_spin_lock(&pkgd->lock);
  // read monr data, read/write pmonr data
  raw_spin_unlock(&pkgd->lock);

next version:

  read_lock(&monr_hrchy_rwlock);
  raw_spin_lock(&pkgd->lock);
  // read monr data, read/write pmonr data
  raw_spin_unlock(&pkgd->lock);
  read_unlock(&monr_hrchy_rwlock);


Writer Path: Modifies monr hierarchy (add/remove monr in
creation/destruction of events and start/stop monitoring
of cgroups):

v3 (old):

  monr_hrchy_acquire_locks(...); // acquires all pkg->lock
  // read/write monr data, potentially read/write pmonr data
  monr_hrchy_release_locks(...); // release all pkg->lock

v4:

  write_lock(&monr_hrchy_rwlock);
  // read/write monr data, potentially read/write pmonr data
  write_unlock(&monr_hrchy_rwlock);


The write-path should occur infrequently. The reader-path
is slower due to the additional read_lock(&monr_hrchy_rwlock).
The number of locks to acquire remains constant on the number of
packages in both paths. There would be no need to ever nest pkgd->lock's.

For v3, I discarded this idea because, due to the additional
read_lock(&monr_hrchy_rwlock), the pmonr state transitions that may
occur in a context switch would be slower than in the current version.
But it may be ultimately better considering the scalability issues
that you mention.


>> + *  - Hold pkg->mutex and pkg->lock in _all_ active packages to traverse or
>> + *  change the monr hierarchy.
>
> So if you hold both locks in 8 packages then you already have 16 locks held
> nested. Now double that with 16 packages and you are slowly approaching the
> maximum lock chain depth lockdep can cope with. Now think SGI and guess
> how that works.
>
> I certainly can understand why you want to split the locks, but my gut
> feeling tells me that while it's probably not a big issue on a 2/4 socket
> machines it will go down the drain rather fast on anything real big.
>
> Thanks,
>
>         tglx

Thanks for reviewing the patches. I really appreciate the feedback.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11  2:22     ` David Carrillo-Cisneros
@ 2016-11-11  7:21       ` Peter Zijlstra
  2016-11-11  7:32         ` Ingo Molnar
                           ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2016-11-11  7:21 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Thomas Gleixner, linux-kernel, x86, Ingo Molnar, Andi Kleen,
	Kan Liang, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Thu, Nov 10, 2016 at 06:22:17PM -0800, David Carrillo-Cisneros wrote:
> A possible alternative to nested spinlocks is to use a single rw_lock to
> protect changes to the monr hierarchy.
> 
> This works because, when manipulating pmonrs, the monr hierarchy (monr's
> parent and children) is read but never changed (reader path).
> Changes to the monr hierarchy do change pmonrs (writer path).
> 
> It would be implemented like:
> 
>   // system-wide lock for monr hierarchy
>   rwlock_t monr_hrchy_rwlock = __RW_LOCK_UNLOCKED(monr_hrchy_rwlock);
> 
> 
> Reader Path: Access pmonrs in same pkgd. Used in pmonr state transitions
> and when reading cgroups (read subtree of pmonrs in same package):
> 
> v3 (old):
> 
>   raw_spin_lock(&pkgd->lock);
>   // read monr data, read/write pmonr data
>   raw_spin_unlock(&pkgd->lock);
> 
> next version:
> 
>   read_lock(&monr_hrchy_rwlock);
>   raw_spin_lock(&pkgd->lock);
>   // read monr data, read/write pmonr data
>   raw_spin_unlock(&pkgd->lock);
>   read_unlock(&monr_hrchy_rwlock);
> 
> 
> Writer Path: Modifies monr hierarchy (add/remove monr in
> creation/destruction of events and start/stop monitoring
> of cgroups):
> 
> v3 (old):
> 
>   monr_hrchy_acquire_locks(...); // acquires all pkg->lock
>   // read/write monr data, potentially read/write pmonr data
>   monr_hrchy_release_locks(...); // release all pkg->lock
> 
> v4:
> 
>   write_lock(&monr_hrchy_rwlock);
>   // read/write monr data, potentially read/write pmonr data
>   write_unlock(&monr_hrchy_rwlock);
> 
> 
> The write-path should occur infrequently. The reader-path
> is slower due to the additional read_lock(&monr_hrchy_rwlock).
> The number of locks to acquire remains constant on the number of
> packages in both paths. There would be no need to ever nest pkgd->lock's.
> 
> For v3, I discarded this idea because, due to the additional
> read_lock(&monr_hrchy_rwlock), the pmonr state transitions that may
> occur in a context switch would be slower than in the current version.
> But it may be ultimately better considering the scalability issues
> that you mention.

Well, its very hard to suggest alternatives, because there simply isn't
anything of content here. This patch just adds locks, and the next few
patches don't describe much either.

Why can't this be RCU?

Also, "monr" is a horribly 'word'.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11  7:21       ` Peter Zijlstra
@ 2016-11-11  7:32         ` Ingo Molnar
  2016-11-11  9:41         ` Thomas Gleixner
  2016-11-15  4:53         ` David Carrillo-Cisneros
  2 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2016-11-11  7:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Carrillo-Cisneros, Thomas Gleixner, linux-kernel, x86,
	Ingo Molnar, Andi Kleen, Kan Liang, Vegard Nossum,
	Marcelo Tosatti, Nilay Vaish, Borislav Petkov, Vikas Shivappa,
	Ravi V Shankar, Fenghua Yu, Paul Turner, Stephane Eranian


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Nov 10, 2016 at 06:22:17PM -0800, David Carrillo-Cisneros wrote:
> > A possible alternative to nested spinlocks is to use a single rw_lock to
> > protect changes to the monr hierarchy.
> > 
> > This works because, when manipulating pmonrs, the monr hierarchy (monr's
> > parent and children) is read but never changed (reader path).
> > Changes to the monr hierarchy do change pmonrs (writer path).
> > 
> > It would be implemented like:
> > 
> >   // system-wide lock for monr hierarchy
> >   rwlock_t monr_hrchy_rwlock = __RW_LOCK_UNLOCKED(monr_hrchy_rwlock);
> > 
> > 
> > Reader Path: Access pmonrs in same pkgd. Used in pmonr state transitions
> > and when reading cgroups (read subtree of pmonrs in same package):
> > 
> > v3 (old):
> > 
> >   raw_spin_lock(&pkgd->lock);
> >   // read monr data, read/write pmonr data
> >   raw_spin_unlock(&pkgd->lock);
> > 
> > next version:
> > 
> >   read_lock(&monr_hrchy_rwlock);
> >   raw_spin_lock(&pkgd->lock);
> >   // read monr data, read/write pmonr data
> >   raw_spin_unlock(&pkgd->lock);
> >   read_unlock(&monr_hrchy_rwlock);
> > 
> > 
> > Writer Path: Modifies monr hierarchy (add/remove monr in
> > creation/destruction of events and start/stop monitoring
> > of cgroups):
> > 
> > v3 (old):
> > 
> >   monr_hrchy_acquire_locks(...); // acquires all pkg->lock
> >   // read/write monr data, potentially read/write pmonr data
> >   monr_hrchy_release_locks(...); // release all pkg->lock
> > 
> > v4:
> > 
> >   write_lock(&monr_hrchy_rwlock);
> >   // read/write monr data, potentially read/write pmonr data
> >   write_unlock(&monr_hrchy_rwlock);
> > 
> > 
> > The write-path should occur infrequently. The reader-path
> > is slower due to the additional read_lock(&monr_hrchy_rwlock).
> > The number of locks to acquire remains constant on the number of
> > packages in both paths. There would be no need to ever nest pkgd->lock's.
> > 
> > For v3, I discarded this idea because, due to the additional
> > read_lock(&monr_hrchy_rwlock), the pmonr state transitions that may
> > occur in a context switch would be slower than in the current version.
> > But it may be ultimately better considering the scalability issues
> > that you mention.
> 
> Well, its very hard to suggest alternatives, because there simply isn't
> anything of content here. This patch just adds locks, and the next few
> patches don't describe much either.
> 
> Why can't this be RCU?
> 
> Also, "monr" is a horribly 'word'.

And 'hrchy' is a hrbbl wrd as wll!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11  7:21       ` Peter Zijlstra
  2016-11-11  7:32         ` Ingo Molnar
@ 2016-11-11  9:41         ` Thomas Gleixner
  2016-11-11 17:21           ` David Carrillo-Cisneros
  2016-11-15  4:53         ` David Carrillo-Cisneros
  2 siblings, 1 reply; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-11  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Carrillo-Cisneros, linux-kernel, x86, Ingo Molnar,
	Andi Kleen, Kan Liang, Vegard Nossum, Marcelo Tosatti,
	Nilay Vaish, Borislav Petkov, Vikas Shivappa, Ravi V Shankar,
	Fenghua Yu, Paul Turner, Stephane Eranian

On Fri, 11 Nov 2016, Peter Zijlstra wrote:
> Well, its very hard to suggest alternatives, because there simply isn't
> anything of content here. This patch just adds locks, and the next few
> patches don't describe much either.
> 
> Why can't this be RCU?

AFAICT from looking at later patches the context switch path can do RMID
borrowing/stealing whatever things which need protection of the package
data. Other pathes (setup/rotation/ ...) take all the package locks to
prevent concurrent RMID operations.

I still have not figured out why all of this is necessary and unfortunately
there is no real coherent epxlanation of the overall design. The cover
letter is not really helpful either.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11  9:41         ` Thomas Gleixner
@ 2016-11-11 17:21           ` David Carrillo-Cisneros
  2016-11-13 10:58             ` Thomas Gleixner
  0 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-11-11 17:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, x86, Ingo Molnar, Andi Kleen,
	Kan Liang, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Fri, Nov 11, 2016 at 1:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 11 Nov 2016, Peter Zijlstra wrote:
>> Well, its very hard to suggest alternatives, because there simply isn't
>> anything of content here. This patch just adds locks, and the next few
>> patches don't describe much either.
>>
>> Why can't this be RCU?
>
> AFAICT from looking at later patches the context switch path can do RMID
> borrowing/stealing whatever things which need protection of the package
> data. Other pathes (setup/rotation/ ...) take all the package locks to
> prevent concurrent RMID operations.

That's right. When a pmonr makes a state transition during context
switch, it needs to modify itself and potentially other pmonrs
because:
  - a pmonr in DEP_IDLE or DEP_DIRTY state will update its sched_rmid
if its lowest ancestor in Active state changes. This pmonrs "borrow"
the rmid.
  - a pmonr entering Active state must collect references to all
pmonrs that "borrow" rmid from it (all pmonrs that borrow rmids from
it).

All this transitions only affect data in pmonrs of a given package and
for that, they are protected by pkgd->lock. A rcu here is complicated
because many pmonrs must be changed atomically.

During pmonr state transition, monrs are not changed, only read,
because they embed the hierarchical relationship that pmonrs use to
find ancestors and descendants. For this reason, v3 of the series
requires to acquire pkgd->lock in all packages prior to any change to
the monrs hierarchical relationship.

The alternative proposes to protect pmonrs changes (that read the monr
hierarchy) with read_lock(&monr_hrchy_rwlock)  and pkgd->lock  . And
changes to the monr hierarchy with write_lock(&monr_hrchy_rwlock).

> I still have not figured out why all of this is necessary and unfortunately
> there is no real coherent epxlanation of the overall design. The cover
> letter is not really helpful either.

Note taken. I'll work on that.

>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11 17:21           ` David Carrillo-Cisneros
@ 2016-11-13 10:58             ` Thomas Gleixner
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-13 10:58 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Peter Zijlstra, linux-kernel, x86, Ingo Molnar, Andi Kleen,
	Kan Liang, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Fri, 11 Nov 2016, David Carrillo-Cisneros wrote:
> On Fri, Nov 11, 2016 at 1:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > I still have not figured out why all of this is necessary and unfortunately
> > there is no real coherent epxlanation of the overall design. The cover
> > letter is not really helpful either.
> 
> Note taken. I'll work on that.

The first thing which would be helpful is a proper explanation of what you
want to achieve and why, w/o going into design and implementation details.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-11  7:21       ` Peter Zijlstra
  2016-11-11  7:32         ` Ingo Molnar
  2016-11-11  9:41         ` Thomas Gleixner
@ 2016-11-15  4:53         ` David Carrillo-Cisneros
  2016-11-16 19:00           ` Thomas Gleixner
  2 siblings, 1 reply; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-11-15  4:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, x86, Ingo Molnar, Andi Kleen,
	Kan Liang, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

> Also, "monr" is a horribly 'word'.

What makes it so bad? (honest question) . Some alternatives:

- res_mon, resm, rmon (Resource Monitor)
- rmnode, rnode, rmon_node (Resource Monitoring node, similar to
Resource Monitor ID, but to reflect that it's a node in a
tree/hierarchy)
 - rdt_mon, rdtm (something with RDT + Monitoring)
 - ment, rdt_ment (Monitoring Entity)

Other suggestions?

Thanks,
David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks
  2016-11-15  4:53         ` David Carrillo-Cisneros
@ 2016-11-16 19:00           ` Thomas Gleixner
  0 siblings, 0 replies; 59+ messages in thread
From: Thomas Gleixner @ 2016-11-16 19:00 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: Peter Zijlstra, linux-kernel, x86, Ingo Molnar, Andi Kleen,
	Kan Liang, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian

On Mon, 14 Nov 2016, David Carrillo-Cisneros wrote:

> > Also, "monr" is a horribly 'word'.
> 
> What makes it so bad? (honest question) . Some alternatives:
> 
> - res_mon, resm, rmon (Resource Monitor)
> - rmnode, rnode, rmon_node (Resource Monitoring node, similar to
> Resource Monitor ID, but to reflect that it's a node in a
> tree/hierarchy)
>  - rdt_mon, rdtm (something with RDT + Monitoring)
>  - ment, rdt_ment (Monitoring Entity)
> 
> Other suggestions?

The naming is the least of my worries right now. Before you start to rework
the series can we please get the information about:

    - what you want to achieve and why

    - the design of your approach

so we can avoid staring at another series of 40+ patches just to figure out
that something is wrong at the conceptual level?

We sort out the naming convention once we are done with the above.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-11-16 19:03 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-30  0:37 [PATCH v3 00/46] Cache Monitoring Technology (aka CQM) David Carrillo-Cisneros
2016-10-30  0:37 ` [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM David Carrillo-Cisneros
2016-10-30  0:37 ` [PATCH v3 02/46] perf/x86/intel: rename CQM cpufeatures to CMT David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 03/46] x86/intel: add CONFIG_INTEL_RDT_M configuration flag David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support David Carrillo-Cisneros
2016-11-10 15:19   ` Thomas Gleixner
2016-10-30  0:38 ` [PATCH v3 05/46] perf/x86/intel/cmt: add per-package locks David Carrillo-Cisneros
2016-11-10 21:23   ` Thomas Gleixner
2016-11-11  2:22     ` David Carrillo-Cisneros
2016-11-11  7:21       ` Peter Zijlstra
2016-11-11  7:32         ` Ingo Molnar
2016-11-11  9:41         ` Thomas Gleixner
2016-11-11 17:21           ` David Carrillo-Cisneros
2016-11-13 10:58             ` Thomas Gleixner
2016-11-15  4:53         ` David Carrillo-Cisneros
2016-11-16 19:00           ` Thomas Gleixner
2016-10-30  0:38 ` [PATCH v3 06/46] perf/x86/intel/cmt: add intel_cmt pmu David Carrillo-Cisneros
2016-11-10 21:27   ` Thomas Gleixner
2016-10-30  0:38 ` [PATCH v3 07/46] perf/core: add RDT Monitoring attributes to struct hw_perf_event David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 08/46] perf/x86/intel/cmt: add MONitored Resource (monr) initialization David Carrillo-Cisneros
2016-11-10 23:09   ` Thomas Gleixner
2016-10-30  0:38 ` [PATCH v3 09/46] perf/x86/intel/cmt: add basic monr hierarchy David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 10/46] perf/x86/intel/cmt: add Package MONitored Resource (pmonr) initialization David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 11/46] perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 12/46] perf/x86/intel/cmt: add per-package rmid pools David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 13/46] perf/x86/intel/cmt: add pmonr's Off and Unused states David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 14/46] perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 15/46] perf/x86/intel: encapsulate rmid and closid updates in pqr cache David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 16/46] perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 17/46] perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 18/46] perf/core: add arch_info field to struct perf_cgroup David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 19/46] perf/x86/intel/cmt: add support for cgroup events David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 20/46] perf/core: add pmu::event_terminate David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 21/46] perf/x86/intel/cmt: use newly introduced event_terminate David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 22/46] perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 23/46] perf/core: hooks to add architecture specific features in perf_cgroup David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 24/46] perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline} David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 25/46] perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 26/46] sched: introduce the finish_arch_pre_lock_switch() scheduler hook David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 27/46] perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 28/46] perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to pmu::read David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 29/46] perf/x86/intel/cmt: add error handling to intel_cmt_event_read David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 30/46] perf/x86/intel/cmt: add asynchronous read for task events David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 31/46] perf/x86/intel/cmt: add subtree read for cgroup events David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 32/46] perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 33/46] perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 34/46] perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 35/46] perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 36/46] perf/core: add perf_event cgroup hooks for subsystem attributes David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 37/46] perf/x86/intel/cmt: add cont_monitoring to perf cgroup David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 38/46] perf/x86/intel/cmt: introduce read SLOs for rotation David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 39/46] perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 40/46] perf/x86/intel/cmt: add rotation scheduled work David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 41/46] perf/x86/intel/cmt: add rotation minimum progress SLO David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 42/46] perf/x86/intel/cmt: add rmid stealing David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 43/46] perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 44/46] perf/x86/intel/cmt: add debugfs intel_cmt directory David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 45/46] perf/stat: fix bug in handling events in error state David Carrillo-Cisneros
2016-10-30  0:38 ` [PATCH v3 46/46] perf/stat: revamp read error handling, snapshot and per_pkg events David Carrillo-Cisneros

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).