[PATCH v3 00/46] Cache Monitoring Technology (aka CQM)

* [PATCH v3 00/46] Cache Monitoring Technology (aka CQM)
@ 2016-10-30  0:37 David Carrillo-Cisneros
  2016-10-30  0:37 ` [PATCH v3 01/46] perf/x86/intel/cqm: remove previous version of CQM and MBM David Carrillo-Cisneros
                   ` (45 more replies)
  0 siblings, 46 replies; 59+ messages in thread
From: David Carrillo-Cisneros @ 2016-10-30  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Vegard Nossum, Marcelo Tosatti, Nilay Vaish,
	Borislav Petkov, Vikas Shivappa, Ravi V Shankar, Fenghua Yu,
	Paul Turner, Stephane Eranian, David Carrillo-Cisneros

This series introduces the next iteration of kernel support for the
Cache Monitoring Technology or CMT (formerly Cache QoS Monitoring, CQM)
available in Intel Xeon processors.

Documentation has replaced the Intel CQM name with Intel CMT.
This version renames all code to CMT.

It is rebased at tip x86/core to continue on top of
Fenghua Yu's Intel CAT series (partially merged).

One of the main limitations of the previous version is the inability
to simultaneously monitor:
  1) An llc_occupancy CPU event and a cgroup or task llc_occupancy
  event in that CPU.
  2) cgroup events for cgroups in same descendancy line.
  3) cgroup events and any task event whose thread runs in a cgroup
  in same descendancy line.

Another limitation is that monitoring for a cgroup was enabled/disabled by
the existence of a perf event for that cgroup. Since the event
llc_occupancy measures changes in occupancy rather than total occupancy,
in order to read meaningful llc_occupancy values, an event should be
enabled for a long enough period of time. The overhead in context switches
caused by the perf events is undesired in some sensitive scenarios.

This series of patches addresses the shortcomings mentioned above and
add some other improvements. The main changes are:
	- No more potential conflicts between different events. New
	version	builds a hierarchy of RMIDs that captures the dependency
	between	monitored cgroups. llc_occupancy for cgroup is the sum of
	llc_occupancies for that cgroup RMID and all other RMIDs in the
	cgroups subtree (both monitored cgroups and threads).

	- A cgroup integration that allows to start monitoring of a cgroup
	without	creating a perf event, decreasing the context switch
	overhead. Monitoring is controlled by a semicolon separated list
	of flags passed to a perf cgroup attribute, e.g.:

		echo "1;3;0;1" > cgroup_path/perf_event.cmt_monitoring

	CPU packages 0, 1 and 3 have flags > 0 and therefore mark those
	packages to monitor using RMIDs even if no perf_event is attached
	to the cgroup. The meaning of other flag values are explained
	in their own patches.

	A perf_event is always required in order to read llc_occupancy.
	This cgroup integration uses Intel's PQR code and is intended to
	share code with the upcoming Intel's CAT driver.

	- A more stable rotation algorithm: New algorithm explicitly
	defines SLOs to guarantee that RMIDs are assigned and kept
	long enough to produce meaningful occupancy values.

	- Reduce impact of stealing/rotation of RMIDs: The new algorithm
	tries to assign dirty RMIDs to their previous owners when
	suitable, decreasing the error introduced by RMID rotation and
	the negative impact of dirty RMIDs that drop occupancy too slowly
	when unscheduled.

	- Eliminate pmu::count: perf generic's perf_event_count()
	perform a quick add of atomic types. The introduction of
	pmu::count in the previous CMT series to read occupancy for thread
	events changed the behavior of perf_event_count() by performing a
	potentially slow IPI and write/read to MSR. It also made pmu::read
	to have different behaviors depending on whether the event was a
	cpu/cgroup event or a thread. This patches serie removes the custom
	pmu::count from CMT and provides a consistent behavior for all
	calls of perf_event_read .

	- Add error return for pmu::read: Reads to CQM events may fail
	due to stealing of RMIDs, even after successfully adding an event
	to a PMU. This patch series expands pmu::read with an int return
	value and propagates the error to callers that can fail
	(ie. perf_read).
	The ability to fail of pmu::read is consistent with the recent
	changes	that allow perf_event_read to fail for transactional
	reading of event groups.

	- Introduce additional flags to perf_event::group_caps and
	perf_event::event_caps: the flags PERF_EV_CAP_READ_ANY_{,CPU_}PKG
	allow read of CMT events while an event is inactive, saving
	unnecessary IPIs. The flag PERV_EV_CAP_CGROUP_NO_RECURSION prevents
	generic code from programming multiple CMT events in a CPU, when
	dealing with cgroup hierarchy, since this is unsupported by hw.

This patch series also updates the perf tool to fix error handling and to
better handle the idiosyncrasies of snapshot and per-pkg events.

Support for Intel MBM is yet to be build on-top of this driver.

Changes in 3rd version:
  - Rename from CQM to CMT, making it consistent with latest Intel's docs.
  - Plenty of fixes requested by Thomas G. Mainly:
      - Redesign of pmonr state machine.
      - Avoid abuse of lock_nested() by defining static lock_class_key's.
      - Simplify locking rules.
      - Remove unnecessary macros, inlines and wrappers.
      - Remove reliance on WARN_ONs for error handling.
      - Use kzalloc.
      - Fix comments and line breaks.
      - Add high level overview in comments of cmt header file.
      - Cleaner device initialization/termination. Still not modular,
	I am holding that change until the integration with perf cgroup
	is discussed (currently is through architecture specific hooks,
	see patch 36).
  - Clean up and simplify RMID rotation code.
  - Add user specific flags (uflags) for both events and
    perf_cgroup.cmt_monitoring to allow No Rotation and No Lazy Allocation
    of rmids.
  - Use CPU Hotplug state machine.
  - No longer need the new hook perf_event_exec to start monitoring after
    an exec (hook introduced in v1, removed in this one).
  - Remove polling of llc_occupancy for active rmids. Replaced by
    asynchronous read (see patch 30).
  - Change rmids pools to bitmaps, thus removing "wrapped rmid" (wrmid).
  - Removal of per-package pools of wrmids used as temporal objects.
  - Added a very useful debugfs node to observe internals such as:
	- monr hierarchy.
	- per-package data.
	- llc_occupancy of rmids.
  - Reduction of code size to 66% of v2 (now 641 KBs).
  - Rebased to tip x86/cache.

Changes in 2nd version:
  - As requested by Peter Z., redo commit history to completely remove
    old version of CQM in a single patch.
  - Use topology_max_packages and fix build errors reported by
  Vikas Shivappa.
  - Split largest patches, clean up.
  - Rebased to peterz/queue perf/core .

David Carrillo-Cisneros (45):
  perf/x86/intel/cqm: remove previous version of CQM and MBM
  perf/x86/intel: rename CQM cpufeatures to CMT
  x86/intel: add CONFIG_INTEL_RDT_M configuration flag
  perf/x86/intel/cmt: add device initialization and CPU hotplug support
  perf/x86/intel/cmt: add per-package locks
  perf/x86/intel/cmt: add intel_cmt pmu
  perf/core: add RDT Monitoring attributes to struct hw_perf_event
  perf/x86/intel/cmt: add MONitored Resource (monr) initialization
  perf/x86/intel/cmt: add basic monr hierarchy
  perf/x86/intel/cmt: add Package MONitored Resource (pmonr)
    initialization
  perf/x86/intel/cmt: add cmt_user_flags (uflags) to monr
  perf/x86/intel/cmt: add per-package rmid pools
  perf/x86/intel/cmt: add pmonr's Off and Unused states
  perf/x86/intel/cmt: add Active and Dep_{Idle, Dirty} states
  perf/x86/intel: encapsulate rmid and closid updates in pqr cache
  perf/x86/intel/cmt: set sched rmid and complete pmu start/stop/add/del
  perf/x86/intel/cmt: add uflag CMT_UF_NOLAZY_RMID
  perf/core: add arch_info field to struct perf_cgroup
  perf/x86/intel/cmt: add support for cgroup events
  perf/core: add pmu::event_terminate
  perf/x86/intel/cmt: use newly introduced event_terminate
  perf/x86/intel/cmt: sync cgroups and intel_cmt device start/stop
  perf/core: hooks to add architecture specific features in perf_cgroup
  perf/x86/intel/cmt: add perf_cgroup_arch_css_{online,offline}
  perf/x86/intel/cmt: add monr->flags and CMT_MONR_ZOMBIE
  sched: introduce the finish_arch_pre_lock_switch() scheduler hook
  perf/x86/intel: add pqr cache flags and intel_pqr_ctx_switch
  perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
    pmu::read
  perf/x86/intel/cmt: add error handling to intel_cmt_event_read
  perf/x86/intel/cmt: add asynchronous read for task events
  perf/x86/intel/cmt: add subtree read for cgroup events
  perf/core: Add PERF_EV_CAP_READ_ANY_{CPU_,}PKG flags
  perf/x86/intel/cmt: use PERF_EV_CAP_READ_{,CPU_}PKG flags in Intel cmt
  perf/core: introduce PERF_EV_CAP_CGROUP_NO_RECURSION
  perf/x86/intel/cmt: use PERF_EV_CAP_CGROUP_NO_RECURSION in intel_cmt
  perf/core: add perf_event cgroup hooks for subsystem attributes
  perf/x86/intel/cmt: add cont_monitoring to perf cgroup
  perf/x86/intel/cmt: introduce read SLOs for rotation
  perf/x86/intel/cmt: add max_recycle_threshold sysfs attribute
  perf/x86/intel/cmt: add rotation scheduled work
  perf/x86/intel/cmt: add rotation minimum progress SLO
  perf/x86/intel/cmt: add rmid stealing
  perf/x86/intel/cmt: add CMT_UF_NOSTEAL_RMID flag
  perf/x86/intel/cmt: add debugfs intel_cmt directory
  perf/stat: revamp read error handling, snapshot and per_pkg events

Stephane Eranian (1):
  perf/stat: fix bug in handling events in error state

 arch/alpha/kernel/perf_event.c           |    3 +-
 arch/arc/kernel/perf_event.c             |    3 +-
 arch/arm64/include/asm/hw_breakpoint.h   |    2 +-
 arch/arm64/kernel/hw_breakpoint.c        |    3 +-
 arch/metag/kernel/perf/perf_event.c      |    5 +-
 arch/mips/kernel/perf_event_mipsxx.c     |    3 +-
 arch/powerpc/include/asm/hw_breakpoint.h |    2 +-
 arch/powerpc/kernel/hw_breakpoint.c      |    3 +-
 arch/powerpc/perf/core-book3s.c          |   11 +-
 arch/powerpc/perf/core-fsl-emb.c         |    5 +-
 arch/powerpc/perf/hv-24x7.c              |    5 +-
 arch/powerpc/perf/hv-gpci.c              |    3 +-
 arch/s390/kernel/perf_cpum_cf.c          |    5 +-
 arch/s390/kernel/perf_cpum_sf.c          |    3 +-
 arch/sh/include/asm/hw_breakpoint.h      |    2 +-
 arch/sh/kernel/hw_breakpoint.c           |    3 +-
 arch/sparc/kernel/perf_event.c           |    2 +-
 arch/tile/kernel/perf_event.c            |    3 +-
 arch/x86/Kconfig                         |   12 +
 arch/x86/events/amd/ibs.c                |    2 +-
 arch/x86/events/amd/iommu.c              |    5 +-
 arch/x86/events/amd/uncore.c             |    3 +-
 arch/x86/events/core.c                   |    3 +-
 arch/x86/events/intel/Makefile           |    3 +-
 arch/x86/events/intel/bts.c              |    3 +-
 arch/x86/events/intel/cmt.c              | 3498 ++++++++++++++++++++++++++++++
 arch/x86/events/intel/cmt.h              |  344 +++
 arch/x86/events/intel/cqm.c              | 1766 ---------------
 arch/x86/events/intel/cstate.c           |    3 +-
 arch/x86/events/intel/pt.c               |    3 +-
 arch/x86/events/intel/rapl.c             |    3 +-
 arch/x86/events/intel/uncore.c           |    3 +-
 arch/x86/events/intel/uncore.h           |    2 +-
 arch/x86/events/msr.c                    |    3 +-
 arch/x86/include/asm/cpufeatures.h       |   14 +-
 arch/x86/include/asm/hw_breakpoint.h     |    2 +-
 arch/x86/include/asm/intel_rdt_common.h  |   62 +-
 arch/x86/include/asm/perf_event.h        |   29 +
 arch/x86/include/asm/processor.h         |    4 +
 arch/x86/kernel/cpu/Makefile             |    3 +-
 arch/x86/kernel/cpu/common.c             |   10 +-
 arch/x86/kernel/cpu/intel_rdt_common.c   |   37 +
 arch/x86/kernel/hw_breakpoint.c          |    3 +-
 arch/x86/kvm/pmu.h                       |   10 +-
 drivers/bus/arm-cci.c                    |    3 +-
 drivers/bus/arm-ccn.c                    |    3 +-
 drivers/perf/arm_pmu.c                   |    3 +-
 include/linux/cpuhotplug.h               |    4 +-
 include/linux/perf_event.h               |   70 +-
 kernel/events/core.c                     |  177 +-
 kernel/sched/core.c                      |    1 +
 kernel/sched/sched.h                     |    3 +
 kernel/trace/bpf_trace.c                 |    4 +-
 tools/perf/builtin-stat.c                |   42 +-
 tools/perf/util/counts.h                 |   19 +
 tools/perf/util/evsel.c                  |   49 +-
 tools/perf/util/evsel.h                  |    8 +-
 tools/perf/util/stat.c                   |   35 +-
 58 files changed, 4361 insertions(+), 1956 deletions(-)
 create mode 100644 arch/x86/events/intel/cmt.c
 create mode 100644 arch/x86/events/intel/cmt.h
 delete mode 100644 arch/x86/events/intel/cqm.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 59+ messages in thread