From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S938892AbdAFW1s (ORCPT ); Fri, 6 Jan 2017 17:27:48 -0500 Received: from mga05.intel.com ([192.55.52.43]:34800 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759056AbdAFWAC (ORCPT ); Fri, 6 Jan 2017 17:00:02 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,326,1477983600"; d="scan'208";a="1090866273" From: Vikas Shivappa To: vikas.shivappa@intel.com, vikas.shivappa@linux.intel.com Cc: davidcc@google.com, eranian@google.com, linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com, tglx@linutronix.de, mingo@kernel.org, peterz@infradead.org, ravi.v.shankar@intel.com, tony.luck@intel.com, fenghua.yu@intel.com, andi.kleen@intel.com, h.peter.anvin@intel.com Subject: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes Date: Fri, 6 Jan 2017 13:59:53 -0800 Message-Id: <1483740005-23499-1-git-send-email-vikas.shivappa@linux.intel.com> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Resending version 5 with updated send list. Sorry for the spam. Cqm(cache quality monitoring) is part of Intel RDT(resource director technology) which enables monitoring and controlling of processor shared resources via MSR interface. The current upstream cqm(Cache monitoring) has major issues which make the feature almost unusable which this series tries to fix and also address Thomas comments on previous versions of the cqm2 patch series to better document/organize what we are trying to fix. Changes in V5 - Based on Peterz feedback, removed the file interface in perf_event cgroup to start and stop continuous monitoring. - Based on Andi's feedback and references David has sent a patch optimizing the perf overhead as a seperate patch which is generic and not cqm specific. This is a continuation of patch series David(davidcc@google.com) previously posted and hence its based on his patches and is also trying to fix the same issues. Patches apply on 4.10-rc2 Below are the issues and the fixes we attempt- - Issue(1): Inaccurate data for per package data, systemwide. Just prints zeros or arbitrary numbers. Fix: Patches fix this by just throwing an error if the mode is not supported. The modes supported is task monitoring and cgroup monitoring. Also the per package data for say socket x is returned with the -C -G cgrpy option. The systemwide data can be looked up by monitoring root cgroup. - Issue(2): RMIDs are global and dont scale with more packages and hence also run out of RMIDs very soon. Fix: Support per pkg RMIDs hence scale better with more packages, and get more RMIDs to use and use when needed (ie when tasks are actually scheduled on the package). - Issue(3): Cgroup monitoring is not complete. No hierarchical monitoring support, inconsistent or wrong data seen when monitoring cgroup. Fix: cgroup monitoring support added. Patch adds full cgroup monitoring support. Can monitor different cgroups in the same hierarchy together and separately. And can also monitor a task and the cgroup which the task belongs. - Issue(4): Lot of inconsistent data is seen currently when we monitor different kind of events like cgroup and task events *together*. Fix: Patch adds support to be able to monitor a cgroup x and as task p1 with in a cgroup x and also monitor different cgroup and tasks together. - Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to - Issue(6): RMID recycling leads to inaccurate data and complicates the code and increases the code foot print. Currently, it almost makes the feature *unusable* as we only see zeroes and inconsistent data once we run out of RMIDs in the life time of a systemboot. The only way to get right numbers is to reboot the system once we run out of RMIDs. Root cause: Recycling steals an RMID from an existing event x and gives it to an other event y. However due to the nature of monitoring llc_occupancy we may miss tracking an unknown(possibly large) part of cache fills at the time when event does not have RMID. Hence the user ends up with inaccurate data for both events x and y and the inaccuracy is arbitrary and cannot be measured. Even if an event x gets another RMID very soon after loosing the previous RMID, we still miss all the occupancy data that was tied to the previous RMID which means we cannot get accurate data even when for most of the time event has an RMID. There is no way to guarantee accurate results with recycling and data is inaccurate by arbitrary degree. The fact that an event can loose an RMID anytime complicates a lot of code in sched_in, init, count, read. It also complicates mbm as we may loose the RMID anytime and hence need to keep a history of all the old counts. Fix: Recycling is removed based on Tony's idea originally that its introducing a lot of code, failing to provide accurate data and hence questionable benefits. Because inspite of several attempts to improve the recycling there is no way to guarantee accurate data as explained above and the incorrectness is of arbitrary degree(where we cant say for ex: the data is off by x% ). As a fix we introduce per-pkg RMIDs to mitigate the scarcity of RMIDs to a large extent - this is because RMIDs are plenty - about 2 to 4 per logical processor/SMT thread on each package. So on a 2 socket BDW system with say 44 logical processors/SMT threads we have 176 RMIDs on each package (a total of 2x176 = 352 RMIDs). Also cgroup is fully supported and hence many threads like all threads in one VM/container can be grouped which use just one RMID. The RMIDs scale with the number of sockets. If we still run out of RMIDs perf read throws an error because we are not able to monitor as we run out of limited h/w resource. This may be better unlike recycling(even with a better version than the one upstream)where the user thinks events are being monitored but they actually are not monitored for arbitrary amount of time hence resulting in inaccurate data of arbitrary degree. The inaccurate data defeats the purpose of RDT whose goal is to provide a consistent system behaviour by giving the ability to monitor and control processor resources in an accurate and reliable fashion. The fix instead helps provide accurate data and for large extent mitigates the RMID scarcity. Whats working now (unit tested): Task monitoring, cgroup hierarchical monitoring, monitor multiple cgroups, cgroup and task in same cgroup, per pkg rmids, error on read. TBD : - Most of MBM is working but will need updates to hierarchical monitoring and other new feature related changes we introduce. Below is a list of patches and what each patch fixes, Each commit message also gives details on what the patch actually fixes among the bunch: [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling Before the patch: Users sees only zeros or wrong data once we run out of RMIDs. After: User would see either correct data or an error that we run out of RMIDs. [PATCH 03/12] x86/rdt: Add rdt common/cqm compile option [PATCH 04/12] x86/cqm: Add Per pkg rmid support Before patch: RMIds are global. Tests: Available RMIDs increase by x times where x is # of packages. Adds LAZY RMID alloc - RMIDs are alloced during first sched in [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare [PATCH 06/12] x86/cqm: Add cgroup hierarchical monitoring support [PATCH 07/12] x86/rdt,cqm: Scheduling support update Before patch: cgroup monitoring not supported fully. After: cgroup monitoring is fully supported including hierarchical monitoring. [PATCH 08/12] x86/cqm: Add support for monitoring task and cgroup Before patch: cgroup and task could not be monitored together and would result in a lot of inconsistent data. After : Can monitor task and cgroup together and also supports monitoring a task within a cgroup and the cgroup together. [PATCH 9/12] x86/cqm: Add RMID reuse Before patch: Once RMID is used , its never used again. After: We reuse the RMIDs which are freed. User can specify NOLAZY RMID allocation and open fails if we fail to get all RMIDs at open. [PATCH 10/12] perf/core,x86/cqm: Add read for Cgroup events,per pkg [PATCH 11/12] perf/stat: fix bug in handling events in error state [PATCH 12/12] perf/stat: revamp read error handling, snapshot and Patches 1/12 - 9/12 Add all the features but the data is not visible to the perf/core nor the perf user mode. The 11-12 fix these and make the data availabe to the perf user mode.