linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/21] Support Architectural LBR
@ 2020-06-19 14:03 kan.liang
  2020-06-19 14:03 ` [PATCH 01/21] x86/cpufeatures: Add Architectural LBRs feature bit kan.liang
                   ` (20 more replies)
  0 siblings, 21 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

LBR (Last Branch Records) enables recording of software path history
by logging taken branches and other control flows within architectural
registers. Intel CPUs have had model-specific LBRs for quite some time
but this evolves them into an architectural feature now.

The main advantages for the users are:
- Faster context switching due to XSAVES support and faster reset of
  LBR MSRs via the new DEPTH MSR
- Faster LBR read for a non-PEBS event due to XSAVES support, which
  lowers the overhead of the NMI handler. (For a PEBS event, the LBR
  information is recorded in the PEBS records. There is no impact on
  the PEBS event.)
- Linux kernel can support the LBR features without knowing the model
  number of the current CPU.
- Clean exposure of LBRs to guests without relying on model-specific
  features. (An improvement for KVM. Not included in this patch series.)
- Supports for running with a smaller number of LBRs than the full 32,
  to lower overhead (currently not exposed, however)

The key improvements for the perf kernel in this patch series include,
- Doesn't require a model check. The capabilities of Architectural LBR
  can be enumerated by CPUID.
- Each LBR record or entry is still comprised of three MSRs,
  IA32_LBR_x_FROM_IP, IA32_LBR_x_TO_IP and IA32_LBR_x_TO_IP, but they
  become architectural MSRs.
- Architectural LBR is stack-like now. Entry 0 is always the youngest
  branch, entry 1 the next youngest... The TOS MSR has been removed.
- A new IA32_LBR_CTL MSR is introduced to enable and configure LBRs,
  which replaces the IA32_DEBUGCTL[bit 0] and the LBR_SELECT MSR.
- The possible LBR depth can be retrieved from CPUID enumeration. The
  max value is written to the new MSR_ARCH_LBR_DEPTH as the number of
  LBR entries.
- Faster LBR MSRs reset via the new DEPTH MSR, which avoids touching
  potentially nearly a hundred MSRs.
- XSAVES and XRSTORS are used to read, save/restore LBR related MSRs
- Faster direct reporting of the branch type with the LBR without needing
  access to the code

The existing LBR capabilities, such as CPL filtering, Branch filtering,
Call stack, Mispredict information, cycles information, Branch Type
information, are still kept for Architectural LBRs.

XSAVES and XRSTORS improvements:

In perf with call stack mode, LBR information is used to reconstruct
a call stack. To get a complete call stack, perf has to save and restore
all LBR registers during a context switch. However, the number of LBR
registers is huge. To reduce the overhead, LBR state component is
introduced with architectural LBR. Perf subsystem will use XSAVES/XRSTORS
to save/restore LBRs during a context switch.

LBR call stack mode is not always enabled. Perf subsystem only needs to
save/restore an LBR state on demand. To avoid unnecessary save/restore of
the LBR state at a context switch, a software concept, dynamic supervisor
state, is introduced, which
- does not allocate a buffer in each task->fpu;
- does not save/restore a state component at each context switch;
- sets the bit corresponding to a dynamic supervisor feature in
  IA32_XSS at boot time, and avoids setting it at run time;
- dynamically allocates a specific buffer for a state component
  on demand, e.g. only allocate a LBR-specific XSAVE buffer when LBR is
  enabled in perf. (Note: The buffer has to include the LBR state
  component, legacy region and XSAVE header.)
- saves/restores a state component on demand, e.g. manually invoke
  the XSAVES/XRSTORS instruction to save/restore the LBR state
  to/from the buffer when perf is active and a call stack is required.

The specification of Architectural LBR can be found in the latest Intel
Architecture Instruction Set Extensions and Future Features Programming
Reference, 319433-038.

Kan Liang (21):
  x86/cpufeatures: Add Architectural LBRs feature bit
  perf/x86/intel/lbr: Add pointers for LBR enable and disable
  perf/x86/intel/lbr: Add pointer for LBR reset
  perf/x86/intel/lbr: Add pointer for LBR read
  perf/x86/intel/lbr: Add pointers for LBR save and restore
  perf/x86/intel/lbr: Factor out a new struct for generic optimization
  perf/x86/intel/lbr: Use dynamic data structure for task_ctx
  x86/msr-index: Add bunch of MSRs for Arch LBR
  perf/x86: Expose CPUID enumeration bits for arch LBR
  perf/x86/intel: Check Arch LBR MSRs
  perf/x86/intel/lbr: Support LBR_CTL
  perf/x86/intel/lbr: Support Architectural LBR
  perf/core: Factor out functions to allocate/free the task_ctx_data
  perf/core: Use kmem_cache to allocate the PMU specific data
  perf/x86/intel/lbr: Create kmem_cache for the LBR context data
  perf/x86: Remove task_ctx_size
  x86/fpu: Use proper mask to replace full instruction mask
  x86/fpu/xstate: Support dynamic supervisor feature for LBR
  x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature
  perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch
  perf/x86/intel/lbr: Support XSAVES for arch LBR read

 arch/x86/events/core.c              |   2 +-
 arch/x86/events/intel/core.c        |  50 ++-
 arch/x86/events/intel/lbr.c         | 633 +++++++++++++++++++++++++++++++-----
 arch/x86/events/perf_event.h        | 159 ++++++++-
 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm/fpu/internal.h |   9 +-
 arch/x86/include/asm/fpu/types.h    |  26 ++
 arch/x86/include/asm/fpu/xstate.h   |  36 ++
 arch/x86/include/asm/msr-index.h    |  20 ++
 arch/x86/kernel/fpu/xstate.c        |  90 ++++-
 include/linux/perf_event.h          |   5 +-
 kernel/events/core.c                |  25 +-
 12 files changed, 951 insertions(+), 105 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/21] x86/cpufeatures: Add Architectural LBRs feature bit
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 02/21] perf/x86/intel/lbr: Add pointers for LBR enable and disable kan.liang
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

CPUID.(EAX=07H, ECX=0):EDX[19] indicates whether Intel CPU support
Architectural LBRs.

The Architectural Last Branch Records (LBR) feature enables recording
of software path history by logging taken branches and other control
flows. The feature will be supported in the perf_events subsystem.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index db18994..d5ce18a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -365,6 +365,7 @@
 #define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
 #define X86_FEATURE_TSX_FORCE_ABORT	(18*32+13) /* "" TSX_FORCE_ABORT */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
+#define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/21] perf/x86/intel/lbr: Add pointers for LBR enable and disable
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
  2020-06-19 14:03 ` [PATCH 01/21] x86/cpufeatures: Add Architectural LBRs feature bit kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 03/21] perf/x86/intel/lbr: Add pointer for LBR reset kan.liang
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The method to enable and disable Architectural LBR is different from
the previous model-specific LBR. Perf has to implement different
functions.

The function pointers for LBR enable and disable are introduced. Perf
should initialize the corresponding functions at boot time.

The current model-specific LBR functions are set as default.

The __intel_pmu_lbr_disable() and __intel_pmu_lbr_enable() are not
static functions anymore. The prefix "__" is removed from the name.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 6 ++++++
 arch/x86/events/intel/lbr.c  | 8 ++++----
 arch/x86/events/perf_event.h | 7 +++++++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 332954c..56966fc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3951,6 +3951,9 @@ static __initconst const struct x86_pmu core_pmu = {
 	.cpu_dead		= intel_pmu_cpu_dead,
 
 	.check_period		= intel_pmu_check_period,
+
+	.lbr_enable		= intel_pmu_lbr_enable,
+	.lbr_disable		= intel_pmu_lbr_disable,
 };
 
 static __initconst const struct x86_pmu intel_pmu = {
@@ -3996,6 +3999,9 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.check_period		= intel_pmu_check_period,
 
 	.aux_output_match	= intel_pmu_aux_output_match,
+
+	.lbr_enable		= intel_pmu_lbr_enable,
+	.lbr_disable		= intel_pmu_lbr_disable,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 65113b1..bdd38b6 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -150,7 +150,7 @@ static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
  * otherwise it becomes near impossible to get a reliable stack.
  */
 
-static void __intel_pmu_lbr_enable(bool pmi)
+void intel_pmu_lbr_enable(bool pmi)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	u64 debugctl, lbr_select = 0, orig_debugctl;
@@ -185,7 +185,7 @@ static void __intel_pmu_lbr_enable(bool pmi)
 		wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
-static void __intel_pmu_lbr_disable(void)
+void intel_pmu_lbr_disable(void)
 {
 	u64 debugctl;
 
@@ -545,7 +545,7 @@ void intel_pmu_lbr_enable_all(bool pmi)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	if (cpuc->lbr_users)
-		__intel_pmu_lbr_enable(pmi);
+		x86_pmu.lbr_enable(pmi);
 }
 
 void intel_pmu_lbr_disable_all(void)
@@ -553,7 +553,7 @@ void intel_pmu_lbr_disable_all(void)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	if (cpuc->lbr_users)
-		__intel_pmu_lbr_disable();
+		x86_pmu.lbr_disable();
 }
 
 static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f1cd1ca..a61a076 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -679,6 +679,9 @@ struct x86_pmu {
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
 	bool		lbr_pt_coexist;		   /* (LBR|BTS) may coexist with PT */
 
+	void		(*lbr_enable)(bool pmi);
+	void		(*lbr_disable)(void);
+
 	/*
 	 * Intel PT/LBR/BTS are exclusive
 	 */
@@ -1059,8 +1062,12 @@ void intel_pmu_lbr_del(struct perf_event *event);
 
 void intel_pmu_lbr_enable_all(bool pmi);
 
+void intel_pmu_lbr_enable(bool pmi);
+
 void intel_pmu_lbr_disable_all(void);
 
+void intel_pmu_lbr_disable(void);
+
 void intel_pmu_lbr_read(void);
 
 void intel_pmu_lbr_init_core(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/21] perf/x86/intel/lbr: Add pointer for LBR reset
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
  2020-06-19 14:03 ` [PATCH 01/21] x86/cpufeatures: Add Architectural LBRs feature bit kan.liang
  2020-06-19 14:03 ` [PATCH 02/21] perf/x86/intel/lbr: Add pointers for LBR enable and disable kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 04/21] perf/x86/intel/lbr: Add pointer for LBR read kan.liang
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The method to reset Architectural LBRs is different from previous
model-specific LBR. Perf has to implement a different function.

A function pointer is introduced for LBR reset. The enum of
LBR_FORMAT_* is also moved to perf_event.h. Perf should initialize the
corresponding functions at boot time, and avoid checking lbr_format at
run time.

The current 64-bit LBR reset function is set as default.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c |  5 +++++
 arch/x86/events/intel/lbr.c  | 20 +++-----------------
 arch/x86/events/perf_event.h | 16 ++++++++++++++++
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 56966fc..995acdb 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3954,6 +3954,7 @@ static __initconst const struct x86_pmu core_pmu = {
 
 	.lbr_enable		= intel_pmu_lbr_enable,
 	.lbr_disable		= intel_pmu_lbr_disable,
+	.lbr_reset		= intel_pmu_lbr_reset_64,
 };
 
 static __initconst const struct x86_pmu intel_pmu = {
@@ -4002,6 +4003,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 
 	.lbr_enable		= intel_pmu_lbr_enable,
 	.lbr_disable		= intel_pmu_lbr_disable,
+	.lbr_reset		= intel_pmu_lbr_reset_64,
 };
 
 static __init void intel_clovertown_quirk(void)
@@ -4628,6 +4630,9 @@ __init int intel_pmu_init(void)
 		x86_pmu.intel_cap.capabilities = capabilities;
 	}
 
+	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
+		x86_pmu.lbr_reset = intel_pmu_lbr_reset_32;
+
 	intel_ds_init();
 
 	x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs last */
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index bdd38b6..ff320d1 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -8,17 +8,6 @@
 
 #include "../perf_event.h"
 
-enum {
-	LBR_FORMAT_32		= 0x00,
-	LBR_FORMAT_LIP		= 0x01,
-	LBR_FORMAT_EIP		= 0x02,
-	LBR_FORMAT_EIP_FLAGS	= 0x03,
-	LBR_FORMAT_EIP_FLAGS2	= 0x04,
-	LBR_FORMAT_INFO		= 0x05,
-	LBR_FORMAT_TIME		= 0x06,
-	LBR_FORMAT_MAX_KNOWN    = LBR_FORMAT_TIME,
-};
-
 static const enum {
 	LBR_EIP_FLAGS		= 1,
 	LBR_TSX			= 2,
@@ -194,7 +183,7 @@ void intel_pmu_lbr_disable(void)
 	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
-static void intel_pmu_lbr_reset_32(void)
+void intel_pmu_lbr_reset_32(void)
 {
 	int i;
 
@@ -202,7 +191,7 @@ static void intel_pmu_lbr_reset_32(void)
 		wrmsrl(x86_pmu.lbr_from + i, 0);
 }
 
-static void intel_pmu_lbr_reset_64(void)
+void intel_pmu_lbr_reset_64(void)
 {
 	int i;
 
@@ -221,10 +210,7 @@ void intel_pmu_lbr_reset(void)
 	if (!x86_pmu.lbr_nr)
 		return;
 
-	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
-		intel_pmu_lbr_reset_32();
-	else
-		intel_pmu_lbr_reset_64();
+	x86_pmu.lbr_reset();
 
 	cpuc->last_task_ctx = NULL;
 	cpuc->last_log_id = 0;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a61a076..abf95ef 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -179,6 +179,17 @@ struct x86_perf_task_context;
 #define MAX_LBR_ENTRIES		32
 
 enum {
+	LBR_FORMAT_32		= 0x00,
+	LBR_FORMAT_LIP		= 0x01,
+	LBR_FORMAT_EIP		= 0x02,
+	LBR_FORMAT_EIP_FLAGS	= 0x03,
+	LBR_FORMAT_EIP_FLAGS2	= 0x04,
+	LBR_FORMAT_INFO		= 0x05,
+	LBR_FORMAT_TIME		= 0x06,
+	LBR_FORMAT_MAX_KNOWN    = LBR_FORMAT_TIME,
+};
+
+enum {
 	X86_PERF_KFREE_SHARED = 0,
 	X86_PERF_KFREE_EXCL   = 1,
 	X86_PERF_KFREE_MAX
@@ -681,6 +692,7 @@ struct x86_pmu {
 
 	void		(*lbr_enable)(bool pmi);
 	void		(*lbr_disable)(void);
+	void		(*lbr_reset)(void);
 
 	/*
 	 * Intel PT/LBR/BTS are exclusive
@@ -1056,6 +1068,10 @@ u64 lbr_from_signext_quirk_wr(u64 val);
 
 void intel_pmu_lbr_reset(void);
 
+void intel_pmu_lbr_reset_32(void);
+
+void intel_pmu_lbr_reset_64(void);
+
 void intel_pmu_lbr_add(struct perf_event *event);
 
 void intel_pmu_lbr_del(struct perf_event *event);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/21] perf/x86/intel/lbr: Add pointer for LBR read
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (2 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 03/21] perf/x86/intel/lbr: Add pointer for LBR reset kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 05/21] perf/x86/intel/lbr: Add pointers for LBR save and restore kan.liang
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The method to read Architectural LBRs is different from previous
model-specific LBR. Perf has to implement a different function.

A function pointer for LBR read is introduced. Perf should initialize
the corresponding function at boot time, and avoid checking lbr_format
at run time.

The current 64-bit LBR read function is set as default.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 6 +++++-
 arch/x86/events/intel/lbr.c  | 9 +++------
 arch/x86/events/perf_event.h | 5 +++++
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 995acdb..03b17d5 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3955,6 +3955,7 @@ static __initconst const struct x86_pmu core_pmu = {
 	.lbr_enable		= intel_pmu_lbr_enable,
 	.lbr_disable		= intel_pmu_lbr_disable,
 	.lbr_reset		= intel_pmu_lbr_reset_64,
+	.lbr_read		= intel_pmu_lbr_read_64,
 };
 
 static __initconst const struct x86_pmu intel_pmu = {
@@ -4004,6 +4005,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.lbr_enable		= intel_pmu_lbr_enable,
 	.lbr_disable		= intel_pmu_lbr_disable,
 	.lbr_reset		= intel_pmu_lbr_reset_64,
+	.lbr_read		= intel_pmu_lbr_read_64,
 };
 
 static __init void intel_clovertown_quirk(void)
@@ -4630,8 +4632,10 @@ __init int intel_pmu_init(void)
 		x86_pmu.intel_cap.capabilities = capabilities;
 	}
 
-	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
+	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) {
 		x86_pmu.lbr_reset = intel_pmu_lbr_reset_32;
+		x86_pmu.lbr_read = intel_pmu_lbr_read_32;
+	}
 
 	intel_ds_init();
 
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index ff320d1..d762c76 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -542,7 +542,7 @@ void intel_pmu_lbr_disable_all(void)
 		x86_pmu.lbr_disable();
 }
 
-static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
+void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
 	u64 tos = intel_pmu_lbr_tos();
@@ -579,7 +579,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
  * is the same as the linear address, allowing us to merge the LIP and EIP
  * LBR formats.
  */
-static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
+void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 {
 	bool need_info = false, call_stack = false;
 	unsigned long mask = x86_pmu.lbr_nr - 1;
@@ -683,10 +683,7 @@ void intel_pmu_lbr_read(void)
 	if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
 		return;
 
-	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
-		intel_pmu_lbr_read_32(cpuc);
-	else
-		intel_pmu_lbr_read_64(cpuc);
+	x86_pmu.lbr_read(cpuc);
 
 	intel_pmu_lbr_filter(cpuc);
 }
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index abf95ef..e2e086c0 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -693,6 +693,7 @@ struct x86_pmu {
 	void		(*lbr_enable)(bool pmi);
 	void		(*lbr_disable)(void);
 	void		(*lbr_reset)(void);
+	void		(*lbr_read)(struct cpu_hw_events *cpuc);
 
 	/*
 	 * Intel PT/LBR/BTS are exclusive
@@ -1086,6 +1087,10 @@ void intel_pmu_lbr_disable(void);
 
 void intel_pmu_lbr_read(void);
 
+void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc);
+
+void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc);
+
 void intel_pmu_lbr_init_core(void);
 
 void intel_pmu_lbr_init_nhm(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/21] perf/x86/intel/lbr: Add pointers for LBR save and restore
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (3 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 04/21] perf/x86/intel/lbr: Add pointer for LBR read kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 06/21] perf/x86/intel/lbr: Factor out a new struct for generic optimization kan.liang
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The MSRs of Architectural LBR are different from previous model-specific
LBR. Perf has to implement different functions to save and restore them.

The function pointers for LBR save and restore are introduced. Perf
should initialize the corresponding functions at boot time.

The generic optimizations, e.g. avoiding restore LBR if no one else
touched them, still apply for Architectural LBRs. The related codes are
not moved to model-specific functions.

Current model-specific LBR functions are set as default.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c |  4 +++
 arch/x86/events/intel/lbr.c  | 71 +++++++++++++++++++++++++++-----------------
 arch/x86/events/perf_event.h |  6 ++++
 3 files changed, 54 insertions(+), 27 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 03b17d5..b236cff 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3956,6 +3956,8 @@ static __initconst const struct x86_pmu core_pmu = {
 	.lbr_disable		= intel_pmu_lbr_disable,
 	.lbr_reset		= intel_pmu_lbr_reset_64,
 	.lbr_read		= intel_pmu_lbr_read_64,
+	.lbr_save		= intel_pmu_lbr_save,
+	.lbr_restore		= intel_pmu_lbr_restore,
 };
 
 static __initconst const struct x86_pmu intel_pmu = {
@@ -4006,6 +4008,8 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.lbr_disable		= intel_pmu_lbr_disable,
 	.lbr_reset		= intel_pmu_lbr_reset_64,
 	.lbr_read		= intel_pmu_lbr_read_64,
+	.lbr_save		= intel_pmu_lbr_save,
+	.lbr_restore		= intel_pmu_lbr_restore,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index d762c76..18f9990 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -323,11 +323,37 @@ static inline u64 rdlbr_to(unsigned int idx)
 	return val;
 }
 
-static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+void intel_pmu_lbr_restore(void *ctx)
 {
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct x86_perf_task_context *task_ctx = ctx;
 	int i;
 	unsigned lbr_idx, mask;
+	u64 tos = task_ctx->tos;
+
+	mask = x86_pmu.lbr_nr - 1;
+	for (i = 0; i < task_ctx->valid_lbrs; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrlbr_from(lbr_idx, task_ctx->lbr_from[i]);
+		wrlbr_to  (lbr_idx, task_ctx->lbr_to[i]);
+
+		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
+	}
+
+	for (; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrlbr_from(lbr_idx, 0);
+		wrlbr_to(lbr_idx, 0);
+		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, 0);
+	}
+
+	wrmsrl(x86_pmu.lbr_tos, tos);
+}
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	u64 tos;
 
 	if (task_ctx->lbr_callstack_users == 0 ||
@@ -349,40 +375,18 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 		return;
 	}
 
-	mask = x86_pmu.lbr_nr - 1;
-	for (i = 0; i < task_ctx->valid_lbrs; i++) {
-		lbr_idx = (tos - i) & mask;
-		wrlbr_from(lbr_idx, task_ctx->lbr_from[i]);
-		wrlbr_to  (lbr_idx, task_ctx->lbr_to[i]);
-
-		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
-			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
-	}
-
-	for (; i < x86_pmu.lbr_nr; i++) {
-		lbr_idx = (tos - i) & mask;
-		wrlbr_from(lbr_idx, 0);
-		wrlbr_to(lbr_idx, 0);
-		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
-			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, 0);
-	}
+	x86_pmu.lbr_restore(task_ctx);
 
-	wrmsrl(x86_pmu.lbr_tos, tos);
 	task_ctx->lbr_stack_state = LBR_NONE;
 }
 
-static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+void intel_pmu_lbr_save(void *ctx)
 {
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct x86_perf_task_context *task_ctx = ctx;
 	unsigned lbr_idx, mask;
 	u64 tos, from;
 	int i;
 
-	if (task_ctx->lbr_callstack_users == 0) {
-		task_ctx->lbr_stack_state = LBR_NONE;
-		return;
-	}
-
 	mask = x86_pmu.lbr_nr - 1;
 	tos = intel_pmu_lbr_tos();
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
@@ -397,6 +401,19 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 	}
 	task_ctx->valid_lbrs = i;
 	task_ctx->tos = tos;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	if (task_ctx->lbr_callstack_users == 0) {
+		task_ctx->lbr_stack_state = LBR_NONE;
+		return;
+	}
+
+	x86_pmu.lbr_save(task_ctx);
+
 	task_ctx->lbr_stack_state = LBR_VALID;
 
 	cpuc->last_task_ctx = task_ctx;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e2e086c0..7c67847 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -694,6 +694,8 @@ struct x86_pmu {
 	void		(*lbr_disable)(void);
 	void		(*lbr_reset)(void);
 	void		(*lbr_read)(struct cpu_hw_events *cpuc);
+	void		(*lbr_save)(void *ctx);
+	void		(*lbr_restore)(void *ctx);
 
 	/*
 	 * Intel PT/LBR/BTS are exclusive
@@ -1091,6 +1093,10 @@ void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc);
 
 void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc);
 
+void intel_pmu_lbr_save(void *ctx);
+
+void intel_pmu_lbr_restore(void *ctx);
+
 void intel_pmu_lbr_init_core(void);
 
 void intel_pmu_lbr_init_nhm(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/21] perf/x86/intel/lbr: Factor out a new struct for generic optimization
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (4 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 05/21] perf/x86/intel/lbr: Add pointers for LBR save and restore kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 07/21] perf/x86/intel/lbr: Use dynamic data structure for task_ctx kan.liang
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To reduce the overhead of a context switch with LBR enabled, some
generic optimizations were introduced, e.g. avoiding restore LBR if no
one else touched them. The generic optimizations can also be used by
Architecture LBR later. Currently, the fields for the generic
optimizations are part of structure x86_perf_task_context, which will be
deprecated by Architecture LBR. A new structure should be introduced
for the common fields of generic optimization, which can be shared
between Architecture LBR and model-specific LBR.

Both 'valid_lbrs' and 'tos' are also used by the generic optimizations,
but they are not moved into the new structure, because Architecture LBR
is stack-like. The 'valid_lbrs' which records the index of the valid LBR
is not required anymore. The TOS MSR will be removed.

LBR registers may be cleared in the deep Cstate. If so, the generic
optimizations should not be applied. Perf has to unconditionally
restore the LBR registers. A generic function is required to detect the
reset due to the deep Cstate. lbr_is_reset_in_cstate() is introduced.
Currently, for the model-specific LBR, the TOS MSR is used to detect the
reset. There will be another method introduced for Architecture LBR
later.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c  | 37 ++++++++++++++++++++-----------------
 arch/x86/events/perf_event.h | 10 +++++++---
 2 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 18f9990..f220a4c 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -351,33 +351,36 @@ void intel_pmu_lbr_restore(void *ctx)
 	wrmsrl(x86_pmu.lbr_tos, tos);
 }
 
+static bool lbr_is_reset_in_cstate(struct x86_perf_task_context *task_ctx)
+{
+	return !rdlbr_from(task_ctx->tos);
+}
+
 static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	u64 tos;
 
-	if (task_ctx->lbr_callstack_users == 0 ||
-	    task_ctx->lbr_stack_state == LBR_NONE) {
+	if (task_ctx->opt.lbr_callstack_users == 0 ||
+	    task_ctx->opt.lbr_stack_state == LBR_NONE) {
 		intel_pmu_lbr_reset();
 		return;
 	}
 
-	tos = task_ctx->tos;
 	/*
 	 * Does not restore the LBR registers, if
 	 * - No one else touched them, and
-	 * - Did not enter C6
+	 * - Was not cleared in Cstate
 	 */
 	if ((task_ctx == cpuc->last_task_ctx) &&
-	    (task_ctx->log_id == cpuc->last_log_id) &&
-	    rdlbr_from(tos)) {
-		task_ctx->lbr_stack_state = LBR_NONE;
+	    (task_ctx->opt.log_id == cpuc->last_log_id) &&
+	    !lbr_is_reset_in_cstate(task_ctx)) {
+		task_ctx->opt.lbr_stack_state = LBR_NONE;
 		return;
 	}
 
 	x86_pmu.lbr_restore(task_ctx);
 
-	task_ctx->lbr_stack_state = LBR_NONE;
+	task_ctx->opt.lbr_stack_state = LBR_NONE;
 }
 
 void intel_pmu_lbr_save(void *ctx)
@@ -407,17 +410,17 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (task_ctx->lbr_callstack_users == 0) {
-		task_ctx->lbr_stack_state = LBR_NONE;
+	if (task_ctx->opt.lbr_callstack_users == 0) {
+		task_ctx->opt.lbr_stack_state = LBR_NONE;
 		return;
 	}
 
 	x86_pmu.lbr_save(task_ctx);
 
-	task_ctx->lbr_stack_state = LBR_VALID;
+	task_ctx->opt.lbr_stack_state = LBR_VALID;
 
 	cpuc->last_task_ctx = task_ctx;
-	cpuc->last_log_id = ++task_ctx->log_id;
+	cpuc->last_log_id = ++task_ctx->opt.log_id;
 }
 
 void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -439,8 +442,8 @@ void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 	if (!prev_ctx_data || !next_ctx_data)
 		return;
 
-	swap(prev_ctx_data->lbr_callstack_users,
-	     next_ctx_data->lbr_callstack_users);
+	swap(prev_ctx_data->opt.lbr_callstack_users,
+	     next_ctx_data->opt.lbr_callstack_users);
 }
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
@@ -492,7 +495,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
 
 	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
 		task_ctx = event->ctx->task_ctx_data;
-		task_ctx->lbr_callstack_users++;
+		task_ctx->opt.lbr_callstack_users++;
 	}
 
 	/*
@@ -532,7 +535,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
 	if (branch_user_callstack(cpuc->br_sel) &&
 	    event->ctx->task_ctx_data) {
 		task_ctx = event->ctx->task_ctx_data;
-		task_ctx->lbr_callstack_users--;
+		task_ctx->opt.lbr_callstack_users--;
 	}
 
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7c67847..fd73c6c 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -735,15 +735,19 @@ struct x86_pmu {
 	int (*aux_output_match) (struct perf_event *event);
 };
 
+struct x86_perf_task_context_opt {
+	int lbr_callstack_users;
+	int lbr_stack_state;
+	int log_id;
+};
+
 struct x86_perf_task_context {
 	u64 lbr_from[MAX_LBR_ENTRIES];
 	u64 lbr_to[MAX_LBR_ENTRIES];
 	u64 lbr_info[MAX_LBR_ENTRIES];
 	int tos;
 	int valid_lbrs;
-	int lbr_callstack_users;
-	int lbr_stack_state;
-	int log_id;
+	struct x86_perf_task_context_opt opt;
 };
 
 #define x86_add_quirk(func_)						\
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/21] perf/x86/intel/lbr: Use dynamic data structure for task_ctx
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (5 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 06/21] perf/x86/intel/lbr: Factor out a new struct for generic optimization kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR kan.liang
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The type of task_ctx is hardcoded as struct x86_perf_task_context,
which doesn't apply for Architecture LBR. For example, Architecture LBR
doesn't have the TOS MSR. The number of LBR entries is variable. A new
struct will be introduced for Architecture LBR. Perf has to determine
the type of task_ctx at run time.

The type of task_ctx pointer is changed to 'void *', which will be
determined at run time.

The generic LBR optimization can be shared between Architecture LBR and
model-specific LBR. Both need to access the structure for the generic
LBR optimization. A helper task_context_opt() is introduced to retrieve
the pointer of the structure at run time.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c  | 58 ++++++++++++++++++++------------------------
 arch/x86/events/perf_event.h |  7 +++++-
 2 files changed, 32 insertions(+), 33 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index f220a4c..1c253ab 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -351,17 +351,17 @@ void intel_pmu_lbr_restore(void *ctx)
 	wrmsrl(x86_pmu.lbr_tos, tos);
 }
 
-static bool lbr_is_reset_in_cstate(struct x86_perf_task_context *task_ctx)
+static bool lbr_is_reset_in_cstate(void *ctx)
 {
-	return !rdlbr_from(task_ctx->tos);
+	return !rdlbr_from(((struct x86_perf_task_context *)ctx)->tos);
 }
 
-static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+static void __intel_pmu_lbr_restore(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (task_ctx->opt.lbr_callstack_users == 0 ||
-	    task_ctx->opt.lbr_stack_state == LBR_NONE) {
+	if (task_context_opt(ctx)->lbr_callstack_users == 0 ||
+	    task_context_opt(ctx)->lbr_stack_state == LBR_NONE) {
 		intel_pmu_lbr_reset();
 		return;
 	}
@@ -371,16 +371,16 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 	 * - No one else touched them, and
 	 * - Was not cleared in Cstate
 	 */
-	if ((task_ctx == cpuc->last_task_ctx) &&
-	    (task_ctx->opt.log_id == cpuc->last_log_id) &&
-	    !lbr_is_reset_in_cstate(task_ctx)) {
-		task_ctx->opt.lbr_stack_state = LBR_NONE;
+	if ((ctx == cpuc->last_task_ctx) &&
+	    (task_context_opt(ctx)->log_id == cpuc->last_log_id) &&
+	    !lbr_is_reset_in_cstate(ctx)) {
+		task_context_opt(ctx)->lbr_stack_state = LBR_NONE;
 		return;
 	}
 
-	x86_pmu.lbr_restore(task_ctx);
+	x86_pmu.lbr_restore(ctx);
 
-	task_ctx->opt.lbr_stack_state = LBR_NONE;
+	task_context_opt(ctx)->lbr_stack_state = LBR_NONE;
 }
 
 void intel_pmu_lbr_save(void *ctx)
@@ -406,27 +406,27 @@ void intel_pmu_lbr_save(void *ctx)
 	task_ctx->tos = tos;
 }
 
-static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+static void __intel_pmu_lbr_save(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (task_ctx->opt.lbr_callstack_users == 0) {
-		task_ctx->opt.lbr_stack_state = LBR_NONE;
+	if (task_context_opt(ctx)->lbr_callstack_users == 0) {
+		task_context_opt(ctx)->lbr_stack_state = LBR_NONE;
 		return;
 	}
 
-	x86_pmu.lbr_save(task_ctx);
+	x86_pmu.lbr_save(ctx);
 
-	task_ctx->opt.lbr_stack_state = LBR_VALID;
+	task_context_opt(ctx)->lbr_stack_state = LBR_VALID;
 
-	cpuc->last_task_ctx = task_ctx;
-	cpuc->last_log_id = ++task_ctx->opt.log_id;
+	cpuc->last_task_ctx = ctx;
+	cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
 }
 
 void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 				 struct perf_event_context *next)
 {
-	struct x86_perf_task_context *prev_ctx_data, *next_ctx_data;
+	void *prev_ctx_data, *next_ctx_data;
 
 	swap(prev->task_ctx_data, next->task_ctx_data);
 
@@ -442,14 +442,14 @@ void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 	if (!prev_ctx_data || !next_ctx_data)
 		return;
 
-	swap(prev_ctx_data->opt.lbr_callstack_users,
-	     next_ctx_data->opt.lbr_callstack_users);
+	swap(task_context_opt(prev_ctx_data)->lbr_callstack_users,
+	     task_context_opt(next_ctx_data)->lbr_callstack_users);
 }
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct x86_perf_task_context *task_ctx;
+	void *task_ctx;
 
 	if (!cpuc->lbr_users)
 		return;
@@ -486,17 +486,14 @@ static inline bool branch_user_callstack(unsigned br_sel)
 void intel_pmu_lbr_add(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
-	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
-		task_ctx = event->ctx->task_ctx_data;
-		task_ctx->opt.lbr_callstack_users++;
-	}
+	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
+		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
 
 	/*
 	 * Request pmu::sched_task() callback, which will fire inside the
@@ -527,16 +524,13 @@ void intel_pmu_lbr_add(struct perf_event *event)
 void intel_pmu_lbr_del(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 
 	if (branch_user_callstack(cpuc->br_sel) &&
-	    event->ctx->task_ctx_data) {
-		task_ctx = event->ctx->task_ctx_data;
-		task_ctx->opt.lbr_callstack_users--;
-	}
+	    event->ctx->task_ctx_data)
+		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
 
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users--;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index fd73c6c..e33d348 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -246,7 +246,7 @@ struct cpu_hw_events {
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
 	u64				br_sel;
-	struct x86_perf_task_context	*last_task_ctx;
+	void				*last_task_ctx;
 	int				last_log_id;
 
 	/*
@@ -798,6 +798,11 @@ static struct perf_pmu_events_ht_attr event_attr_##v = {		\
 struct pmu *x86_get_pmu(void);
 extern struct x86_pmu x86_pmu __read_mostly;
 
+static inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
+{
+	return &((struct x86_perf_task_context *)ctx)->opt;
+}
+
 static inline bool x86_pmu_has_lbr_callstack(void)
 {
 	return  x86_pmu.lbr_sel_map &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (6 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 07/21] perf/x86/intel/lbr: Use dynamic data structure for task_ctx kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 19:11   ` Peter Zijlstra
  2020-06-19 14:03 ` [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR kan.liang
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Add Arch LBR related MSRs in MSR-index.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 12c9684..7b7d82f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -156,6 +156,26 @@
 #define LBR_INFO_ABORT			BIT_ULL(61)
 #define LBR_INFO_CYCLES			0xffff
 
+#define MSR_ARCH_LBR_CTL		0x000014ce
+#define ARCH_LBR_CTL_LBREN		BIT(0)
+#define ARCH_LBR_CTL_CPL_OFFSET		1
+#define ARCH_LBR_CTL_CPL		(0x3ull << ARCH_LBR_CTL_CPL_OFFSET)
+#define ARCH_LBR_CTL_STACK_OFFSET	3
+#define ARCH_LBR_CTL_STACK		(0x1ull << ARCH_LBR_CTL_STACK_OFFSET)
+#define ARCH_LBR_CTL_FILTER_OFFSET	16
+#define ARCH_LBR_CTL_FILTER		(0x7full << ARCH_LBR_CTL_FILTER_OFFSET)
+#define MSR_ARCH_LBR_DEPTH		0x000014cf
+#define MSR_ARCH_LBR_FROM_0		0x00001500
+#define MSR_ARCH_LBR_TO_0		0x00001600
+#define MSR_ARCH_LBR_INFO_0		0x00001200
+#define ARCH_LBR_INFO_MISPRED		BIT_ULL(63)
+#define ARCH_LBR_INFO_IN_TSX		BIT_ULL(62)
+#define ARCH_LBR_INFO_TSX_ABORT		BIT_ULL(61)
+#define ARCH_LBR_INFO_CYC_CNT_VALID	BIT_ULL(60)
+#define ARCH_LBR_INFO_BR_TYPE_OFFSET	56
+#define ARCH_LBR_INFO_BR_TYPE		(0xfull << ARCH_LBR_INFO_BR_TYPE_OFFSET)
+#define ARCH_LBR_INFO_CYC_CNT		0xffff
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_PEBS_DATA_CFG		0x000003f2
 #define MSR_IA32_DS_AREA		0x00000600
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (7 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 18:31   ` Peter Zijlstra
  2020-06-19 14:03 ` [PATCH 10/21] perf/x86/intel: Check Arch LBR MSRs kan.liang
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The LBR capabilities of Architecture LBR are retrieved from the CPUID
enumeration once at boot time. The capabilities have to be saved for
future usage.

Several new fields in x86_pmu are added to indicate the capabilities.
The fields will be used in the following patches.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/perf_event.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e33d348..cbfc55b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -689,6 +689,50 @@ struct x86_pmu {
 	const int	*lbr_sel_map;		   /* lbr_select mappings */
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
 	bool		lbr_pt_coexist;		   /* (LBR|BTS) may coexist with PT */
+	bool		arch_lbr;		   /* Arch LBR supported */
+
+	/* Arch LBR Capabilities */
+	union {
+		struct {
+			/* Supported LBR depth values */
+			unsigned int	arch_lbr_depth_mask:8;
+
+			unsigned int	reserved:22;
+
+			/* Deep C-state Reset */
+			unsigned int	arch_lbr_deep_c_reset:1;
+
+			/* IP values contain LIP */
+			unsigned int	arch_lbr_lip:1;
+		};
+		unsigned int		arch_lbr_eax;
+	};
+	union {
+		struct {
+			/* CPL Filtering Supported */
+			unsigned int    arch_lbr_cpl:1;
+
+			/* Branch Filtering Supported */
+			unsigned int    arch_lbr_filter:1;
+
+			/* Call-stack Mode Supported */
+			unsigned int    arch_lbr_call_stack:1;
+		};
+		unsigned int            arch_lbr_ebx;
+	};
+	union {
+		struct {
+			/* Mispredict Bit Supported */
+			unsigned int    arch_lbr_mispred:1;
+
+			/* Timed LBRs Supported */
+			unsigned int    arch_lbr_timed_lbr:1;
+
+			/* Branch Type Field Supported */
+			unsigned int    arch_lbr_br_type:1;
+		};
+		unsigned int            arch_lbr_ecx;
+	};
 
 	void		(*lbr_enable)(bool pmi);
 	void		(*lbr_disable)(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/21] perf/x86/intel: Check Arch LBR MSRs
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (8 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 14:03 ` [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL kan.liang
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The KVM may not support the MSRs of Architecture LBR. Accessing the
MSRs may cause #GP and crash the guest.

The MSRs have to be checked at guest boot time.

Only using the max number of Architecture LBR depth to check the
MSR_ARCH_LBR_DEPTH should be good enough. The max number can be
calculated by 8 * the position of the last set bit of LBR_DEPTH value
in CPUID enumeration.

Co-developed-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++--
 arch/x86/events/perf_event.h |  5 +++++
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b236cff..c3372bd 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4145,6 +4145,8 @@ static bool check_msr(unsigned long msr, u64 mask)
 
 	if (is_lbr_from(msr))
 		val_tmp = lbr_from_signext_quirk_wr(val_tmp);
+	else if (msr == MSR_ARCH_LBR_DEPTH)
+		val_tmp = x86_pmu_get_max_arch_lbr_nr();
 
 	if (wrmsrl_safe(msr, val_tmp) ||
 	    rdmsrl_safe(msr, &val_new))
@@ -5188,8 +5190,23 @@ __init int intel_pmu_init(void)
 	 * Check all LBT MSR here.
 	 * Disable LBR access if any LBR MSRs can not be accessed.
 	 */
-	if (x86_pmu.lbr_nr && !check_msr(x86_pmu.lbr_tos, 0x3UL))
-		x86_pmu.lbr_nr = 0;
+	if (x86_pmu.lbr_nr) {
+		if (x86_pmu.arch_lbr) {
+			u64 mask = 1;
+
+			if (x86_pmu.arch_lbr_cpl)
+				mask |= ARCH_LBR_CTL_CPL;
+			if (x86_pmu.arch_lbr_filter)
+				mask |= ARCH_LBR_CTL_FILTER;
+			if (x86_pmu.arch_lbr_call_stack)
+				mask |= ARCH_LBR_CTL_STACK;
+			if (!check_msr(MSR_ARCH_LBR_CTL, mask))
+				x86_pmu.lbr_nr = 0;
+			if (!check_msr(MSR_ARCH_LBR_DEPTH, 0))
+				x86_pmu.lbr_nr = 0;
+		} else if (!check_msr(x86_pmu.lbr_tos, 0x3UL))
+			x86_pmu.lbr_nr = 0;
+	}
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		if (!(check_msr(x86_pmu.lbr_from + i, 0xffffUL) &&
 		      check_msr(x86_pmu.lbr_to + i, 0xffffUL)))
@@ -5206,6 +5223,9 @@ __init int intel_pmu_init(void)
 	 */
 	if (x86_pmu.extra_regs) {
 		for (er = x86_pmu.extra_regs; er->msr; er++) {
+			/* Skip Arch LBR which is already verified */
+			if (x86_pmu.arch_lbr && (er->idx == EXTRA_REG_LBR))
+				continue;
 			er->extra_msr_access = check_msr(er->msr, 0x11UL);
 			/* Disable LBR select mapping */
 			if ((er->idx == EXTRA_REG_LBR) && !er->extra_msr_access)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index cbfc55b..7112c51 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -853,6 +853,11 @@ static inline bool x86_pmu_has_lbr_callstack(void)
 		x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
 }
 
+static inline int x86_pmu_get_max_arch_lbr_nr(void)
+{
+	return fls(x86_pmu.arch_lbr_depth_mask) * 8;
+}
+
 DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 int x86_perf_event_set_period(struct perf_event *event);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (9 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 10/21] perf/x86/intel: Check Arch LBR MSRs kan.liang
@ 2020-06-19 14:03 ` kan.liang
  2020-06-19 18:40   ` Peter Zijlstra
  2020-06-19 14:04 ` [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR kan.liang
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:03 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

An IA32_LBR_CTL is introduced for Architecture LBR to enable and config
LBR registers to replace the previous LBR_SELECT.

All the related members in struct cpu_hw_events and struct x86_pmu
have to be renamed.

Some new macros are added to reflect the layout of LBR_CTL.

The mapping from PERF_SAMPLE_BRANCH_* to the corresponding bits in
LBR_CTL MSR is saved in lbr_ctl_map now, which is not a const value.
The value relies on the CPUID enumeration.

A dedicated HW LBR filter is implemented for Architecture LBR. For the
previous model-specific LBR, most of the bits in LBR_SELECT operate in
the suppressed mode. For the bits in LBR_CTL, the polarity is inverted.

For the previous model-specific LBR format 5 (LBR_FORMAT_INFO), if the
NO_CYCLES and NO_FLAGS type are set, the flag LBR_NO_INFO will be set to
avoid the unnecessary LBR_INFO MSR read. Although Architecture LBR also
has a dedicated LBR_INFO MSR, perf doesn't need to check and set the
flag LBR_NO_INFO. For Architecture LBR, XSAVES instruction will be used
as the default way to read the LBR MSRs all together. The overhead which
the flag tries to avoid doesn't exist anymore. Dropping the flag can
save the extra check for the flag in the lbr_read() later, and make the
code cleaner.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c |  4 +--
 arch/x86/events/intel/lbr.c  | 73 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h | 18 ++++++++---
 3 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c3372bd..6462ef2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3680,7 +3680,7 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 {
 	cpuc->pebs_record_size = x86_pmu.pebs_record_size;
 
-	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
+	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map || x86_pmu.lbr_ctl_map) {
 		cpuc->shared_regs = allocate_shared_regs(cpu);
 		if (!cpuc->shared_regs)
 			goto err;
@@ -3778,7 +3778,7 @@ static void intel_pmu_cpu_starting(int cpu)
 		cpuc->shared_regs->refcnt++;
 	}
 
-	if (x86_pmu.lbr_sel_map)
+	if (x86_pmu.lbr_sel_map || x86_pmu.lbr_ctl_map)
 		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 
 	if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 1c253ab..b34beb5 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -132,6 +132,44 @@ enum {
 	 X86_BR_IRQ		|\
 	 X86_BR_INT)
 
+/*
+ * Intel LBR_CTL bits
+ *
+ * Hardware branch filter for Arch LBR
+ */
+#define ARCH_LBR_KERNEL_BIT		1  /* capture at ring0 */
+#define ARCH_LBR_USER_BIT		2  /* capture at ring > 0 */
+#define ARCH_LBR_CALL_STACK_BIT		3  /* enable call stack */
+#define ARCH_LBR_JCC_BIT		16 /* capture conditional branches */
+#define ARCH_LBR_REL_JMP_BIT		17 /* capture relative jumps */
+#define ARCH_LBR_IND_JMP_BIT		18 /* capture indirect jumps */
+#define ARCH_LBR_REL_CALL_BIT		19 /* capture relative calls */
+#define ARCH_LBR_IND_CALL_BIT		20 /* capture indirect calls */
+#define ARCH_LBR_RETURN_BIT		21 /* capture near returns */
+#define ARCH_LBR_OTHER_BRANCH_BIT	22 /* capture other branches */
+
+#define ARCH_LBR_KERNEL			(1ULL << ARCH_LBR_KERNEL_BIT)
+#define ARCH_LBR_USER			(1ULL << ARCH_LBR_USER_BIT)
+#define ARCH_LBR_CALL_STACK		(1ULL << ARCH_LBR_CALL_STACK_BIT)
+#define ARCH_LBR_JCC			(1ULL << ARCH_LBR_JCC_BIT)
+#define ARCH_LBR_REL_JMP		(1ULL << ARCH_LBR_REL_JMP_BIT)
+#define ARCH_LBR_IND_JMP		(1ULL << ARCH_LBR_IND_JMP_BIT)
+#define ARCH_LBR_REL_CALL		(1ULL << ARCH_LBR_REL_CALL_BIT)
+#define ARCH_LBR_IND_CALL		(1ULL << ARCH_LBR_IND_CALL_BIT)
+#define ARCH_LBR_RETURN			(1ULL << ARCH_LBR_RETURN_BIT)
+#define ARCH_LBR_OTHER_BRANCH		(1ULL << ARCH_LBR_OTHER_BRANCH_BIT)
+
+#define ARCH_LBR_ANY			 \
+	(ARCH_LBR_JCC			|\
+	 ARCH_LBR_REL_JMP		|\
+	 ARCH_LBR_IND_JMP		|\
+	 ARCH_LBR_REL_CALL		|\
+	 ARCH_LBR_IND_CALL		|\
+	 ARCH_LBR_RETURN		|\
+	 ARCH_LBR_OTHER_BRANCH)
+
+#define ARCH_LBR_CTL_MASK			0x7f000e
+
 static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
 
 /*
@@ -814,6 +852,37 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	return 0;
 }
 
+/*
+ * Setup the HW LBR filter for arch LBR
+ * Used only when available, may not be enough to disambiguate
+ * all branches, may need the help of the SW filter
+ */
+static int intel_pmu_setup_hw_arch_lbr_filter(struct perf_event *event)
+{
+	struct hw_perf_event_extra *reg;
+	u64 br_type = event->attr.branch_sample_type;
+	u64 mask = 0, v;
+	int i;
+
+	for (i = 0; i < PERF_SAMPLE_BRANCH_MAX_SHIFT; i++) {
+		if (!(br_type & (1ULL << i)))
+			continue;
+
+		v = x86_pmu.lbr_ctl_map[i];
+		if (v == LBR_NOT_SUPP)
+			return -EOPNOTSUPP;
+
+		if (v != LBR_IGN)
+			mask |= v;
+	}
+
+	reg = &event->hw.branch_reg;
+	reg->idx = EXTRA_REG_LBR;
+	reg->config = mask;
+
+	return 0;
+}
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event)
 {
 	int ret = 0;
@@ -834,7 +903,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 	/*
 	 * setup HW LBR filter, if any
 	 */
-	if (x86_pmu.lbr_sel_map)
+	if (x86_pmu.lbr_ctl_map)
+		ret = intel_pmu_setup_hw_arch_lbr_filter(event);
+	else if (x86_pmu.lbr_sel_map)
 		ret = intel_pmu_setup_hw_lbr_filter(event);
 
 	return ret;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7112c51..1b91f2b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -244,7 +244,10 @@ struct cpu_hw_events {
 	int				lbr_pebs_users;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
-	struct er_account		*lbr_sel;
+	union {
+		struct er_account		*lbr_sel;
+		struct er_account		*lbr_ctl;
+	};
 	u64				br_sel;
 	void				*last_task_ctx;
 	int				last_log_id;
@@ -685,8 +688,12 @@ struct x86_pmu {
 	 */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
-	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
+	union {
+		u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
+		u64		lbr_ctl_mask;		   /* LBR_CTL valid bits */
+	};
 	const int	*lbr_sel_map;		   /* lbr_select mappings */
+	int		*lbr_ctl_map;		   /* LBR_CTL mappings */
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
 	bool		lbr_pt_coexist;		   /* (LBR|BTS) may coexist with PT */
 	bool		arch_lbr;		   /* Arch LBR supported */
@@ -849,8 +856,11 @@ static inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
 
 static inline bool x86_pmu_has_lbr_callstack(void)
 {
-	return  x86_pmu.lbr_sel_map &&
-		x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+	if (x86_pmu.lbr_ctl_map)
+		return x86_pmu.lbr_ctl_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+	if (x86_pmu.lbr_sel_map)
+		return x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+	return false;
 }
 
 static inline int x86_pmu_get_max_arch_lbr_nr(void)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (10 preceding siblings ...)
  2020-06-19 14:03 ` [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 19:08   ` Peter Zijlstra
  2020-06-19 14:04 ` [PATCH 13/21] perf/core: Factor out functions to allocate/free the task_ctx_data kan.liang
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Last Branch Records (LBR) enables recording of software path history by
logging taken branches and other control flows within architectural
registers now. Intel CPUs have had model-specific LBR for quite some
time, but this evolves them into an architectural feature now.

The main improvements of Architectural LBR implemented includes:
- Linux kernel can support the LBR features without knowing the model
  number of the current CPU.
- Architectural LBR capabilities can be enumerated by CPUID. The
  lbr_ctl_map is based on the CPUID Enumeration.
- The possible LBR depth can be retrieved from CPUID enumeration. The
  max value is written to the new MSR_ARCH_LBR_DEPTH as the number of
  LBR entries.
- A new IA32_LBR_CTL MSR is introduced to enable and configure LBRs,
  which replaces the IA32_DEBUGCTL[bit 0] and the LBR_SELECT MSR.
- Each LBR record or entry is still comprised of three MSRs,
  IA32_LBR_x_FROM_IP, IA32_LBR_x_TO_IP and IA32_LBR_x_TO_IP.
  But they become the architectural MSRs.
- Architectural LBR is stack-like now. Entry 0 is always the youngest
  branch, entry 1 the next youngest... The TOS MSR has been removed.

Accordingly, Architectural LBR dedicated functions are implemented to
enable/disable/reset/read/save/restore LBR.
- For enable/disable, like in the previous LBR implementation, the LBR
  call stack mode does not work well with the FREEZE_LBRS_ON_PMI.
  The bit DEBUGCTLMSR_FREEZE_LBRS_ON_PMI is still disabled in the LBR
  call stack mode. The IA32_DEBUGCTL[bit 0] has no meaning anymore.
- For reset, writing to the ARCH_LBR_DEPTH MSR clears all Arch LBR
  entries, which is a lot faster and can improve the context switch
  latency.
- For read, the branch type information can be retrieved from
  the MSR_ARCH_LBR_INFO_*. But it's not fully compatible due to
  OTHER_BRANCH type. The software decoding is still required for the
  OTHER_BRANCH case.
  LBR records are stored in the age order.
  Add dedicated read functions for both PEBS and non-PEBS case.
- For save/restore, applying the fast reset (writing ARCH_LBR_DEPTH).
  Reading 'lbr_from' of entry 0 instead of the TOS MSR to check if the
  LBR registers are reset in the deep C-state. If 'the deep C-state
  reset' bit is not set in CPUID enumeration, ignoring the check.
  XSAVE support for Architectural LBR will be implemented later.

The number of LBR entries cannot be hardcoded anymore, which should be
retrieved from CPUID enumeration. A new structure
x86_perf_task_context_arch_lbr is introduced for Architectural LBR.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c |   3 +
 arch/x86/events/intel/lbr.c  | 276 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h |  16 +++
 3 files changed, 294 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6462ef2..768caa9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4643,6 +4643,9 @@ __init int intel_pmu_init(void)
 		x86_pmu.lbr_read = intel_pmu_lbr_read_32;
 	}
 
+	if (boot_cpu_has(X86_FEATURE_ARCH_LBR))
+		intel_pmu_arch_lbr_init();
+
 	intel_ds_init();
 
 	x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs last */
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index b34beb5..fde23e8 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -212,6 +212,33 @@ void intel_pmu_lbr_enable(bool pmi)
 		wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
+static void intel_pmu_arch_lbr_enable(bool pmi)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	u64 debugctl, lbr_ctl = 0, orig_debugctl;
+
+	if (pmi)
+		return;
+
+	if (cpuc->lbr_ctl)
+		lbr_ctl = cpuc->lbr_ctl->config & x86_pmu.lbr_ctl_mask;
+	/*
+	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+	 * may be missed, that can lead to confusing results.
+	 */
+	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+	orig_debugctl = debugctl;
+	if (lbr_ctl & ARCH_LBR_CALL_STACK)
+		debugctl &= ~DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
+	else
+		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
+	if (orig_debugctl != debugctl)
+		wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+
+	wrmsrl(MSR_ARCH_LBR_CTL, lbr_ctl | ARCH_LBR_CTL_LBREN);
+}
+
 void intel_pmu_lbr_disable(void)
 {
 	u64 debugctl;
@@ -221,6 +248,11 @@ void intel_pmu_lbr_disable(void)
 	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
+static void intel_pmu_arch_lbr_disable(void)
+{
+	wrmsrl(MSR_ARCH_LBR_CTL, 0);
+}
+
 void intel_pmu_lbr_reset_32(void)
 {
 	int i;
@@ -241,6 +273,12 @@ void intel_pmu_lbr_reset_64(void)
 	}
 }
 
+static void intel_pmu_arch_lbr_reset(void)
+{
+	/* Write to ARCH_LBR_DEPTH MSR, all LBR entries are reset to 0 */
+	wrmsrl(MSR_ARCH_LBR_DEPTH, x86_pmu.lbr_nr);
+}
+
 void intel_pmu_lbr_reset(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -389,8 +427,29 @@ void intel_pmu_lbr_restore(void *ctx)
 	wrmsrl(x86_pmu.lbr_tos, tos);
 }
 
+static void intel_pmu_arch_lbr_restore(void *ctx)
+{
+	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
+	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
+	int i;
+
+	/* Fast reset the LBRs before restore if the call stack is not full. */
+	if (!entries[x86_pmu.lbr_nr - 1].lbr_from)
+		intel_pmu_arch_lbr_reset();
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		if (!entries[i].lbr_from)
+			break;
+		wrlbr_from(i, entries[i].lbr_from);
+		wrlbr_to(i, entries[i].lbr_to);
+		wrmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
+	}
+}
+
 static bool lbr_is_reset_in_cstate(void *ctx)
 {
+	if (x86_pmu.arch_lbr)
+		return x86_pmu.arch_lbr_deep_c_reset && !rdlbr_from(0);
 	return !rdlbr_from(((struct x86_perf_task_context *)ctx)->tos);
 }
 
@@ -444,6 +503,26 @@ void intel_pmu_lbr_save(void *ctx)
 	task_ctx->tos = tos;
 }
 
+static void intel_pmu_arch_lbr_save(void *ctx)
+{
+	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
+	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		entries[i].lbr_from = rdlbr_from(i);
+		/* Only save valid branches. */
+		if (!entries[i].lbr_from)
+			break;
+		entries[i].lbr_to = rdlbr_to(i);
+		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
+	}
+
+	/* LBR call stack is not full. Reset is required in restore. */
+	if (i < x86_pmu.lbr_nr)
+		entries[x86_pmu.lbr_nr - 1].lbr_from = 0;
+}
+
 static void __intel_pmu_lbr_save(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -722,6 +801,92 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.hw_idx = tos;
 }
 
+enum {
+	ARCH_LBR_BR_TYPE_JCC			= 0,
+	ARCH_LBR_BR_TYPE_NEAR_IND_JMP		= 1,
+	ARCH_LBR_BR_TYPE_NEAR_REL_JMP		= 2,
+	ARCH_LBR_BR_TYPE_NEAR_IND_CALL		= 3,
+	ARCH_LBR_BR_TYPE_NEAR_REL_CALL		= 4,
+	ARCH_LBR_BR_TYPE_NEAR_RET		= 5,
+	ARCH_LBR_BR_TYPE_KNOWN_MAX		= ARCH_LBR_BR_TYPE_NEAR_RET,
+
+	ARCH_LBR_BR_TYPE_MAP_MAX		= 16,
+};
+
+static const int arch_lbr_br_type_map[ARCH_LBR_BR_TYPE_MAP_MAX] = {
+	[ARCH_LBR_BR_TYPE_JCC]			= X86_BR_JCC,
+	[ARCH_LBR_BR_TYPE_NEAR_IND_JMP]		= X86_BR_IND_JMP,
+	[ARCH_LBR_BR_TYPE_NEAR_REL_JMP]		= X86_BR_JMP,
+	[ARCH_LBR_BR_TYPE_NEAR_IND_CALL]	= X86_BR_IND_CALL,
+	[ARCH_LBR_BR_TYPE_NEAR_REL_CALL]	= X86_BR_CALL,
+	[ARCH_LBR_BR_TYPE_NEAR_RET]		= X86_BR_RET,
+};
+
+static void __intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc, int index,
+				      u64 from, u64 to, u64 info)
+{
+	u64 mis = 0, pred = 0, in_tx = 0, abort = 0, type = 0;
+	u32 br_type, to_plm;
+	u16 cycles = 0;
+
+	if (x86_pmu.arch_lbr_mispred) {
+		mis = !!(info & ARCH_LBR_INFO_MISPRED);
+		pred = !mis;
+	}
+	in_tx = !!(info & ARCH_LBR_INFO_IN_TSX);
+	abort = !!(info & ARCH_LBR_INFO_TSX_ABORT);
+	if (x86_pmu.arch_lbr_timed_lbr &&
+	    (info & ARCH_LBR_INFO_CYC_CNT_VALID))
+		cycles = (info & ARCH_LBR_INFO_CYC_CNT);
+
+	/*
+	 * Parse the branch type recorded in LBR_x_INFO MSR.
+	 * Doesn't support OTHER_BRANCH decoding for now.
+	 * OTHER_BRANCH branch type still rely on software decoding.
+	 */
+	if (x86_pmu.arch_lbr_br_type) {
+		br_type = (info & ARCH_LBR_INFO_BR_TYPE) >> ARCH_LBR_INFO_BR_TYPE_OFFSET;
+
+		if (br_type <= ARCH_LBR_BR_TYPE_KNOWN_MAX) {
+			to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
+			type = arch_lbr_br_type_map[br_type] | to_plm;
+		}
+	}
+
+	cpuc->lbr_entries[index].from		 = from;
+	cpuc->lbr_entries[index].to		 = to;
+	cpuc->lbr_entries[index].mispred	 = mis;
+	cpuc->lbr_entries[index].predicted	 = pred;
+	cpuc->lbr_entries[index].in_tx		 = in_tx;
+	cpuc->lbr_entries[index].abort		 = abort;
+	cpuc->lbr_entries[index].cycles		 = cycles;
+	cpuc->lbr_entries[index].type		 = type;
+	cpuc->lbr_entries[index].reserved	 = 0;
+}
+
+static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
+{
+	u64 from, to, info;
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		from = rdlbr_from(i);
+		to   = rdlbr_to(i);
+
+		/*
+		 * Read LBR entries until invalid entry (0s) is detected.
+		 */
+		if (!from)
+			break;
+
+		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, info);
+
+		__intel_pmu_arch_lbr_read(cpuc, i, from, to, info);
+	}
+
+	cpuc->lbr_stack.nr = i;
+}
+
 void intel_pmu_lbr_read(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1149,7 +1314,10 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 		from = cpuc->lbr_entries[i].from;
 		to = cpuc->lbr_entries[i].to;
 
-		type = branch_type(from, to, cpuc->lbr_entries[i].abort);
+		if (cpuc->lbr_entries[i].type)
+			type = cpuc->lbr_entries[i].type;
+		else
+			type = branch_type(from, to, cpuc->lbr_entries[i].abort);
 		if (type != X86_BR_NONE && (br_sel & X86_BR_ANYTX)) {
 			if (cpuc->lbr_entries[i].in_tx)
 				type |= X86_BR_IN_TX;
@@ -1184,11 +1352,37 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 	}
 }
 
+static void intel_pmu_store_pebs_arch_lbrs(struct pebs_lbr *pebs_lbr,
+					   struct cpu_hw_events *cpuc)
+{
+	struct pebs_lbr_entry *lbr;
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr = &pebs_lbr->lbr[i];
+
+		/*
+		 * Read LBR entries until invalid entry (0s) is detected.
+		 */
+		if (!lbr->from)
+			break;
+
+		__intel_pmu_arch_lbr_read(cpuc, i, lbr->from,
+					  lbr->to, lbr->info);
+	}
+
+	cpuc->lbr_stack.nr = i;
+	intel_pmu_lbr_filter(cpuc);
+}
+
 void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int i;
 
+	if (x86_pmu.arch_lbr)
+		return intel_pmu_store_pebs_arch_lbrs(lbr, cpuc);
+
 	cpuc->lbr_stack.nr = x86_pmu.lbr_nr;
 
 	/* Cannot get TOS for large PEBS */
@@ -1266,6 +1460,26 @@ static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
 	[PERF_SAMPLE_BRANCH_CALL_SHIFT]		= LBR_REL_CALL,
 };
 
+static int arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_MAX_SHIFT] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= ARCH_LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= ARCH_LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= ARCH_LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= ARCH_LBR_RETURN |
+						  ARCH_LBR_OTHER_BRANCH,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]     = ARCH_LBR_REL_CALL |
+						  ARCH_LBR_IND_CALL |
+						  ARCH_LBR_OTHER_BRANCH,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = ARCH_LBR_IND_CALL,
+	[PERF_SAMPLE_BRANCH_COND_SHIFT]         = ARCH_LBR_JCC,
+	[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]   = ARCH_LBR_REL_CALL |
+						  ARCH_LBR_IND_CALL |
+						  ARCH_LBR_RETURN |
+						  ARCH_LBR_CALL_STACK,
+	[PERF_SAMPLE_BRANCH_IND_JUMP_SHIFT]	= ARCH_LBR_IND_JMP,
+	[PERF_SAMPLE_BRANCH_CALL_SHIFT]		= ARCH_LBR_REL_CALL,
+};
+
 /* core */
 void __init intel_pmu_lbr_init_core(void)
 {
@@ -1411,3 +1625,63 @@ void intel_pmu_lbr_init_knl(void)
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
 		x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
+
+void __init intel_pmu_arch_lbr_init(void)
+{
+	unsigned int unused_edx;
+	u64 lbr_nr;
+
+	/* Arch LBR Capabilities */
+	cpuid(28, &x86_pmu.arch_lbr_eax, &x86_pmu.arch_lbr_ebx,
+		  &x86_pmu.arch_lbr_ecx, &unused_edx);
+
+	lbr_nr = x86_pmu_get_max_arch_lbr_nr();
+	if (!lbr_nr)
+		return;
+
+	/* Apply the max depth of Arch LBR */
+	if (wrmsrl_safe(MSR_ARCH_LBR_DEPTH, lbr_nr))
+		return;
+
+	x86_pmu.lbr_nr = lbr_nr;
+	x86_get_pmu()->task_ctx_size = sizeof(struct x86_perf_task_context_arch_lbr) +
+				       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
+
+	x86_pmu.lbr_from = MSR_ARCH_LBR_FROM_0;
+	x86_pmu.lbr_to = MSR_ARCH_LBR_TO_0;
+
+	/* LBR callstack requires both CPL and Branch Filtering support */
+	if (!x86_pmu.arch_lbr_cpl ||
+	    !x86_pmu.arch_lbr_filter ||
+	    !x86_pmu.arch_lbr_call_stack)
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] = LBR_NOT_SUPP;
+
+	if (!x86_pmu.arch_lbr_cpl) {
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_USER_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_KERNEL_SHIFT] = LBR_NOT_SUPP;
+	} else if (!x86_pmu.arch_lbr_filter) {
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_ANY_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_COND_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_IND_JUMP_SHIFT] = LBR_NOT_SUPP;
+		arch_lbr_ctl_map[PERF_SAMPLE_BRANCH_CALL_SHIFT] = LBR_NOT_SUPP;
+	}
+
+	x86_pmu.lbr_ctl_mask = ARCH_LBR_CTL_MASK;
+	x86_pmu.lbr_ctl_map  = arch_lbr_ctl_map;
+
+	if (!x86_pmu.arch_lbr_cpl && !x86_pmu.arch_lbr_filter)
+		x86_pmu.lbr_ctl_map = NULL;
+
+	x86_pmu.lbr_enable = intel_pmu_arch_lbr_enable;
+	x86_pmu.lbr_disable = intel_pmu_arch_lbr_disable;
+	x86_pmu.lbr_reset = intel_pmu_arch_lbr_reset;
+	x86_pmu.lbr_read = intel_pmu_arch_lbr_read;
+	x86_pmu.lbr_save = intel_pmu_arch_lbr_save;
+	x86_pmu.lbr_restore = intel_pmu_arch_lbr_restore;
+
+	x86_pmu.arch_lbr = true;
+	pr_cont("Architectural LBR, ");
+}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1b91f2b..5cebc75 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -801,6 +801,17 @@ struct x86_perf_task_context {
 	struct x86_perf_task_context_opt opt;
 };
 
+struct x86_perf_arch_lbr_entry {
+	u64 lbr_from;
+	u64 lbr_to;
+	u64 lbr_info;
+};
+
+struct x86_perf_task_context_arch_lbr {
+	struct x86_perf_task_context_opt opt;
+	struct x86_perf_arch_lbr_entry  entries[0];
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
@@ -851,6 +862,9 @@ extern struct x86_pmu x86_pmu __read_mostly;
 
 static inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
 {
+	if (x86_pmu.arch_lbr)
+		return &((struct x86_perf_task_context_arch_lbr *)ctx)->opt;
+
 	return &((struct x86_perf_task_context *)ctx)->opt;
 }
 
@@ -1181,6 +1195,8 @@ void intel_pmu_lbr_init_skl(void);
 
 void intel_pmu_lbr_init_knl(void);
 
+void intel_pmu_arch_lbr_init(void);
+
 void intel_pmu_pebs_data_source_nhm(void);
 
 void intel_pmu_pebs_data_source_skl(bool pmem);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 13/21] perf/core: Factor out functions to allocate/free the task_ctx_data
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (11 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 14/21] perf/core: Use kmem_cache to allocate the PMU specific data kan.liang
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The method to allocate/free the task_ctx_data is going to be changed in
the following patch. Currently, the task_ctx_data is allocated/freed in
several different places. To avoid repeatedly modifying the same codes
in several different places, alloc_task_ctx_data() and
free_task_ctx_data() are factored out to allocate/free the
task_ctx_data. The modification only needs to be applied once.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 633b4ae..a9bfe32 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1234,12 +1234,22 @@ static void get_ctx(struct perf_event_context *ctx)
 	refcount_inc(&ctx->refcount);
 }
 
+static void *alloc_task_ctx_data(struct pmu *pmu)
+{
+	return kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+}
+
+static void free_task_ctx_data(struct pmu *pmu, void *task_ctx_data)
+{
+	kfree(task_ctx_data);
+}
+
 static void free_ctx(struct rcu_head *head)
 {
 	struct perf_event_context *ctx;
 
 	ctx = container_of(head, struct perf_event_context, rcu_head);
-	kfree(ctx->task_ctx_data);
+	free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
 	kfree(ctx);
 }
 
@@ -4467,7 +4477,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		goto errout;
 
 	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
-		task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+		task_ctx_data = alloc_task_ctx_data(pmu);
 		if (!task_ctx_data) {
 			err = -ENOMEM;
 			goto errout;
@@ -4525,11 +4535,11 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		}
 	}
 
-	kfree(task_ctx_data);
+	free_task_ctx_data(pmu, task_ctx_data);
 	return ctx;
 
 errout:
-	kfree(task_ctx_data);
+	free_task_ctx_data(pmu, task_ctx_data);
 	return ERR_PTR(err);
 }
 
@@ -12406,8 +12416,7 @@ inherit_event(struct perf_event *parent_event,
 	    !child_ctx->task_ctx_data) {
 		struct pmu *pmu = child_event->pmu;
 
-		child_ctx->task_ctx_data = kzalloc(pmu->task_ctx_size,
-						   GFP_KERNEL);
+		child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
 		if (!child_ctx->task_ctx_data) {
 			free_event(child_event);
 			return ERR_PTR(-ENOMEM);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 14/21] perf/core: Use kmem_cache to allocate the PMU specific data
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (12 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 13/21] perf/core: Factor out functions to allocate/free the task_ctx_data kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 15/21] perf/x86/intel/lbr: Create kmem_cache for the LBR context data kan.liang
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Currently, the PMU specific data task_ctx_data is allocated by the
function kzalloc() in the perf generic code. When there is no specific
alignment requirement for the task_ctx_data, the method works well for
now. However, there will be a problem once a specific alignment
requirement is introduced in future features, e.g., the Architecture LBR
XSAVE feature requires 64-byte alignment. If the specific alignment
requirement is not fulfilled, the XSAVE family of instructions will fail
to save/restore the xstate to/from the task_ctx_data.

The function kzalloc() itself only guarantees a natural alignment. A
new method to allocate the task_ctx_data has to be introduced, which
has to meet the requirements as below:
- must be a generic method can be used by different architectures,
  because the allocation of the task_ctx_data is implemented in the
  perf generic code;
- must be an alignment-guarantee method (The alignment requirement is
  not changed after the boot);
- must be able to allocate/free a buffer (smaller than a page size)
  dynamically;
- should not cause extra CPU overhead or space overhead.

Several options were considered as below:
- One option is to allocate a larger buffer for task_ctx_data. E.g.,
    ptr = kmalloc(size + alignment, GFP_KERNEL);
    ptr &= ~(alignment - 1);
  This option causes space overhead.
- Another option is to allocate the task_ctx_data in the PMU specific
  code. To do so, several function pointers have to be added. As a
  result, both the generic structure and the PMU specific structure
  will become bigger. Besides, extra function calls are added when
  allocating/freeing the buffer. This option will increase both the
  space overhead and CPU overhead.
- The third option is to use a kmem_cache to allocate a buffer for the
  task_ctx_data. The kmem_cache can be created with a specific alignment
  requirement by the PMU at boot time. A new pointer for kmem_cache has
  to be added in the generic struct pmu, which would be used to
  dynamically allocate a buffer for the task_ctx_data at run time.
  Although the new pointer is added to the struct pmu, the existing
  variable task_ctx_size is not required anymore. The size of the
  generic structure is kept the same.

The third option which meets all the aforementioned requirements is used
to replace kzalloc() for the PMU specific data allocation. A later patch
will remove the kzalloc() method and the related variables.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h | 5 +++++
 kernel/events/core.c       | 8 +++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c3e761..ef6440f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -425,6 +425,11 @@ struct pmu {
 	size_t				task_ctx_size;
 
 	/*
+	 * Kmem cache of PMU specific data
+	 */
+	struct kmem_cache		*task_ctx_cache;
+
+	/*
 	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
 	 * can be synchronized using this function. See Intel LBR callstack support
 	 * implementation and Perf core context switch handling callbacks for usage
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a9bfe32..165b79b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1236,12 +1236,18 @@ static void get_ctx(struct perf_event_context *ctx)
 
 static void *alloc_task_ctx_data(struct pmu *pmu)
 {
+	if (pmu->task_ctx_cache)
+		return kmem_cache_zalloc(pmu->task_ctx_cache, GFP_KERNEL);
+
 	return kzalloc(pmu->task_ctx_size, GFP_KERNEL);
 }
 
 static void free_task_ctx_data(struct pmu *pmu, void *task_ctx_data)
 {
-	kfree(task_ctx_data);
+	if (pmu->task_ctx_cache && task_ctx_data)
+		kmem_cache_free(pmu->task_ctx_cache, task_ctx_data);
+	else
+		kfree(task_ctx_data);
 }
 
 static void free_ctx(struct rcu_head *head)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 15/21] perf/x86/intel/lbr: Create kmem_cache for the LBR context data
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (13 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 14/21] perf/core: Use kmem_cache to allocate the PMU specific data kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 16/21] perf/x86: Remove task_ctx_size kan.liang
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

A new kmem_cache method is introduced to allocate the PMU specific data
task_ctx_data, which requires the PMU specific code to create a
kmem_cache.

Currently, the task_ctx_data is only used by the Intel LBR call stack
feature, which is introduced since Haswell. The kmem_cache should be
only created for Haswell and later platforms. There is no alignment
requirement for the existing platforms.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index fde23e8..28f0d41 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1533,9 +1533,17 @@ void __init intel_pmu_lbr_init_snb(void)
 	 */
 }
 
+static inline struct kmem_cache *
+create_lbr_kmem_cache(size_t size, size_t align)
+{
+	return kmem_cache_create("x86_lbr", size, align, 0, NULL);
+}
+
 /* haswell */
 void intel_pmu_lbr_init_hsw(void)
 {
+	size_t size = sizeof(struct x86_perf_task_context);
+
 	x86_pmu.lbr_nr	 = 16;
 	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
 	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
@@ -1544,6 +1552,8 @@ void intel_pmu_lbr_init_hsw(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
 
+	x86_get_pmu()->task_ctx_cache = create_lbr_kmem_cache(size, 0);
+
 	if (lbr_from_signext_quirk_needed())
 		static_branch_enable(&lbr_from_quirk_key);
 }
@@ -1551,6 +1561,8 @@ void intel_pmu_lbr_init_hsw(void)
 /* skylake */
 __init void intel_pmu_lbr_init_skl(void)
 {
+	size_t size = sizeof(struct x86_perf_task_context);
+
 	x86_pmu.lbr_nr	 = 32;
 	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
 	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
@@ -1559,6 +1571,8 @@ __init void intel_pmu_lbr_init_skl(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
 
+	x86_get_pmu()->task_ctx_cache = create_lbr_kmem_cache(size, 0);
+
 	/*
 	 * SW branch filter usage:
 	 * - support syscall, sysret capture.
@@ -1629,6 +1643,7 @@ void intel_pmu_lbr_init_knl(void)
 void __init intel_pmu_arch_lbr_init(void)
 {
 	unsigned int unused_edx;
+	size_t size;
 	u64 lbr_nr;
 
 	/* Arch LBR Capabilities */
@@ -1644,8 +1659,11 @@ void __init intel_pmu_arch_lbr_init(void)
 		return;
 
 	x86_pmu.lbr_nr = lbr_nr;
-	x86_get_pmu()->task_ctx_size = sizeof(struct x86_perf_task_context_arch_lbr) +
-				       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
+
+	size = sizeof(struct x86_perf_task_context_arch_lbr) +
+	       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
+	x86_get_pmu()->task_ctx_size = size;
+	x86_get_pmu()->task_ctx_cache = create_lbr_kmem_cache(size, 0);
 
 	x86_pmu.lbr_from = MSR_ARCH_LBR_FROM_0;
 	x86_pmu.lbr_to = MSR_ARCH_LBR_TO_0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 16/21] perf/x86: Remove task_ctx_size
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (14 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 15/21] perf/x86/intel/lbr: Create kmem_cache for the LBR context data kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask kan.liang
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

A new kmem_cache method has replaced the kzalloc() to allocate the PMU
specific data. The task_ctx_size is not required anymore.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c      | 1 -
 arch/x86/events/intel/lbr.c | 1 -
 include/linux/perf_event.h  | 4 ----
 kernel/events/core.c        | 4 +---
 4 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index a619763..aeb6e6d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2364,7 +2364,6 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
-	.task_ctx_size          = sizeof(struct x86_perf_task_context),
 	.swap_task_ctx		= x86_pmu_swap_task_ctx,
 	.check_period		= x86_pmu_check_period,
 
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 28f0d41..4060d3a 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1662,7 +1662,6 @@ void __init intel_pmu_arch_lbr_init(void)
 
 	size = sizeof(struct x86_perf_task_context_arch_lbr) +
 	       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
-	x86_get_pmu()->task_ctx_size = size;
 	x86_get_pmu()->task_ctx_cache = create_lbr_kmem_cache(size, 0);
 
 	x86_pmu.lbr_from = MSR_ARCH_LBR_FROM_0;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ef6440f..dd6f3a8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -419,10 +419,6 @@ struct pmu {
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
 					bool sched_in);
-	/*
-	 * PMU specific data size
-	 */
-	size_t				task_ctx_size;
 
 	/*
 	 * Kmem cache of PMU specific data
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 165b79b..9800e99 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1239,15 +1239,13 @@ static void *alloc_task_ctx_data(struct pmu *pmu)
 	if (pmu->task_ctx_cache)
 		return kmem_cache_zalloc(pmu->task_ctx_cache, GFP_KERNEL);
 
-	return kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+	return NULL;
 }
 
 static void free_task_ctx_data(struct pmu *pmu, void *task_ctx_data)
 {
 	if (pmu->task_ctx_cache && task_ctx_data)
 		kmem_cache_free(pmu->task_ctx_cache, task_ctx_data);
-	else
-		kfree(task_ctx_data);
 }
 
 static void free_ctx(struct rcu_head *head)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (15 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 16/21] perf/x86: Remove task_ctx_size kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 19:31   ` Peter Zijlstra
  2020-06-19 14:04 ` [PATCH 18/21] x86/fpu/xstate: Support dynamic supervisor feature for LBR kan.liang
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

When saving xstate to a kernel/user XSAVE area with the XSAVE family of
instructions, the current code applies the 'full' instruction mask (-1),
which tries to XSAVE all possible features. This method relies on
hardware to trim 'all possible' down to what is enabled in the
hardware. The code works well for now. However, there will be a
problem, if some features are enabled in hardware, but are not suitable
to be saved into all kernel XSAVE buffers, like task->fpu, due to
performance consideration.

One such example is the Last Branch Records (LBR) state. The LBR state
only contains valuable information when LBR is explicitly enabled by
the perf subsystem, and the size of an LBR state is large (808 bytes
for now). To avoid both CPU overhead and space overhead at each context
switch, the LBR state should not be saved into task->fpu like other
state components. It should be saved/restored on demand when LBR is
enabled in the perf subsystem. Current copy_xregs_to_* will trigger a
buffer overflow for such cases.

Three sites use the '-1' instruction mask which must be updated.

Two are saving/restoring the xstate to/from a kernel-allocated XSAVE
buffer and can use 'xfeatures_mask_all', which will save/restore all of
the features present in a normal task FPU buffer.

The last one saves the register state directly to a user buffer. It could
also use 'xfeatures_mask_all'. Just as it was with the '-1' argument,
any supervisor states in the mask will be filtered out by the hardware
and not saved to the buffer.  But, to be more explicit about what is
expected to be saved, use xfeatures_mask_user() for the instruction
mask.

KVM includes the header file fpu/internal.h. To avoid 'undefined
xfeatures_mask_all' compiling issue, xfeatures_mask_all has to be
exported.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/fpu/internal.h | 9 ++++++---
 arch/x86/kernel/fpu/xstate.c        | 1 +
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 42159f4..0388c792 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -274,7 +274,7 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
  */
 static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate)
 {
-	u64 mask = -1;
+	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
@@ -320,7 +320,7 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
  */
 static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
 {
-	u64 mask = -1;
+	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
@@ -356,6 +356,9 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
  */
 static inline int copy_xregs_to_user(struct xregs_state __user *buf)
 {
+	u64 mask = xfeatures_mask_user();
+	u32 lmask = mask;
+	u32 hmask = mask >> 32;
 	int err;
 
 	/*
@@ -367,7 +370,7 @@ static inline int copy_xregs_to_user(struct xregs_state __user *buf)
 		return -EFAULT;
 
 	stac();
-	XSTATE_OP(XSAVE, buf, -1, -1, err);
+	XSTATE_OP(XSAVE, buf, lmask, hmask, err);
 	clac();
 
 	return err;
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 587e03f..eb2e44e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -58,6 +58,7 @@ static short xsave_cpuid_features[] __initdata = {
  * XSAVE buffer, both supervisor and user xstates.
  */
 u64 xfeatures_mask_all __read_mostly;
+EXPORT_SYMBOL_GPL(xfeatures_mask_all);
 
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 18/21] x86/fpu/xstate: Support dynamic supervisor feature for LBR
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (16 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 19/21] x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature kan.liang
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Last Branch Records (LBR) registers are used to log taken branches and
other control flows. In perf with call stack mode, LBR information is
used to reconstruct a call stack. To get the complete call stack, perf
has to save/restore all LBR registers during a context switch. Due to
the large number of the LBR registers, e.g., the current platform has
96 LBR registers, this process causes a high CPU overhead. To reduce
the CPU overhead during a context switch, an LBR state component that
contains all the LBR related registers is introduced in hardware. All
LBR registers can be saved/restored together using one XSAVES/XRSTORS
instruction.

However, the kernel should not save/restore the LBR state component at
each context switch, like other state components, because of the
following unique features of LBR:
- The LBR state component only contains valuable information when LBR
  is enabled in the perf subsystem, but for most of the time, LBR is
  disabled.
- The size of the LBR state component is huge. For the current
  platform, it's 808 bytes.
If the kernel saves/restores the LBR state at each context switch, for
most of the time, it is just a waste of space and cycles.

To efficiently support the LBR state component, it is desired to have:
- only context-switch the LBR when the LBR feature is enabled in perf.
- only allocate an LBR-specific XSAVE buffer on demand.
  (Besides the LBR state, a legacy region and an XSAVE header have to be
   included in the buffer as well. There is a total of (808+576) byte
   overhead for the LBR-specific XSAVE buffer. The overhead only happens
   when the perf is actively using LBRs. There is still a space-saving,
   on average, when it replaces the constant 808 bytes of overhead for
   every task, all the time on the systems that support architectural
   LBR.)
- be able to use XSAVES/XRSTORS for accessing LBR at run time.
  However, the IA32_XSS should not be adjusted at run time.
  (The XCR0 | IA32_XSS are used to determine the requested-feature
  bitmap (RFBM) of XSAVES.)

A solution, called dynamic supervisor feature, is introduced to address
this issue, which
- does not allocate a buffer in each task->fpu;
- does not save/restore a state component at each context switch;
- sets the bit corresponding to the dynamic supervisor feature in
  IA32_XSS at boot time, and avoids setting it at run time.
- dynamically allocates a specific buffer for a state component
  on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is
  enabled in perf. (Note: The buffer has to include the LBR state
  component, a legacy region and a XSAVE header space.)
  (Implemented in a later patch)
- saves/restores a state component on demand, e.g. manually invokes
  the XSAVES/XRSTORS instruction to save/restore the LBR state
  to/from the buffer when perf is active and a call stack is required.
  (Implemented in a later patch)

A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic()
are introduced to indicate the dynamic supervisor feature. For the
systems which support the Architecture LBR, LBR is the only dynamic
supervisor feature for now. For the previous systems, there is no
dynamic supervisor feature available.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/fpu/types.h  |  7 +++++++
 arch/x86/include/asm/fpu/xstate.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/fpu/xstate.c      | 15 ++++++++++-----
 3 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f098f6c..132e9cc 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -114,6 +114,12 @@ enum xfeature {
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 	XFEATURE_PKRU,
+	XFEATURE_RSRVD_COMP_10,
+	XFEATURE_RSRVD_COMP_11,
+	XFEATURE_RSRVD_COMP_12,
+	XFEATURE_RSRVD_COMP_13,
+	XFEATURE_RSRVD_COMP_14,
+	XFEATURE_LBR,
 
 	XFEATURE_MAX,
 };
@@ -128,6 +134,7 @@ enum xfeature {
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 422d836..040c4d4 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -36,6 +36,27 @@
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
 
 /*
+ * A supervisor state component may not always contain valuable information,
+ * and its size may be huge. Saving/restoring such supervisor state components
+ * at each context switch can cause high CPU and space overhead, which should
+ * be avoided. Such supervisor state components should only be saved/restored
+ * on demand. The on-demand dynamic supervisor features are set in this mask.
+ *
+ * Unlike the existing supported supervisor features, a dynamic supervisor
+ * feature does not allocate a buffer in task->fpu, and the corresponding
+ * supervisor state component cannot be saved/restored at each context switch.
+ *
+ * To support a dynamic supervisor feature, a developer should follow the
+ * dos and don'ts as below:
+ * - Do dynamically allocate a buffer for the supervisor state component.
+ * - Do manually invoke the XSAVES/XRSTORS instruction to save/restore the
+ *   state component to/from the buffer.
+ * - Don't set the bit corresponding to the dynamic supervisor feature in
+ *   IA32_XSS at run time, since it has been set at boot time.
+ */
+#define XFEATURE_MASK_DYNAMIC (XFEATURE_MASK_LBR)
+
+/*
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
@@ -43,6 +64,7 @@
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
+				      XFEATURE_MASK_DYNAMIC | \
 				      XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)
 
 #ifdef CONFIG_X86_64
@@ -63,6 +85,14 @@ static inline u64 xfeatures_mask_user(void)
 	return xfeatures_mask_all & XFEATURE_MASK_USER_SUPPORTED;
 }
 
+static inline u64 xfeatures_mask_dynamic(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_ARCH_LBR))
+		return XFEATURE_MASK_DYNAMIC & ~XFEATURE_MASK_LBR;
+
+	return XFEATURE_MASK_DYNAMIC;
+}
+
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 
 extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index eb2e44e..58d79f1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -234,8 +234,10 @@ void fpu__init_cpu_xstate(void)
 	/*
 	 * MSR_IA32_XSS sets supervisor states managed by XSAVES.
 	 */
-	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());
+	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
+				     xfeatures_mask_dynamic());
+	}
 }
 
 static bool xfeature_enabled(enum xfeature xfeature)
@@ -599,7 +601,8 @@ static void check_xstate_against_struct(int nr)
 	 */
 	if ((nr < XFEATURE_YMM) ||
 	    (nr >= XFEATURE_MAX) ||
-	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
+	    ((nr >= XFEATURE_RSRVD_COMP_10) && (nr <= XFEATURE_LBR))) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
@@ -848,8 +851,10 @@ void fpu__resume_cpu(void)
 	 * Restore IA32_XSS. The same CPUID bit enumerates support
 	 * of XSAVES and MSR_IA32_XSS.
 	 */
-	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());
+	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
+				     xfeatures_mask_dynamic());
+	}
 }
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 19/21] x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (17 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 18/21] x86/fpu/xstate: Support dynamic supervisor feature for LBR kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 14:04 ` [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch kan.liang
  2020-06-19 14:04 ` [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read kan.liang
  20 siblings, 0 replies; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The LBR dynamic supervisor feature will be enabled in the perf
subsystem later. A new structure, several helpers, and a macro are
added as below to facilitate enabling the feature.
- Currently, the structure for each state component is maintained in
  fpu/types.h. The structure for the new LBR state component should be
  maintained in the same place, which will be used in the following
  patch.
- The perf subsystem will only need to save/restore the LBR state.
  However, the existing helpers save all supported supervisor states to
  a kernel buffer, which will be unnecessary. Two helpers are
  introduced to only save/restore requested dynamic supervisor states.
  The supervisor features in XFEATURE_MASK_SUPERVISOR_SUPPORTED and
  XFEATURE_MASK_SUPERVISOR_UNSUPPORTED mask cannot be saved/restored
  using these helpers.
- The XSAVE buffer must be 64-byte aligned. A new macro is added to
  reflect the alignment requirement.

The structure, the helpers, and the macro will be used in the following
patch.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/fpu/types.h  | 19 +++++++++++
 arch/x86/include/asm/fpu/xstate.h |  5 +++
 arch/x86/kernel/fpu/xstate.c      | 72 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 132e9cc..975f078 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -236,6 +236,25 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 15: Architectural LBR configuration state.
+ * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
+ */
+struct arch_lbr_entry {
+	u64 lbr_from;
+	u64 lbr_to;
+	u64 lbr_info;
+};
+
+struct arch_lbr_state {
+	u64 lbr_ctl;
+	u64 lbr_depth;
+	u64 ler_from;
+	u64 ler_to;
+	u64 ler_info;
+	struct arch_lbr_entry		entries[0];
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 040c4d4..636c3ef 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -21,6 +21,8 @@
 #define XSAVE_YMM_SIZE	    256
 #define XSAVE_YMM_OFFSET    (XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)
 
+#define XSAVE_ALIGNMENT     64
+
 /* All currently supported user features */
 #define XFEATURE_MASK_USER_SUPPORTED (XFEATURE_MASK_FP | \
 				      XFEATURE_MASK_SSE | \
@@ -106,6 +108,9 @@ int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned i
 int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
 int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
 void copy_supervisor_to_kernel(struct xregs_state *xsave);
+void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask);
+void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask);
+
 
 /* Validate an xstate header supplied by userspace (ptrace or sigreturn) */
 int validate_user_xstate_header(const struct xstate_header *hdr);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 58d79f1..49e0347 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1352,6 +1352,78 @@ void copy_supervisor_to_kernel(struct xregs_state *xstate)
 	}
 }
 
+/**
+ * copy_dynamic_supervisor_to_kernel() - Save dynamic supervisor states to
+ *                                       an xsave area
+ * @xstate: A pointer to an xsave area
+ * @mask: Represent the dynamic supervisor features saved into the xsave area
+ *
+ * Only the dynamic supervisor states sets in the mask are saved into the xsave
+ * area (See the comment in XFEATURE_MASK_DYNAMIC for the details of dynamic
+ * supervisor feature). Besides the dynamic supervisor states, the legacy
+ * region and XSAVE header are also saved into the xsave area. The supervisor
+ * features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
+ * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not saved.
+ *
+ * The xsave area must be 64-bytes aligned.
+ */
+void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
+{
+	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u32 lmask, hmask;
+	int err;
+
+	if (WARN_ON_FPU(!boot_cpu_has(X86_FEATURE_XSAVES)))
+		return;
+
+	if (WARN_ON_FPU(!dynamic_mask))
+		return;
+
+	lmask = dynamic_mask;
+	hmask = dynamic_mask >> 32;
+
+	XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
+
+	/* Should never fault when copying to a kernel buffer */
+	WARN_ON_FPU(err);
+}
+
+/**
+ * copy_kernel_to_dynamic_supervisor() - Restore dynamic supervisor states from
+ *                                       an xsave area
+ * @xstate: A pointer to an xsave area
+ * @mask: Represent the dynamic supervisor features restored from the xsave area
+ *
+ * Only the dynamic supervisor states sets in the mask are restored from the
+ * xsave area (See the comment in XFEATURE_MASK_DYNAMIC for the details of
+ * dynamic supervisor feature). Besides the dynamic supervisor states, the
+ * legacy region and XSAVE header are also restored from the xsave area. The
+ * supervisor features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
+ * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not restored.
+ *
+ * The xsave area must be 64-bytes aligned.
+ */
+void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask)
+{
+	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u32 lmask, hmask;
+	int err;
+
+	if (WARN_ON_FPU(!boot_cpu_has(X86_FEATURE_XSAVES)))
+		return;
+
+	if (WARN_ON_FPU(!dynamic_mask))
+		return;
+
+	lmask = dynamic_mask;
+	hmask = dynamic_mask >> 32;
+
+	XSTATE_OP(XRSTORS, xstate, lmask, hmask, err);
+
+	/* Should never fault when copying from a kernel buffer */
+	WARN_ON_FPU(err);
+}
+
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 /*
  * Report the amount of time elapsed in millisecond since last AVX512
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (18 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 19/21] x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-19 19:41   ` Peter Zijlstra
  2020-06-19 14:04 ` [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read kan.liang
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In the LBR call stack mode, LBR information is used to reconstruct a
call stack. To get the complete call stack, perf has to save/restore
all LBR registers during a context switch. Due to a large number of the
LBR registers, this process causes a high CPU overhead. To reduce the
CPU overhead during a context switch, use the XSAVES/XRSTORS
instructions.

Every XSAVE area must follow a canonical format: the legacy region, an
XSAVE header and the extended region. Although the LBR information is
only kept in the extended region, a space for the legacy region and
XSAVE header is still required. Add a new dedicated structure for LBR
XSAVES support.

Before enabling XSAVES support, the size of the LBR state has to be
sanity checked, because:
- the size of the software structure is calculated from the max number
of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR.
The size of the LBR state is enumerated by the CPUID leaf for XSAVE
support of Arch LBR. If the values from the two CPUID leaves are not
consistent, it may trigger a buffer overflow. For example, a hypervisor
may unconsciously set inconsistent values for the two emulated CPUID.
- unlike other state components, the size of an LBR state depends on the
max number of LBRs, which may vary from generation to generation.

Expose the function xfeature_size() for the sanity check.
The LBR XSAVES support will be disabled if the size of the LBR state
enumerated by CPUID doesn't match with the size of the software
structure.

The XSAVE instruction requires 64-byte alignment for state buffers. A
64-byte aligned kmem_cache is created for architecture LBR.

Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support,
which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR
information at the context switch when the call stack mode is enabled.
Since the XSAVES/XRSTORS instructions will be eventually invoked, the
dedicated functions is named with '_xsaves'/'_xrstors' postfix.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c       | 78 ++++++++++++++++++++++++++++++++++++---
 arch/x86/events/perf_event.h      | 18 +++++++++
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      |  2 +-
 4 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 4060d3a..dc40a76 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -446,6 +446,17 @@ static void intel_pmu_arch_lbr_restore(void *ctx)
 	}
 }
 
+/*
+ * Restore the Architecture LBR state from the xsave area in the perf
+ * context data for the task via the XRSTORS instruction.
+ */
+static void intel_pmu_arch_lbr_xrstors(void *ctx)
+{
+	struct x86_perf_task_context_arch_lbr_xsave *task_ctx = ctx;
+
+	copy_kernel_to_dynamic_supervisor(&task_ctx->xsave, XFEATURE_MASK_LBR);
+}
+
 static bool lbr_is_reset_in_cstate(void *ctx)
 {
 	if (x86_pmu.arch_lbr)
@@ -523,6 +534,17 @@ static void intel_pmu_arch_lbr_save(void *ctx)
 		entries[x86_pmu.lbr_nr - 1].lbr_from = 0;
 }
 
+/*
+ * Save the Architecture LBR state to the xsave area in the perf
+ * context data for the task via the XSAVES instruction.
+ */
+static void intel_pmu_arch_lbr_xsaves(void *ctx)
+{
+	struct x86_perf_task_context_arch_lbr_xsave *task_ctx = ctx;
+
+	copy_dynamic_supervisor_to_kernel(&task_ctx->xsave, XFEATURE_MASK_LBR);
+}
+
 static void __intel_pmu_lbr_save(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1640,9 +1662,37 @@ void intel_pmu_lbr_init_knl(void)
 		x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
 
+/*
+ * LBR state size is variable based on the max number of registers.
+ * This calculates the expected state size, which should match
+ * what the hardware enumerates for the size of XFEATURE_LBR.
+ */
+static inline unsigned int get_lbr_state_size(void)
+{
+	return sizeof(struct arch_lbr_state) +
+	       x86_pmu.lbr_nr * sizeof(struct arch_lbr_entry);
+}
+
+static bool is_arch_lbr_xsave_available(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return false;
+
+	/*
+	 * Check the LBR state with the corresponding software structure.
+	 * Disable LBR XSAVES support if the size doesn't match.
+	 */
+	if (WARN_ON(xfeature_size(XFEATURE_LBR) != get_lbr_state_size()))
+		return false;
+
+	return true;
+}
+
 void __init intel_pmu_arch_lbr_init(void)
 {
+	struct pmu *pmu = x86_get_pmu();
 	unsigned int unused_edx;
+	bool arch_lbr_xsave;
 	size_t size;
 	u64 lbr_nr;
 
@@ -1660,9 +1710,21 @@ void __init intel_pmu_arch_lbr_init(void)
 
 	x86_pmu.lbr_nr = lbr_nr;
 
-	size = sizeof(struct x86_perf_task_context_arch_lbr) +
-	       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
-	x86_get_pmu()->task_ctx_cache = create_lbr_kmem_cache(size, 0);
+	arch_lbr_xsave = is_arch_lbr_xsave_available();
+	if (arch_lbr_xsave) {
+		size = sizeof(struct x86_perf_task_context_arch_lbr_xsave) +
+		       get_lbr_state_size();
+		pmu->task_ctx_cache = create_lbr_kmem_cache(size,
+							    XSAVE_ALIGNMENT);
+	}
+
+	if (!pmu->task_ctx_cache) {
+		arch_lbr_xsave = false;
+
+		size = sizeof(struct x86_perf_task_context_arch_lbr) +
+		       lbr_nr * sizeof(struct x86_perf_arch_lbr_entry);
+		pmu->task_ctx_cache = create_lbr_kmem_cache(size, 0);
+	}
 
 	x86_pmu.lbr_from = MSR_ARCH_LBR_FROM_0;
 	x86_pmu.lbr_to = MSR_ARCH_LBR_TO_0;
@@ -1696,8 +1758,14 @@ void __init intel_pmu_arch_lbr_init(void)
 	x86_pmu.lbr_disable = intel_pmu_arch_lbr_disable;
 	x86_pmu.lbr_reset = intel_pmu_arch_lbr_reset;
 	x86_pmu.lbr_read = intel_pmu_arch_lbr_read;
-	x86_pmu.lbr_save = intel_pmu_arch_lbr_save;
-	x86_pmu.lbr_restore = intel_pmu_arch_lbr_restore;
+	if (arch_lbr_xsave) {
+		x86_pmu.lbr_save = intel_pmu_arch_lbr_xsaves;
+		x86_pmu.lbr_restore = intel_pmu_arch_lbr_xrstors;
+		pr_cont("XSAVE ");
+	} else {
+		x86_pmu.lbr_save = intel_pmu_arch_lbr_save;
+		x86_pmu.lbr_restore = intel_pmu_arch_lbr_restore;
+	}
 
 	x86_pmu.arch_lbr = true;
 	pr_cont("Architectural LBR, ");
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5cebc75..812980e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -812,6 +812,24 @@ struct x86_perf_task_context_arch_lbr {
 	struct x86_perf_arch_lbr_entry  entries[0];
 };
 
+/*
+ * The structure is dynamically allocated. The size of the LBR state may vary
+ * based on the number of LBR registers.
+ *
+ * Do not put anything after the LBR state.
+ */
+struct x86_perf_task_context_arch_lbr_xsave {
+	struct x86_perf_task_context_opt	opt;
+	union {
+		struct xregs_state		xsave;
+		struct {
+			struct fxregs_state	i387;
+			struct xstate_header	header;
+			struct arch_lbr_state	lbr;
+		};
+	};
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 636c3ef..1559554 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -103,6 +103,7 @@ extern void __init update_regset_xstate_info(unsigned int size,
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
+int xfeature_size(int xfeature_nr);
 int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int offset, unsigned int size);
 int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned int offset, unsigned int size);
 int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 49e0347..9c0541d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -489,7 +489,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
 	return ebx;
 }
 
-static int xfeature_size(int xfeature_nr)
+int xfeature_size(int xfeature_nr)
 {
 	u32 eax, ebx, ecx, edx;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read
  2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
                   ` (19 preceding siblings ...)
  2020-06-19 14:04 ` [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch kan.liang
@ 2020-06-19 14:04 ` kan.liang
  2020-06-22 18:49   ` Cyrill Gorcunov
  20 siblings, 1 reply; 40+ messages in thread
From: kan.liang @ 2020-06-19 14:04 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, x86, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, namhyung, dave.hansen,
	yu-cheng.yu, bigeasy, gorcunov, hpa, alexey.budankov, eranian,
	ak, like.xu, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Reading LBR registers in a perf NMI handler for a non-PEBS event
causes a high overhead because the number of LBR registers is huge.
To reduce the overhead, the XSAVES instruction should be used to replace
the LBR registers' reading method.

The XSAVES buffer used for LBR read has to be per-CPU because the NMI
handler invoked the lbr_read(). The existing task_ctx_data buffer
cannot be used which is per-task and only be allocated for the LBR call
stack mode. A new lbr_xsave pointer is introduced in the cpu_hw_events
as an XSAVES buffer for LBR read.

The XSAVES buffer should be allocated only when LBR is used by a
non-PEBS event on the CPU because the total size of the lbr_xsave is
not small (~1.4KB).

The XSAVES buffer is allocated when a non-PEBS event is added, but it
is lazily released in x86_release_hardware() when perf releases the
entire PMU hardware resource, because perf may frequently schedule the
event, e.g. high context switch. The lazy release method reduces the
overhead of frequently allocate/free the buffer.

If the lbr_xsave fails to be allocated, roll back to normal Arch LBR
lbr_read().

Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       |  1 +
 arch/x86/events/intel/lbr.c  | 58 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h |  7 ++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index aeb6e6d..3339347 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -359,6 +359,7 @@ void x86_release_hardware(void)
 	if (atomic_dec_and_mutex_lock(&pmc_refcount, &pmc_reserve_mutex)) {
 		release_pmc_hardware();
 		release_ds_buffers();
+		release_lbr_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index dc40a76..4b0042f 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -624,6 +624,7 @@ static inline bool branch_user_callstack(unsigned br_sel)
 
 void intel_pmu_lbr_add(struct perf_event *event)
 {
+	struct kmem_cache *kmem_cache = event->pmu->task_ctx_cache;
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	if (!x86_pmu.lbr_nr)
@@ -658,6 +659,28 @@ void intel_pmu_lbr_add(struct perf_event *event)
 	perf_sched_cb_inc(event->ctx->pmu);
 	if (!cpuc->lbr_users++ && !event->total_time_running)
 		intel_pmu_lbr_reset();
+
+	if (x86_pmu.arch_lbr && kmem_cache && !cpuc->lbr_xsave &&
+	    (cpuc->lbr_users != cpuc->lbr_pebs_users))
+		cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL);
+}
+
+void release_lbr_buffers(void)
+{
+	struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
+	struct cpu_hw_events *cpuc;
+	int cpu;
+
+	if (!x86_pmu.arch_lbr)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		cpuc = per_cpu_ptr(&cpu_hw_events, cpu);
+		if (kmem_cache && cpuc->lbr_xsave) {
+			kmem_cache_free(kmem_cache, cpuc->lbr_xsave);
+			cpuc->lbr_xsave = NULL;
+		}
+	}
 }
 
 void intel_pmu_lbr_del(struct perf_event *event)
@@ -909,6 +932,38 @@ static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.nr = i;
 }
 
+static void intel_pmu_arch_lbr_read_xsave(struct cpu_hw_events *cpuc)
+{
+	struct x86_perf_task_context_arch_lbr_xsave *xsave = cpuc->lbr_xsave;
+	struct arch_lbr_entry *lbr;
+	int i;
+
+	if (!xsave)
+		goto rollback;
+
+	copy_dynamic_supervisor_to_kernel(&xsave->xsave, XFEATURE_MASK_LBR);
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr = &xsave->lbr.entries[i];
+
+		/*
+		 * Read LBR entries until invalid entry (0s) is detected.
+		 */
+		if (!lbr->lbr_from)
+			break;
+
+		__intel_pmu_arch_lbr_read(cpuc, i, lbr->lbr_from,
+					  lbr->lbr_to, lbr->lbr_info);
+	}
+
+	cpuc->lbr_stack.nr = i;
+
+	return;
+
+rollback:
+	intel_pmu_arch_lbr_read(cpuc);
+}
+
 void intel_pmu_lbr_read(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1757,14 +1812,15 @@ void __init intel_pmu_arch_lbr_init(void)
 	x86_pmu.lbr_enable = intel_pmu_arch_lbr_enable;
 	x86_pmu.lbr_disable = intel_pmu_arch_lbr_disable;
 	x86_pmu.lbr_reset = intel_pmu_arch_lbr_reset;
-	x86_pmu.lbr_read = intel_pmu_arch_lbr_read;
 	if (arch_lbr_xsave) {
 		x86_pmu.lbr_save = intel_pmu_arch_lbr_xsaves;
 		x86_pmu.lbr_restore = intel_pmu_arch_lbr_xrstors;
+		x86_pmu.lbr_read = intel_pmu_arch_lbr_read_xsave;
 		pr_cont("XSAVE ");
 	} else {
 		x86_pmu.lbr_save = intel_pmu_arch_lbr_save;
 		x86_pmu.lbr_restore = intel_pmu_arch_lbr_restore;
+		x86_pmu.lbr_read = intel_pmu_arch_lbr_read;
 	}
 
 	x86_pmu.arch_lbr = true;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 812980e..b9dfc55 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -251,6 +251,7 @@ struct cpu_hw_events {
 	u64				br_sel;
 	void				*last_task_ctx;
 	int				last_log_id;
+	void				*lbr_xsave;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -1106,6 +1107,8 @@ void release_ds_buffers(void);
 
 void reserve_ds_buffers(void);
 
+void release_lbr_buffers(void);
+
 extern struct event_constraint bts_constraint;
 
 void intel_pmu_enable_bts(u64 config);
@@ -1250,6 +1253,10 @@ static inline void release_ds_buffers(void)
 {
 }
 
+static inline void release_lbr_buffers(void)
+{
+}
+
 static inline int intel_pmu_init(void)
 {
 	return 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR
  2020-06-19 14:03 ` [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR kan.liang
@ 2020-06-19 18:31   ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 18:31 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:03:57AM -0700, kan.liang@linux.intel.com wrote:

> +	union {
> +		struct {
> +			/* Supported LBR depth values */
> +			unsigned int	arch_lbr_depth_mask:8;
> +
> +			unsigned int	reserved:22;
> +
> +			/* Deep C-state Reset */
> +			unsigned int	arch_lbr_deep_c_reset:1;
> +
> +			/* IP values contain LIP */
> +			unsigned int	arch_lbr_lip:1;
> +		};
> +		unsigned int		arch_lbr_eax;
> +	};
> +	union {
> +		struct {
> +			/* CPL Filtering Supported */
> +			unsigned int    arch_lbr_cpl:1;
> +
> +			/* Branch Filtering Supported */
> +			unsigned int    arch_lbr_filter:1;
> +
> +			/* Call-stack Mode Supported */
> +			unsigned int    arch_lbr_call_stack:1;
> +		};
> +		unsigned int            arch_lbr_ebx;
> +	};
> +	union {
> +		struct {
> +			/* Mispredict Bit Supported */
> +			unsigned int    arch_lbr_mispred:1;
> +
> +			/* Timed LBRs Supported */
> +			unsigned int    arch_lbr_timed_lbr:1;
> +
> +			/* Branch Type Field Supported */
> +			unsigned int    arch_lbr_br_type:1;
> +		};
> +		unsigned int            arch_lbr_ecx;
> +	};

Please, union cpuid28_e[abc]x in asm/perf_event.h right along with the
existing cpuid10_e*x unions.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL
  2020-06-19 14:03 ` [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL kan.liang
@ 2020-06-19 18:40   ` Peter Zijlstra
  2020-06-19 19:15     ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 18:40 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:03:59AM -0700, kan.liang@linux.intel.com wrote:
> -	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
> +	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map || x86_pmu.lbr_ctl_map) {

> +	union {
> +		u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
> +		u64		lbr_ctl_mask;		   /* LBR_CTL valid bits */
> +	};

This makes absolutely no sense. There is hoping the compiler realizes
how stupid that is and fixes it for you, but shees.

Please, just keep the old name.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR
  2020-06-19 14:04 ` [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR kan.liang
@ 2020-06-19 19:08   ` Peter Zijlstra
  2020-06-19 19:40     ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 19:08 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:04:00AM -0700, kan.liang@linux.intel.com wrote:

> +static void intel_pmu_arch_lbr_enable(bool pmi)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	u64 debugctl, lbr_ctl = 0, orig_debugctl;
> +
> +	if (pmi)
> +		return;
> +
> +	if (cpuc->lbr_ctl)
> +		lbr_ctl = cpuc->lbr_ctl->config & x86_pmu.lbr_ctl_mask;
> +	/*
> +	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
> +	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
> +	 * may be missed, that can lead to confusing results.
> +	 */
> +	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +	orig_debugctl = debugctl;
> +	if (lbr_ctl & ARCH_LBR_CALL_STACK)
> +		debugctl &= ~DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
> +	else
> +		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
> +	if (orig_debugctl != debugctl)
> +		wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +
> +	wrmsrl(MSR_ARCH_LBR_CTL, lbr_ctl | ARCH_LBR_CTL_LBREN);
> +}

This is nearly a duplicate of the old one, surely we can do better?

> +static void intel_pmu_arch_lbr_restore(void *ctx)
> +{
> +	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
> +	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
> +	int i;
> +
> +	/* Fast reset the LBRs before restore if the call stack is not full. */
> +	if (!entries[x86_pmu.lbr_nr - 1].lbr_from)
> +		intel_pmu_arch_lbr_reset();
> +
> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +		if (!entries[i].lbr_from)
> +			break;
> +		wrlbr_from(i, entries[i].lbr_from);
> +		wrlbr_to(i, entries[i].lbr_to);
> +		wrmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
> +	}
> +}

This too looks very much like the old one.

> +static void intel_pmu_arch_lbr_save(void *ctx)
> +{
> +	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
> +	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
> +	int i;
> +
> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +		entries[i].lbr_from = rdlbr_from(i);
> +		/* Only save valid branches. */
> +		if (!entries[i].lbr_from)
> +			break;
> +		entries[i].lbr_to = rdlbr_to(i);
> +		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
> +	}
> +
> +	/* LBR call stack is not full. Reset is required in restore. */
> +	if (i < x86_pmu.lbr_nr)
> +		entries[x86_pmu.lbr_nr - 1].lbr_from = 0;
> +}

And again..

> +static void __intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc, int index,
> +				      u64 from, u64 to, u64 info)
> +{
> +	u64 mis = 0, pred = 0, in_tx = 0, abort = 0, type = 0;
> +	u32 br_type, to_plm;
> +	u16 cycles = 0;
> +
> +	if (x86_pmu.arch_lbr_mispred) {
> +		mis = !!(info & ARCH_LBR_INFO_MISPRED);
> +		pred = !mis;
> +	}
> +	in_tx = !!(info & ARCH_LBR_INFO_IN_TSX);
> +	abort = !!(info & ARCH_LBR_INFO_TSX_ABORT);
> +	if (x86_pmu.arch_lbr_timed_lbr &&
> +	    (info & ARCH_LBR_INFO_CYC_CNT_VALID))
> +		cycles = (info & ARCH_LBR_INFO_CYC_CNT);
> +
> +	/*
> +	 * Parse the branch type recorded in LBR_x_INFO MSR.
> +	 * Doesn't support OTHER_BRANCH decoding for now.
> +	 * OTHER_BRANCH branch type still rely on software decoding.
> +	 */
> +	if (x86_pmu.arch_lbr_br_type) {
> +		br_type = (info & ARCH_LBR_INFO_BR_TYPE) >> ARCH_LBR_INFO_BR_TYPE_OFFSET;
> +
> +		if (br_type <= ARCH_LBR_BR_TYPE_KNOWN_MAX) {
> +			to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
> +			type = arch_lbr_br_type_map[br_type] | to_plm;
> +		}
> +	}
> +
> +	cpuc->lbr_entries[index].from		 = from;
> +	cpuc->lbr_entries[index].to		 = to;
> +	cpuc->lbr_entries[index].mispred	 = mis;
> +	cpuc->lbr_entries[index].predicted	 = pred;
> +	cpuc->lbr_entries[index].in_tx		 = in_tx;
> +	cpuc->lbr_entries[index].abort		 = abort;
> +	cpuc->lbr_entries[index].cycles		 = cycles;
> +	cpuc->lbr_entries[index].type		 = type;
> +	cpuc->lbr_entries[index].reserved	 = 0;
> +}
> +
> +static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
> +{
> +	u64 from, to, info;
> +	int i;
> +
> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +		from = rdlbr_from(i);
> +		to   = rdlbr_to(i);
> +
> +		/*
> +		 * Read LBR entries until invalid entry (0s) is detected.
> +		 */
> +		if (!from)
> +			break;
> +
> +		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, info);
> +
> +		__intel_pmu_arch_lbr_read(cpuc, i, from, to, info);
> +	}
> +
> +	cpuc->lbr_stack.nr = i;
> +}

> +static void intel_pmu_store_pebs_arch_lbrs(struct pebs_lbr *pebs_lbr,
> +					   struct cpu_hw_events *cpuc)
> +{
> +	struct pebs_lbr_entry *lbr;
> +	int i;
> +
> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +		lbr = &pebs_lbr->lbr[i];
> +
> +		/*
> +		 * Read LBR entries until invalid entry (0s) is detected.
> +		 */
> +		if (!lbr->from)
> +			break;
> +
> +		__intel_pmu_arch_lbr_read(cpuc, i, lbr->from,
> +					  lbr->to, lbr->info);
> +	}
> +
> +	cpuc->lbr_stack.nr = i;
> +	intel_pmu_lbr_filter(cpuc);
> +}

Unless I'm reading cross-eyed again, that too is very similar to what we
already had.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR
  2020-06-19 14:03 ` [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR kan.liang
@ 2020-06-19 19:11   ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 19:11 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:03:56AM -0700, kan.liang@linux.intel.com wrote:

> +#define ARCH_LBR_INFO_MISPRED		BIT_ULL(63)
> +#define ARCH_LBR_INFO_IN_TSX		BIT_ULL(62)
> +#define ARCH_LBR_INFO_TSX_ABORT		BIT_ULL(61)

That's identical to what we already have.

> +#define ARCH_LBR_INFO_CYC_CNT_VALID	BIT_ULL(60)

If you call that LBR_INFO_CYC_VALID or something, then we good there.

> +#define ARCH_LBR_INFO_BR_TYPE_OFFSET	56
> +#define ARCH_LBR_INFO_BR_TYPE		(0xfull << ARCH_LBR_INFO_BR_TYPE_OFFSET)

Same

> +#define ARCH_LBR_INFO_CYC_CNT		0xffff

And we already have that in LBR_INFO_CYCLES.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL
  2020-06-19 18:40   ` Peter Zijlstra
@ 2020-06-19 19:15     ` Liang, Kan
  2020-06-19 19:22       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Liang, Kan @ 2020-06-19 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin



On 6/19/2020 2:40 PM, Peter Zijlstra wrote:
> On Fri, Jun 19, 2020 at 07:03:59AM -0700, kan.liang@linux.intel.com wrote:
>> -	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
>> +	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map || x86_pmu.lbr_ctl_map) {
> 
>> +	union {
>> +		u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
>> +		u64		lbr_ctl_mask;		   /* LBR_CTL valid bits */
>> +	};
> 
> This makes absolutely no sense. There is hoping the compiler realizes
> how stupid that is and fixes it for you, but shees.
> 

The lbr_ctl_map and the lbr_ctl_mask are two different things.

The lbr_ctl_map stores the mapping from PERF_SAMPLE_BRANCH_* to the 
corresponding filtering bits in LBR_CTL MSR. It is used to replace the 
old lbr_sel_map. The mapping information in the old lbr_sel_map is hard 
coded, and has a const type. But for arch LBR, the LBR filtering 
capabilities are enumerated from CPUID. We should not hard code the 
mapping. So I add a new variable lbr_ctl_map.

  	const int	*lbr_sel_map;		   /* lbr_select mappings */
+	int		*lbr_ctl_map;		   /* LBR_CTL mappings */


I think we cannot reuse the old lbr_sel_map for the lbr_ctl_map.

Thanks,
Kan

> Please, just keep the old name.
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL
  2020-06-19 19:15     ` Liang, Kan
@ 2020-06-19 19:22       ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 19:22 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 03:15:09PM -0400, Liang, Kan wrote:
> 
> 
> On 6/19/2020 2:40 PM, Peter Zijlstra wrote:
> > On Fri, Jun 19, 2020 at 07:03:59AM -0700, kan.liang@linux.intel.com wrote:
> > > -	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
> > > +	if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map || x86_pmu.lbr_ctl_map) {
> > 
> > > +	union {
> > > +		u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
> > > +		u64		lbr_ctl_mask;		   /* LBR_CTL valid bits */
> > > +	};
> > 
> > This makes absolutely no sense. There is hoping the compiler realizes
> > how stupid that is and fixes it for you, but shees.
> > 
> 
> The lbr_ctl_map and the lbr_ctl_mask are two different things.
> 
> The lbr_ctl_map stores the mapping from PERF_SAMPLE_BRANCH_* to the
> corresponding filtering bits in LBR_CTL MSR. It is used to replace the old
> lbr_sel_map. The mapping information in the old lbr_sel_map is hard coded,
> and has a const type. But for arch LBR, the LBR filtering capabilities are
> enumerated from CPUID. We should not hard code the mapping. So I add a new
> variable lbr_ctl_map.
> 
>  	const int	*lbr_sel_map;		   /* lbr_select mappings */
> +	int		*lbr_ctl_map;		   /* LBR_CTL mappings */
> 
> 
> I think we cannot reuse the old lbr_sel_map for the lbr_ctl_map.

Of course you can, you just did it, they're the exact same variable, you
just got confused with all the naming nonsense. You then got further
confused and ended up writing code that checked if a variable was not 0
twice, just to make sure.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-19 14:04 ` [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask kan.liang
@ 2020-06-19 19:31   ` Peter Zijlstra
  2020-06-22 14:52     ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 19:31 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:04:05AM -0700, kan.liang@linux.intel.com wrote:

> KVM includes the header file fpu/internal.h. To avoid 'undefined
> xfeatures_mask_all' compiling issue, xfeatures_mask_all has to be
> exported.

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 587e03f..eb2e44e 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -58,6 +58,7 @@ static short xsave_cpuid_features[] __initdata = {
>   * XSAVE buffer, both supervisor and user xstates.
>   */
>  u64 xfeatures_mask_all __read_mostly;
> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);

*groan*...

AFAICT KVM doesn't actually use any of those functions, can't we have
something like BUILD_KVM (like BUILD_VDSO) and exclude those functions
from the KVM build?

I so detest exporting random crap because kvm..

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR
  2020-06-19 19:08   ` Peter Zijlstra
@ 2020-06-19 19:40     ` Liang, Kan
  0 siblings, 0 replies; 40+ messages in thread
From: Liang, Kan @ 2020-06-19 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin



On 6/19/2020 3:08 PM, Peter Zijlstra wrote:
> On Fri, Jun 19, 2020 at 07:04:00AM -0700, kan.liang@linux.intel.com wrote:
> 
>> +static void intel_pmu_arch_lbr_enable(bool pmi)
>> +{
>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> +	u64 debugctl, lbr_ctl = 0, orig_debugctl;
>> +
>> +	if (pmi)
>> +		return;
>> +
>> +	if (cpuc->lbr_ctl)
>> +		lbr_ctl = cpuc->lbr_ctl->config & x86_pmu.lbr_ctl_mask;
>> +	/*
>> +	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
>> +	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
>> +	 * may be missed, that can lead to confusing results.
>> +	 */
>> +	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>> +	orig_debugctl = debugctl;
>> +	if (lbr_ctl & ARCH_LBR_CALL_STACK)
>> +		debugctl &= ~DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
>> +	else
>> +		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
>> +	if (orig_debugctl != debugctl)
>> +		wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>> +
>> +	wrmsrl(MSR_ARCH_LBR_CTL, lbr_ctl | ARCH_LBR_CTL_LBREN);
>> +}
> 
> This is nearly a duplicate of the old one, surely we can do better?

It's similar, but we have to deal with different MSRs and bits.

> 
>> +static void intel_pmu_arch_lbr_restore(void *ctx)
>> +{
>> +	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
>> +	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
>> +	int i;
>> +
>> +	/* Fast reset the LBRs before restore if the call stack is not full. */
>> +	if (!entries[x86_pmu.lbr_nr - 1].lbr_from)
>> +		intel_pmu_arch_lbr_reset();
>> +
>> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +		if (!entries[i].lbr_from)
>> +			break;
>> +		wrlbr_from(i, entries[i].lbr_from);
>> +		wrlbr_to(i, entries[i].lbr_to);
>> +		wrmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
>> +	}
>> +}
> 
> This too looks very much like the old one.

The difference is the reset part.
For the previous platforms, we restore the saved LBRs first, then reset 
the unsaved LBR MSRs one by one.
Now, for Arch LBR, we have a fast reset method. We reset all LBRs (if we 
know there are unsaved items) first, then restore the saved LBRs.
That would improve the performance for the application with short call 
stack.

> 
>> +static void intel_pmu_arch_lbr_save(void *ctx)
>> +{
>> +	struct x86_perf_task_context_arch_lbr *task_ctx = ctx;
>> +	struct x86_perf_arch_lbr_entry *entries = task_ctx->entries;
>> +	int i;
>> +
>> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +		entries[i].lbr_from = rdlbr_from(i);
>> +		/* Only save valid branches. */
>> +		if (!entries[i].lbr_from)
>> +			break;
>> +		entries[i].lbr_to = rdlbr_to(i);
>> +		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, entries[i].lbr_info);
>> +	}
>> +
>> +	/* LBR call stack is not full. Reset is required in restore. */
>> +	if (i < x86_pmu.lbr_nr)
>> +		entries[x86_pmu.lbr_nr - 1].lbr_from = 0;
>> +}
> 
> And again..
> 
>> +static void __intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc, int index,
>> +				      u64 from, u64 to, u64 info)
>> +{
>> +	u64 mis = 0, pred = 0, in_tx = 0, abort = 0, type = 0;
>> +	u32 br_type, to_plm;
>> +	u16 cycles = 0;
>> +
>> +	if (x86_pmu.arch_lbr_mispred) {
>> +		mis = !!(info & ARCH_LBR_INFO_MISPRED);
>> +		pred = !mis;
>> +	}
>> +	in_tx = !!(info & ARCH_LBR_INFO_IN_TSX);
>> +	abort = !!(info & ARCH_LBR_INFO_TSX_ABORT);
>> +	if (x86_pmu.arch_lbr_timed_lbr &&
>> +	    (info & ARCH_LBR_INFO_CYC_CNT_VALID))
>> +		cycles = (info & ARCH_LBR_INFO_CYC_CNT);
>> +
>> +	/*
>> +	 * Parse the branch type recorded in LBR_x_INFO MSR.
>> +	 * Doesn't support OTHER_BRANCH decoding for now.
>> +	 * OTHER_BRANCH branch type still rely on software decoding.
>> +	 */
>> +	if (x86_pmu.arch_lbr_br_type) {
>> +		br_type = (info & ARCH_LBR_INFO_BR_TYPE) >> ARCH_LBR_INFO_BR_TYPE_OFFSET;
>> +
>> +		if (br_type <= ARCH_LBR_BR_TYPE_KNOWN_MAX) {
>> +			to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
>> +			type = arch_lbr_br_type_map[br_type] | to_plm;
>> +		}
>> +	}
>> +
>> +	cpuc->lbr_entries[index].from		 = from;
>> +	cpuc->lbr_entries[index].to		 = to;
>> +	cpuc->lbr_entries[index].mispred	 = mis;
>> +	cpuc->lbr_entries[index].predicted	 = pred;
>> +	cpuc->lbr_entries[index].in_tx		 = in_tx;
>> +	cpuc->lbr_entries[index].abort		 = abort;
>> +	cpuc->lbr_entries[index].cycles		 = cycles;
>> +	cpuc->lbr_entries[index].type		 = type;
>> +	cpuc->lbr_entries[index].reserved	 = 0;
>> +}
>> +
>> +static void intel_pmu_arch_lbr_read(struct cpu_hw_events *cpuc)
>> +{
>> +	u64 from, to, info;
>> +	int i;
>> +
>> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +		from = rdlbr_from(i);
>> +		to   = rdlbr_to(i);
>> +
>> +		/*
>> +		 * Read LBR entries until invalid entry (0s) is detected.
>> +		 */
>> +		if (!from)
>> +			break;
>> +
>> +		rdmsrl(MSR_ARCH_LBR_INFO_0 + i, info);
>> +
>> +		__intel_pmu_arch_lbr_read(cpuc, i, from, to, info);
>> +	}
>> +
>> +	cpuc->lbr_stack.nr = i;
>> +}
> 
>> +static void intel_pmu_store_pebs_arch_lbrs(struct pebs_lbr *pebs_lbr,
>> +					   struct cpu_hw_events *cpuc)
>> +{
>> +	struct pebs_lbr_entry *lbr;
>> +	int i;
>> +
>> +	for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +		lbr = &pebs_lbr->lbr[i];
>> +
>> +		/*
>> +		 * Read LBR entries until invalid entry (0s) is detected.
>> +		 */
>> +		if (!lbr->from)
>> +			break;
>> +
>> +		__intel_pmu_arch_lbr_read(cpuc, i, lbr->from,
>> +					  lbr->to, lbr->info);
>> +	}
>> +
>> +	cpuc->lbr_stack.nr = i;
>> +	intel_pmu_lbr_filter(cpuc);
>> +}
> 
> Unless I'm reading cross-eyed again, that too is very similar to what we
> already had.
> 

I will try to factor out the common codes for thses functions as many as 
possible.


Thanks,
Kan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch
  2020-06-19 14:04 ` [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch kan.liang
@ 2020-06-19 19:41   ` Peter Zijlstra
  2020-06-19 22:28     ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2020-06-19 19:41 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin

On Fri, Jun 19, 2020 at 07:04:08AM -0700, kan.liang@linux.intel.com wrote:
> The XSAVE instruction requires 64-byte alignment for state buffers. A
> 64-byte aligned kmem_cache is created for architecture LBR.

> +		pmu->task_ctx_cache = create_lbr_kmem_cache(size,
> +							    XSAVE_ALIGNMENT);

> +struct x86_perf_task_context_arch_lbr_xsave {
> +	struct x86_perf_task_context_opt	opt;
> +	union {
> +		struct xregs_state		xsave;

Due to x86_perf_task_context_opt, what guarantees you're actually at the
required alignment here?

> +		struct {
> +			struct fxregs_state	i387;
> +			struct xstate_header	header;
> +			struct arch_lbr_state	lbr;
> +		};
> +	};
> +};

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch
  2020-06-19 19:41   ` Peter Zijlstra
@ 2020-06-19 22:28     ` Liang, Kan
  0 siblings, 0 replies; 40+ messages in thread
From: Liang, Kan @ 2020-06-19 22:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin



On 6/19/2020 3:41 PM, Peter Zijlstra wrote:
> On Fri, Jun 19, 2020 at 07:04:08AM -0700, kan.liang@linux.intel.com wrote:
>> The XSAVE instruction requires 64-byte alignment for state buffers. A
>> 64-byte aligned kmem_cache is created for architecture LBR.
> 
>> +		pmu->task_ctx_cache = create_lbr_kmem_cache(size,
>> +							    XSAVE_ALIGNMENT);
> 
>> +struct x86_perf_task_context_arch_lbr_xsave {
>> +	struct x86_perf_task_context_opt	opt;
>> +	union {
>> +		struct xregs_state		xsave;
> 
> Due to x86_perf_task_context_opt, what guarantees you're actually at the
> required alignment here?

Now it relies on the compiler. The struct xregs_state has 'aligned(64)' 
attribute applied.
I think we probably need a padding to get rid of the dependency for the 
compiler.

+	union {
+		struct x86_perf_task_context_opt	opt;
+		u8 padding[64];
+	};

Thanks,
Kan

> 
>> +		struct {
>> +			struct fxregs_state	i387;
>> +			struct xstate_header	header;
>> +			struct arch_lbr_state	lbr;
>> +		};
>> +	};
>> +};

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-19 19:31   ` Peter Zijlstra
@ 2020-06-22 14:52     ` Liang, Kan
  2020-06-22 15:02       ` Dave Hansen
  0 siblings, 1 reply; 40+ messages in thread
From: Liang, Kan @ 2020-06-22 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, gorcunov, hpa, alexey.budankov, eranian, ak, like.xu,
	yao.jin



On 6/19/2020 3:31 PM, Peter Zijlstra wrote:
> On Fri, Jun 19, 2020 at 07:04:05AM -0700, kan.liang@linux.intel.com wrote:
> 
>> KVM includes the header file fpu/internal.h. To avoid 'undefined
>> xfeatures_mask_all' compiling issue, xfeatures_mask_all has to be
>> exported.
> 
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index 587e03f..eb2e44e 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -58,6 +58,7 @@ static short xsave_cpuid_features[] __initdata = {
>>    * XSAVE buffer, both supervisor and user xstates.
>>    */
>>   u64 xfeatures_mask_all __read_mostly;
>> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);
> 
> *groan*...
> 
> AFAICT KVM doesn't actually use any of those functions,

It seems KVM may eventually invoke copy_xregs_to_kernel() as below.

kvm_save_current_fpu()
     copy_fpregs_to_fpstate()
         copy_xregs_to_kernel()

I think we have to export the xfeatures_mask_all.

Thanks,
Kan

> can't we have
> something like BUILD_KVM (like BUILD_VDSO) and exclude those functions
> from the KVM build?
> 
> I so detest exporting random crap because kvm..
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-22 14:52     ` Liang, Kan
@ 2020-06-22 15:02       ` Dave Hansen
  2020-06-22 17:47         ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2020-06-22 15:02 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, yu-cheng.yu, bigeasy,
	gorcunov, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin

On 6/22/20 7:52 AM, Liang, Kan wrote:
>>> --- a/arch/x86/kernel/fpu/xstate.c
>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>> @@ -58,6 +58,7 @@ static short xsave_cpuid_features[] __initdata = {
>>>    * XSAVE buffer, both supervisor and user xstates.
>>>    */
>>>   u64 xfeatures_mask_all __read_mostly;
>>> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);
>>
>> *groan*...
>>
>> AFAICT KVM doesn't actually use any of those functions,
> 
> It seems KVM may eventually invoke copy_xregs_to_kernel() as below.
> 
> kvm_save_current_fpu()
>     copy_fpregs_to_fpstate()
>         copy_xregs_to_kernel()
> 
> I think we have to export the xfeatures_mask_all.

I'm wondering if we should just take these copy_*regs_to_*() functions
and uninline them.  Yeah, they are basically wrapping one instruction,
but it might literally be the most heavyweight instruction in the whole ISA.

Or, maybe just make an out-of-line version for KVM to call?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-22 15:02       ` Dave Hansen
@ 2020-06-22 17:47         ` Liang, Kan
  2020-06-22 18:05           ` Dave Hansen
  0 siblings, 1 reply; 40+ messages in thread
From: Liang, Kan @ 2020-06-22 17:47 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, yu-cheng.yu, bigeasy,
	gorcunov, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin



On 6/22/2020 11:02 AM, Dave Hansen wrote:
> On 6/22/20 7:52 AM, Liang, Kan wrote:
>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>> @@ -58,6 +58,7 @@ static short xsave_cpuid_features[] __initdata = {
>>>>     * XSAVE buffer, both supervisor and user xstates.
>>>>     */
>>>>    u64 xfeatures_mask_all __read_mostly;
>>>> +EXPORT_SYMBOL_GPL(xfeatures_mask_all);
>>>
>>> *groan*...
>>>
>>> AFAICT KVM doesn't actually use any of those functions,
>>
>> It seems KVM may eventually invoke copy_xregs_to_kernel() as below.
>>
>> kvm_save_current_fpu()
>>      copy_fpregs_to_fpstate()
>>          copy_xregs_to_kernel()
>>
>> I think we have to export the xfeatures_mask_all.
> 
> I'm wondering if we should just take these copy_*regs_to_*() functions
> and uninline them.  Yeah, they are basically wrapping one instruction,
> but it might literally be the most heavyweight instruction in the whole ISA.
>

Thanks for the suggestions, but I'm not sure if I follow these methods.

I don't think simply removing the "inline" key word for the 
copy_xregs_to_kernel() functions would help here.
Do you mean exporting the copy_*regs_to_*()?


> Or, maybe just make an out-of-line version for KVM to call?
> 

I think the out-of-line version for KVM still needs the 
xfeatures_mask_all. Because the size of vcpu's XSAVE buffer 
(&vcpu->arch.guest_fpu) is the same as other kernel XSAVE buffers, such 
as task->fpu. The xfeatures_mask_all is required for KVM to filter out 
the dynamic supervisor feature as well. I think even if we make an 
out-of-line version for KVM, we still have to export the 
xfeatures_mask_all for KVM.


Thanks,
Kan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-22 17:47         ` Liang, Kan
@ 2020-06-22 18:05           ` Dave Hansen
  2020-06-22 18:46             ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2020-06-22 18:05 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, yu-cheng.yu, bigeasy,
	gorcunov, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin

On 6/22/20 10:47 AM, Liang, Kan wrote:
>> I'm wondering if we should just take these copy_*regs_to_*() functions
>> and uninline them.  Yeah, they are basically wrapping one instruction,
>> but it might literally be the most heavyweight instruction in the
>> whole ISA.
> 
> Thanks for the suggestions, but I'm not sure if I follow these methods.
> 
> I don't think simply removing the "inline" key word for the
> copy_xregs_to_kernel() functions would help here.
> Do you mean exporting the copy_*regs_to_*()?

The thing that worries me here is exporting "internal" FPU state like
xfeatures_mask_all.  I'm much happier exporting a function with a much
more defined purpose.

So, yes, I'm suggesting exporting the functions, *not* the data structures.

>> Or, maybe just make an out-of-line version for KVM to call?
> 
> I think the out-of-line version for KVM still needs the
> xfeatures_mask_all. Because the size of vcpu's XSAVE buffer
> (&vcpu->arch.guest_fpu) is the same as other kernel XSAVE buffers, such
> as task->fpu. The xfeatures_mask_all is required for KVM to filter out
> the dynamic supervisor feature as well. I think even if we make an
> out-of-line version for KVM, we still have to export the
> xfeatures_mask_all for KVM.

No.

You do this in a .h file:

extern void notinline_copy_xregs_to_kernel(struct xregs_state *xstate);

And then this in a .c file:

void notinline_copy_xregs_to_kernel(struct xregs_state *xstate)
{
	copy_xregs_to_kernel(xstate);
}
EXPORT_SYMBOL_GPL(notinline_copy_xregs_to_kernel);


KVM now calls notinline_copy_xregs_to_kernel() (not what it should
really be called).  It does *not* need 'xfeatures_mask_all' exported in
this case.  That preserves the inlining for core kernel users.

It's not the prettiest situation, but it is straightforward.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask
  2020-06-22 18:05           ` Dave Hansen
@ 2020-06-22 18:46             ` Liang, Kan
  0 siblings, 0 replies; 40+ messages in thread
From: Liang, Kan @ 2020-06-22 18:46 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra
  Cc: mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, yu-cheng.yu, bigeasy,
	gorcunov, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin



On 6/22/2020 2:05 PM, Dave Hansen wrote:
> On 6/22/20 10:47 AM, Liang, Kan wrote:
>>> I'm wondering if we should just take these copy_*regs_to_*() functions
>>> and uninline them.  Yeah, they are basically wrapping one instruction,
>>> but it might literally be the most heavyweight instruction in the
>>> whole ISA.
>> Thanks for the suggestions, but I'm not sure if I follow these methods.
>>
>> I don't think simply removing the "inline" key word for the
>> copy_xregs_to_kernel() functions would help here.
>> Do you mean exporting the copy_*regs_to_*()?
> The thing that worries me here is exporting "internal" FPU state like
> xfeatures_mask_all.  I'm much happier exporting a function with a much
> more defined purpose.
> 
> So, yes, I'm suggesting exporting the functions,*not*  the data structures.
> 

I think maybe we should just export the copy_fpregs_to_fpstate() as 
below, because
- KVM directly invokes this function. The copy_xregs_to_kernel() is 
indirectly invoked via the function. I think we should export the 
function which is directly used by other modules.
- The copy_fpregs_to_fpstate() is a bigger function with many checks. 
Uninline the function should not impact the performance.
- it's also a function. It's a safer way than exporting the "internal" 
FPU state. No one except the FPU can change the state 
intentionally/unintentionally.


diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 0388c792..d3724dc 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -411,43 +411,7 @@ static inline int copy_kernel_to_xregs_err(struct 
xregs_state *xstate, u64 mask)
  	return err;
  }

-/*
- * These must be called with preempt disabled. Returns
- * 'true' if the FPU state is still intact and we can
- * keep registers active.
- *
- * The legacy FNSAVE instruction cleared all FPU state
- * unconditionally, so registers are essentially destroyed.
- * Modern FPU state can be kept in registers, if there are
- * no pending FP exceptions.
- */
-static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
-{
-	if (likely(use_xsave())) {
-		copy_xregs_to_kernel(&fpu->state.xsave);
-
-		/*
-		 * AVX512 state is tracked here because its use is
-		 * known to slow the max clock speed of the core.
-		 */
-		if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
-			fpu->avx512_timestamp = jiffies;
-		return 1;
-	}
-
-	if (likely(use_fxsr())) {
-		copy_fxregs_to_kernel(fpu);
-		return 1;
-	}
-
-	/*
-	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
-	 * so we have to mark them inactive:
-	 */
-	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
-
-	return 0;
-}
+extern int copy_fpregs_to_fpstate(struct fpu *fpu);

  static inline void __copy_kernel_to_fpregs(union fpregs_state 
*fpstate, u64 mask)
  {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 06c8189..1bb7532 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -82,6 +82,45 @@ bool irq_fpu_usable(void)
  }
  EXPORT_SYMBOL(irq_fpu_usable);

+/*
+ * These must be called with preempt disabled. Returns
+ * 'true' if the FPU state is still intact and we can
+ * keep registers active.
+ *
+ * The legacy FNSAVE instruction cleared all FPU state
+ * unconditionally, so registers are essentially destroyed.
+ * Modern FPU state can be kept in registers, if there are
+ * no pending FP exceptions.
+ */
+int copy_fpregs_to_fpstate(struct fpu *fpu)
+{
+	if (likely(use_xsave())) {
+		copy_xregs_to_kernel(&fpu->state.xsave);
+
+		/*
+		 * AVX512 state is tracked here because its use is
+		 * known to slow the max clock speed of the core.
+		 */
+		if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+			fpu->avx512_timestamp = jiffies;
+		return 1;
+	}
+
+	if (likely(use_fxsr())) {
+		copy_fxregs_to_kernel(fpu);
+		return 1;
+	}
+
+	/*
+	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
+	 * so we have to mark them inactive:
+	 */
+	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
+
+	return 0;
+}
+EXPORT_SYMBOL(copy_fpregs_to_fpstate);
+
  void kernel_fpu_begin(void)
  {
  	preempt_disable();
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9c0541d..ca20029 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -58,7 +58,6 @@ static short xsave_cpuid_features[] __initdata = {
   * XSAVE buffer, both supervisor and user xstates.
   */
  u64 xfeatures_mask_all __read_mostly;
-EXPORT_SYMBOL_GPL(xfeatures_mask_all);

  static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... 
XFEATURE_MAX - 1] = -1}; static unsigned int xstate_sizes[XFEATURE_MAX] 
  = { [ 0 ... XFEATURE_MAX - 1] = -1};

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read
  2020-06-19 14:04 ` [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read kan.liang
@ 2020-06-22 18:49   ` Cyrill Gorcunov
  2020-06-22 19:11     ` Liang, Kan
  0 siblings, 1 reply; 40+ messages in thread
From: Cyrill Gorcunov @ 2020-06-22 18:49 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin

On Fri, Jun 19, 2020 at 07:04:09AM -0700, kan.liang@linux.intel.com wrote:
...
> +static void intel_pmu_arch_lbr_read_xsave(struct cpu_hw_events *cpuc)
> +{
> +	struct x86_perf_task_context_arch_lbr_xsave *xsave = cpuc->lbr_xsave;
> +	struct arch_lbr_entry *lbr;
> +	int i;
> +
> +	if (!xsave)
> +		goto rollback;

Why not make it simplier?

	if (!xsave) {
		intel_pmu_arch_lbr_read(cpuc);
		return;
	}

The goto and "return" statement before the "rollback" label
looks pretty ugly. I'm sorry I didn't follow the series
in details so if you plan to add more handlers at "rollback"
then sure.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read
  2020-06-22 18:49   ` Cyrill Gorcunov
@ 2020-06-22 19:11     ` Liang, Kan
  2020-06-22 19:31       ` Cyrill Gorcunov
  0 siblings, 1 reply; 40+ messages in thread
From: Liang, Kan @ 2020-06-22 19:11 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: peterz, mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin



On 6/22/2020 2:49 PM, Cyrill Gorcunov wrote:
> On Fri, Jun 19, 2020 at 07:04:09AM -0700, kan.liang@linux.intel.com wrote:
> ...
>> +static void intel_pmu_arch_lbr_read_xsave(struct cpu_hw_events *cpuc)
>> +{
>> +	struct x86_perf_task_context_arch_lbr_xsave *xsave = cpuc->lbr_xsave;
>> +	struct arch_lbr_entry *lbr;
>> +	int i;
>> +
>> +	if (!xsave)
>> +		goto rollback;
> 
> Why not make it simplier?
> 
> 	if (!xsave) {
> 		intel_pmu_arch_lbr_read(cpuc);
> 		return;
> 	}
> 
> The goto and "return" statement before the "rollback" label
> looks pretty ugly. I'm sorry I didn't follow the series
> in details so if you plan to add more handlers at "rollback"
> then sure.
> 

There were several handlers when I first implemented the function, but 
they are removed now. I don't think I will add more handlers in the next 
version.
I will remove the "rollback" label.

Thanks for pointing it out.

Kan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read
  2020-06-22 19:11     ` Liang, Kan
@ 2020-06-22 19:31       ` Cyrill Gorcunov
  0 siblings, 0 replies; 40+ messages in thread
From: Cyrill Gorcunov @ 2020-06-22 19:31 UTC (permalink / raw)
  To: Liang, Kan
  Cc: peterz, mingo, acme, tglx, bp, x86, linux-kernel, mark.rutland,
	alexander.shishkin, jolsa, namhyung, dave.hansen, yu-cheng.yu,
	bigeasy, hpa, alexey.budankov, eranian, ak, like.xu, yao.jin

On Mon, Jun 22, 2020 at 03:11:07PM -0400, Liang, Kan wrote:
> > 
> > The goto and "return" statement before the "rollback" label
> > looks pretty ugly. I'm sorry I didn't follow the series
> > in details so if you plan to add more handlers at "rollback"
> > then sure.
> > 
> 
> There were several handlers when I first implemented the function, but they
> are removed now. I don't think I will add more handlers in the next version.
> I will remove the "rollback" label.
> 
> Thanks for pointing it out.

This could be done on top of the series of course, no need to resend
for this sole change I think.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-06-22 19:31 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 14:03 [PATCH 00/21] Support Architectural LBR kan.liang
2020-06-19 14:03 ` [PATCH 01/21] x86/cpufeatures: Add Architectural LBRs feature bit kan.liang
2020-06-19 14:03 ` [PATCH 02/21] perf/x86/intel/lbr: Add pointers for LBR enable and disable kan.liang
2020-06-19 14:03 ` [PATCH 03/21] perf/x86/intel/lbr: Add pointer for LBR reset kan.liang
2020-06-19 14:03 ` [PATCH 04/21] perf/x86/intel/lbr: Add pointer for LBR read kan.liang
2020-06-19 14:03 ` [PATCH 05/21] perf/x86/intel/lbr: Add pointers for LBR save and restore kan.liang
2020-06-19 14:03 ` [PATCH 06/21] perf/x86/intel/lbr: Factor out a new struct for generic optimization kan.liang
2020-06-19 14:03 ` [PATCH 07/21] perf/x86/intel/lbr: Use dynamic data structure for task_ctx kan.liang
2020-06-19 14:03 ` [PATCH 08/21] x86/msr-index: Add bunch of MSRs for Arch LBR kan.liang
2020-06-19 19:11   ` Peter Zijlstra
2020-06-19 14:03 ` [PATCH 09/21] perf/x86: Expose CPUID enumeration bits for arch LBR kan.liang
2020-06-19 18:31   ` Peter Zijlstra
2020-06-19 14:03 ` [PATCH 10/21] perf/x86/intel: Check Arch LBR MSRs kan.liang
2020-06-19 14:03 ` [PATCH 11/21] perf/x86/intel/lbr: Support LBR_CTL kan.liang
2020-06-19 18:40   ` Peter Zijlstra
2020-06-19 19:15     ` Liang, Kan
2020-06-19 19:22       ` Peter Zijlstra
2020-06-19 14:04 ` [PATCH 12/21] perf/x86/intel/lbr: Support Architectural LBR kan.liang
2020-06-19 19:08   ` Peter Zijlstra
2020-06-19 19:40     ` Liang, Kan
2020-06-19 14:04 ` [PATCH 13/21] perf/core: Factor out functions to allocate/free the task_ctx_data kan.liang
2020-06-19 14:04 ` [PATCH 14/21] perf/core: Use kmem_cache to allocate the PMU specific data kan.liang
2020-06-19 14:04 ` [PATCH 15/21] perf/x86/intel/lbr: Create kmem_cache for the LBR context data kan.liang
2020-06-19 14:04 ` [PATCH 16/21] perf/x86: Remove task_ctx_size kan.liang
2020-06-19 14:04 ` [PATCH 17/21] x86/fpu: Use proper mask to replace full instruction mask kan.liang
2020-06-19 19:31   ` Peter Zijlstra
2020-06-22 14:52     ` Liang, Kan
2020-06-22 15:02       ` Dave Hansen
2020-06-22 17:47         ` Liang, Kan
2020-06-22 18:05           ` Dave Hansen
2020-06-22 18:46             ` Liang, Kan
2020-06-19 14:04 ` [PATCH 18/21] x86/fpu/xstate: Support dynamic supervisor feature for LBR kan.liang
2020-06-19 14:04 ` [PATCH 19/21] x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature kan.liang
2020-06-19 14:04 ` [PATCH 20/21] perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch kan.liang
2020-06-19 19:41   ` Peter Zijlstra
2020-06-19 22:28     ` Liang, Kan
2020-06-19 14:04 ` [PATCH 21/21] perf/x86/intel/lbr: Support XSAVES for arch LBR read kan.liang
2020-06-22 18:49   ` Cyrill Gorcunov
2020-06-22 19:11     ` Liang, Kan
2020-06-22 19:31       ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).