All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support
@ 2022-08-18 13:02 Jan Beulich
  2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

New changes have appeared in the meantime, in particular one partly undoing
what we still haven't merged (patch 1 here).

1: add 'preferred_cstates' module argument
2: add core C6 optimization for SPR
3: add AlderLake support
4: disable IBRS during long idle
5: make SPR C1 and C1E be independent

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
@ 2022-08-18 13:03 ` Jan Beulich
  2022-10-11 16:22   ` Roger Pau Monné
  2022-10-13 11:52   ` Roger Pau Monné
  2022-08-18 13:03 ` [PATCH v3 2/5] x86/mwait-idle: add core C6 optimization for SPR Jan Beulich
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:03 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
exclusive - only one of them can be enabled. By default, 'intel_idle' driver
enables C1 and disables C1E. However, some users prefer to use C1E instead of
C1, because it saves more energy.

This patch adds a new module parameter ('preferred_cstates') for enabling C1E
and disabling C1. Here is the idea behind it.

1. This option has effect only for "mutually exclusive" C-states like C1 and
   C1E on SPR.
2. It does not have any effect on independent C-states, which do not require
   other C-states to be disabled (most states on most platforms as of today).
3. For mutually exclusive C-states, the 'intel_idle' driver always has a
   reasonable default, such as enabling C1 on SPR by default. On other
   platforms, the default may be different.
4. Users can override the default using the 'preferred_cstates' parameter.
5. The parameter accepts the preferred C-states bit-mask, similarly to the
   existing 'states_off' parameter.
6. This parameter is not limited to C1/C1E, and leaves room for supporting
   other mutually exclusive C-states, if they come in the future.

Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
to disable C1 and enable C1E, users should boot with the following kernel
argument: intel_idle.preferred_cstates=4

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git da0e58c038e6

Enable C1E (if requested) not only on the BSP's socket / package. Alter
command line option to fit our model, and extend it to also accept
string form arguments.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Also accept string form arguments for command line option. Restore
    C1E-control related enum from Linux, despite our somewhat different
    use (and bigger code churn).

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1912,6 +1912,12 @@ paging controls access to usermode addre
 ### ple_window (Intel)
 > `= <integer>`
 
+### preferred-cstates (x86)
+> `= ( <integer> | List of ( C1 | C1E | C2 | ... )`
+
+This is a mask of C-states which are to be used preferably.  This option is
+applicable only on hardware were certain C-states are exclusive of one another.
+
 ### psr (Intel)
 > `= List of ( cmt:<boolean> | rmid_max:<integer> | cat:<boolean> | cos_max:<integer> | cdp:<boolean> )`
 
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -82,10 +82,29 @@ boolean_param("mwait-idle", opt_mwait_id
 
 static unsigned int mwait_substates;
 
+/*
+ * Some platforms come with mutually exclusive C-states, so that if one is
+ * enabled, the other C-states must not be used. Example: C1 and C1E on
+ * Sapphire Rapids platform. This parameter allows for selecting the
+ * preferred C-states among the groups of mutually exclusive C-states - the
+ * selected C-states will be registered, the other C-states from the mutually
+ * exclusive group won't be registered. If the platform has no mutually
+ * exclusive C-states, this parameter has no effect.
+ */
+static unsigned int __ro_after_init preferred_states_mask;
+static char __initdata preferred_states[64];
+string_param("preferred-cstates", preferred_states);
+
 #define LAPIC_TIMER_ALWAYS_RELIABLE 0xFFFFFFFF
 /* Reliable LAPIC Timer States, bit 1 for C1 etc. Default to only C1. */
 static unsigned int lapic_timer_reliable_states = (1 << 1);
 
+enum c1e_promotion {
+	C1E_PROMOTION_PRESERVE,
+	C1E_PROMOTION_ENABLE,
+	C1E_PROMOTION_DISABLE
+};
+
 struct idle_cpu {
 	const struct cpuidle_state *state_table;
 
@@ -95,7 +114,7 @@ struct idle_cpu {
 	 */
 	unsigned long auto_demotion_disable_flags;
 	bool byt_auto_demotion_disable_flag;
-	bool disable_promotion_to_c1e;
+	enum c1e_promotion c1e_promotion;
 };
 
 static const struct idle_cpu *icpu;
@@ -924,6 +943,15 @@ static void cf_check byt_auto_demotion_d
 	wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0);
 }
 
+static void cf_check c1e_promotion_enable(void *dummy)
+{
+	uint64_t msr_bits;
+
+	rdmsrl(MSR_IA32_POWER_CTL, msr_bits);
+	msr_bits |= 0x2;
+	wrmsrl(MSR_IA32_POWER_CTL, msr_bits);
+}
+
 static void cf_check c1e_promotion_disable(void *dummy)
 {
 	u64 msr_bits;
@@ -936,7 +964,7 @@ static void cf_check c1e_promotion_disab
 static const struct idle_cpu idle_cpu_nehalem = {
 	.state_table = nehalem_cstates,
 	.auto_demotion_disable_flags = NHM_C1_AUTO_DEMOTE | NHM_C3_AUTO_DEMOTE,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_atom = {
@@ -954,64 +982,64 @@ static const struct idle_cpu idle_cpu_li
 
 static const struct idle_cpu idle_cpu_snb = {
 	.state_table = snb_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_byt = {
 	.state_table = byt_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 	.byt_auto_demotion_disable_flag = true,
 };
 
 static const struct idle_cpu idle_cpu_cht = {
 	.state_table = cht_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 	.byt_auto_demotion_disable_flag = true,
 };
 
 static const struct idle_cpu idle_cpu_ivb = {
 	.state_table = ivb_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_ivt = {
 	.state_table = ivt_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_hsw = {
 	.state_table = hsw_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_bdw = {
 	.state_table = bdw_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_skl = {
 	.state_table = skl_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_skx = {
 	.state_table = skx_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_icx = {
        .state_table = icx_cstates,
-       .disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static struct idle_cpu __read_mostly idle_cpu_spr = {
 	.state_table = spr_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_avn = {
 	.state_table = avn_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_knl = {
@@ -1020,17 +1048,17 @@ static const struct idle_cpu idle_cpu_kn
 
 static const struct idle_cpu idle_cpu_bxt = {
 	.state_table = bxt_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_dnv = {
 	.state_table = dnv_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 static const struct idle_cpu idle_cpu_snr = {
 	.state_table = snr_cstates,
-	.disable_promotion_to_c1e = true,
+	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
 #define ICPU(model, cpu) \
@@ -1241,6 +1269,25 @@ static void __init skx_idle_state_table_
 }
 
 /*
+ * spr_idle_state_table_update - Adjust Sapphire Rapids idle states table.
+ */
+static void __init spr_idle_state_table_update(void)
+{
+	/* Check if user prefers C1E over C1. */
+	if (preferred_states_mask & BIT(2, U)) {
+		if (preferred_states_mask & BIT(1, U))
+			/* Both can't be enabled, stick to the defaults. */
+			return;
+
+		spr_cstates[0].flags |= CPUIDLE_FLAG_DISABLED;
+		spr_cstates[1].flags &= ~CPUIDLE_FLAG_DISABLED;
+
+		/* Request enabling C1E using the "C1E promotion" bit. */
+		idle_cpu_spr.c1e_promotion = C1E_PROMOTION_ENABLE;
+	}
+}
+
+/*
  * mwait_idle_state_table_update()
  *
  * Update the default state_table for this CPU-id
@@ -1261,6 +1308,9 @@ static void __init mwait_idle_state_tabl
 	case INTEL_FAM6_SKYLAKE_X:
 		skx_idle_state_table_update();
 		break;
+	case INTEL_FAM6_SAPPHIRERAPIDS_X:
+		spr_idle_state_table_update();
+		break;
 	}
 }
 
@@ -1268,6 +1318,7 @@ static int __init mwait_idle_probe(void)
 {
 	unsigned int eax, ebx, ecx;
 	const struct x86_cpu_id *id = x86_match_cpu(intel_idle_ids);
+	const char *str;
 
 	if (!id) {
 		pr_debug(PREFIX "does not run on family %d model %d\n",
@@ -1309,6 +1360,39 @@ static int __init mwait_idle_probe(void)
 	pr_debug(PREFIX "lapic_timer_reliable_states %#x\n",
 		 lapic_timer_reliable_states);
 
+	str = preferred_states;
+	if (isdigit(str[0]))
+		preferred_states_mask = simple_strtoul(str, &str, 0);
+	else if (str[0])
+	{
+		const char *ss;
+
+		do {
+			const struct cpuidle_state *state = icpu->state_table;
+			unsigned int bit = 1;
+
+			ss = strchr(str, ',');
+			if (!ss)
+				ss = strchr(str, '\0');
+
+			for (; state->name[0]; ++state) {
+				bit <<= 1;
+				if (!cmdline_strcmp(str, state->name)) {
+					preferred_states_mask |= bit;
+					break;
+				}
+			}
+			if (!state->name[0])
+				break;
+
+			str = ss + 1;
+	    } while (*ss);
+
+	    str -= str == ss + 1;
+	}
+	if (str[0])
+		printk("unrecognized \"preferred-cstates=%s\"\n", str);
+
 	mwait_idle_state_table_update();
 
 	return 0;
@@ -1400,8 +1484,18 @@ static int cf_check mwait_idle_cpu_init(
 	if (icpu->byt_auto_demotion_disable_flag)
 		on_selected_cpus(cpumask_of(cpu), byt_auto_demotion_disable, NULL, 1);
 
-	if (icpu->disable_promotion_to_c1e)
+	switch (icpu->c1e_promotion) {
+	case C1E_PROMOTION_DISABLE:
 		on_selected_cpus(cpumask_of(cpu), c1e_promotion_disable, NULL, 1);
+		break;
+
+	case C1E_PROMOTION_ENABLE:
+		on_selected_cpus(cpumask_of(cpu), c1e_promotion_enable, NULL, 1);
+		break;
+
+	case C1E_PROMOTION_PRESERVE:
+		break;
+	}
 
 	return NOTIFY_DONE;
 }



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 2/5] x86/mwait-idle: add core C6 optimization for SPR
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
  2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
@ 2022-08-18 13:03 ` Jan Beulich
  2022-08-18 13:04 ` [PATCH v3 3/5] x86/mwait-idle: add AlderLake support Jan Beulich
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:03 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Add a Sapphire Rapids Xeon C6 optimization, similar to what we have for Sky Lake
Xeon: if package C6 is disabled, adjust C6 exit latency and target residency to
match core C6 values, instead of using the default package C6 values.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 3a9cf77b60dc

Make sure a contradictory "preferred-cstates" wouldn't cause bypassing
of the added logic.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
---
v3: Add parentheses.
v2: Sync with the Linux side fix to the noticed issue. Re-base over
    change to earlier patch.

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -1273,18 +1273,31 @@ static void __init skx_idle_state_table_
  */
 static void __init spr_idle_state_table_update(void)
 {
-	/* Check if user prefers C1E over C1. */
-	if (preferred_states_mask & BIT(2, U)) {
-		if (preferred_states_mask & BIT(1, U))
-			/* Both can't be enabled, stick to the defaults. */
-			return;
+	uint64_t msr;
 
+	/* Check if user prefers C1E over C1. */
+	if ((preferred_states_mask & BIT(2, U)) &&
+	    !(preferred_states_mask & BIT(1, U))) {
+		/* Disable C1 and enable C1E. */
 		spr_cstates[0].flags |= CPUIDLE_FLAG_DISABLED;
 		spr_cstates[1].flags &= ~CPUIDLE_FLAG_DISABLED;
 
 		/* Request enabling C1E using the "C1E promotion" bit. */
 		idle_cpu_spr.c1e_promotion = C1E_PROMOTION_ENABLE;
 	}
+
+	/*
+	 * By default, the C6 state assumes the worst-case scenario of package
+	 * C6. However, if PC6 is disabled, we update the numbers to match
+	 * core C6.
+	 */
+	rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr);
+
+	/* Limit value 2 and above allow for PC6. */
+	if ((msr & 0x7) < 2) {
+		spr_cstates[2].exit_latency = 190;
+		spr_cstates[2].target_residency = 600;
+	}
 }
 
 /*



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 3/5] x86/mwait-idle: add AlderLake support
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
  2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
  2022-08-18 13:03 ` [PATCH v3 2/5] x86/mwait-idle: add core C6 optimization for SPR Jan Beulich
@ 2022-08-18 13:04 ` Jan Beulich
  2022-10-13 11:55   ` Roger Pau Monné
  2022-08-18 13:04 ` [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle Jan Beulich
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

From: Zhang Rui <rui.zhang@intel.com>

Similar to SPR, the C1 and C1E states on ADL are mutually exclusive.
Only one of them can be enabled at a time.

But contrast to SPR, which usually has a strong latency requirement
as a Xeon processor, C1E is preferred on ADL for better energy
efficiency.

Add custom C-state tables for ADL with both C1 and C1E, and

 1. Enable the "C1E promotion" bit in MSR_IA32_POWER_CTL and mark C1
    with the CPUIDLE_FLAG_UNUSABLE flag, so C1 is not available by
    default.

 2. Add support for the "preferred_cstates" module parameter, so that
    users can choose to use C1 instead of C1E by booting with
    "intel_idle.preferred_cstates=2".

Separate custom C-state tables are introduced for the ADL mobile and
desktop processors, because of the exit latency differences between
these two variants, especially with respect to PC10.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Changelog edits, code rearrangement ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git d1cf8bbfed1e
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -606,6 +606,84 @@ static const struct cpuidle_state icx_cs
 };
 
 /*
+ * On AlderLake C1 has to be disabled if C1E is enabled, and vice versa.
+ * C1E is enabled only if "C1E promotion" bit is set in MSR_IA32_POWER_CTL.
+ * But in this case there is effectively no C1, because C1 requests are
+ * promoted to C1E. If the "C1E promotion" bit is cleared, then both C1
+ * and C1E requests end up with C1, so there is effectively no C1E.
+ *
+ * By default we enable C1E and disable C1 by marking it with
+ * 'CPUIDLE_FLAG_DISABLED'.
+ */
+static struct cpuidle_state __read_mostly adl_cstates[] = {
+	{
+		.name = "C1",
+		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_DISABLED,
+		.exit_latency = 1,
+		.target_residency = 1,
+	},
+	{
+		.name = "C1E",
+		.flags = MWAIT2flg(0x01),
+		.exit_latency = 2,
+		.target_residency = 4,
+	},
+	{
+		.name = "C6",
+		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 220,
+		.target_residency = 600,
+	},
+	{
+		.name = "C8",
+		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 280,
+		.target_residency = 800,
+	},
+	{
+		.name = "C10",
+		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 680,
+		.target_residency = 2000,
+	},
+	{}
+};
+
+static struct cpuidle_state __read_mostly adl_l_cstates[] = {
+	{
+		.name = "C1",
+		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_DISABLED,
+		.exit_latency = 1,
+		.target_residency = 1,
+	},
+	{
+		.name = "C1E",
+		.flags = MWAIT2flg(0x01),
+		.exit_latency = 2,
+		.target_residency = 4,
+	},
+	{
+		.name = "C6",
+		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 170,
+		.target_residency = 500,
+	},
+	{
+		.name = "C8",
+		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 200,
+		.target_residency = 600,
+	},
+	{
+		.name = "C10",
+		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 230,
+		.target_residency = 700,
+	},
+	{}
+};
+
+/*
  * On Sapphire Rapids Xeon C1 has to be disabled if C1E is enabled, and vice
  * versa. On SPR C1E is enabled only if "C1E promotion" bit is set in
  * MSR_IA32_POWER_CTL. But in this case there effectively no C1, because C1
@@ -1032,6 +1110,14 @@ static const struct idle_cpu idle_cpu_ic
 	.c1e_promotion = C1E_PROMOTION_DISABLE,
 };
 
+static struct idle_cpu __read_mostly idle_cpu_adl = {
+	.state_table = adl_cstates,
+};
+
+static struct idle_cpu __read_mostly idle_cpu_adl_l = {
+	.state_table = adl_l_cstates,
+};
+
 static struct idle_cpu __read_mostly idle_cpu_spr = {
 	.state_table = spr_cstates,
 	.c1e_promotion = C1E_PROMOTION_DISABLE,
@@ -1099,6 +1185,8 @@ static const struct x86_cpu_id intel_idl
 	ICPU(SKYLAKE_X,			skx),
 	ICPU(ICELAKE_X,			icx),
 	ICPU(ICELAKE_D,			icx),
+	ICPU(ALDERLAKE,			adl),
+	ICPU(ALDERLAKE_L,		adl_l),
 	ICPU(SAPPHIRERAPIDS_X,		spr),
 	ICPU(XEON_PHI_KNL,		knl),
 	ICPU(XEON_PHI_KNM,		knl),
@@ -1269,6 +1357,30 @@ static void __init skx_idle_state_table_
 }
 
 /*
+ * adl_idle_state_table_update - Adjust AlderLake idle states table.
+ */
+static void __init adl_idle_state_table_update(void)
+{
+	/* Check if user prefers C1 over C1E. */
+	if ((preferred_states_mask & BIT(1, U)) &&
+	    !(preferred_states_mask & BIT(2, U))) {
+		adl_cstates[0].flags &= ~CPUIDLE_FLAG_DISABLED;
+		adl_cstates[1].flags |= CPUIDLE_FLAG_DISABLED;
+		adl_l_cstates[0].flags &= ~CPUIDLE_FLAG_DISABLED;
+		adl_l_cstates[1].flags |= CPUIDLE_FLAG_DISABLED;
+
+		/* Disable C1E by clearing the "C1E promotion" bit. */
+		idle_cpu_adl.c1e_promotion = C1E_PROMOTION_DISABLE;
+		idle_cpu_adl_l.c1e_promotion = C1E_PROMOTION_DISABLE;
+		return;
+	}
+
+	/* Make sure C1E is enabled by default */
+	idle_cpu_adl.c1e_promotion = C1E_PROMOTION_ENABLE;
+	idle_cpu_adl_l.c1e_promotion = C1E_PROMOTION_ENABLE;
+}
+
+/*
  * spr_idle_state_table_update - Adjust Sapphire Rapids idle states table.
  */
 static void __init spr_idle_state_table_update(void)
@@ -1324,6 +1436,10 @@ static void __init mwait_idle_state_tabl
 	case INTEL_FAM6_SAPPHIRERAPIDS_X:
 		spr_idle_state_table_update();
 		break;
+	case INTEL_FAM6_ALDERLAKE:
+	case INTEL_FAM6_ALDERLAKE_L:
+		adl_idle_state_table_update();
+		break;
 	}
 }
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
                   ` (2 preceding siblings ...)
  2022-08-18 13:04 ` [PATCH v3 3/5] x86/mwait-idle: add AlderLake support Jan Beulich
@ 2022-08-18 13:04 ` Jan Beulich
  2022-10-13 12:03   ` Roger Pau Monné
  2022-08-18 13:05 ` [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent Jan Beulich
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

From: Peter Zijlstra <peterz@infradead.org>

Having IBRS enabled while the SMT sibling is idle unnecessarily slows
down the running sibling. OTOH, disabling IBRS around idle takes two
MSR writes, which will increase the idle latency.

Therefore, only disable IBRS around deeper idle states. Shallow idle
states are bounded by the tick in duration, since NOHZ is not allowed
for them by virtue of their short target residency.

Only do this for mwait-driven idle, since that keeps interrupts disabled
across idle, which makes disabling IBRS vs IRQ-entry a non-issue.

Note: C6 is a random threshold, most importantly C1 probably shouldn't
disable IBRS, benchmarking needed.

Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -141,6 +141,12 @@ static const struct cpuidle_state {
 #define CPUIDLE_FLAG_TLB_FLUSHED	0x10000
 
 /*
+ * Disable IBRS across idle (when KERNEL_IBRS), is exclusive vs IRQ_ENABLE
+ * above.
+ */
+#define CPUIDLE_FLAG_IBRS		0x20000
+
+/*
  * MWAIT takes an 8-bit "hint" in EAX "suggesting"
  * the C-state (top nibble) and sub-state (bottom nibble)
  * 0x00 means "MWAIT(C1)", 0x10 means "MWAIT(C2)" etc.
@@ -530,31 +536,31 @@ static struct cpuidle_state __read_mostl
 	},
 	{
 		.name = "C6",
-		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 85,
 		.target_residency = 200,
 	},
 	{
 		.name = "C7s",
-		.flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 124,
 		.target_residency = 800,
 	},
 	{
 		.name = "C8",
-		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 200,
 		.target_residency = 800,
 	},
 	{
 		.name = "C9",
-		.flags = MWAIT2flg(0x50) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x50) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 480,
 		.target_residency = 5000,
 	},
 	{
 		.name = "C10",
-		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 890,
 		.target_residency = 5000,
 	},
@@ -576,7 +582,7 @@ static struct cpuidle_state __read_mostl
 	},
 	{
 		.name = "C6",
-		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
 		.exit_latency = 133,
 		.target_residency = 600,
 	},
@@ -906,6 +912,7 @@ static const struct cpuidle_state snr_cs
 static void cf_check mwait_idle(void)
 {
 	unsigned int cpu = smp_processor_id();
+	struct cpu_info *info = get_cpu_info();
 	struct acpi_processor_power *power = processor_powers[cpu];
 	struct acpi_processor_cx *cx = NULL;
 	unsigned int next_state;
@@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
 			pm_idle_save();
 		else
 		{
-			struct cpu_info *info = get_cpu_info();
-
 			spec_ctrl_enter_idle(info);
 			safe_halt();
 			spec_ctrl_exit_idle(info);
@@ -960,6 +965,11 @@ static void cf_check mwait_idle(void)
 	if ((cx->type >= 3) && errata_c6_workaround())
 		cx = power->safe_state;
 
+	if (cx->ibrs_disable) {
+		ASSERT(!cx->irq_enable_early);
+		spec_ctrl_enter_idle(info);
+	}
+
 #if 0 /* XXX Can we/do we need to do something similar on Xen? */
 	/*
 	 * leave_mm() to avoid costly and often unnecessary wakeups
@@ -991,6 +1001,10 @@ static void cf_check mwait_idle(void)
 
 	/* Now back in C0. */
 	update_idle_stats(power, cx, before, after);
+
+	if (cx->ibrs_disable)
+		spec_ctrl_exit_idle(info);
+
 	local_irq_enable();
 
 	TRACE_6D(TRC_PM_IDLE_EXIT, cx->type, after,
@@ -1603,6 +1617,8 @@ static int cf_check mwait_idle_cpu_init(
 		    /* cstate_restore_tsc() needs to be a no-op */
 		    boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
 			cx->irq_enable_early = true;
+		if (cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IBRS)
+			cx->ibrs_disable = true;
 
 		dev->count++;
 	}
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -42,7 +42,8 @@ struct acpi_processor_cx
     u8 idx;
     u8 type;         /* ACPI_STATE_Cn */
     u8 entry_method; /* ACPI_CSTATE_EM_xxx */
-    bool irq_enable_early;
+    bool irq_enable_early:1;
+    bool ibrs_disable:1;
     u32 address;
     u32 latency;
     u32 target_residency;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
                   ` (3 preceding siblings ...)
  2022-08-18 13:04 ` [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle Jan Beulich
@ 2022-08-18 13:05 ` Jan Beulich
  2022-10-13 12:05   ` Roger Pau Monné
  2022-08-18 13:09 ` [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
  2022-10-13 12:25 ` [4.17?] " Jan Beulich
  6 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:05 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

This patch partially reverts the changes made by the following commit:

da0e58c038e6 intel_idle: add 'preferred_cstates' module argument

As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
C1E states were mutually exclusive, so that users could only have either C1 and
C6, or C1E and C6.

However, Intel firmware engineers managed to remove this limitation and make C1
and C1E to be completely independent, just like on previous Xeon platforms.

Therefore, this patch:
 * Removes commentary describing the old, and now non-existing SPR C1E
   limitation.
 * Marks SPR C1E as available by default.
 * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
   C1E will be available regardless of 'preferred_cstates' value.

We expect that all SPR systems are shipping with new firmware, which includes
the C1/C1E improvement.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1548fac47a11
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -689,16 +689,6 @@ static struct cpuidle_state __read_mostl
 	{}
 };
 
-/*
- * On Sapphire Rapids Xeon C1 has to be disabled if C1E is enabled, and vice
- * versa. On SPR C1E is enabled only if "C1E promotion" bit is set in
- * MSR_IA32_POWER_CTL. But in this case there effectively no C1, because C1
- * requests are promoted to C1E. If the "C1E promotion" bit is cleared, then
- * both C1 and C1E requests end up with C1, so there is effectively no C1E.
- *
- * By default we enable C1 and disable C1E by marking it with
- * 'CPUIDLE_FLAG_DISABLED'.
- */
 static struct cpuidle_state __read_mostly spr_cstates[] = {
 	{
 		.name = "C1",
@@ -708,7 +698,7 @@ static struct cpuidle_state __read_mostl
 	},
 	{
 		.name = "C1E",
-		.flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_DISABLED,
+		.flags = MWAIT2flg(0x01),
 		.exit_latency = 2,
 		.target_residency = 4,
 	},
@@ -1401,17 +1391,6 @@ static void __init spr_idle_state_table_
 {
 	uint64_t msr;
 
-	/* Check if user prefers C1E over C1. */
-	if ((preferred_states_mask & BIT(2, U)) &&
-	    !(preferred_states_mask & BIT(1, U))) {
-		/* Disable C1 and enable C1E. */
-		spr_cstates[0].flags |= CPUIDLE_FLAG_DISABLED;
-		spr_cstates[1].flags &= ~CPUIDLE_FLAG_DISABLED;
-
-		/* Request enabling C1E using the "C1E promotion" bit. */
-		idle_cpu_spr.c1e_promotion = C1E_PROMOTION_ENABLE;
-	}
-
 	/*
 	 * By default, the C6 state assumes the worst-case scenario of package
 	 * C6. However, if PC6 is disabled, we update the numbers to match



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
                   ` (4 preceding siblings ...)
  2022-08-18 13:05 ` [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent Jan Beulich
@ 2022-08-18 13:09 ` Jan Beulich
  2022-08-19  1:02   ` Henry Wang
  2022-10-13 12:25 ` [4.17?] " Jan Beulich
  6 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-08-18 13:09 UTC (permalink / raw)
  To: Henry Wang; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

Henry,

On 18.08.2022 15:02, Jan Beulich wrote:
> New changes have appeared in the meantime, in particular one partly undoing
> what we still haven't merged (patch 1 here).
> 
> 1: add 'preferred_cstates' module argument
> 2: add core C6 optimization for SPR
> 3: add AlderLake support
> 4: disable IBRS during long idle
> 5: make SPR C1 and C1E be independent

strictly speaking patches 3-5 are late submissions. Patch 5, however,
actually corrects patch 1, and I'd prefer to keep things in the order
in which they were put in for Linux. Whether we actually want patch 4
is to be determined; if not that one should be easy to leave out. In
any event I'd like to ask for you to consider granting an exception
on these last three patches.

Thanks, Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support
  2022-08-18 13:09 ` [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
@ 2022-08-19  1:02   ` Henry Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Henry Wang @ 2022-08-19  1:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Subject: Re: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL
> support
> 
> Henry,
> 
> On 18.08.2022 15:02, Jan Beulich wrote:
> > New changes have appeared in the meantime, in particular one partly
> undoing
> > what we still haven't merged (patch 1 here).
> >
> > 1: add 'preferred_cstates' module argument
> > 2: add core C6 optimization for SPR
> > 3: add AlderLake support
> > 4: disable IBRS during long idle
> > 5: make SPR C1 and C1E be independent
> 
> strictly speaking patches 3-5 are late submissions. Patch 5, however,
> actually corrects patch 1, and I'd prefer to keep things in the order
> in which they were put in for Linux. Whether we actually want patch 4
> is to be determined; if not that one should be easy to leave out. In
> any event I'd like to ask for you to consider granting an exception
> on these last three patches.

Thank you for informing this.

Let me add this series to my list first. The feature freeze date is
Fri Sep 2, 2022 and code freeze date is Fri Sep 23, 2022. My personal
feeling would be as long as (1) this series can get a good shape to go in
before the code freeze and (2) other x86 maintainers don't have
objection to accept this series, it is fine with me. Also if x86 people want
this series in the release and it won't cause the delay of release, I don't
think there would be any reason to refuse this series :))) So I guess please
just keep these 3 patches for now and let's see what the discussion in
this series will be.

Kind regards,
Henry

> 
> Thanks, Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option
  2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
@ 2022-10-11 16:22   ` Roger Pau Monné
  2022-10-12 13:46     ` Jan Beulich
  2022-10-13 11:52   ` Roger Pau Monné
  1 sibling, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-11 16:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Aug 18, 2022 at 03:03:33PM +0200, Jan Beulich wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 
> On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
> exclusive - only one of them can be enabled. By default, 'intel_idle' driver
> enables C1 and disables C1E. However, some users prefer to use C1E instead of
> C1, because it saves more energy.
> 
> This patch adds a new module parameter ('preferred_cstates') for enabling C1E
> and disabling C1. Here is the idea behind it.
> 
> 1. This option has effect only for "mutually exclusive" C-states like C1 and
>    C1E on SPR.
> 2. It does not have any effect on independent C-states, which do not require
>    other C-states to be disabled (most states on most platforms as of today).
> 3. For mutually exclusive C-states, the 'intel_idle' driver always has a
>    reasonable default, such as enabling C1 on SPR by default. On other
>    platforms, the default may be different.
> 4. Users can override the default using the 'preferred_cstates' parameter.
> 5. The parameter accepts the preferred C-states bit-mask, similarly to the
>    existing 'states_off' parameter.
> 6. This parameter is not limited to C1/C1E, and leaves room for supporting
>    other mutually exclusive C-states, if they come in the future.
> 
> Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
> to disable C1 and enable C1E, users should boot with the following kernel
> argument: intel_idle.preferred_cstates=4
> 
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git da0e58c038e6
> 
> Enable C1E (if requested) not only on the BSP's socket / package. Alter
> command line option to fit our model, and extend it to also accept
> string form arguments.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Also accept string form arguments for command line option. Restore
>     C1E-control related enum from Linux, despite our somewhat different
>     use (and bigger code churn).
> 
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1912,6 +1912,12 @@ paging controls access to usermode addre
>  ### ple_window (Intel)
>  > `= <integer>`
>  
> +### preferred-cstates (x86)
> +> `= ( <integer> | List of ( C1 | C1E | C2 | ... )`
> +
> +This is a mask of C-states which are to be used preferably.  This option is
> +applicable only on hardware were certain C-states are exclusive of one another.
> +
>  ### psr (Intel)
>  > `= List of ( cmt:<boolean> | rmid_max:<integer> | cat:<boolean> | cos_max:<integer> | cdp:<boolean> )`
>  
> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -82,10 +82,29 @@ boolean_param("mwait-idle", opt_mwait_id
>  
>  static unsigned int mwait_substates;
>  
> +/*
> + * Some platforms come with mutually exclusive C-states, so that if one is
> + * enabled, the other C-states must not be used. Example: C1 and C1E on
> + * Sapphire Rapids platform. This parameter allows for selecting the
> + * preferred C-states among the groups of mutually exclusive C-states - the
> + * selected C-states will be registered, the other C-states from the mutually
> + * exclusive group won't be registered. If the platform has no mutually
> + * exclusive C-states, this parameter has no effect.
> + */
> +static unsigned int __ro_after_init preferred_states_mask;
> +static char __initdata preferred_states[64];
> +string_param("preferred-cstates", preferred_states);
> +
>  #define LAPIC_TIMER_ALWAYS_RELIABLE 0xFFFFFFFF
>  /* Reliable LAPIC Timer States, bit 1 for C1 etc. Default to only C1. */
>  static unsigned int lapic_timer_reliable_states = (1 << 1);
>  
> +enum c1e_promotion {
> +	C1E_PROMOTION_PRESERVE,
> +	C1E_PROMOTION_ENABLE,
> +	C1E_PROMOTION_DISABLE
> +};
> +
>  struct idle_cpu {
>  	const struct cpuidle_state *state_table;
>  
> @@ -95,7 +114,7 @@ struct idle_cpu {
>  	 */
>  	unsigned long auto_demotion_disable_flags;
>  	bool byt_auto_demotion_disable_flag;
> -	bool disable_promotion_to_c1e;
> +	enum c1e_promotion c1e_promotion;
>  };
>  
>  static const struct idle_cpu *icpu;
> @@ -924,6 +943,15 @@ static void cf_check byt_auto_demotion_d
>  	wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0);
>  }
>  
> +static void cf_check c1e_promotion_enable(void *dummy)
> +{
> +	uint64_t msr_bits;
> +
> +	rdmsrl(MSR_IA32_POWER_CTL, msr_bits);
> +	msr_bits |= 0x2;
> +	wrmsrl(MSR_IA32_POWER_CTL, msr_bits);
> +}
> +
>  static void cf_check c1e_promotion_disable(void *dummy)
>  {
>  	u64 msr_bits;
> @@ -936,7 +964,7 @@ static void cf_check c1e_promotion_disab
>  static const struct idle_cpu idle_cpu_nehalem = {
>  	.state_table = nehalem_cstates,
>  	.auto_demotion_disable_flags = NHM_C1_AUTO_DEMOTE | NHM_C3_AUTO_DEMOTE,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_atom = {
> @@ -954,64 +982,64 @@ static const struct idle_cpu idle_cpu_li
>  
>  static const struct idle_cpu idle_cpu_snb = {
>  	.state_table = snb_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_byt = {
>  	.state_table = byt_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  	.byt_auto_demotion_disable_flag = true,
>  };
>  
>  static const struct idle_cpu idle_cpu_cht = {
>  	.state_table = cht_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  	.byt_auto_demotion_disable_flag = true,
>  };
>  
>  static const struct idle_cpu idle_cpu_ivb = {
>  	.state_table = ivb_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_ivt = {
>  	.state_table = ivt_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_hsw = {
>  	.state_table = hsw_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_bdw = {
>  	.state_table = bdw_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_skl = {
>  	.state_table = skl_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_skx = {
>  	.state_table = skx_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_icx = {
>         .state_table = icx_cstates,
> -       .disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static struct idle_cpu __read_mostly idle_cpu_spr = {
>  	.state_table = spr_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_avn = {
>  	.state_table = avn_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_knl = {
> @@ -1020,17 +1048,17 @@ static const struct idle_cpu idle_cpu_kn
>  
>  static const struct idle_cpu idle_cpu_bxt = {
>  	.state_table = bxt_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_dnv = {
>  	.state_table = dnv_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  static const struct idle_cpu idle_cpu_snr = {
>  	.state_table = snr_cstates,
> -	.disable_promotion_to_c1e = true,
> +	.c1e_promotion = C1E_PROMOTION_DISABLE,
>  };
>  
>  #define ICPU(model, cpu) \
> @@ -1241,6 +1269,25 @@ static void __init skx_idle_state_table_
>  }
>  
>  /*
> + * spr_idle_state_table_update - Adjust Sapphire Rapids idle states table.
> + */
> +static void __init spr_idle_state_table_update(void)
> +{
> +	/* Check if user prefers C1E over C1. */
> +	if (preferred_states_mask & BIT(2, U)) {
> +		if (preferred_states_mask & BIT(1, U))
> +			/* Both can't be enabled, stick to the defaults. */
> +			return;
> +
> +		spr_cstates[0].flags |= CPUIDLE_FLAG_DISABLED;
> +		spr_cstates[1].flags &= ~CPUIDLE_FLAG_DISABLED;
> +
> +		/* Request enabling C1E using the "C1E promotion" bit. */
> +		idle_cpu_spr.c1e_promotion = C1E_PROMOTION_ENABLE;
> +	}
> +}
> +
> +/*
>   * mwait_idle_state_table_update()
>   *
>   * Update the default state_table for this CPU-id
> @@ -1261,6 +1308,9 @@ static void __init mwait_idle_state_tabl
>  	case INTEL_FAM6_SKYLAKE_X:
>  		skx_idle_state_table_update();
>  		break;
> +	case INTEL_FAM6_SAPPHIRERAPIDS_X:
> +		spr_idle_state_table_update();
> +		break;
>  	}
>  }
>  
> @@ -1268,6 +1318,7 @@ static int __init mwait_idle_probe(void)
>  {
>  	unsigned int eax, ebx, ecx;
>  	const struct x86_cpu_id *id = x86_match_cpu(intel_idle_ids);
> +	const char *str;
>  
>  	if (!id) {
>  		pr_debug(PREFIX "does not run on family %d model %d\n",
> @@ -1309,6 +1360,39 @@ static int __init mwait_idle_probe(void)
>  	pr_debug(PREFIX "lapic_timer_reliable_states %#x\n",
>  		 lapic_timer_reliable_states);
>  
> +	str = preferred_states;
> +	if (isdigit(str[0]))
> +		preferred_states_mask = simple_strtoul(str, &str, 0);
> +	else if (str[0])
> +	{
> +		const char *ss;
> +
> +		do {
> +			const struct cpuidle_state *state = icpu->state_table;
> +			unsigned int bit = 1;
> +
> +			ss = strchr(str, ',');
> +			if (!ss)
> +				ss = strchr(str, '\0');
> +
> +			for (; state->name[0]; ++state) {
> +				bit <<= 1;
> +				if (!cmdline_strcmp(str, state->name)) {
> +					preferred_states_mask |= bit;
> +					break;
> +				}
> +			}
> +			if (!state->name[0])
> +				break;
> +
> +			str = ss + 1;
> +	    } while (*ss);
> +
> +	    str -= str == ss + 1;

I would add parentheses to the sum for clarity.

> +	}
> +	if (str[0])
> +		printk("unrecognized \"preferred-cstates=%s\"\n", str);
> +
>  	mwait_idle_state_table_update();
>  
>  	return 0;
> @@ -1400,8 +1484,18 @@ static int cf_check mwait_idle_cpu_init(
>  	if (icpu->byt_auto_demotion_disable_flag)
>  		on_selected_cpus(cpumask_of(cpu), byt_auto_demotion_disable, NULL, 1);
>  
> -	if (icpu->disable_promotion_to_c1e)
> +	switch (icpu->c1e_promotion) {
> +	case C1E_PROMOTION_DISABLE:
>  		on_selected_cpus(cpumask_of(cpu), c1e_promotion_disable, NULL, 1);
> +		break;
> +
> +	case C1E_PROMOTION_ENABLE:
> +		on_selected_cpus(cpumask_of(cpu), c1e_promotion_enable, NULL, 1);
> +		break;
> +
> +	case C1E_PROMOTION_PRESERVE:
> +		break;
> +	}

I find it kind of weird to user a notifier for this, won't it be
easier to set the C1E promotion as part of the CPU bringup process?

I see we also set other bits in the same way.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option
  2022-10-11 16:22   ` Roger Pau Monné
@ 2022-10-12 13:46     ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2022-10-12 13:46 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 11.10.2022 18:22, Roger Pau Monné wrote:
> On Thu, Aug 18, 2022 at 03:03:33PM +0200, Jan Beulich wrote:
>> @@ -1309,6 +1360,39 @@ static int __init mwait_idle_probe(void)
>>  	pr_debug(PREFIX "lapic_timer_reliable_states %#x\n",
>>  		 lapic_timer_reliable_states);
>>  
>> +	str = preferred_states;
>> +	if (isdigit(str[0]))
>> +		preferred_states_mask = simple_strtoul(str, &str, 0);
>> +	else if (str[0])
>> +	{
>> +		const char *ss;
>> +
>> +		do {
>> +			const struct cpuidle_state *state = icpu->state_table;
>> +			unsigned int bit = 1;
>> +
>> +			ss = strchr(str, ',');
>> +			if (!ss)
>> +				ss = strchr(str, '\0');
>> +
>> +			for (; state->name[0]; ++state) {
>> +				bit <<= 1;
>> +				if (!cmdline_strcmp(str, state->name)) {
>> +					preferred_states_mask |= bit;
>> +					break;
>> +				}
>> +			}
>> +			if (!state->name[0])
>> +				break;
>> +
>> +			str = ss + 1;
>> +	    } while (*ss);
>> +
>> +	    str -= str == ss + 1;
> 
> I would add parentheses to the sum for clarity.

If I was to add parentheses here, then like this:

    str -= (str == ss + 1);

Looks like I've screwed up with indentation here, though.

>> @@ -1400,8 +1484,18 @@ static int cf_check mwait_idle_cpu_init(
>>  	if (icpu->byt_auto_demotion_disable_flag)
>>  		on_selected_cpus(cpumask_of(cpu), byt_auto_demotion_disable, NULL, 1);
>>  
>> -	if (icpu->disable_promotion_to_c1e)
>> +	switch (icpu->c1e_promotion) {
>> +	case C1E_PROMOTION_DISABLE:
>>  		on_selected_cpus(cpumask_of(cpu), c1e_promotion_disable, NULL, 1);
>> +		break;
>> +
>> +	case C1E_PROMOTION_ENABLE:
>> +		on_selected_cpus(cpumask_of(cpu), c1e_promotion_enable, NULL, 1);
>> +		break;
>> +
>> +	case C1E_PROMOTION_PRESERVE:
>> +		break;
>> +	}
> 
> I find it kind of weird to user a notifier for this, won't it be
> easier to set the C1E promotion as part of the CPU bringup process?

A CPU notifier _is_ part of the CPU bringup process, isn't it? So it's
not clear to me what alternative you're thinking of.

> I see we also set other bits in the same way.

Well, yes - right here I only extend what we already have in place.
Re-working that in whichever way ought to be a separate topic imo, and
preferably not be part of a port of a commit coming from Linux.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option
  2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
  2022-10-11 16:22   ` Roger Pau Monné
@ 2022-10-13 11:52   ` Roger Pau Monné
  1 sibling, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-13 11:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Aug 18, 2022 at 03:03:33PM +0200, Jan Beulich wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 
> On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
> exclusive - only one of them can be enabled. By default, 'intel_idle' driver
> enables C1 and disables C1E. However, some users prefer to use C1E instead of
> C1, because it saves more energy.
> 
> This patch adds a new module parameter ('preferred_cstates') for enabling C1E
> and disabling C1. Here is the idea behind it.
> 
> 1. This option has effect only for "mutually exclusive" C-states like C1 and
>    C1E on SPR.
> 2. It does not have any effect on independent C-states, which do not require
>    other C-states to be disabled (most states on most platforms as of today).
> 3. For mutually exclusive C-states, the 'intel_idle' driver always has a
>    reasonable default, such as enabling C1 on SPR by default. On other
>    platforms, the default may be different.
> 4. Users can override the default using the 'preferred_cstates' parameter.
> 5. The parameter accepts the preferred C-states bit-mask, similarly to the
>    existing 'states_off' parameter.
> 6. This parameter is not limited to C1/C1E, and leaves room for supporting
>    other mutually exclusive C-states, if they come in the future.
> 
> Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
> to disable C1 and enable C1E, users should boot with the following kernel
> argument: intel_idle.preferred_cstates=4
> 
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git da0e58c038e6
> 
> Enable C1E (if requested) not only on the BSP's socket / package. Alter
> command line option to fit our model, and extend it to also accept
> string form arguments.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

With the indentation fixes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/5] x86/mwait-idle: add AlderLake support
  2022-08-18 13:04 ` [PATCH v3 3/5] x86/mwait-idle: add AlderLake support Jan Beulich
@ 2022-10-13 11:55   ` Roger Pau Monné
  0 siblings, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-13 11:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Aug 18, 2022 at 03:04:28PM +0200, Jan Beulich wrote:
> From: Zhang Rui <rui.zhang@intel.com>
> 
> Similar to SPR, the C1 and C1E states on ADL are mutually exclusive.
> Only one of them can be enabled at a time.
> 
> But contrast to SPR, which usually has a strong latency requirement
> as a Xeon processor, C1E is preferred on ADL for better energy
> efficiency.
> 
> Add custom C-state tables for ADL with both C1 and C1E, and
> 
>  1. Enable the "C1E promotion" bit in MSR_IA32_POWER_CTL and mark C1
>     with the CPUIDLE_FLAG_UNUSABLE flag, so C1 is not available by
>     default.
> 
>  2. Add support for the "preferred_cstates" module parameter, so that
>     users can choose to use C1 instead of C1E by booting with
>     "intel_idle.preferred_cstates=2".
> 
> Separate custom C-state tables are introduced for the ADL mobile and
> desktop processors, because of the exit latency differences between
> these two variants, especially with respect to PC10.
> 
> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> [ rjw: Changelog edits, code rearrangement ]
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git d1cf8bbfed1e
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-08-18 13:04 ` [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle Jan Beulich
@ 2022-10-13 12:03   ` Roger Pau Monné
  2022-10-13 12:17     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-13 12:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
> down the running sibling. OTOH, disabling IBRS around idle takes two
> MSR writes, which will increase the idle latency.
> 
> Therefore, only disable IBRS around deeper idle states. Shallow idle
> states are bounded by the tick in duration, since NOHZ is not allowed
> for them by virtue of their short target residency.
> 
> Only do this for mwait-driven idle, since that keeps interrupts disabled
> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
> 
> Note: C6 is a random threshold, most importantly C1 probably shouldn't
> disable IBRS, benchmarking needed.
> 
> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

One unrelated comment below.

> ---
> v3: New.
> 
> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -141,6 +141,12 @@ static const struct cpuidle_state {
>  #define CPUIDLE_FLAG_TLB_FLUSHED	0x10000
>  
>  /*
> + * Disable IBRS across idle (when KERNEL_IBRS), is exclusive vs IRQ_ENABLE
> + * above.
> + */
> +#define CPUIDLE_FLAG_IBRS		0x20000
> +
> +/*
>   * MWAIT takes an 8-bit "hint" in EAX "suggesting"
>   * the C-state (top nibble) and sub-state (bottom nibble)
>   * 0x00 means "MWAIT(C1)", 0x10 means "MWAIT(C2)" etc.
> @@ -530,31 +536,31 @@ static struct cpuidle_state __read_mostl
>  	},
>  	{
>  		.name = "C6",
> -		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 85,
>  		.target_residency = 200,
>  	},
>  	{
>  		.name = "C7s",
> -		.flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 124,
>  		.target_residency = 800,
>  	},
>  	{
>  		.name = "C8",
> -		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 200,
>  		.target_residency = 800,
>  	},
>  	{
>  		.name = "C9",
> -		.flags = MWAIT2flg(0x50) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x50) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 480,
>  		.target_residency = 5000,
>  	},
>  	{
>  		.name = "C10",
> -		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x60) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 890,
>  		.target_residency = 5000,
>  	},
> @@ -576,7 +582,7 @@ static struct cpuidle_state __read_mostl
>  	},
>  	{
>  		.name = "C6",
> -		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
> +		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED | CPUIDLE_FLAG_IBRS,
>  		.exit_latency = 133,
>  		.target_residency = 600,
>  	},
> @@ -906,6 +912,7 @@ static const struct cpuidle_state snr_cs
>  static void cf_check mwait_idle(void)
>  {
>  	unsigned int cpu = smp_processor_id();
> +	struct cpu_info *info = get_cpu_info();
>  	struct acpi_processor_power *power = processor_powers[cpu];
>  	struct acpi_processor_cx *cx = NULL;
>  	unsigned int next_state;
> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
>  			pm_idle_save();
>  		else
>  		{
> -			struct cpu_info *info = get_cpu_info();
> -
>  			spec_ctrl_enter_idle(info);
>  			safe_halt();
>  			spec_ctrl_exit_idle(info);

Do we need to disable speculation just for the hlt if there's no
C state change?

It would seem to me like the MSR writes could add a lot of latency
when there's no C state change.

Maybe I'm confused.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent
  2022-08-18 13:05 ` [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent Jan Beulich
@ 2022-10-13 12:05   ` Roger Pau Monné
  2022-10-13 12:20     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-13 12:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Aug 18, 2022 at 03:05:19PM +0200, Jan Beulich wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 
> This patch partially reverts the changes made by the following commit:
> 
> da0e58c038e6 intel_idle: add 'preferred_cstates' module argument
> 
> As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
> C1E states were mutually exclusive, so that users could only have either C1 and
> C6, or C1E and C6.
> 
> However, Intel firmware engineers managed to remove this limitation and make C1
> and C1E to be completely independent, just like on previous Xeon platforms.
> 
> Therefore, this patch:
>  * Removes commentary describing the old, and now non-existing SPR C1E
>    limitation.
>  * Marks SPR C1E as available by default.
>  * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
>    C1E will be available regardless of 'preferred_cstates' value.
> 
> We expect that all SPR systems are shipping with new firmware, which includes
> the C1/C1E improvement.
> 
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1548fac47a11
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

I guess we need to be careful of running this on pre-production
hardware then?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-10-13 12:03   ` Roger Pau Monné
@ 2022-10-13 12:17     ` Jan Beulich
  2022-10-13 13:10       ` Roger Pau Monné
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-10-13 12:17 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 13.10.2022 14:03, Roger Pau Monné wrote:
> On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
>> down the running sibling. OTOH, disabling IBRS around idle takes two
>> MSR writes, which will increase the idle latency.
>>
>> Therefore, only disable IBRS around deeper idle states. Shallow idle
>> states are bounded by the tick in duration, since NOHZ is not allowed
>> for them by virtue of their short target residency.
>>
>> Only do this for mwait-driven idle, since that keeps interrupts disabled
>> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
>>
>> Note: C6 is a random threshold, most importantly C1 probably shouldn't
>> disable IBRS, benchmarking needed.
>>
>> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Borislav Petkov <bp@suse.de>
>> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
>> Signed-off-by: Borislav Petkov <bp@suse.de>
>> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> One unrelated comment below.
> [...]
>> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
>>  			pm_idle_save();
>>  		else
>>  		{
>> -			struct cpu_info *info = get_cpu_info();
>> -
>>  			spec_ctrl_enter_idle(info);
>>  			safe_halt();
>>  			spec_ctrl_exit_idle(info);
> 
> Do we need to disable speculation just for the hlt if there's no
> C state change?
> 
> It would seem to me like the MSR writes could add a lot of latency
> when there's no C state change.

HLT enters (at least) C1, so is a C-state change to me as well. Plus
we may remain there for a while, and during that time we'd like to
not unduly impact the other thread.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent
  2022-10-13 12:05   ` Roger Pau Monné
@ 2022-10-13 12:20     ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2022-10-13 12:20 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 13.10.2022 14:05, Roger Pau Monné wrote:
> On Thu, Aug 18, 2022 at 03:05:19PM +0200, Jan Beulich wrote:
>> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>>
>> This patch partially reverts the changes made by the following commit:
>>
>> da0e58c038e6 intel_idle: add 'preferred_cstates' module argument
>>
>> As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
>> C1E states were mutually exclusive, so that users could only have either C1 and
>> C6, or C1E and C6.
>>
>> However, Intel firmware engineers managed to remove this limitation and make C1
>> and C1E to be completely independent, just like on previous Xeon platforms.
>>
>> Therefore, this patch:
>>  * Removes commentary describing the old, and now non-existing SPR C1E
>>    limitation.
>>  * Marks SPR C1E as available by default.
>>  * Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
>>    C1E will be available regardless of 'preferred_cstates' value.
>>
>> We expect that all SPR systems are shipping with new firmware, which includes
>> the C1/C1E improvement.
>>
>> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1548fac47a11
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> I guess we need to be careful of running this on pre-production
> hardware then?

Well, power savings may not be as expected there, but beyond that I don't
think there would be much of an observable effect.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [4.17?] Re: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support
  2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
                   ` (5 preceding siblings ...)
  2022-08-18 13:09 ` [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
@ 2022-10-13 12:25 ` Jan Beulich
  2022-10-13 12:46   ` Henry Wang
  6 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-10-13 12:25 UTC (permalink / raw)
  To: Henry Wang; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

Henry,

On 18.08.2022 15:02, Jan Beulich wrote:
> New changes have appeared in the meantime, in particular one partly undoing
> what we still haven't merged (patch 1 here).
> 
> 1: add 'preferred_cstates' module argument
> 2: add core C6 optimization for SPR
> 3: add AlderLake support
> 4: disable IBRS during long idle
> 5: make SPR C1 and C1E be independent

just now acks have completed coming in for this series of ports of earlier
Linux commits. This all targets newer hardware which of course we'd like
to properly support with 4.17. Unfortunately the series was kind of lost
with all the security work which had been ongoing. Therefore I'd like to
ask (in part also motivated by Andrew to do so) that you consider still
allowing in this series. In the (I think) not very likely case of problems
(first and foremost because of the changes coming from Linux, rather than
being homegrown) people always have the option of simply turning off the
use of this driver on their systems.

Thanks, Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [4.17?] Re: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support
  2022-10-13 12:25 ` [4.17?] " Jan Beulich
@ 2022-10-13 12:46   ` Henry Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Henry Wang @ 2022-10-13 12:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Subject: [4.17?] Re: [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new)
> ADL support
> 
> Henry,
> 
> On 18.08.2022 15:02, Jan Beulich wrote:
> > New changes have appeared in the meantime, in particular one partly
> undoing
> > what we still haven't merged (patch 1 here).
> >
> > 1: add 'preferred_cstates' module argument
> > 2: add core C6 optimization for SPR
> > 3: add AlderLake support
> > 4: disable IBRS during long idle
> > 5: make SPR C1 and C1E be independent
> 
> just now acks have completed coming in for this series of ports of earlier
> Linux commits. This all targets newer hardware which of course we'd like
> to properly support with 4.17. Unfortunately the series was kind of lost
> with all the security work which had been ongoing. Therefore I'd like to
> ask (in part also motivated by Andrew to do so) that you consider still
> allowing in this series. In the (I think) not very likely case of problems
> (first and foremost because of the changes coming from Linux, rather than
> being homegrown) people always have the option of simply turning off the
> use of this driver on their systems.

I just checked the x86 gitlab board for the list and saw this series is on it. So
yes, please add my release ack. For now we still have 2 weeks before the release
and I think that would be a reasonable time for bugs to emerge.

To set expectations I am personally be more conservative in between the RC3
to RC4 week, because the fear of last minute bugs.

Kind regards,
Henry

> 
> Thanks, Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-10-13 12:17     ` Jan Beulich
@ 2022-10-13 13:10       ` Roger Pau Monné
  2022-10-13 14:15         ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-13 13:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Oct 13, 2022 at 02:17:54PM +0200, Jan Beulich wrote:
> On 13.10.2022 14:03, Roger Pau Monné wrote:
> > On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
> >> From: Peter Zijlstra <peterz@infradead.org>
> >>
> >> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
> >> down the running sibling. OTOH, disabling IBRS around idle takes two
> >> MSR writes, which will increase the idle latency.
> >>
> >> Therefore, only disable IBRS around deeper idle states. Shallow idle
> >> states are bounded by the tick in duration, since NOHZ is not allowed
> >> for them by virtue of their short target residency.
> >>
> >> Only do this for mwait-driven idle, since that keeps interrupts disabled
> >> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
> >>
> >> Note: C6 is a random threshold, most importantly C1 probably shouldn't
> >> disable IBRS, benchmarking needed.
> >>
> >> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Borislav Petkov <bp@suse.de>
> >> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
> >> Signed-off-by: Borislav Petkov <bp@suse.de>
> >> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > One unrelated comment below.
> > [...]
> >> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
> >>  			pm_idle_save();
> >>  		else
> >>  		{
> >> -			struct cpu_info *info = get_cpu_info();
> >> -
> >>  			spec_ctrl_enter_idle(info);
> >>  			safe_halt();
> >>  			spec_ctrl_exit_idle(info);
> > 
> > Do we need to disable speculation just for the hlt if there's no
> > C state change?
> > 
> > It would seem to me like the MSR writes could add a lot of latency
> > when there's no C state change.
> 
> HLT enters (at least) C1, so is a C-state change to me as well. Plus
> we may remain there for a while, and during that time we'd like to
> not unduly impact the other thread.

OK, but it's not a "deeper C state" as mentioned in the commit
message.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-10-13 13:10       ` Roger Pau Monné
@ 2022-10-13 14:15         ` Jan Beulich
  2022-10-14  8:03           ` Roger Pau Monné
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2022-10-13 14:15 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 13.10.2022 15:10, Roger Pau Monné wrote:
> On Thu, Oct 13, 2022 at 02:17:54PM +0200, Jan Beulich wrote:
>> On 13.10.2022 14:03, Roger Pau Monné wrote:
>>> On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
>>>> down the running sibling. OTOH, disabling IBRS around idle takes two
>>>> MSR writes, which will increase the idle latency.
>>>>
>>>> Therefore, only disable IBRS around deeper idle states. Shallow idle
>>>> states are bounded by the tick in duration, since NOHZ is not allowed
>>>> for them by virtue of their short target residency.
>>>>
>>>> Only do this for mwait-driven idle, since that keeps interrupts disabled
>>>> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
>>>>
>>>> Note: C6 is a random threshold, most importantly C1 probably shouldn't
>>>> disable IBRS, benchmarking needed.
>>>>
>>>> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> Signed-off-by: Borislav Petkov <bp@suse.de>
>>>> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
>>>> Signed-off-by: Borislav Petkov <bp@suse.de>
>>>> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Thanks.
>>
>>> One unrelated comment below.
>>> [...]
>>>> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
>>>>  			pm_idle_save();
>>>>  		else
>>>>  		{
>>>> -			struct cpu_info *info = get_cpu_info();
>>>> -
>>>>  			spec_ctrl_enter_idle(info);
>>>>  			safe_halt();
>>>>  			spec_ctrl_exit_idle(info);
>>>
>>> Do we need to disable speculation just for the hlt if there's no
>>> C state change?
>>>
>>> It would seem to me like the MSR writes could add a lot of latency
>>> when there's no C state change.
>>
>> HLT enters (at least) C1, so is a C-state change to me as well. Plus
>> we may remain there for a while, and during that time we'd like to
>> not unduly impact the other thread.
> 
> OK, but it's not a "deeper C state" as mentioned in the commit
> message.

Correct. But it's also code not being altered by this commit.

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-10-13 14:15         ` Jan Beulich
@ 2022-10-14  8:03           ` Roger Pau Monné
  2022-10-14  8:53             ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Roger Pau Monné @ 2022-10-14  8:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Oct 13, 2022 at 04:15:04PM +0200, Jan Beulich wrote:
> On 13.10.2022 15:10, Roger Pau Monné wrote:
> > On Thu, Oct 13, 2022 at 02:17:54PM +0200, Jan Beulich wrote:
> >> On 13.10.2022 14:03, Roger Pau Monné wrote:
> >>> On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
> >>>> From: Peter Zijlstra <peterz@infradead.org>
> >>>>
> >>>> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
> >>>> down the running sibling. OTOH, disabling IBRS around idle takes two
> >>>> MSR writes, which will increase the idle latency.
> >>>>
> >>>> Therefore, only disable IBRS around deeper idle states. Shallow idle
> >>>> states are bounded by the tick in duration, since NOHZ is not allowed
> >>>> for them by virtue of their short target residency.
> >>>>
> >>>> Only do this for mwait-driven idle, since that keeps interrupts disabled
> >>>> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
> >>>>
> >>>> Note: C6 is a random threshold, most importantly C1 probably shouldn't
> >>>> disable IBRS, benchmarking needed.
> >>>>
> >>>> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
> >>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>>> Signed-off-by: Borislav Petkov <bp@suse.de>
> >>>> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
> >>>> Signed-off-by: Borislav Petkov <bp@suse.de>
> >>>> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>
> >>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> >>
> >> Thanks.
> >>
> >>> One unrelated comment below.
> >>> [...]
> >>>> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
> >>>>  			pm_idle_save();
> >>>>  		else
> >>>>  		{
> >>>> -			struct cpu_info *info = get_cpu_info();
> >>>> -
> >>>>  			spec_ctrl_enter_idle(info);
> >>>>  			safe_halt();
> >>>>  			spec_ctrl_exit_idle(info);
> >>>
> >>> Do we need to disable speculation just for the hlt if there's no
> >>> C state change?
> >>>
> >>> It would seem to me like the MSR writes could add a lot of latency
> >>> when there's no C state change.
> >>
> >> HLT enters (at least) C1, so is a C-state change to me as well. Plus
> >> we may remain there for a while, and during that time we'd like to
> >> not unduly impact the other thread.
> > 
> > OK, but it's not a "deeper C state" as mentioned in the commit
> > message.
> 
> Correct. But it's also code not being altered by this commit.

Indeed, that's why it's an unrelated comment.  I was just wondering
whether we should drop those or not in a separate patch.  I'm
concerned over hitting that path on a virtualized environment, where
changing the spec controls is likely not that cheap.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle
  2022-10-14  8:03           ` Roger Pau Monné
@ 2022-10-14  8:53             ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2022-10-14  8:53 UTC (permalink / raw)
  To: Roger Pau Monné, Andrew Cooper; +Cc: xen-devel, Wei Liu

On 14.10.2022 10:03, Roger Pau Monné wrote:
> On Thu, Oct 13, 2022 at 04:15:04PM +0200, Jan Beulich wrote:
>> On 13.10.2022 15:10, Roger Pau Monné wrote:
>>> On Thu, Oct 13, 2022 at 02:17:54PM +0200, Jan Beulich wrote:
>>>> On 13.10.2022 14:03, Roger Pau Monné wrote:
>>>>> On Thu, Aug 18, 2022 at 03:04:51PM +0200, Jan Beulich wrote:
>>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>>
>>>>>> Having IBRS enabled while the SMT sibling is idle unnecessarily slows
>>>>>> down the running sibling. OTOH, disabling IBRS around idle takes two
>>>>>> MSR writes, which will increase the idle latency.
>>>>>>
>>>>>> Therefore, only disable IBRS around deeper idle states. Shallow idle
>>>>>> states are bounded by the tick in duration, since NOHZ is not allowed
>>>>>> for them by virtue of their short target residency.
>>>>>>
>>>>>> Only do this for mwait-driven idle, since that keeps interrupts disabled
>>>>>> across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
>>>>>>
>>>>>> Note: C6 is a random threshold, most importantly C1 probably shouldn't
>>>>>> disable IBRS, benchmarking needed.
>>>>>>
>>>>>> Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>>> Signed-off-by: Borislav Petkov <bp@suse.de>
>>>>>> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
>>>>>> Signed-off-by: Borislav Petkov <bp@suse.de>
>>>>>> Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>
>>>>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>>>>
>>>> Thanks.
>>>>
>>>>> One unrelated comment below.
>>>>> [...]
>>>>>> @@ -932,8 +939,6 @@ static void cf_check mwait_idle(void)
>>>>>>  			pm_idle_save();
>>>>>>  		else
>>>>>>  		{
>>>>>> -			struct cpu_info *info = get_cpu_info();
>>>>>> -
>>>>>>  			spec_ctrl_enter_idle(info);
>>>>>>  			safe_halt();
>>>>>>  			spec_ctrl_exit_idle(info);
>>>>>
>>>>> Do we need to disable speculation just for the hlt if there's no
>>>>> C state change?
>>>>>
>>>>> It would seem to me like the MSR writes could add a lot of latency
>>>>> when there's no C state change.
>>>>
>>>> HLT enters (at least) C1, so is a C-state change to me as well. Plus
>>>> we may remain there for a while, and during that time we'd like to
>>>> not unduly impact the other thread.
>>>
>>> OK, but it's not a "deeper C state" as mentioned in the commit
>>> message.
>>
>> Correct. But it's also code not being altered by this commit.
> 
> Indeed, that's why it's an unrelated comment.  I was just wondering
> whether we should drop those or not in a separate patch.  I'm
> concerned over hitting that path on a virtualized environment, where
> changing the spec controls is likely not that cheap.

Perhaps we want to make spec_ctrl_{enter,exit}_idle() a no-op when
we're running virtualized ourselves?

Jan


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-10-14  8:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-18 13:02 [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
2022-08-18 13:03 ` [PATCH v3 1/5] x86/mwait-idle: add 'preferred-cstates' command line option Jan Beulich
2022-10-11 16:22   ` Roger Pau Monné
2022-10-12 13:46     ` Jan Beulich
2022-10-13 11:52   ` Roger Pau Monné
2022-08-18 13:03 ` [PATCH v3 2/5] x86/mwait-idle: add core C6 optimization for SPR Jan Beulich
2022-08-18 13:04 ` [PATCH v3 3/5] x86/mwait-idle: add AlderLake support Jan Beulich
2022-10-13 11:55   ` Roger Pau Monné
2022-08-18 13:04 ` [PATCH v3 4/5] x86/mwait-idle: disable IBRS during long idle Jan Beulich
2022-10-13 12:03   ` Roger Pau Monné
2022-10-13 12:17     ` Jan Beulich
2022-10-13 13:10       ` Roger Pau Monné
2022-10-13 14:15         ` Jan Beulich
2022-10-14  8:03           ` Roger Pau Monné
2022-10-14  8:53             ` Jan Beulich
2022-08-18 13:05 ` [PATCH v3 5/5] x86/mwait-idle: make SPR C1 and C1E be independent Jan Beulich
2022-10-13 12:05   ` Roger Pau Monné
2022-10-13 12:20     ` Jan Beulich
2022-08-18 13:09 ` [PATCH v3 0/5] x86/mwait-idle: (remaining) SPR + (new) ADL support Jan Beulich
2022-08-19  1:02   ` Henry Wang
2022-10-13 12:25 ` [4.17?] " Jan Beulich
2022-10-13 12:46   ` Henry Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.