Re: [PATCH] ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Dave Hansen <dave.hansen@intel.com>, linux-kernel@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	Len Brown <lenb@kernel.org>,
	Mario Limonciello <Mario.Limonciello@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Subject: Re: [PATCH] ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems
Date: Sat, 24 Sep 2022 02:05:55 +0530	[thread overview]
Message-ID: <775d8042-639a-1ae6-c099-9411e6027f40@amd.com> (raw)
In-Reply-To: <20220922184745.3252932-1-dave.hansen@intel.com>

Hello Dave,

On 9/23/2022 12:17 AM, Dave Hansen wrote:
> Old, circa 2002 chipsets have a bug: they don't go idle when they are
> supposed to.  So, a workaround was added to slow the CPU down and
> ensure that the CPU waits a bit for the chipset to actually go idle.
> This workaround is ancient and has been in place in some form since
> the original kernel ACPI implementation.
> 
> But, this workaround is very painful on modern systems.  The "inl()"
> can take thousands of cycles (see Link: for some more detailed
> numbers and some fun kernel archaeology).
> 
> First and foremost, modern systems should not be using this code.
> Typical Intel systems have not used it in over a decade because it is
> horribly inferior to MWAIT-based idle.
> 
> Despite this, people do seem to be tripping over this workaround on
> AMD system today.
> 
> Limit the "dummy wait" workaround to Intel systems.  Keep Modern AMD
> systems from tripping over the workaround.  Remotely modern Intel
> systems use intel_idle instead of this code and will, in practice,
> remain unaffected by the dummy wait.

I've run 30 runs of tbench with 128 clients on a dual socket Zen3 system
(2 x 64C/128T) and do not see any massive regression like I used to when
we were hitting the dummy wait issue:

Kernel        : baseline      baseline + C2 disabled   baseline + this patch

Min (MB/s)    : 2215.06       33072.10 (+1393.05%)     30519.60 (+1277.82%)
Max (MB/s)    : 32938.80      34399.10                 32699.30
Median (MB/s) : 32191.80      33476.60                 31418.90
AMean (MB/s)  : 22448.55      33649.27 (+49.89%)       31545.93 (+40.52%)
AMean Stddev  : 17526.70      680.14                   1095.39
AMean CoefVar : 78.07%        2.02%                    3.47%

The range is well within the variation we've normally seen with tbench
on the test platform.

> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Mario Limonciello <Mario.Limonciello@amd.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Borislav Petkov <bp@alien8.de>

Can you please add a cc to stable?

Cc: stable@vger.kernel.org

> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Link: https://lore.kernel.org/all/20220921063638.2489-1-kprateek.nayak@amd.com/
> ---
>  drivers/acpi/processor_idle.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 16a1663d02d4..9f40917c49ef 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -531,10 +531,27 @@ static void wait_for_freeze(void)
>  	/* No delay is needed if we are in guest */
>  	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return;
> +	/*
> +	 * Modern (>=Nehalem) Intel systems use ACPI via intel_idle,
> +	 * not this code.  Assume that any Intel systems using this
> +	 * are ancient and may need the dummy wait.  This also assumes
> +	 * that the motivating chipset issue was Intel-only.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)

Based on Andreas's comment, this problem is not limited to Intel chipsets
and affects at least the AMD Athlon on VIA chipset (circa 2006)
(https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/)
To be on safer side, the exception could be made for AMD Fam 17h+ and also
Hygon as pointed out by Peter, where we know the dummy wait is unnecessary.
Extending the condition you proposed, we can have:

	if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON ||
	    ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
	     (boot_cpu_data.x86_model >= 0x17)))
		return;

It is not pretty by any means which is why we can use a x86_BUG_STPCLK to
limit the dummy op to only affected processors. This way, the x86 vendor
check and family check can be avoided in the acpi code. A v2 has been sent
out tackling the problem this way:
https://lore.kernel.org/lkml/20220923153801.9167-1-kprateek.nayak@amd.com/

> +		return;
>  #endif
> -	/* Dummy wait op - must do something useless after P_LVL2 read
> -	   because chipsets cannot guarantee that STPCLK# signal
> -	   gets asserted in time to freeze execution properly. */
> +	/*
> +	 * Dummy wait op - must do something useless after P_LVL2 read
> +	 * because chipsets cannot guarantee that STPCLK# signal gets
> +	 * asserted in time to freeze execution properly
> +	 *
> +	 * This workaround has been in place since the original ACPI
> +	 * implementation was merged, circa 2002.
> +	 *
> +	 * If a profile is pointing to this instruction, please first
> +	 * consider moving your system to a more modern idle
> +	 * mechanism.
> +	 */
>  	inl(acpi_gbl_FADT.xpm_timer_block.address);
>  }
>  

The patch, as it is, solves the problem we've seen on the newer AMD
platforms with large core density that use IOPORT based C-states.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek