[1/4] x86, sched: Bail out of frequency invariance if base frequency is unknown
diff mbox series

Message ID 20200416054745.740-2-ggherdovich@suse.cz
State Accepted
Commit 9a6c2c3c7a73ce315c57c1b002caad6fcc858d0f
Headers show
Series
  • Frequency invariance fixes for x86
Related show

Commit Message

Giovanni Gherdovich April 16, 2020, 5:47 a.m. UTC
Some hypervisors such as VMWare ESXi 5.5 advertise support for
X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
clock frequency of the CPU (highest non-turbo frequency), producing a
division by zero when computing the ratio turbo_freq/base_freq necessary
for frequency invariant accounting.

It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
couldn't be done in principle (not that it would make a lot of sense in a
VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
that feature.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
---
 arch/x86/kernel/smpboot.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Giovanni Gherdovich April 16, 2020, 6:41 a.m. UTC | #1
+Dario Faggioli

On Thu, 2020-04-16 at 07:47 +0200, Giovanni Gherdovich wrote:
> Some hypervisors such as VMWare ESXi 5.5 advertise support for
> X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
> MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
> clock frequency of the CPU (highest non-turbo frequency), producing a
> division by zero when computing the ratio turbo_freq/base_freq necessary
> for frequency invariant accounting.
> 
> It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
> data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
> couldn't be done in principle (not that it would make a lot of sense in a
> VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
> appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
> that feature.
> 
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
> ---
>  arch/x86/kernel/smpboot.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index fe3ab9632f3b..3a318ec9bc17 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1985,6 +1985,15 @@ static bool intel_set_max_freq_ratio(void)
>  	return false;
>  
>  out:
> +	/*
> +	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
> +	 * but then fill all MSR's with zeroes.
> +	 */
> +	if (!base_freq) {
> +		pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
> +		return false;
> +	}
> +
>  	arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
>  					base_freq);
>  	arch_set_max_freq_ratio(turbo_disabled());
Rafael J. Wysocki April 16, 2020, 3:41 p.m. UTC | #2
On Thu, Apr 16, 2020 at 7:48 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> Some hypervisors such as VMWare ESXi 5.5 advertise support for
> X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
> MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
> clock frequency of the CPU (highest non-turbo frequency), producing a
> division by zero when computing the ratio turbo_freq/base_freq necessary
> for frequency invariant accounting.
>
> It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
> data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
> couldn't be done in principle (not that it would make a lot of sense in a
> VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
> appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
> that feature.
>
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")

Please feel free to add

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

to all patches in the series and I'm expecting them to be routed through tip.

Thanks!

> ---
>  arch/x86/kernel/smpboot.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index fe3ab9632f3b..3a318ec9bc17 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1985,6 +1985,15 @@ static bool intel_set_max_freq_ratio(void)
>         return false;
>
>  out:
> +       /*
> +        * Some hypervisors advertise X86_FEATURE_APERFMPERF
> +        * but then fill all MSR's with zeroes.
> +        */
> +       if (!base_freq) {
> +               pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
> +               return false;
> +       }
> +
>         arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
>                                         base_freq);
>         arch_set_max_freq_ratio(turbo_disabled());
> --
> 2.16.4
>
Ricardo Neri April 22, 2020, 5:15 p.m. UTC | #3
On Thu, Apr 16, 2020 at 07:47:42AM +0200, Giovanni Gherdovich wrote:
> Some hypervisors such as VMWare ESXi 5.5 advertise support for
> X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
> MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
> clock frequency of the CPU (highest non-turbo frequency), producing a
> division by zero when computing the ratio turbo_freq/base_freq necessary
> for frequency invariant accounting.
> 
> It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
> data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
> couldn't be done in principle (not that it would make a lot of sense in a
> VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
> appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
> that feature.
> 
> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
> ---
>  arch/x86/kernel/smpboot.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index fe3ab9632f3b..3a318ec9bc17 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1985,6 +1985,15 @@ static bool intel_set_max_freq_ratio(void)
>  	return false;
>  
>  out:
> +	/*
> +	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
> +	 * but then fill all MSR's with zeroes.
> +	 */
> +	if (!base_freq) {
> +		pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
> +		return false;
> +	}

It may be possible that MSR_TURBO_RATIO_LIMIT is also all-zeros. In
such case, turbo_freq will be also zero. If that is the case,
arch_max_freq_ratio will be zero and we will see a division by zero
exception in arch_scale_freq_tick() because mcnt is multiplied by
arch_max_freq_ratio().

Hence, you should also check for !turbo_freq.

Thanks and BR,
Ricardo
Giovanni Gherdovich April 23, 2020, 8:06 a.m. UTC | #4
On Wed, 2020-04-22 at 10:15 -0700, Ricardo Neri wrote:
> On Thu, Apr 16, 2020 at 07:47:42AM +0200, Giovanni Gherdovich wrote:
> > Some hypervisors such as VMWare ESXi 5.5 advertise support for
> > X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
> > MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
> > clock frequency of the CPU (highest non-turbo frequency), producing a
> > division by zero when computing the ratio turbo_freq/base_freq necessary
> > for frequency invariant accounting.
> > 
> > It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
> > data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
> > couldn't be done in principle (not that it would make a lot of sense in a
> > VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
> > appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
> > that feature.
> > 
> > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
> > ---
> >  arch/x86/kernel/smpboot.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index fe3ab9632f3b..3a318ec9bc17 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1985,6 +1985,15 @@ static bool intel_set_max_freq_ratio(void)
> >  	return false;
> >  
> >  out:
> > +	/*
> > +	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
> > +	 * but then fill all MSR's with zeroes.
> > +	 */
> > +	if (!base_freq) {
> > +		pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
> > +		return false;
> > +	}
> 
> It may be possible that MSR_TURBO_RATIO_LIMIT is also all-zeros. In
> such case, turbo_freq will be also zero. If that is the case,
> arch_max_freq_ratio will be zero and we will see a division by zero
> exception in arch_scale_freq_tick() because mcnt is multiplied by
> arch_max_freq_ratio().

Thanks Ricardo for clarifying this.

Follow-up question: when I see an all-zeros MSR_TURBO_RATIO_LIMIT, can I
assume the CPU doesn't support turbo boost? Or is it possible that such a CPU
has turbo boost, just the turbo ratios aren't declared in the MSR?

Some context: this feature (called "frequency invariance") wants to know
what's the max clock freq a CPU can have at any time (it needs it for some
scheduler calculations). This is hard to know precisely, because turbo can
kick in at any time and depends on many factors.  So it settles for an
"average maximum frequency", which I decided the 4 cores turbo is a good
estimate for. Now, if an all-zeros MSR_TURBO_RATIO_LIMIT means "turbo boost
unsupported", this is actually the easy case because then I know exactly what
the max freq is (base frequency). If, on the other hand, an all-zeros MSR
means "there may or may not be turbo, and you don't know how much" then I must
disable frequency invariance.


Thanks,
Giovanni
Ricardo Neri April 24, 2020, 1:32 a.m. UTC | #5
On Thu, Apr 23, 2020 at 10:06:04AM +0200, Giovanni Gherdovich wrote:
> On Wed, 2020-04-22 at 10:15 -0700, Ricardo Neri wrote:
> > On Thu, Apr 16, 2020 at 07:47:42AM +0200, Giovanni Gherdovich wrote:
> > > Some hypervisors such as VMWare ESXi 5.5 advertise support for
> > > X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
> > > MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
> > > clock frequency of the CPU (highest non-turbo frequency), producing a
> > > division by zero when computing the ratio turbo_freq/base_freq necessary
> > > for frequency invariant accounting.
> > > 
> > > It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
> > > data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
> > > couldn't be done in principle (not that it would make a lot of sense in a
> > > VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
> > > appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
> > > that feature.
> > > 
> > > Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > > Fixes: 1567c3e3467c ("x86, sched: Add support for frequency invariance")
> > > ---
> > >  arch/x86/kernel/smpboot.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > > 
> > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > index fe3ab9632f3b..3a318ec9bc17 100644
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -1985,6 +1985,15 @@ static bool intel_set_max_freq_ratio(void)
> > >  	return false;
> > >  
> > >  out:
> > > +	/*
> > > +	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
> > > +	 * but then fill all MSR's with zeroes.
> > > +	 */
> > > +	if (!base_freq) {
> > > +		pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
> > > +		return false;
> > > +	}
> > 
> > It may be possible that MSR_TURBO_RATIO_LIMIT is also all-zeros. In
> > such case, turbo_freq will be also zero. If that is the case,
> > arch_max_freq_ratio will be zero and we will see a division by zero
> > exception in arch_scale_freq_tick() because mcnt is multiplied by
> > arch_max_freq_ratio().
> 
> Thanks Ricardo for clarifying this.
> 
> Follow-up question: when I see an all-zeros MSR_TURBO_RATIO_LIMIT, can I
> assume the CPU doesn't support turbo boost? Or is it possible that such a CPU
> has turbo boost, just the turbo ratios aren't declared in the MSR?
> 
> Some context: this feature (called "frequency invariance") wants to know
> what's the max clock freq a CPU can have at any time (it needs it for some
> scheduler calculations). This is hard to know precisely, because turbo can
> kick in at any time and depends on many factors.  So it settles for an
> "average maximum frequency", which I decided the 4 cores turbo is a good
> estimate for. Now, if an all-zeros MSR_TURBO_RATIO_LIMIT means "turbo boost
> unsupported", this is actually the easy case because then I know exactly what
> the max freq is (base frequency). If, on the other hand, an all-zeros MSR
> means "there may or may not be turbo, and you don't know how much" then I must
> disable frequency invariance.

I'd say that there can be cases in which the CPU has turbo boost and yet the
turbo ratios are not declared in MSR_TURBO_RATIO_LIMIT. Hence, frequency
invariance should be disabled.

Thanks and BR,
Ricardo
Giovanni Gherdovich April 24, 2020, 5:53 a.m. UTC | #6
On Thu, 2020-04-23 at 18:32 -0700, Ricardo Neri wrote:
> On Thu, Apr 23, 2020 at 10:06:04AM +0200, Giovanni Gherdovich wrote:
> > > 
> > > It may be possible that MSR_TURBO_RATIO_LIMIT is also all-zeros. In
> > > such case, turbo_freq will be also zero. If that is the case,
> > > arch_max_freq_ratio will be zero and we will see a division by zero
> > > exception in arch_scale_freq_tick() because mcnt is multiplied by
> > > arch_max_freq_ratio().
> > 
> > Thanks Ricardo for clarifying this.
> > 
> > Follow-up question: when I see an all-zeros MSR_TURBO_RATIO_LIMIT, can I
> > assume the CPU doesn't support turbo boost? Or is it possible that such a CPU
> > has turbo boost, just the turbo ratios aren't declared in the MSR?
> > 
> > Some context: this feature (called "frequency invariance") wants to know
> > what's the max clock freq a CPU can have at any time (it needs it for some
> > scheduler calculations). This is hard to know precisely, because turbo can
> > kick in at any time and depends on many factors.  So it settles for an
> > "average maximum frequency", which I decided the 4 cores turbo is a good
> > estimate for. Now, if an all-zeros MSR_TURBO_RATIO_LIMIT means "turbo boost
> > unsupported", this is actually the easy case because then I know exactly what
> > the max freq is (base frequency). If, on the other hand, an all-zeros MSR
> > means "there may or may not be turbo, and you don't know how much" then I must
> > disable frequency invariance.
> 
> I'd say that there can be cases in which the CPU has turbo boost and yet the
> turbo ratios are not declared in MSR_TURBO_RATIO_LIMIT. Hence, frequency
> invariance should be disabled.

Great, thanks for the information Ricardo!

For the tip tree maintainers: Ricardo is identifying an additional corner case
I need to take care of, but this series stands on its own: the commits
correctly do what their changelog says, and fix existing bugs.

I'll send an additional patch that follows Ricardo's recommendations, and it
will apply on top of this series.


Giovanni

Patch
diff mbox series

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fe3ab9632f3b..3a318ec9bc17 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1985,6 +1985,15 @@  static bool intel_set_max_freq_ratio(void)
 	return false;
 
 out:
+	/*
+	 * Some hypervisors advertise X86_FEATURE_APERFMPERF
+	 * but then fill all MSR's with zeroes.
+	 */
+	if (!base_freq) {
+		pr_debug("Couldn't determine cpu base frequency, necessary for scale-invariant accounting.\n");
+		return false;
+	}
+
 	arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
 					base_freq);
 	arch_set_max_freq_ratio(turbo_disabled());