All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update
@ 2022-01-18 18:56 Bjorn Andersson
  2022-01-18 18:56 ` [PATCH 2/2] arch_topology: Sanity check cpumask in " Bjorn Andersson
  2022-01-19  6:35 ` [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for " Viresh Kumar
  0 siblings, 2 replies; 8+ messages in thread
From: Bjorn Andersson @ 2022-01-18 18:56 UTC (permalink / raw)
  To: Viresh Kumar, Lukasz Luba, Vladimir Zapolskiy
  Cc: Sudeep Holla, Greg Kroah-Hartman, Rafael J. Wysocki,
	Thara Gopinath, linux-kernel, linux-arm-msm, linux-pm

In the event that the SoC is under thermal pressure while booting it's
possible for the dcvs notification to happen inbetween the cpufreq
framework calling init and it actually updating the policy's
related_cpus cpumask.

Prior to the introduction of the thermal pressure update helper an empty
cpumask would simply result in the thermal pressure of no cpus being
updated, but the new code will attempt to dereference an invalid per_cpu
variable.

Avoid this problem by using the policy's cpus cpumask instead of the
related_cpus mask, as this is initialized before the interrupt is
registered.

Fixes: 0258cb19c77d ("cpufreq: qcom-cpufreq-hw: Use new thermal pressure update function")
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
---
 drivers/cpufreq/qcom-cpufreq-hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c
index 05f3d7876e44..866fba3ac6fc 100644
--- a/drivers/cpufreq/qcom-cpufreq-hw.c
+++ b/drivers/cpufreq/qcom-cpufreq-hw.c
@@ -297,7 +297,7 @@ static void qcom_lmh_dcvs_notify(struct qcom_cpufreq_data *data)
 	throttled_freq = freq_hz / HZ_PER_KHZ;
 
 	/* Update thermal pressure (the boost frequencies are accepted) */
-	arch_update_thermal_pressure(policy->related_cpus, throttled_freq);
+	arch_update_thermal_pressure(policy->cpus, throttled_freq);
 
 	/*
 	 * In the unlikely case policy is unregistered do not enable
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] arch_topology: Sanity check cpumask in thermal pressure update
  2022-01-18 18:56 [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update Bjorn Andersson
@ 2022-01-18 18:56 ` Bjorn Andersson
  2022-01-19 10:25   ` Greg Kroah-Hartman
  2022-01-19 14:43   ` Sudeep Holla
  2022-01-19  6:35 ` [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for " Viresh Kumar
  1 sibling, 2 replies; 8+ messages in thread
From: Bjorn Andersson @ 2022-01-18 18:56 UTC (permalink / raw)
  To: Sudeep Holla, Greg Kroah-Hartman, Rafael J. Wysocki,
	Viresh Kumar, Lukasz Luba, Vladimir Zapolskiy
  Cc: Thara Gopinath, linux-kernel, linux-arm-msm, linux-pm

Occasionally during boot the Qualcomm cpufreq driver was able to cause
an invalid memory access in topology_update_thermal_pressure() on the
line:

	if (max_freq <= capped_freq)

It turns out that this was caused by a race, which resulted in the
cpumask passed to the function being empty, in which case
cpumask_first() will return a cpu beyond the number of valid cpus, which
when used to access the per_cpu max_freq would return invalid pointer.

The bug in the Qualcomm cpufreq driver is being fixed, but having a
sanity check of the arguments would have saved quite a bit of time and
it's not unlikely that others will run into the same issue.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
---
 drivers/base/arch_topology.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 976154140f0b..6560a0c3b969 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -177,6 +177,9 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
 	u32 max_freq;
 	int cpu;
 
+	if (WARN_ON(cpumask_empty(cpus)))
+		return;
+
 	cpu = cpumask_first(cpus);
 	max_capacity = arch_scale_cpu_capacity(cpu);
 	max_freq = per_cpu(freq_factor, cpu);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update
  2022-01-18 18:56 [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update Bjorn Andersson
  2022-01-18 18:56 ` [PATCH 2/2] arch_topology: Sanity check cpumask in " Bjorn Andersson
@ 2022-01-19  6:35 ` Viresh Kumar
  2022-01-19  6:40   ` Viresh Kumar
  1 sibling, 1 reply; 8+ messages in thread
From: Viresh Kumar @ 2022-01-19  6:35 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Lukasz Luba, Vladimir Zapolskiy, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki, Thara Gopinath,
	linux-kernel, linux-arm-msm, linux-pm

On 18-01-22, 10:56, Bjorn Andersson wrote:
> In the event that the SoC is under thermal pressure while booting it's
> possible for the dcvs notification to happen inbetween the cpufreq
> framework calling init and it actually updating the policy's
> related_cpus cpumask.
> 
> Prior to the introduction of the thermal pressure update helper an empty
> cpumask would simply result in the thermal pressure of no cpus being
> updated, but the new code will attempt to dereference an invalid per_cpu
> variable.
> 
> Avoid this problem by using the policy's cpus cpumask instead of the
> related_cpus mask, as this is initialized before the interrupt is
> registered.
> 
> Fixes: 0258cb19c77d ("cpufreq: qcom-cpufreq-hw: Use new thermal pressure update function")
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> ---
>  drivers/cpufreq/qcom-cpufreq-hw.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c b/drivers/cpufreq/qcom-cpufreq-hw.c
> index 05f3d7876e44..866fba3ac6fc 100644
> --- a/drivers/cpufreq/qcom-cpufreq-hw.c
> +++ b/drivers/cpufreq/qcom-cpufreq-hw.c
> @@ -297,7 +297,7 @@ static void qcom_lmh_dcvs_notify(struct qcom_cpufreq_data *data)
>  	throttled_freq = freq_hz / HZ_PER_KHZ;
>  
>  	/* Update thermal pressure (the boost frequencies are accepted) */
> -	arch_update_thermal_pressure(policy->related_cpus, throttled_freq);
> +	arch_update_thermal_pressure(policy->cpus, throttled_freq);

policy->cpus keeps on changing with CPU hotplug and this can leave
your platform in an inconsistent state. For example, in case where you
offline a CPU from policy, other CPUs get their thermal pressure
updated, online the CPU back and all CPUs of a policy don't have the
same settings anymore.

There are few things we can do here now:

- Check for empty related_cpus and return early. Since related_cpus is
  updated only once, this shall work just fine and must not be racy.

  While at it, I think we can also do something like this in
  topology_update_thermal_pressure() instead:

  	cpu = cpumask_first(cpus);
        if (unlikely(cpu >= NR_CPUS))
                return;

- And while writing this email, I dropped all other ideas in favor of
  change to topology_update_thermal_pressure() :)

--
viresh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update
  2022-01-19  6:35 ` [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for " Viresh Kumar
@ 2022-01-19  6:40   ` Viresh Kumar
  2022-01-19 15:05     ` Bjorn Andersson
  0 siblings, 1 reply; 8+ messages in thread
From: Viresh Kumar @ 2022-01-19  6:40 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Lukasz Luba, Vladimir Zapolskiy, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki, Thara Gopinath,
	linux-kernel, linux-arm-msm, linux-pm

On 19-01-22, 12:05, Viresh Kumar wrote:
> policy->cpus keeps on changing with CPU hotplug and this can leave
> your platform in an inconsistent state. For example, in case where you
> offline a CPU from policy, other CPUs get their thermal pressure
> updated, online the CPU back and all CPUs of a policy don't have the
> same settings anymore.
> 
> There are few things we can do here now:
> 
> - Check for empty related_cpus and return early. Since related_cpus is
>   updated only once, this shall work just fine and must not be racy.
> 
>   While at it, I think we can also do something like this in
>   topology_update_thermal_pressure() instead:
> 
>   	cpu = cpumask_first(cpus);
>         if (unlikely(cpu >= NR_CPUS))
>                 return;
> 
> - And while writing this email, I dropped all other ideas in favor of
>   change to topology_update_thermal_pressure() :)

And then I saw your second patch, which looks good as otherwise we
will not be able to catch the bug in our system where we are sending
the empty cpumask :)

So the other idea is:

- Revert, or bring back a new version of this and register the
  interrupt from there. But that is also not a very clean solution.

  commit 4bf8e582119e ("cpufreq: Remove ready() callback")

- 

-- 
viresh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] arch_topology: Sanity check cpumask in thermal pressure update
  2022-01-18 18:56 ` [PATCH 2/2] arch_topology: Sanity check cpumask in " Bjorn Andersson
@ 2022-01-19 10:25   ` Greg Kroah-Hartman
  2022-01-19 14:43   ` Sudeep Holla
  1 sibling, 0 replies; 8+ messages in thread
From: Greg Kroah-Hartman @ 2022-01-19 10:25 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Sudeep Holla, Rafael J. Wysocki, Viresh Kumar, Lukasz Luba,
	Vladimir Zapolskiy, Thara Gopinath, linux-kernel, linux-arm-msm,
	linux-pm

On Tue, Jan 18, 2022 at 10:56:12AM -0800, Bjorn Andersson wrote:
> Occasionally during boot the Qualcomm cpufreq driver was able to cause
> an invalid memory access in topology_update_thermal_pressure() on the
> line:
> 
> 	if (max_freq <= capped_freq)
> 
> It turns out that this was caused by a race, which resulted in the
> cpumask passed to the function being empty, in which case
> cpumask_first() will return a cpu beyond the number of valid cpus, which
> when used to access the per_cpu max_freq would return invalid pointer.
> 
> The bug in the Qualcomm cpufreq driver is being fixed, but having a
> sanity check of the arguments would have saved quite a bit of time and
> it's not unlikely that others will run into the same issue.
> 
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> ---
>  drivers/base/arch_topology.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..6560a0c3b969 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -177,6 +177,9 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
>  	u32 max_freq;
>  	int cpu;
>  
> +	if (WARN_ON(cpumask_empty(cpus)))
> +		return;

Sorry, but I do not want to add any more WARN_ON() calls to the kernel
unless really needed.  We don't try to save the kernel from itself all
the time by validating every internal api call parameters.

thjanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] arch_topology: Sanity check cpumask in thermal pressure update
  2022-01-18 18:56 ` [PATCH 2/2] arch_topology: Sanity check cpumask in " Bjorn Andersson
  2022-01-19 10:25   ` Greg Kroah-Hartman
@ 2022-01-19 14:43   ` Sudeep Holla
  2022-01-19 15:21     ` Bjorn Andersson
  1 sibling, 1 reply; 8+ messages in thread
From: Sudeep Holla @ 2022-01-19 14:43 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Viresh Kumar,
	Sudeep Holla, Lukasz Luba, Vladimir Zapolskiy, Thara Gopinath,
	linux-kernel, linux-arm-msm, linux-pm

On Tue, Jan 18, 2022 at 10:56:12AM -0800, Bjorn Andersson wrote:
> Occasionally during boot the Qualcomm cpufreq driver was able to cause
> an invalid memory access in topology_update_thermal_pressure() on the
> line:
> 
> 	if (max_freq <= capped_freq)
> 
> It turns out that this was caused by a race, which resulted in the
> cpumask passed to the function being empty, in which case
> cpumask_first() will return a cpu beyond the number of valid cpus, which
> when used to access the per_cpu max_freq would return invalid pointer.
> 
> The bug in the Qualcomm cpufreq driver is being fixed, but having a
> sanity check of the arguments would have saved quite a bit of time and
> it's not unlikely that others will run into the same issue.
> 
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> ---
>  drivers/base/arch_topology.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..6560a0c3b969 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -177,6 +177,9 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
>  	u32 max_freq;
>  	int cpu;
>  
> +	if (WARN_ON(cpumask_empty(cpus)))
> +		return;
> +

Why can't the caller check and call this only when cpus is not empty ?
IIUC there are many such APIs that use cpumask and could result in similar
issues if called with empty cpus. Probably we could add a note that cpus
must not be empty if that helps the callers ?

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update
  2022-01-19  6:40   ` Viresh Kumar
@ 2022-01-19 15:05     ` Bjorn Andersson
  0 siblings, 0 replies; 8+ messages in thread
From: Bjorn Andersson @ 2022-01-19 15:05 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Lukasz Luba, Vladimir Zapolskiy, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki, Thara Gopinath,
	linux-kernel, linux-arm-msm, linux-pm

On Tue 18 Jan 22:40 PST 2022, Viresh Kumar wrote:

> On 19-01-22, 12:05, Viresh Kumar wrote:
> > policy->cpus keeps on changing with CPU hotplug and this can leave
> > your platform in an inconsistent state. For example, in case where you
> > offline a CPU from policy, other CPUs get their thermal pressure
> > updated, online the CPU back and all CPUs of a policy don't have the
> > same settings anymore.
> > 

Oh, I didn't know that. Then my proposal doesn't seem that awesome.

> > There are few things we can do here now:
> > 
> > - Check for empty related_cpus and return early. Since related_cpus is
> >   updated only once, this shall work just fine and must not be racy.
> > 
> >   While at it, I think we can also do something like this in
> >   topology_update_thermal_pressure() instead:
> > 
> >   	cpu = cpumask_first(cpus);
> >         if (unlikely(cpu >= NR_CPUS))
> >                 return;
> > 
> > - And while writing this email, I dropped all other ideas in favor of
> >   change to topology_update_thermal_pressure() :)
> 
> And then I saw your second patch, which looks good as otherwise we
> will not be able to catch the bug in our system where we are sending
> the empty cpumask :)
> 
> So the other idea is:
> 
> - Revert, or bring back a new version of this and register the
>   interrupt from there. But that is also not a very clean solution.
> 
>   commit 4bf8e582119e ("cpufreq: Remove ready() callback")
> 

We could do this and keep the interrupt disabled until we hit ready().

But I found the resulting issue non-trivial to debug, so I would prefer
if arch_update_thermal_pressure() dealt with the empty cpumask. So as
you suggest in your first reply, I'll respin the second patch alone,
without the WARN_ON().

Thanks,
Bjorn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] arch_topology: Sanity check cpumask in thermal pressure update
  2022-01-19 14:43   ` Sudeep Holla
@ 2022-01-19 15:21     ` Bjorn Andersson
  0 siblings, 0 replies; 8+ messages in thread
From: Bjorn Andersson @ 2022-01-19 15:21 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Viresh Kumar, Lukasz Luba,
	Vladimir Zapolskiy, Thara Gopinath, linux-kernel, linux-arm-msm,
	linux-pm

On Wed 19 Jan 06:43 PST 2022, Sudeep Holla wrote:

> On Tue, Jan 18, 2022 at 10:56:12AM -0800, Bjorn Andersson wrote:
> > Occasionally during boot the Qualcomm cpufreq driver was able to cause
> > an invalid memory access in topology_update_thermal_pressure() on the
> > line:
> > 
> > 	if (max_freq <= capped_freq)
> > 
> > It turns out that this was caused by a race, which resulted in the
> > cpumask passed to the function being empty, in which case
> > cpumask_first() will return a cpu beyond the number of valid cpus, which
> > when used to access the per_cpu max_freq would return invalid pointer.
> > 
> > The bug in the Qualcomm cpufreq driver is being fixed, but having a
> > sanity check of the arguments would have saved quite a bit of time and
> > it's not unlikely that others will run into the same issue.
> > 
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> > ---
> >  drivers/base/arch_topology.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 976154140f0b..6560a0c3b969 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -177,6 +177,9 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
> >  	u32 max_freq;
> >  	int cpu;
> >  
> > +	if (WARN_ON(cpumask_empty(cpus)))
> > +		return;
> > +
> 
> Why can't the caller check and call this only when cpus is not empty ?
> IIUC there are many such APIs that use cpumask and could result in similar
> issues if called with empty cpus. Probably we could add a note that cpus
> must not be empty if that helps the callers ?
> 

As indicated in the commit message, it took me a while to conclude that
the cause for a memory fault on what seemed to be a comparison between
two variables on the stack was actually caused by this race - which
isn't trivially reproducible, unless you know what the bug is.

Now _I_ know better and will hopefully recognize the oops signature
right away, but my hope was to put the sanity check on this side to save
the next caller of this API some time. Updating the comment probably
would have saved me a minute or two at the end, probably as confirmation
of my findings after the fact...

If you prefer to keep topology_update_thermal_pressure() clean(er) and
exciting I can hack around the issue in the Qualcomm driver.

PS. I'm onboard with Greg's objection to the WARN_ON()...

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-19 15:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 18:56 [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for thermal pressure update Bjorn Andersson
2022-01-18 18:56 ` [PATCH 2/2] arch_topology: Sanity check cpumask in " Bjorn Andersson
2022-01-19 10:25   ` Greg Kroah-Hartman
2022-01-19 14:43   ` Sudeep Holla
2022-01-19 15:21     ` Bjorn Andersson
2022-01-19  6:35 ` [PATCH 1/2] cpufreq: qcom-hw: Use initialized cpumask for " Viresh Kumar
2022-01-19  6:40   ` Viresh Kumar
2022-01-19 15:05     ` Bjorn Andersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.