[PATCH] cpufreq: flush any pending policy update work scheduled before freeing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
@ 2019-10-17 16:35 Sudeep Holla
  2019-10-17 19:36 ` Rafael J. Wysocki
  2019-10-18  5:38 ` Viresh Kumar
  0 siblings, 2 replies; 21+ messages in thread
From: Sudeep Holla @ 2019-10-17 16:35 UTC (permalink / raw)
  To: Viresh Kumar, Rafael J . Wysocki; +Cc: Sudeep Holla, linux-pm, linux-kernel

dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
which schedule policy update work. It may end up racing with the freeing
the policy and unregistering the driver.

One possible race is as below where the cpufreq_driver is unregistered
but the scheduled work gets executed at later stage when cpufreq_driver
is NULL(i.e. after freeing the policy and driver)

Unable to handle kernel NULL pointer dereference at virtual address 0000001c
pgd = (ptrval)
[0000001c] *pgd=80000080204003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP THUMB2
Modules linked in:
CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
Hardware name: ARM-Versatile Express
Workqueue: events handle_update
PC is at cpufreq_set_policy+0x58/0x228
LR is at dev_pm_qos_read_value+0x77/0xac
Control: 70c5387d  Table: 80203000  DAC: fffffffd
Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
	(cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
	(refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
	(handle_update) from (process_one_work+0x16d/0x3cc)
	(process_one_work) from (worker_thread+0xff/0x414)
	(worker_thread) from (kthread+0xff/0x100)
	(kthread) from (ret_from_fork+0x11/0x28)

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/cpufreq/cpufreq.c | 3 +++
 1 file changed, 3 insertions(+)

Hi Rafael, Viresh,

This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
I have based this patch on -rc3 and not on top of your patches. This
only fixes the boot issue but I hit the other crashes while continuously
switching on and off the bL switcher that register/unregister the driver
Your patch series fixes them. I can based this on top of those if you
prefer.

Regards,
Sudeep

[1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index c52d6fa32aac..b703c29a84be 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
 	}
 
 	dev_pm_qos_remove_request(policy->min_freq_req);
+	/* flush the pending policy->update work before freeing the policy */
+	if (work_pending(&policy->update))
+		flush_work(&policy->update);
 	kfree(policy->min_freq_req);
 
 	cpufreq_policy_put_kobj(policy);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-17 16:35 [PATCH] cpufreq: flush any pending policy update work scheduled before freeing Sudeep Holla
@ 2019-10-17 19:36 ` Rafael J. Wysocki
  2019-10-17 21:26   ` Rafael J. Wysocki
  2019-10-18  5:47   ` Sudeep Holla
  2019-10-18  5:38 ` Viresh Kumar
  1 sibling, 2 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-17 19:36 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Viresh Kumar, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> which schedule policy update work. It may end up racing with the freeing
> the policy and unregistering the driver.
>
> One possible race is as below where the cpufreq_driver is unregistered
> but the scheduled work gets executed at later stage when cpufreq_driver
> is NULL(i.e. after freeing the policy and driver)
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> pgd = (ptrval)
> [0000001c] *pgd=80000080204003, *pmd=00000000
> Internal error: Oops: 206 [#1] SMP THUMB2
> Modules linked in:
> CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> Hardware name: ARM-Versatile Express
> Workqueue: events handle_update
> PC is at cpufreq_set_policy+0x58/0x228
> LR is at dev_pm_qos_read_value+0x77/0xac
> Control: 70c5387d  Table: 80203000  DAC: fffffffd
> Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
>         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
>         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
>         (handle_update) from (process_one_work+0x16d/0x3cc)
>         (process_one_work) from (worker_thread+0xff/0x414)
>         (worker_thread) from (kthread+0xff/0x100)
>         (kthread) from (ret_from_fork+0x11/0x28)
>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/cpufreq/cpufreq.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> Hi Rafael, Viresh,
>
> This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> I have based this patch on -rc3 and not on top of your patches. This
> only fixes the boot issue but I hit the other crashes while continuously
> switching on and off the bL switcher that register/unregister the driver
> Your patch series fixes them. I can based this on top of those if you
> prefer.
>
> Regards,
> Sudeep
>
> [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index c52d6fa32aac..b703c29a84be 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>         }
>
>         dev_pm_qos_remove_request(policy->min_freq_req);
> +       /* flush the pending policy->update work before freeing the policy */
> +       if (work_pending(&policy->update))

Isn't this racy?

It still may be running if the pending bit is clear and we still need
to wait for it then, don't we?

Why don't you do an unconditional flush_work() here?

> +               flush_work(&policy->update);
>         kfree(policy->min_freq_req);
>
>         cpufreq_policy_put_kobj(policy);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-17 19:36 ` Rafael J. Wysocki
@ 2019-10-17 21:26   ` Rafael J. Wysocki
  2019-10-18  5:55     ` Sudeep Holla
  2019-10-18  5:47   ` Sudeep Holla
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-17 21:26 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Viresh Kumar, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > which schedule policy update work. It may end up racing with the freeing
> > the policy and unregistering the driver.
> >
> > One possible race is as below where the cpufreq_driver is unregistered
> > but the scheduled work gets executed at later stage when cpufreq_driver
> > is NULL(i.e. after freeing the policy and driver)
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > pgd = (ptrval)
> > [0000001c] *pgd=80000080204003, *pmd=00000000
> > Internal error: Oops: 206 [#1] SMP THUMB2
> > Modules linked in:
> > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > Hardware name: ARM-Versatile Express
> > Workqueue: events handle_update
> > PC is at cpufreq_set_policy+0x58/0x228
> > LR is at dev_pm_qos_read_value+0x77/0xac
> > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> >         (handle_update) from (process_one_work+0x16d/0x3cc)
> >         (process_one_work) from (worker_thread+0xff/0x414)
> >         (worker_thread) from (kthread+0xff/0x100)
> >         (kthread) from (ret_from_fork+0x11/0x28)
> >
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >  drivers/cpufreq/cpufreq.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Hi Rafael, Viresh,
> >
> > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > I have based this patch on -rc3 and not on top of your patches. This
> > only fixes the boot issue but I hit the other crashes while continuously
> > switching on and off the bL switcher that register/unregister the driver
> > Your patch series fixes them. I can based this on top of those if you
> > prefer.
> >
> > Regards,
> > Sudeep
> >
> > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index c52d6fa32aac..b703c29a84be 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> >         }
> >
> >         dev_pm_qos_remove_request(policy->min_freq_req);
> > +       /* flush the pending policy->update work before freeing the policy */
> > +       if (work_pending(&policy->update))
>
> Isn't this racy?
>
> It still may be running if the pending bit is clear and we still need
> to wait for it then, don't we?
>
> Why don't you do an unconditional flush_work() here?

You may as well do a cancel_work_sync() here, because whether or not
the last update of the policy happens before it goes away is a matter
of timing in any case

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-17 16:35 [PATCH] cpufreq: flush any pending policy update work scheduled before freeing Sudeep Holla
  2019-10-17 19:36 ` Rafael J. Wysocki
@ 2019-10-18  5:38 ` Viresh Kumar
  2019-10-18  7:53   ` Rafael J. Wysocki
  1 sibling, 1 reply; 21+ messages in thread
From: Viresh Kumar @ 2019-10-18  5:38 UTC (permalink / raw)
  To: Sudeep Holla; +Cc: Rafael J . Wysocki, linux-pm, linux-kernel

On 17-10-19, 17:35, Sudeep Holla wrote:
> dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> which schedule policy update work.

I don't think that's correct. We remove the notifiers first and then
only remove the requests. Though it is possible due to the other bug
we are discussing where the notifier doesn't really get removed from
the right CPU, but even that patch didn't fix your issue.

Looks like we are still missing something ?

> It may end up racing with the freeing
> the policy and unregistering the driver.
> 
> One possible race is as below where the cpufreq_driver is unregistered
> but the scheduled work gets executed at later stage when cpufreq_driver
> is NULL(i.e. after freeing the policy and driver)
> 
> Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> pgd = (ptrval)
> [0000001c] *pgd=80000080204003, *pmd=00000000
> Internal error: Oops: 206 [#1] SMP THUMB2
> Modules linked in:
> CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> Hardware name: ARM-Versatile Express
> Workqueue: events handle_update
> PC is at cpufreq_set_policy+0x58/0x228
> LR is at dev_pm_qos_read_value+0x77/0xac
> Control: 70c5387d  Table: 80203000  DAC: fffffffd
> Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> 	(cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> 	(refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> 	(handle_update) from (process_one_work+0x16d/0x3cc)
> 	(process_one_work) from (worker_thread+0xff/0x414)
> 	(worker_thread) from (kthread+0xff/0x100)
> 	(kthread) from (ret_from_fork+0x11/0x28)
> 
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>  drivers/cpufreq/cpufreq.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> Hi Rafael, Viresh,
> 
> This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> I have based this patch on -rc3 and not on top of your patches. This
> only fixes the boot issue but I hit the other crashes while continuously
> switching on and off the bL switcher that register/unregister the driver
> Your patch series fixes them. I can based this on top of those if you
> prefer.
> 
> Regards,
> Sudeep
> 
> [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index c52d6fa32aac..b703c29a84be 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>  	}
>  
>  	dev_pm_qos_remove_request(policy->min_freq_req);
> +	/* flush the pending policy->update work before freeing the policy */
> +	if (work_pending(&policy->update))
> +		flush_work(&policy->update);

This diff surely makes sense even without the QoS stuff, this race can
still happen, very unlikely though.

And yes, you must use the other variant that Rafael suggested, we are
already doing similar thing in a bunch of cpufreq governors :)

And I will probably add this after calling
dev_pm_qos_remove_notifier() for the MAX policy as this doesn't and
shouldn't depend on removing the qos request.

>  	kfree(policy->min_freq_req);
>  
>  	cpufreq_policy_put_kobj(policy);
> -- 
> 2.17.1

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-17 19:36 ` Rafael J. Wysocki
  2019-10-17 21:26   ` Rafael J. Wysocki
@ 2019-10-18  5:47   ` Sudeep Holla
  1 sibling, 0 replies; 21+ messages in thread
From: Sudeep Holla @ 2019-10-18  5:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Rafael J . Wysocki, Linux PM,
	Linux Kernel Mailing List, Sudeep Holla

On Thu, Oct 17, 2019 at 09:36:30PM +0200, Rafael J. Wysocki wrote:
> On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > which schedule policy update work. It may end up racing with the freeing
> > the policy and unregistering the driver.
> >
> > One possible race is as below where the cpufreq_driver is unregistered
> > but the scheduled work gets executed at later stage when cpufreq_driver
> > is NULL(i.e. after freeing the policy and driver)
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > pgd = (ptrval)
> > [0000001c] *pgd=80000080204003, *pmd=00000000
> > Internal error: Oops: 206 [#1] SMP THUMB2
> > Modules linked in:
> > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > Hardware name: ARM-Versatile Express
> > Workqueue: events handle_update
> > PC is at cpufreq_set_policy+0x58/0x228
> > LR is at dev_pm_qos_read_value+0x77/0xac
> > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> >         (handle_update) from (process_one_work+0x16d/0x3cc)
> >         (process_one_work) from (worker_thread+0xff/0x414)
> >         (worker_thread) from (kthread+0xff/0x100)
> >         (kthread) from (ret_from_fork+0x11/0x28)
> >
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >  drivers/cpufreq/cpufreq.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Hi Rafael, Viresh,
> >
> > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > I have based this patch on -rc3 and not on top of your patches. This
> > only fixes the boot issue but I hit the other crashes while continuously
> > switching on and off the bL switcher that register/unregister the driver
> > Your patch series fixes them. I can based this on top of those if you
> > prefer.
> >
> > Regards,
> > Sudeep
> >
> > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index c52d6fa32aac..b703c29a84be 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> >         }
> >
> >         dev_pm_qos_remove_request(policy->min_freq_req);
> > +       /* flush the pending policy->update work before freeing the policy */
> > +       if (work_pending(&policy->update))
>
> Isn't this racy?
>
> It still may be running if the pending bit is clear and we still need
> to wait for it then, don't we?
>

Yes, we could end up in such situation.

> Why don't you do an unconditional flush_work() here?
>

Yes that should be fine.

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-17 21:26   ` Rafael J. Wysocki
@ 2019-10-18  5:55     ` Sudeep Holla
  2019-10-18  6:02       ` Viresh Kumar
  0 siblings, 1 reply; 21+ messages in thread
From: Sudeep Holla @ 2019-10-18  5:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > >
> > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > which schedule policy update work. It may end up racing with the freeing
> > > the policy and unregistering the driver.
> > >
> > > One possible race is as below where the cpufreq_driver is unregistered
> > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > is NULL(i.e. after freeing the policy and driver)
> > >
> > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > pgd = (ptrval)
> > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > Modules linked in:
> > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > Hardware name: ARM-Versatile Express
> > > Workqueue: events handle_update
> > > PC is at cpufreq_set_policy+0x58/0x228
> > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > >         (process_one_work) from (worker_thread+0xff/0x414)
> > >         (worker_thread) from (kthread+0xff/0x100)
> > >         (kthread) from (ret_from_fork+0x11/0x28)
> > >
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > ---
> > >  drivers/cpufreq/cpufreq.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > Hi Rafael, Viresh,
> > >
> > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > I have based this patch on -rc3 and not on top of your patches. This
> > > only fixes the boot issue but I hit the other crashes while continuously
> > > switching on and off the bL switcher that register/unregister the driver
> > > Your patch series fixes them. I can based this on top of those if you
> > > prefer.
> > >
> > > Regards,
> > > Sudeep
> > >
> > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > >
> > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > index c52d6fa32aac..b703c29a84be 100644
> > > --- a/drivers/cpufreq/cpufreq.c
> > > +++ b/drivers/cpufreq/cpufreq.c
> > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > >         }
> > >
> > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > +       /* flush the pending policy->update work before freeing the policy */
> > > +       if (work_pending(&policy->update))
> >
> > Isn't this racy?
> >
> > It still may be running if the pending bit is clear and we still need
> > to wait for it then, don't we?
> >
> > Why don't you do an unconditional flush_work() here?
> 
> You may as well do a cancel_work_sync() here, because whether or not
> the last update of the policy happens before it goes away is a matter
> of timing in any case

In fact that's the first thing I tried to fix the issue I was seeing.
But I then thought it would be better to complete the update as the PM
QoS were getting updated back to DEFAULT values for the device. Even
this works.

What is your preference ? flush_work or cancel_work_sync ? I will
update accordingly. I may need to do some more testing with
cancel_work_sync as I just checked that quickly to confirm the race.

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  5:55     ` Sudeep Holla
@ 2019-10-18  6:02       ` Viresh Kumar
  2019-10-18  7:32         ` Rafael J. Wysocki
  2019-10-18 10:19         ` Sudeep Holla
  0 siblings, 2 replies; 21+ messages in thread
From: Viresh Kumar @ 2019-10-18  6:02 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Rafael J. Wysocki, Rafael J . Wysocki, Linux PM,
	Linux Kernel Mailing List

On 18-10-19, 06:55, Sudeep Holla wrote:
> On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > >
> > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > which schedule policy update work. It may end up racing with the freeing
> > > > the policy and unregistering the driver.
> > > >
> > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > is NULL(i.e. after freeing the policy and driver)
> > > >
> > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > pgd = (ptrval)
> > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > Modules linked in:
> > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > Hardware name: ARM-Versatile Express
> > > > Workqueue: events handle_update
> > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > >         (worker_thread) from (kthread+0xff/0x100)
> > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > >
> > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > ---
> > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > Hi Rafael, Viresh,
> > > >
> > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > switching on and off the bL switcher that register/unregister the driver
> > > > Your patch series fixes them. I can based this on top of those if you
> > > > prefer.
> > > >
> > > > Regards,
> > > > Sudeep
> > > >
> > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > >
> > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > index c52d6fa32aac..b703c29a84be 100644
> > > > --- a/drivers/cpufreq/cpufreq.c
> > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > >         }
> > > >
> > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > +       if (work_pending(&policy->update))
> > >
> > > Isn't this racy?
> > >
> > > It still may be running if the pending bit is clear and we still need
> > > to wait for it then, don't we?
> > >
> > > Why don't you do an unconditional flush_work() here?
> > 
> > You may as well do a cancel_work_sync() here, because whether or not
> > the last update of the policy happens before it goes away is a matter
> > of timing in any case
> 
> In fact that's the first thing I tried to fix the issue I was seeing.
> But I then thought it would be better to complete the update as the PM
> QoS were getting updated back to DEFAULT values for the device. Even
> this works.
> 
> What is your preference ? flush_work or cancel_work_sync ? I will
> update accordingly. I may need to do some more testing with
> cancel_work_sync as I just checked that quickly to confirm the race.

As I said in the other email, this work didn't come as a result of
removal of the qos request from cpufreq core and so must have come
from other thermal or similar events. In that case maybe doing
flush_work() is better before we remove the cpufreq driver. Though
Rafael's timing related comment makes sense as well, but now that we
have received the work before policy is removed, I will rather
complete the work and quit.

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  6:02       ` Viresh Kumar
@ 2019-10-18  7:32         ` Rafael J. Wysocki
  2019-10-18  8:03           ` Viresh Kumar
  2019-10-18 10:19         ` Sudeep Holla
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-18  7:32 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Sudeep Holla, Rafael J. Wysocki, Rafael J . Wysocki, Linux PM,
	Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 8:02 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-10-19, 06:55, Sudeep Holla wrote:
> > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > >
> > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > the policy and unregistering the driver.
> > > > >
> > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > is NULL(i.e. after freeing the policy and driver)
> > > > >
> > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > pgd = (ptrval)
> > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > Modules linked in:
> > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > Hardware name: ARM-Versatile Express
> > > > > Workqueue: events handle_update
> > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > >
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > ---
> > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > Hi Rafael, Viresh,
> > > > >
> > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > prefer.
> > > > >
> > > > > Regards,
> > > > > Sudeep
> > > > >
> > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > >
> > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > >         }
> > > > >
> > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > +       if (work_pending(&policy->update))
> > > >
> > > > Isn't this racy?
> > > >
> > > > It still may be running if the pending bit is clear and we still need
> > > > to wait for it then, don't we?
> > > >
> > > > Why don't you do an unconditional flush_work() here?
> > >
> > > You may as well do a cancel_work_sync() here, because whether or not
> > > the last update of the policy happens before it goes away is a matter
> > > of timing in any case
> >
> > In fact that's the first thing I tried to fix the issue I was seeing.
> > But I then thought it would be better to complete the update as the PM
> > QoS were getting updated back to DEFAULT values for the device. Even
> > this works.
> >
> > What is your preference ? flush_work or cancel_work_sync ? I will
> > update accordingly. I may need to do some more testing with
> > cancel_work_sync as I just checked that quickly to confirm the race.
>
> As I said in the other email, this work didn't come as a result of
> removal of the qos request from cpufreq core and so must have come
> from other thermal or similar events. In that case maybe doing
> flush_work() is better before we remove the cpufreq driver. Though
> Rafael's timing related comment makes sense as well, but now that we
> have received the work before policy is removed, I will rather
> complete the work and quit.

Well, the policy is going away, so the governor has been stopped for
it already.  Even if the limit is updated, it will not be used anyway,
so why bother with updating it?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  5:38 ` Viresh Kumar
@ 2019-10-18  7:53   ` Rafael J. Wysocki
  0 siblings, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-18  7:53 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Sudeep Holla, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 7:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 17-10-19, 17:35, Sudeep Holla wrote:
> > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > which schedule policy update work.
>
> I don't think that's correct. We remove the notifiers first and then
> only remove the requests. Though it is possible due to the other bug
> we are discussing where the notifier doesn't really get removed from
> the right CPU, but even that patch didn't fix your issue.

Right, that async update comes from somewhere else.

> Looks like we are still missing something ?
>
> > It may end up racing with the freeing
> > the policy and unregistering the driver.
> >
> > One possible race is as below where the cpufreq_driver is unregistered
> > but the scheduled work gets executed at later stage when cpufreq_driver
> > is NULL(i.e. after freeing the policy and driver)
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > pgd = (ptrval)
> > [0000001c] *pgd=80000080204003, *pmd=00000000
> > Internal error: Oops: 206 [#1] SMP THUMB2
> > Modules linked in:
> > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > Hardware name: ARM-Versatile Express
> > Workqueue: events handle_update
> > PC is at cpufreq_set_policy+0x58/0x228
> > LR is at dev_pm_qos_read_value+0x77/0xac
> > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> >       (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> >       (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> >       (handle_update) from (process_one_work+0x16d/0x3cc)
> >       (process_one_work) from (worker_thread+0xff/0x414)
> >       (worker_thread) from (kthread+0xff/0x100)
> >       (kthread) from (ret_from_fork+0x11/0x28)
> >
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >  drivers/cpufreq/cpufreq.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Hi Rafael, Viresh,
> >
> > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > I have based this patch on -rc3 and not on top of your patches. This
> > only fixes the boot issue but I hit the other crashes while continuously
> > switching on and off the bL switcher that register/unregister the driver
> > Your patch series fixes them. I can based this on top of those if you
> > prefer.
> >
> > Regards,
> > Sudeep
> >
> > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index c52d6fa32aac..b703c29a84be 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> >       }
> >
> >       dev_pm_qos_remove_request(policy->min_freq_req);
> > +     /* flush the pending policy->update work before freeing the policy */
> > +     if (work_pending(&policy->update))
> > +             flush_work(&policy->update);
>
> This diff surely makes sense even without the QoS stuff, this race can
> still happen, very unlikely though.
>
> And yes, you must use the other variant that Rafael suggested, we are
> already doing similar thing in a bunch of cpufreq governors :)
>
> And I will probably add this after calling
> dev_pm_qos_remove_notifier() for the MAX policy as this doesn't and
> shouldn't depend on removing the qos request.

Good point.

This is after taking the last CPU in the policy offline, so
policy->update cannot be scheduled from anywhere at this point.

> >       kfree(policy->min_freq_req);
> >
> >       cpufreq_policy_put_kobj(policy);
> > --

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  7:32         ` Rafael J. Wysocki
@ 2019-10-18  8:03           ` Viresh Kumar
  2019-10-18  8:19             ` Rafael J. Wysocki
  0 siblings, 1 reply; 21+ messages in thread
From: Viresh Kumar @ 2019-10-18  8:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sudeep Holla, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On 18-10-19, 09:32, Rafael J. Wysocki wrote:
> Well, the policy is going away, so the governor has been stopped for
> it already.  Even if the limit is updated, it will not be used anyway,
> so why bother with updating it?

The hardware will be programmed to run on that frequency before the
policy exits, so I will say it will be used and the constraint will be
satisfied. And so I had that view.

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  8:03           ` Viresh Kumar
@ 2019-10-18  8:19             ` Rafael J. Wysocki
  2019-10-18  8:25               ` Viresh Kumar
  0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-18  8:19 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Sudeep Holla, Rafael J . Wysocki, Linux PM,
	Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 10:03 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-10-19, 09:32, Rafael J. Wysocki wrote:
> > Well, the policy is going away, so the governor has been stopped for
> > it already.  Even if the limit is updated, it will not be used anyway,
> > so why bother with updating it?
>
> The hardware will be programmed to run on that frequency before the
> policy exits,

How exactly?

The policy is inactive, so refresh_frequency_limits() won't even run
cpufreq_set_policy() for it.

> so I will say it will be used and the constraint will be
> satisfied. And so I had that view.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  8:19             ` Rafael J. Wysocki
@ 2019-10-18  8:25               ` Viresh Kumar
  0 siblings, 0 replies; 21+ messages in thread
From: Viresh Kumar @ 2019-10-18  8:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sudeep Holla, Rafael J . Wysocki, Linux PM, Linux Kernel Mailing List

On 18-10-19, 10:19, Rafael J. Wysocki wrote:
> On Fri, Oct 18, 2019 at 10:03 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 18-10-19, 09:32, Rafael J. Wysocki wrote:
> > > Well, the policy is going away, so the governor has been stopped for
> > > it already.  Even if the limit is updated, it will not be used anyway,
> > > so why bother with updating it?
> >
> > The hardware will be programmed to run on that frequency before the
> > policy exits,
> 
> How exactly?
> 
> The policy is inactive, so refresh_frequency_limits() won't even run
> cpufreq_set_policy() for it.

Ahh, yes. We won't change the frequency, this is all useless in that
case.

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18  6:02       ` Viresh Kumar
  2019-10-18  7:32         ` Rafael J. Wysocki
@ 2019-10-18 10:19         ` Sudeep Holla
  2019-10-18 10:37           ` Rafael J. Wysocki
  1 sibling, 1 reply; 21+ messages in thread
From: Sudeep Holla @ 2019-10-18 10:19 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Rafael J . Wysocki, Linux PM,
	Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 11:32:47AM +0530, Viresh Kumar wrote:
> On 18-10-19, 06:55, Sudeep Holla wrote:
> > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > >
> > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > the policy and unregistering the driver.
> > > > >
> > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > is NULL(i.e. after freeing the policy and driver)
> > > > >
> > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > pgd = (ptrval)
> > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > Modules linked in:
> > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > Hardware name: ARM-Versatile Express
> > > > > Workqueue: events handle_update
> > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > >
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > ---
> > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > Hi Rafael, Viresh,
> > > > >
> > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > prefer.
> > > > >
> > > > > Regards,
> > > > > Sudeep
> > > > >
> > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > >
> > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > >         }
> > > > >
> > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > +       if (work_pending(&policy->update))
> > > >
> > > > Isn't this racy?
> > > >
> > > > It still may be running if the pending bit is clear and we still need
> > > > to wait for it then, don't we?
> > > >
> > > > Why don't you do an unconditional flush_work() here?
> > > 
> > > You may as well do a cancel_work_sync() here, because whether or not
> > > the last update of the policy happens before it goes away is a matter
> > > of timing in any case
> > 
> > In fact that's the first thing I tried to fix the issue I was seeing.
> > But I then thought it would be better to complete the update as the PM
> > QoS were getting updated back to DEFAULT values for the device. Even
> > this works.
> > 
> > What is your preference ? flush_work or cancel_work_sync ? I will
> > update accordingly. I may need to do some more testing with
> > cancel_work_sync as I just checked that quickly to confirm the race.
> 
> As I said in the other email, this work didn't come as a result of
> removal of the qos request from cpufreq core and so must have come
> from other thermal or similar events.

I don't think so. For sure not because of any thermal events. I didn't
have log handy and hence had to wait till I was next to hardware.

This is log:
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
 cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
 cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
 cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
 cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
 cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after

So if I move the call above, it still crashes as the work is getting
scheduled later.

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18 10:19         ` Sudeep Holla
@ 2019-10-18 10:37           ` Rafael J. Wysocki
  2019-10-18 11:06             ` Sudeep Holla
  0 siblings, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-18 10:37 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Viresh Kumar, Rafael J. Wysocki, Linux PM, Linux Kernel Mailing List

On Friday, October 18, 2019 12:19:24 PM CEST Sudeep Holla wrote:
> On Fri, Oct 18, 2019 at 11:32:47AM +0530, Viresh Kumar wrote:
> > On 18-10-19, 06:55, Sudeep Holla wrote:
> > > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > > >
> > > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > > the policy and unregistering the driver.
> > > > > >
> > > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > > is NULL(i.e. after freeing the policy and driver)
> > > > > >
> > > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > > pgd = (ptrval)
> > > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > > Modules linked in:
> > > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > > Hardware name: ARM-Versatile Express
> > > > > > Workqueue: events handle_update
> > > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > > >
> > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > > ---
> > > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > > >  1 file changed, 3 insertions(+)
> > > > > >
> > > > > > Hi Rafael, Viresh,
> > > > > >
> > > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > > prefer.
> > > > > >
> > > > > > Regards,
> > > > > > Sudeep
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > > >
> > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > > >         }
> > > > > >
> > > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > > +       if (work_pending(&policy->update))
> > > > >
> > > > > Isn't this racy?
> > > > >
> > > > > It still may be running if the pending bit is clear and we still need
> > > > > to wait for it then, don't we?
> > > > >
> > > > > Why don't you do an unconditional flush_work() here?
> > > > 
> > > > You may as well do a cancel_work_sync() here, because whether or not
> > > > the last update of the policy happens before it goes away is a matter
> > > > of timing in any case
> > > 
> > > In fact that's the first thing I tried to fix the issue I was seeing.
> > > But I then thought it would be better to complete the update as the PM
> > > QoS were getting updated back to DEFAULT values for the device. Even
> > > this works.
> > > 
> > > What is your preference ? flush_work or cancel_work_sync ? I will
> > > update accordingly. I may need to do some more testing with
> > > cancel_work_sync as I just checked that quickly to confirm the race.
> > 
> > As I said in the other email, this work didn't come as a result of
> > removal of the qos request from cpufreq core and so must have come
> > from other thermal or similar events.
> 
> I don't think so. For sure not because of any thermal events. I didn't
> have log handy and hence had to wait till I was next to hardware.
> 
> This is log:
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
>  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
>  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
>  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
>  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
>  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> 
> So if I move the call above, it still crashes as the work is getting
> scheduled later.

OK, please cancel the work after dropping the last request.

We still need to understand what is going on here, but the crash needs to be
prevented from occurring in the first place IMO.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18 10:37           ` Rafael J. Wysocki
@ 2019-10-18 11:06             ` Sudeep Holla
  2019-10-21  0:14               ` Rafael J. Wysocki
  2019-10-21  2:15               ` Viresh Kumar
  0 siblings, 2 replies; 21+ messages in thread
From: Sudeep Holla @ 2019-10-18 11:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Rafael J. Wysocki, Linux PM, Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 12:37:51PM +0200, Rafael J. Wysocki wrote:
> On Friday, October 18, 2019 12:19:24 PM CEST Sudeep Holla wrote:
> > On Fri, Oct 18, 2019 at 11:32:47AM +0530, Viresh Kumar wrote:
> > > On 18-10-19, 06:55, Sudeep Holla wrote:
> > > > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > > > >
> > > > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > > > the policy and unregistering the driver.
> > > > > > >
> > > > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > > > is NULL(i.e. after freeing the policy and driver)
> > > > > > >
> > > > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > > > pgd = (ptrval)
> > > > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > > > Modules linked in:
> > > > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > > > Hardware name: ARM-Versatile Express
> > > > > > > Workqueue: events handle_update
> > > > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > > > >
> > > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > > > ---
> > > > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > > > >  1 file changed, 3 insertions(+)
> > > > > > >
> > > > > > > Hi Rafael, Viresh,
> > > > > > >
> > > > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > > > prefer.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sudeep
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > > > >
> > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > > > >         }
> > > > > > >
> > > > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > > > +       if (work_pending(&policy->update))
> > > > > >
> > > > > > Isn't this racy?
> > > > > >
> > > > > > It still may be running if the pending bit is clear and we still need
> > > > > > to wait for it then, don't we?
> > > > > >
> > > > > > Why don't you do an unconditional flush_work() here?
> > > > > 
> > > > > You may as well do a cancel_work_sync() here, because whether or not
> > > > > the last update of the policy happens before it goes away is a matter
> > > > > of timing in any case
> > > > 
> > > > In fact that's the first thing I tried to fix the issue I was seeing.
> > > > But I then thought it would be better to complete the update as the PM
> > > > QoS were getting updated back to DEFAULT values for the device. Even
> > > > this works.
> > > > 
> > > > What is your preference ? flush_work or cancel_work_sync ? I will
> > > > update accordingly. I may need to do some more testing with
> > > > cancel_work_sync as I just checked that quickly to confirm the race.
> > > 
> > > As I said in the other email, this work didn't come as a result of
> > > removal of the qos request from cpufreq core and so must have come
> > > from other thermal or similar events.
> > 
> > I don't think so. For sure not because of any thermal events. I didn't
> > have log handy and hence had to wait till I was next to hardware.
> > 
> > This is log:
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> > 
> > So if I move the call above, it still crashes as the work is getting
> > scheduled later.
> 
> OK, please cancel the work after dropping the last request.
> 
> We still need to understand what is going on here, but the crash needs to be
> prevented from occurring in the first place IMO.
> 
Callstack is:

(cpufreq_notifier_max)
(notifier_call_chain)
(blocking_notifier_call_chain)
(pm_qos_update_target)
(freq_qos_apply)
(freq_qos_remove_request)
(cpufreq_policy_free)
(subsys_interface_unregister)
(cpufreq_unregister_driver)

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18 11:06             ` Sudeep Holla
@ 2019-10-21  0:14               ` Rafael J. Wysocki
  2019-10-21 10:33                 ` Sudeep Holla
  2019-10-21  2:15               ` Viresh Kumar
  1 sibling, 1 reply; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-21  0:14 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Rafael J. Wysocki, Viresh Kumar, Rafael J. Wysocki, Linux PM,
	Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 1:06 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> On Fri, Oct 18, 2019 at 12:37:51PM +0200, Rafael J. Wysocki wrote:
> > On Friday, October 18, 2019 12:19:24 PM CEST Sudeep Holla wrote:
> > > On Fri, Oct 18, 2019 at 11:32:47AM +0530, Viresh Kumar wrote:
> > > > On 18-10-19, 06:55, Sudeep Holla wrote:
> > > > > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > > > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > > > > >
> > > > > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > > > > the policy and unregistering the driver.
> > > > > > > >
> > > > > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > > > > is NULL(i.e. after freeing the policy and driver)
> > > > > > > >
> > > > > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > > > > pgd = (ptrval)
> > > > > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > > > > Modules linked in:
> > > > > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > > > > Hardware name: ARM-Versatile Express
> > > > > > > > Workqueue: events handle_update
> > > > > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > > > > >
> > > > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > > > > ---
> > > > > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > > > > >  1 file changed, 3 insertions(+)
> > > > > > > >
> > > > > > > > Hi Rafael, Viresh,
> > > > > > > >
> > > > > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > > > > prefer.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Sudeep
> > > > > > > >
> > > > > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > > > > >
> > > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > > > > +       if (work_pending(&policy->update))
> > > > > > >
> > > > > > > Isn't this racy?
> > > > > > >
> > > > > > > It still may be running if the pending bit is clear and we still need
> > > > > > > to wait for it then, don't we?
> > > > > > >
> > > > > > > Why don't you do an unconditional flush_work() here?
> > > > > >
> > > > > > You may as well do a cancel_work_sync() here, because whether or not
> > > > > > the last update of the policy happens before it goes away is a matter
> > > > > > of timing in any case
> > > > >
> > > > > In fact that's the first thing I tried to fix the issue I was seeing.
> > > > > But I then thought it would be better to complete the update as the PM
> > > > > QoS were getting updated back to DEFAULT values for the device. Even
> > > > > this works.
> > > > >
> > > > > What is your preference ? flush_work or cancel_work_sync ? I will
> > > > > update accordingly. I may need to do some more testing with
> > > > > cancel_work_sync as I just checked that quickly to confirm the race.
> > > >
> > > > As I said in the other email, this work didn't come as a result of
> > > > removal of the qos request from cpufreq core and so must have come
> > > > from other thermal or similar events.
> > >
> > > I don't think so. For sure not because of any thermal events. I didn't
> > > have log handy and hence had to wait till I was next to hardware.
> > >
> > > This is log:
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> > >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> > >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> > >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> > >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> > >
> > > So if I move the call above, it still crashes as the work is getting
> > > scheduled later.
> >
> > OK, please cancel the work after dropping the last request.
> >
> > We still need to understand what is going on here, but the crash needs to be
> > prevented from occurring in the first place IMO.
> >
> Callstack is:
>
> (cpufreq_notifier_max)
> (notifier_call_chain)
> (blocking_notifier_call_chain)
> (pm_qos_update_target)
> (freq_qos_apply)
> (freq_qos_remove_request)
> (cpufreq_policy_free)
> (subsys_interface_unregister)
> (cpufreq_unregister_driver)

That may be due to a bug in one of my patches (it's adding one of the
notifiers to a wrong list).

Please re-test with the current linux-next branch that I've just pushed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-18 11:06             ` Sudeep Holla
  2019-10-21  0:14               ` Rafael J. Wysocki
@ 2019-10-21  2:15               ` Viresh Kumar
  2019-10-21  8:20                 ` Rafael J. Wysocki
  2019-10-21 10:27                 ` Sudeep Holla
  1 sibling, 2 replies; 21+ messages in thread
From: Viresh Kumar @ 2019-10-21  2:15 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM,
	Linux Kernel Mailing List

On 18-10-19, 12:06, Sudeep Holla wrote:
> Callstack is:
> 
> (cpufreq_notifier_max)
> (notifier_call_chain)
> (blocking_notifier_call_chain)
> (pm_qos_update_target)
> (freq_qos_apply)
> (freq_qos_remove_request)
> (cpufreq_policy_free)
> (subsys_interface_unregister)
> (cpufreq_unregister_driver)

@sudeep: I see that the patch is merged now, but as I said earlier the
reasoning isn't clear yet. Please don't stop working on this and lets
clean this once and for all.

What patches were you testing this with? My buggy patches or Rafael's
patches as well ? At least with my patches, this can happen due to the
other bug where the notifier doesn't get removed (as I said earlier),
but once that bug isn't there then this shouldn't happen, else we have
another bug in pipeline somewhere and should find it.

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-21  2:15               ` Viresh Kumar
@ 2019-10-21  8:20                 ` Rafael J. Wysocki
  2019-10-21 10:27                 ` Sudeep Holla
  1 sibling, 0 replies; 21+ messages in thread
From: Rafael J. Wysocki @ 2019-10-21  8:20 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Sudeep Holla, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM,
	Linux Kernel Mailing List

On Mon, Oct 21, 2019 at 4:15 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-10-19, 12:06, Sudeep Holla wrote:
> > Callstack is:
> >
> > (cpufreq_notifier_max)
> > (notifier_call_chain)
> > (blocking_notifier_call_chain)
> > (pm_qos_update_target)
> > (freq_qos_apply)
> > (freq_qos_remove_request)
> > (cpufreq_policy_free)
> > (subsys_interface_unregister)
> > (cpufreq_unregister_driver)
>
> @sudeep: I see that the patch is merged now, but as I said earlier the
> reasoning isn't clear yet. Please don't stop working on this and lets
> clean this once and for all.
>
> What patches were you testing this with? My buggy patches or Rafael's
> patches as well ? At least with my patches, this can happen due to the
> other bug where the notifier doesn't get removed (as I said earlier),
> but once that bug isn't there then this shouldn't happen, else we have
> another bug in pipeline somewhere and should find it.

Right.

I have found one already which should be fixed in my current
linux-next branch, so testing that in particular would be appreciated.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-21  2:15               ` Viresh Kumar
  2019-10-21  8:20                 ` Rafael J. Wysocki
@ 2019-10-21 10:27                 ` Sudeep Holla
  2019-10-21 10:55                   ` Viresh Kumar
  1 sibling, 1 reply; 21+ messages in thread
From: Sudeep Holla @ 2019-10-21 10:27 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM,
	Linux Kernel Mailing List

On Mon, Oct 21, 2019 at 07:45:51AM +0530, Viresh Kumar wrote:
> On 18-10-19, 12:06, Sudeep Holla wrote:
> > Callstack is:
> >
> > (cpufreq_notifier_max)
> > (notifier_call_chain)
> > (blocking_notifier_call_chain)
> > (pm_qos_update_target)
> > (freq_qos_apply)
> > (freq_qos_remove_request)
> > (cpufreq_policy_free)
> > (subsys_interface_unregister)
> > (cpufreq_unregister_driver)
>
> @sudeep: I see that the patch is merged now, but as I said earlier the
> reasoning isn't clear yet. Please don't stop working on this and lets
> clean this once and for all.
>

Sure.

> What patches were you testing this with? My buggy patches or Rafael's
> patches as well ? At least with my patches, this can happen due to the
> other bug where the notifier doesn't get removed (as I said earlier),
> but once that bug isn't there then this shouldn't happen, else we have
> another bug in pipeline somewhere and should find it.
>

I just tested now with today's linux-pm/bleeding-edge branch.
And even if I move cancel_work_sync just after freq_qos_remove_notifier,
it works fine now. It was not the case on Friday.

Is that what you wanted to check or something else ?

Regards,
Sudeep

-->8

diff --git i/drivers/cpufreq/cpufreq.c w/drivers/cpufreq/cpufreq.c
index 829a3764df1b..48a224a6b178 100644
--- i/drivers/cpufreq/cpufreq.c
+++ w/drivers/cpufreq/cpufreq.c
@@ -1268,6 +1268,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
        freq_qos_remove_notifier(&policy->constraints, FREQ_QOS_MIN,
                                 &policy->nb_min);

+       /* Cancel any pending policy->update work before freeing the policy. */
+       cancel_work_sync(&policy->update);
+
        if (policy->max_freq_req) {
                /*
                 * CPUFREQ_CREATE_POLICY notification is sent only after
@@ -1279,8 +1282,6 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
        }

        freq_qos_remove_request(policy->min_freq_req);
-       /* Cancel any pending policy->update work before freeing the policy. */
-       cancel_work_sync(&policy->update);
        kfree(policy->min_freq_req);

        cpufreq_policy_put_kobj(policy);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-21  0:14               ` Rafael J. Wysocki
@ 2019-10-21 10:33                 ` Sudeep Holla
  0 siblings, 0 replies; 21+ messages in thread
From: Sudeep Holla @ 2019-10-21 10:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Viresh Kumar, Linux PM, Linux Kernel Mailing List

On Mon, Oct 21, 2019 at 02:14:51AM +0200, Rafael J. Wysocki wrote:
> On Fri, Oct 18, 2019 at 1:06 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 12:37:51PM +0200, Rafael J. Wysocki wrote:
> > > On Friday, October 18, 2019 12:19:24 PM CEST Sudeep Holla wrote:
> > > > On Fri, Oct 18, 2019 at 11:32:47AM +0530, Viresh Kumar wrote:
> > > > > On 18-10-19, 06:55, Sudeep Holla wrote:
> > > > > > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote:
> > > > > > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > > > > > > >
> > > > > > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers
> > > > > > > > > which schedule policy update work. It may end up racing with the freeing
> > > > > > > > > the policy and unregistering the driver.
> > > > > > > > >
> > > > > > > > > One possible race is as below where the cpufreq_driver is unregistered
> > > > > > > > > but the scheduled work gets executed at later stage when cpufreq_driver
> > > > > > > > > is NULL(i.e. after freeing the policy and driver)
> > > > > > > > >
> > > > > > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000001c
> > > > > > > > > pgd = (ptrval)
> > > > > > > > > [0000001c] *pgd=80000080204003, *pmd=00000000
> > > > > > > > > Internal error: Oops: 206 [#1] SMP THUMB2
> > > > > > > > > Modules linked in:
> > > > > > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.4.0-rc3-00006-g67f5a8081a4b #86
> > > > > > > > > Hardware name: ARM-Versatile Express
> > > > > > > > > Workqueue: events handle_update
> > > > > > > > > PC is at cpufreq_set_policy+0x58/0x228
> > > > > > > > > LR is at dev_pm_qos_read_value+0x77/0xac
> > > > > > > > > Control: 70c5387d  Table: 80203000  DAC: fffffffd
> > > > > > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval))
> > > > > > > > >         (cpufreq_set_policy) from (refresh_frequency_limits.part.24+0x37/0x48)
> > > > > > > > >         (refresh_frequency_limits.part.24) from (handle_update+0x2f/0x38)
> > > > > > > > >         (handle_update) from (process_one_work+0x16d/0x3cc)
> > > > > > > > >         (process_one_work) from (worker_thread+0xff/0x414)
> > > > > > > > >         (worker_thread) from (kthread+0xff/0x100)
> > > > > > > > >         (kthread) from (ret_from_fork+0x11/0x28)
> > > > > > > > >
> > > > > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > > > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > > > > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/cpufreq/cpufreq.c | 3 +++
> > > > > > > > >  1 file changed, 3 insertions(+)
> > > > > > > > >
> > > > > > > > > Hi Rafael, Viresh,
> > > > > > > > >
> > > > > > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled.
> > > > > > > > > I have based this patch on -rc3 and not on top of your patches. This
> > > > > > > > > only fixes the boot issue but I hit the other crashes while continuously
> > > > > > > > > switching on and off the bL switcher that register/unregister the driver
> > > > > > > > > Your patch series fixes them. I can based this on top of those if you
> > > > > > > > > prefer.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Sudeep
> > > > > > > > >
> > > > > > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > > > > > > index c52d6fa32aac..b703c29a84be 100644
> > > > > > > > > --- a/drivers/cpufreq/cpufreq.c
> > > > > > > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > > > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
> > > > > > > > >         }
> > > > > > > > >
> > > > > > > > >         dev_pm_qos_remove_request(policy->min_freq_req);
> > > > > > > > > +       /* flush the pending policy->update work before freeing the policy */
> > > > > > > > > +       if (work_pending(&policy->update))
> > > > > > > >
> > > > > > > > Isn't this racy?
> > > > > > > >
> > > > > > > > It still may be running if the pending bit is clear and we still need
> > > > > > > > to wait for it then, don't we?
> > > > > > > >
> > > > > > > > Why don't you do an unconditional flush_work() here?
> > > > > > >
> > > > > > > You may as well do a cancel_work_sync() here, because whether or not
> > > > > > > the last update of the policy happens before it goes away is a matter
> > > > > > > of timing in any case
> > > > > >
> > > > > > In fact that's the first thing I tried to fix the issue I was seeing.
> > > > > > But I then thought it would be better to complete the update as the PM
> > > > > > QoS were getting updated back to DEFAULT values for the device. Even
> > > > > > this works.
> > > > > >
> > > > > > What is your preference ? flush_work or cancel_work_sync ? I will
> > > > > > update accordingly. I may need to do some more testing with
> > > > > > cancel_work_sync as I just checked that quickly to confirm the race.
> > > > >
> > > > > As I said in the other email, this work didn't come as a result of
> > > > > removal of the qos request from cpufreq core and so must have come
> > > > > from other thermal or similar events.
> > > >
> > > > I don't think so. For sure not because of any thermal events. I didn't
> > > > have log handy and hence had to wait till I was next to hardware.
> > > >
> > > > This is log:
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> > > >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> > > >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max before
> > > >  cpufreq: cpufreq_notifier_max: schedule_work(&policy->update)
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request max after
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min before
> > > >  cpufreq: cpufreq_notifier_min: schedule_work(&policy->update)
> > > >  cpufreq: cpufreq_policy_free: dev_pm_qos_remove_request min after
> > > >
> > > > So if I move the call above, it still crashes as the work is getting
> > > > scheduled later.
> > >
> > > OK, please cancel the work after dropping the last request.
> > >
> > > We still need to understand what is going on here, but the crash needs to be
> > > prevented from occurring in the first place IMO.
> > >
> > Callstack is:
> >
> > (cpufreq_notifier_max)
> > (notifier_call_chain)
> > (blocking_notifier_call_chain)
> > (pm_qos_update_target)
> > (freq_qos_apply)
> > (freq_qos_remove_request)
> > (cpufreq_policy_free)
> > (subsys_interface_unregister)
> > (cpufreq_unregister_driver)
>
> That may be due to a bug in one of my patches (it's adding one of the
> notifiers to a wrong list).
>

Ah that explains, I was wondering what changed as it's working now but
was not the case when I tried earlier and I had to keep cancel_work_sync
after dev_pm_qos_remove_request

> Please re-test with the current linux-next branch that I've just pushed.

Yes, it did that and it now works fine even if I move the cancel_work_sync
call earlier just after freq_qos_remove_notifier.

If you/Viresh prefer the call to cancel_work_sync to be moved up, that
should be fine now. I have sent the delta for reference in other reply.

--
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] cpufreq: flush any pending policy update work scheduled before freeing
  2019-10-21 10:27                 ` Sudeep Holla
@ 2019-10-21 10:55                   ` Viresh Kumar
  0 siblings, 0 replies; 21+ messages in thread
From: Viresh Kumar @ 2019-10-21 10:55 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM,
	Linux Kernel Mailing List

On 21-10-19, 11:27, Sudeep Holla wrote:
> I just tested now with today's linux-pm/bleeding-edge branch.
> And even if I move cancel_work_sync just after freq_qos_remove_notifier,
> it works fine now. It was not the case on Friday.
> 
> Is that what you wanted to check or something else ?
> 
> Regards,
> Sudeep
> 
> -->8
> 
> diff --git i/drivers/cpufreq/cpufreq.c w/drivers/cpufreq/cpufreq.c
> index 829a3764df1b..48a224a6b178 100644
> --- i/drivers/cpufreq/cpufreq.c
> +++ w/drivers/cpufreq/cpufreq.c
> @@ -1268,6 +1268,9 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>         freq_qos_remove_notifier(&policy->constraints, FREQ_QOS_MIN,
>                                  &policy->nb_min);
> 
> +       /* Cancel any pending policy->update work before freeing the policy. */
> +       cancel_work_sync(&policy->update);
> +
>         if (policy->max_freq_req) {
>                 /*
>                  * CPUFREQ_CREATE_POLICY notification is sent only after
> @@ -1279,8 +1282,6 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>         }
> 
>         freq_qos_remove_request(policy->min_freq_req);
> -       /* Cancel any pending policy->update work before freeing the policy. */
> -       cancel_work_sync(&policy->update);
>         kfree(policy->min_freq_req);
> 
>         cpufreq_policy_put_kobj(policy);

Yes, send a incremental patch for that. Thanks.

-- 
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-10-21 10:56 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17 16:35 [PATCH] cpufreq: flush any pending policy update work scheduled before freeing Sudeep Holla
2019-10-17 19:36 ` Rafael J. Wysocki
2019-10-17 21:26   ` Rafael J. Wysocki
2019-10-18  5:55     ` Sudeep Holla
2019-10-18  6:02       ` Viresh Kumar
2019-10-18  7:32         ` Rafael J. Wysocki
2019-10-18  8:03           ` Viresh Kumar
2019-10-18  8:19             ` Rafael J. Wysocki
2019-10-18  8:25               ` Viresh Kumar
2019-10-18 10:19         ` Sudeep Holla
2019-10-18 10:37           ` Rafael J. Wysocki
2019-10-18 11:06             ` Sudeep Holla
2019-10-21  0:14               ` Rafael J. Wysocki
2019-10-21 10:33                 ` Sudeep Holla
2019-10-21  2:15               ` Viresh Kumar
2019-10-21  8:20                 ` Rafael J. Wysocki
2019-10-21 10:27                 ` Sudeep Holla
2019-10-21 10:55                   ` Viresh Kumar
2019-10-18  5:47   ` Sudeep Holla
2019-10-18  5:38 ` Viresh Kumar
2019-10-18  7:53   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).