All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: sched-freq locking
       [not found]               ` <CAEG3pNBgpsdX8Lk7_3nHd31_bsq2sLB_f+=4_xiFFbiAWiKsxg@mail.gmail.com>
@ 2016-01-20  1:24                 ` Steve Muckle
  2016-01-20 12:18                   ` Juri Lelli
  2016-01-21  1:22                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 18+ messages in thread
From: Steve Muckle @ 2016-01-20  1:24 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Vincent Guittot, Juri Lelli, Patrick Bellasi, Morten Rasmussen,
	Dietmar Eggemann, Viresh Kumar, linux-kernel, linux-pm,
	Peter Zijlstra, Rafael J. Wysocki

On 01/19/2016 03:40 PM, Michael Turquette wrote:
> Right, this was _the_ original impetus behind the design decision to
> muck around with struct cpufreq_policy in the hot path which goes al
> the way back to v1.
> 
> An alternative thought is that we can make copies of the relevant bits
> of struct cpufreq_policy that we do not expect too change often. These
> will not require any locks as they are mostly read-only data on the
> scheduler side of the interface. Or we could even go all in and just
> make local copies of the struct directly, during the GOV_START
> perhaps, with:

I believe this is a good first step as it avoids reworking a huge amount
of locking and can get us to something functionally correct. It is what
I had proposed earlier, copying the enabled CPUs and freq table in
during the governor start callback. Unless there are objections to it
I'll add it to the next schedfreq RFC.

> 
...
> 
> Well if we're going to try an optimize out every single false-positive
> wakeup then I think that the cleanest long term solution would be
> rework the per-policy locking around struct cpufreq_policy to use a
> raw spinlock.

It would be nice if the policy lock was a spinlock but I don't know how
easy that is. From a quick look at cpufreq there's a blocking notifier
chain that's called with rwsem held, so it looks messy. Potentially long
term indeed.

>> Also it'd be good I think to avoid building in an assumption that we'll
>> never want to run solely in the fast (atomic) path. Perhaps ARM won't,
>> and x86 may never use this, but it's reasonable to think another
>> platform might come along which uses cpufreq and has the capability to
>> kick off cpufreq transitions swiftly and without sleeping. Maybe ARM
>> platforms will evolve to have that capability.
> 
> The current design of the cpufreq subsystem and its interfaces have
> made this choice for us. sched-freq is just another consumer of
> cpufreq, and until cpufreq's own locking scheme is improved then we
> have no choice.

I did not word that very well - I should have said, we should avoid
building in an assumption that we never want to try and run in the fast
path.

AFAICS, once we've calculated that a frequency change is required we can
down_write_trylock(&policy->rwsem) in the fast path and go ahead with
the transition, if the trylock succeeds and the driver supports fast
path transitions. We can fall back to the slow path (waking up the
kthread) if that fails.

> This discussion is pretty useful. Should we Cc lkml to this thread?

Done (added linux-pm, PeterZ and Rafael as well).

thanks,
Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20  1:24                 ` sched-freq locking Steve Muckle
@ 2016-01-20 12:18                   ` Juri Lelli
  2016-01-20 14:50                     ` Steve Muckle
  2016-01-21  1:22                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2016-01-20 12:18 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

[+Punit, Javi]

Hi,

On 19/01/16 17:24, Steve Muckle wrote:
> On 01/19/2016 03:40 PM, Michael Turquette wrote:
> > Right, this was _the_ original impetus behind the design decision to
> > muck around with struct cpufreq_policy in the hot path which goes al
> > the way back to v1.
> > 
> > An alternative thought is that we can make copies of the relevant bits
> > of struct cpufreq_policy that we do not expect too change often. These
> > will not require any locks as they are mostly read-only data on the
> > scheduler side of the interface. Or we could even go all in and just
> > make local copies of the struct directly, during the GOV_START
> > perhaps, with:
> 
> I believe this is a good first step as it avoids reworking a huge amount
> of locking and can get us to something functionally correct. It is what
> I had proposed earlier, copying the enabled CPUs and freq table in
> during the governor start callback. Unless there are objections to it
> I'll add it to the next schedfreq RFC.
> 

I fear that caching could break thermal. If everybody was already using
sched-freq interface to control frequency this won't probably be a
problem, but we are not yet there :(. So, IIUC by caching policy->max,
for example, we might affect what thermal expects from cpufreq.

> > 
> ...
> > 
> > Well if we're going to try an optimize out every single false-positive
> > wakeup then I think that the cleanest long term solution would be
> > rework the per-policy locking around struct cpufreq_policy to use a
> > raw spinlock.
> 
> It would be nice if the policy lock was a spinlock but I don't know how
> easy that is. From a quick look at cpufreq there's a blocking notifier
> chain that's called with rwsem held, so it looks messy. Potentially long
> term indeed.
> 

Right. Blocking notifiers are one problem, as I was saying to Peter
yesterday.

> >> Also it'd be good I think to avoid building in an assumption that we'll
> >> never want to run solely in the fast (atomic) path. Perhaps ARM won't,
> >> and x86 may never use this, but it's reasonable to think another
> >> platform might come along which uses cpufreq and has the capability to
> >> kick off cpufreq transitions swiftly and without sleeping. Maybe ARM
> >> platforms will evolve to have that capability.
> > 
> > The current design of the cpufreq subsystem and its interfaces have
> > made this choice for us. sched-freq is just another consumer of
> > cpufreq, and until cpufreq's own locking scheme is improved then we
> > have no choice.
> 
> I did not word that very well - I should have said, we should avoid
> building in an assumption that we never want to try and run in the fast
> path.
> 
> AFAICS, once we've calculated that a frequency change is required we can
> down_write_trylock(&policy->rwsem) in the fast path and go ahead with
> the transition, if the trylock succeeds and the driver supports fast
> path transitions. We can fall back to the slow path (waking up the
> kthread) if that fails.
> 
> > This discussion is pretty useful. Should we Cc lkml to this thread?
> 
> Done (added linux-pm, PeterZ and Rafael as well).
> 

This discussion is pretty interesting, yes. I'm a bit afraid people
bumped into this might have troubles understanding context, though.
And I'm not sure how to give them that context; maybe start a new thread
summarizing what has been discussed so far?

Best,

- Juri

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 12:18                   ` Juri Lelli
@ 2016-01-20 14:50                     ` Steve Muckle
  2016-01-20 15:58                       ` Juri Lelli
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Muckle @ 2016-01-20 14:50 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

On 01/20/2016 04:18 AM, Juri Lelli wrote:
> I fear that caching could break thermal. If everybody was already using
> sched-freq interface to control frequency this won't probably be a
> problem, but we are not yet there :(. So, IIUC by caching policy->max,
> for example, we might affect what thermal expects from cpufreq.

I think we could probably just access policy->max without rwsem, as long
as we ensure policy is valid. Cpufreq will take care to cap a frequency
request to an upper limit put in by thermal anyway so I don't think it's
a problematic race. But I haven't spent much time thinking about it.

>
...
>> Done (added linux-pm, PeterZ and Rafael as well).
>>
> 
> This discussion is pretty interesting, yes. I'm a bit afraid people
> bumped into this might have troubles understanding context, though.
> And I'm not sure how to give them that context; maybe start a new thread
> summarizing what has been discussed so far?

Yeah, that occurred to me as well. I wasn't sure whether to restart the
thread or put in the locking I had in mind and just send it with the
next schedfreq RFC series, which should be in the next few weeks. I was
going to do the latter but here is a recap of the topic from my side:

Currently schedfreq has to go through two stages of aggregation. The
first is aggregating the capacity request votes from the different sched
classes within a CPU. The second is aggregating the capacity request
votes from the CPUs within a frequency domain. I'm looking to solve the
locking issues with both of these stages as well as those with calling
into cpufreq in the fast path.

For the first stage, currently we assume that the rq lock is held. This
is the case for the existing calls into schedfreq (the load balancer
required a minor change). There are a few more hooks that need to go
into migration paths in core.c which will be slightly more painful, but
they are IMO doable.

For the second stage I believe an internal schedfreq spinlock is
required, for one thing to protect against competing updates within the
same frequency domain from different CPUs, but also so that we can
guarantee the cpufreq policy is valid, which can be done if the
governor_start/stop callbacks take the spinlock as well.

As for accessing various things in cpufreq->policy and trying to take
rwsem in the fast path, we should be able to either cache some of the
items in the governor_start() path, eliminate the references, or access
them without locking rwsem (as long as we know the policy is valid,
which we do by taking the spinlock in governor_start()). Here are the
things we currently reference in the schedfreq set capacity hook and my
thoughts on each of them:

policy->governor: this is currently tested to see if schedfreq is
enabled, but this can be tracked internally to schedfreq and set in the
governor_start/stop callbacks

policy->governor_data: used to access schedfreq's data for the policy,
could be changed to an internal schedfreq percpu pointer

policy->cpus, policy->freq_table: these can be cached internally to
schedfreq on governor_start/stop

policy->max: as mentioned above I *think* we could get away with
accessing this without locking rwsem as discussed above

policy->transition_ongoing: this isn't strictly required as schedfreq
could track internally whether it has a transition outstanding, but we
should also be okay to look at this without policy->rwsem since
schedfreq is the only one queuing transitions, again assuming we take
care to ensure policy is valid as above

Finally when actually requesting a frequency change in the fast path, we
can trylock policy->rwsem and fall back to the slow path (kthread) if
that fails.

thanks,
Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 14:50                     ` Steve Muckle
@ 2016-01-20 15:58                       ` Juri Lelli
  2016-01-20 16:39                           ` Punit Agrawal
                                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Juri Lelli @ 2016-01-20 15:58 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

On 20/01/16 06:50, Steve Muckle wrote:
> On 01/20/2016 04:18 AM, Juri Lelli wrote:
> > I fear that caching could break thermal. If everybody was already using
> > sched-freq interface to control frequency this won't probably be a
> > problem, but we are not yet there :(. So, IIUC by caching policy->max,
> > for example, we might affect what thermal expects from cpufreq.
> 
> I think we could probably just access policy->max without rwsem, as long
> as we ensure policy is valid. Cpufreq will take care to cap a frequency
> request to an upper limit put in by thermal anyway so I don't think it's
> a problematic race. But I haven't spent much time thinking about it.
> 

Mmm, can't the fact that the scheduler might think it can still request
a frequency above max be problematic?

Anyway, thermal seems to use cpufreq_set_policy() for changing max (if I
got it right Punit :)), so we might be able to intercept such sites and
do proper update of cached values.

> >
> ...
> >> Done (added linux-pm, PeterZ and Rafael as well).
> >>
> > 
> > This discussion is pretty interesting, yes. I'm a bit afraid people
> > bumped into this might have troubles understanding context, though.
> > And I'm not sure how to give them that context; maybe start a new thread
> > summarizing what has been discussed so far?
> 
> Yeah, that occurred to me as well. I wasn't sure whether to restart the
> thread or put in the locking I had in mind and just send it with the
> next schedfreq RFC series, which should be in the next few weeks. I was
> going to do the latter but here is a recap of the topic from my side:
> 

Thanks a lot for writing this down!

> Currently schedfreq has to go through two stages of aggregation. The
> first is aggregating the capacity request votes from the different sched
> classes within a CPU. The second is aggregating the capacity request
> votes from the CPUs within a frequency domain. I'm looking to solve the
> locking issues with both of these stages as well as those with calling
> into cpufreq in the fast path.
> 
> For the first stage, currently we assume that the rq lock is held. This
> is the case for the existing calls into schedfreq (the load balancer
> required a minor change). There are a few more hooks that need to go
> into migration paths in core.c which will be slightly more painful, but
> they are IMO doable.
> 
> For the second stage I believe an internal schedfreq spinlock is
> required, for one thing to protect against competing updates within the
> same frequency domain from different CPUs, but also so that we can
> guarantee the cpufreq policy is valid, which can be done if the
> governor_start/stop callbacks take the spinlock as well.
> 

Does this need to nest within the rq lock?

> As for accessing various things in cpufreq->policy and trying to take
> rwsem in the fast path, we should be able to either cache some of the
> items in the governor_start() path, eliminate the references, or access
> them without locking rwsem (as long as we know the policy is valid,

If we only need to guarantee existence of the policy, without any direct
updates, RCU might be a good fit.

> which we do by taking the spinlock in governor_start()). Here are the
> things we currently reference in the schedfreq set capacity hook and my
> thoughts on each of them:
> 
> policy->governor: this is currently tested to see if schedfreq is
> enabled, but this can be tracked internally to schedfreq and set in the
> governor_start/stop callbacks
> 
> policy->governor_data: used to access schedfreq's data for the policy,
> could be changed to an internal schedfreq percpu pointer
> 
> policy->cpus, policy->freq_table: these can be cached internally to
> schedfreq on governor_start/stop
> 
> policy->max: as mentioned above I *think* we could get away with
> accessing this without locking rwsem as discussed above
> 
> policy->transition_ongoing: this isn't strictly required as schedfreq
> could track internally whether it has a transition outstanding, but we
> should also be okay to look at this without policy->rwsem since
> schedfreq is the only one queuing transitions, again assuming we take
> care to ensure policy is valid as above
> 
> Finally when actually requesting a frequency change in the fast path, we
> can trylock policy->rwsem and fall back to the slow path (kthread) if
> that fails.
> 

OTOH, all the needed aggregation could make the fast path not so fast in
the end. So, maybe having a fast and a slow path is always good?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 15:58                       ` Juri Lelli
@ 2016-01-20 16:39                           ` Punit Agrawal
  2016-01-20 16:46                           ` Punit Agrawal
  2016-01-20 20:46                         ` Steve Muckle
  2 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2016-01-20 16:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Rafael J. Wysocki, Javi Merino

Juri Lelli <Juri.Lelli@arm.com> writes:

> On 20/01/16 06:50, Steve Muckle wrote:
>> On 01/20/2016 04:18 AM, Juri Lelli wrote:
>> > I fear that caching could break thermal. If everybody was already using
>> > sched-freq interface to control frequency this won't probably be a
>> > problem, but we are not yet there :(. So, IIUC by caching policy->max,
>> > for example, we might affect what thermal expects from cpufreq.
>> 
>> I think we could probably just access policy->max without rwsem, as long
>> as we ensure policy is valid. Cpufreq will take care to cap a frequency
>> request to an upper limit put in by thermal anyway so I don't think it's
>> a problematic race. But I haven't spent much time thinking about it.
>> 
>
> Mmm, can't the fact that the scheduler might think it can still request
> a frequency above max be problematic?
>
> Anyway, thermal seems to use cpufreq_set_policy() for changing max (if I
> got it right Punit :)), so we might be able to intercept such sites and
> do proper update of cached values.

That's correct. In cpu_cooling there is a call to cpufreq_update_policy
that subsequently calls cpufreq_set_policy.

The max itself is updated via the policy change notifier registered by
the cooling device.

>
>> >
>> ...
>> >> Done (added linux-pm, PeterZ and Rafael as well).
>> >>
>> > 
>> > This discussion is pretty interesting, yes. I'm a bit afraid people
>> > bumped into this might have troubles understanding context, though.
>> > And I'm not sure how to give them that context; maybe start a new thread
>> > summarizing what has been discussed so far?
>> 
>> Yeah, that occurred to me as well. I wasn't sure whether to restart the
>> thread or put in the locking I had in mind and just send it with the
>> next schedfreq RFC series, which should be in the next few weeks. I was
>> going to do the latter but here is a recap of the topic from my side:
>> 
>
> Thanks a lot for writing this down!
>
>> Currently schedfreq has to go through two stages of aggregation. The
>> first is aggregating the capacity request votes from the different sched
>> classes within a CPU. The second is aggregating the capacity request
>> votes from the CPUs within a frequency domain. I'm looking to solve the
>> locking issues with both of these stages as well as those with calling
>> into cpufreq in the fast path.
>> 
>> For the first stage, currently we assume that the rq lock is held. This
>> is the case for the existing calls into schedfreq (the load balancer
>> required a minor change). There are a few more hooks that need to go
>> into migration paths in core.c which will be slightly more painful, but
>> they are IMO doable.
>> 
>> For the second stage I believe an internal schedfreq spinlock is
>> required, for one thing to protect against competing updates within the
>> same frequency domain from different CPUs, but also so that we can
>> guarantee the cpufreq policy is valid, which can be done if the
>> governor_start/stop callbacks take the spinlock as well.
>> 
>
> Does this need to nest within the rq lock?
>
>> As for accessing various things in cpufreq->policy and trying to take
>> rwsem in the fast path, we should be able to either cache some of the
>> items in the governor_start() path, eliminate the references, or access
>> them without locking rwsem (as long as we know the policy is valid,
>
> If we only need to guarantee existence of the policy, without any direct
> updates, RCU might be a good fit.
>
>> which we do by taking the spinlock in governor_start()). Here are the
>> things we currently reference in the schedfreq set capacity hook and my
>> thoughts on each of them:
>> 
>> policy->governor: this is currently tested to see if schedfreq is
>> enabled, but this can be tracked internally to schedfreq and set in the
>> governor_start/stop callbacks
>> 
>> policy->governor_data: used to access schedfreq's data for the policy,
>> could be changed to an internal schedfreq percpu pointer
>> 
>> policy->cpus, policy->freq_table: these can be cached internally to
>> schedfreq on governor_start/stop
>> 
>> policy->max: as mentioned above I *think* we could get away with
>> accessing this without locking rwsem as discussed above
>> 
>> policy->transition_ongoing: this isn't strictly required as schedfreq
>> could track internally whether it has a transition outstanding, but we
>> should also be okay to look at this without policy->rwsem since
>> schedfreq is the only one queuing transitions, again assuming we take
>> care to ensure policy is valid as above
>> 
>> Finally when actually requesting a frequency change in the fast path, we
>> can trylock policy->rwsem and fall back to the slow path (kthread) if
>> that fails.
>> 
>
> OTOH, all the needed aggregation could make the fast path not so fast in
> the end. So, maybe having a fast and a slow path is always good?
>
> Thanks,
>
> - Juri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
@ 2016-01-20 16:39                           ` Punit Agrawal
  0 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2016-01-20 16:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Rafael J. Wysocki, Javi Merino

Juri Lelli <Juri.Lelli@arm.com> writes:

> On 20/01/16 06:50, Steve Muckle wrote:
>> On 01/20/2016 04:18 AM, Juri Lelli wrote:
>> > I fear that caching could break thermal. If everybody was already using
>> > sched-freq interface to control frequency this won't probably be a
>> > problem, but we are not yet there :(. So, IIUC by caching policy->max,
>> > for example, we might affect what thermal expects from cpufreq.
>> 
>> I think we could probably just access policy->max without rwsem, as long
>> as we ensure policy is valid. Cpufreq will take care to cap a frequency
>> request to an upper limit put in by thermal anyway so I don't think it's
>> a problematic race. But I haven't spent much time thinking about it.
>> 
>
> Mmm, can't the fact that the scheduler might think it can still request
> a frequency above max be problematic?
>
> Anyway, thermal seems to use cpufreq_set_policy() for changing max (if I
> got it right Punit :)), so we might be able to intercept such sites and
> do proper update of cached values.

That's correct. In cpu_cooling there is a call to cpufreq_update_policy
that subsequently calls cpufreq_set_policy.

The max itself is updated via the policy change notifier registered by
the cooling device.

>
>> >
>> ...
>> >> Done (added linux-pm, PeterZ and Rafael as well).
>> >>
>> > 
>> > This discussion is pretty interesting, yes. I'm a bit afraid people
>> > bumped into this might have troubles understanding context, though.
>> > And I'm not sure how to give them that context; maybe start a new thread
>> > summarizing what has been discussed so far?
>> 
>> Yeah, that occurred to me as well. I wasn't sure whether to restart the
>> thread or put in the locking I had in mind and just send it with the
>> next schedfreq RFC series, which should be in the next few weeks. I was
>> going to do the latter but here is a recap of the topic from my side:
>> 
>
> Thanks a lot for writing this down!
>
>> Currently schedfreq has to go through two stages of aggregation. The
>> first is aggregating the capacity request votes from the different sched
>> classes within a CPU. The second is aggregating the capacity request
>> votes from the CPUs within a frequency domain. I'm looking to solve the
>> locking issues with both of these stages as well as those with calling
>> into cpufreq in the fast path.
>> 
>> For the first stage, currently we assume that the rq lock is held. This
>> is the case for the existing calls into schedfreq (the load balancer
>> required a minor change). There are a few more hooks that need to go
>> into migration paths in core.c which will be slightly more painful, but
>> they are IMO doable.
>> 
>> For the second stage I believe an internal schedfreq spinlock is
>> required, for one thing to protect against competing updates within the
>> same frequency domain from different CPUs, but also so that we can
>> guarantee the cpufreq policy is valid, which can be done if the
>> governor_start/stop callbacks take the spinlock as well.
>> 
>
> Does this need to nest within the rq lock?
>
>> As for accessing various things in cpufreq->policy and trying to take
>> rwsem in the fast path, we should be able to either cache some of the
>> items in the governor_start() path, eliminate the references, or access
>> them without locking rwsem (as long as we know the policy is valid,
>
> If we only need to guarantee existence of the policy, without any direct
> updates, RCU might be a good fit.
>
>> which we do by taking the spinlock in governor_start()). Here are the
>> things we currently reference in the schedfreq set capacity hook and my
>> thoughts on each of them:
>> 
>> policy->governor: this is currently tested to see if schedfreq is
>> enabled, but this can be tracked internally to schedfreq and set in the
>> governor_start/stop callbacks
>> 
>> policy->governor_data: used to access schedfreq's data for the policy,
>> could be changed to an internal schedfreq percpu pointer
>> 
>> policy->cpus, policy->freq_table: these can be cached internally to
>> schedfreq on governor_start/stop
>> 
>> policy->max: as mentioned above I *think* we could get away with
>> accessing this without locking rwsem as discussed above
>> 
>> policy->transition_ongoing: this isn't strictly required as schedfreq
>> could track internally whether it has a transition outstanding, but we
>> should also be okay to look at this without policy->rwsem since
>> schedfreq is the only one queuing transitions, again assuming we take
>> care to ensure policy is valid as above
>> 
>> Finally when actually requesting a frequency change in the fast path, we
>> can trylock policy->rwsem and fall back to the slow path (kthread) if
>> that fails.
>> 
>
> OTOH, all the needed aggregation could make the fast path not so fast in
> the end. So, maybe having a fast and a slow path is always good?
>
> Thanks,
>
> - Juri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 15:58                       ` Juri Lelli
@ 2016-01-20 16:46                           ` Punit Agrawal
  2016-01-20 16:46                           ` Punit Agrawal
  2016-01-20 20:46                         ` Steve Muckle
  2 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2016-01-20 16:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Rafael J. Wysocki, Javi Merino

Juri Lelli <Juri.Lelli@arm.com> writes:

> On 20/01/16 06:50, Steve Muckle wrote:
>> On 01/20/2016 04:18 AM, Juri Lelli wrote:
>> > I fear that caching could break thermal. If everybody was already using
>> > sched-freq interface to control frequency this won't probably be a
>> > problem, but we are not yet there :(. So, IIUC by caching policy->max,
>> > for example, we might affect what thermal expects from cpufreq.
>> 
>> I think we could probably just access policy->max without rwsem, as long
>> as we ensure policy is valid. Cpufreq will take care to cap a frequency
>> request to an upper limit put in by thermal anyway so I don't think it's
>> a problematic race. But I haven't spent much time thinking about it.
>> 
>
> Mmm, can't the fact that the scheduler might think it can still request
> a frequency above max be problematic?
>
> Anyway, thermal seems to use cpufreq_set_policy() for changing max (if I
> got it right Punit :)), so we might be able to intercept such sites and
> do proper update of cached values.


That's correct. The cpu cooling device calls cpufreq_update_policy which
subsequently calls cpufreq_set_policy.

The max itself is changed via the policy notifier callback in the
cooling device.

>
>> >
>> ...
>> >> Done (added linux-pm, PeterZ and Rafael as well).
>> >>
>> > 
>> > This discussion is pretty interesting, yes. I'm a bit afraid people
>> > bumped into this might have troubles understanding context, though.
>> > And I'm not sure how to give them that context; maybe start a new thread
>> > summarizing what has been discussed so far?
>> 
>> Yeah, that occurred to me as well. I wasn't sure whether to restart the
>> thread or put in the locking I had in mind and just send it with the
>> next schedfreq RFC series, which should be in the next few weeks. I was
>> going to do the latter but here is a recap of the topic from my side:
>> 
>
> Thanks a lot for writing this down!
>
>> Currently schedfreq has to go through two stages of aggregation. The
>> first is aggregating the capacity request votes from the different sched
>> classes within a CPU. The second is aggregating the capacity request
>> votes from the CPUs within a frequency domain. I'm looking to solve the
>> locking issues with both of these stages as well as those with calling
>> into cpufreq in the fast path.
>> 
>> For the first stage, currently we assume that the rq lock is held. This
>> is the case for the existing calls into schedfreq (the load balancer
>> required a minor change). There are a few more hooks that need to go
>> into migration paths in core.c which will be slightly more painful, but
>> they are IMO doable.
>> 
>> For the second stage I believe an internal schedfreq spinlock is
>> required, for one thing to protect against competing updates within the
>> same frequency domain from different CPUs, but also so that we can
>> guarantee the cpufreq policy is valid, which can be done if the
>> governor_start/stop callbacks take the spinlock as well.
>> 
>
> Does this need to nest within the rq lock?
>
>> As for accessing various things in cpufreq->policy and trying to take
>> rwsem in the fast path, we should be able to either cache some of the
>> items in the governor_start() path, eliminate the references, or access
>> them without locking rwsem (as long as we know the policy is valid,
>
> If we only need to guarantee existence of the policy, without any direct
> updates, RCU might be a good fit.
>
>> which we do by taking the spinlock in governor_start()). Here are the
>> things we currently reference in the schedfreq set capacity hook and my
>> thoughts on each of them:
>> 
>> policy->governor: this is currently tested to see if schedfreq is
>> enabled, but this can be tracked internally to schedfreq and set in the
>> governor_start/stop callbacks
>> 
>> policy->governor_data: used to access schedfreq's data for the policy,
>> could be changed to an internal schedfreq percpu pointer
>> 
>> policy->cpus, policy->freq_table: these can be cached internally to
>> schedfreq on governor_start/stop
>> 
>> policy->max: as mentioned above I *think* we could get away with
>> accessing this without locking rwsem as discussed above
>> 
>> policy->transition_ongoing: this isn't strictly required as schedfreq
>> could track internally whether it has a transition outstanding, but we
>> should also be okay to look at this without policy->rwsem since
>> schedfreq is the only one queuing transitions, again assuming we take
>> care to ensure policy is valid as above
>> 
>> Finally when actually requesting a frequency change in the fast path, we
>> can trylock policy->rwsem and fall back to the slow path (kthread) if
>> that fails.
>> 
>
> OTOH, all the needed aggregation could make the fast path not so fast in
> the end. So, maybe having a fast and a slow path is always good?
>
> Thanks,
>
> - Juri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
@ 2016-01-20 16:46                           ` Punit Agrawal
  0 siblings, 0 replies; 18+ messages in thread
From: Punit Agrawal @ 2016-01-20 16:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Rafael J. Wysocki, Javi Merino

Juri Lelli <Juri.Lelli@arm.com> writes:

> On 20/01/16 06:50, Steve Muckle wrote:
>> On 01/20/2016 04:18 AM, Juri Lelli wrote:
>> > I fear that caching could break thermal. If everybody was already using
>> > sched-freq interface to control frequency this won't probably be a
>> > problem, but we are not yet there :(. So, IIUC by caching policy->max,
>> > for example, we might affect what thermal expects from cpufreq.
>> 
>> I think we could probably just access policy->max without rwsem, as long
>> as we ensure policy is valid. Cpufreq will take care to cap a frequency
>> request to an upper limit put in by thermal anyway so I don't think it's
>> a problematic race. But I haven't spent much time thinking about it.
>> 
>
> Mmm, can't the fact that the scheduler might think it can still request
> a frequency above max be problematic?
>
> Anyway, thermal seems to use cpufreq_set_policy() for changing max (if I
> got it right Punit :)), so we might be able to intercept such sites and
> do proper update of cached values.


That's correct. The cpu cooling device calls cpufreq_update_policy which
subsequently calls cpufreq_set_policy.

The max itself is changed via the policy notifier callback in the
cooling device.

>
>> >
>> ...
>> >> Done (added linux-pm, PeterZ and Rafael as well).
>> >>
>> > 
>> > This discussion is pretty interesting, yes. I'm a bit afraid people
>> > bumped into this might have troubles understanding context, though.
>> > And I'm not sure how to give them that context; maybe start a new thread
>> > summarizing what has been discussed so far?
>> 
>> Yeah, that occurred to me as well. I wasn't sure whether to restart the
>> thread or put in the locking I had in mind and just send it with the
>> next schedfreq RFC series, which should be in the next few weeks. I was
>> going to do the latter but here is a recap of the topic from my side:
>> 
>
> Thanks a lot for writing this down!
>
>> Currently schedfreq has to go through two stages of aggregation. The
>> first is aggregating the capacity request votes from the different sched
>> classes within a CPU. The second is aggregating the capacity request
>> votes from the CPUs within a frequency domain. I'm looking to solve the
>> locking issues with both of these stages as well as those with calling
>> into cpufreq in the fast path.
>> 
>> For the first stage, currently we assume that the rq lock is held. This
>> is the case for the existing calls into schedfreq (the load balancer
>> required a minor change). There are a few more hooks that need to go
>> into migration paths in core.c which will be slightly more painful, but
>> they are IMO doable.
>> 
>> For the second stage I believe an internal schedfreq spinlock is
>> required, for one thing to protect against competing updates within the
>> same frequency domain from different CPUs, but also so that we can
>> guarantee the cpufreq policy is valid, which can be done if the
>> governor_start/stop callbacks take the spinlock as well.
>> 
>
> Does this need to nest within the rq lock?
>
>> As for accessing various things in cpufreq->policy and trying to take
>> rwsem in the fast path, we should be able to either cache some of the
>> items in the governor_start() path, eliminate the references, or access
>> them without locking rwsem (as long as we know the policy is valid,
>
> If we only need to guarantee existence of the policy, without any direct
> updates, RCU might be a good fit.
>
>> which we do by taking the spinlock in governor_start()). Here are the
>> things we currently reference in the schedfreq set capacity hook and my
>> thoughts on each of them:
>> 
>> policy->governor: this is currently tested to see if schedfreq is
>> enabled, but this can be tracked internally to schedfreq and set in the
>> governor_start/stop callbacks
>> 
>> policy->governor_data: used to access schedfreq's data for the policy,
>> could be changed to an internal schedfreq percpu pointer
>> 
>> policy->cpus, policy->freq_table: these can be cached internally to
>> schedfreq on governor_start/stop
>> 
>> policy->max: as mentioned above I *think* we could get away with
>> accessing this without locking rwsem as discussed above
>> 
>> policy->transition_ongoing: this isn't strictly required as schedfreq
>> could track internally whether it has a transition outstanding, but we
>> should also be okay to look at this without policy->rwsem since
>> schedfreq is the only one queuing transitions, again assuming we take
>> care to ensure policy is valid as above
>> 
>> Finally when actually requesting a frequency change in the fast path, we
>> can trylock policy->rwsem and fall back to the slow path (kthread) if
>> that fails.
>> 
>
> OTOH, all the needed aggregation could make the fast path not so fast in
> the end. So, maybe having a fast and a slow path is always good?
>
> Thanks,
>
> - Juri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 15:58                       ` Juri Lelli
  2016-01-20 16:39                           ` Punit Agrawal
  2016-01-20 16:46                           ` Punit Agrawal
@ 2016-01-20 20:46                         ` Steve Muckle
  2016-01-21  9:45                           ` Juri Lelli
  2 siblings, 1 reply; 18+ messages in thread
From: Steve Muckle @ 2016-01-20 20:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

Hi Juri,

On 01/20/2016 07:58 AM, Juri Lelli wrote:
> On 20/01/16 06:50, Steve Muckle wrote:
>> On 01/20/2016 04:18 AM, Juri Lelli wrote:
>>> I fear that caching could break thermal. If everybody was already using
>>> sched-freq interface to control frequency this won't probably be a
>>> problem, but we are not yet there :(. So, IIUC by caching policy->max,
>>> for example, we might affect what thermal expects from cpufreq.
>>
>> I think we could probably just access policy->max without rwsem, as long
>> as we ensure policy is valid. Cpufreq will take care to cap a frequency
>> request to an upper limit put in by thermal anyway so I don't think it's
>> a problematic race. But I haven't spent much time thinking about it.
>>
> 
> Mmm, can't the fact that the scheduler might think it can still request
> a frequency above max be problematic?

I don't think so because cpufreq is going to clamp the incoming request
to be within policy->min and policy->max anyway (see
__cpufreq_driver_target() near the top). So nothing's going to
break/catch fire.

The question to me is, if the system gets throttled (or un-throttled)
just as you are evaluating capacity requests, are you actually better
off using the new policy->max value to scale? I suspect not, because the
recent load tracking statistics and hence the capacity requests will
have been calculated based on the original policy->max rather than the
one that was just written.

Perhaps a decayed version of policy->max could be maintained and used to
scale instead. Not sure if this will be enough of an issue in practice
though to warrant it...

> 
...
> 
>> Currently schedfreq has to go through two stages of aggregation. The
>> first is aggregating the capacity request votes from the different sched
>> classes within a CPU. The second is aggregating the capacity request
>> votes from the CPUs within a frequency domain. I'm looking to solve the
>> locking issues with both of these stages as well as those with calling
>> into cpufreq in the fast path.
>>
>> For the first stage, currently we assume that the rq lock is held. This
>> is the case for the existing calls into schedfreq (the load balancer
>> required a minor change). There are a few more hooks that need to go
>> into migration paths in core.c which will be slightly more painful, but
>> they are IMO doable.
>>
>> For the second stage I believe an internal schedfreq spinlock is
>> required, for one thing to protect against competing updates within the
>> same frequency domain from different CPUs, but also so that we can
>> guarantee the cpufreq policy is valid, which can be done if the
>> governor_start/stop callbacks take the spinlock as well.
>>
> 
> Does this need to nest within the rq lock?

I think it will have to because I suspect releasing the rq lock to run
the second stage, then re-acquiring it afterwards, will cause breakage
in the scheduler paths from which this is all being called.

Otherwise I don't see a requirement within schedfreq for it to be nested.

>
>> As for accessing various things in cpufreq->policy and trying to take
>> rwsem in the fast path, we should be able to either cache some of the
>> items in the governor_start() path, eliminate the references, or access
>> them without locking rwsem (as long as we know the policy is valid,
> 
> If we only need to guarantee existence of the policy, without any direct
> updates, RCU might be a good fit.

It must be guaranteed that policy->cpus and policy->freq_table are
current/synchronized with cpufreq for the duration of the second stage.
I think this precludes the use of RCU.

We're going to need an internal schedfreq spinlock to arbitrate multiple
CPUs in a frequency domain trying to go through the hot path
concurrently anyway, so we're really just extending that critical
section to the governor start and stop callbacks where we can safely
cache some of the policy data.

> 
>> which we do by taking the spinlock in governor_start()). Here are the
>> things we currently reference in the schedfreq set capacity hook and my
>> thoughts on each of them:
>>
>> policy->governor: this is currently tested to see if schedfreq is
>> enabled, but this can be tracked internally to schedfreq and set in the
>> governor_start/stop callbacks
>>
>> policy->governor_data: used to access schedfreq's data for the policy,
>> could be changed to an internal schedfreq percpu pointer
>>
>> policy->cpus, policy->freq_table: these can be cached internally to
>> schedfreq on governor_start/stop
>>
>> policy->max: as mentioned above I *think* we could get away with
>> accessing this without locking rwsem as discussed above
>>
>> policy->transition_ongoing: this isn't strictly required as schedfreq
>> could track internally whether it has a transition outstanding, but we
>> should also be okay to look at this without policy->rwsem since
>> schedfreq is the only one queuing transitions, again assuming we take
>> care to ensure policy is valid as above
>>
>> Finally when actually requesting a frequency change in the fast path, we
>> can trylock policy->rwsem and fall back to the slow path (kthread) if
>> that fails.
>>
> 
> OTOH, all the needed aggregation could make the fast path not so fast in
> the end. So, maybe having a fast and a slow path is always good?

Sorry I didn't follow... What do you mean by having a fast and a slow path?

thanks,
Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20  1:24                 ` sched-freq locking Steve Muckle
  2016-01-20 12:18                   ` Juri Lelli
@ 2016-01-21  1:22                   ` Rafael J. Wysocki
  2016-01-21  1:39                     ` Steve Muckle
  1 sibling, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2016-01-21  1:22 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Michael Turquette, Vincent Guittot, Juri Lelli, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra

On Tuesday, January 19, 2016 05:24:49 PM Steve Muckle wrote:
> On 01/19/2016 03:40 PM, Michael Turquette wrote:
> > Right, this was _the_ original impetus behind the design decision to
> > muck around with struct cpufreq_policy in the hot path which goes al
> > the way back to v1.
> > 
> > An alternative thought is that we can make copies of the relevant bits
> > of struct cpufreq_policy that we do not expect too change often. These
> > will not require any locks as they are mostly read-only data on the
> > scheduler side of the interface. Or we could even go all in and just
> > make local copies of the struct directly, during the GOV_START
> > perhaps, with:
> 
> I believe this is a good first step as it avoids reworking a huge amount
> of locking and can get us to something functionally correct. It is what
> I had proposed earlier, copying the enabled CPUs and freq table in
> during the governor start callback. Unless there are objections to it
> I'll add it to the next schedfreq RFC.
> 
> > 
> ...
> > 
> > Well if we're going to try an optimize out every single false-positive
> > wakeup then I think that the cleanest long term solution would be
> > rework the per-policy locking around struct cpufreq_policy to use a
> > raw spinlock.
> 
> It would be nice if the policy lock was a spinlock but I don't know how
> easy that is. From a quick look at cpufreq there's a blocking notifier
> chain that's called with rwsem held, so it looks messy. Potentially long
> term indeed.
> 
> >> Also it'd be good I think to avoid building in an assumption that we'll
> >> never want to run solely in the fast (atomic) path. Perhaps ARM won't,
> >> and x86 may never use this, but it's reasonable to think another
> >> platform might come along which uses cpufreq and has the capability to
> >> kick off cpufreq transitions swiftly and without sleeping. Maybe ARM
> >> platforms will evolve to have that capability.
> > 
> > The current design of the cpufreq subsystem and its interfaces have
> > made this choice for us. sched-freq is just another consumer of
> > cpufreq, and until cpufreq's own locking scheme is improved then we
> > have no choice.
> 
> I did not word that very well - I should have said, we should avoid
> building in an assumption that we never want to try and run in the fast
> path.
> 
> AFAICS, once we've calculated that a frequency change is required we can
> down_write_trylock(&policy->rwsem) in the fast path and go ahead with
> the transition, if the trylock succeeds and the driver supports fast
> path transitions. We can fall back to the slow path (waking up the
> kthread) if that fails.
> 
> > This discussion is pretty useful. Should we Cc lkml to this thread?
> 
> Done (added linux-pm, PeterZ and Rafael as well).

Thanks!

One comment here (which may be a bit off in which case please ignore it).

You seem to be thinking that sched-freq needs to be a cpufreq governor
and thus be handled in the same way as ondemand, for example.

However, this doesn't have to be the case in principle.  For example,
if we have a special driver callback specifically to work with sched-freq,
it may just use that callback and bypass (almost) all of the usual
cpufreq mechanics.  This way you may avoid worrying about the governor
locking and related ugliness entirely.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-21  1:22                   ` Rafael J. Wysocki
@ 2016-01-21  1:39                     ` Steve Muckle
  2016-01-21  1:46                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Muckle @ 2016-01-21  1:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Michael Turquette, Vincent Guittot, Juri Lelli, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra

On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
> One comment here (which may be a bit off in which case please ignore it).
> 
> You seem to be thinking that sched-freq needs to be a cpufreq governor
> and thus be handled in the same way as ondemand, for example.

That's true, I hadn't really given much thought to the alternative you
mention below.

> 
> However, this doesn't have to be the case in principle.  For example,
> if we have a special driver callback specifically to work with sched-freq,
> it may just use that callback and bypass (almost) all of the usual
> cpufreq mechanics.  This way you may avoid worrying about the governor
> locking and related ugliness entirely.

That sounds good but I'm worried about other consequences of taking
cpufreq out of the loop. For example wouldn't we need a new way for
something like thermal to set frequency limits?

thanks,
steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-21  1:39                     ` Steve Muckle
@ 2016-01-21  1:46                       ` Rafael J. Wysocki
  2016-01-21 10:49                         ` Juri Lelli
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2016-01-21  1:46 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Michael Turquette, Vincent Guittot, Juri Lelli, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra

On Wednesday, January 20, 2016 05:39:14 PM Steve Muckle wrote:
> On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
> > One comment here (which may be a bit off in which case please ignore it).
> > 
> > You seem to be thinking that sched-freq needs to be a cpufreq governor
> > and thus be handled in the same way as ondemand, for example.
> 
> That's true, I hadn't really given much thought to the alternative you
> mention below.
> 
> > 
> > However, this doesn't have to be the case in principle.  For example,
> > if we have a special driver callback specifically to work with sched-freq,
> > it may just use that callback and bypass (almost) all of the usual
> > cpufreq mechanics.  This way you may avoid worrying about the governor
> > locking and related ugliness entirely.
> 
> That sounds good but I'm worried about other consequences of taking
> cpufreq out of the loop. For example wouldn't we need a new way for
> something like thermal to set frequency limits?

I don't know from the top of my head, but that's at least worth investigating.

Maybe we can keep the interface for those things unchanged, but handle it
differently under the hood?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-20 20:46                         ` Steve Muckle
@ 2016-01-21  9:45                           ` Juri Lelli
  2016-01-21 19:29                             ` Steve Muckle
  0 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2016-01-21  9:45 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

On 20/01/16 12:46, Steve Muckle wrote:
> Hi Juri,
> 
> On 01/20/2016 07:58 AM, Juri Lelli wrote:
> > On 20/01/16 06:50, Steve Muckle wrote:
> >> On 01/20/2016 04:18 AM, Juri Lelli wrote:
> >>> I fear that caching could break thermal. If everybody was already using
> >>> sched-freq interface to control frequency this won't probably be a
> >>> problem, but we are not yet there :(. So, IIUC by caching policy->max,
> >>> for example, we might affect what thermal expects from cpufreq.
> >>
> >> I think we could probably just access policy->max without rwsem, as long
> >> as we ensure policy is valid. Cpufreq will take care to cap a frequency
> >> request to an upper limit put in by thermal anyway so I don't think it's
> >> a problematic race. But I haven't spent much time thinking about it.
> >>
> > 
> > Mmm, can't the fact that the scheduler might think it can still request
> > a frequency above max be problematic?
> 
> I don't think so because cpufreq is going to clamp the incoming request
> to be within policy->min and policy->max anyway (see
> __cpufreq_driver_target() near the top). So nothing's going to
> break/catch fire.
> 

Right. I wasn't worried about this, but rather by the fact that the
scheduler might think there is enough space in a CPU (and balance tasks
accordingly) when that space has actually been restricted by thermal.
But, I also think there are ways to prevent this from happening, we just
need to be aware of such a problem.

> The question to me is, if the system gets throttled (or un-throttled)
> just as you are evaluating capacity requests, are you actually better
> off using the new policy->max value to scale? I suspect not, because the
> recent load tracking statistics and hence the capacity requests will
> have been calculated based on the original policy->max rather than the
> one that was just written.
> 
> Perhaps a decayed version of policy->max could be maintained and used to
> scale instead. Not sure if this will be enough of an issue in practice
> though to warrant it...
> 

Yeah, I can't really see a big issue here at the moment.

> > 
> ...
> > 
> >> Currently schedfreq has to go through two stages of aggregation. The
> >> first is aggregating the capacity request votes from the different sched
> >> classes within a CPU. The second is aggregating the capacity request
> >> votes from the CPUs within a frequency domain. I'm looking to solve the
> >> locking issues with both of these stages as well as those with calling
> >> into cpufreq in the fast path.
> >>
> >> For the first stage, currently we assume that the rq lock is held. This
> >> is the case for the existing calls into schedfreq (the load balancer
> >> required a minor change). There are a few more hooks that need to go
> >> into migration paths in core.c which will be slightly more painful, but
> >> they are IMO doable.
> >>
> >> For the second stage I believe an internal schedfreq spinlock is
> >> required, for one thing to protect against competing updates within the
> >> same frequency domain from different CPUs, but also so that we can
> >> guarantee the cpufreq policy is valid, which can be done if the
> >> governor_start/stop callbacks take the spinlock as well.
> >>
> > 
> > Does this need to nest within the rq lock?
> 
> I think it will have to because I suspect releasing the rq lock to run
> the second stage, then re-acquiring it afterwards, will cause breakage
> in the scheduler paths from which this is all being called.
> 

Right, that's what I was thinking too. However, we need to be aware that
we might add some overhead caused by contention on this new spinlock.

> Otherwise I don't see a requirement within schedfreq for it to be nested.
> 

OTOH, doesn't second stage happen on the kthread (when we need to
sleep)?

> >
> >> As for accessing various things in cpufreq->policy and trying to take
> >> rwsem in the fast path, we should be able to either cache some of the
> >> items in the governor_start() path, eliminate the references, or access
> >> them without locking rwsem (as long as we know the policy is valid,
> > 
> > If we only need to guarantee existence of the policy, without any direct
> > updates, RCU might be a good fit.
> 
> It must be guaranteed that policy->cpus and policy->freq_table are
> current/synchronized with cpufreq for the duration of the second stage.
> I think this precludes the use of RCU.
> 

Mmm, it seems so yes.

> We're going to need an internal schedfreq spinlock to arbitrate multiple
> CPUs in a frequency domain trying to go through the hot path
> concurrently anyway, so we're really just extending that critical
> section to the governor start and stop callbacks where we can safely
> cache some of the policy data.
> 
> > 
> >> which we do by taking the spinlock in governor_start()). Here are the
> >> things we currently reference in the schedfreq set capacity hook and my
> >> thoughts on each of them:
> >>
> >> policy->governor: this is currently tested to see if schedfreq is
> >> enabled, but this can be tracked internally to schedfreq and set in the
> >> governor_start/stop callbacks
> >>
> >> policy->governor_data: used to access schedfreq's data for the policy,
> >> could be changed to an internal schedfreq percpu pointer
> >>
> >> policy->cpus, policy->freq_table: these can be cached internally to
> >> schedfreq on governor_start/stop
> >>
> >> policy->max: as mentioned above I *think* we could get away with
> >> accessing this without locking rwsem as discussed above
> >>
> >> policy->transition_ongoing: this isn't strictly required as schedfreq
> >> could track internally whether it has a transition outstanding, but we
> >> should also be okay to look at this without policy->rwsem since
> >> schedfreq is the only one queuing transitions, again assuming we take
> >> care to ensure policy is valid as above
> >>
> >> Finally when actually requesting a frequency change in the fast path, we
> >> can trylock policy->rwsem and fall back to the slow path (kthread) if
> >> that fails.
> >>
> > 
> > OTOH, all the needed aggregation could make the fast path not so fast in
> > the end. So, maybe having a fast and a slow path is always good?
> 
> Sorry I didn't follow... What do you mean by having a fast and a slow path?
> 

Sorry, I wasn't clear enough. What I was trying to say is that having
two different paths, one fast where we aggregate locally and fire a
request and a second one slower where we aggregate per frequency domain,
compute the new request and call cpufreq, might be always desirable;
even on system that could issue frequency changes without sleeping.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-21  1:46                       ` Rafael J. Wysocki
@ 2016-01-21 10:49                         ` Juri Lelli
  2016-01-22  1:21                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Juri Lelli @ 2016-01-21 10:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Javi Merino, Punit Agrawal

[+Punit, Javi]

Hi Rafael,

On 21/01/16 02:46, Rafael J. Wysocki wrote:
> On Wednesday, January 20, 2016 05:39:14 PM Steve Muckle wrote:
> > On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
> > > One comment here (which may be a bit off in which case please ignore it).
> > > 
> > > You seem to be thinking that sched-freq needs to be a cpufreq governor
> > > and thus be handled in the same way as ondemand, for example.
> > 
> > That's true, I hadn't really given much thought to the alternative you
> > mention below.
> > 
> > > 
> > > However, this doesn't have to be the case in principle.  For example,
> > > if we have a special driver callback specifically to work with sched-freq,
> > > it may just use that callback and bypass (almost) all of the usual
> > > cpufreq mechanics.  This way you may avoid worrying about the governor
> > > locking and related ugliness entirely.
> > 
> > That sounds good but I'm worried about other consequences of taking
> > cpufreq out of the loop. For example wouldn't we need a new way for
> > something like thermal to set frequency limits?
> 
> I don't know from the top of my head, but that's at least worth investigating.
> 

Yes, that's an interesting alternative that we have to think through.

> Maybe we can keep the interface for those things unchanged, but handle it
> differently under the hood?
> 

Let me see if I understand what you are proposing :). If we don't want
to duplicate too many things, maybe it is still feasible to just use
existing cpufreq mechanics to handle hotplug, sysfs, thermal, etc. (with
possibly minor modifications to be notified of events) and only create a
new method to ask the driver for frequency changes, since we will have
replicated policy and freq_table information inside sched-freq.  Is that
what you were also thinking of by saying "bypass (almost) all the usual
cpufreq mechanics"? :)

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-21  9:45                           ` Juri Lelli
@ 2016-01-21 19:29                             ` Steve Muckle
  0 siblings, 0 replies; 18+ messages in thread
From: Steve Muckle @ 2016-01-21 19:29 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Rafael J. Wysocki, Javi Merino,
	Punit Agrawal

On 01/21/2016 01:45 AM, Juri Lelli wrote:
>>> Mmm, can't the fact that the scheduler might think it can still request
>>> a frequency above max be problematic?
>>
>> I don't think so because cpufreq is going to clamp the incoming request
>> to be within policy->min and policy->max anyway (see
>> __cpufreq_driver_target() near the top). So nothing's going to
>> break/catch fire.
>>
> 
> Right. I wasn't worried about this, but rather by the fact that the
> scheduler might think there is enough space in a CPU (and balance tasks
> accordingly) when that space has actually been restricted by thermal.
> But, I also think there are ways to prevent this from happening, we just
> need to be aware of such a problem.

Ah ok. This seems to me like an issue strictly between scheduler and
thermal.

>
...
>>>> Currently schedfreq has to go through two stages of aggregation. The
>>>> first is aggregating the capacity request votes from the different sched
>>>> classes within a CPU. The second is aggregating the capacity request
>>>> votes from the CPUs within a frequency domain. I'm looking to solve the
>>>> locking issues with both of these stages as well as those with calling
>>>> into cpufreq in the fast path.
>>>>
>>>> For the first stage, currently we assume that the rq lock is held. This
>>>> is the case for the existing calls into schedfreq (the load balancer
>>>> required a minor change). There are a few more hooks that need to go
>>>> into migration paths in core.c which will be slightly more painful, but
>>>> they are IMO doable.
>>>>
>>>> For the second stage I believe an internal schedfreq spinlock is
>>>> required, for one thing to protect against competing updates within the
>>>> same frequency domain from different CPUs, but also so that we can
>>>> guarantee the cpufreq policy is valid, which can be done if the
>>>> governor_start/stop callbacks take the spinlock as well.
>>>>
>>>
>>> Does this need to nest within the rq lock?
>>
>> I think it will have to because I suspect releasing the rq lock to run
>> the second stage, then re-acquiring it afterwards, will cause breakage
>> in the scheduler paths from which this is all being called.
>>
> 
> Right, that's what I was thinking too. However, we need to be aware that
> we might add some overhead caused by contention on this new spinlock.
> 
>> Otherwise I don't see a requirement within schedfreq for it to be nested.
>>
> 
> OTOH, doesn't second stage happen on the kthread (when we need to
> sleep)?

The second stage of aggregation currently happens in the fast path as well.

The frequency transition occurs either in the fast or slow path
depending on driver capability, throttling, and whether a request is
currently in progress.

> 
...
>>> OTOH, all the needed aggregation could make the fast path not so fast in
>>> the end. So, maybe having a fast and a slow path is always good?
>>
>> Sorry I didn't follow... What do you mean by having a fast and a slow path?
> 
> Sorry, I wasn't clear enough. What I was trying to say is that having
> two different paths, one fast where we aggregate locally and fire a
> request and a second one slower where we aggregate per frequency domain,
> compute the new request and call cpufreq, might be always desirable;
> even on system that could issue frequency changes without sleeping.

I think this is going to result in too much overhead because you'll have
to wake up the kthread most of the time. Any time the local CPU's
capacity request change equates to a different required CPU frequency,
you'll need to wake up the kthread to aggregate the CPU votes. If
another CPU in the domain has a higher or equal frequency request (a
common occurance) the kthread will wake to do nothing. And
waking/context switching to/running a thread seems very costly.

The next step from that is to make a note of the current max request in
the freq domain as MikeT suggested. This would be done during
aggregation in the slow path and then the fast path could avoid waking
the kthread when its capacity change request wouldn't impact the overall
frequency. I think this will need some further complexity to work such
as a mask of CPUs which are currently requesting the max frequency (or
at least a bit saying there's more than one). But it might lessen the
work in the fast path. I'd like to get the locking addressed and send
out another RFC prior to exploring that change though. Another internal
schedfreq lock will be required either way.

thanks,
Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-21 10:49                         ` Juri Lelli
@ 2016-01-22  1:21                           ` Rafael J. Wysocki
  2016-01-27  1:54                             ` Steve Muckle
  0 siblings, 1 reply; 18+ messages in thread
From: Rafael J. Wysocki @ 2016-01-22  1:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Michael Turquette, Vincent Guittot,
	Patrick Bellasi, Morten Rasmussen, Dietmar Eggemann,
	Viresh Kumar, linux-kernel, linux-pm, Peter Zijlstra,
	Javi Merino, Punit Agrawal

On Thursday, January 21, 2016 10:49:58 AM Juri Lelli wrote:
> [+Punit, Javi]
> 
> Hi Rafael,
> 
> On 21/01/16 02:46, Rafael J. Wysocki wrote:
> > On Wednesday, January 20, 2016 05:39:14 PM Steve Muckle wrote:
> > > On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
> > > > One comment here (which may be a bit off in which case please ignore it).
> > > > 
> > > > You seem to be thinking that sched-freq needs to be a cpufreq governor
> > > > and thus be handled in the same way as ondemand, for example.
> > > 
> > > That's true, I hadn't really given much thought to the alternative you
> > > mention below.
> > > 
> > > > 
> > > > However, this doesn't have to be the case in principle.  For example,
> > > > if we have a special driver callback specifically to work with sched-freq,
> > > > it may just use that callback and bypass (almost) all of the usual
> > > > cpufreq mechanics.  This way you may avoid worrying about the governor
> > > > locking and related ugliness entirely.
> > > 
> > > That sounds good but I'm worried about other consequences of taking
> > > cpufreq out of the loop. For example wouldn't we need a new way for
> > > something like thermal to set frequency limits?
> > 
> > I don't know from the top of my head, but that's at least worth investigating.
> > 
> 
> Yes, that's an interesting alternative that we have to think through.
> 
> > Maybe we can keep the interface for those things unchanged, but handle it
> > differently under the hood?
> > 
> 
> Let me see if I understand what you are proposing :). If we don't want
> to duplicate too many things, maybe it is still feasible to just use
> existing cpufreq mechanics to handle hotplug, sysfs, thermal, etc. (with
> possibly minor modifications to be notified of events) and only create a
> new method to ask the driver for frequency changes, since we will have
> replicated policy and freq_table information inside sched-freq.  Is that
> what you were also thinking of by saying "bypass (almost) all the usual
> cpufreq mechanics"? :)

Yes, it is.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-22  1:21                           ` Rafael J. Wysocki
@ 2016-01-27  1:54                             ` Steve Muckle
  2016-01-27  8:03                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 18+ messages in thread
From: Steve Muckle @ 2016-01-27  1:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Juri Lelli
  Cc: Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Javi Merino, Punit Agrawal

On 01/21/2016 05:21 PM, Rafael J. Wysocki wrote:
> On Thursday, January 21, 2016 10:49:58 AM Juri Lelli wrote:
>> [+Punit, Javi]
>>
>> Hi Rafael,
>>
>> On 21/01/16 02:46, Rafael J. Wysocki wrote:
>>> On Wednesday, January 20, 2016 05:39:14 PM Steve Muckle wrote:
>>>> On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
>>>>> One comment here (which may be a bit off in which case please ignore it).
>>>>>
>>>>> You seem to be thinking that sched-freq needs to be a cpufreq governor
>>>>> and thus be handled in the same way as ondemand, for example.
>>>>
>>>> That's true, I hadn't really given much thought to the alternative you
>>>> mention below.
>>>>
>>>>>
>>>>> However, this doesn't have to be the case in principle.  For example,
>>>>> if we have a special driver callback specifically to work with sched-freq,
>>>>> it may just use that callback and bypass (almost) all of the usual
>>>>> cpufreq mechanics.  This way you may avoid worrying about the governor
>>>>> locking and related ugliness entirely.
>>>>
>>>> That sounds good but I'm worried about other consequences of taking
>>>> cpufreq out of the loop. For example wouldn't we need a new way for
>>>> something like thermal to set frequency limits?
>>>
>>> I don't know from the top of my head, but that's at least worth investigating.
>>>
>>
>> Yes, that's an interesting alternative that we have to think through.
>>
>>> Maybe we can keep the interface for those things unchanged, but handle it
>>> differently under the hood?
>>>
>>
>> Let me see if I understand what you are proposing :). If we don't want
>> to duplicate too many things, maybe it is still feasible to just use
>> existing cpufreq mechanics to handle hotplug, sysfs, thermal, etc. (with
>> possibly minor modifications to be notified of events) and only create a
>> new method to ask the driver for frequency changes, since we will have
>> replicated policy and freq_table information inside sched-freq.  Is that
>> what you were also thinking of by saying "bypass (almost) all the usual
>> cpufreq mechanics"? :)
> 
> Yes, it is.

I've been working on the locking in schedfreq (governor) and believe it
is now functionally correct. It'll get sent out in another RFC soon.

Having wrestled with that locking a bit I can appreciate the value of
potentially deprecating some or all of the cpufreq core. I'm also
fearful though of making the current task (creating a scheduler-based
CPU frequency scaling algorithm) more complex than it is already.

For that reason my preference would be to get the thing to a viable
state as a governor first, assuming that's possible, and then take on
restructuring to eliminate/deprecate unnecessary infrastructure. Does
this seem reasonable?

thanks,
Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sched-freq locking
  2016-01-27  1:54                             ` Steve Muckle
@ 2016-01-27  8:03                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 18+ messages in thread
From: Rafael J. Wysocki @ 2016-01-27  8:03 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Juri Lelli, Michael Turquette, Vincent Guittot, Patrick Bellasi,
	Morten Rasmussen, Dietmar Eggemann, Viresh Kumar, linux-kernel,
	linux-pm, Peter Zijlstra, Javi Merino, Punit Agrawal

On Tuesday, January 26, 2016 05:54:13 PM Steve Muckle wrote:
> On 01/21/2016 05:21 PM, Rafael J. Wysocki wrote:
> > On Thursday, January 21, 2016 10:49:58 AM Juri Lelli wrote:
> >> [+Punit, Javi]
> >>
> >> Hi Rafael,
> >>
> >> On 21/01/16 02:46, Rafael J. Wysocki wrote:
> >>> On Wednesday, January 20, 2016 05:39:14 PM Steve Muckle wrote:
> >>>> On 01/20/2016 05:22 PM, Rafael J. Wysocki wrote:
> >>>>> One comment here (which may be a bit off in which case please ignore it).
> >>>>>
> >>>>> You seem to be thinking that sched-freq needs to be a cpufreq governor
> >>>>> and thus be handled in the same way as ondemand, for example.
> >>>>
> >>>> That's true, I hadn't really given much thought to the alternative you
> >>>> mention below.
> >>>>
> >>>>>
> >>>>> However, this doesn't have to be the case in principle.  For example,
> >>>>> if we have a special driver callback specifically to work with sched-freq,
> >>>>> it may just use that callback and bypass (almost) all of the usual
> >>>>> cpufreq mechanics.  This way you may avoid worrying about the governor
> >>>>> locking and related ugliness entirely.
> >>>>
> >>>> That sounds good but I'm worried about other consequences of taking
> >>>> cpufreq out of the loop. For example wouldn't we need a new way for
> >>>> something like thermal to set frequency limits?
> >>>
> >>> I don't know from the top of my head, but that's at least worth investigating.
> >>>
> >>
> >> Yes, that's an interesting alternative that we have to think through.
> >>
> >>> Maybe we can keep the interface for those things unchanged, but handle it
> >>> differently under the hood?
> >>>
> >>
> >> Let me see if I understand what you are proposing :). If we don't want
> >> to duplicate too many things, maybe it is still feasible to just use
> >> existing cpufreq mechanics to handle hotplug, sysfs, thermal, etc. (with
> >> possibly minor modifications to be notified of events) and only create a
> >> new method to ask the driver for frequency changes, since we will have
> >> replicated policy and freq_table information inside sched-freq.  Is that
> >> what you were also thinking of by saying "bypass (almost) all the usual
> >> cpufreq mechanics"? :)
> > 
> > Yes, it is.
> 
> I've been working on the locking in schedfreq (governor) and believe it
> is now functionally correct. It'll get sent out in another RFC soon.
> 
> Having wrestled with that locking a bit I can appreciate the value of
> potentially deprecating some or all of the cpufreq core. I'm also
> fearful though of making the current task (creating a scheduler-based
> CPU frequency scaling algorithm) more complex than it is already.
> 
> For that reason my preference would be to get the thing to a viable
> state as a governor first, assuming that's possible, and then take on
> restructuring to eliminate/deprecate unnecessary infrastructure. Does
> this seem reasonable?

Quite frankly, it depends on what's in the code.

I may not like that approach in principle, but then if the resulting
code turns out to be clean and nice and has no unnecessary overhead and
so on, it may very well work for me.

That said I have some simple patches to replace timer functions with
scheduler callbacks in cpufreq as a whole, which to be honest I'd prefer
to do first, because it would allow us to eliminate the heavy use of
timers that was complained about during the Kernel Summit last year.

I haven't posted them yest as they have been under some preliminary
testing and I'm still traveling today.  However, since the patches don't
seem to break stuff and I'll be back home tomorrow, I think I'll just
post them then.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-01-27  8:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <56984C30.8040402@linaro.org>
     [not found] ` <20160115104051.GP18603@e106622-lin>
     [not found]   ` <569D568D.5000500@linaro.org>
     [not found]     ` <CAEG3pNDoDHpgO7WdC8n7C8Ri+KNKZqOi9rKHMkTV0c5dkyevOw@mail.gmail.com>
     [not found]       ` <CAKfTPtCYUWBm_=GQ-NLLZ01xJqG+83vy4uapxRLjqMx_61HSew@mail.gmail.com>
     [not found]         ` <569E90CF.9050503@linaro.org>
     [not found]           ` <CAEG3pND5kHqJhrvwVjsnCnfT-nSrV8SrUaNk2p=ZfEx0MmmB=A@mail.gmail.com>
     [not found]             ` <569EB225.4040707@linaro.org>
     [not found]               ` <CAEG3pNBgpsdX8Lk7_3nHd31_bsq2sLB_f+=4_xiFFbiAWiKsxg@mail.gmail.com>
2016-01-20  1:24                 ` sched-freq locking Steve Muckle
2016-01-20 12:18                   ` Juri Lelli
2016-01-20 14:50                     ` Steve Muckle
2016-01-20 15:58                       ` Juri Lelli
2016-01-20 16:39                         ` Punit Agrawal
2016-01-20 16:39                           ` Punit Agrawal
2016-01-20 16:46                         ` Punit Agrawal
2016-01-20 16:46                           ` Punit Agrawal
2016-01-20 20:46                         ` Steve Muckle
2016-01-21  9:45                           ` Juri Lelli
2016-01-21 19:29                             ` Steve Muckle
2016-01-21  1:22                   ` Rafael J. Wysocki
2016-01-21  1:39                     ` Steve Muckle
2016-01-21  1:46                       ` Rafael J. Wysocki
2016-01-21 10:49                         ` Juri Lelli
2016-01-22  1:21                           ` Rafael J. Wysocki
2016-01-27  1:54                             ` Steve Muckle
2016-01-27  8:03                               ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.