sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
diff mbox series

Message ID 20210122154600.1722680-1-joel@joelfernandes.org
State New, archived
Headers show
Series
  • sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
Related show

Commit Message

Joel Fernandes Jan. 22, 2021, 3:46 p.m. UTC
On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
        50.0th: 210 (106 samples)
        75.0th: 619 (53 samples)
        90.0th: 665 (32 samples)
        95.0th: 703 (11 samples)
        *99.0th: 12656 (8 samples)
        99.5th: 12784 (1 samples)
        99.9th: 13424 (1 samples)
        min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
        50.0th: 188 (108 samples)
        75.0th: 238 (53 samples)
        90.0th: 623 (32 samples)
        95.0th: 657 (12 samples)
        *99.0th: 717 (7 samples)
        99.5th: 725 (2 samples)
        99.9th: 725 (0 samples)

Cc: Paul McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
Co-developed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Qais Yousef Jan. 22, 2021, 6:39 p.m. UTC | #1
On 01/22/21 17:56, Vincent Guittot wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> >                 return false;
> >
> > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval
> This hard coded (HZ/20) looks really like an ugly hack

This was meant as an RFC patch to discuss the problem really.

Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
processing __update_blocked_fair() and the other half in sugov_update_shared().
So roughly 50us each. Note that each function is calling an iterator in
return. Correct me if my numbers are wrong Joel.

Running on a little core on low frequency these numbers don't look too odd.
So I'm not seeing how we can speed these functions up.

But since update_sg_lb_stats() will end up with multiple calls to
update_blocked_averages() in one go, this latency adds up quickly.

One noticeable factor in Joel's system is the presence of a lot of cgroups.
Which is essentially what makes __update_blocked_fair() expensive, and it seems
to always return something has decayed so we end up with a call to
sugov_update_shared() in every call.

I think we should limit the expensive call to update_blocked_averages() but
I honestly don't know what would be the right way to do it :-/

Or maybe there's another better alternative too..

Thanks

--
Qais Yousef

> 
> >                 return true;
> >
> >         update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >
Joel Fernandes Jan. 22, 2021, 7:10 p.m. UTC | #2
Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
> 
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

                  * There can be a lot of idle CPU cgroups.  Don't let fully
                  * decayed cfs_rqs linger on the list.
                  */
                 if (cfs_rq_is_decayed(cfs_rq))
                         list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
> 
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
> 
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace.  The tracer shows
update_blocked_averages taking a lot of time.

     sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
     sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
     sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
     sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
     sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
     sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
     sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
     sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
     sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
     sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
     sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
     sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
     sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
     sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
     sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
     sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
     sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458


> > The fix is simply rate limit the calls to update_blocked_averages to 20
> > times per second. It appears that updating the blocked average less
> > often is sufficient. Currently I see about 200 calls per second
> 
> Would be good to explain why updating less often is sufficient ?

I don't know this code that well, intuitively it seems to me updating blocked
averages at such a high rate seems pointless. But I defer to your expertise
on that. Why do you feel an update is needed at least HZ times per second?
What about system with HZ=1000 or 300, that seems to be an insane rate of
updating (not to mention all the complexity of going through the leaf cgroup
list and doing the frequency updates).

> > sometimes, which seems overkill.
> >
> > schbench shows a clear improvement with the change:
> 
> Have you got more details about your test setup ?
> which platform ?
> which kernel ?

I mentioned in the commit message it is a v5.4 kernel.

The platform is Snapdragon 7c. The platform is:
https://www.qualcomm.com/products/snapdragon-7c-compute-platform

> 
> >
> > Without patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> >         50.0th: 210 (106 samples)
> >         75.0th: 619 (53 samples)
> >         90.0th: 665 (32 samples)
> >         95.0th: 703 (11 samples)
> >         *99.0th: 12656 (8 samples)
> >         99.5th: 12784 (1 samples)
> >         99.9th: 13424 (1 samples)
> >         min=15, max=13424
> >
> > With patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> >         50.0th: 188 (108 samples)
> >         75.0th: 238 (53 samples)
> >         90.0th: 623 (32 samples)
> >         95.0th: 657 (12 samples)
> >         *99.0th: 717 (7 samples)
> >         99.5th: 725 (2 samples)
> >         99.9th: 725 (0 samples)
> >
> > Cc: Paul McKenney <paulmck@kernel.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
> > Co-developed-by: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> >                 return false;
> >
> > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval

What prevents newidle_balance from hitting this code? Are you suggesting rate
limit that as well?

> This hard coded (HZ/20) looks really like an ugly hack

Yes, it was not a well researched number. If you have a suggestion for better
fix, let me know.

thanks,

 - Joel


> >                 return true;
> >
> >         update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >
Joel Fernandes Jan. 22, 2021, 7:14 p.m. UTC | #3
On Fri, Jan 22, 2021 at 06:39:27PM +0000, Qais Yousef wrote:
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> > 
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
> 
> This was meant as an RFC patch to discuss the problem really.

Agreed, sorry.

> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in
> return. Correct me if my numbers are wrong Joel.

Correct, and I see update_nohz_stats() itself called around 8 times during a
load balance which multiplies the overhead.

Dietmar found out also that the reason for update_nohz_stacks() being called
8 times is because in our setup, there is only 1 MC sched domain with all 8
CPUs, versus say 2 MC domains with 4 CPUs each.

> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.

Agreed.

> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.

True!

> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.

Correct.

thanks,

 - Joel

[..]
Dietmar Eggemann Jan. 25, 2021, 10:44 a.m. UTC | #4
On 22/01/2021 20:10, Joel Fernandes wrote:
> Hi Vincent,
> 
> Thanks for reply. Please see the replies below:
> 
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>> <joel@joelfernandes.org> wrote:
>>>
>>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
>>> that there are a lot of calls to update_blocked_averages(). This causes
>>> the schedule loop to slow down to taking upto 500 micro seconds at
>>> times (due to newidle load balance). I have also seen this manifest in
>>> the periodic balancer.
>>>
>>> Closer look shows that the problem is caused by the following
>>> ingredients:
>>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
>>> suggesting to inspect /proc/sched_debug for this), this can make
>>> __update_blocked_fair() take a long time.
>>
>> Inactive cgroups are removed from the list so they should not impact
>> the duration
> 
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
> 
>                   * There can be a lot of idle CPU cgroups.  Don't let fully
>                   * decayed cfs_rqs linger on the list.
>                   */
>                  if (cfs_rq_is_decayed(cfs_rq))
>                          list_del_leaf_cfs_rq(cfs_rq);
> 
>>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
>>> shared frequency domain configuration to be slower than usual. (the load
>>
>> What do you mean exactly by it causes schedutil to be slower than usual ?
> 
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
> 
>>> average updates also try to update the frequency in schedutil).
>>>
>>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
>>> code paths to take longer than when running at a high CPU frequency.
>>
>> Low frequency usually means low utilization so it should happen that much.
> 
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
> 
>      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
>      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
>      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
>      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
>      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
>      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
>      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
>      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
>      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
>      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
>      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
>      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
>      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
>      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
>      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
>      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
>      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
> 
> 
>>> The fix is simply rate limit the calls to update_blocked_averages to 20
>>> times per second. It appears that updating the blocked average less
>>> often is sufficient. Currently I see about 200 calls per second
>>
>> Would be good to explain why updating less often is sufficient ?
> 
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

I assume this is what you're seeing on your device. This is on tip sched/core
but should be close to your kernel. I glanced over the diffs in fair.c between
chromeos-5.4 and tip sched/core and didn't spot any changes in this area.

I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
(sg)) in the MC domain (the only sched domain). 

Since nohz.has_blocked=1 in your newidle_balance() calls,
load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
update_sg_lb_stats() for each of the 8 sg's.

Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
update_nohz_stats(..., false) per cpu in sg.

And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
is called since none of the 3 bail-out conditions:

 (1) !rq->has_blocked_load
 (2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask) 
 (3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))

trigger.

We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
only to jiffies' in update_blocked_load_status().


[005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
...
[005] 7370.210068: bprint: pick_next_task_fair: CPU5
[005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
[005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
[005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
[005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
[005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
[005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
[005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
[005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
[005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
[005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
[005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
[005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
...
[005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0


If I understood you correctly, you want to avoid these frequent calls
to update_blocked_averages() here to further avoid invoking sched_util
via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
is set) very often in your setup.
Since you have up to 6 CPUs in a frequency domain, this could be more
costly than usual.
Vincent Guittot Jan. 25, 2021, 1:23 p.m. UTC | #5
On Fri, 22 Jan 2021 at 19:39, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
>
> This was meant as an RFC patch to discuss the problem really.
>
> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in

Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
which would mean that the update was useful at the end ?

> return. Correct me if my numbers are wrong Joel.
>
> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.
>
> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.
>
> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.
>
> I think we should limit the expensive call to update_blocked_averages() but

At the opposite, some will complain that block values  stay stall to
high value and prevent any useful adjustment.

Also update_blocked average is already rate limited with idle and busy
load_balance

Seems like the problem raised by Joel is the number of newly idle load balance

> I honestly don't know what would be the right way to do it :-/
>
> Or maybe there's another better alternative too..
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > >                 return true;
> > >
> > >         update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
Vincent Guittot Jan. 25, 2021, 2:42 p.m. UTC | #6
On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > that there are a lot of calls to update_blocked_averages(). This causes
> > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > times (due to newidle load balance). I have also seen this manifest in
> > > the periodic balancer.
> > >
> > > Closer look shows that the problem is caused by the following
> > > ingredients:
> > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > suggesting to inspect /proc/sched_debug for this), this can make
> > > __update_blocked_fair() take a long time.
> >
> > Inactive cgroups are removed from the list so they should not impact
> > the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.

How many blocked cgroups have you got ?

>
>                   * There can be a lot of idle CPU cgroups.  Don't let fully
>                   * decayed cfs_rqs linger on the list.
>                   */
>                  if (cfs_rq_is_decayed(cfs_rq))
>                          list_del_leaf_cfs_rq(cfs_rq);
>
> > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > shared frequency domain configuration to be slower than usual. (the load
> >
> > What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.

Qais mentioned half of the time being used by
sugov_next_freq_shared(). Are there any frequency changes resulting in
this call ?

>
> > > average updates also try to update the frequency in schedutil).
> > >
> > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > code paths to take longer than when running at a high CPU frequency.
> >
> > Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.

Happening a lot in itself is not a problem if there is nothing else to
do so it's not a argument in itself

So why is it a problem for you ? You are mentioning newly idle load
balance so I assume that your root problem is the scheduling delay
generated by the newly idle load balance which then calls
update_blocked_averages

rate limiting the call to update_blocked_averages() will only reduce
the number of time it happens but it will not prevent it to happen.

>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
>
>      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
>      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
>      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
>      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
>      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
>      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
>      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
>      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
>      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
>      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
>      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
>      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
>      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
>      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
>      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
>      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
>      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
>
>
> > > The fix is simply rate limit the calls to update_blocked_averages to 20
> > > times per second. It appears that updating the blocked average less
> > > often is sufficient. Currently I see about 200 calls per second
> >
> > Would be good to explain why updating less often is sufficient ?
>
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

Your trace above is a full update of blocked load which is a normal
sequence that happen regularly but you don't show that this happens
too often

The call to update_blocked _average is rate limited by load_balance
interval for idle and busy load balance so those 2 cases should not be
a problem. There is no rate limit in case of newly_idle because it is
assumed that the latter is normally called only if it has been
estimated that there is enough time to run it. That's why
update_blocked_load_average is called all the time during newly idle
load balance

see in update_sd_lb_stats()
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;

We could consider removing this condition but I need time to  remind
all the side effect of this

IIUC, your real problem is that newidle_balance is running whereas a
task is about to wake up on the cpu and we don't abort quickly during
this load_balance

so we could also try to abort earlier in case of newly idle load balance

>
> > > sometimes, which seems overkill.
> > >
> > > schbench shows a clear improvement with the change:
> >
> > Have you got more details about your test setup ?
> > which platform ?
> > which kernel ?
>
> I mentioned in the commit message it is a v5.4 kernel.

I was not sure if the tests results done with this kernel because we
usually ask for results against mainline to make sure you are not
facing a problem that has solved since v5.4 has been released

>
> The platform is Snapdragon 7c. The platform is:
> https://www.qualcomm.com/products/snapdragon-7c-compute-platform
>
> >
> > >
> > > Without patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > >         50.0th: 210 (106 samples)
> > >         75.0th: 619 (53 samples)
> > >         90.0th: 665 (32 samples)
> > >         95.0th: 703 (11 samples)
> > >         *99.0th: 12656 (8 samples)
> > >         99.5th: 12784 (1 samples)
> > >         99.9th: 13424 (1 samples)
> > >         min=15, max=13424
> > >
> > > With patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > >         50.0th: 188 (108 samples)
> > >         75.0th: 238 (53 samples)
> > >         90.0th: 623 (32 samples)
> > >         95.0th: 657 (12 samples)
> > >         *99.0th: 717 (7 samples)
> > >         99.5th: 725 (2 samples)
> > >         99.9th: 725 (0 samples)
> > >
> > > Cc: Paul McKenney <paulmck@kernel.org>
> > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
> > > Co-developed-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
>
> What prevents newidle_balance from hitting this code? Are you suggesting rate
> limit that as well?
>
> > This hard coded (HZ/20) looks really like an ugly hack
>
> Yes, it was not a well researched number. If you have a suggestion for better
> fix, let me know.
>
> thanks,
>
>  - Joel
>
>
> > >                 return true;
> > >
> > >         update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
Vincent Guittot Jan. 25, 2021, 5:30 p.m. UTC | #7
On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22/01/2021 20:10, Joel Fernandes wrote:
> > Hi Vincent,
> >
> > Thanks for reply. Please see the replies below:
> >
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >> <joel@joelfernandes.org> wrote:
> >>>
> >>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> >>> that there are a lot of calls to update_blocked_averages(). This causes
> >>> the schedule loop to slow down to taking upto 500 micro seconds at
> >>> times (due to newidle load balance). I have also seen this manifest in
> >>> the periodic balancer.
> >>>
> >>> Closer look shows that the problem is caused by the following
> >>> ingredients:
> >>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> >>> suggesting to inspect /proc/sched_debug for this), this can make
> >>> __update_blocked_fair() take a long time.
> >>
> >> Inactive cgroups are removed from the list so they should not impact
> >> the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> >
> >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> >                   * decayed cfs_rqs linger on the list.
> >                   */
> >                  if (cfs_rq_is_decayed(cfs_rq))
> >                          list_del_leaf_cfs_rq(cfs_rq);
> >
> >>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> >>> shared frequency domain configuration to be slower than usual. (the load
> >>
> >> What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> >
> >>> average updates also try to update the frequency in schedutil).
> >>>
> >>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> >>> code paths to take longer than when running at a high CPU frequency.
> >>
> >> Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > schedule() can result in new idle balance with the CFS pick call happening
> > often. Here is a function graph trace.  The tracer shows
> > update_blocked_averages taking a lot of time.
> >
> >      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
> >      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
> >      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
> >      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
> >      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
> >      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
> >      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
> >      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
> >      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
> >      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
> >      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
> >      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
> >      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
> >      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
> >      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
> >      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
> >      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
> >
> >
> >>> The fix is simply rate limit the calls to update_blocked_averages to 20
> >>> times per second. It appears that updating the blocked average less
> >>> often is sufficient. Currently I see about 200 calls per second
> >>
> >> Would be good to explain why updating less often is sufficient ?
> >
> > I don't know this code that well, intuitively it seems to me updating blocked
> > averages at such a high rate seems pointless. But I defer to your expertise
> > on that. Why do you feel an update is needed at least HZ times per second?
> > What about system with HZ=1000 or 300, that seems to be an insane rate of
> > updating (not to mention all the complexity of going through the leaf cgroup
> > list and doing the frequency updates).
>
> I assume this is what you're seeing on your device. This is on tip sched/core
> but should be close to your kernel. I glanced over the diffs in fair.c between
> chromeos-5.4 and tip sched/core and didn't spot any changes in this area.
>
> I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
> (sg)) in the MC domain (the only sched domain).

you can also modify the DT to emulate 1 MC level with 8 cores

>
> Since nohz.has_blocked=1 in your newidle_balance() calls,
> load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
> update_sg_lb_stats() for each of the 8 sg's.
>
> Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
> update_nohz_stats(..., false) per cpu in sg.
>
> And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
> is called since none of the 3 bail-out conditions:
>
>  (1) !rq->has_blocked_load
>  (2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask)
>  (3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>
> trigger.
>
> We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
> 8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
> only to jiffies' in update_blocked_load_status().
>
>
> [005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
> ...
> [005] 7370.210068: bprint: pick_next_task_fair: CPU5
> [005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
> [005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
> [005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
> [005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
> [005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
> [005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
> [005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
> [005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
> [005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
> [005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
> ...
> [005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0
>
>
> If I understood you correctly, you want to avoid these frequent calls
> to update_blocked_averages() here to further avoid invoking sched_util
> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
> is set) very often in your setup.

So It's not clear if the problem that joel wants to raise, is about:
- the running time of  update_blocked_averages
- the running time of the cpufreq_update_util which is called because
utilization has decayed during the update of blocked load
- the wake up latency because of newly_idle lb

> Since you have up to 6 CPUs in a frequency domain, this could be more
> costly than usual.
Dietmar Eggemann Jan. 25, 2021, 5:53 p.m. UTC | #8
On 25/01/2021 18:30, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 22/01/2021 20:10, Joel Fernandes wrote:
>>> Hi Vincent,
>>>
>>> Thanks for reply. Please see the replies below:
>>>
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>> <joel@joelfernandes.org> wrote:

[...]

>> If I understood you correctly, you want to avoid these frequent calls
>> to update_blocked_averages() here to further avoid invoking sched_util
>> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
>> is set) very often in your setup.
> 
> So It's not clear if the problem that joel wants to raise, is about:
> - the running time of  update_blocked_averages
> - the running time of the cpufreq_update_util which is called because
> utilization has decayed during the update of blocked load
> - the wake up latency because of newly_idle lb

Pretty much so.

IIRC his interest is driven by the fact that he saw much less activity
in newly_idle lb and therefore cpufreq_update_util on a system with the
same kernel and userspace but with less CPUs (i.e. also smaller
frequency domains) and less cgroups (with blocked load) and started
wondering why.

I assume that since he understands this environment now much better, he
should be able to come up with better test numbers to show if there is a
performance issue on his 2+6 DynamIQ system and if yes, where exactly in
this code path.
Qais Yousef Jan. 26, 2021, 4:36 p.m. UTC | #9
On 01/25/21 14:23, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 19:39, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 01/22/21 17:56, Vincent Guittot wrote:
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 04a3ce20da67..fe2dc0024db5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > >                 return false;
> > > >
> > > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> > >
> > > This condition is there to make sure to update blocked load at most
> > > once a tick in order to filter newly idle case otherwise the rate
> > > limit is already done by load balance interval
> > > This hard coded (HZ/20) looks really like an ugly hack
> >
> > This was meant as an RFC patch to discuss the problem really.
> >
> > Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> > processing __update_blocked_fair() and the other half in sugov_update_shared().
> > So roughly 50us each. Note that each function is calling an iterator in
> 
> Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
> which would mean that the update was useful at the end ?

I couldn't reproduce his problem on Juno. But I think it is not actually doing
any frequency update. sugov_update_shared() is rate limited by
sugov_should_update_freq(). Joel has a 1ms rate limit for schedutil in sysfs.
The function traces showed that it is continuously doing the full scan which
indicates that sugov_update_next_freq() is continuously bailing out at

	if else (sg_policy->next_freq == next_freq)
		return false;

> 
> > return. Correct me if my numbers are wrong Joel.
> >
> > Running on a little core on low frequency these numbers don't look too odd.
> > So I'm not seeing how we can speed these functions up.
> >
> > But since update_sg_lb_stats() will end up with multiple calls to
> > update_blocked_averages() in one go, this latency adds up quickly.
> >
> > One noticeable factor in Joel's system is the presence of a lot of cgroups.
> > Which is essentially what makes __update_blocked_fair() expensive, and it seems
> > to always return something has decayed so we end up with a call to
> > sugov_update_shared() in every call.
> >
> > I think we should limit the expensive call to update_blocked_averages() but
> 
> At the opposite, some will complain that block values  stay stall to
> high value and prevent any useful adjustment.
> 
> Also update_blocked average is already rate limited with idle and busy
> load_balance
> 
> Seems like the problem raised by Joel is the number of newly idle load balance

It could be. When Joel comments out the update_blocked_averages() or rate limit
it the big schedule delay disappears. Which is just an indication of where we
could do better. Either by making update_blocked_averages() faster, or control
how often we do it in a row. I thought the latter made more sense - though
implementation wise I'm not sure how we should do that.

We might actually help update_blocked_averages() being a bit faster by not
doing cpufreq_update_util() in every call and do it once in the last call to
update_blocked_averages(). Since it seemed the for_each_leaf_cfs_rq_safe() loop
in __update_blocked_fair() is expensive too, not sure if reducing the calls to
cpufreq_update_util() will be enough.

Thanks

--
Qais Yousef
Joel Fernandes Jan. 27, 2021, 6:43 p.m. UTC | #10
Hi Vincent,

On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > times (due to newidle load balance). I have also seen this manifest in
> > > > the periodic balancer.
> > > >
> > > > Closer look shows that the problem is caused by the following
> > > > ingredients:
> > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > __update_blocked_fair() take a long time.
> > >
> > > Inactive cgroups are removed from the list so they should not impact
> > > the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> 
> How many blocked cgroups have you got ?

I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
this loop runs per new idle balance. When the problem happens I see this loop
run 35-40 times (for one single instance of newidle balance). So in total
there are at least these many cfs_rq load updates.

I also see that new idle balance can be called 200-500 times per second.

> >
> >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> >                   * decayed cfs_rqs linger on the list.
> >                   */
> >                  if (cfs_rq_is_decayed(cfs_rq))
> >                          list_del_leaf_cfs_rq(cfs_rq);
> >
> > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > shared frequency domain configuration to be slower than usual. (the load
> > >
> > > What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> 
> Qais mentioned half of the time being used by
> sugov_next_freq_shared(). Are there any frequency changes resulting in
> this call ?

I do not see a frequency update happening at the time of the problem. However
note that sugov_iowait_boost() does run even if frequency is not being
updated. IIRC, this function is also not that light weight and I am not sure
if it is a good idea to call this that often.

> > > > average updates also try to update the frequency in schedutil).
> > > >
> > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > code paths to take longer than when running at a high CPU frequency.
> > >
> > > Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> Happening a lot in itself is not a problem if there is nothing else to
> do so it's not a argument in itself

It is a problem - it shows up in the preempt off critical section latency
tracer. Are you saying its Ok for preemption to be disabled on system for 500
micro seconds?  That hurts real-time applications (audio etc).

> So why is it a problem for you ? You are mentioning newly idle load
> balance so I assume that your root problem is the scheduling delay
> generated by the newly idle load balance which then calls
> update_blocked_averages

Yes, the new idle balance is when I see it happen quite often. I do see it
happen with other load balance as well, but it not that often as those LB
don't run as often as new idle balance.

> 
> rate limiting the call to update_blocked_averages() will only reduce
> the number of time it happens but it will not prevent it to happen.

Sure, but soft real-time issue can tolerate if the issue does not happen very
often. In this case though, it is frequent.

> IIUC, your real problem is that newidle_balance is running whereas a
> task is about to wake up on the cpu and we don't abort quickly during
> this load_balance

Yes.

> so we could also try to abort earlier in case of newly idle load balance

I think interrupts are disabled when the load balance runs, so there's no way
for say an audio interrupt to even run in order to wake up a task. How would
you know to abort the new idle load balance?

Could you elaborate more also on the drawback of the rate limiting patch we
posted? Do you see a side effect?

> > > > sometimes, which seems overkill.
> > > >
> > > > schbench shows a clear improvement with the change:
> > >
> > > Have you got more details about your test setup ?
> > > which platform ?
> > > which kernel ?
> >
> > I mentioned in the commit message it is a v5.4 kernel.
> 
> I was not sure if the tests results done with this kernel because we
> usually ask for results against mainline to make sure you are not
> facing a problem that has solved since v5.4 has been released

Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
was hoping that is recent enough for this debug.

thanks,

 - Joel
Vincent Guittot Jan. 28, 2021, 1:57 p.m. UTC | #11
Hi Joel,

On Wed, 27 Jan 2021 at 19:43, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Vincent,
>
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > <joel@joelfernandes.org> wrote:
> > > > >
> > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > the periodic balancer.
> > > > >
> > > > > Closer look shows that the problem is caused by the following
> > > > > ingredients:
> > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > __update_blocked_fair() take a long time.
> > > >
> > > > Inactive cgroups are removed from the list so they should not impact
> > > > the duration
> > >
> > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > decayed and not have any tasks running on it but its load needs to be
> > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > 'blocked'.
> >
> > How many blocked cgroups have you got ?
>
> I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> this loop runs per new idle balance. When the problem happens I see this loop
> run 35-40 times (for one single instance of newidle balance). So in total
> there are at least these many cfs_rq load updates.

Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

>
> I also see that new idle balance can be called 200-500 times per second.

This is not surprising because newidle_balance() is called every time
the CPU is about to become idle

>
> > >
> > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > >                   * decayed cfs_rqs linger on the list.
> > >                   */
> > >                  if (cfs_rq_is_decayed(cfs_rq))
> > >                          list_del_leaf_cfs_rq(cfs_rq);
> > >
> > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > shared frequency domain configuration to be slower than usual. (the load
> > > >
> > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > >
> > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > frequency update needs to go through utilization of other CPUs in the
> > > cluster. I believe this could be adding to the problem but is not really
> > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > to begin with.
> >
> > Qais mentioned half of the time being used by
> > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > this call ?
>
> I do not see a frequency update happening at the time of the problem. However
> note that sugov_iowait_boost() does run even if frequency is not being
> updated. IIRC, this function is also not that light weight and I am not sure
> if it is a good idea to call this that often.

Scheduler can't make any assumption about how often schedutil/cpufreq
wants to be called. Some are fast and straightforward and can be
called very often to adjust frequency; Others can't handle much
updates. The rate limit mechanism in schedutil and io-boost should be
there for such purpose.

>
> > > > > average updates also try to update the frequency in schedutil).
> > > > >
> > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > code paths to take longer than when running at a high CPU frequency.
> > > >
> > > > Low frequency usually means low utilization so it should happen that much.
> > >
> > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > Happening a lot in itself is not a problem if there is nothing else to
> > do so it's not a argument in itself
>
> It is a problem - it shows up in the preempt off critical section latency

But this point is not related to the point above which is about how
often it happens.

> tracer. Are you saying its Ok for preemption to be disabled on system for 500
> micro seconds?  That hurts real-time applications (audio etc).

So. Is your problem related to real-time applications (audio etc) ?

>
> > So why is it a problem for you ? You are mentioning newly idle load
> > balance so I assume that your root problem is the scheduling delay
> > generated by the newly idle load balance which then calls
> > update_blocked_averages
>
> Yes, the new idle balance is when I see it happen quite often. I do see it
> happen with other load balance as well, but it not that often as those LB
> don't run as often as new idle balance.

The update of average blocked load has been added in newidle_balance
to take advantage of the cpu becoming idle but it seems to create a
long preempt off sequence. I 'm going to prepare a patch to move it
out the schedule path.

>
> >
> > rate limiting the call to update_blocked_averages() will only reduce
> > the number of time it happens but it will not prevent it to happen.
>
> Sure, but soft real-time issue can tolerate if the issue does not happen very
> often. In this case though, it is frequent.

Could you provide details of the problem that you are facing ? It's
not clear for me what happens in your case at the end. Have you got an
audio glitch as an example?

"Not often" doesn't really give any clue.

Also update_blocked_averages was supposed called in newlyidle_balance
when the coming idle duration is expected to be long enough

>
> > IIUC, your real problem is that newidle_balance is running whereas a
> > task is about to wake up on the cpu and we don't abort quickly during
> > this load_balance
>
> Yes.
>
> > so we could also try to abort earlier in case of newly idle load balance
>
> I think interrupts are disabled when the load balance runs, so there's no way
> for say an audio interrupt to even run in order to wake up a task. How would
> you know to abort the new idle load balance?
>
> Could you elaborate more also on the drawback of the rate limiting patch we
> posted? Do you see a side effect?

Your patch just tries to hide your problem and not to solve the root cause.

>
> > > > > sometimes, which seems overkill.
> > > > >
> > > > > schbench shows a clear improvement with the change:
> > > >
> > > > Have you got more details about your test setup ?
> > > > which platform ?
> > > > which kernel ?
> > >
> > > I mentioned in the commit message it is a v5.4 kernel.
> >
> > I was not sure if the tests results done with this kernel because we
> > usually ask for results against mainline to make sure you are not
> > facing a problem that has solved since v5.4 has been released
>
> Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
> was hoping that is recent enough for this debug.

more than 14 months old is not really recent... It's always good to
have a reproducer against mainline

Regards,
Vincent
>
> thanks,
>
>  - Joel
>
Joel Fernandes Jan. 28, 2021, 3:09 p.m. UTC | #12
Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > > >                   * decayed cfs_rqs linger on the list.
> > > >                   */
> > > >                  if (cfs_rq_is_decayed(cfs_rq))
> > > >                          list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency in schedutil).
> > > > > >
> > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > >
> > > > > Low frequency usually means low utilization so it should happen that much.
> > > >
> > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > >
> > > Happening a lot in itself is not a problem if there is nothing else to
> > > do so it's not a argument in itself
> >
> > It is a problem - it shows up in the preempt off critical section latency
>
> But this point is not related to the point above which is about how
> often it happens.

It is related. A bad thing happening quite often is worse than a bad
thing happening infrequently. I agree it is not a root cause fix, but
it makes things "better".

> > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > micro seconds?  That hurts real-time applications (audio etc).
>
> So. Is your problem related to real-time applications (audio etc) ?

Yes it is. I don't have a reproducer and it could be the audio
buffering will hide the problem. But that doesn't mean that there is
no problem or that we cannot improve things. There will also likely be
power consumption improvement.

> > > So why is it a problem for you ? You are mentioning newly idle load
> > > balance so I assume that your root problem is the scheduling delay
> > > generated by the newly idle load balance which then calls
> > > update_blocked_averages
> >
> > Yes, the new idle balance is when I see it happen quite often. I do see it
> > happen with other load balance as well, but it not that often as those LB
> > don't run as often as new idle balance.
>
> The update of average blocked load has been added in newidle_balance
> to take advantage of the cpu becoming idle but it seems to create a
> long preempt off sequence. I 'm going to prepare a patch to move it
> out the schedule path.

Ok thanks, that would really help!

> > > rate limiting the call to update_blocked_averages() will only reduce
> > > the number of time it happens but it will not prevent it to happen.
> >
> > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > often. In this case though, it is frequent.
>
> Could you provide details of the problem that you are facing ? It's
> not clear for me what happens in your case at the end. Have you got an
> audio glitch as an example?
>
> "Not often" doesn't really give any clue.

I believe from my side I have provided details. I shared output from a
function graph tracer and schbench micro benchmark clearly showing the
issue and improvements. Sorry, I don't have a real-world reproducer
for this.

> Also update_blocked_averages was supposed called in newlyidle_balance
> when the coming idle duration is expected to be long enough

No, we do not want the schedule loop to take half a millisecond.

> > > IIUC, your real problem is that newidle_balance is running whereas a
> > > task is about to wake up on the cpu and we don't abort quickly during
> > > this load_balance
> >
> > Yes.
> >
> > > so we could also try to abort earlier in case of newly idle load balance
> >
> > I think interrupts are disabled when the load balance runs, so there's no way
> > for say an audio interrupt to even run in order to wake up a task. How would
> > you know to abort the new idle load balance?
> >
> > Could you elaborate more also on the drawback of the rate limiting patch we
> > posted? Do you see a side effect?
>
> Your patch just tries to hide your problem and not to solve the root cause.

Agreed, the solution presented is a band aid and not a proper fix. It
was just intended to illustrate the problem and start a discussion. I
should have marked it RFC for sure.

thanks!

- Joel
Qais Yousef Jan. 28, 2021, 4:57 p.m. UTC | #13
On 01/28/21 10:09, Joel Fernandes wrote:
> > > > Qais mentioned half of the time being used by
> > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > this call ?
> > >
> > > I do not see a frequency update happening at the time of the problem. However
> > > note that sugov_iowait_boost() does run even if frequency is not being
> > > updated. IIRC, this function is also not that light weight and I am not sure
> > > if it is a good idea to call this that often.
> >
> > Scheduler can't make any assumption about how often schedutil/cpufreq
> > wants to be called. Some are fast and straightforward and can be
> > called very often to adjust frequency; Others can't handle much
> > updates. The rate limit mechanism in schedutil and io-boost should be
> > there for such purpose.
> 
> Sure, I know that's the intention.

How about the below patch to help with reducing the impact of
sugov_update_shared()?

I basically made sure it'll bail out immediately if it gets 2 calls within the
defined rate_limit_us value.

And tweaked the way we call cpufreq_update_util() from
update_blocked_averages() too so that we first update blocked load on all cpus,
then we ask for the frequency update. Combined with above this should result to
a single call to sugov_update_shared() for each policy. Each call should see
the final state of what is the load really is.

Only compile tested!

Thanks

--
Qais Yousef

----->8-----


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..bd83e9343749 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -109,7 +109,6 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
 		return false;
 
 	sg_policy->next_freq = next_freq;
-	sg_policy->last_freq_update_time = time;
 
 	return true;
 }
@@ -554,20 +553,23 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	raw_spin_lock(&sg_policy->update_lock);
 
+	ignore_dl_rate_limit(sg_cpu, sg_policy);
+
+	if (!sugov_should_update_freq(sg_policy, time))
+		goto out;
+
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
-	ignore_dl_rate_limit(sg_cpu, sg_policy);
-
-	if (sugov_should_update_freq(sg_policy, time)) {
-		next_f = sugov_next_freq_shared(sg_cpu, time);
+	next_f = sugov_next_freq_shared(sg_cpu, time);
 
-		if (sg_policy->policy->fast_switch_enabled)
-			sugov_fast_switch(sg_policy, time, next_f);
-		else
-			sugov_deferred_update(sg_policy, time, next_f);
-	}
+	if (sg_policy->policy->fast_switch_enabled)
+		sugov_fast_switch(sg_policy, time, next_f);
+	else
+		sugov_deferred_update(sg_policy, time, next_f);
 
+	sg_policy->last_freq_update_time = time;
+out:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..47bd8be03a6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8033,7 +8033,7 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif
 
-static void update_blocked_averages(int cpu)
+static void update_blocked_averages(int cpu, bool update_freq)
 {
 	bool decayed = false, done = true;
 	struct rq *rq = cpu_rq(cpu);
@@ -8046,7 +8046,7 @@ static void update_blocked_averages(int cpu)
 	decayed |= __update_blocked_fair(rq, &done);
 
 	update_blocked_load_status(rq, !done);
-	if (decayed)
+	if (decayed && update_freq)
 		cpufreq_update_util(rq, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
@@ -8384,7 +8384,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
 	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
 		return true;
 
-	update_blocked_averages(cpu);
+	update_blocked_averages(cpu, false);
 
 	return rq->has_blocked_load;
 #else
@@ -8454,6 +8454,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
+	/*
+	 * We did a bunch a blocked average updates, let cpufreq know about
+	 * them in one go. For system with a lot of cpus on a frequency domain
+	 * this will make sure we take all the changes that affects the policy
+	 * in one go.
+	 */
+	for_each_cpu_and(i, sched_group_span(group), env->cpus)
+		cpufreq_update_util(cpu_rq(i), 0);
+
 	/* Check if dst CPU is idle and preferred to this group */
 	if (env->sd->flags & SD_ASYM_PACKING &&
 	    env->idle != CPU_NOT_IDLE &&
@@ -10464,7 +10473,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 
 	/* Newly idle CPU doesn't need an update */
 	if (idle != CPU_NEWLY_IDLE) {
-		update_blocked_averages(this_cpu);
+		update_blocked_averages(this_cpu, false);
 		has_blocked_load |= this_rq->has_blocked_load;
 	}
 
@@ -10478,6 +10487,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 	ret = true;
 
 abort:
+	/*
+	 * We did a bunch a blocked average updates, let cpufreq know about
+	 * them in one go. For system with a lot of cpus on a frequency domain
+	 * this will make sure we take all the changes that affects the policy
+	 * in one go.
+	 */
+	for_each_cpu(balance_cpu, nohz.idle_cpus_mask)
+		cpufreq_update_util(cpu_rq(balance_cpu), 0);
+
 	/* There is still blocked load, enable periodic update */
 	if (has_blocked_load)
 		WRITE_ONCE(nohz.has_blocked, 1);
@@ -10603,7 +10621,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_unlock(&this_rq->lock);
 
-	update_blocked_averages(this_cpu);
+	update_blocked_averages(this_cpu, true);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
@@ -10691,7 +10709,7 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
 		return;
 
 	/* normal load balance */
-	update_blocked_averages(this_rq->cpu);
+	update_blocked_averages(this_rq->cpu, true);
 	rebalance_domains(this_rq, idle);
 }
Vincent Guittot Jan. 29, 2021, 5:27 p.m. UTC | #14
Le vendredi 29 janv. 2021 à 11:33:00 (+0100), Vincent Guittot a écrit :
> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Vincent,
> >
> > On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > > > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > > > <joel@joelfernandes.org> wrote:
> > > > > > > >
> > > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > > > the periodic balancer.
> > > > > > > >
> > > > > > > > Closer look shows that the problem is caused by the following
> > > > > > > > ingredients:
> > > > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > > > __update_blocked_fair() take a long time.
> > > > > > >
> > > > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > > > the duration
> > > > > >
> > > > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > > > decayed and not have any tasks running on it but its load needs to be
> > > > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > > > 'blocked'.
> > > > >
> > > > > How many blocked cgroups have you got ?
> > > >
> > > > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > > > this loop runs per new idle balance. When the problem happens I see this loop
> > > > run 35-40 times (for one single instance of newidle balance). So in total
> > > > there are at least these many cfs_rq load updates.
> > >
> > > Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?
> >
> > All CPUs.
> >
> > > > I also see that new idle balance can be called 200-500 times per second.
> > >
> > > This is not surprising because newidle_balance() is called every time
> > > the CPU is about to become idle
> >
> > Sure.
> >
> > > > > >
> > > > > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > > > > >                   * decayed cfs_rqs linger on the list.
> > > > > >                   */
> > > > > >                  if (cfs_rq_is_decayed(cfs_rq))
> > > > > >                          list_del_leaf_cfs_rq(cfs_rq);
> > > > > >
> > > > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > > > >
> > > > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > > > >
> > > > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > > > frequency update needs to go through utilization of other CPUs in the
> > > > > > cluster. I believe this could be adding to the problem but is not really
> > > > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > > > to begin with.
> > > > >
> > > > > Qais mentioned half of the time being used by
> > > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > > this call ?
> > > >
> > > > I do not see a frequency update happening at the time of the problem. However
> > > > note that sugov_iowait_boost() does run even if frequency is not being
> > > > updated. IIRC, this function is also not that light weight and I am not sure
> > > > if it is a good idea to call this that often.
> > >
> > > Scheduler can't make any assumption about how often schedutil/cpufreq
> > > wants to be called. Some are fast and straightforward and can be
> > > called very often to adjust frequency; Others can't handle much
> > > updates. The rate limit mechanism in schedutil and io-boost should be
> > > there for such purpose.
> >
> > Sure, I know that's the intention.
> >
> > > > > > > > average updates also try to update the frequency in schedutil).
> > > > > > > >
> > > > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > > > >
> > > > > > > Low frequency usually means low utilization so it should happen that much.
> > > > > >
> > > > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > > > >
> > > > > Happening a lot in itself is not a problem if there is nothing else to
> > > > > do so it's not a argument in itself
> > > >
> > > > It is a problem - it shows up in the preempt off critical section latency
> > >
> > > But this point is not related to the point above which is about how
> > > often it happens.
> >
> > It is related. A bad thing happening quite often is worse than a bad
> > thing happening infrequently. I agree it is not a root cause fix, but
> > it makes things "better".
> >
> > > > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > > > micro seconds?  That hurts real-time applications (audio etc).
> > >
> > > So. Is your problem related to real-time applications (audio etc) ?
> >
> > Yes it is. I don't have a reproducer and it could be the audio
> > buffering will hide the problem. But that doesn't mean that there is
> > no problem or that we cannot improve things. There will also likely be
> > power consumption improvement.
> >
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
> 
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
> 
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved. As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.
> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> 
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473
> 
> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400
> 
> So a quick and wrong conclusion could be to say that you should disable EAS ;-)
> 
> All that to say that your schbench's results are not relevant and I
> disagree when you said that you have provided all details of your
> problem.
> 
> Then, your function graph shows a half millisecond duration but there
> are no details about the context. It seems that your trace happens
> after that sugov just changed the freq. But it would be not surprising
> to have such a large duration when running at the lowest
> capacity/frequency of the system as an example.
> 
> All that to say that the main reason of my questions is just to make
> sure that the problem is fully understood and the results are
> relevants, which doesn't seem to be the case for schbench at least.
> 
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

The patch below moves the update of the blocked load of CPUs outside newidle_balance().

Instead, the update is done with the usual idle load balance update. I'm working on an
additonnal patch that will select this cpu that is about to become idle, instead of a
random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 32 +++-----------------------------
 1 file changed, 3 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..8200b1d4df3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7421,8 +7421,6 @@ enum migration_type {
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
-#define LBF_NOHZ_STATS	0x10
-#define LBF_NOHZ_AGAIN	0x20
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
-		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
-			env->flags |= LBF_NOHZ_AGAIN;
-
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->group_runnable += cpu_runnable(rq);
@@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats tmp_sgs;
 	int sg_status = 0;
 
-#ifdef CONFIG_NO_HZ_COMMON
-	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
-		env->flags |= LBF_NOHZ_STATS;
-#endif
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	/* Tag domain that child domain prefers tasks go to siblings first */
 	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 
-#ifdef CONFIG_NO_HZ_COMMON
-	if ((env->flags & LBF_NOHZ_AGAIN) &&
-	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
-
-		WRITE_ONCE(nohz.next_blocked,
-			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
-	}
-#endif
-
 	if (env->sd->flags & SD_NUMA)
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
@@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 		return;
 
 	raw_spin_unlock(&this_rq->lock);
-	/*
-	 * This CPU is going to be idle and blocked load of idle CPUs
-	 * need to be updated. Run the ilb locally as it is a good
-	 * candidate for ilb instead of waking up another idle CPU.
-	 * Kick an normal ilb if we failed to do the update.
-	 */
-	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
-		kick_ilb(NOHZ_STATS_KICK);
+	kick_ilb(NOHZ_STATS_KICK);
 	raw_spin_lock(&this_rq->lock);
 }
 
@@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
-		nohz_newidle_balance(this_rq);
-
 		goto out;
 	}
 
@@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
+	else
+		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
Joel Fernandes Feb. 1, 2021, 3:13 p.m. UTC | #15
On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

That's fine, as long as we agree on this problem ;-) Thanks for
providing the patches and I will try them once they are ready.

> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...

Agreed, I was also thinking the same. But that doesn't change the fact
that there is room for improvement and I'm grateful to you for trying
to improve it!

thanks,

 - Joel
Dietmar Eggemann Feb. 3, 2021, 11:54 a.m. UTC | #16
On 29/01/2021 18:27, Vincent Guittot wrote:
> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>
>>> Hi Vincent,
>>>
>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>> <joel@joelfernandes.org> wrote:

[...]

>> The only point that I agree with, is that running
>> update_blocked_averages with preempt and irq off is not a good thing
>> because we don't manage the number of csf_rq to update and I'm going
>> to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

I'm trying to understand the need for this extra patch.

The patch below moves away from doing update_blocked_averages() (1) for
all CPUs in sched groups of the sched domain:

newidle_balance()->load_balance()->
find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

to:

calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
update_nohz_stats() and for the ilb CPU.

newidle_balance() calls (1) for newidle CPU already.

What would be the benefit to choose newidle CPU as ilb CPU?

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);
>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);
>  
>  	rq_repin_lock(this_rq, rf);
Vincent Guittot Feb. 3, 2021, 1:12 p.m. UTC | #17
On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 29/01/2021 18:27, Vincent Guittot wrote:
> > Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
> >> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> >>> <vincent.guittot@linaro.org> wrote:
> >>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> >>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >>>>>>>> <joel@joelfernandes.org> wrote:
>
> [...]
>
> >> The only point that I agree with, is that running
> >> update_blocked_averages with preempt and irq off is not a good thing
> >> because we don't manage the number of csf_rq to update and I'm going
> >> to provide a patchset for this
> >
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>
> I'm trying to understand the need for this extra patch.
>
> The patch below moves away from doing update_blocked_averages() (1) for
> all CPUs in sched groups of the sched domain:
>
> newidle_balance()->load_balance()->
> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>
> to:
>
> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
> update_nohz_stats() and for the ilb CPU.
>
> newidle_balance() calls (1) for newidle CPU already.
>
> What would be the benefit to choose newidle CPU as ilb CPU?

To prevent waking up another idle cpu to run the update whereas
newidle cpu is already woken up and about to be idle so the best
candidate.
All the aim of the removed code was to prevent waking up an idle cpu
for doing something that could be done by the newidle cpu before it
enters idle state

>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 32 +++-----------------------------
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK       0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED      0x08
> > -#define LBF_NOHZ_STATS       0x10
> > -#define LBF_NOHZ_AGAIN       0x20
> >
> >  struct lb_env {
> >       struct sched_domain     *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >       for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >               struct rq *rq = cpu_rq(i);
> >
> > -             if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > -                     env->flags |= LBF_NOHZ_AGAIN;
> > -
> >               sgs->group_load += cpu_load(rq);
> >               sgs->group_util += cpu_util(i);
> >               sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       struct sg_lb_stats tmp_sgs;
> >       int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > -             env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >       do {
> >               struct sg_lb_stats *sgs = &tmp_sgs;
> >               int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       /* Tag domain that child domain prefers tasks go to siblings first */
> >       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if ((env->flags & LBF_NOHZ_AGAIN) &&
> > -         cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > -             WRITE_ONCE(nohz.next_blocked,
> > -                        jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > -     }
> > -#endif
> > -
> >       if (env->sd->flags & SD_NUMA)
> >               env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >               return;
> >
> >       raw_spin_unlock(&this_rq->lock);
> > -     /*
> > -      * This CPU is going to be idle and blocked load of idle CPUs
> > -      * need to be updated. Run the ilb locally as it is a good
> > -      * candidate for ilb instead of waking up another idle CPU.
> > -      * Kick an normal ilb if we failed to do the update.
> > -      */
> > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> > -             kick_ilb(NOHZ_STATS_KICK);
> > +     kick_ilb(NOHZ_STATS_KICK);
> >       raw_spin_lock(&this_rq->lock);
> >  }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                       update_next_balance(sd, &next_balance);
> >               rcu_read_unlock();
> >
> > -             nohz_newidle_balance(this_rq);
> > -
> >               goto out;
> >       }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >       if (pulled_task)
> >               this_rq->idle_stamp = 0;
> > +     else
> > +             nohz_newidle_balance(this_rq);
> >
> >       rq_repin_lock(this_rq, rf);
Qais Yousef Feb. 3, 2021, 5:09 p.m. UTC | #18
On 01/29/21 18:27, Vincent Guittot wrote:
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))

Since we removed the call to this function (which uses this_rq)

> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);

And unconditionally call kick_ilb() which will find a suitable cpu to run the
lb at regardless what this_rq is.

Doesn't the below become unnecessary now?

	  10494         /*
	  10495          * This CPU doesn't want to be disturbed by scheduler
	  10496          * housekeeping
	  10497          */
	  10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
	  10499                 return;
	  10500
	  10501         /* Will wake up very soon. No time for doing anything else*/
	  10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
	  10503                 return;

And we can drop this_rq arg altogether?

>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);

Since nohz_newidle_balance() will not do any real work now, I couldn't figure
out what moving this here achieves. Fault from my end to parse the change most
likely :-)

Joel can still test this patch as is of course. This is just an early review
since I already spent the time trying to understand it.

Thanks

--
Qais Yousef

>  
>  	rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
Vincent Guittot Feb. 3, 2021, 5:35 p.m. UTC | #19
On Wed, 3 Feb 2021 at 18:09, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 01/29/21 18:27, Vincent Guittot wrote:
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 32 +++-----------------------------
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK       0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED      0x08
> > -#define LBF_NOHZ_STATS       0x10
> > -#define LBF_NOHZ_AGAIN       0x20
> >
> >  struct lb_env {
> >       struct sched_domain     *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >       for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >               struct rq *rq = cpu_rq(i);
> >
> > -             if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > -                     env->flags |= LBF_NOHZ_AGAIN;
> > -
> >               sgs->group_load += cpu_load(rq);
> >               sgs->group_util += cpu_util(i);
> >               sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       struct sg_lb_stats tmp_sgs;
> >       int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > -             env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >       do {
> >               struct sg_lb_stats *sgs = &tmp_sgs;
> >               int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       /* Tag domain that child domain prefers tasks go to siblings first */
> >       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if ((env->flags & LBF_NOHZ_AGAIN) &&
> > -         cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > -             WRITE_ONCE(nohz.next_blocked,
> > -                        jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > -     }
> > -#endif
> > -
> >       if (env->sd->flags & SD_NUMA)
> >               env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >               return;
> >
> >       raw_spin_unlock(&this_rq->lock);
> > -     /*
> > -      * This CPU is going to be idle and blocked load of idle CPUs
> > -      * need to be updated. Run the ilb locally as it is a good
> > -      * candidate for ilb instead of waking up another idle CPU.
> > -      * Kick an normal ilb if we failed to do the update.
> > -      */
> > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>
> Since we removed the call to this function (which uses this_rq)
>
> > -             kick_ilb(NOHZ_STATS_KICK);
> > +     kick_ilb(NOHZ_STATS_KICK);
>
> And unconditionally call kick_ilb() which will find a suitable cpu to run the
> lb at regardless what this_rq is.
>
> Doesn't the below become unnecessary now?

The end goal is to keep running on this cpu that is about to become idle.

This patch is mainly  there to check that Joel's problem disappears if
the update of the blocked load of the cpus is not done in the
newidle_balance context. I'm preparing few other patches on top to
clean up the full path

>
>           10494         /*
>           10495          * This CPU doesn't want to be disturbed by scheduler
>           10496          * housekeeping
>           10497          */
>           10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
>           10499                 return;
>           10500
>           10501         /* Will wake up very soon. No time for doing anything else*/
>           10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
>           10503                 return;
>
> And we can drop this_rq arg altogether?
>
> >       raw_spin_lock(&this_rq->lock);
> >  }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                       update_next_balance(sd, &next_balance);
> >               rcu_read_unlock();
> >
> > -             nohz_newidle_balance(this_rq);
> > -
> >               goto out;
> >       }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >       if (pulled_task)
> >               this_rq->idle_stamp = 0;
> > +     else
> > +             nohz_newidle_balance(this_rq);
>
> Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> out what moving this here achieves. Fault from my end to parse the change most
> likely :-)

The goal is to schedule the update only if we are about to be idle and
nothing else has been queued in the meantime

>
> Joel can still test this patch as is of course. This is just an early review
> since I already spent the time trying to understand it.
>
> Thanks
>
> --
> Qais Yousef
>
> >
> >       rq_repin_lock(this_rq, rf);
> >
> > --
> > 2.17.1
Joel Fernandes Feb. 3, 2021, 7:56 p.m. UTC | #20
On Fri, Jan 29, 2021 at 06:27:27PM +0100, Vincent Guittot wrote:
[...]
> > update_blocked_averages with preempt and irq off is not a good thing
> > because we don't manage the number of csf_rq to update and I'm going
> > to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I confirmed that with this patch, I don't see the preemptoff issues related
to update_blocked_averages() anymore (tested using preemptoff tracer).

I went through the patch and it looks correct to me, I will further review it
and await further reviews from others as well, and then backport the patch to
our kernels. Thanks Vince and everyone!

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel



> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);
>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);
>  
>  	rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
> 
> 
> > 
> > > for this.
> > >
> > > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > > when the coming idle duration is expected to be long enough
> > >
> > > No, we do not want the schedule loop to take half a millisecond.
> > 
> > keep in mind that you are scaling frequency so everything takes time
> > at lowest frequency/capacity ...
> > 
> > >
> > > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > > task is about to wake up on the cpu and we don't abort quickly during
> > > > > > this load_balance
> > > > >
> > > > > Yes.
> > > > >
> > > > > > so we could also try to abort earlier in case of newly idle load balance
> > > > >
> > > > > I think interrupts are disabled when the load balance runs, so there's no way
> > > > > for say an audio interrupt to even run in order to wake up a task. How would
> > > > > you know to abort the new idle load balance?
> > > > >
> > > > > Could you elaborate more also on the drawback of the rate limiting patch we
> > > > > posted? Do you see a side effect?
> > > >
> > > > Your patch just tries to hide your problem and not to solve the root cause.
> > >
> > > Agreed, the solution presented is a band aid and not a proper fix. It
> > > was just intended to illustrate the problem and start a discussion. I
> > > should have marked it RFC for sure.
> > >
> > > thanks!
> > >
> > > - Joel
Dietmar Eggemann Feb. 4, 2021, 9:47 a.m. UTC | #21
On 03/02/2021 14:12, Vincent Guittot wrote:
> On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/01/2021 18:27, Vincent Guittot wrote:
>>> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>>>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>>>> <joel@joelfernandes.org> wrote:
>>
>> [...]
>>
>>>> The only point that I agree with, is that running
>>>> update_blocked_averages with preempt and irq off is not a good thing
>>>> because we don't manage the number of csf_rq to update and I'm going
>>>> to provide a patchset for this
>>>
>>> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>>>
>>> Instead, the update is done with the usual idle load balance update. I'm working on an
>>> additonnal patch that will select this cpu that is about to become idle, instead of a
>>> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>>
>> I'm trying to understand the need for this extra patch.
>>
>> The patch below moves away from doing update_blocked_averages() (1) for
>> all CPUs in sched groups of the sched domain:
>>
>> newidle_balance()->load_balance()->
>> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>>
>> to:
>>
>> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
>> update_nohz_stats() and for the ilb CPU.
>>
>> newidle_balance() calls (1) for newidle CPU already.
>>
>> What would be the benefit to choose newidle CPU as ilb CPU?
> 
> To prevent waking up another idle cpu to run the update whereas
> newidle cpu is already woken up and about to be idle so the best
> candidate.
> All the aim of the removed code was to prevent waking up an idle cpu
> for doing something that could be done by the newidle cpu before it
> enters idle state

Ah OK, makes sense. So the only diff would be that newidle CPU will call
(1) on nohz.idle_cpus_mask rather on on sd->span.

[...]
Qais Yousef Feb. 4, 2021, 10:45 a.m. UTC | #22
On 02/03/21 18:35, Vincent Guittot wrote:
> > >       raw_spin_unlock(&this_rq->lock);
> > > -     /*
> > > -      * This CPU is going to be idle and blocked load of idle CPUs
> > > -      * need to be updated. Run the ilb locally as it is a good
> > > -      * candidate for ilb instead of waking up another idle CPU.
> > > -      * Kick an normal ilb if we failed to do the update.
> > > -      */
> > > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >
> > Since we removed the call to this function (which uses this_rq)
> >
> > > -             kick_ilb(NOHZ_STATS_KICK);
> > > +     kick_ilb(NOHZ_STATS_KICK);
> >
> > And unconditionally call kick_ilb() which will find a suitable cpu to run the
> > lb at regardless what this_rq is.
> >
> > Doesn't the below become unnecessary now?
> 
> The end goal is to keep running on this cpu that is about to become idle.
> 
> This patch is mainly  there to check that Joel's problem disappears if
> the update of the blocked load of the cpus is not done in the
> newidle_balance context. I'm preparing few other patches on top to
> clean up the full path

+1

> >
> >           10494         /*
> >           10495          * This CPU doesn't want to be disturbed by scheduler
> >           10496          * housekeeping
> >           10497          */
> >           10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> >           10499                 return;
> >           10500
> >           10501         /* Will wake up very soon. No time for doing anything else*/
> >           10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >           10503                 return;
> >
> > And we can drop this_rq arg altogether?
> >
> > >       raw_spin_lock(&this_rq->lock);
> > >  }
> > >
> > > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >                       update_next_balance(sd, &next_balance);
> > >               rcu_read_unlock();
> > >
> > > -             nohz_newidle_balance(this_rq);
> > > -
> > >               goto out;
> > >       }
> > >
> > > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >
> > >       if (pulled_task)
> > >               this_rq->idle_stamp = 0;
> > > +     else
> > > +             nohz_newidle_balance(this_rq);
> >
> > Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> > out what moving this here achieves. Fault from my end to parse the change most
> > likely :-)
> 
> The goal is to schedule the update only if we are about to be idle and
> nothing else has been queued in the meantime

I see. This short coming already existed and not *strictly* related to moving
update of blocked load out of newidle balance.

Thanks

--
Qais Yousef

Patch
diff mbox series

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@  static bool update_nohz_stats(struct rq *rq, bool force)
 	if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 		return false;
 
-	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
 		return true;
 
 	update_blocked_averages(cpu);