sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
diff mbox series

Message ID 20210122154600.1722680-1-joel@joelfernandes.org
State New, archived
Headers show
Series
  • sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
Related show

Commit Message

Joel Fernandes Jan. 22, 2021, 3:46 p.m. UTC
On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
        50.0th: 210 (106 samples)
        75.0th: 619 (53 samples)
        90.0th: 665 (32 samples)
        95.0th: 703 (11 samples)
        *99.0th: 12656 (8 samples)
        99.5th: 12784 (1 samples)
        99.9th: 13424 (1 samples)
        min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
        50.0th: 188 (108 samples)
        75.0th: 238 (53 samples)
        90.0th: 623 (32 samples)
        95.0th: 657 (12 samples)
        *99.0th: 717 (7 samples)
        99.5th: 725 (2 samples)
        99.9th: 725 (0 samples)

Cc: Paul McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
Co-developed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Qais Yousef Jan. 22, 2021, 6:39 p.m. UTC | #1
On 01/22/21 17:56, Vincent Guittot wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> >                 return false;
> >
> > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval
> This hard coded (HZ/20) looks really like an ugly hack

This was meant as an RFC patch to discuss the problem really.

Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
processing __update_blocked_fair() and the other half in sugov_update_shared().
So roughly 50us each. Note that each function is calling an iterator in
return. Correct me if my numbers are wrong Joel.

Running on a little core on low frequency these numbers don't look too odd.
So I'm not seeing how we can speed these functions up.

But since update_sg_lb_stats() will end up with multiple calls to
update_blocked_averages() in one go, this latency adds up quickly.

One noticeable factor in Joel's system is the presence of a lot of cgroups.
Which is essentially what makes __update_blocked_fair() expensive, and it seems
to always return something has decayed so we end up with a call to
sugov_update_shared() in every call.

I think we should limit the expensive call to update_blocked_averages() but
I honestly don't know what would be the right way to do it :-/

Or maybe there's another better alternative too..

Thanks

--
Qais Yousef

> 
> >                 return true;
> >
> >         update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >
Joel Fernandes Jan. 22, 2021, 7:10 p.m. UTC | #2
Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
> 
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

                  * There can be a lot of idle CPU cgroups.  Don't let fully
                  * decayed cfs_rqs linger on the list.
                  */
                 if (cfs_rq_is_decayed(cfs_rq))
                         list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
> 
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
> 
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace.  The tracer shows
update_blocked_averages taking a lot of time.

     sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
     sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
     sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
     sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
     sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
     sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
     sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
     sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
     sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
     sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
     sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
     sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
     sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
     sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
     sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
     sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
     sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
     sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
     sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
     sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458


> > The fix is simply rate limit the calls to update_blocked_averages to 20
> > times per second. It appears that updating the blocked average less
> > often is sufficient. Currently I see about 200 calls per second
> 
> Would be good to explain why updating less often is sufficient ?

I don't know this code that well, intuitively it seems to me updating blocked
averages at such a high rate seems pointless. But I defer to your expertise
on that. Why do you feel an update is needed at least HZ times per second?
What about system with HZ=1000 or 300, that seems to be an insane rate of
updating (not to mention all the complexity of going through the leaf cgroup
list and doing the frequency updates).

> > sometimes, which seems overkill.
> >
> > schbench shows a clear improvement with the change:
> 
> Have you got more details about your test setup ?
> which platform ?
> which kernel ?

I mentioned in the commit message it is a v5.4 kernel.

The platform is Snapdragon 7c. The platform is:
https://www.qualcomm.com/products/snapdragon-7c-compute-platform

> 
> >
> > Without patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> >         50.0th: 210 (106 samples)
> >         75.0th: 619 (53 samples)
> >         90.0th: 665 (32 samples)
> >         95.0th: 703 (11 samples)
> >         *99.0th: 12656 (8 samples)
> >         99.5th: 12784 (1 samples)
> >         99.9th: 13424 (1 samples)
> >         min=15, max=13424
> >
> > With patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> >         50.0th: 188 (108 samples)
> >         75.0th: 238 (53 samples)
> >         90.0th: 623 (32 samples)
> >         95.0th: 657 (12 samples)
> >         *99.0th: 717 (7 samples)
> >         99.5th: 725 (2 samples)
> >         99.9th: 725 (0 samples)
> >
> > Cc: Paul McKenney <paulmck@kernel.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
> > Co-developed-by: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> >                 return false;
> >
> > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval

What prevents newidle_balance from hitting this code? Are you suggesting rate
limit that as well?

> This hard coded (HZ/20) looks really like an ugly hack

Yes, it was not a well researched number. If you have a suggestion for better
fix, let me know.

thanks,

 - Joel


> >                 return true;
> >
> >         update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >
Joel Fernandes Jan. 22, 2021, 7:14 p.m. UTC | #3
On Fri, Jan 22, 2021 at 06:39:27PM +0000, Qais Yousef wrote:
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> > 
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
> 
> This was meant as an RFC patch to discuss the problem really.

Agreed, sorry.

> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in
> return. Correct me if my numbers are wrong Joel.

Correct, and I see update_nohz_stats() itself called around 8 times during a
load balance which multiplies the overhead.

Dietmar found out also that the reason for update_nohz_stacks() being called
8 times is because in our setup, there is only 1 MC sched domain with all 8
CPUs, versus say 2 MC domains with 4 CPUs each.

> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.

Agreed.

> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.

True!

> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.

Correct.

thanks,

 - Joel

[..]
Dietmar Eggemann Jan. 25, 2021, 10:44 a.m. UTC | #4
On 22/01/2021 20:10, Joel Fernandes wrote:
> Hi Vincent,
> 
> Thanks for reply. Please see the replies below:
> 
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>> <joel@joelfernandes.org> wrote:
>>>
>>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
>>> that there are a lot of calls to update_blocked_averages(). This causes
>>> the schedule loop to slow down to taking upto 500 micro seconds at
>>> times (due to newidle load balance). I have also seen this manifest in
>>> the periodic balancer.
>>>
>>> Closer look shows that the problem is caused by the following
>>> ingredients:
>>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
>>> suggesting to inspect /proc/sched_debug for this), this can make
>>> __update_blocked_fair() take a long time.
>>
>> Inactive cgroups are removed from the list so they should not impact
>> the duration
> 
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
> 
>                   * There can be a lot of idle CPU cgroups.  Don't let fully
>                   * decayed cfs_rqs linger on the list.
>                   */
>                  if (cfs_rq_is_decayed(cfs_rq))
>                          list_del_leaf_cfs_rq(cfs_rq);
> 
>>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
>>> shared frequency domain configuration to be slower than usual. (the load
>>
>> What do you mean exactly by it causes schedutil to be slower than usual ?
> 
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
> 
>>> average updates also try to update the frequency in schedutil).
>>>
>>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
>>> code paths to take longer than when running at a high CPU frequency.
>>
>> Low frequency usually means low utilization so it should happen that much.
> 
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
> 
>      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
>      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
>      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
>      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
>      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
>      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
>      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
>      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
>      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
>      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
>      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
>      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
>      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
>      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
>      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
>      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
>      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
> 
> 
>>> The fix is simply rate limit the calls to update_blocked_averages to 20
>>> times per second. It appears that updating the blocked average less
>>> often is sufficient. Currently I see about 200 calls per second
>>
>> Would be good to explain why updating less often is sufficient ?
> 
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

I assume this is what you're seeing on your device. This is on tip sched/core
but should be close to your kernel. I glanced over the diffs in fair.c between
chromeos-5.4 and tip sched/core and didn't spot any changes in this area.

I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
(sg)) in the MC domain (the only sched domain). 

Since nohz.has_blocked=1 in your newidle_balance() calls,
load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
update_sg_lb_stats() for each of the 8 sg's.

Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
update_nohz_stats(..., false) per cpu in sg.

And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
is called since none of the 3 bail-out conditions:

 (1) !rq->has_blocked_load
 (2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask) 
 (3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))

trigger.

We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
only to jiffies' in update_blocked_load_status().


[005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
...
[005] 7370.210068: bprint: pick_next_task_fair: CPU5
[005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
[005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
[005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
[005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
[005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
[005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
[005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
[005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
[005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
[005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
[005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
[005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
...
[005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0


If I understood you correctly, you want to avoid these frequent calls
to update_blocked_averages() here to further avoid invoking sched_util
via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
is set) very often in your setup.
Since you have up to 6 CPUs in a frequency domain, this could be more
costly than usual.
Vincent Guittot Jan. 25, 2021, 1:23 p.m. UTC | #5
On Fri, 22 Jan 2021 at 19:39, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
>
> This was meant as an RFC patch to discuss the problem really.
>
> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in

Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
which would mean that the update was useful at the end ?

> return. Correct me if my numbers are wrong Joel.
>
> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.
>
> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.
>
> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.
>
> I think we should limit the expensive call to update_blocked_averages() but

At the opposite, some will complain that block values  stay stall to
high value and prevent any useful adjustment.

Also update_blocked average is already rate limited with idle and busy
load_balance

Seems like the problem raised by Joel is the number of newly idle load balance

> I honestly don't know what would be the right way to do it :-/
>
> Or maybe there's another better alternative too..
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > >                 return true;
> > >
> > >         update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
Vincent Guittot Jan. 25, 2021, 2:42 p.m. UTC | #6
On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > that there are a lot of calls to update_blocked_averages(). This causes
> > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > times (due to newidle load balance). I have also seen this manifest in
> > > the periodic balancer.
> > >
> > > Closer look shows that the problem is caused by the following
> > > ingredients:
> > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > suggesting to inspect /proc/sched_debug for this), this can make
> > > __update_blocked_fair() take a long time.
> >
> > Inactive cgroups are removed from the list so they should not impact
> > the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.

How many blocked cgroups have you got ?

>
>                   * There can be a lot of idle CPU cgroups.  Don't let fully
>                   * decayed cfs_rqs linger on the list.
>                   */
>                  if (cfs_rq_is_decayed(cfs_rq))
>                          list_del_leaf_cfs_rq(cfs_rq);
>
> > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > shared frequency domain configuration to be slower than usual. (the load
> >
> > What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.

Qais mentioned half of the time being used by
sugov_next_freq_shared(). Are there any frequency changes resulting in
this call ?

>
> > > average updates also try to update the frequency in schedutil).
> > >
> > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > code paths to take longer than when running at a high CPU frequency.
> >
> > Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.

Happening a lot in itself is not a problem if there is nothing else to
do so it's not a argument in itself

So why is it a problem for you ? You are mentioning newly idle load
balance so I assume that your root problem is the scheduling delay
generated by the newly idle load balance which then calls
update_blocked_averages

rate limiting the call to update_blocked_averages() will only reduce
the number of time it happens but it will not prevent it to happen.

>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
>
>      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
>      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
>      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
>      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
>      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
>      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
>      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
>      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
>      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
>      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
>      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
>      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
>      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
>      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
>      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
>      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
>      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
>      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
>      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
>      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
>
>
> > > The fix is simply rate limit the calls to update_blocked_averages to 20
> > > times per second. It appears that updating the blocked average less
> > > often is sufficient. Currently I see about 200 calls per second
> >
> > Would be good to explain why updating less often is sufficient ?
>
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

Your trace above is a full update of blocked load which is a normal
sequence that happen regularly but you don't show that this happens
too often

The call to update_blocked _average is rate limited by load_balance
interval for idle and busy load balance so those 2 cases should not be
a problem. There is no rate limit in case of newly_idle because it is
assumed that the latter is normally called only if it has been
estimated that there is enough time to run it. That's why
update_blocked_load_average is called all the time during newly idle
load balance

see in update_sd_lb_stats()
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;

We could consider removing this condition but I need time to  remind
all the side effect of this

IIUC, your real problem is that newidle_balance is running whereas a
task is about to wake up on the cpu and we don't abort quickly during
this load_balance

so we could also try to abort earlier in case of newly idle load balance

>
> > > sometimes, which seems overkill.
> > >
> > > schbench shows a clear improvement with the change:
> >
> > Have you got more details about your test setup ?
> > which platform ?
> > which kernel ?
>
> I mentioned in the commit message it is a v5.4 kernel.

I was not sure if the tests results done with this kernel because we
usually ask for results against mainline to make sure you are not
facing a problem that has solved since v5.4 has been released

>
> The platform is Snapdragon 7c. The platform is:
> https://www.qualcomm.com/products/snapdragon-7c-compute-platform
>
> >
> > >
> > > Without patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > >         50.0th: 210 (106 samples)
> > >         75.0th: 619 (53 samples)
> > >         90.0th: 665 (32 samples)
> > >         95.0th: 703 (11 samples)
> > >         *99.0th: 12656 (8 samples)
> > >         99.5th: 12784 (1 samples)
> > >         99.9th: 13424 (1 samples)
> > >         min=15, max=13424
> > >
> > > With patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > >         50.0th: 188 (108 samples)
> > >         75.0th: 238 (53 samples)
> > >         90.0th: 623 (32 samples)
> > >         95.0th: 657 (12 samples)
> > >         *99.0th: 717 (7 samples)
> > >         99.5th: 725 (2 samples)
> > >         99.9th: 725 (0 samples)
> > >
> > > Cc: Paul McKenney <paulmck@kernel.org>
> > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
> > > Co-developed-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > >                 return false;
> > >
> > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
>
> What prevents newidle_balance from hitting this code? Are you suggesting rate
> limit that as well?
>
> > This hard coded (HZ/20) looks really like an ugly hack
>
> Yes, it was not a well researched number. If you have a suggestion for better
> fix, let me know.
>
> thanks,
>
>  - Joel
>
>
> > >                 return true;
> > >
> > >         update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
Vincent Guittot Jan. 25, 2021, 5:30 p.m. UTC | #7
On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22/01/2021 20:10, Joel Fernandes wrote:
> > Hi Vincent,
> >
> > Thanks for reply. Please see the replies below:
> >
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >> <joel@joelfernandes.org> wrote:
> >>>
> >>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> >>> that there are a lot of calls to update_blocked_averages(). This causes
> >>> the schedule loop to slow down to taking upto 500 micro seconds at
> >>> times (due to newidle load balance). I have also seen this manifest in
> >>> the periodic balancer.
> >>>
> >>> Closer look shows that the problem is caused by the following
> >>> ingredients:
> >>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> >>> suggesting to inspect /proc/sched_debug for this), this can make
> >>> __update_blocked_fair() take a long time.
> >>
> >> Inactive cgroups are removed from the list so they should not impact
> >> the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> >
> >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> >                   * decayed cfs_rqs linger on the list.
> >                   */
> >                  if (cfs_rq_is_decayed(cfs_rq))
> >                          list_del_leaf_cfs_rq(cfs_rq);
> >
> >>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> >>> shared frequency domain configuration to be slower than usual. (the load
> >>
> >> What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> >
> >>> average updates also try to update the frequency in schedutil).
> >>>
> >>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> >>> code paths to take longer than when running at a high CPU frequency.
> >>
> >> Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > schedule() can result in new idle balance with the CFS pick call happening
> > often. Here is a function graph trace.  The tracer shows
> > update_blocked_averages taking a lot of time.
> >
> >      sugov:0-2454  [002]  2657.992570: funcgraph_entry:                   |  load_balance() {
> >      sugov:0-2454  [002]  2657.992577: funcgraph_entry:                   |    update_group_capacity() {
> >      sugov:0-2454  [002]  2657.992580: funcgraph_entry:        2.656 us   |      __msecs_to_jiffies();
> >      sugov:0-2454  [002]  2657.992585: funcgraph_entry:        2.447 us   |      _raw_spin_lock_irqsave();
> >      sugov:0-2454  [002]  2657.992591: funcgraph_entry:        2.552 us   |      _raw_spin_unlock_irqrestore();
> >      sugov:0-2454  [002]  2657.992595: funcgraph_exit:       + 17.448 us  |    }
> >      sugov:0-2454  [002]  2657.992597: funcgraph_entry:        1.875 us   |    update_nohz_stats();
> >      sugov:0-2454  [002]  2657.992601: funcgraph_entry:        1.667 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992605: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992608: funcgraph_entry:      + 33.333 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992643: funcgraph_exit:       + 38.073 us  |    }
> >      sugov:0-2454  [002]  2657.992645: funcgraph_entry:        1.770 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992649: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992651: funcgraph_entry:      + 41.823 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992694: funcgraph_exit:       + 45.729 us  |    }
> >      sugov:0-2454  [002]  2657.992696: funcgraph_entry:        1.823 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992700: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992702: funcgraph_entry:      + 35.312 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992740: funcgraph_exit:       + 39.792 us  |    }
> >      sugov:0-2454  [002]  2657.992742: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992746: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992748: funcgraph_entry:      + 33.438 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992783: funcgraph_exit:       + 37.500 us  |    }
> >      sugov:0-2454  [002]  2657.992785: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992790: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992792: funcgraph_entry:      + 45.521 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992839: funcgraph_exit:       + 49.323 us  |    }
> >      sugov:0-2454  [002]  2657.992842: funcgraph_entry:        1.823 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992847: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992850: funcgraph_entry:      + 67.187 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992919: funcgraph_exit:       + 72.031 us  |    }
> >      sugov:0-2454  [002]  2657.992921: funcgraph_entry:        2.760 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992926: funcgraph_entry:                   |    update_nohz_stats() {
> >      sugov:0-2454  [002]  2657.992928: funcgraph_entry:      + 61.146 us  |      update_blocked_averages();
> >      sugov:0-2454  [002]  2657.992992: funcgraph_exit:       + 65.886 us  |    }
> >      sugov:0-2454  [002]  2657.992994: funcgraph_entry:        1.771 us   |    idle_cpu();
> >      sugov:0-2454  [002]  2657.992998: funcgraph_exit:       ! 430.209 us |  }
> >      sugov:0-2454  [002]  2657.993006: bprint:               trace_long: wtf: lb: 432916
> >      sugov:0-2454  [002]  2657.993017: bprint:               trace_long: wtf: newidle_balance: 501458
> >
> >
> >>> The fix is simply rate limit the calls to update_blocked_averages to 20
> >>> times per second. It appears that updating the blocked average less
> >>> often is sufficient. Currently I see about 200 calls per second
> >>
> >> Would be good to explain why updating less often is sufficient ?
> >
> > I don't know this code that well, intuitively it seems to me updating blocked
> > averages at such a high rate seems pointless. But I defer to your expertise
> > on that. Why do you feel an update is needed at least HZ times per second?
> > What about system with HZ=1000 or 300, that seems to be an insane rate of
> > updating (not to mention all the complexity of going through the leaf cgroup
> > list and doing the frequency updates).
>
> I assume this is what you're seeing on your device. This is on tip sched/core
> but should be close to your kernel. I glanced over the diffs in fair.c between
> chromeos-5.4 and tip sched/core and didn't spot any changes in this area.
>
> I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
> (sg)) in the MC domain (the only sched domain).

you can also modify the DT to emulate 1 MC level with 8 cores

>
> Since nohz.has_blocked=1 in your newidle_balance() calls,
> load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
> update_sg_lb_stats() for each of the 8 sg's.
>
> Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
> update_nohz_stats(..., false) per cpu in sg.
>
> And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
> is called since none of the 3 bail-out conditions:
>
>  (1) !rq->has_blocked_load
>  (2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask)
>  (3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>
> trigger.
>
> We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
> 8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
> only to jiffies' in update_blocked_load_status().
>
>
> [005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
> ...
> [005] 7370.210068: bprint: pick_next_task_fair: CPU5
> [005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
> [005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
> [005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
> [005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
> [005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
> [005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
> [005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
> [005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
> [005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
> [005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
> ...
> [005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0
>
>
> If I understood you correctly, you want to avoid these frequent calls
> to update_blocked_averages() here to further avoid invoking sched_util
> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
> is set) very often in your setup.

So It's not clear if the problem that joel wants to raise, is about:
- the running time of  update_blocked_averages
- the running time of the cpufreq_update_util which is called because
utilization has decayed during the update of blocked load
- the wake up latency because of newly_idle lb

> Since you have up to 6 CPUs in a frequency domain, this could be more
> costly than usual.
Dietmar Eggemann Jan. 25, 2021, 5:53 p.m. UTC | #8
On 25/01/2021 18:30, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 22/01/2021 20:10, Joel Fernandes wrote:
>>> Hi Vincent,
>>>
>>> Thanks for reply. Please see the replies below:
>>>
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>> <joel@joelfernandes.org> wrote:

[...]

>> If I understood you correctly, you want to avoid these frequent calls
>> to update_blocked_averages() here to further avoid invoking sched_util
>> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
>> is set) very often in your setup.
> 
> So It's not clear if the problem that joel wants to raise, is about:
> - the running time of  update_blocked_averages
> - the running time of the cpufreq_update_util which is called because
> utilization has decayed during the update of blocked load
> - the wake up latency because of newly_idle lb

Pretty much so.

IIRC his interest is driven by the fact that he saw much less activity
in newly_idle lb and therefore cpufreq_update_util on a system with the
same kernel and userspace but with less CPUs (i.e. also smaller
frequency domains) and less cgroups (with blocked load) and started
wondering why.

I assume that since he understands this environment now much better, he
should be able to come up with better test numbers to show if there is a
performance issue on his 2+6 DynamIQ system and if yes, where exactly in
this code path.
Qais Yousef Jan. 26, 2021, 4:36 p.m. UTC | #9
On 01/25/21 14:23, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 19:39, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 01/22/21 17:56, Vincent Guittot wrote:
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 04a3ce20da67..fe2dc0024db5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > >         if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > >                 return false;
> > > >
> > > > -       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > > +       if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> > >
> > > This condition is there to make sure to update blocked load at most
> > > once a tick in order to filter newly idle case otherwise the rate
> > > limit is already done by load balance interval
> > > This hard coded (HZ/20) looks really like an ugly hack
> >
> > This was meant as an RFC patch to discuss the problem really.
> >
> > Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> > processing __update_blocked_fair() and the other half in sugov_update_shared().
> > So roughly 50us each. Note that each function is calling an iterator in
> 
> Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
> which would mean that the update was useful at the end ?

I couldn't reproduce his problem on Juno. But I think it is not actually doing
any frequency update. sugov_update_shared() is rate limited by
sugov_should_update_freq(). Joel has a 1ms rate limit for schedutil in sysfs.
The function traces showed that it is continuously doing the full scan which
indicates that sugov_update_next_freq() is continuously bailing out at

	if else (sg_policy->next_freq == next_freq)
		return false;

> 
> > return. Correct me if my numbers are wrong Joel.
> >
> > Running on a little core on low frequency these numbers don't look too odd.
> > So I'm not seeing how we can speed these functions up.
> >
> > But since update_sg_lb_stats() will end up with multiple calls to
> > update_blocked_averages() in one go, this latency adds up quickly.
> >
> > One noticeable factor in Joel's system is the presence of a lot of cgroups.
> > Which is essentially what makes __update_blocked_fair() expensive, and it seems
> > to always return something has decayed so we end up with a call to
> > sugov_update_shared() in every call.
> >
> > I think we should limit the expensive call to update_blocked_averages() but
> 
> At the opposite, some will complain that block values  stay stall to
> high value and prevent any useful adjustment.
> 
> Also update_blocked average is already rate limited with idle and busy
> load_balance
> 
> Seems like the problem raised by Joel is the number of newly idle load balance

It could be. When Joel comments out the update_blocked_averages() or rate limit
it the big schedule delay disappears. Which is just an indication of where we
could do better. Either by making update_blocked_averages() faster, or control
how often we do it in a row. I thought the latter made more sense - though
implementation wise I'm not sure how we should do that.

We might actually help update_blocked_averages() being a bit faster by not
doing cpufreq_update_util() in every call and do it once in the last call to
update_blocked_averages(). Since it seemed the for_each_leaf_cfs_rq_safe() loop
in __update_blocked_fair() is expensive too, not sure if reducing the calls to
cpufreq_update_util() will be enough.

Thanks

--
Qais Yousef
Joel Fernandes Jan. 27, 2021, 6:43 p.m. UTC | #10
Hi Vincent,

On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > times (due to newidle load balance). I have also seen this manifest in
> > > > the periodic balancer.
> > > >
> > > > Closer look shows that the problem is caused by the following
> > > > ingredients:
> > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > __update_blocked_fair() take a long time.
> > >
> > > Inactive cgroups are removed from the list so they should not impact
> > > the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> 
> How many blocked cgroups have you got ?

I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
this loop runs per new idle balance. When the problem happens I see this loop
run 35-40 times (for one single instance of newidle balance). So in total
there are at least these many cfs_rq load updates.

I also see that new idle balance can be called 200-500 times per second.

> >
> >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> >                   * decayed cfs_rqs linger on the list.
> >                   */
> >                  if (cfs_rq_is_decayed(cfs_rq))
> >                          list_del_leaf_cfs_rq(cfs_rq);
> >
> > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > shared frequency domain configuration to be slower than usual. (the load
> > >
> > > What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> 
> Qais mentioned half of the time being used by
> sugov_next_freq_shared(). Are there any frequency changes resulting in
> this call ?

I do not see a frequency update happening at the time of the problem. However
note that sugov_iowait_boost() does run even if frequency is not being
updated. IIRC, this function is also not that light weight and I am not sure
if it is a good idea to call this that often.

> > > > average updates also try to update the frequency in schedutil).
> > > >
> > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > code paths to take longer than when running at a high CPU frequency.
> > >
> > > Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> Happening a lot in itself is not a problem if there is nothing else to
> do so it's not a argument in itself

It is a problem - it shows up in the preempt off critical section latency
tracer. Are you saying its Ok for preemption to be disabled on system for 500
micro seconds?  That hurts real-time applications (audio etc).

> So why is it a problem for you ? You are mentioning newly idle load
> balance so I assume that your root problem is the scheduling delay
> generated by the newly idle load balance which then calls
> update_blocked_averages

Yes, the new idle balance is when I see it happen quite often. I do see it
happen with other load balance as well, but it not that often as those LB
don't run as often as new idle balance.

> 
> rate limiting the call to update_blocked_averages() will only reduce
> the number of time it happens but it will not prevent it to happen.

Sure, but soft real-time issue can tolerate if the issue does not happen very
often. In this case though, it is frequent.

> IIUC, your real problem is that newidle_balance is running whereas a
> task is about to wake up on the cpu and we don't abort quickly during
> this load_balance

Yes.

> so we could also try to abort earlier in case of newly idle load balance

I think interrupts are disabled when the load balance runs, so there's no way
for say an audio interrupt to even run in order to wake up a task. How would
you know to abort the new idle load balance?

Could you elaborate more also on the drawback of the rate limiting patch we
posted? Do you see a side effect?

> > > > sometimes, which seems overkill.
> > > >
> > > > schbench shows a clear improvement with the change:
> > >
> > > Have you got more details about your test setup ?
> > > which platform ?
> > > which kernel ?
> >
> > I mentioned in the commit message it is a v5.4 kernel.
> 
> I was not sure if the tests results done with this kernel because we
> usually ask for results against mainline to make sure you are not
> facing a problem that has solved since v5.4 has been released

Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
was hoping that is recent enough for this debug.

thanks,

 - Joel
Vincent Guittot Jan. 28, 2021, 1:57 p.m. UTC | #11
Hi Joel,

On Wed, 27 Jan 2021 at 19:43, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Vincent,
>
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > <joel@joelfernandes.org> wrote:
> > > > >
> > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > the periodic balancer.
> > > > >
> > > > > Closer look shows that the problem is caused by the following
> > > > > ingredients:
> > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > __update_blocked_fair() take a long time.
> > > >
> > > > Inactive cgroups are removed from the list so they should not impact
> > > > the duration
> > >
> > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > decayed and not have any tasks running on it but its load needs to be
> > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > 'blocked'.
> >
> > How many blocked cgroups have you got ?
>
> I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> this loop runs per new idle balance. When the problem happens I see this loop
> run 35-40 times (for one single instance of newidle balance). So in total
> there are at least these many cfs_rq load updates.

Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

>
> I also see that new idle balance can be called 200-500 times per second.

This is not surprising because newidle_balance() is called every time
the CPU is about to become idle

>
> > >
> > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > >                   * decayed cfs_rqs linger on the list.
> > >                   */
> > >                  if (cfs_rq_is_decayed(cfs_rq))
> > >                          list_del_leaf_cfs_rq(cfs_rq);
> > >
> > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > shared frequency domain configuration to be slower than usual. (the load
> > > >
> > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > >
> > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > frequency update needs to go through utilization of other CPUs in the
> > > cluster. I believe this could be adding to the problem but is not really
> > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > to begin with.
> >
> > Qais mentioned half of the time being used by
> > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > this call ?
>
> I do not see a frequency update happening at the time of the problem. However
> note that sugov_iowait_boost() does run even if frequency is not being
> updated. IIRC, this function is also not that light weight and I am not sure
> if it is a good idea to call this that often.

Scheduler can't make any assumption about how often schedutil/cpufreq
wants to be called. Some are fast and straightforward and can be
called very often to adjust frequency; Others can't handle much
updates. The rate limit mechanism in schedutil and io-boost should be
there for such purpose.

>
> > > > > average updates also try to update the frequency in schedutil).
> > > > >
> > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > code paths to take longer than when running at a high CPU frequency.
> > > >
> > > > Low frequency usually means low utilization so it should happen that much.
> > >
> > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > Happening a lot in itself is not a problem if there is nothing else to
> > do so it's not a argument in itself
>
> It is a problem - it shows up in the preempt off critical section latency

But this point is not related to the point above which is about how
often it happens.

> tracer. Are you saying its Ok for preemption to be disabled on system for 500
> micro seconds?  That hurts real-time applications (audio etc).

So. Is your problem related to real-time applications (audio etc) ?

>
> > So why is it a problem for you ? You are mentioning newly idle load
> > balance so I assume that your root problem is the scheduling delay
> > generated by the newly idle load balance which then calls
> > update_blocked_averages
>
> Yes, the new idle balance is when I see it happen quite often. I do see it
> happen with other load balance as well, but it not that often as those LB
> don't run as often as new idle balance.

The update of average blocked load has been added in newidle_balance
to take advantage of the cpu becoming idle but it seems to create a
long preempt off sequence. I 'm going to prepare a patch to move it
out the schedule path.

>
> >
> > rate limiting the call to update_blocked_averages() will only reduce
> > the number of time it happens but it will not prevent it to happen.
>
> Sure, but soft real-time issue can tolerate if the issue does not happen very
> often. In this case though, it is frequent.

Could you provide details of the problem that you are facing ? It's
not clear for me what happens in your case at the end. Have you got an
audio glitch as an example?

"Not often" doesn't really give any clue.

Also update_blocked_averages was supposed called in newlyidle_balance
when the coming idle duration is expected to be long enough

>
> > IIUC, your real problem is that newidle_balance is running whereas a
> > task is about to wake up on the cpu and we don't abort quickly during
> > this load_balance
>
> Yes.
>
> > so we could also try to abort earlier in case of newly idle load balance
>
> I think interrupts are disabled when the load balance runs, so there's no way
> for say an audio interrupt to even run in order to wake up a task. How would
> you know to abort the new idle load balance?
>
> Could you elaborate more also on the drawback of the rate limiting patch we
> posted? Do you see a side effect?

Your patch just tries to hide your problem and not to solve the root cause.

>
> > > > > sometimes, which seems overkill.
> > > > >
> > > > > schbench shows a clear improvement with the change:
> > > >
> > > > Have you got more details about your test setup ?
> > > > which platform ?
> > > > which kernel ?
> > >
> > > I mentioned in the commit message it is a v5.4 kernel.
> >
> > I was not sure if the tests results done with this kernel because we
> > usually ask for results against mainline to make sure you are not
> > facing a problem that has solved since v5.4 has been released
>
> Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
> was hoping that is recent enough for this debug.

more than 14 months old is not really recent... It's always good to
have a reproducer against mainline

Regards,
Vincent
>
> thanks,
>
>  - Joel
>
Joel Fernandes Jan. 28, 2021, 3:09 p.m. UTC | #12
Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > > >                   * decayed cfs_rqs linger on the list.
> > > >                   */
> > > >                  if (cfs_rq_is_decayed(cfs_rq))
> > > >                          list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency in schedutil).
> > > > > >
> > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > >
> > > > > Low frequency usually means low utilization so it should happen that much.
> > > >
> > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > >
> > > Happening a lot in itself is not a problem if there is nothing else to
> > > do so it's not a argument in itself
> >
> > It is a problem - it shows up in the preempt off critical section latency
>
> But this point is not related to the point above which is about how
> often it happens.

It is related. A bad thing happening quite often is worse than a bad
thing happening infrequently. I agree it is not a root cause fix, but
it makes things "better".

> > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > micro seconds?  That hurts real-time applications (audio etc).
>
> So. Is your problem related to real-time applications (audio etc) ?

Yes it is. I don't have a reproducer and it could be the audio
buffering will hide the problem. But that doesn't mean that there is
no problem or that we cannot improve things. There will also likely be
power consumption improvement.

> > > So why is it a problem for you ? You are mentioning newly idle load
> > > balance so I assume that your root problem is the scheduling delay
> > > generated by the newly idle load balance which then calls
> > > update_blocked_averages
> >
> > Yes, the new idle balance is when I see it happen quite often. I do see it
> > happen with other load balance as well, but it not that often as those LB
> > don't run as often as new idle balance.
>
> The update of average blocked load has been added in newidle_balance
> to take advantage of the cpu becoming idle but it seems to create a
> long preempt off sequence. I 'm going to prepare a patch to move it
> out the schedule path.

Ok thanks, that would really help!

> > > rate limiting the call to update_blocked_averages() will only reduce
> > > the number of time it happens but it will not prevent it to happen.
> >
> > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > often. In this case though, it is frequent.
>
> Could you provide details of the problem that you are facing ? It's
> not clear for me what happens in your case at the end. Have you got an
> audio glitch as an example?
>
> "Not often" doesn't really give any clue.

I believe from my side I have provided details. I shared output from a
function graph tracer and schbench micro benchmark clearly showing the
issue and improvements. Sorry, I don't have a real-world reproducer
for this.

> Also update_blocked_averages was supposed called in newlyidle_balance
> when the coming idle duration is expected to be long enough

No, we do not want the schedule loop to take half a millisecond.

> > > IIUC, your real problem is that newidle_balance is running whereas a
> > > task is about to wake up on the cpu and we don't abort quickly during
> > > this load_balance
> >
> > Yes.
> >
> > > so we could also try to abort earlier in case of newly idle load balance
> >
> > I think interrupts are disabled when the load balance runs, so there's no way
> > for say an audio interrupt to even run in order to wake up a task. How would
> > you know to abort the new idle load balance?
> >
> > Could you elaborate more also on the drawback of the rate limiting patch we
> > posted? Do you see a side effect?
>
> Your patch just tries to hide your problem and not to solve the root cause.

Agreed, the solution presented is a band aid and not a proper fix. It
was just intended to illustrate the problem and start a discussion. I
should have marked it RFC for sure.

thanks!

- Joel
Qais Yousef Jan. 28, 2021, 4:57 p.m. UTC | #13
On 01/28/21 10:09, Joel Fernandes wrote:
> > > > Qais mentioned half of the time being used by
> > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > this call ?
> > >
> > > I do not see a frequency update happening at the time of the problem. However
> > > note that sugov_iowait_boost() does run even if frequency is not being
> > > updated. IIRC, this function is also not that light weight and I am not sure
> > > if it is a good idea to call this that often.
> >
> > Scheduler can't make any assumption about how often schedutil/cpufreq
> > wants to be called. Some are fast and straightforward and can be
> > called very often to adjust frequency; Others can't handle much
> > updates. The rate limit mechanism in schedutil and io-boost should be
> > there for such purpose.
> 
> Sure, I know that's the intention.

How about the below patch to help with reducing the impact of
sugov_update_shared()?

I basically made sure it'll bail out immediately if it gets 2 calls within the
defined rate_limit_us value.

And tweaked the way we call cpufreq_update_util() from
update_blocked_averages() too so that we first update blocked load on all cpus,
then we ask for the frequency update. Combined with above this should result to
a single call to sugov_update_shared() for each policy. Each call should see
the final state of what is the load really is.

Only compile tested!

Thanks

--
Qais Yousef

----->8-----


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..bd83e9343749 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -109,7 +109,6 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
 		return false;
 
 	sg_policy->next_freq = next_freq;
-	sg_policy->last_freq_update_time = time;
 
 	return true;
 }
@@ -554,20 +553,23 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	raw_spin_lock(&sg_policy->update_lock);
 
+	ignore_dl_rate_limit(sg_cpu, sg_policy);
+
+	if (!sugov_should_update_freq(sg_policy, time))
+		goto out;
+
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
 
-	ignore_dl_rate_limit(sg_cpu, sg_policy);
-
-	if (sugov_should_update_freq(sg_policy, time)) {
-		next_f = sugov_next_freq_shared(sg_cpu, time);
+	next_f = sugov_next_freq_shared(sg_cpu, time);
 
-		if (sg_policy->policy->fast_switch_enabled)
-			sugov_fast_switch(sg_policy, time, next_f);
-		else
-			sugov_deferred_update(sg_policy, time, next_f);
-	}
+	if (sg_policy->policy->fast_switch_enabled)
+		sugov_fast_switch(sg_policy, time, next_f);
+	else
+		sugov_deferred_update(sg_policy, time, next_f);
 
+	sg_policy->last_freq_update_time = time;
+out:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..47bd8be03a6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8033,7 +8033,7 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif
 
-static void update_blocked_averages(int cpu)
+static void update_blocked_averages(int cpu, bool update_freq)
 {
 	bool decayed = false, done = true;
 	struct rq *rq = cpu_rq(cpu);
@@ -8046,7 +8046,7 @@ static void update_blocked_averages(int cpu)
 	decayed |= __update_blocked_fair(rq, &done);
 
 	update_blocked_load_status(rq, !done);
-	if (decayed)
+	if (decayed && update_freq)
 		cpufreq_update_util(rq, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
@@ -8384,7 +8384,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
 	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
 		return true;
 
-	update_blocked_averages(cpu);
+	update_blocked_averages(cpu, false);
 
 	return rq->has_blocked_load;
 #else
@@ -8454,6 +8454,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
+	/*
+	 * We did a bunch a blocked average updates, let cpufreq know about
+	 * them in one go. For system with a lot of cpus on a frequency domain
+	 * this will make sure we take all the changes that affects the policy
+	 * in one go.
+	 */
+	for_each_cpu_and(i, sched_group_span(group), env->cpus)
+		cpufreq_update_util(cpu_rq(i), 0);
+
 	/* Check if dst CPU is idle and preferred to this group */
 	if (env->sd->flags & SD_ASYM_PACKING &&
 	    env->idle != CPU_NOT_IDLE &&
@@ -10464,7 +10473,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 
 	/* Newly idle CPU doesn't need an update */
 	if (idle != CPU_NEWLY_IDLE) {
-		update_blocked_averages(this_cpu);
+		update_blocked_averages(this_cpu, false);
 		has_blocked_load |= this_rq->has_blocked_load;
 	}
 
@@ -10478,6 +10487,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 	ret = true;
 
 abort:
+	/*
+	 * We did a bunch a blocked average updates, let cpufreq know about
+	 * them in one go. For system with a lot of cpus on a frequency domain
+	 * this will make sure we take all the changes that affects the policy
+	 * in one go.
+	 */
+	for_each_cpu(balance_cpu, nohz.idle_cpus_mask)
+		cpufreq_update_util(cpu_rq(balance_cpu), 0);
+
 	/* There is still blocked load, enable periodic update */
 	if (has_blocked_load)
 		WRITE_ONCE(nohz.has_blocked, 1);
@@ -10603,7 +10621,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_unlock(&this_rq->lock);
 
-	update_blocked_averages(this_cpu);
+	update_blocked_averages(this_cpu, true);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
@@ -10691,7 +10709,7 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
 		return;
 
 	/* normal load balance */
-	update_blocked_averages(this_rq->cpu);
+	update_blocked_averages(this_rq->cpu, true);
 	rebalance_domains(this_rq, idle);
 }
Vincent Guittot Jan. 29, 2021, 5:27 p.m. UTC | #14
Le vendredi 29 janv. 2021 à 11:33:00 (+0100), Vincent Guittot a écrit :
> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Vincent,
> >
> > On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > > > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > > > <joel@joelfernandes.org> wrote:
> > > > > > > >
> > > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > > > the periodic balancer.
> > > > > > > >
> > > > > > > > Closer look shows that the problem is caused by the following
> > > > > > > > ingredients:
> > > > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > > > __update_blocked_fair() take a long time.
> > > > > > >
> > > > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > > > the duration
> > > > > >
> > > > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > > > decayed and not have any tasks running on it but its load needs to be
> > > > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > > > 'blocked'.
> > > > >
> > > > > How many blocked cgroups have you got ?
> > > >
> > > > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > > > this loop runs per new idle balance. When the problem happens I see this loop
> > > > run 35-40 times (for one single instance of newidle balance). So in total
> > > > there are at least these many cfs_rq load updates.
> > >
> > > Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?
> >
> > All CPUs.
> >
> > > > I also see that new idle balance can be called 200-500 times per second.
> > >
> > > This is not surprising because newidle_balance() is called every time
> > > the CPU is about to become idle
> >
> > Sure.
> >
> > > > > >
> > > > > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > > > > >                   * decayed cfs_rqs linger on the list.
> > > > > >                   */
> > > > > >                  if (cfs_rq_is_decayed(cfs_rq))
> > > > > >                          list_del_leaf_cfs_rq(cfs_rq);
> > > > > >
> > > > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > > > >
> > > > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > > > >
> > > > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > > > frequency update needs to go through utilization of other CPUs in the
> > > > > > cluster. I believe this could be adding to the problem but is not really
> > > > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > > > to begin with.
> > > > >
> > > > > Qais mentioned half of the time being used by
> > > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > > this call ?
> > > >
> > > > I do not see a frequency update happening at the time of the problem. However
> > > > note that sugov_iowait_boost() does run even if frequency is not being
> > > > updated. IIRC, this function is also not that light weight and I am not sure
> > > > if it is a good idea to call this that often.
> > >
> > > Scheduler can't make any assumption about how often schedutil/cpufreq
> > > wants to be called. Some are fast and straightforward and can be
> > > called very often to adjust frequency; Others can't handle much
> > > updates. The rate limit mechanism in schedutil and io-boost should be
> > > there for such purpose.
> >
> > Sure, I know that's the intention.
> >
> > > > > > > > average updates also try to update the frequency in schedutil).
> > > > > > > >
> > > > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > > > >
> > > > > > > Low frequency usually means low utilization so it should happen that much.
> > > > > >
> > > > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > > > >
> > > > > Happening a lot in itself is not a problem if there is nothing else to
> > > > > do so it's not a argument in itself
> > > >
> > > > It is a problem - it shows up in the preempt off critical section latency
> > >
> > > But this point is not related to the point above which is about how
> > > often it happens.
> >
> > It is related. A bad thing happening quite often is worse than a bad
> > thing happening infrequently. I agree it is not a root cause fix, but
> > it makes things "better".
> >
> > > > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > > > micro seconds?  That hurts real-time applications (audio etc).
> > >
> > > So. Is your problem related to real-time applications (audio etc) ?
> >
> > Yes it is. I don't have a reproducer and it could be the audio
> > buffering will hide the problem. But that doesn't mean that there is
> > no problem or that we cannot improve things. There will also likely be
> > power consumption improvement.
> >
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
> 
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
> 
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved. As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.
> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> 
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473
> 
> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400
> 
> So a quick and wrong conclusion could be to say that you should disable EAS ;-)
> 
> All that to say that your schbench's results are not relevant and I
> disagree when you said that you have provided all details of your
> problem.
> 
> Then, your function graph shows a half millisecond duration but there
> are no details about the context. It seems that your trace happens
> after that sugov just changed the freq. But it would be not surprising
> to have such a large duration when running at the lowest
> capacity/frequency of the system as an example.
> 
> All that to say that the main reason of my questions is just to make
> sure that the problem is fully understood and the results are
> relevants, which doesn't seem to be the case for schbench at least.
> 
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

The patch below moves the update of the blocked load of CPUs outside newidle_balance().

Instead, the update is done with the usual idle load balance update. I'm working on an
additonnal patch that will select this cpu that is about to become idle, instead of a
random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 32 +++-----------------------------
 1 file changed, 3 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..8200b1d4df3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7421,8 +7421,6 @@ enum migration_type {
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
-#define LBF_NOHZ_STATS	0x10
-#define LBF_NOHZ_AGAIN	0x20
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
-		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
-			env->flags |= LBF_NOHZ_AGAIN;
-
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->group_runnable += cpu_runnable(rq);
@@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats tmp_sgs;
 	int sg_status = 0;
 
-#ifdef CONFIG_NO_HZ_COMMON
-	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
-		env->flags |= LBF_NOHZ_STATS;
-#endif
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	/* Tag domain that child domain prefers tasks go to siblings first */
 	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 
-#ifdef CONFIG_NO_HZ_COMMON
-	if ((env->flags & LBF_NOHZ_AGAIN) &&
-	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
-
-		WRITE_ONCE(nohz.next_blocked,
-			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
-	}
-#endif
-
 	if (env->sd->flags & SD_NUMA)
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
@@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 		return;
 
 	raw_spin_unlock(&this_rq->lock);
-	/*
-	 * This CPU is going to be idle and blocked load of idle CPUs
-	 * need to be updated. Run the ilb locally as it is a good
-	 * candidate for ilb instead of waking up another idle CPU.
-	 * Kick an normal ilb if we failed to do the update.
-	 */
-	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
-		kick_ilb(NOHZ_STATS_KICK);
+	kick_ilb(NOHZ_STATS_KICK);
 	raw_spin_lock(&this_rq->lock);
 }
 
@@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
-		nohz_newidle_balance(this_rq);
-
 		goto out;
 	}
 
@@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
+	else
+		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
Joel Fernandes Feb. 1, 2021, 3:13 p.m. UTC | #15
On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

That's fine, as long as we agree on this problem ;-) Thanks for
providing the patches and I will try them once they are ready.

> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...

Agreed, I was also thinking the same. But that doesn't change the fact
that there is room for improvement and I'm grateful to you for trying
to improve it!

thanks,

 - Joel
Dietmar Eggemann Feb. 3, 2021, 11:54 a.m. UTC | #16
On 29/01/2021 18:27, Vincent Guittot wrote:
> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>
>>> Hi Vincent,
>>>
>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>> <joel@joelfernandes.org> wrote:

[...]

>> The only point that I agree with, is that running
>> update_blocked_averages with preempt and irq off is not a good thing
>> because we don't manage the number of csf_rq to update and I'm going
>> to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

I'm trying to understand the need for this extra patch.

The patch below moves away from doing update_blocked_averages() (1) for
all CPUs in sched groups of the sched domain:

newidle_balance()->load_balance()->
find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

to:

calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
update_nohz_stats() and for the ilb CPU.

newidle_balance() calls (1) for newidle CPU already.

What would be the benefit to choose newidle CPU as ilb CPU?

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);
>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);
>  
>  	rq_repin_lock(this_rq, rf);
Vincent Guittot Feb. 3, 2021, 1:12 p.m. UTC | #17
On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 29/01/2021 18:27, Vincent Guittot wrote:
> > Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
> >> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> >>> <vincent.guittot@linaro.org> wrote:
> >>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> >>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >>>>>>>> <joel@joelfernandes.org> wrote:
>
> [...]
>
> >> The only point that I agree with, is that running
> >> update_blocked_averages with preempt and irq off is not a good thing
> >> because we don't manage the number of csf_rq to update and I'm going
> >> to provide a patchset for this
> >
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>
> I'm trying to understand the need for this extra patch.
>
> The patch below moves away from doing update_blocked_averages() (1) for
> all CPUs in sched groups of the sched domain:
>
> newidle_balance()->load_balance()->
> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>
> to:
>
> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
> update_nohz_stats() and for the ilb CPU.
>
> newidle_balance() calls (1) for newidle CPU already.
>
> What would be the benefit to choose newidle CPU as ilb CPU?

To prevent waking up another idle cpu to run the update whereas
newidle cpu is already woken up and about to be idle so the best
candidate.
All the aim of the removed code was to prevent waking up an idle cpu
for doing something that could be done by the newidle cpu before it
enters idle state

>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 32 +++-----------------------------
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK       0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED      0x08
> > -#define LBF_NOHZ_STATS       0x10
> > -#define LBF_NOHZ_AGAIN       0x20
> >
> >  struct lb_env {
> >       struct sched_domain     *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >       for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >               struct rq *rq = cpu_rq(i);
> >
> > -             if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > -                     env->flags |= LBF_NOHZ_AGAIN;
> > -
> >               sgs->group_load += cpu_load(rq);
> >               sgs->group_util += cpu_util(i);
> >               sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       struct sg_lb_stats tmp_sgs;
> >       int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > -             env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >       do {
> >               struct sg_lb_stats *sgs = &tmp_sgs;
> >               int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       /* Tag domain that child domain prefers tasks go to siblings first */
> >       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if ((env->flags & LBF_NOHZ_AGAIN) &&
> > -         cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > -             WRITE_ONCE(nohz.next_blocked,
> > -                        jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > -     }
> > -#endif
> > -
> >       if (env->sd->flags & SD_NUMA)
> >               env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >               return;
> >
> >       raw_spin_unlock(&this_rq->lock);
> > -     /*
> > -      * This CPU is going to be idle and blocked load of idle CPUs
> > -      * need to be updated. Run the ilb locally as it is a good
> > -      * candidate for ilb instead of waking up another idle CPU.
> > -      * Kick an normal ilb if we failed to do the update.
> > -      */
> > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> > -             kick_ilb(NOHZ_STATS_KICK);
> > +     kick_ilb(NOHZ_STATS_KICK);
> >       raw_spin_lock(&this_rq->lock);
> >  }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                       update_next_balance(sd, &next_balance);
> >               rcu_read_unlock();
> >
> > -             nohz_newidle_balance(this_rq);
> > -
> >               goto out;
> >       }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >       if (pulled_task)
> >               this_rq->idle_stamp = 0;
> > +     else
> > +             nohz_newidle_balance(this_rq);
> >
> >       rq_repin_lock(this_rq, rf);
Qais Yousef Feb. 3, 2021, 5:09 p.m. UTC | #18
On 01/29/21 18:27, Vincent Guittot wrote:
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))

Since we removed the call to this function (which uses this_rq)

> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);

And unconditionally call kick_ilb() which will find a suitable cpu to run the
lb at regardless what this_rq is.

Doesn't the below become unnecessary now?

	  10494         /*
	  10495          * This CPU doesn't want to be disturbed by scheduler
	  10496          * housekeeping
	  10497          */
	  10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
	  10499                 return;
	  10500
	  10501         /* Will wake up very soon. No time for doing anything else*/
	  10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
	  10503                 return;

And we can drop this_rq arg altogether?

>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);

Since nohz_newidle_balance() will not do any real work now, I couldn't figure
out what moving this here achieves. Fault from my end to parse the change most
likely :-)

Joel can still test this patch as is of course. This is just an early review
since I already spent the time trying to understand it.

Thanks

--
Qais Yousef

>  
>  	rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
Vincent Guittot Feb. 3, 2021, 5:35 p.m. UTC | #19
On Wed, 3 Feb 2021 at 18:09, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 01/29/21 18:27, Vincent Guittot wrote:
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 32 +++-----------------------------
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK       0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED      0x08
> > -#define LBF_NOHZ_STATS       0x10
> > -#define LBF_NOHZ_AGAIN       0x20
> >
> >  struct lb_env {
> >       struct sched_domain     *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >       for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >               struct rq *rq = cpu_rq(i);
> >
> > -             if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > -                     env->flags |= LBF_NOHZ_AGAIN;
> > -
> >               sgs->group_load += cpu_load(rq);
> >               sgs->group_util += cpu_util(i);
> >               sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       struct sg_lb_stats tmp_sgs;
> >       int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > -             env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >       do {
> >               struct sg_lb_stats *sgs = &tmp_sgs;
> >               int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >       /* Tag domain that child domain prefers tasks go to siblings first */
> >       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -     if ((env->flags & LBF_NOHZ_AGAIN) &&
> > -         cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > -             WRITE_ONCE(nohz.next_blocked,
> > -                        jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > -     }
> > -#endif
> > -
> >       if (env->sd->flags & SD_NUMA)
> >               env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >               return;
> >
> >       raw_spin_unlock(&this_rq->lock);
> > -     /*
> > -      * This CPU is going to be idle and blocked load of idle CPUs
> > -      * need to be updated. Run the ilb locally as it is a good
> > -      * candidate for ilb instead of waking up another idle CPU.
> > -      * Kick an normal ilb if we failed to do the update.
> > -      */
> > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>
> Since we removed the call to this function (which uses this_rq)
>
> > -             kick_ilb(NOHZ_STATS_KICK);
> > +     kick_ilb(NOHZ_STATS_KICK);
>
> And unconditionally call kick_ilb() which will find a suitable cpu to run the
> lb at regardless what this_rq is.
>
> Doesn't the below become unnecessary now?

The end goal is to keep running on this cpu that is about to become idle.

This patch is mainly  there to check that Joel's problem disappears if
the update of the blocked load of the cpus is not done in the
newidle_balance context. I'm preparing few other patches on top to
clean up the full path

>
>           10494         /*
>           10495          * This CPU doesn't want to be disturbed by scheduler
>           10496          * housekeeping
>           10497          */
>           10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
>           10499                 return;
>           10500
>           10501         /* Will wake up very soon. No time for doing anything else*/
>           10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
>           10503                 return;
>
> And we can drop this_rq arg altogether?
>
> >       raw_spin_lock(&this_rq->lock);
> >  }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                       update_next_balance(sd, &next_balance);
> >               rcu_read_unlock();
> >
> > -             nohz_newidle_balance(this_rq);
> > -
> >               goto out;
> >       }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >       if (pulled_task)
> >               this_rq->idle_stamp = 0;
> > +     else
> > +             nohz_newidle_balance(this_rq);
>
> Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> out what moving this here achieves. Fault from my end to parse the change most
> likely :-)

The goal is to schedule the update only if we are about to be idle and
nothing else has been queued in the meantime

>
> Joel can still test this patch as is of course. This is just an early review
> since I already spent the time trying to understand it.
>
> Thanks
>
> --
> Qais Yousef
>
> >
> >       rq_repin_lock(this_rq, rf);
> >
> > --
> > 2.17.1
Joel Fernandes Feb. 3, 2021, 7:56 p.m. UTC | #20
On Fri, Jan 29, 2021 at 06:27:27PM +0100, Vincent Guittot wrote:
[...]
> > update_blocked_averages with preempt and irq off is not a good thing
> > because we don't manage the number of csf_rq to update and I'm going
> > to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I confirmed that with this patch, I don't see the preemptoff issues related
to update_blocked_averages() anymore (tested using preemptoff tracer).

I went through the patch and it looks correct to me, I will further review it
and await further reviews from others as well, and then backport the patch to
our kernels. Thanks Vince and everyone!

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel



> ---
>  kernel/sched/fair.c | 32 +++-----------------------------
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK	0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED	0x08
> -#define LBF_NOHZ_STATS	0x10
> -#define LBF_NOHZ_AGAIN	0x20
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
>  
> -		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> -			env->flags |= LBF_NOHZ_AGAIN;
> -
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util(i);
>  		sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> -		env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
>  		int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	/* Tag domain that child domain prefers tasks go to siblings first */
>  	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> -	if ((env->flags & LBF_NOHZ_AGAIN) &&
> -	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> -		WRITE_ONCE(nohz.next_blocked,
> -			   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> -	}
> -#endif
> -
>  	if (env->sd->flags & SD_NUMA)
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  		return;
>  
>  	raw_spin_unlock(&this_rq->lock);
> -	/*
> -	 * This CPU is going to be idle and blocked load of idle CPUs
> -	 * need to be updated. Run the ilb locally as it is a good
> -	 * candidate for ilb instead of waking up another idle CPU.
> -	 * Kick an normal ilb if we failed to do the update.
> -	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);
>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
>  
> -		nohz_newidle_balance(this_rq);
> -
>  		goto out;
>  	}
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> +	else
> +		nohz_newidle_balance(this_rq);
>  
>  	rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
> 
> 
> > 
> > > for this.
> > >
> > > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > > when the coming idle duration is expected to be long enough
> > >
> > > No, we do not want the schedule loop to take half a millisecond.
> > 
> > keep in mind that you are scaling frequency so everything takes time
> > at lowest frequency/capacity ...
> > 
> > >
> > > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > > task is about to wake up on the cpu and we don't abort quickly during
> > > > > > this load_balance
> > > > >
> > > > > Yes.
> > > > >
> > > > > > so we could also try to abort earlier in case of newly idle load balance
> > > > >
> > > > > I think interrupts are disabled when the load balance runs, so there's no way
> > > > > for say an audio interrupt to even run in order to wake up a task. How would
> > > > > you know to abort the new idle load balance?
> > > > >
> > > > > Could you elaborate more also on the drawback of the rate limiting patch we
> > > > > posted? Do you see a side effect?
> > > >
> > > > Your patch just tries to hide your problem and not to solve the root cause.
> > >
> > > Agreed, the solution presented is a band aid and not a proper fix. It
> > > was just intended to illustrate the problem and start a discussion. I
> > > should have marked it RFC for sure.
> > >
> > > thanks!
> > >
> > > - Joel
Dietmar Eggemann Feb. 4, 2021, 9:47 a.m. UTC | #21
On 03/02/2021 14:12, Vincent Guittot wrote:
> On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/01/2021 18:27, Vincent Guittot wrote:
>>> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>>>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>>>> <joel@joelfernandes.org> wrote:
>>
>> [...]
>>
>>>> The only point that I agree with, is that running
>>>> update_blocked_averages with preempt and irq off is not a good thing
>>>> because we don't manage the number of csf_rq to update and I'm going
>>>> to provide a patchset for this
>>>
>>> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>>>
>>> Instead, the update is done with the usual idle load balance update. I'm working on an
>>> additonnal patch that will select this cpu that is about to become idle, instead of a
>>> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>>
>> I'm trying to understand the need for this extra patch.
>>
>> The patch below moves away from doing update_blocked_averages() (1) for
>> all CPUs in sched groups of the sched domain:
>>
>> newidle_balance()->load_balance()->
>> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>>
>> to:
>>
>> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
>> update_nohz_stats() and for the ilb CPU.
>>
>> newidle_balance() calls (1) for newidle CPU already.
>>
>> What would be the benefit to choose newidle CPU as ilb CPU?
> 
> To prevent waking up another idle cpu to run the update whereas
> newidle cpu is already woken up and about to be idle so the best
> candidate.
> All the aim of the removed code was to prevent waking up an idle cpu
> for doing something that could be done by the newidle cpu before it
> enters idle state

Ah OK, makes sense. So the only diff would be that newidle CPU will call
(1) on nohz.idle_cpus_mask rather on on sd->span.

[...]
Qais Yousef Feb. 4, 2021, 10:45 a.m. UTC | #22
On 02/03/21 18:35, Vincent Guittot wrote:
> > >       raw_spin_unlock(&this_rq->lock);
> > > -     /*
> > > -      * This CPU is going to be idle and blocked load of idle CPUs
> > > -      * need to be updated. Run the ilb locally as it is a good
> > > -      * candidate for ilb instead of waking up another idle CPU.
> > > -      * Kick an normal ilb if we failed to do the update.
> > > -      */
> > > -     if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >
> > Since we removed the call to this function (which uses this_rq)
> >
> > > -             kick_ilb(NOHZ_STATS_KICK);
> > > +     kick_ilb(NOHZ_STATS_KICK);
> >
> > And unconditionally call kick_ilb() which will find a suitable cpu to run the
> > lb at regardless what this_rq is.
> >
> > Doesn't the below become unnecessary now?
> 
> The end goal is to keep running on this cpu that is about to become idle.
> 
> This patch is mainly  there to check that Joel's problem disappears if
> the update of the blocked load of the cpus is not done in the
> newidle_balance context. I'm preparing few other patches on top to
> clean up the full path

+1

> >
> >           10494         /*
> >           10495          * This CPU doesn't want to be disturbed by scheduler
> >           10496          * housekeeping
> >           10497          */
> >           10498         if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> >           10499                 return;
> >           10500
> >           10501         /* Will wake up very soon. No time for doing anything else*/
> >           10502         if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >           10503                 return;
> >
> > And we can drop this_rq arg altogether?
> >
> > >       raw_spin_lock(&this_rq->lock);
> > >  }
> > >
> > > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >                       update_next_balance(sd, &next_balance);
> > >               rcu_read_unlock();
> > >
> > > -             nohz_newidle_balance(this_rq);
> > > -
> > >               goto out;
> > >       }
> > >
> > > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >
> > >       if (pulled_task)
> > >               this_rq->idle_stamp = 0;
> > > +     else
> > > +             nohz_newidle_balance(this_rq);
> >
> > Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> > out what moving this here achieves. Fault from my end to parse the change most
> > likely :-)
> 
> The goal is to schedule the update only if we are about to be idle and
> nothing else has been queued in the meantime

I see. This short coming already existed and not *strictly* related to moving
update of blocked load out of newidle balance.

Thanks

--
Qais Yousef
Tim Chen March 23, 2021, 9:37 p.m. UTC | #23
On 1/29/21 9:27 AM, Vincent Guittot wrote:
> 
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().

On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
in newidle_balance().  So changes to reduce this overhead is much welcomed.

Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.  

A big part of the overhead in our database workload comes from updating
blocked averages in newidle_balance, caused by I/O threads making
some CPUs go in and out of idle frequently in the following code path:

----__blkdev_direct_IO_simple
          |          
          |----io_schedule_timeout
          |          |          
          |           ----schedule_timeout
          |                     |          
          |                      ----schedule
          |                                |          
          |                                 ----__schedule
          |                                           |          
          |                                            ----pick_next_task_fair
          |                                                      |          
          |                                                       ----newidle_balance
          |                                                                 |          
                                                                             ----update_blocked_averages

We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
gets turned on.

I hacked up Joe's original patch to rate limit the update of blocked
averages called from newidle_balance().  The 9% throughput degradation reduced to
5.4%.  We'll be testing Vincent's change to see if it can give
similar performance improvement.

Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
and reduce latency.  We got quite a lot of calls to update_blocked_averages directly 
and then try to load_balance in newidle_balance instead of relegating
the responsibility to idle load balancer.  (See code snippet in newidle_balance below)  


        if (this_rq->avg_idle < sysctl_sched_migration_cost ||       <-----sched_migration_cost check
            !READ_ONCE(this_rq->rd->overload)) {

                rcu_read_lock();
                sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                goto out;  <--- invoke idle load balancer
        }

        raw_spin_unlock(&this_rq->lock);

        update_blocked_averages(this_cpu);

	.... followed by load balance code ---


So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less 
effective in this case with small sched_migration_cost.

Looking at the code a bit more, we don't actually load balance every time in this code path
unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages immediately 
is only needed if we do call load_balance().  If we don't do any load balance in the code path,
we can let the idle load balancer update the blocked averages lazily.

Something like the following perhaps on top of Vincent's patch?  We haven't really tested
this change yet but want to see if this change makes sense to you.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63950d80fd0b..b93f5f52658a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	u64 curr_cost = 0;
+	bool updated_blocked_avg = false;
 
 	update_misfit_status(NULL, this_rq);
 	/*
@@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_unlock(&this_rq->lock);
 
-	update_blocked_averages(this_cpu);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
@@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
+			if (!updated_blocked_avg) {
+				update_blocked_averages(this_cpu);
+				updated_blocked_avg = true;
+			}
+
 			t0 = sched_clock_cpu(this_cpu);
 
 			pulled_task = load_balance(this_cpu, this_rq,
Vincent Guittot March 24, 2021, 1:44 p.m. UTC | #24
Hi Tim,

Le mardi 23 mars 2021 à 14:37:59 (-0700), Tim Chen a écrit :
> 
> 
> On 1/29/21 9:27 AM, Vincent Guittot wrote:
> > 
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> 
> On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
> in newidle_balance().  So changes to reduce this overhead is much welcomed.
> 
> Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.  
> 
> A big part of the overhead in our database workload comes from updating
> blocked averages in newidle_balance, caused by I/O threads making
> some CPUs go in and out of idle frequently in the following code path:
> 
> ----__blkdev_direct_IO_simple
>           |          
>           |----io_schedule_timeout
>           |          |          
>           |           ----schedule_timeout
>           |                     |          
>           |                      ----schedule
>           |                                |          
>           |                                 ----__schedule
>           |                                           |          
>           |                                            ----pick_next_task_fair
>           |                                                      |          
>           |                                                       ----newidle_balance
>           |                                                                 |          
>                                                                              ----update_blocked_averages
> 
> We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
> gets turned on.
> 
> I hacked up Joe's original patch to rate limit the update of blocked
> averages called from newidle_balance().  The 9% throughput degradation reduced to
> 5.4%.  We'll be testing Vincent's change to see if it can give
> similar performance improvement.
> 
> Though in our test environment, sysctl_sched_migration_cost was kept
> much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
> and reduce latency.  We got quite a lot of calls to update_blocked_averages directly 
> and then try to load_balance in newidle_balance instead of relegating
> the responsibility to idle load balancer.  (See code snippet in newidle_balance below)  
> 
> 
>         if (this_rq->avg_idle < sysctl_sched_migration_cost ||       <-----sched_migration_cost check
>             !READ_ONCE(this_rq->rd->overload)) {
> 
>                 rcu_read_lock();
>                 sd = rcu_dereference_check_sched_domain(this_rq->sd);
>                 if (sd)
>                         update_next_balance(sd, &next_balance);
>                 rcu_read_unlock();
> 
>                 goto out;  <--- invoke idle load balancer
>         }
> 
>         raw_spin_unlock(&this_rq->lock);
> 
>         update_blocked_averages(this_cpu);
> 
> 	.... followed by load balance code ---
> 
 
> So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less 
> effective in this case with small sched_migration_cost.
> 
> Looking at the code a bit more, we don't actually load balance every time in this code path
> unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages immediately 

IIUC your problem, we call update_blocked_averages() but because of:

		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
			update_next_balance(sd, &next_balance);
			break;
		}

the for_each_domain loop stops even before running load_balance on the 1st
sched domain level which means that update_blocked_averages() was called
unnecessarily. 

And this is even more true with a small sysctl_sched_migration_cost which allows newly
idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value 
for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
smallest domain but that's probably because of task_hot().

if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
skip spin_unlock/lock and for_each_domain() loop entirely

Maybe something like below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76e33a70d575..08933e0d87ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
         */
        rq_unpin_lock(this_rq, rf);

+       rcu_read_lock();
+       sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
        if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-           !READ_ONCE(this_rq->rd->overload)) {
+           !READ_ONCE(this_rq->rd->overload) ||
+           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

-               rcu_read_lock();
-               sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                goto out;
        }
+       rcu_read_unlock();
+

        raw_spin_unlock(&this_rq->lock);


> is only needed if we do call load_balance().  If we don't do any load balance in the code path,
> we can let the idle load balancer update the blocked averages lazily.
> 
> Something like the following perhaps on top of Vincent's patch?  We haven't really tested
> this change yet but want to see if this change makes sense to you.
> 
> Tim
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63950d80fd0b..b93f5f52658a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	struct sched_domain *sd;
>  	int pulled_task = 0;
>  	u64 curr_cost = 0;
> +	bool updated_blocked_avg = false;
>  
>  	update_misfit_status(NULL, this_rq);
>  	/*
> @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_unlock(&this_rq->lock);
>  
> -	update_blocked_averages(this_cpu);
>  	rcu_read_lock();
>  	for_each_domain(this_cpu, sd) {
>  		int continue_balancing = 1;
> @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		}
>  
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
> +			if (!updated_blocked_avg) {
> +				update_blocked_averages(this_cpu);
> +				updated_blocked_avg = true;
> +			}
> +
>  			t0 = sched_clock_cpu(this_cpu);
>  
>  			pulled_task = load_balance(this_cpu, this_rq,
>  
>
Tim Chen March 24, 2021, 4:05 p.m. UTC | #25
On 3/24/21 6:44 AM, Vincent Guittot wrote:
> Hi Tim,

> 
> IIUC your problem, we call update_blocked_averages() but because of:
> 
> 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> 			update_next_balance(sd, &next_balance);
> 			break;
> 		}
> 
> the for_each_domain loop stops even before running load_balance on the 1st
> sched domain level which means that update_blocked_averages() was called
> unnecessarily. 
> 

That's right

> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value 
> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> smallest domain but that's probably because of task_hot().
> 
> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> skip spin_unlock/lock and for_each_domain() loop entirely
> 
> Maybe something like below:
> 

The patch makes sense.  I'll ask our benchmark team to queue this patch for testing.

Tim
Vincent Guittot April 7, 2021, 2:02 p.m. UTC | #26
Hi Tim,

On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> > Hi Tim,
>
> >
> > IIUC your problem, we call update_blocked_averages() but because of:
> >
> >               if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >                       update_next_balance(sd, &next_balance);
> >                       break;
> >               }
> >
> > the for_each_domain loop stops even before running load_balance on the 1st
> > sched domain level which means that update_blocked_averages() was called
> > unnecessarily.
> >
>
> That's right
>
> > And this is even more true with a small sysctl_sched_migration_cost which allows newly
> > idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> > for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> > smallest domain but that's probably because of task_hot().
> >
> > if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> > skip spin_unlock/lock and for_each_domain() loop entirely
> >
> > Maybe something like below:
> >
>
> The patch makes sense.  I'll ask our benchmark team to queue this patch for testing.

Do you have feedback from your benchmark team ?

Regards,
Vincent
>
> Tim
>
>
Tim Chen April 7, 2021, 5:19 p.m. UTC | #27
On 4/7/21 7:02 AM, Vincent Guittot wrote:
> Hi Tim,
> 
> On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>
>>
>> On 3/24/21 6:44 AM, Vincent Guittot wrote:
>>> Hi Tim,
>>
>>>
>>> IIUC your problem, we call update_blocked_averages() but because of:
>>>
>>>               if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>                       update_next_balance(sd, &next_balance);
>>>                       break;
>>>               }
>>>
>>> the for_each_domain loop stops even before running load_balance on the 1st
>>> sched domain level which means that update_blocked_averages() was called
>>> unnecessarily.
>>>
>>
>> That's right
>>
>>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
>>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
>>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
>>> smallest domain but that's probably because of task_hot().
>>>
>>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
>>> skip spin_unlock/lock and for_each_domain() loop entirely
>>>
>>> Maybe something like below:
>>>
>>
>> The patch makes sense.  I'll ask our benchmark team to queue this patch for testing.
> 
> Do you have feedback from your benchmark team ?
> 

Vincent,

Thanks for following up. I just got some data back from the benchmark team.
The performance didn't change with your patch.  And the overall cpu% of update_blocked_averages
also remain at about the same level.  My first thought was perhaps this update
still didn't catch all the calls to update_blocked_averages

        if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-           !READ_ONCE(this_rq->rd->overload)) {
+           !READ_ONCE(this_rq->rd->overload) ||
+           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

To experiment, I added one more check on the next_balance to further limit
the path to actually do idle load balance with the next_balance time.

        if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-           !READ_ONCE(this_rq->rd->overload)) {
+	    time_before(jiffies, this_rq->next_balance) ||	    
+           !READ_ONCE(this_rq->rd->overload) ||
+           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

I was suprised to find the overall cpu% consumption of update_blocked_averages
and throughput of the benchmark still didn't change much.  So I took a 
peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
The call to update_locked_averages was reduced in newidle_balance so the patch did
what we intended.  But the overall rate of calls to
update_blocked_averages remain roughly the same, shifting from
newidle_balance to run_rebalance_domains.

   100.00%  (ffffffff810cf070)
            |
            ---update_blocked_averages
               |          
               |--95.47%--run_rebalance_domains
               |          __do_softirq
               |          |          
               |          |--94.27%--asm_call_irq_on_stack
               |          |          do_softirq_own_stack
               |          |          |          
               |          |          |--93.74%--irq_exit_rcu
               |          |          |          |          
               |          |          |          |--88.20%--sysvec_apic_timer_interrupt
               |          |          |          |          asm_sysvec_apic_timer_interrupt
               |          |          |          |          |          
	       ...
	       |
               |          
                --4.53%--newidle_balance
                          pick_next_task_fair

I was expecting idle load balancer to be rate limited to 60 Hz, which
should be 15 jiffies apart on the test system with CONFIG_HZ_250.
When I did a trace on a single CPU, I see that update_blocked_averages
are often called between 1 to 4 jiffies apart, which is at a much higher
rate than I expected.  I haven't taken a closer look yet.  But you may
have a better idea.  I won't have access to the test system and workload
till probably next week.

Thanks.

Tim
Vincent Guittot April 8, 2021, 2:51 p.m. UTC | #28
On Wed, 7 Apr 2021 at 19:19, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 4/7/21 7:02 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> >>> Hi Tim,
> >>
> >>>
> >>> IIUC your problem, we call update_blocked_averages() but because of:
> >>>
> >>>               if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >>>                       update_next_balance(sd, &next_balance);
> >>>                       break;
> >>>               }
> >>>
> >>> the for_each_domain loop stops even before running load_balance on the 1st
> >>> sched domain level which means that update_blocked_averages() was called
> >>> unnecessarily.
> >>>
> >>
> >> That's right
> >>
> >>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> >>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> >>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> >>> smallest domain but that's probably because of task_hot().
> >>>
> >>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> >>> skip spin_unlock/lock and for_each_domain() loop entirely
> >>>
> >>> Maybe something like below:
> >>>
> >>
> >> The patch makes sense.  I'll ask our benchmark team to queue this patch for testing.
> >
> > Do you have feedback from your benchmark team ?
> >
>
> Vincent,
>
> Thanks for following up. I just got some data back from the benchmark team.
> The performance didn't change with your patch.  And the overall cpu% of update_blocked_averages
> also remain at about the same level.  My first thought was perhaps this update
> still didn't catch all the calls to update_blocked_averages
>
>         if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -           !READ_ONCE(this_rq->rd->overload)) {
> +           !READ_ONCE(this_rq->rd->overload) ||
> +           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> To experiment, I added one more check on the next_balance to further limit
> the path to actually do idle load balance with the next_balance time.
>
>         if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -           !READ_ONCE(this_rq->rd->overload)) {
> +           time_before(jiffies, this_rq->next_balance) ||
> +           !READ_ONCE(this_rq->rd->overload) ||
> +           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> I was suprised to find the overall cpu% consumption of update_blocked_averages
> and throughput of the benchmark still didn't change much.  So I took a
> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> The call to update_locked_averages was reduced in newidle_balance so the patch did
> what we intended.  But the overall rate of calls to

At least , we have removed the useless call to update_blocked_averages
in newidle_balance when we will not perform any newly idle load
balance

> update_blocked_averages remain roughly the same, shifting from
> newidle_balance to run_rebalance_domains.
>
>    100.00%  (ffffffff810cf070)
>             |
>             ---update_blocked_averages
>                |
>                |--95.47%--run_rebalance_domains
>                |          __do_softirq
>                |          |
>                |          |--94.27%--asm_call_irq_on_stack
>                |          |          do_softirq_own_stack

The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
And as a result, not from the new path
do_idle()->nohz_run_idle_balance() which has been added by this patch
to defer the call to update_nohz_stats() after newlyidle_balance and
before entering idle.

>                |          |          |
>                |          |          |--93.74%--irq_exit_rcu
>                |          |          |          |
>                |          |          |          |--88.20%--sysvec_apic_timer_interrupt
>                |          |          |          |          asm_sysvec_apic_timer_interrupt
>                |          |          |          |          |
>                ...
>                |
>                |
>                 --4.53%--newidle_balance
>                           pick_next_task_fair
>
> I was expecting idle load balancer to be rate limited to 60 Hz, which

Why 60Hz ?

> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> When I did a trace on a single CPU, I see that update_blocked_averages
> are often called between 1 to 4 jiffies apart, which is at a much higher
> rate than I expected.  I haven't taken a closer look yet.  But you may

2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
- the need for an update of blocked load which should not happen more
than once every 32ms which means a rate of around 30Hz
- the need for a load balance of a sched_domain. The min interval for
a sched_domain is its weight when the CPU is idle which is usually few
jiffies

The only idea that I have for now is that we spend less time in
newidle_balance which changes the dynamic of your system.

In your trace, could you check if update_blocked_averages is called
during the tick ? and Is the current task idle task ?

Vincent

> have a better idea.  I won't have access to the test system and workload
> till probably next week.
>
> Thanks.
>
> Tim
Tim Chen April 8, 2021, 11:05 p.m. UTC | #29
On 4/8/21 7:51 AM, Vincent Guittot wrote:

>> I was suprised to find the overall cpu% consumption of update_blocked_averages
>> and throughput of the benchmark still didn't change much.  So I took a
>> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
>> The call to update_locked_averages was reduced in newidle_balance so the patch did
>> what we intended.  But the overall rate of calls to
> 
> At least , we have removed the useless call to update_blocked_averages
> in newidle_balance when we will not perform any newly idle load
> balance
> 
>> update_blocked_averages remain roughly the same, shifting from
>> newidle_balance to run_rebalance_domains.
>>
>>    100.00%  (ffffffff810cf070)
>>             |
>>             ---update_blocked_averages
>>                |
>>                |--95.47%--run_rebalance_domains
>>                |          __do_softirq
>>                |          |
>>                |          |--94.27%--asm_call_irq_on_stack
>>                |          |          do_softirq_own_stack
> 
> The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> And as a result, not from the new path
> do_idle()->nohz_run_idle_balance() which has been added by this patch
> to defer the call to update_nohz_stats() after newlyidle_balance and
> before entering idle.
> 
>>                |          |          |
>>                |          |          |--93.74%--irq_exit_rcu
>>                |          |          |          |
>>                |          |          |          |--88.20%--sysvec_apic_timer_interrupt
>>                |          |          |          |          asm_sysvec_apic_timer_interrupt
>>                |          |          |          |          |
>>                ...
>>                |
>>                |
>>                 --4.53%--newidle_balance
>>                           pick_next_task_fair
>>
>> I was expecting idle load balancer to be rate limited to 60 Hz, which
> 
> Why 60Hz ?
> 

My thinking is we will trigger load balance only after rq->next_balance.

void trigger_load_balance(struct rq *rq)
{
        /* Don't need to rebalance while attached to NULL domain */
        if (unlikely(on_null_domain(rq)))
                return;

        if (time_after_eq(jiffies, rq->next_balance))
                raise_softirq(SCHED_SOFTIRQ);

        nohz_balancer_kick(rq);
}

And it seems like next_balance is set to be 60 Hz

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
        int continue_balancing = 1;
        int cpu = rq->cpu;
        int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
        unsigned long interval;
        struct sched_domain *sd;
        /* Earliest time when we have to do rebalance again */
        unsigned long next_balance = jiffies + 60*HZ;


>> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
>> When I did a trace on a single CPU, I see that update_blocked_averages
>> are often called between 1 to 4 jiffies apart, which is at a much higher
>> rate than I expected.  I haven't taken a closer look yet.  But you may
> 
> 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> - the need for an update of blocked load which should not happen more
> than once every 32ms which means a rate of around 30Hz
> - the need for a load balance of a sched_domain. The min interval for
> a sched_domain is its weight when the CPU is idle which is usually few
> jiffies
> 
> The only idea that I have for now is that we spend less time in
> newidle_balance which changes the dynamic of your system.
> 
> In your trace, could you check if update_blocked_averages is called
> during the tick ? and Is the current task idle task ?

Here's a snapshot of the trace. However I didn't have the current task in my trace.
You can tell the frequency that update_blocked_averages is called on
cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies apart).
When I have a chance to get on the machine, I'll take another look
at the current task and whether we got to trigger_load_balance() from scheduler_tick().


     3.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
     4.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
     6.484 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
     6.506 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
     9.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
    11.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
    11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
    11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    11.642 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    11.645 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    11.977 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    12.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    12.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    12.043 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
    12.567 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
    13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
    13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.159 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.203 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.223 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.301 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
    14.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
    14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
    14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.059 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.083 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.100 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.103 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.150 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.227 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.248 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.311 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
    15.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
    16.140 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
    16.185 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
    16.224 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
    16.340 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
    16.384 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
    16.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
    16.993 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
    17.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
    17.630 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
    17.830 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
    18.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
    18.031 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
    18.036 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
    18.040 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
    18.502 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740

Thanks for taking a look.

Tim
Vincent Guittot April 9, 2021, 3:26 p.m. UTC | #30
On Fri, 9 Apr 2021 at 01:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
>
> On 4/8/21 7:51 AM, Vincent Guittot wrote:
>
> >> I was suprised to find the overall cpu% consumption of update_blocked_averages
> >> and throughput of the benchmark still didn't change much.  So I took a
> >> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> >> The call to update_locked_averages was reduced in newidle_balance so the patch did
> >> what we intended.  But the overall rate of calls to
> >
> > At least , we have removed the useless call to update_blocked_averages
> > in newidle_balance when we will not perform any newly idle load
> > balance
> >
> >> update_blocked_averages remain roughly the same, shifting from
> >> newidle_balance to run_rebalance_domains.
> >>
> >>    100.00%  (ffffffff810cf070)
> >>             |
> >>             ---update_blocked_averages
> >>                |
> >>                |--95.47%--run_rebalance_domains
> >>                |          __do_softirq
> >>                |          |
> >>                |          |--94.27%--asm_call_irq_on_stack
> >>                |          |          do_softirq_own_stack
> >
> > The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> > And as a result, not from the new path
> > do_idle()->nohz_run_idle_balance() which has been added by this patch
> > to defer the call to update_nohz_stats() after newlyidle_balance and
> > before entering idle.
> >
> >>                |          |          |
> >>                |          |          |--93.74%--irq_exit_rcu
> >>                |          |          |          |
> >>                |          |          |          |--88.20%--sysvec_apic_timer_interrupt
> >>                |          |          |          |          asm_sysvec_apic_timer_interrupt
> >>                |          |          |          |          |
> >>                ...
> >>                |
> >>                |
> >>                 --4.53%--newidle_balance
> >>                           pick_next_task_fair
> >>
> >> I was expecting idle load balancer to be rate limited to 60 Hz, which
> >
> > Why 60Hz ?
> >
>
> My thinking is we will trigger load balance only after rq->next_balance.
>
> void trigger_load_balance(struct rq *rq)
> {
>         /* Don't need to rebalance while attached to NULL domain */
>         if (unlikely(on_null_domain(rq)))
>                 return;
>
>         if (time_after_eq(jiffies, rq->next_balance))
>                 raise_softirq(SCHED_SOFTIRQ);
>
>         nohz_balancer_kick(rq);
> }
>
> And it seems like next_balance is set to be 60 Hz
>
> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> {
>         int continue_balancing = 1;
>         int cpu = rq->cpu;
>         int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>         unsigned long interval;
>         struct sched_domain *sd;
>         /* Earliest time when we have to do rebalance again */
>         unsigned long next_balance = jiffies + 60*HZ;

This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
jiffies per second. We init next_balance with now + 60 sec to make
sure it's far later than the next balance of the sched_domains

Then, update_next_balance() keeps track of 1st balance to happen next time

>
>
> >> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> >> When I did a trace on a single CPU, I see that update_blocked_averages
> >> are often called between 1 to 4 jiffies apart, which is at a much higher
> >> rate than I expected.  I haven't taken a closer look yet.  But you may
> >
> > 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> > - the need for an update of blocked load which should not happen more
> > than once every 32ms which means a rate of around 30Hz
> > - the need for a load balance of a sched_domain. The min interval for
> > a sched_domain is its weight when the CPU is idle which is usually few
> > jiffies
> >
> > The only idea that I have for now is that we spend less time in
> > newidle_balance which changes the dynamic of your system.
> >
> > In your trace, could you check if update_blocked_averages is called
> > during the tick ? and Is the current task idle task ?
>
> Here's a snapshot of the trace. However I didn't have the current task in my trace.
> You can tell the frequency that update_blocked_averages is called on
> cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies apart).
> When I have a chance to get on the machine, I'll take another look
> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>
>
>      3.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
>      4.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
>      6.484 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
>      6.506 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
>      9.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
>     11.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
>     11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>     11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     11.642 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     11.645 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     11.977 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     12.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     12.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     12.043 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>     12.567 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>     13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>     13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.159 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.203 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.223 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.301 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>     14.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>     14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
>     14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.059 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.083 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.100 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.103 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.150 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.227 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.248 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.311 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>     15.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
>     16.140 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
>     16.185 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>     16.224 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>     16.340 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>     16.384 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>     16.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
>     16.993 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
>     17.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
>     17.630 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
>     17.830 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>     18.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>     18.031 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>     18.036 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>     18.040 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>     18.502 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>

I don't know exactly what you track with "next_balance=" in
probe:newidle_balance but it always starts with the same value
0x1004fb76c in the future to finish with a value 0x1004fb731 in the
past. This would mean that a load balance is needed during the next
tick which explains why we can see then the
probe:update_blocked_averages for each tick.

Also could you check if the tick is stopped when idle. When the
predicted idle time is short and the next wake is expected to happen
before the next tick, the tick is not stopped.

Vincent

> Thanks for taking a look.
>
> Tim
Tim Chen April 9, 2021, 5:59 p.m. UTC | #31
On 4/9/21 8:26 AM, Vincent Guittot wrote:

>>>>
>>>> I was expecting idle load balancer to be rate limited to 60 Hz, which
>>>
>>> Why 60Hz ?
>>>
>>
>> My thinking is we will trigger load balance only after rq->next_balance.
>>
>> void trigger_load_balance(struct rq *rq)
>> {
>>         /* Don't need to rebalance while attached to NULL domain */
>>         if (unlikely(on_null_domain(rq)))
>>                 return;
>>
>>         if (time_after_eq(jiffies, rq->next_balance))
>>                 raise_softirq(SCHED_SOFTIRQ);
>>
>>         nohz_balancer_kick(rq);
>> }
>>
>> And it seems like next_balance is set to be 60 Hz
>>
>> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>> {
>>         int continue_balancing = 1;
>>         int cpu = rq->cpu;
>>         int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>>         unsigned long interval;
>>         struct sched_domain *sd;
>>         /* Earliest time when we have to do rebalance again */
>>         unsigned long next_balance = jiffies + 60*HZ;
> 
> This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
> jiffies per second. We init next_balance with now + 60 sec to make
> sure it's far later than the next balance of the sched_domains
> 
> Then, update_next_balance() keeps track of 1st balance to happen next time
> 

Thanks for pointing out my misread of the code.  In this case the
balance frequency should be lower than I thought as balance should be 60 sec
apart in theory.  

>> Here's a snapshot of the trace. However I didn't have the current task in my trace.
>> You can tell the frequency that update_blocked_averages is called on
>> cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies apart).
>> When I have a chance to get on the machine, I'll take another look
>> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>>
>>
>>      3.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
>>      4.505 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
>>      6.484 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
>>      6.506 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
>>      9.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
>>     11.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
>>     11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>>     11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     11.642 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     11.645 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     11.977 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     12.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     12.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     12.043 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>     12.567 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>>     13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>>     13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.003 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.159 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.203 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.223 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.301 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>     14.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>>     14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
>>     14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.059 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.083 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.100 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.103 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.150 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.227 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.248 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.311 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>>     15.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
>>     16.140 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
>>     16.185 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>>     16.224 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>>     16.340 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>>     16.384 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>>     16.503 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
>>     16.993 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
>>     17.504 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
>>     17.630 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
>>     17.830 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>>     18.015 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>>     18.031 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>>     18.036 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>>     18.040 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>>     18.502 (         ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>>
> 
> I don't know exactly what you track with "next_balance=" in

It is the rq->next_balance value as we enter the newidle_balance function.

> probe:newidle_balance but it always starts with the same value
> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
> past. 

This indeed is odd as the next_balance should move forward and not backward.

> This would mean that a load balance is needed during the next
> tick which explains why we can see then the
> probe:update_blocked_averages for each tick.

Will try to debug and find out why the next_balance has gone backwards
next time I get access to the test system.

> 
> Also could you check if the tick is stopped when idle. When the
> predicted idle time is short and the next wake is expected to happen
> before the next tick, the tick is not stopped.
> 

Will do. 

Tim

Patch
diff mbox series

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@  static bool update_nohz_stats(struct rq *rq, bool force)
 	if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
 		return false;
 
-	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+	if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
 		return true;
 
 	update_blocked_averages(cpu);