* [RFC PATCH v1 0/1] select idle cpu from idle cpumask in sched domain
@ 2020-09-10 5:42 Aubrey Li
2020-09-10 5:42 ` [RFC PATCH v1 1/1] sched/fair: " Aubrey Li
0 siblings, 1 reply; 15+ messages in thread
From: Aubrey Li @ 2020-09-10 5:42 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, valentin.schneider
Cc: tim.c.chen, linux-kernel, Aubrey Li
I'm writting to see if it makes sense to track idle cpus in a shared cpumask
in sched domain, then a task wakes up it can select idle cpu from this cpumask
instead of scanning all the cpus in the last level cache domain, especially
when the system is heavily loaded, the scanning cost could be significantly
reduced. The price is that the atomic cpumask ops are added to the idle entry
and exit paths.
I tested the following benchmarks on a x86 4 socket system with 24 cores per
socket and 2 hyperthreads per core, total 192 CPUs:
uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
threads baseline-avg %std patch-avg %std
96 1 1.24 0.98 2.76
144 1 1.13 1.35 4.01
192 1 0.58 1.67 3.25
240 1 2.49 1.68 3.55
hackbench: process mode, 100000 loops, 40 file descriptors per group
group baseline-avg %std patch-avg %std
2(80) 1 12.05 0.97 9.88
3(120) 1 12.48 0.95 11.62
4(160) 1 13.83 0.97 13.22
5(200) 1 2.76 1.01 2.94
schbench: 99th percentile latency, 16 workers per message thread
mthread baseline-avg %std patch-avg %std
6(96) 1 1.24 0.993 1.73
9(144) 1 0.38 0.998 0.39
12(192) 1 1.58 0.995 1.64
15(240) 1 51.71 0.606 37.41
sysbench mysql throughput: read/write, table size = 10,000,000
thread baseline-avg %std patch-avg %std
96 1 1.77 1.015 1.71
144 1 3.39 0.998 4.05
192 1 2.88 1.002 2.81
240 1 2.07 1.011 2.09
kbuild: kexec reboot every time
baseline-avg patch-avg
1 1
Any suggestions are highly appreciated!
Thanks,
-Aubrey
Aubrey Li (1):
sched/fair: select idle cpu from idle cpumask in sched domain
include/linux/sched/topology.h | 13 +++++++++++++
kernel/sched/fair.c | 4 +++-
kernel/sched/topology.c | 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-10 5:42 [RFC PATCH v1 0/1] select idle cpu from idle cpumask in sched domain Aubrey Li
@ 2020-09-10 5:42 ` Aubrey Li
2020-09-11 16:28 ` Qais Yousef
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Aubrey Li @ 2020-09-10 5:42 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, valentin.schneider
Cc: tim.c.chen, linux-kernel, Aubrey Li, Aubrey Li
Added idle cpumask to track idle cpus in sched domain. When a CPU
enters idle, its corresponding bit in the idle cpumask will be set,
and when the CPU exits idle, its bit will be cleared.
When a task wakes up to select an idle cpu, scanning idle cpumask
has low cost than scanning all the cpus in last level cache domain,
especially when the system is heavily loaded.
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
include/linux/sched/topology.h | 13 +++++++++++++
kernel/sched/fair.c | 4 +++-
kernel/sched/topology.c | 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fb11091129b3..43a641d26154 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,21 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ /*
+ * Span of all idle CPUs in this domain.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus_span[];
};
+static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus_span);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b3b59cc51d6..3b6f8a3589be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
time = cpu_clock(this);
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
sd->nohz_idle = 0;
atomic_inc(&sd->shared->nr_busy_cpus);
+ cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
@@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
sd->nohz_idle = 1;
atomic_dec(&sd->shared->nr_busy_cpus);
+ cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..92d0aeef86bf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-10 5:42 ` [RFC PATCH v1 1/1] sched/fair: " Aubrey Li
@ 2020-09-11 16:28 ` Qais Yousef
2020-09-11 23:04 ` Li, Aubrey
2020-09-13 3:59 ` Jiang Biao
2020-09-14 12:32 ` Vincent Guittot
2 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2020-09-11 16:28 UTC (permalink / raw)
To: Aubrey Li
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, valentin.schneider, tim.c.chen,
linux-kernel, Aubrey Li
On 09/10/20 13:42, Aubrey Li wrote:
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
The patch looks useful. Did it help you with any particular workload? It'd be
good to expand on that in the commit message.
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-11 16:28 ` Qais Yousef
@ 2020-09-11 23:04 ` Li, Aubrey
2020-09-11 23:16 ` Li, Aubrey
2020-09-14 10:31 ` Valentin Schneider
0 siblings, 2 replies; 15+ messages in thread
From: Li, Aubrey @ 2020-09-11 23:04 UTC (permalink / raw)
To: Qais Yousef, Aubrey Li
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, valentin.schneider, tim.c.chen,
linux-kernel
On 2020/9/12 0:28, Qais Yousef wrote:
> On 09/10/20 13:42, Aubrey Li wrote:
>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, its corresponding bit in the idle cpumask will be set,
>> and when the CPU exits idle, its bit will be cleared.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>> ---
>> include/linux/sched/topology.h | 13 +++++++++++++
>> kernel/sched/fair.c | 4 +++-
>> kernel/sched/topology.c | 2 +-
>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index fb11091129b3..43a641d26154 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> + /*
>> + * Span of all idle CPUs in this domain.
>> + *
>> + * NOTE: this field is variable length. (Allocated dynamically
>> + * by attaching extra space to the end of the structure,
>> + * depending on how many CPUs the kernel has booted up with)
>> + */
>> + unsigned long idle_cpus_span[];
>
> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
I can use the existing free code. Do we have a problem of this?
>
> The patch looks useful. Did it help you with any particular workload? It'd be
> good to expand on that in the commit message.
>
Odd, that included in patch v1 0/1, did you receive it?
Thanks,
-Aubrey
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-11 23:04 ` Li, Aubrey
@ 2020-09-11 23:16 ` Li, Aubrey
2020-09-14 10:31 ` Valentin Schneider
1 sibling, 0 replies; 15+ messages in thread
From: Li, Aubrey @ 2020-09-11 23:16 UTC (permalink / raw)
To: Qais Yousef, Aubrey Li
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, valentin.schneider, tim.c.chen,
linux-kernel
On 2020/9/12 7:04, Li, Aubrey wrote:
> On 2020/9/12 0:28, Qais Yousef wrote:
>> On 09/10/20 13:42, Aubrey Li wrote:
>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>> and when the CPU exits idle, its bit will be cleared.
>>>
>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>> has low cost than scanning all the cpus in last level cache domain,
>>> especially when the system is heavily loaded.
>>>
>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>> ---
>>> include/linux/sched/topology.h | 13 +++++++++++++
>>> kernel/sched/fair.c | 4 +++-
>>> kernel/sched/topology.c | 2 +-
>>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>> index fb11091129b3..43a641d26154 100644
>>> --- a/include/linux/sched/topology.h
>>> +++ b/include/linux/sched/topology.h
>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>> atomic_t ref;
>>> atomic_t nr_busy_cpus;
>>> int has_idle_cores;
>>> + /*
>>> + * Span of all idle CPUs in this domain.
>>> + *
>>> + * NOTE: this field is variable length. (Allocated dynamically
>>> + * by attaching extra space to the end of the structure,
>>> + * depending on how many CPUs the kernel has booted up with)
>>> + */
>>> + unsigned long idle_cpus_span[];
>>
>> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>
> I can use the existing free code. Do we have a problem of this?
>
>>
>> The patch looks useful. Did it help you with any particular workload? It'd be
>> good to expand on that in the commit message.
>>
> Odd, that included in patch v1 0/1, did you receive it?
I found it at here:
https://lkml.org/lkml/2020/9/11/645
>
> Thanks,
> -Aubrey
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-10 5:42 ` [RFC PATCH v1 1/1] sched/fair: " Aubrey Li
2020-09-11 16:28 ` Qais Yousef
@ 2020-09-13 3:59 ` Jiang Biao
2020-09-14 12:26 ` Vincent Guittot
2020-09-14 12:32 ` Vincent Guittot
2 siblings, 1 reply; 15+ messages in thread
From: Jiang Biao @ 2020-09-13 3:59 UTC (permalink / raw)
To: Aubrey Li
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, linux-kernel, Aubrey Li
Hi, Aubrey
On Fri, 11 Sep 2020 at 23:48, Aubrey Li <aubrey.li@intel.com> wrote:
>
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..3b6f8a3589be 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
Is the sds_idle_cpus() always empty if nohz=off?
Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> sd->nohz_idle = 0;
>
> atomic_inc(&sd->shared->nr_busy_cpus);
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> sd->nohz_idle = 1;
>
> atomic_dec(&sd->shared->nr_busy_cpus);
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
This only works when entering/exiting tickless mode? :)
Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
Thx.
Regards,
Jiang
> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..92d0aeef86bf 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-11 23:04 ` Li, Aubrey
2020-09-11 23:16 ` Li, Aubrey
@ 2020-09-14 10:31 ` Valentin Schneider
2020-09-14 11:08 ` Qais Yousef
1 sibling, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2020-09-14 10:31 UTC (permalink / raw)
To: Li, Aubrey
Cc: Qais Yousef, Aubrey Li, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
tim.c.chen, linux-kernel
On 12/09/20 00:04, Li, Aubrey wrote:
>>> +++ b/include/linux/sched/topology.h
>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>> atomic_t ref;
>>> atomic_t nr_busy_cpus;
>>> int has_idle_cores;
>>> + /*
>>> + * Span of all idle CPUs in this domain.
>>> + *
>>> + * NOTE: this field is variable length. (Allocated dynamically
>>> + * by attaching extra space to the end of the structure,
>>> + * depending on how many CPUs the kernel has booted up with)
>>> + */
>>> + unsigned long idle_cpus_span[];
>>
>> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>
> I can use the existing free code. Do we have a problem of this?
>
Nah, flexible array members are the preferred approach here; this also
means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
allocated.
See struct numa_group, struct sched_group, struct sched_domain, struct
em_perf_domain...
>>
>> The patch looks useful. Did it help you with any particular workload? It'd be
>> good to expand on that in the commit message.
>>
> Odd, that included in patch v1 0/1, did you receive it?
>
> Thanks,
> -Aubrey
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-14 10:31 ` Valentin Schneider
@ 2020-09-14 11:08 ` Qais Yousef
2020-09-14 11:26 ` Valentin Schneider
0 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2020-09-14 11:08 UTC (permalink / raw)
To: Valentin Schneider
Cc: Li, Aubrey, Aubrey Li, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
tim.c.chen, linux-kernel
On 09/14/20 11:31, Valentin Schneider wrote:
>
> On 12/09/20 00:04, Li, Aubrey wrote:
> >>> +++ b/include/linux/sched/topology.h
> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >>> atomic_t ref;
> >>> atomic_t nr_busy_cpus;
> >>> int has_idle_cores;
> >>> + /*
> >>> + * Span of all idle CPUs in this domain.
> >>> + *
> >>> + * NOTE: this field is variable length. (Allocated dynamically
> >>> + * by attaching extra space to the end of the structure,
> >>> + * depending on how many CPUs the kernel has booted up with)
> >>> + */
> >>> + unsigned long idle_cpus_span[];
> >>
> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
> >
> > I can use the existing free code. Do we have a problem of this?
> >
>
> Nah, flexible array members are the preferred approach here; this also
Is this your opinion or a rule written somewhere I missed?
> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
> allocated.
>
> See struct numa_group, struct sched_group, struct sched_domain, struct
> em_perf_domain...
struct root_domain, struct cpupri_vec, struct generic_pm_domain,
struct irq_common_data..
Use cpumask_var_t.
Both approach look correct to me, so no objection in principle. cpumask_var_t
looks neater IMO and will be necessary once more than one cpumask are required
in a struct.
>
> >>
> >> The patch looks useful. Did it help you with any particular workload? It'd be
> >> good to expand on that in the commit message.
> >>
> > Odd, that included in patch v1 0/1, did you receive it?
Aubrey,
Sorry I didn't see that no. It's important justification to be part of the
commit message, I think worth adding it.
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-14 11:08 ` Qais Yousef
@ 2020-09-14 11:26 ` Valentin Schneider
2020-09-14 11:31 ` Qais Yousef
0 siblings, 1 reply; 15+ messages in thread
From: Valentin Schneider @ 2020-09-14 11:26 UTC (permalink / raw)
To: Qais Yousef
Cc: Li, Aubrey, Aubrey Li, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
tim.c.chen, linux-kernel
On 14/09/20 12:08, Qais Yousef wrote:
> On 09/14/20 11:31, Valentin Schneider wrote:
>>
>> On 12/09/20 00:04, Li, Aubrey wrote:
>> >>> +++ b/include/linux/sched/topology.h
>> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> >>> atomic_t ref;
>> >>> atomic_t nr_busy_cpus;
>> >>> int has_idle_cores;
>> >>> + /*
>> >>> + * Span of all idle CPUs in this domain.
>> >>> + *
>> >>> + * NOTE: this field is variable length. (Allocated dynamically
>> >>> + * by attaching extra space to the end of the structure,
>> >>> + * depending on how many CPUs the kernel has booted up with)
>> >>> + */
>> >>> + unsigned long idle_cpus_span[];
>> >>
>> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
>> >
>> > I can use the existing free code. Do we have a problem of this?
>> >
>>
>> Nah, flexible array members are the preferred approach here; this also
>
> Is this your opinion or a rule written somewhere I missed?
I don't think there's a written rule, but AIUI it is preferred by at
least Peter:
https://lore.kernel.org/linux-pm/20180612125930.GP12217@hirez.programming.kicks-ass.net/
https://lore.kernel.org/lkml/20180619110734.GO2458@hirez.programming.kicks-ass.net/
And my opinion is that, if you can, having fewer separate allocation is better.
>
>> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
>> allocated.
>>
>> See struct numa_group, struct sched_group, struct sched_domain, struct
>> em_perf_domain...
>
> struct root_domain, struct cpupri_vec, struct generic_pm_domain,
> struct irq_common_data..
>
> Use cpumask_var_t.
>
> Both approach look correct to me, so no objection in principle. cpumask_var_t
> looks neater IMO and will be necessary once more than one cpumask are required
> in a struct.
>
You're right in that cpumask_var_t becomes necessary when you need more
than one mask. For those that use it despite requiring only one mask
(cpupri stuff, struct nohz too), I'm not sure.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-14 11:26 ` Valentin Schneider
@ 2020-09-14 11:31 ` Qais Yousef
0 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2020-09-14 11:31 UTC (permalink / raw)
To: Valentin Schneider
Cc: Li, Aubrey, Aubrey Li, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
tim.c.chen, linux-kernel
On 09/14/20 12:26, Valentin Schneider wrote:
>
> On 14/09/20 12:08, Qais Yousef wrote:
> > On 09/14/20 11:31, Valentin Schneider wrote:
> >>
> >> On 12/09/20 00:04, Li, Aubrey wrote:
> >> >>> +++ b/include/linux/sched/topology.h
> >> >>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> >> >>> atomic_t ref;
> >> >>> atomic_t nr_busy_cpus;
> >> >>> int has_idle_cores;
> >> >>> + /*
> >> >>> + * Span of all idle CPUs in this domain.
> >> >>> + *
> >> >>> + * NOTE: this field is variable length. (Allocated dynamically
> >> >>> + * by attaching extra space to the end of the structure,
> >> >>> + * depending on how many CPUs the kernel has booted up with)
> >> >>> + */
> >> >>> + unsigned long idle_cpus_span[];
> >> >>
> >> >> Can't you use cpumask_var_t and zalloc_cpumask_var() instead?
> >> >
> >> > I can use the existing free code. Do we have a problem of this?
> >> >
> >>
> >> Nah, flexible array members are the preferred approach here; this also
> >
> > Is this your opinion or a rule written somewhere I missed?
>
> I don't think there's a written rule, but AIUI it is preferred by at
> least Peter:
>
> https://lore.kernel.org/linux-pm/20180612125930.GP12217@hirez.programming.kicks-ass.net/
> https://lore.kernel.org/lkml/20180619110734.GO2458@hirez.programming.kicks-ass.net/
>
> And my opinion is that, if you can, having fewer separate allocation is better.
+1
>
> >
> >> means we don't let CONFIG_CPUMASK_OFFSTACK dictate where this gets
> >> allocated.
> >>
> >> See struct numa_group, struct sched_group, struct sched_domain, struct
> >> em_perf_domain...
> >
> > struct root_domain, struct cpupri_vec, struct generic_pm_domain,
> > struct irq_common_data..
> >
> > Use cpumask_var_t.
> >
> > Both approach look correct to me, so no objection in principle. cpumask_var_t
> > looks neater IMO and will be necessary once more than one cpumask are required
> > in a struct.
> >
>
> You're right in that cpumask_var_t becomes necessary when you need more
> than one mask. For those that use it despite requiring only one mask
> (cpupri stuff, struct nohz too), I'm not sure.
I don't have a strong opinoin. cpumask_var_t is more readable and maintainble
IMO. But it's not a big deal. Any form can be easily changed.
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-13 3:59 ` Jiang Biao
@ 2020-09-14 12:26 ` Vincent Guittot
2020-09-15 8:47 ` Jiang Biao
0 siblings, 1 reply; 15+ messages in thread
From: Vincent Guittot @ 2020-09-14 12:26 UTC (permalink / raw)
To: Jiang Biao
Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, linux-kernel, Aubrey Li
On Sun, 13 Sep 2020 at 05:59, Jiang Biao <benbjiang@gmail.com> wrote:
>
> Hi, Aubrey
>
> On Fri, 11 Sep 2020 at 23:48, Aubrey Li <aubrey.li@intel.com> wrote:
> >
> > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, its corresponding bit in the idle cpumask will be set,
> > and when the CPU exits idle, its bit will be cleared.
> >
> > When a task wakes up to select an idle cpu, scanning idle cpumask
> > has low cost than scanning all the cpus in last level cache domain,
> > especially when the system is heavily loaded.
> >
> > Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > ---
> > include/linux/sched/topology.h | 13 +++++++++++++
> > kernel/sched/fair.c | 4 +++-
> > kernel/sched/topology.c | 2 +-
> > 3 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index fb11091129b3..43a641d26154 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > atomic_t ref;
> > atomic_t nr_busy_cpus;
> > int has_idle_cores;
> > + /*
> > + * Span of all idle CPUs in this domain.
> > + *
> > + * NOTE: this field is variable length. (Allocated dynamically
> > + * by attaching extra space to the end of the structure,
> > + * depending on how many CPUs the kernel has booted up with)
> > + */
> > + unsigned long idle_cpus_span[];
> > };
> >
> > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > +{
> > + return to_cpumask(sds->idle_cpus_span);
> > +}
> > +
> > struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6b3b59cc51d6..3b6f8a3589be 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >
> > time = cpu_clock(this);
> >
> > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> Is the sds_idle_cpus() always empty if nohz=off?
Good point
> Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > sd->nohz_idle = 0;
> >
> > atomic_inc(&sd->shared->nr_busy_cpus);
> > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > sd->nohz_idle = 1;
> >
> > atomic_dec(&sd->shared->nr_busy_cpus);
> > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> This only works when entering/exiting tickless mode? :)
> Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
set_cpu_sd_state_busy is only called during a tick in order to limit
the rate of the update to once per tick per cpu at most and prevents
any kind of storm of update if short running tasks wake/sleep all the
time. We don't want to update a cpumask at each and every enter/leave
idle.
>
> Thx.
> Regards,
> Jiang
>
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9079d865a935..92d0aeef86bf 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> >
> > *per_cpu_ptr(sdd->sd, j) = sd;
> >
> > - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> > + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> > GFP_KERNEL, cpu_to_node(j));
> > if (!sds)
> > return -ENOMEM;
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-10 5:42 ` [RFC PATCH v1 1/1] sched/fair: " Aubrey Li
2020-09-11 16:28 ` Qais Yousef
2020-09-13 3:59 ` Jiang Biao
@ 2020-09-14 12:32 ` Vincent Guittot
2 siblings, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2020-09-14 12:32 UTC (permalink / raw)
To: Aubrey Li
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim Chen, linux-kernel, Aubrey Li
On Fri, 11 Sep 2020 at 15:28, Aubrey Li <aubrey.li@intel.com> wrote:
>
> Added idle cpumask to track idle cpus in sched domain. When a CPU
> enters idle, its corresponding bit in the idle cpumask will be set,
> and when the CPU exits idle, its bit will be cleared.
>
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has low cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
>
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> ---
> include/linux/sched/topology.h | 13 +++++++++++++
> kernel/sched/fair.c | 4 +++-
> kernel/sched/topology.c | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fb11091129b3..43a641d26154 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,21 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + /*
> + * Span of all idle CPUs in this domain.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus_span[];
> };
>
> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus_span);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b3b59cc51d6..3b6f8a3589be 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> time = cpu_clock(this);
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
sched_domain_shared is set only at shared cache level so this works
only because select_idle_cpu is called with sd_llc. It's worth adding
a comment to describe this.
>
> for_each_cpu_wrap(cpu, cpus, target) {
> if (!--nr)
> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> sd->nohz_idle = 0;
>
> atomic_inc(&sd->shared->nr_busy_cpus);
> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> sd->nohz_idle = 1;
>
> atomic_dec(&sd->shared->nr_busy_cpus);
> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..92d0aeef86bf 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1769,7 +1769,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-14 12:26 ` Vincent Guittot
@ 2020-09-15 8:47 ` Jiang Biao
2020-09-15 9:23 ` Vincent Guittot
0 siblings, 1 reply; 15+ messages in thread
From: Jiang Biao @ 2020-09-15 8:47 UTC (permalink / raw)
To: Vincent Guittot
Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, linux-kernel, Aubrey Li
Hi, Vincent
On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Sun, 13 Sep 2020 at 05:59, Jiang Biao <benbjiang@gmail.com> wrote:
> >
> > Hi, Aubrey
> >
> > On Fri, 11 Sep 2020 at 23:48, Aubrey Li <aubrey.li@intel.com> wrote:
> > >
> > > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > > enters idle, its corresponding bit in the idle cpumask will be set,
> > > and when the CPU exits idle, its bit will be cleared.
> > >
> > > When a task wakes up to select an idle cpu, scanning idle cpumask
> > > has low cost than scanning all the cpus in last level cache domain,
> > > especially when the system is heavily loaded.
> > >
> > > Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > > ---
> > > include/linux/sched/topology.h | 13 +++++++++++++
> > > kernel/sched/fair.c | 4 +++-
> > > kernel/sched/topology.c | 2 +-
> > > 3 files changed, 17 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index fb11091129b3..43a641d26154 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > > atomic_t ref;
> > > atomic_t nr_busy_cpus;
> > > int has_idle_cores;
> > > + /*
> > > + * Span of all idle CPUs in this domain.
> > > + *
> > > + * NOTE: this field is variable length. (Allocated dynamically
> > > + * by attaching extra space to the end of the structure,
> > > + * depending on how many CPUs the kernel has booted up with)
> > > + */
> > > + unsigned long idle_cpus_span[];
> > > };
> > >
> > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > > +{
> > > + return to_cpumask(sds->idle_cpus_span);
> > > +}
> > > +
> > > struct sched_domain {
> > > /* These fields must be setup */
> > > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6b3b59cc51d6..3b6f8a3589be 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > >
> > > time = cpu_clock(this);
> > >
> > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> > Is the sds_idle_cpus() always empty if nohz=off?
>
> Good point
>
> > Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
> >
> > >
> > > for_each_cpu_wrap(cpu, cpus, target) {
> > > if (!--nr)
> > > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > > sd->nohz_idle = 0;
> > >
> > > atomic_inc(&sd->shared->nr_busy_cpus);
> > > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > > unlock:
> > > rcu_read_unlock();
> > > }
> > > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > > sd->nohz_idle = 1;
> > >
> > > atomic_dec(&sd->shared->nr_busy_cpus);
> > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > This only works when entering/exiting tickless mode? :)
> > Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
>
> set_cpu_sd_state_busy is only called during a tick in order to limit
> the rate of the update to once per tick per cpu at most and prevents
> any kind of storm of update if short running tasks wake/sleep all the
> time. We don't want to update a cpumask at each and every enter/leave
> idle.
>
Agree. But set_cpu_sd_state_busy seems not being reached when
nohz=off, which means it will not work for that case? :)
Thx.
Regards,
Jiang
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-15 8:47 ` Jiang Biao
@ 2020-09-15 9:23 ` Vincent Guittot
2020-09-15 13:38 ` Li, Aubrey
0 siblings, 1 reply; 15+ messages in thread
From: Vincent Guittot @ 2020-09-15 9:23 UTC (permalink / raw)
To: Jiang Biao
Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, linux-kernel, Aubrey Li
On Tue, 15 Sep 2020 at 10:47, Jiang Biao <benbjiang@gmail.com> wrote:
>
> Hi, Vincent
>
> On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Sun, 13 Sep 2020 at 05:59, Jiang Biao <benbjiang@gmail.com> wrote:
> > >
> > > Hi, Aubrey
> > >
> > > On Fri, 11 Sep 2020 at 23:48, Aubrey Li <aubrey.li@intel.com> wrote:
> > > >
> > > > Added idle cpumask to track idle cpus in sched domain. When a CPU
> > > > enters idle, its corresponding bit in the idle cpumask will be set,
> > > > and when the CPU exits idle, its bit will be cleared.
> > > >
> > > > When a task wakes up to select an idle cpu, scanning idle cpumask
> > > > has low cost than scanning all the cpus in last level cache domain,
> > > > especially when the system is heavily loaded.
> > > >
> > > > Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > > > ---
> > > > include/linux/sched/topology.h | 13 +++++++++++++
> > > > kernel/sched/fair.c | 4 +++-
> > > > kernel/sched/topology.c | 2 +-
> > > > 3 files changed, 17 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > > index fb11091129b3..43a641d26154 100644
> > > > --- a/include/linux/sched/topology.h
> > > > +++ b/include/linux/sched/topology.h
> > > > @@ -65,8 +65,21 @@ struct sched_domain_shared {
> > > > atomic_t ref;
> > > > atomic_t nr_busy_cpus;
> > > > int has_idle_cores;
> > > > + /*
> > > > + * Span of all idle CPUs in this domain.
> > > > + *
> > > > + * NOTE: this field is variable length. (Allocated dynamically
> > > > + * by attaching extra space to the end of the structure,
> > > > + * depending on how many CPUs the kernel has booted up with)
> > > > + */
> > > > + unsigned long idle_cpus_span[];
> > > > };
> > > >
> > > > +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
> > > > +{
> > > > + return to_cpumask(sds->idle_cpus_span);
> > > > +}
> > > > +
> > > > struct sched_domain {
> > > > /* These fields must be setup */
> > > > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 6b3b59cc51d6..3b6f8a3589be 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> > > >
> > > > time = cpu_clock(this);
> > > >
> > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > > > + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
> > > Is the sds_idle_cpus() always empty if nohz=off?
> >
> > Good point
> >
> > > Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
> > >
> > > >
> > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > if (!--nr)
> > > > @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
> > > > sd->nohz_idle = 0;
> > > >
> > > > atomic_inc(&sd->shared->nr_busy_cpus);
> > > > + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
> > > > unlock:
> > > > rcu_read_unlock();
> > > > }
> > > > @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
> > > > sd->nohz_idle = 1;
> > > >
> > > > atomic_dec(&sd->shared->nr_busy_cpus);
> > > > + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
> > > This only works when entering/exiting tickless mode? :)
> > > Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
> >
> > set_cpu_sd_state_busy is only called during a tick in order to limit
> > the rate of the update to once per tick per cpu at most and prevents
> > any kind of storm of update if short running tasks wake/sleep all the
> > time. We don't want to update a cpumask at each and every enter/leave
> > idle.
> >
> Agree. But set_cpu_sd_state_busy seems not being reached when
> nohz=off, which means it will not work for that case? :)
Yes set_cpu_sd_state_idle/busy are nohz function
>
> Thx.
> Regards,
> Jiang
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 1/1] sched/fair: select idle cpu from idle cpumask in sched domain
2020-09-15 9:23 ` Vincent Guittot
@ 2020-09-15 13:38 ` Li, Aubrey
0 siblings, 0 replies; 15+ messages in thread
From: Li, Aubrey @ 2020-09-15 13:38 UTC (permalink / raw)
To: Vincent Guittot, Jiang Biao
Cc: Aubrey Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim Chen, linux-kernel
On 2020/9/15 17:23, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 10:47, Jiang Biao <benbjiang@gmail.com> wrote:
>>
>> Hi, Vincent
>>
>> On Mon, 14 Sep 2020 at 20:26, Vincent Guittot
>> <vincent.guittot@linaro.org> wrote:
>>>
>>> On Sun, 13 Sep 2020 at 05:59, Jiang Biao <benbjiang@gmail.com> wrote:
>>>>
>>>> Hi, Aubrey
>>>>
>>>> On Fri, 11 Sep 2020 at 23:48, Aubrey Li <aubrey.li@intel.com> wrote:
>>>>>
>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU
>>>>> enters idle, its corresponding bit in the idle cpumask will be set,
>>>>> and when the CPU exits idle, its bit will be cleared.
>>>>>
>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>> especially when the system is heavily loaded.
>>>>>
>>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>>> ---
>>>>> include/linux/sched/topology.h | 13 +++++++++++++
>>>>> kernel/sched/fair.c | 4 +++-
>>>>> kernel/sched/topology.c | 2 +-
>>>>> 3 files changed, 17 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>>> index fb11091129b3..43a641d26154 100644
>>>>> --- a/include/linux/sched/topology.h
>>>>> +++ b/include/linux/sched/topology.h
>>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>> atomic_t ref;
>>>>> atomic_t nr_busy_cpus;
>>>>> int has_idle_cores;
>>>>> + /*
>>>>> + * Span of all idle CPUs in this domain.
>>>>> + *
>>>>> + * NOTE: this field is variable length. (Allocated dynamically
>>>>> + * by attaching extra space to the end of the structure,
>>>>> + * depending on how many CPUs the kernel has booted up with)
>>>>> + */
>>>>> + unsigned long idle_cpus_span[];
>>>>> };
>>>>>
>>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>>>>> +{
>>>>> + return to_cpumask(sds->idle_cpus_span);
>>>>> +}
>>>>> +
>>>>> struct sched_domain {
>>>>> /* These fields must be setup */
>>>>> struct sched_domain __rcu *parent; /* top domain must be null terminated */
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 6b3b59cc51d6..3b6f8a3589be 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6136,7 +6136,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>>>>
>>>>> time = cpu_clock(this);
>>>>>
>>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>>> Is the sds_idle_cpus() always empty if nohz=off?
>>>
>>> Good point
>>>
>>>> Do we need to initialize the idle_cpus_span with sched_domain_span(sd)?
>>>>
>>>>>
>>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>>> if (!--nr)
>>>>> @@ -10182,6 +10182,7 @@ static void set_cpu_sd_state_busy(int cpu)
>>>>> sd->nohz_idle = 0;
>>>>>
>>>>> atomic_inc(&sd->shared->nr_busy_cpus);
>>>>> + cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>>>>> unlock:
>>>>> rcu_read_unlock();
>>>>> }
>>>>> @@ -10212,6 +10213,7 @@ static void set_cpu_sd_state_idle(int cpu)
>>>>> sd->nohz_idle = 1;
>>>>>
>>>>> atomic_dec(&sd->shared->nr_busy_cpus);
>>>>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>>>> This only works when entering/exiting tickless mode? :)
>>>> Why not update idle_cpus_span during tick_nohz_idle_enter()/exit()?
>>>
>>> set_cpu_sd_state_busy is only called during a tick in order to limit
>>> the rate of the update to once per tick per cpu at most and prevents
>>> any kind of storm of update if short running tasks wake/sleep all the
>>> time. We don't want to update a cpumask at each and every enter/leave
>>> idle.
>>>
>> Agree. But set_cpu_sd_state_busy seems not being reached when
>> nohz=off, which means it will not work for that case? :)
>
> Yes set_cpu_sd_state_idle/busy are nohz function
Thanks Biao to point this out.
If the shared idle cpumask is initialized with sched_domain_span(sd),
then nohz=off case will remain the previous behavior.
Thanks,
-Aubrey
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-09-16 0:34 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 5:42 [RFC PATCH v1 0/1] select idle cpu from idle cpumask in sched domain Aubrey Li
2020-09-10 5:42 ` [RFC PATCH v1 1/1] sched/fair: " Aubrey Li
2020-09-11 16:28 ` Qais Yousef
2020-09-11 23:04 ` Li, Aubrey
2020-09-11 23:16 ` Li, Aubrey
2020-09-14 10:31 ` Valentin Schneider
2020-09-14 11:08 ` Qais Yousef
2020-09-14 11:26 ` Valentin Schneider
2020-09-14 11:31 ` Qais Yousef
2020-09-13 3:59 ` Jiang Biao
2020-09-14 12:26 ` Vincent Guittot
2020-09-15 8:47 ` Jiang Biao
2020-09-15 9:23 ` Vincent Guittot
2020-09-15 13:38 ` Li, Aubrey
2020-09-14 12:32 ` Vincent Guittot
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.