linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* hotplug support for arch/arc/plat-eznps platform
@ 2017-08-06  5:53 Ofer Levi(SW)
  2017-08-07  8:33 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-06  5:53 UTC (permalink / raw)
  To: rusty, vatsa, mingo, peterz, Vineet.Gupta1; +Cc: linux-kernel

I'm adding hot-plug support for the above arch and ran into performance issue with execution of 
partition_sched_domains ()  - About 0.5 sec per cpu, which is unacceptable with the arch supported 4k cpus.
To my limited understanding, on the plat-eznps arch, where each cpu is always running a single threaded process in 
isolation mode, the above call is not necessary,
Can you please confirm (or reject) my understanding? If I'm wrong short explanation will be appreciated.
If I'm correct, I will be grateful if you can point me to how to skip this call in a way acceptable by community.

Thank you.
-Ofer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-06  5:53 hotplug support for arch/arc/plat-eznps platform Ofer Levi(SW)
@ 2017-08-07  8:33 ` Peter Zijlstra
  2017-08-07 13:41   ` Ofer Levi(SW)
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-07  8:33 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, vatsa, mingo, Vineet.Gupta1, linux-kernel

On Sun, Aug 06, 2017 at 05:53:37AM +0000, Ofer Levi(SW) wrote:
> I'm adding hot-plug support for the above arch and ran into performance issue with execution of 
> partition_sched_domains ()  - About 0.5 sec per cpu, which is unacceptable with the arch supported 4k cpus.
> To my limited understanding, on the plat-eznps arch, where each cpu is always running a single threaded process in 
> isolation mode, the above call is not necessary,
> Can you please confirm (or reject) my understanding? If I'm wrong short explanation will be appreciated.
> If I'm correct, I will be grateful if you can point me to how to skip this call in a way acceptable by community.

You've failed to explain why you think hotplug should be a performance
critical path.

I'm also not seeing how it would be different from boot; you'd be
looking at a similar cost for SMP bringup.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-07  8:33 ` Peter Zijlstra
@ 2017-08-07 13:41   ` Ofer Levi(SW)
  2017-08-07 15:10     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-07 13:41 UTC (permalink / raw)
  To: Peter Zijlstra, rusty, vatsa, mingo; +Cc: Vineet.Gupta1, linux-kernel

> You've failed to explain why you think hotplug should be a performance
> critical path.
1. hotplug bring up of 4K cpus takes 40 minutes.  Way too much for any user.
2. plat-eznps is a network processor, where bring up time is sensitive.

> I'm also not seeing how it would be different from boot; you'd be
> looking at a similar cost for SMP bringup.
bring up time of 4k cpus during kernel boot takes 4.5 minutes. 
The function in question is performed only when smp init was performed. 
If I understand correctly, whatever this function is doing is performed after all cpus 
were brought up during kernel boot.


Thanks
-Ofer


 On Monday, August 7, 2017 11:34 AM +0000, Peter Zijlstra wrote:
> On Sun, Aug 06, 2017 at 05:53:37AM +0000, Ofer Levi(SW) wrote:
> > I'm adding hot-plug support for the above arch and ran into performance
> issue with execution of
> > partition_sched_domains ()  - About 0.5 sec per cpu, which is unacceptable
> with the arch supported 4k cpus.
> > To my limited understanding, on the plat-eznps arch, where each cpu is
> always running a single threaded process in
> > isolation mode, the above call is not necessary,
> > Can you please confirm (or reject) my understanding? If I'm wrong short
> explanation will be appreciated.
> > If I'm correct, I will be grateful if you can point me to how to skip this call in
> a way acceptable by community.
> 
> You've failed to explain why you think hotplug should be a performance
> critical path.
> 
> I'm also not seeing how it would be different from boot; you'd be
> looking at a similar cost for SMP bringup.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-07 13:41   ` Ofer Levi(SW)
@ 2017-08-07 15:10     ` Peter Zijlstra
  2017-08-08  6:49       ` Ofer Levi(SW)
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-07 15:10 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, vatsa, mingo, Vineet.Gupta1, linux-kernel

On Mon, Aug 07, 2017 at 01:41:38PM +0000, Ofer Levi(SW) wrote:
> > You've failed to explain why you think hotplug should be a performance
> > critical path.
> 1. hotplug bring up of 4K cpus takes 40 minutes.  Way too much for any user.
> 2. plat-eznps is a network processor, where bring up time is sensitive.

But who is doing actual hotplug? Why would you ever unplug or plug a CPU
in a time critical situation?

> > I'm also not seeing how it would be different from boot; you'd be
> > looking at a similar cost for SMP bringup.
> bring up time of 4k cpus during kernel boot takes 4.5 minutes. 
> The function in question is performed only when smp init was performed. 
> If I understand correctly, whatever this function is doing is performed after all cpus 
> were brought up during kernel boot.

Doesn't make sense. If you look at smp_init() boot brings up the CPUs
one at a time.

So how can boot be different than hot-pugging them?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-07 15:10     ` Peter Zijlstra
@ 2017-08-08  6:49       ` Ofer Levi(SW)
  2017-08-08 10:16         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-08  6:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rusty, vatsa, mingo, Vineet.Gupta1, linux-kernel


> On Monday, August 7, 2017 6:10 PM +0000, Ofer Levi(SW) wrote:
> 
> On Mon, Aug 07, 2017 at 01:41:38PM +0000, Ofer Levi(SW) wrote:
> > > You've failed to explain why you think hotplug should be a
> > > performance critical path.
> > 1. hotplug bring up of 4K cpus takes 40 minutes.  Way too much for any
> user.
> > 2. plat-eznps is a network processor, where bring up time is sensitive.
> 
> But who is doing actual hotplug? Why would you ever unplug or plug a CPU in
> a time critical situation?

The idea behind implementing hotplug for this arch is to shorten time to traffic processing. 
This way instead of waiting ~5 min for all cpus to boot, application running on cpu 0 will 
Loop booting other cpus and assigning  the traffic processing application to it. 
Outgoing traffic will build up until all cpus are up and running full traffic rate.
This method allow for traffic processing to start after ~20 sec instead of the 5 min.

> 
> > > I'm also not seeing how it would be different from boot; you'd be
> > > looking at a similar cost for SMP bringup.
> > bring up time of 4k cpus during kernel boot takes 4.5 minutes.
> > The function in question is performed only when smp init was performed.
> > If I understand correctly, whatever this function is doing is
> > performed after all cpus were brought up during kernel boot.
> 
> Doesn't make sense. If you look at smp_init() boot brings up the CPUs one at
> a time.
> 
> So how can boot be different than hot-pugging them?

Please have a look at following code kernel/sched/core.c, sched_cpu_activate() :

	if (sched_smp_initialized) {
		sched_domains_numa_masks_set(cpu);
		cpuset_cpu_active();
	}
The cpuset_cpu_active call eventually leads to the function in question partition_sched_domains()
When cold-booting cpus the sched_smp_initialized flag is false and therefore partition_sched_domains is not executing.

This leads me back to my questions


Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-08  6:49       ` Ofer Levi(SW)
@ 2017-08-08 10:16         ` Peter Zijlstra
  2017-08-09 15:19           ` Ofer Levi(SW)
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-08 10:16 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

On Tue, Aug 08, 2017 at 06:49:39AM +0000, Ofer Levi(SW) wrote:

> The idea behind implementing hotplug for this arch is to shorten time
> to traffic processing.  This way instead of waiting ~5 min for all
> cpus to boot, application running on cpu 0 will Loop booting other
> cpus and assigning  the traffic processing application to it.
> Outgoing traffic will build up until all cpus are up and running full
> traffic rate.  This method allow for traffic processing to start after
> ~20 sec instead of the 5 min.

Ah, ok. So only online is ever used. Offline is a whole other can of
worms.

> > So how can boot be different than hot-pugging them?
> 
> Please have a look at following code kernel/sched/core.c, sched_cpu_activate() :
> 
> 	if (sched_smp_initialized) {
> 		sched_domains_numa_masks_set(cpu);
> 		cpuset_cpu_active();
> 	}

Ah, cute, I totally missed we did that. Yes that avoids endless domain
rebuilds on boot.

> The cpuset_cpu_active call eventually leads to the function in
> question partition_sched_domains() When cold-booting cpus the
> sched_smp_initialized flag is false and therefore
> partition_sched_domains is not executing.

So you're booting with "maxcpus=1" to only online the one. And then you
want to online the rest once userspace runs.

There's two possibilities. The one I prefer (but which appears the most
broken with the current code) is using the cpuset controller.

1)

  Once you're up and running with a single CPU do:

  $ mkdir /cgroup
  $ mount none /cgroup -t cgroup -o cpuset
  $ echo 0 > /cgroup/cpuset.sched_load_balance
  $ for ((i=1;i<4096;i++))
    do
      echo 1 > /sys/devices/system/cpu/cpu$i/online;
    done

  And then, if you want load-balancing, you can re-enable it globally,
  or only on a subset of CPUs.


2)

  The alternative is to use "isolcpus=1-4095" to completely kill
  load-balancing. This more or less works with the current code,
  except that it will keep rebuilding the CPU0 sched-domain, which
  is somewhat pointless (also fixed by the below patch).

  The reason I don't particularly like this option is that its boot time
  only, you cannot reconfigure your system at runtime, but that might
  be good enough for you.


With the attached patch, either option generates (I only have 40 CPUs):

[   44.305563] CPU0 attaching NULL sched-domain.
[   51.954872] SMP alternatives: switching to SMP code
[   51.976923] x86: Booting SMP configuration:
[   51.981602] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   52.057756] microcode: sig=0x306e4, pf=0x1, revision=0x416
[   52.064740] microcode: updated to revision 0x428, date = 2014-05-29
[   52.080854] smpboot: Booting Node 0 Processor 2 APIC 0x4
[   52.164124] smpboot: Booting Node 0 Processor 3 APIC 0x6
[   52.244615] smpboot: Booting Node 0 Processor 4 APIC 0x8
[   52.324564] smpboot: Booting Node 0 Processor 5 APIC 0x10
[   52.405407] smpboot: Booting Node 0 Processor 6 APIC 0x12
[   52.485460] smpboot: Booting Node 0 Processor 7 APIC 0x14
[   52.565333] smpboot: Booting Node 0 Processor 8 APIC 0x16
[   52.645364] smpboot: Booting Node 0 Processor 9 APIC 0x18
[   52.725314] smpboot: Booting Node 1 Processor 10 APIC 0x20
[   52.827517] smpboot: Booting Node 1 Processor 11 APIC 0x22
[   52.912271] smpboot: Booting Node 1 Processor 12 APIC 0x24
[   52.996101] smpboot: Booting Node 1 Processor 13 APIC 0x26
[   53.081239] smpboot: Booting Node 1 Processor 14 APIC 0x28
[   53.164990] smpboot: Booting Node 1 Processor 15 APIC 0x30
[   53.250146] smpboot: Booting Node 1 Processor 16 APIC 0x32
[   53.333894] smpboot: Booting Node 1 Processor 17 APIC 0x34
[   53.419026] smpboot: Booting Node 1 Processor 18 APIC 0x36
[   53.502820] smpboot: Booting Node 1 Processor 19 APIC 0x38
[   53.587938] smpboot: Booting Node 0 Processor 20 APIC 0x1
[   53.659828] microcode: sig=0x306e4, pf=0x1, revision=0x428
[   53.674857] smpboot: Booting Node 0 Processor 21 APIC 0x3
[   53.756346] smpboot: Booting Node 0 Processor 22 APIC 0x5
[   53.836793] smpboot: Booting Node 0 Processor 23 APIC 0x7
[   53.917753] smpboot: Booting Node 0 Processor 24 APIC 0x9
[   53.998717] smpboot: Booting Node 0 Processor 25 APIC 0x11
[   54.079674] smpboot: Booting Node 0 Processor 26 APIC 0x13
[   54.160636] smpboot: Booting Node 0 Processor 27 APIC 0x15
[   54.241592] smpboot: Booting Node 0 Processor 28 APIC 0x17
[   54.322553] smpboot: Booting Node 0 Processor 29 APIC 0x19
[   54.403487] smpboot: Booting Node 1 Processor 30 APIC 0x21
[   54.487676] smpboot: Booting Node 1 Processor 31 APIC 0x23
[   54.571921] smpboot: Booting Node 1 Processor 32 APIC 0x25
[   54.656508] smpboot: Booting Node 1 Processor 33 APIC 0x27
[   54.740835] smpboot: Booting Node 1 Processor 34 APIC 0x29
[   54.824466] smpboot: Booting Node 1 Processor 35 APIC 0x31
[   54.908374] smpboot: Booting Node 1 Processor 36 APIC 0x33
[   54.992322] smpboot: Booting Node 1 Processor 37 APIC 0x35
[   55.076333] smpboot: Booting Node 1 Processor 38 APIC 0x37
[   55.160249] smpboot: Booting Node 1 Processor 39 APIC 0x39


---
Subject: sched,cpuset: Avoid spurious/wrong domain rebuilds

When disabling cpuset.sched_load_balance we expect to be able to online
CPUs without generating sched_domains. However this is currently
completely broken.

What happens is that we generate the sched_domains and then destroy
them. This is because of the spurious 'default' domain build in
cpuset_update_active_cpus(). That builds a single machine wide domain
and then schedules a work to build the 'real' domains. The work then
finds there are _no_ domains and destroys the lot again.

Furthermore, if there actually were cpusets, building the machine wide
domain is actively wrong, because it would allow tasks to 'escape' their
cpuset. Also I don't think its needed, the scheduler really should
respect the active mask.

Also (this should probably be a separate patch) fix
partition_sched_domains() to try and preserve the existing machine wide
domain instead of unconditionally destroying it. We do this by
attempting to allocate the new single domain, only when that fails to we
reuse the fallback_doms.

Cc: Tejun Heo <tj@kernel.org>
Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/cgroup/cpuset.c  |  6 ------
 kernel/sched/topology.c | 15 ++++++++++++---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..e557cdba2350 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2342,13 +2342,7 @@ void cpuset_update_active_cpus(void)
 	 * We're inside cpu hotplug critical region which usually nests
 	 * inside cgroup synchronization.  Bounce actual hotplug processing
 	 * to a work item to avoid reverse locking order.
-	 *
-	 * We still need to do partition_sched_domains() synchronously;
-	 * otherwise, the scheduler will get confused and put tasks to the
-	 * dead CPU.  Fall back to the default single domain.
-	 * cpuset_hotplug_workfn() will rebuild it as necessary.
 	 */
-	partition_sched_domains(1, NULL, NULL);
 	schedule_work(&cpuset_hotplug_work);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895aec281e..1b74b2cc5dba 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1854,7 +1854,17 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	/* Let the architecture update CPU core mappings: */
 	new_topology = arch_update_cpu_topology();
 
-	n = doms_new ? ndoms_new : 0;
+	if (!doms_new) {
+		WARN_ON_ONCE(dattr_new);
+		n = 0;
+		doms_new = alloc_sched_domains(1);
+		if (doms_new) {
+			n = 1;
+			cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
+		}
+	} else {
+		n = ndoms_new;
+	}
 
 	/* Destroy deleted domains: */
 	for (i = 0; i < ndoms_cur; i++) {
@@ -1870,11 +1880,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	}
 
 	n = ndoms_cur;
-	if (doms_new == NULL) {
+	if (!doms_new) {
 		n = 0;
 		doms_new = &fallback_doms;
 		cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map);
-		WARN_ON_ONCE(dattr_new);
 	}
 
 	/* Build new domains: */

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-08 10:16         ` Peter Zijlstra
@ 2017-08-09 15:19           ` Ofer Levi(SW)
  2017-08-09 15:34             ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-09 15:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

I appreciate your effort and detailed reply, however I'm still experiencing  performance hit at 
partition_sched_domains(). It seems the issue is due to the large magnitude of cpus. 
I used he suggested method 2, patched in the diffs and used the command line switch isolcpus to
kill load-balancing.
It did save few hundredth of a sec per cpu. When I limited number of available cpus 
(using present and possible cpus ) to 48, it did reduced dramatically this function execution time:

With 4K available cpus :
[   48.890000] ## CPU16 LIVE ##: Executing Code...
[   48.910000] partition_sched_domains start
[   49.360000] partition_sched_domains end

With 48 available cpus:
[   36.950000] ## CPU16 LIVE ##: Executing Code...
[   36.950000] partition_sched_domains start
[   36.960000] partition_sched_domains end

Note that I currently use kernel version: 4.8.0.17.0600.00.0000, if this has any influence.
Would appreciate your thoughts.


Thanks
-Ofer



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Tuesday, August 8, 2017 1:16 PM
> To: Ofer Levi(SW) <oferle@mellanox.com>
> Cc: rusty@rustcorp.com.au; mingo@redhat.com;
> Vineet.Gupta1@synopsys.com; linux-kernel@vger.kernel.org; Tejun Heo
> <tj@kernel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
> 
> On Tue, Aug 08, 2017 at 06:49:39AM +0000, Ofer Levi(SW) wrote:
> 
> > The idea behind implementing hotplug for this arch is to shorten time
> > to traffic processing.  This way instead of waiting ~5 min for all
> > cpus to boot, application running on cpu 0 will Loop booting other
> > cpus and assigning  the traffic processing application to it.
> > Outgoing traffic will build up until all cpus are up and running full
> > traffic rate.  This method allow for traffic processing to start after
> > ~20 sec instead of the 5 min.
> 
> Ah, ok. So only online is ever used. Offline is a whole other can of worms.
> 
> > > So how can boot be different than hot-pugging them?
> >
> > Please have a look at following code kernel/sched/core.c,
> sched_cpu_activate() :
> >
> > 	if (sched_smp_initialized) {
> > 		sched_domains_numa_masks_set(cpu);
> > 		cpuset_cpu_active();
> > 	}
> 
> Ah, cute, I totally missed we did that. Yes that avoids endless domain rebuilds
> on boot.
> 
> > The cpuset_cpu_active call eventually leads to the function in
> > question partition_sched_domains() When cold-booting cpus the
> > sched_smp_initialized flag is false and therefore
> > partition_sched_domains is not executing.
> 
> So you're booting with "maxcpus=1" to only online the one. And then you
> want to online the rest once userspace runs.
> 
> There's two possibilities. The one I prefer (but which appears the most
> broken with the current code) is using the cpuset controller.
> 
> 1)
> 
>   Once you're up and running with a single CPU do:
> 
>   $ mkdir /cgroup
>   $ mount none /cgroup -t cgroup -o cpuset
>   $ echo 0 > /cgroup/cpuset.sched_load_balance
>   $ for ((i=1;i<4096;i++))
>     do
>       echo 1 > /sys/devices/system/cpu/cpu$i/online;
>     done
> 
>   And then, if you want load-balancing, you can re-enable it globally,
>   or only on a subset of CPUs.
> 
> 
> 2)
> 
>   The alternative is to use "isolcpus=1-4095" to completely kill
>   load-balancing. This more or less works with the current code,
>   except that it will keep rebuilding the CPU0 sched-domain, which
>   is somewhat pointless (also fixed by the below patch).
> 
>   The reason I don't particularly like this option is that its boot time
>   only, you cannot reconfigure your system at runtime, but that might
>   be good enough for you.
> 
> 
> With the attached patch, either option generates (I only have 40 CPUs):
> 
> [   44.305563] CPU0 attaching NULL sched-domain.
> [   51.954872] SMP alternatives: switching to SMP code
> [   51.976923] x86: Booting SMP configuration:
> [   51.981602] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   52.057756] microcode: sig=0x306e4, pf=0x1, revision=0x416
> [   52.064740] microcode: updated to revision 0x428, date = 2014-05-29
> [   52.080854] smpboot: Booting Node 0 Processor 2 APIC 0x4
> [   52.164124] smpboot: Booting Node 0 Processor 3 APIC 0x6
> [   52.244615] smpboot: Booting Node 0 Processor 4 APIC 0x8
> [   52.324564] smpboot: Booting Node 0 Processor 5 APIC 0x10
> [   52.405407] smpboot: Booting Node 0 Processor 6 APIC 0x12
> [   52.485460] smpboot: Booting Node 0 Processor 7 APIC 0x14
> [   52.565333] smpboot: Booting Node 0 Processor 8 APIC 0x16
> [   52.645364] smpboot: Booting Node 0 Processor 9 APIC 0x18
> [   52.725314] smpboot: Booting Node 1 Processor 10 APIC 0x20
> [   52.827517] smpboot: Booting Node 1 Processor 11 APIC 0x22
> [   52.912271] smpboot: Booting Node 1 Processor 12 APIC 0x24
> [   52.996101] smpboot: Booting Node 1 Processor 13 APIC 0x26
> [   53.081239] smpboot: Booting Node 1 Processor 14 APIC 0x28
> [   53.164990] smpboot: Booting Node 1 Processor 15 APIC 0x30
> [   53.250146] smpboot: Booting Node 1 Processor 16 APIC 0x32
> [   53.333894] smpboot: Booting Node 1 Processor 17 APIC 0x34
> [   53.419026] smpboot: Booting Node 1 Processor 18 APIC 0x36
> [   53.502820] smpboot: Booting Node 1 Processor 19 APIC 0x38
> [   53.587938] smpboot: Booting Node 0 Processor 20 APIC 0x1
> [   53.659828] microcode: sig=0x306e4, pf=0x1, revision=0x428
> [   53.674857] smpboot: Booting Node 0 Processor 21 APIC 0x3
> [   53.756346] smpboot: Booting Node 0 Processor 22 APIC 0x5
> [   53.836793] smpboot: Booting Node 0 Processor 23 APIC 0x7
> [   53.917753] smpboot: Booting Node 0 Processor 24 APIC 0x9
> [   53.998717] smpboot: Booting Node 0 Processor 25 APIC 0x11
> [   54.079674] smpboot: Booting Node 0 Processor 26 APIC 0x13
> [   54.160636] smpboot: Booting Node 0 Processor 27 APIC 0x15
> [   54.241592] smpboot: Booting Node 0 Processor 28 APIC 0x17
> [   54.322553] smpboot: Booting Node 0 Processor 29 APIC 0x19
> [   54.403487] smpboot: Booting Node 1 Processor 30 APIC 0x21
> [   54.487676] smpboot: Booting Node 1 Processor 31 APIC 0x23
> [   54.571921] smpboot: Booting Node 1 Processor 32 APIC 0x25
> [   54.656508] smpboot: Booting Node 1 Processor 33 APIC 0x27
> [   54.740835] smpboot: Booting Node 1 Processor 34 APIC 0x29
> [   54.824466] smpboot: Booting Node 1 Processor 35 APIC 0x31
> [   54.908374] smpboot: Booting Node 1 Processor 36 APIC 0x33
> [   54.992322] smpboot: Booting Node 1 Processor 37 APIC 0x35
> [   55.076333] smpboot: Booting Node 1 Processor 38 APIC 0x37
> [   55.160249] smpboot: Booting Node 1 Processor 39 APIC 0x39
> 
> 
> ---
> Subject: sched,cpuset: Avoid spurious/wrong domain rebuilds
> 
> When disabling cpuset.sched_load_balance we expect to be able to online
> CPUs without generating sched_domains. However this is currently
> completely broken.
> 
> What happens is that we generate the sched_domains and then destroy
> them. This is because of the spurious 'default' domain build in
> cpuset_update_active_cpus(). That builds a single machine wide domain and
> then schedules a work to build the 'real' domains. The work then finds there
> are _no_ domains and destroys the lot again.
> 
> Furthermore, if there actually were cpusets, building the machine wide
> domain is actively wrong, because it would allow tasks to 'escape' their
> cpuset. Also I don't think its needed, the scheduler really should respect the
> active mask.
> 
> Also (this should probably be a separate patch) fix
> partition_sched_domains() to try and preserve the existing machine wide
> domain instead of unconditionally destroying it. We do this by attempting to
> allocate the new single domain, only when that fails to we reuse the
> fallback_doms.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/cgroup/cpuset.c  |  6 ------
>  kernel/sched/topology.c | 15 ++++++++++++---
>  2 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index
> ca8376e5008c..e557cdba2350 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2342,13 +2342,7 @@ void cpuset_update_active_cpus(void)
>  	 * We're inside cpu hotplug critical region which usually nests
>  	 * inside cgroup synchronization.  Bounce actual hotplug processing
>  	 * to a work item to avoid reverse locking order.
> -	 *
> -	 * We still need to do partition_sched_domains() synchronously;
> -	 * otherwise, the scheduler will get confused and put tasks to the
> -	 * dead CPU.  Fall back to the default single domain.
> -	 * cpuset_hotplug_workfn() will rebuild it as necessary.
>  	 */
> -	partition_sched_domains(1, NULL, NULL);
>  	schedule_work(&cpuset_hotplug_work);
>  }
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index
> 79895aec281e..1b74b2cc5dba 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1854,7 +1854,17 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
>  	/* Let the architecture update CPU core mappings: */
>  	new_topology = arch_update_cpu_topology();
> 
> -	n = doms_new ? ndoms_new : 0;
> +	if (!doms_new) {
> +		WARN_ON_ONCE(dattr_new);
> +		n = 0;
> +		doms_new = alloc_sched_domains(1);
> +		if (doms_new) {
> +			n = 1;
> +			cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> +		}
> +	} else {
> +		n = ndoms_new;
> +	}
> 
>  	/* Destroy deleted domains: */
>  	for (i = 0; i < ndoms_cur; i++) {
> @@ -1870,11 +1880,10 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
>  	}
> 
>  	n = ndoms_cur;
> -	if (doms_new == NULL) {
> +	if (!doms_new) {
>  		n = 0;
>  		doms_new = &fallback_doms;
>  		cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> -		WARN_ON_ONCE(dattr_new);
>  	}
> 
>  	/* Build new domains: */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-09 15:19           ` Ofer Levi(SW)
@ 2017-08-09 15:34             ` Peter Zijlstra
  2017-08-10  7:40               ` Ofer Levi(SW)
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-09 15:34 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

On Wed, Aug 09, 2017 at 03:19:02PM +0000, Ofer Levi(SW) wrote:
> I appreciate your effort and detailed reply, however I'm still experiencing  performance hit at 
> partition_sched_domains(). It seems the issue is due to the large magnitude of cpus. 
> I used he suggested method 2, patched in the diffs and used the command line switch isolcpus to
> kill load-balancing.
> It did save few hundredth of a sec per cpu. When I limited number of available cpus 
> (using present and possible cpus ) to 48, it did reduced dramatically this function execution time:
> 
> With 4K available cpus :
> [   48.890000] ## CPU16 LIVE ##: Executing Code...
> [   48.910000] partition_sched_domains start
> [   49.360000] partition_sched_domains end
> 
> With 48 available cpus:
> [   36.950000] ## CPU16 LIVE ##: Executing Code...
> [   36.950000] partition_sched_domains start
> [   36.960000] partition_sched_domains end
> 
> Note that I currently use kernel version: 4.8.0.17.0600.00.0000, if this has any influence.
> Would appreciate your thoughts.
> 

Does something like this cure things? It seems we're doing a
possible_cpus iteration for sysctl cruft, and that will most certainly
hurt on your little toy :-)

Not sure what the more generic solution to that would be, but the below
avoids it for isolcpus.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -85,6 +85,7 @@ int sysctl_sched_rt_runtime = 950000;
 
 /* CPUs with isolated domains */
 cpumask_var_t cpu_isolated_map;
+cpumask_var_t non_isolated_cpus;
 
 /*
  * __task_rq_lock - lock the rq @p resides on.
@@ -5685,8 +5686,6 @@ static inline void sched_init_smt(void)
 
 void __init sched_init_smp(void)
 {
-	cpumask_var_t non_isolated_cpus;
-
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
 
 	sched_init_numa();
@@ -5697,17 +5696,17 @@ void __init sched_init_smp(void)
 	 * happen.
 	 */
 	mutex_lock(&sched_domains_mutex);
-	sched_init_domains(cpu_active_mask);
 	cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
 	if (cpumask_empty(non_isolated_cpus))
 		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
+
+	sched_init_domains(cpu_active_mask);
 	mutex_unlock(&sched_domains_mutex);
 
 	/* Move init over to a non-isolated CPU */
 	if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)
 		BUG();
 	sched_init_granularity();
-	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
 	init_sched_dl_class();
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -327,6 +327,8 @@ static struct ctl_table *sd_alloc_ctl_cp
 	return table;
 }
 
+extern cpumask_var_t non_isolated_cpus;
+
 static struct ctl_table_header *sd_sysctl_header;
 void register_sched_domain_sysctl(void)
 {
@@ -340,7 +342,7 @@ void register_sched_domain_sysctl(void)
 	if (entry == NULL)
 		return;
 
-	for_each_possible_cpu(i) {
+	for_each_cpu(i, non_isolated_cpus) {
 		snprintf(buf, 32, "cpu%d", i);
 		entry->procname = kstrdup(buf, GFP_KERNEL);
 		entry->mode = 0555;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-09 15:34             ` Peter Zijlstra
@ 2017-08-10  7:40               ` Ofer Levi(SW)
  2017-08-10  9:19                 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-10  7:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

Well, this definitely have pleased the little toy :)
Thank you. I really appreciate your time and effort.

If I may, one more newbie question. What do I need to do for the two patches to find 
their way into formal kernel code?

Thanks
-Ofer



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, August 9, 2017 6:35 PM
> To: Ofer Levi(SW) <oferle@mellanox.com>
> Cc: rusty@rustcorp.com.au; mingo@redhat.com;
> Vineet.Gupta1@synopsys.com; linux-kernel@vger.kernel.org; Tejun Heo
> <tj@kernel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
> 
> On Wed, Aug 09, 2017 at 03:19:02PM +0000, Ofer Levi(SW) wrote:
> > I appreciate your effort and detailed reply, however I'm still
> > experiencing  performance hit at partition_sched_domains(). It seems the
> issue is due to the large magnitude of cpus.
> > I used he suggested method 2, patched in the diffs and used the
> > command line switch isolcpus to kill load-balancing.
> > It did save few hundredth of a sec per cpu. When I limited number of
> > available cpus (using present and possible cpus ) to 48, it did reduced
> dramatically this function execution time:
> >
> > With 4K available cpus :
> > [   48.890000] ## CPU16 LIVE ##: Executing Code...
> > [   48.910000] partition_sched_domains start
> > [   49.360000] partition_sched_domains end
> >
> > With 48 available cpus:
> > [   36.950000] ## CPU16 LIVE ##: Executing Code...
> > [   36.950000] partition_sched_domains start
> > [   36.960000] partition_sched_domains end
> >
> > Note that I currently use kernel version: 4.8.0.17.0600.00.0000, if this has
> any influence.
> > Would appreciate your thoughts.
> >
> 
> Does something like this cure things? It seems we're doing a possible_cpus
> iteration for sysctl cruft, and that will most certainly hurt on your little toy :-)
> 
> Not sure what the more generic solution to that would be, but the below
> avoids it for isolcpus.
> 
> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -85,6 +85,7 @@ int sysctl_sched_rt_runtime = 950000;
> 
>  /* CPUs with isolated domains */
>  cpumask_var_t cpu_isolated_map;
> +cpumask_var_t non_isolated_cpus;
> 
>  /*
>   * __task_rq_lock - lock the rq @p resides on.
> @@ -5685,8 +5686,6 @@ static inline void sched_init_smt(void)
> 
>  void __init sched_init_smp(void)
>  {
> -	cpumask_var_t non_isolated_cpus;
> -
>  	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
> 
>  	sched_init_numa();
> @@ -5697,17 +5696,17 @@ void __init sched_init_smp(void)
>  	 * happen.
>  	 */
>  	mutex_lock(&sched_domains_mutex);
> -	sched_init_domains(cpu_active_mask);
>  	cpumask_andnot(non_isolated_cpus, cpu_possible_mask,
> cpu_isolated_map);
>  	if (cpumask_empty(non_isolated_cpus))
>  		cpumask_set_cpu(smp_processor_id(), non_isolated_cpus);
> +
> +	sched_init_domains(cpu_active_mask);
>  	mutex_unlock(&sched_domains_mutex);
> 
>  	/* Move init over to a non-isolated CPU */
>  	if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)
>  		BUG();
>  	sched_init_granularity();
> -	free_cpumask_var(non_isolated_cpus);
> 
>  	init_sched_rt_class();
>  	init_sched_dl_class();
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -327,6 +327,8 @@ static struct ctl_table *sd_alloc_ctl_cp
>  	return table;
>  }
> 
> +extern cpumask_var_t non_isolated_cpus;
> +
>  static struct ctl_table_header *sd_sysctl_header;  void
> register_sched_domain_sysctl(void)
>  {
> @@ -340,7 +342,7 @@ void register_sched_domain_sysctl(void)
>  	if (entry == NULL)
>  		return;
> 
> -	for_each_possible_cpu(i) {
> +	for_each_cpu(i, non_isolated_cpus) {
>  		snprintf(buf, 32, "cpu%d", i);
>  		entry->procname = kstrdup(buf, GFP_KERNEL);
>  		entry->mode = 0555;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-10  7:40               ` Ofer Levi(SW)
@ 2017-08-10  9:19                 ` Peter Zijlstra
  2017-08-10 11:51                   ` Ofer Levi(SW)
  2017-08-10 15:45                   ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-10  9:19 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

On Thu, Aug 10, 2017 at 07:40:16AM +0000, Ofer Levi(SW) wrote:
> Well, this definitely have pleased the little toy :)
> Thank you. I really appreciate your time and effort.
> 
> If I may, one more newbie question. What do I need to do for the two patches to find 
> their way into formal kernel code?

I'll split the first patch into two separate patches and line them up.

I'm not sure about this last patch, I'll speak with Ingo once he's back
to see what would be the thing to do here.

I suspect we can make it work, that sysctl stuff is only debug crud
after all and that should never get in the way of getting work done.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-10  9:19                 ` Peter Zijlstra
@ 2017-08-10 11:51                   ` Ofer Levi(SW)
  2017-08-10 15:45                   ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-10 11:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

Great, thanks.

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Thursday, August 10, 2017 12:19 PM
> To: Ofer Levi(SW) <oferle@mellanox.com>
> Cc: rusty@rustcorp.com.au; mingo@redhat.com;
> Vineet.Gupta1@synopsys.com; linux-kernel@vger.kernel.org; Tejun Heo
> <tj@kernel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
> 
> On Thu, Aug 10, 2017 at 07:40:16AM +0000, Ofer Levi(SW) wrote:
> > Well, this definitely have pleased the little toy :) Thank you. I
> > really appreciate your time and effort.
> >
> > If I may, one more newbie question. What do I need to do for the two
> > patches to find their way into formal kernel code?
> 
> I'll split the first patch into two separate patches and line them up.
> 
> I'm not sure about this last patch, I'll speak with Ingo once he's back to see
> what would be the thing to do here.
> 
> I suspect we can make it work, that sysctl stuff is only debug crud after all and
> that should never get in the way of getting work done.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hotplug support for arch/arc/plat-eznps platform
  2017-08-10  9:19                 ` Peter Zijlstra
  2017-08-10 11:51                   ` Ofer Levi(SW)
@ 2017-08-10 15:45                   ` Peter Zijlstra
  2017-08-14  7:54                     ` Ofer Levi(SW)
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-08-10 15:45 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

On Thu, Aug 10, 2017 at 11:19:05AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 07:40:16AM +0000, Ofer Levi(SW) wrote:
> > Well, this definitely have pleased the little toy :)
> > Thank you. I really appreciate your time and effort.
> > 
> > If I may, one more newbie question. What do I need to do for the two patches to find 
> > their way into formal kernel code?
> 
> I'll split the first patch into two separate patches and line them up.
> 
> I'm not sure about this last patch, I'll speak with Ingo once he's back
> to see what would be the thing to do here.
> 
> I suspect we can make it work, that sysctl stuff is only debug crud
> after all and that should never get in the way of getting work done.

Can you test this instead of the second patch? It should have the same
effect.

BTW, what physical size does your toy have? I'm thinking its less than
multiple racks worth like the SGI systems were.

---
Subject: sched/debug: Optimize sched_domain sysctl generation
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 10 17:10:26 CEST 2017

Currently we unconditionally destroy all sysctl bits and regenerate
them after we've rebuild the domains (even if that rebuild is a
no-op).

And since we unconditionally (re)build the sysctl for all possible
CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
only rebuild the bits for CPUs we've actually installed new domains
on.

Reported-by: "Ofer Levi(SW)" <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c    |   68 ++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h    |    4 ++
 kernel/sched/topology.c |    1 
 3 files changed, 59 insertions(+), 14 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -327,38 +327,78 @@ static struct ctl_table *sd_alloc_ctl_cp
 	return table;
 }
 
+static cpumask_var_t sd_sysctl_cpus;
 static struct ctl_table_header *sd_sysctl_header;
+
 void register_sched_domain_sysctl(void)
 {
-	int i, cpu_num = num_possible_cpus();
-	struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
+	static struct ctl_table *cpu_entries;
+	static struct ctl_table **cpu_idx;
 	char buf[32];
+	int i;
+
+	if (!cpu_entries) {
+		cpu_entries = sd_alloc_ctl_entry(num_possible_cpus() + 1);
+		if (!cpu_entries)
+			return;
+
+		WARN_ON(sd_ctl_dir[0].child);
+		sd_ctl_dir[0].child = cpu_entries;
+	}
+
+	if (!cpu_idx) {
+		struct ctl_table *e = cpu_entries;
+
+		cpu_idx = kcalloc(nr_cpu_ids, sizeof(struct ctl_table*), GFP_KERNEL);
+		if (!cpu_idx)
+			return;
+
+		/* deal with sparse possible map */
+		for_each_possible_cpu(i) {
+			cpu_idx[i] = e;
+			e++;
+		}
+	}
 
-	WARN_ON(sd_ctl_dir[0].child);
-	sd_ctl_dir[0].child = entry;
+	if (!cpumask_available(sd_sysctl_cpus)) {
+		if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
+			return;
 
-	if (entry == NULL)
-		return;
+		/* init to possible to not have holes in @cpu_entries */
+		cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
+	}
+
+	for_each_cpu(i, sd_sysctl_cpus) {
+		struct ctl_table *e = cpu_idx[i];
+
+		if (e->child)
+			sd_free_ctl_entry(&e->child);
+
+		if (!e->procname) {
+			snprintf(buf, 32, "cpu%d", i);
+			e->procname = kstrdup(buf, GFP_KERNEL);
+		}
+		e->mode = 0555;
+		e->child = sd_alloc_ctl_cpu_table(i);
 
-	for_each_possible_cpu(i) {
-		snprintf(buf, 32, "cpu%d", i);
-		entry->procname = kstrdup(buf, GFP_KERNEL);
-		entry->mode = 0555;
-		entry->child = sd_alloc_ctl_cpu_table(i);
-		entry++;
+		__cpumask_clear_cpu(i, sd_sysctl_cpus);
 	}
 
 	WARN_ON(sd_sysctl_header);
 	sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
 
+void dirty_sched_domain_sysctl(int cpu)
+{
+	if (cpumask_available(sd_sysctl_cpus))
+		__cpumask_set_cpu(cpu, sd_sysctl_cpus);
+}
+
 /* may be called multiple times per register */
 void unregister_sched_domain_sysctl(void)
 {
 	unregister_sysctl_table(sd_sysctl_header);
 	sd_sysctl_header = NULL;
-	if (sd_ctl_dir[0].child)
-		sd_free_ctl_entry(&sd_ctl_dir[0].child);
 }
 #endif /* CONFIG_SYSCTL */
 #endif /* CONFIG_SMP */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1120,11 +1120,15 @@ extern int group_balance_cpu(struct sche
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
 void register_sched_domain_sysctl(void);
+void dirty_sched_domain_sysctl(int cpu);
 void unregister_sched_domain_sysctl(void);
 #else
 static inline void register_sched_domain_sysctl(void)
 {
 }
+static inline void dirty_sched_domain_sysctl(int cpu)
+{
+}
 static inline void unregister_sched_domain_sysctl(void)
 {
 }
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -461,6 +461,7 @@ cpu_attach_domain(struct sched_domain *s
 	rq_attach_root(rq, rd);
 	tmp = rq->sd;
 	rcu_assign_pointer(rq->sd, sd);
+	dirty_sched_domain_sysctl(cpu);
 	destroy_sched_domains(tmp);
 
 	update_top_cache_domain(cpu);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: hotplug support for arch/arc/plat-eznps platform
  2017-08-10 15:45                   ` Peter Zijlstra
@ 2017-08-14  7:54                     ` Ofer Levi(SW)
  0 siblings, 0 replies; 13+ messages in thread
From: Ofer Levi(SW) @ 2017-08-14  7:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rusty, mingo, Vineet.Gupta1, linux-kernel, Tejun Heo

Sorry for the late response but this patch is a drawback,. Its back to about 0.4 sec per cpu bring up.
This is when possible, present and isolcpus are 16-4095
Most time is spent at:
register_sched_domain_sysctl() calling sd_sysctl_header = register_sysctl_table(sd_ctl_root);

[   22.150000] ## CPU16 LIVE ##: Executing Code...
[   22.170000] partition_sched_domains start
[   22.220000] register_sched_domain_sysctl start
[   22.580000] register_sched_domain_sysctl end
[   22.580000] partition_sched_domains end


> BTW, what physical size does your toy have? I'm thinking its less than
> multiple racks worth like the SGI systems were.
It's a single chip with 4K cpus, capable of 400Gbps duplex. Evaluation board is pizza box size. 

Thanks


> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Thursday, August 10, 2017 6:45 PM
> To: Ofer Levi(SW) <oferle@mellanox.com>
> Cc: rusty@rustcorp.com.au; mingo@redhat.com;
> Vineet.Gupta1@synopsys.com; linux-kernel@vger.kernel.org; Tejun Heo
> <tj@kernel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
> 
> On Thu, Aug 10, 2017 at 11:19:05AM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 10, 2017 at 07:40:16AM +0000, Ofer Levi(SW) wrote:
> > > Well, this definitely have pleased the little toy :) Thank you. I
> > > really appreciate your time and effort.
> > >
> > > If I may, one more newbie question. What do I need to do for the two
> > > patches to find their way into formal kernel code?
> >
> > I'll split the first patch into two separate patches and line them up.
> >
> > I'm not sure about this last patch, I'll speak with Ingo once he's
> > back to see what would be the thing to do here.
> >
> > I suspect we can make it work, that sysctl stuff is only debug crud
> > after all and that should never get in the way of getting work done.
> 
> Can you test this instead of the second patch? It should have the same
> effect.
> 
> 
> ---
> Subject: sched/debug: Optimize sched_domain sysctl generation
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu Aug 10 17:10:26 CEST 2017
> 
> Currently we unconditionally destroy all sysctl bits and regenerate them after
> we've rebuild the domains (even if that rebuild is a no-op).
> 
> And since we unconditionally (re)build the sysctl for all possible CPUs,
> onlining all CPUs gets us O(n^2) time. Instead change this to only rebuild the
> bits for CPUs we've actually installed new domains on.
> 
> Reported-by: "Ofer Levi(SW)" <oferle@mellanox.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c    |   68
> ++++++++++++++++++++++++++++++++++++++----------
>  kernel/sched/sched.h    |    4 ++
>  kernel/sched/topology.c |    1
>  3 files changed, 59 insertions(+), 14 deletions(-)
> 
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -327,38 +327,78 @@ static struct ctl_table *sd_alloc_ctl_cp
>  	return table;
>  }
> 
> +static cpumask_var_t sd_sysctl_cpus;
>  static struct ctl_table_header *sd_sysctl_header;
> +
>  void register_sched_domain_sysctl(void)
>  {
> -	int i, cpu_num = num_possible_cpus();
> -	struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
> +	static struct ctl_table *cpu_entries;
> +	static struct ctl_table **cpu_idx;
>  	char buf[32];
> +	int i;
> +
> +	if (!cpu_entries) {
> +		cpu_entries = sd_alloc_ctl_entry(num_possible_cpus() + 1);
> +		if (!cpu_entries)
> +			return;
> +
> +		WARN_ON(sd_ctl_dir[0].child);
> +		sd_ctl_dir[0].child = cpu_entries;
> +	}
> +
> +	if (!cpu_idx) {
> +		struct ctl_table *e = cpu_entries;
> +
> +		cpu_idx = kcalloc(nr_cpu_ids, sizeof(struct ctl_table*),
> GFP_KERNEL);
> +		if (!cpu_idx)
> +			return;
> +
> +		/* deal with sparse possible map */
> +		for_each_possible_cpu(i) {
> +			cpu_idx[i] = e;
> +			e++;
> +		}
> +	}
> 
> -	WARN_ON(sd_ctl_dir[0].child);
> -	sd_ctl_dir[0].child = entry;
> +	if (!cpumask_available(sd_sysctl_cpus)) {
> +		if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
> +			return;
> 
> -	if (entry == NULL)
> -		return;
> +		/* init to possible to not have holes in @cpu_entries */
> +		cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
> +	}
> +
> +	for_each_cpu(i, sd_sysctl_cpus) {
> +		struct ctl_table *e = cpu_idx[i];
> +
> +		if (e->child)
> +			sd_free_ctl_entry(&e->child);
> +
> +		if (!e->procname) {
> +			snprintf(buf, 32, "cpu%d", i);
> +			e->procname = kstrdup(buf, GFP_KERNEL);
> +		}
> +		e->mode = 0555;
> +		e->child = sd_alloc_ctl_cpu_table(i);
> 
> -	for_each_possible_cpu(i) {
> -		snprintf(buf, 32, "cpu%d", i);
> -		entry->procname = kstrdup(buf, GFP_KERNEL);
> -		entry->mode = 0555;
> -		entry->child = sd_alloc_ctl_cpu_table(i);
> -		entry++;
> +		__cpumask_clear_cpu(i, sd_sysctl_cpus);
>  	}
> 
>  	WARN_ON(sd_sysctl_header);
>  	sd_sysctl_header = register_sysctl_table(sd_ctl_root);
>  }
> 
> +void dirty_sched_domain_sysctl(int cpu) {
> +	if (cpumask_available(sd_sysctl_cpus))
> +		__cpumask_set_cpu(cpu, sd_sysctl_cpus); }
> +
>  /* may be called multiple times per register */  void
> unregister_sched_domain_sysctl(void)
>  {
>  	unregister_sysctl_table(sd_sysctl_header);
>  	sd_sysctl_header = NULL;
> -	if (sd_ctl_dir[0].child)
> -		sd_free_ctl_entry(&sd_ctl_dir[0].child);
>  }
>  #endif /* CONFIG_SYSCTL */
>  #endif /* CONFIG_SMP */
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1120,11 +1120,15 @@ extern int group_balance_cpu(struct sche
> 
>  #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)  void
> register_sched_domain_sysctl(void);
> +void dirty_sched_domain_sysctl(int cpu);
>  void unregister_sched_domain_sysctl(void);
>  #else
>  static inline void register_sched_domain_sysctl(void)
>  {
>  }
> +static inline void dirty_sched_domain_sysctl(int cpu) { }
>  static inline void unregister_sched_domain_sysctl(void)
>  {
>  }
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -461,6 +461,7 @@ cpu_attach_domain(struct sched_domain *s
>  	rq_attach_root(rq, rd);
>  	tmp = rq->sd;
>  	rcu_assign_pointer(rq->sd, sd);
> +	dirty_sched_domain_sysctl(cpu);
>  	destroy_sched_domains(tmp);
> 
>  	update_top_cache_domain(cpu);

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-08-14  7:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-06  5:53 hotplug support for arch/arc/plat-eznps platform Ofer Levi(SW)
2017-08-07  8:33 ` Peter Zijlstra
2017-08-07 13:41   ` Ofer Levi(SW)
2017-08-07 15:10     ` Peter Zijlstra
2017-08-08  6:49       ` Ofer Levi(SW)
2017-08-08 10:16         ` Peter Zijlstra
2017-08-09 15:19           ` Ofer Levi(SW)
2017-08-09 15:34             ` Peter Zijlstra
2017-08-10  7:40               ` Ofer Levi(SW)
2017-08-10  9:19                 ` Peter Zijlstra
2017-08-10 11:51                   ` Ofer Levi(SW)
2017-08-10 15:45                   ` Peter Zijlstra
2017-08-14  7:54                     ` Ofer Levi(SW)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).