All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Reduce clock calibration time during slave cpu startup
@ 2011-07-27 13:57 Jack Steiner
  2011-07-27 14:05 ` Dave Jones
  2011-08-26 23:56 ` [PATCH] " Andrew Morton
  0 siblings, 2 replies; 19+ messages in thread
From: Jack Steiner @ 2011-07-27 13:57 UTC (permalink / raw)
  To: mingo, tglx, akpm; +Cc: linux-kernel

Reduce the startup time for slave cpus.

This patch adds hooks for an arch-specific function for clock calibration.
These hooks are used on x86. They assume all cores in a physical socket
run at the same core speed. If a newly started cpu has the same phys_proc_id
as a core already active, use the already-calculated value of loops_per_jiffy.

This patch reduces the time required to start slave cpus on a 4096 cpu
system from:
	465 sec  OLD
	 62 sec NEW

This reduces boot time on a 4096p system by almost 7 minutes.  Nice...


Signed-off-by: Jack Steiner <steiner@sgi.com>


---
Note: patch assumes that all multi-core x86 processor sockets have the same
clock frequency for all cores. AFAIK, this is true & will continue
to be true for a long time. Have I overlooked anything???

Not sure who takes the patch. It's mostly x86 code but it does affect
generic code.


 arch/x86/kernel/smpboot.c |   31 ++++++++++++++++++++++++++-----
 init/calibrate.c          |   16 ++++++++++++++++
 2 files changed, 42 insertions(+), 5 deletions(-)

Index: linux/arch/x86/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86/kernel/smpboot.c	2011-07-26 08:01:11.611979781 -0500
+++ linux/arch/x86/kernel/smpboot.c	2011-07-27 08:38:04.832002562 -0500
@@ -207,23 +207,29 @@ static void __cpuinit smp_callin(void)
 	 * Need to setup vector mappings before we enable interrupts.
 	 */
 	setup_vector_irq(smp_processor_id());
+
+	/*
+	 * Save our processor parameters. Note: this information
+	 * is needed for clock calibration.
+	 */
+	smp_store_cpu_info(cpuid);
+
 	/*
 	 * Get our bogomips.
+	 * Update loops_per_jiffy in cpu_data. Previous call to
+	 * smp_store_cpu_info() stored a value that is close but not as
+	 * accurate as the value just calculated.
 	 *
 	 * Need to enable IRQs because it can take longer and then
 	 * the NMI watchdog might kill us.
 	 */
 	local_irq_enable();
 	calibrate_delay();
+	cpu_data(cpuid).loops_per_jiffy = loops_per_jiffy;
 	local_irq_disable();
 	pr_debug("Stack at about %p\n", &cpuid);
 
 	/*
-	 * Save our processor parameters
-	 */
-	smp_store_cpu_info(cpuid);
-
-	/*
 	 * This must be done before setting cpu_online_mask
 	 * or calling notify_cpu_starting.
 	 */
@@ -239,6 +245,21 @@ static void __cpuinit smp_callin(void)
 }
 
 /*
+ * Check if another cpu is in the same socket and has already been calibrated.
+ * If found, use the previous value. This assumes all cores in the same physical
+ * socket have the same core frequency.
+ */
+unsigned long __cpuinit calibrate_delay_is_known(void)
+{
+	int i, cpu = smp_processor_id();
+
+	for_each_online_cpu(i)
+		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
+			return cpu_data(i).loops_per_jiffy;
+	return 0;
+}
+
+/*
  * Activate a secondary processor.
  */
 notrace static void __cpuinit start_secondary(void *unused)
Index: linux/init/calibrate.c
===================================================================
--- linux.orig/init/calibrate.c	2011-07-26 08:01:15.571979739 -0500
+++ linux/init/calibrate.c	2011-07-27 08:39:35.691983745 -0500
@@ -243,6 +243,20 @@ recalibrate:
 	return lpj;
 }
 
+/*
+ * Check if cpu calibration delay is already known. For example,
+ * some processors with multi-core sockets may have all sockets
+ * use the same core frequency. It is not necessary to calibrate
+ * each core.
+ *
+ * Architectures should override this function if a faster calibration
+ * method is available.
+ */
+unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)
+{
+	return 0;
+}
+
 void __cpuinit calibrate_delay(void)
 {
 	unsigned long lpj;
@@ -257,6 +271,8 @@ void __cpuinit calibrate_delay(void)
 		lpj = lpj_fine;
 		pr_info("Calibrating delay loop (skipped), "
 			"value calculated using timer frequency.. ");
+	} else if ((lpj = calibrate_delay_is_known())) {
+		;
 	} else if ((lpj = calibrate_delay_direct()) != 0) {
 		if (!printed)
 			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Reduce clock calibration time during slave cpu startup
  2011-07-27 13:57 [PATCH] x86: Reduce clock calibration time during slave cpu startup Jack Steiner
@ 2011-07-27 14:05 ` Dave Jones
       [not found]   ` <20110727141527.GA8453@sgi.com>
  2011-08-26 23:56 ` [PATCH] " Andrew Morton
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Jones @ 2011-07-27 14:05 UTC (permalink / raw)
  To: Jack Steiner; +Cc: mingo, tglx, akpm, linux-kernel

On Wed, Jul 27, 2011 at 08:57:31AM -0500, Jack Steiner wrote:

 > Note: patch assumes that all multi-core x86 processor sockets have the same
 > clock frequency for all cores. AFAIK, this is true & will continue
 > to be true for a long time. 

not necessarily true. (the 'long time' part might be sooner than you think)

	Dave


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
       [not found]     ` <20110727155200.GA25381@redhat.com>
@ 2011-08-01 18:45       ` Jack Steiner
  2011-08-05 10:46         ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Jack Steiner @ 2011-08-01 18:45 UTC (permalink / raw)
  To: mingo, tglx; +Cc: davej, yinghan, linux-kernel


Reduce the startup time for slave cpus.

This patch adds hooks for an arch-specific function for clock calibration.
These hooks are used on x86. If a newly started cpu has the same phys_proc_id
as a core already active, use the already-calculated value of loops_per_jiffy.

The new code is used only on processor sockets that use the TSC for
delay loops and have X86_FEATURE_CONSTANT_TSC.


This patch reduces the time required to start slave cpus on a 4096 cpu system
from:
	465 sec  OLD
	 62 sec NEW

This reduces boot time on a 4096p system by almost 7 minutes.  Nice...


Signed-off-by: Jack Steiner <steiner@sgi.com>


---
V2 - remove assumption that all cores in a socket have the same core frequency.
     the patch is enabled only if delay() uses the TSC & socket is CONSTANT_TSC.

Also see https://lkml.org/lkml/2010/12/14/511 for a previous discussion
for similar improvement. This patch gives almost equivalent performance
and is less intrusive.


 arch/x86/kernel/smpboot.c |   16 +++++++++++-----
 arch/x86/kernel/tsc.c     |   18 ++++++++++++++++++
 init/calibrate.c          |   16 ++++++++++++++++
 3 files changed, 45 insertions(+), 5 deletions(-)

Index: linux/arch/x86/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86/kernel/smpboot.c	2011-07-26 08:01:11.611979781 -0500
+++ linux/arch/x86/kernel/smpboot.c	2011-08-01 13:25:42.544605719 -0500
@@ -207,23 +207,29 @@ static void __cpuinit smp_callin(void)
 	 * Need to setup vector mappings before we enable interrupts.
 	 */
 	setup_vector_irq(smp_processor_id());
+
+	/*
+	 * Save our processor parameters. Note: this information
+	 * is needed for clock calibration.
+	 */
+	smp_store_cpu_info(cpuid);
+
 	/*
 	 * Get our bogomips.
+	 * Update loops_per_jiffy in cpu_data. Previous call to
+	 * smp_store_cpu_info() stored a value that is close but not as
+	 * accurate as the value just calculated.
 	 *
 	 * Need to enable IRQs because it can take longer and then
 	 * the NMI watchdog might kill us.
 	 */
 	local_irq_enable();
 	calibrate_delay();
+	cpu_data(cpuid).loops_per_jiffy = loops_per_jiffy;
 	local_irq_disable();
 	pr_debug("Stack at about %p\n", &cpuid);
 
 	/*
-	 * Save our processor parameters
-	 */
-	smp_store_cpu_info(cpuid);
-
-	/*
 	 * This must be done before setting cpu_online_mask
 	 * or calling notify_cpu_starting.
 	 */
Index: linux/arch/x86/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc.c	2011-08-01 12:45:06.000000000 -0500
+++ linux/arch/x86/kernel/tsc.c	2011-08-01 13:26:20.244043188 -0500
@@ -995,3 +995,21 @@ void __init tsc_init(void)
 	check_system_tsc_reliable();
 }
 
+/*
+ * Check if another cpu is in the same socket and has already been calibrated.
+ * If found, use the previous value. This assumes all cores in the same physical
+ * socket have the same core frequency.
+ */
+unsigned long __cpuinit calibrate_delay_is_known(void)
+{
+	int i, cpu = smp_processor_id();
+
+	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
+		return 0;
+
+	for_each_online_cpu(i)
+		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
+			return cpu_data(i).loops_per_jiffy;
+	return 0;
+}
+
Index: linux/init/calibrate.c
===================================================================
--- linux.orig/init/calibrate.c	2011-07-26 08:01:15.571979739 -0500
+++ linux/init/calibrate.c	2011-07-27 08:39:35.691983745 -0500
@@ -243,6 +243,20 @@ recalibrate:
 	return lpj;
 }
 
+/*
+ * Check if cpu calibration delay is already known. For example,
+ * some processors with multi-core sockets may have all sockets
+ * use the same core frequency. It is not necessary to calibrate
+ * each core.
+ *
+ * Architectures should override this function if a faster calibration
+ * method is available.
+ */
+unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)
+{
+	return 0;
+}
+
 void __cpuinit calibrate_delay(void)
 {
 	unsigned long lpj;
@@ -257,6 +271,8 @@ void __cpuinit calibrate_delay(void)
 		lpj = lpj_fine;
 		pr_info("Calibrating delay loop (skipped), "
 			"value calculated using timer frequency.. ");
+	} else if ((lpj = calibrate_delay_is_known())) {
+		;
 	} else if ((lpj = calibrate_delay_direct()) != 0) {
 		if (!printed)
 			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-01 18:45       ` [PATCH v2] " Jack Steiner
@ 2011-08-05 10:46         ` Ingo Molnar
  2011-08-05 13:13           ` Jack Steiner
  2011-08-05 13:16           ` Jack Steiner
  0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2011-08-05 10:46 UTC (permalink / raw)
  To: Jack Steiner; +Cc: tglx, davej, yinghan, linux-kernel


* Jack Steiner <steiner@sgi.com> wrote:

> +/*
> + * Check if another cpu is in the same socket and has already been calibrated.
> + * If found, use the previous value. This assumes all cores in the same physical
> + * socket have the same core frequency.
> +
> +unsigned long __cpuinit calibrate_delay_is_known(void)
> +{
> +	int i, cpu = smp_processor_id();
> +
> +	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
> +		return 0;
> +
> +	for_each_online_cpu(i)
> +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
> +			return cpu_data(i).loops_per_jiffy;

Hm, why do we have to make such an assumption? Cannot we query the 
core frequency?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-05 10:46         ` Ingo Molnar
@ 2011-08-05 13:13           ` Jack Steiner
  2011-08-05 13:16           ` Jack Steiner
  1 sibling, 0 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-05 13:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, davej, yinghan, linux-kernel

On Fri, Aug 05, 2011 at 12:46:35PM +0200, Ingo Molnar wrote:
> 
> * Jack Steiner <steiner@sgi.com> wrote:
> 
> > +/*
> > + * Check if another cpu is in the same socket and has already been calibrated.
> > + * If found, use the previous value. This assumes all cores in the same physical
> > + * socket have the same core frequency.
> > +
> > +unsigned long __cpuinit calibrate_delay_is_known(void)
> > +{
> > +	int i, cpu = smp_processor_id();
> > +
> > +	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
> > +		return 0;
> > +
> > +	for_each_online_cpu(i)
> > +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
> > +			return cpu_data(i).loops_per_jiffy;
> 
> Hm, why do we have to make such an assumption? Cannot we query the 
> core frequency?

See V2 of the patch. The above assumption was removed & replaced by a check for
X86_FEATURE_CONSTANT_TSC & using the TSC for __delay(). If all cores see a
constant TSC frequency, then core frequency should not matter.

Does this make sense....


--- jack

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-05 10:46         ` Ingo Molnar
  2011-08-05 13:13           ` Jack Steiner
@ 2011-08-05 13:16           ` Jack Steiner
  2011-08-05 21:38             ` Ingo Molnar
  2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
  1 sibling, 2 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-05 13:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, davej, yinghan, linux-kernel

On Fri, Aug 05, 2011 at 12:46:35PM +0200, Ingo Molnar wrote:
> 
> * Jack Steiner <steiner@sgi.com> wrote:
> 
> > +/*
> > + * Check if another cpu is in the same socket and has already been calibrated.
> > + * If found, use the previous value. This assumes all cores in the same physical
> > + * socket have the same core frequency.
> > +
> > +unsigned long __cpuinit calibrate_delay_is_known(void)
> > +{
> > +	int i, cpu = smp_processor_id();
> > +
> > +	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
> > +		return 0;
> > +
> > +	for_each_online_cpu(i)
> > +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
> > +			return cpu_data(i).loops_per_jiffy;
> 
> Hm, why do we have to make such an assumption? Cannot we query the 
> core frequency?

>> See V2 of the patch. The above assumption was removed & replaced by a check for
>> X86_FEATURE_CONSTANT_TSC & using the TSC for __delay(). If all cores see a
>> constant TSC frequency, then core frequency should not matter.
>> 
>> Does this make sense....


Ahhh..  I see what you mean. I failed to update the comment in the code.
Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
look ok.


--- jack

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-05 13:16           ` Jack Steiner
@ 2011-08-05 21:38             ` Ingo Molnar
  2011-08-07  0:36               ` Matthew Garrett
  2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2011-08-05 21:38 UTC (permalink / raw)
  To: Jack Steiner; +Cc: tglx, davej, yinghan, linux-kernel


* Jack Steiner <steiner@sgi.com> wrote:

> On Fri, Aug 05, 2011 at 12:46:35PM +0200, Ingo Molnar wrote:
> > 
> > * Jack Steiner <steiner@sgi.com> wrote:
> > 
> > > +/*
> > > + * Check if another cpu is in the same socket and has already been calibrated.
> > > + * If found, use the previous value. This assumes all cores in the same physical
> > > + * socket have the same core frequency.
> > > +
> > > +unsigned long __cpuinit calibrate_delay_is_known(void)
> > > +{
> > > +	int i, cpu = smp_processor_id();
> > > +
> > > +	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
> > > +		return 0;
> > > +
> > > +	for_each_online_cpu(i)
> > > +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
> > > +			return cpu_data(i).loops_per_jiffy;
> > 
> > Hm, why do we have to make such an assumption? Cannot we query the 
> > core frequency?
> 
> >> See V2 of the patch. The above assumption was removed & replaced by a check for
> >> X86_FEATURE_CONSTANT_TSC & using the TSC for __delay(). If all cores see a
> >> constant TSC frequency, then core frequency should not matter.
> >> 
> >> Does this make sense....
> 
> 
> Ahhh..  I see what you mean. I failed to update the comment in the code.
> Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
> look ok.

Well, it still uses heuristics: it assumes frequency is the same when 
the cpuid data tells us that two CPUs are on the same socket, right?

Cannot we directly see the frequency(ies) of the CPU, and decide 
based on that whether the calibration result can be cached?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-05 13:16           ` Jack Steiner
  2011-08-05 21:38             ` Ingo Molnar
@ 2011-08-06  0:21             ` Yinghai Lu
  2011-08-06  6:58               ` Ingo Molnar
                                 ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Yinghai Lu @ 2011-08-06  0:21 UTC (permalink / raw)
  To: Jack Steiner, Robin Holt; +Cc: Ingo Molnar, tglx, davej, yinghan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

On Fri, Aug 5, 2011 at 6:16 AM, Jack Steiner <steiner@sgi.com> wrote:
> Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
> look ok.

Several months ago, Robin said that he will test updated version

[PATCH] x86: Make calibrate_delay run in parallel.

so any reason that you sgi guyes stop that path?

Please check attached patch for updated version to current tip.

Thanks

Yinghai Lu

[-- Attachment #2: Make-x86-calibrate_delay-run-in-parallel_v4.patch --]
[-- Type: text/x-patch, Size: 9170 bytes --]

[PATCH -v4] x86: Make calibrate_delay run in parallel.

On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
up the cpus.  By specifying lpj=<value>, we reduced that to 75 seconds.
Andi Kleen suggested we rework the calibrate_delay calls to run in
parallel.

-v2: from Yinghai
     two path: one for initial boot cpus. and one for hotplug cpus
	initial path:
	  after all cpu boot up, enter idle, use smp_call_function_many
	  let every ap call __calibrate_delay.
          We can not put that calibrate_delay after local_irq_enable
          in start_secondary(), at that time that cpu could be involed
	  with perf_event with nmi_watchdog enabling. that will cause
	  strange calibrating result.
        hotplug path:
	  need to add cpu notify block.
     add __calibrate_delay instead of changing calibrate_delay all over.
     use cpu_calibrated_delay_mask instead.
     use print_lpj to make print line complete.

Signed-off-by: Robin Holt <holt@sgi.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/cpumask.h |    1 
 arch/x86/kernel/cpu/common.c   |    3 ++
 arch/x86/kernel/smpboot.c      |   58 ++++++++++++++++++++++++++++++++++-------
 include/linux/delay.h          |    1 
 init/calibrate.c               |   54 +++++++++++++++++++-------------------
 5 files changed, 81 insertions(+), 36 deletions(-)


--
Index: linux-2.6/arch/x86/include/asm/cpumask.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/cpumask.h
+++ linux-2.6/arch/x86/include/asm/cpumask.h
@@ -6,6 +6,7 @@
 extern cpumask_var_t cpu_callin_mask;
 extern cpumask_var_t cpu_callout_mask;
 extern cpumask_var_t cpu_initialized_mask;
+extern cpumask_var_t cpu_calibrated_delay_mask;
 extern cpumask_var_t cpu_sibling_setup_mask;
 
 extern void setup_cpu_local_masks(void);
Index: linux-2.6/arch/x86/kernel/cpu/common.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/common.c
+++ linux-2.6/arch/x86/kernel/cpu/common.c
@@ -45,6 +45,7 @@
 cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
 cpumask_var_t cpu_callin_mask;
+cpumask_var_t cpu_calibrated_delay_mask;
 
 /* representing cpus for which sibling maps can be computed */
 cpumask_var_t cpu_sibling_setup_mask;
@@ -58,6 +59,8 @@ void __init setup_cpu_local_masks(void)
 	alloc_bootmem_cpumask_var(&cpu_callin_mask);
 	set_bootmem_name("cpu_callout_mask");
 	alloc_bootmem_cpumask_var(&cpu_callout_mask);
+	set_bootmem_name("cpu_calibrated_delay_mask");
+	alloc_bootmem_cpumask_var(&cpu_calibrated_delay_mask);
 	set_bootmem_name("cpu_sibling_setup_mask");
 	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
 }
Index: linux-2.6/arch/x86/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6/arch/x86/kernel/smpboot.c
@@ -52,6 +52,7 @@
 #include <linux/gfp.h>
 
 #include <asm/acpi.h>
+#include <asm/cpumask.h>
 #include <asm/desc.h>
 #include <asm/nmi.h>
 #include <asm/irq.h>
@@ -210,15 +211,7 @@ static void __cpuinit smp_callin(void)
 	 * Need to setup vector mappings before we enable interrupts.
 	 */
 	setup_vector_irq(smp_processor_id());
-	/*
-	 * Get our bogomips.
-	 *
-	 * Need to enable IRQs because it can take longer and then
-	 * the NMI watchdog might kill us.
-	 */
-	local_irq_enable();
-	calibrate_delay();
-	local_irq_disable();
+
 	pr_debug("Stack at about %p\n", &cpuid);
 
 	/*
@@ -1038,6 +1031,8 @@ void __init native_smp_prepare_cpus(unsi
 	}
 	set_cpu_sibling_map(0);
 
+	/* already called earlier for boot cpu */
+	cpumask_set_cpu(0, cpu_calibrated_delay_mask);
 
 	if (smp_sanity_check(max_cpus) < 0) {
 		printk(KERN_INFO "SMP disabled\n");
@@ -1126,8 +1121,53 @@ void __init native_smp_prepare_boot_cpu(
 	per_cpu(cpu_state, me) = CPU_ONLINE;
 }
 
+static void __cpuinit calibrate_delay_fn(void *info)
+{
+	int cpu = smp_processor_id();
+
+	cpu_data(cpu).loops_per_jiffy = __calibrate_delay(cpu);
+	cpumask_set_cpu(cpu, cpu_calibrated_delay_mask);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int __cpuinit
+cal_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+		smp_call_function_single(cpu, calibrate_delay_fn, NULL, 1);
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static __cpuinitdata struct notifier_block __cpuinitdata cal_cpu_nfb = {
+	.notifier_call = cal_cpu_callback
+};
+
+static void __init register_cal_cpu_nfb(void)
+{
+	register_cpu_notifier(&cal_cpu_nfb);
+}
+#else
+static void __init register_cal_cpu_nfb(void)
+{
+}
+#endif
+
 void __init native_smp_cpus_done(unsigned int max_cpus)
 {
+	smp_call_function_many(cpu_online_mask, calibrate_delay_fn, NULL, 0);
+	while (cpumask_weight(cpu_calibrated_delay_mask) != num_online_cpus()) {
+		cpu_relax();
+		touch_nmi_watchdog();
+	}
+	register_cal_cpu_nfb();
+
 	pr_debug("Boot done.\n");
 
 	impress_friends();
Index: linux-2.6/include/linux/delay.h
===================================================================
--- linux-2.6.orig/include/linux/delay.h
+++ linux-2.6/include/linux/delay.h
@@ -43,6 +43,7 @@ static inline void ndelay(unsigned long
 
 extern unsigned long lpj_fine;
 void calibrate_delay(void);
+unsigned long __calibrate_delay(int cpu);
 void msleep(unsigned int msecs);
 unsigned long msleep_interruptible(unsigned int msecs);
 void usleep_range(unsigned long min, unsigned long max);
Index: linux-2.6/init/calibrate.c
===================================================================
--- linux-2.6.orig/init/calibrate.c
+++ linux-2.6/init/calibrate.c
@@ -31,7 +31,7 @@ __setup("lpj=", lpj_setup);
 #define DELAY_CALIBRATION_TICKS			((HZ < 100) ? 1 : (HZ/100))
 #define MAX_DIRECT_CALIBRATION_RETRIES		5
 
-static unsigned long __cpuinit calibrate_delay_direct(void)
+static unsigned long __cpuinit calibrate_delay_direct(int cpu)
 {
 	unsigned long pre_start, start, post_start;
 	unsigned long pre_end, end, post_end;
@@ -134,15 +134,9 @@ static unsigned long __cpuinit calibrate
 		good_timer_count = 0;
 		if ((measured_times[max] - estimate) <
 				(estimate - measured_times[min])) {
-			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
-					"min bogoMips estimate %d = %lu\n",
-				min, measured_times[min]);
 			measured_times[min] = 0;
 			min = max;
 		} else {
-			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
-					"max bogoMips estimate %d = %lu\n",
-				max, measured_times[max]);
 			measured_times[max] = 0;
 			max = min;
 		}
@@ -160,9 +154,9 @@ static unsigned long __cpuinit calibrate
 
 	}
 
-	printk(KERN_NOTICE "calibrate_delay_direct() failed to get a good "
+	printk(KERN_NOTICE "CPU%d: calibrate_delay_direct() failed to get a good "
 	       "estimate for loops_per_jiffy.\nProbably due to long platform "
-		"interrupts. Consider using \"lpj=\" boot option.\n");
+		"interrupts. Consider using \"lpj=\" boot option.\n", cpu);
 	return 0;
 }
 #else
@@ -246,32 +240,38 @@ recalibrate:
 
 static DEFINE_PER_CPU(unsigned long, cpu_loops_per_jiffy) = { 0 };
 
-void __cpuinit calibrate_delay(void)
+static void __cpuinit print_lpj(int cpu, char *str, unsigned long lpj)
+{
+       pr_info("CPU%d: Calibrating delay%s"
+               "%lu.%02lu BogoMIPS (lpj=%lu)\n", cpu, str,
+               lpj/(500000/HZ), (lpj/(5000/HZ)) % 100, lpj);
+}
+
+unsigned long __cpuinit __calibrate_delay(int cpu)
 {
 	unsigned long lpj;
-	int this_cpu = smp_processor_id();
 
-	if (per_cpu(cpu_loops_per_jiffy, this_cpu)) {
-		lpj = per_cpu(cpu_loops_per_jiffy, this_cpu);
-		pr_info("Calibrating delay loop (skipped) "
-				"already calibrated this CPU");
+	if (per_cpu(cpu_loops_per_jiffy, cpu)) {
+		lpj = per_cpu(cpu_loops_per_jiffy, cpu);
+		print_lpj(cpu, " (skipped) already calibrated this CPU ", lpj);
 	} else if (preset_lpj) {
 		lpj = preset_lpj;
-		pr_info("Calibrating delay loop (skipped) preset value.. ");
-	} else if ((this_cpu == 0) && lpj_fine) {
+		print_lpj(cpu, " (skipped) preset value.. ", lpj);
+	} else if ((cpu == 0) && lpj_fine) {
 		lpj = lpj_fine;
-		pr_info("Calibrating delay loop (skipped), "
-			"value calculated using timer frequency.. ");
-	} else if ((lpj = calibrate_delay_direct()) != 0) {
-		pr_info("Calibrating delay using timer specific routine.. ");
+		print_lpj(cpu, " loop (skipped), value calculated using timer frequency.. ", lpj);
+	} else if ((lpj = calibrate_delay_direct(cpu)) != 0) {
+		print_lpj(cpu, " using timer specific routine.. ", lpj);
 	} else {
-		pr_info("Calibrating delay loop... ");
 		lpj = calibrate_delay_converge();
+		print_lpj(cpu, " delay loop... ", lpj);
 	}
-	per_cpu(cpu_loops_per_jiffy, this_cpu) = lpj;
-	pr_cont("%lu.%02lu BogoMIPS (lpj=%lu)\n",
-			lpj/(500000/HZ),
-			(lpj/(5000/HZ)) % 100, lpj);
+	per_cpu(cpu_loops_per_jiffy, cpu) = lpj;
 
-	loops_per_jiffy = lpj;
+	return lpj;
+}
+
+void __cpuinit calibrate_delay(void)
+{
+	loops_per_jiffy = __calibrate_delay(smp_processor_id());
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
@ 2011-08-06  6:58               ` Ingo Molnar
  2011-08-06 10:51               ` Robin Holt
  2011-08-06 14:39               ` Jack Steiner
  2 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2011-08-06  6:58 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Jack Steiner, Robin Holt, tglx, davej, yinghan, linux-kernel


* Yinghai Lu <yhlu.kernel@gmail.com> wrote:

> On Fri, Aug 5, 2011 at 6:16 AM, Jack Steiner <steiner@sgi.com> wrote:
> > Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
> > look ok.
> 
> Several months ago, Robin said that he will test updated version
> 
> [PATCH] x86: Make calibrate_delay run in parallel.
> 
> so any reason that you sgi guyes stop that path?

Running calibration in parallel is pretty stupid: cores/threads might 
impact each other and there might be a lot of avoidable noise in the 
results.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
  2011-08-06  6:58               ` Ingo Molnar
@ 2011-08-06 10:51               ` Robin Holt
  2011-08-06 14:39               ` Jack Steiner
  2 siblings, 0 replies; 19+ messages in thread
From: Robin Holt @ 2011-08-06 10:51 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jack Steiner, Robin Holt, Ingo Molnar, tglx, davej, yinghan,
	linux-kernel

On Fri, Aug 05, 2011 at 05:21:39PM -0700, Yinghai Lu wrote:
> On Fri, Aug 5, 2011 at 6:16 AM, Jack Steiner <steiner@sgi.com> wrote:
> > Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
> > look ok.
> 
> Several months ago, Robin said that he will test updated version

I am sorry I dropped that.  I was waiting for a large machine to become
available and then got loaned to another group within SGI and have not
done a very good job of keeping track of my obligations to people in
this forum.

When the time came to turn the cpu calibration stuff over to someone
else within SGI, I did tell Jack of efforts others had made in the past,
but forgot the specifics of the efforts or that they were waiting for me.

Please understand the situation has not allowed me much time and that
coupled with my inability to clearly remember how this had progressed
has allowed that effort to fall between the cracks.  I am sorry for
both your effort not being used and also wasting Jack's time in trying
to come up with a different method.

Hopefully, Jack can test this method now.

Robin

> 
> [PATCH] x86: Make calibrate_delay run in parallel.
> 
> so any reason that you sgi guyes stop that path?
> 
> Please check attached patch for updated version to current tip.
> 
> Thanks
> 
> Yinghai Lu

> [PATCH -v4] x86: Make calibrate_delay run in parallel.
> 
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus.  By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel.
> 
> -v2: from Yinghai
>      two path: one for initial boot cpus. and one for hotplug cpus
> 	initial path:
> 	  after all cpu boot up, enter idle, use smp_call_function_many
> 	  let every ap call __calibrate_delay.
>           We can not put that calibrate_delay after local_irq_enable
>           in start_secondary(), at that time that cpu could be involed
> 	  with perf_event with nmi_watchdog enabling. that will cause
> 	  strange calibrating result.
>         hotplug path:
> 	  need to add cpu notify block.
>      add __calibrate_delay instead of changing calibrate_delay all over.
>      use cpu_calibrated_delay_mask instead.
>      use print_lpj to make print line complete.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>
> To: Andi Kleen <andi@firstfloor.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> 
> ---
>  arch/x86/include/asm/cpumask.h |    1 
>  arch/x86/kernel/cpu/common.c   |    3 ++
>  arch/x86/kernel/smpboot.c      |   58 ++++++++++++++++++++++++++++++++++-------
>  include/linux/delay.h          |    1 
>  init/calibrate.c               |   54 +++++++++++++++++++-------------------
>  5 files changed, 81 insertions(+), 36 deletions(-)
> 
> 
> --
> Index: linux-2.6/arch/x86/include/asm/cpumask.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/cpumask.h
> +++ linux-2.6/arch/x86/include/asm/cpumask.h
> @@ -6,6 +6,7 @@
>  extern cpumask_var_t cpu_callin_mask;
>  extern cpumask_var_t cpu_callout_mask;
>  extern cpumask_var_t cpu_initialized_mask;
> +extern cpumask_var_t cpu_calibrated_delay_mask;
>  extern cpumask_var_t cpu_sibling_setup_mask;
>  
>  extern void setup_cpu_local_masks(void);
> Index: linux-2.6/arch/x86/kernel/cpu/common.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> +++ linux-2.6/arch/x86/kernel/cpu/common.c
> @@ -45,6 +45,7 @@
>  cpumask_var_t cpu_initialized_mask;
>  cpumask_var_t cpu_callout_mask;
>  cpumask_var_t cpu_callin_mask;
> +cpumask_var_t cpu_calibrated_delay_mask;
>  
>  /* representing cpus for which sibling maps can be computed */
>  cpumask_var_t cpu_sibling_setup_mask;
> @@ -58,6 +59,8 @@ void __init setup_cpu_local_masks(void)
>  	alloc_bootmem_cpumask_var(&cpu_callin_mask);
>  	set_bootmem_name("cpu_callout_mask");
>  	alloc_bootmem_cpumask_var(&cpu_callout_mask);
> +	set_bootmem_name("cpu_calibrated_delay_mask");
> +	alloc_bootmem_cpumask_var(&cpu_calibrated_delay_mask);
>  	set_bootmem_name("cpu_sibling_setup_mask");
>  	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
>  }
> Index: linux-2.6/arch/x86/kernel/smpboot.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/smpboot.c
> +++ linux-2.6/arch/x86/kernel/smpboot.c
> @@ -52,6 +52,7 @@
>  #include <linux/gfp.h>
>  
>  #include <asm/acpi.h>
> +#include <asm/cpumask.h>
>  #include <asm/desc.h>
>  #include <asm/nmi.h>
>  #include <asm/irq.h>
> @@ -210,15 +211,7 @@ static void __cpuinit smp_callin(void)
>  	 * Need to setup vector mappings before we enable interrupts.
>  	 */
>  	setup_vector_irq(smp_processor_id());
> -	/*
> -	 * Get our bogomips.
> -	 *
> -	 * Need to enable IRQs because it can take longer and then
> -	 * the NMI watchdog might kill us.
> -	 */
> -	local_irq_enable();
> -	calibrate_delay();
> -	local_irq_disable();
> +
>  	pr_debug("Stack at about %p\n", &cpuid);
>  
>  	/*
> @@ -1038,6 +1031,8 @@ void __init native_smp_prepare_cpus(unsi
>  	}
>  	set_cpu_sibling_map(0);
>  
> +	/* already called earlier for boot cpu */
> +	cpumask_set_cpu(0, cpu_calibrated_delay_mask);
>  
>  	if (smp_sanity_check(max_cpus) < 0) {
>  		printk(KERN_INFO "SMP disabled\n");
> @@ -1126,8 +1121,53 @@ void __init native_smp_prepare_boot_cpu(
>  	per_cpu(cpu_state, me) = CPU_ONLINE;
>  }
>  
> +static void __cpuinit calibrate_delay_fn(void *info)
> +{
> +	int cpu = smp_processor_id();
> +
> +	cpu_data(cpu).loops_per_jiffy = __calibrate_delay(cpu);
> +	cpumask_set_cpu(cpu, cpu_calibrated_delay_mask);
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static int __cpuinit
> +cal_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +	int cpu = (unsigned long)hcpu;
> +
> +	switch (action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +		smp_call_function_single(cpu, calibrate_delay_fn, NULL, 1);
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static __cpuinitdata struct notifier_block __cpuinitdata cal_cpu_nfb = {
> +	.notifier_call = cal_cpu_callback
> +};
> +
> +static void __init register_cal_cpu_nfb(void)
> +{
> +	register_cpu_notifier(&cal_cpu_nfb);
> +}
> +#else
> +static void __init register_cal_cpu_nfb(void)
> +{
> +}
> +#endif
> +
>  void __init native_smp_cpus_done(unsigned int max_cpus)
>  {
> +	smp_call_function_many(cpu_online_mask, calibrate_delay_fn, NULL, 0);
> +	while (cpumask_weight(cpu_calibrated_delay_mask) != num_online_cpus()) {
> +		cpu_relax();
> +		touch_nmi_watchdog();
> +	}
> +	register_cal_cpu_nfb();
> +
>  	pr_debug("Boot done.\n");
>  
>  	impress_friends();
> Index: linux-2.6/include/linux/delay.h
> ===================================================================
> --- linux-2.6.orig/include/linux/delay.h
> +++ linux-2.6/include/linux/delay.h
> @@ -43,6 +43,7 @@ static inline void ndelay(unsigned long
>  
>  extern unsigned long lpj_fine;
>  void calibrate_delay(void);
> +unsigned long __calibrate_delay(int cpu);
>  void msleep(unsigned int msecs);
>  unsigned long msleep_interruptible(unsigned int msecs);
>  void usleep_range(unsigned long min, unsigned long max);
> Index: linux-2.6/init/calibrate.c
> ===================================================================
> --- linux-2.6.orig/init/calibrate.c
> +++ linux-2.6/init/calibrate.c
> @@ -31,7 +31,7 @@ __setup("lpj=", lpj_setup);
>  #define DELAY_CALIBRATION_TICKS			((HZ < 100) ? 1 : (HZ/100))
>  #define MAX_DIRECT_CALIBRATION_RETRIES		5
>  
> -static unsigned long __cpuinit calibrate_delay_direct(void)
> +static unsigned long __cpuinit calibrate_delay_direct(int cpu)
>  {
>  	unsigned long pre_start, start, post_start;
>  	unsigned long pre_end, end, post_end;
> @@ -134,15 +134,9 @@ static unsigned long __cpuinit calibrate
>  		good_timer_count = 0;
>  		if ((measured_times[max] - estimate) <
>  				(estimate - measured_times[min])) {
> -			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
> -					"min bogoMips estimate %d = %lu\n",
> -				min, measured_times[min]);
>  			measured_times[min] = 0;
>  			min = max;
>  		} else {
> -			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
> -					"max bogoMips estimate %d = %lu\n",
> -				max, measured_times[max]);
>  			measured_times[max] = 0;
>  			max = min;
>  		}
> @@ -160,9 +154,9 @@ static unsigned long __cpuinit calibrate
>  
>  	}
>  
> -	printk(KERN_NOTICE "calibrate_delay_direct() failed to get a good "
> +	printk(KERN_NOTICE "CPU%d: calibrate_delay_direct() failed to get a good "
>  	       "estimate for loops_per_jiffy.\nProbably due to long platform "
> -		"interrupts. Consider using \"lpj=\" boot option.\n");
> +		"interrupts. Consider using \"lpj=\" boot option.\n", cpu);
>  	return 0;
>  }
>  #else
> @@ -246,32 +240,38 @@ recalibrate:
>  
>  static DEFINE_PER_CPU(unsigned long, cpu_loops_per_jiffy) = { 0 };
>  
> -void __cpuinit calibrate_delay(void)
> +static void __cpuinit print_lpj(int cpu, char *str, unsigned long lpj)
> +{
> +       pr_info("CPU%d: Calibrating delay%s"
> +               "%lu.%02lu BogoMIPS (lpj=%lu)\n", cpu, str,
> +               lpj/(500000/HZ), (lpj/(5000/HZ)) % 100, lpj);
> +}
> +
> +unsigned long __cpuinit __calibrate_delay(int cpu)
>  {
>  	unsigned long lpj;
> -	int this_cpu = smp_processor_id();
>  
> -	if (per_cpu(cpu_loops_per_jiffy, this_cpu)) {
> -		lpj = per_cpu(cpu_loops_per_jiffy, this_cpu);
> -		pr_info("Calibrating delay loop (skipped) "
> -				"already calibrated this CPU");
> +	if (per_cpu(cpu_loops_per_jiffy, cpu)) {
> +		lpj = per_cpu(cpu_loops_per_jiffy, cpu);
> +		print_lpj(cpu, " (skipped) already calibrated this CPU ", lpj);
>  	} else if (preset_lpj) {
>  		lpj = preset_lpj;
> -		pr_info("Calibrating delay loop (skipped) preset value.. ");
> -	} else if ((this_cpu == 0) && lpj_fine) {
> +		print_lpj(cpu, " (skipped) preset value.. ", lpj);
> +	} else if ((cpu == 0) && lpj_fine) {
>  		lpj = lpj_fine;
> -		pr_info("Calibrating delay loop (skipped), "
> -			"value calculated using timer frequency.. ");
> -	} else if ((lpj = calibrate_delay_direct()) != 0) {
> -		pr_info("Calibrating delay using timer specific routine.. ");
> +		print_lpj(cpu, " loop (skipped), value calculated using timer frequency.. ", lpj);
> +	} else if ((lpj = calibrate_delay_direct(cpu)) != 0) {
> +		print_lpj(cpu, " using timer specific routine.. ", lpj);
>  	} else {
> -		pr_info("Calibrating delay loop... ");
>  		lpj = calibrate_delay_converge();
> +		print_lpj(cpu, " delay loop... ", lpj);
>  	}
> -	per_cpu(cpu_loops_per_jiffy, this_cpu) = lpj;
> -	pr_cont("%lu.%02lu BogoMIPS (lpj=%lu)\n",
> -			lpj/(500000/HZ),
> -			(lpj/(5000/HZ)) % 100, lpj);
> +	per_cpu(cpu_loops_per_jiffy, cpu) = lpj;
>  
> -	loops_per_jiffy = lpj;
> +	return lpj;
> +}
> +
> +void __cpuinit calibrate_delay(void)
> +{
> +	loops_per_jiffy = __calibrate_delay(smp_processor_id());
>  }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
  2011-08-06  6:58               ` Ingo Molnar
  2011-08-06 10:51               ` Robin Holt
@ 2011-08-06 14:39               ` Jack Steiner
  2 siblings, 0 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-06 14:39 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Robin Holt, Ingo Molnar, tglx, davej, yinghan, linux-kernel

On Fri, Aug 05, 2011 at 05:21:39PM -0700, Yinghai Lu wrote:
> On Fri, Aug 5, 2011 at 6:16 AM, Jack Steiner <steiner@sgi.com> wrote:
> > Aside from the bogus comment (I'll send a V3 with a fixed comment), does the patch
> > look ok.
> 
> Several months ago, Robin said that he will test updated version
> 
> [PATCH] x86: Make calibrate_delay run in parallel.
> 
> so any reason that you sgi guyes stop that path?

I can take another look at Robin's patch. However, I though the one
I posted was simpler & less likely to cause unexpected behavior.

I'll look in more detail on Monday.....



> 
> Please check attached patch for updated version to current tip.
> 
> Thanks
> 
> Yinghai Lu

> [PATCH -v4] x86: Make calibrate_delay run in parallel.
> 
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus.  By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel.
> 
> -v2: from Yinghai
>      two path: one for initial boot cpus. and one for hotplug cpus
> 	initial path:
> 	  after all cpu boot up, enter idle, use smp_call_function_many
> 	  let every ap call __calibrate_delay.
>           We can not put that calibrate_delay after local_irq_enable
>           in start_secondary(), at that time that cpu could be involed
> 	  with perf_event with nmi_watchdog enabling. that will cause
> 	  strange calibrating result.
>         hotplug path:
> 	  need to add cpu notify block.
>      add __calibrate_delay instead of changing calibrate_delay all over.
>      use cpu_calibrated_delay_mask instead.
>      use print_lpj to make print line complete.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>
> To: Andi Kleen <andi@firstfloor.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> 
> ---
>  arch/x86/include/asm/cpumask.h |    1 
>  arch/x86/kernel/cpu/common.c   |    3 ++
>  arch/x86/kernel/smpboot.c      |   58 ++++++++++++++++++++++++++++++++++-------
>  include/linux/delay.h          |    1 
>  init/calibrate.c               |   54 +++++++++++++++++++-------------------
>  5 files changed, 81 insertions(+), 36 deletions(-)
> 
> 
> --
> Index: linux-2.6/arch/x86/include/asm/cpumask.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/cpumask.h
> +++ linux-2.6/arch/x86/include/asm/cpumask.h
> @@ -6,6 +6,7 @@
>  extern cpumask_var_t cpu_callin_mask;
>  extern cpumask_var_t cpu_callout_mask;
>  extern cpumask_var_t cpu_initialized_mask;
> +extern cpumask_var_t cpu_calibrated_delay_mask;
>  extern cpumask_var_t cpu_sibling_setup_mask;
>  
>  extern void setup_cpu_local_masks(void);
> Index: linux-2.6/arch/x86/kernel/cpu/common.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> +++ linux-2.6/arch/x86/kernel/cpu/common.c
> @@ -45,6 +45,7 @@
>  cpumask_var_t cpu_initialized_mask;
>  cpumask_var_t cpu_callout_mask;
>  cpumask_var_t cpu_callin_mask;
> +cpumask_var_t cpu_calibrated_delay_mask;
>  
>  /* representing cpus for which sibling maps can be computed */
>  cpumask_var_t cpu_sibling_setup_mask;
> @@ -58,6 +59,8 @@ void __init setup_cpu_local_masks(void)
>  	alloc_bootmem_cpumask_var(&cpu_callin_mask);
>  	set_bootmem_name("cpu_callout_mask");
>  	alloc_bootmem_cpumask_var(&cpu_callout_mask);
> +	set_bootmem_name("cpu_calibrated_delay_mask");
> +	alloc_bootmem_cpumask_var(&cpu_calibrated_delay_mask);
>  	set_bootmem_name("cpu_sibling_setup_mask");
>  	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
>  }
> Index: linux-2.6/arch/x86/kernel/smpboot.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/smpboot.c
> +++ linux-2.6/arch/x86/kernel/smpboot.c
> @@ -52,6 +52,7 @@
>  #include <linux/gfp.h>
>  
>  #include <asm/acpi.h>
> +#include <asm/cpumask.h>
>  #include <asm/desc.h>
>  #include <asm/nmi.h>
>  #include <asm/irq.h>
> @@ -210,15 +211,7 @@ static void __cpuinit smp_callin(void)
>  	 * Need to setup vector mappings before we enable interrupts.
>  	 */
>  	setup_vector_irq(smp_processor_id());
> -	/*
> -	 * Get our bogomips.
> -	 *
> -	 * Need to enable IRQs because it can take longer and then
> -	 * the NMI watchdog might kill us.
> -	 */
> -	local_irq_enable();
> -	calibrate_delay();
> -	local_irq_disable();
> +
>  	pr_debug("Stack at about %p\n", &cpuid);
>  
>  	/*
> @@ -1038,6 +1031,8 @@ void __init native_smp_prepare_cpus(unsi
>  	}
>  	set_cpu_sibling_map(0);
>  
> +	/* already called earlier for boot cpu */
> +	cpumask_set_cpu(0, cpu_calibrated_delay_mask);
>  
>  	if (smp_sanity_check(max_cpus) < 0) {
>  		printk(KERN_INFO "SMP disabled\n");
> @@ -1126,8 +1121,53 @@ void __init native_smp_prepare_boot_cpu(
>  	per_cpu(cpu_state, me) = CPU_ONLINE;
>  }
>  
> +static void __cpuinit calibrate_delay_fn(void *info)
> +{
> +	int cpu = smp_processor_id();
> +
> +	cpu_data(cpu).loops_per_jiffy = __calibrate_delay(cpu);
> +	cpumask_set_cpu(cpu, cpu_calibrated_delay_mask);
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static int __cpuinit
> +cal_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +	int cpu = (unsigned long)hcpu;
> +
> +	switch (action) {
> +	case CPU_ONLINE:
> +	case CPU_ONLINE_FROZEN:
> +		smp_call_function_single(cpu, calibrate_delay_fn, NULL, 1);
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static __cpuinitdata struct notifier_block __cpuinitdata cal_cpu_nfb = {
> +	.notifier_call = cal_cpu_callback
> +};
> +
> +static void __init register_cal_cpu_nfb(void)
> +{
> +	register_cpu_notifier(&cal_cpu_nfb);
> +}
> +#else
> +static void __init register_cal_cpu_nfb(void)
> +{
> +}
> +#endif
> +
>  void __init native_smp_cpus_done(unsigned int max_cpus)
>  {
> +	smp_call_function_many(cpu_online_mask, calibrate_delay_fn, NULL, 0);
> +	while (cpumask_weight(cpu_calibrated_delay_mask) != num_online_cpus()) {
> +		cpu_relax();
> +		touch_nmi_watchdog();
> +	}
> +	register_cal_cpu_nfb();
> +
>  	pr_debug("Boot done.\n");
>  
>  	impress_friends();
> Index: linux-2.6/include/linux/delay.h
> ===================================================================
> --- linux-2.6.orig/include/linux/delay.h
> +++ linux-2.6/include/linux/delay.h
> @@ -43,6 +43,7 @@ static inline void ndelay(unsigned long
>  
>  extern unsigned long lpj_fine;
>  void calibrate_delay(void);
> +unsigned long __calibrate_delay(int cpu);
>  void msleep(unsigned int msecs);
>  unsigned long msleep_interruptible(unsigned int msecs);
>  void usleep_range(unsigned long min, unsigned long max);
> Index: linux-2.6/init/calibrate.c
> ===================================================================
> --- linux-2.6.orig/init/calibrate.c
> +++ linux-2.6/init/calibrate.c
> @@ -31,7 +31,7 @@ __setup("lpj=", lpj_setup);
>  #define DELAY_CALIBRATION_TICKS			((HZ < 100) ? 1 : (HZ/100))
>  #define MAX_DIRECT_CALIBRATION_RETRIES		5
>  
> -static unsigned long __cpuinit calibrate_delay_direct(void)
> +static unsigned long __cpuinit calibrate_delay_direct(int cpu)
>  {
>  	unsigned long pre_start, start, post_start;
>  	unsigned long pre_end, end, post_end;
> @@ -134,15 +134,9 @@ static unsigned long __cpuinit calibrate
>  		good_timer_count = 0;
>  		if ((measured_times[max] - estimate) <
>  				(estimate - measured_times[min])) {
> -			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
> -					"min bogoMips estimate %d = %lu\n",
> -				min, measured_times[min]);
>  			measured_times[min] = 0;
>  			min = max;
>  		} else {
> -			printk(KERN_NOTICE "calibrate_delay_direct() dropping "
> -					"max bogoMips estimate %d = %lu\n",
> -				max, measured_times[max]);
>  			measured_times[max] = 0;
>  			max = min;
>  		}
> @@ -160,9 +154,9 @@ static unsigned long __cpuinit calibrate
>  
>  	}
>  
> -	printk(KERN_NOTICE "calibrate_delay_direct() failed to get a good "
> +	printk(KERN_NOTICE "CPU%d: calibrate_delay_direct() failed to get a good "
>  	       "estimate for loops_per_jiffy.\nProbably due to long platform "
> -		"interrupts. Consider using \"lpj=\" boot option.\n");
> +		"interrupts. Consider using \"lpj=\" boot option.\n", cpu);
>  	return 0;
>  }
>  #else
> @@ -246,32 +240,38 @@ recalibrate:
>  
>  static DEFINE_PER_CPU(unsigned long, cpu_loops_per_jiffy) = { 0 };
>  
> -void __cpuinit calibrate_delay(void)
> +static void __cpuinit print_lpj(int cpu, char *str, unsigned long lpj)
> +{
> +       pr_info("CPU%d: Calibrating delay%s"
> +               "%lu.%02lu BogoMIPS (lpj=%lu)\n", cpu, str,
> +               lpj/(500000/HZ), (lpj/(5000/HZ)) % 100, lpj);
> +}
> +
> +unsigned long __cpuinit __calibrate_delay(int cpu)
>  {
>  	unsigned long lpj;
> -	int this_cpu = smp_processor_id();
>  
> -	if (per_cpu(cpu_loops_per_jiffy, this_cpu)) {
> -		lpj = per_cpu(cpu_loops_per_jiffy, this_cpu);
> -		pr_info("Calibrating delay loop (skipped) "
> -				"already calibrated this CPU");
> +	if (per_cpu(cpu_loops_per_jiffy, cpu)) {
> +		lpj = per_cpu(cpu_loops_per_jiffy, cpu);
> +		print_lpj(cpu, " (skipped) already calibrated this CPU ", lpj);
>  	} else if (preset_lpj) {
>  		lpj = preset_lpj;
> -		pr_info("Calibrating delay loop (skipped) preset value.. ");
> -	} else if ((this_cpu == 0) && lpj_fine) {
> +		print_lpj(cpu, " (skipped) preset value.. ", lpj);
> +	} else if ((cpu == 0) && lpj_fine) {
>  		lpj = lpj_fine;
> -		pr_info("Calibrating delay loop (skipped), "
> -			"value calculated using timer frequency.. ");
> -	} else if ((lpj = calibrate_delay_direct()) != 0) {
> -		pr_info("Calibrating delay using timer specific routine.. ");
> +		print_lpj(cpu, " loop (skipped), value calculated using timer frequency.. ", lpj);
> +	} else if ((lpj = calibrate_delay_direct(cpu)) != 0) {
> +		print_lpj(cpu, " using timer specific routine.. ", lpj);
>  	} else {
> -		pr_info("Calibrating delay loop... ");
>  		lpj = calibrate_delay_converge();
> +		print_lpj(cpu, " delay loop... ", lpj);
>  	}
> -	per_cpu(cpu_loops_per_jiffy, this_cpu) = lpj;
> -	pr_cont("%lu.%02lu BogoMIPS (lpj=%lu)\n",
> -			lpj/(500000/HZ),
> -			(lpj/(5000/HZ)) % 100, lpj);
> +	per_cpu(cpu_loops_per_jiffy, cpu) = lpj;
>  
> -	loops_per_jiffy = lpj;
> +	return lpj;
> +}
> +
> +void __cpuinit calibrate_delay(void)
> +{
> +	loops_per_jiffy = __calibrate_delay(smp_processor_id());
>  }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-05 21:38             ` Ingo Molnar
@ 2011-08-07  0:36               ` Matthew Garrett
  2011-08-08 20:44                 ` Jack Steiner
  2011-08-09 15:06                 ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Matthew Garrett @ 2011-08-07  0:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jack Steiner, tglx, davej, yinghan, linux-kernel

On Fri, Aug 05, 2011 at 11:38:36PM +0200, Ingo Molnar wrote:

> Well, it still uses heuristics: it assumes frequency is the same when 
> the cpuid data tells us that two CPUs are on the same socket, right?

If we only assume that when we have a constant TSC then it's a pretty 
safe assumption - the delay loop will be calibrated against the TSC, and 
the TSC will be constant across the package regardless of what frequency 
the cores are actually running at.

> Cannot we directly see the frequency(ies) of the CPU, and decide 
> based on that whether the calibration result can be cached?

Kind of? We can look at the aperf/mperf ratio and compare that against 
the TSC to work out the frequency. But that only gives you meaningful 
results with a constant TSC, so since you're still depending on that you 
might as well skip the extra check.

The only time this should give problems is if someone produces a 
multi-core x86 with per-core (rather than per-package) TSC and runs the 
TSC at different frequencies per-core. Since that'd probably cost more 
than not doing it, I think we're probably safe.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-07  0:36               ` Matthew Garrett
@ 2011-08-08 20:44                 ` Jack Steiner
  2011-08-09 15:06                 ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-08 20:44 UTC (permalink / raw)
  To: mingo, tglx; +Cc: davej, yinghan, linux-kernel, mjg

Reduce the startup time for slave cpus.

This patch adds hooks for an arch-specific function for clock calibration.

These hooks are used on x86. If a newly started cpu has the same phys_proc_id
as a core already active, uses the TSC for the delay loop and has a CONSTANT_TSC,
use the already-calculated value of loops_per_jiffy.

This patch reduces the time required to start slave cpus on a 4096 cpu system
from:
	465 sec  OLD
	 62 sec NEW

This reduces boot time on a 4096p system by almost 7 minutes.  Nice...


Signed-off-by: Jack Steiner <steiner@sgi.com>


---
V2 - remove assumption that all cores in a socket have the same core frequency.
     the patch is enabled only if delay() uses the TSC & socket is CONSTANT_TSC.

V3 - Update comments & patch description. No code changes. I think (hope -:) this
     plus the comments from others on the V2 patch resolves the issues.



Also see https://lkml.org/lkml/2010/12/14/511 for a previous discussion
for similar improvement. This patch gives almost equivalent performance
and is less intrusive.


 arch/x86/kernel/smpboot.c |   16 +++++++++++-----
 arch/x86/kernel/tsc.c     |   19 +++++++++++++++++++
 init/calibrate.c          |   15 +++++++++++++++
 3 files changed, 45 insertions(+), 5 deletions(-)

Index: linux/arch/x86/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86/kernel/smpboot.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/arch/x86/kernel/smpboot.c	2011-08-08 15:21:47.919979997 -0500
@@ -207,23 +207,29 @@ static void __cpuinit smp_callin(void)
 	 * Need to setup vector mappings before we enable interrupts.
 	 */
 	setup_vector_irq(smp_processor_id());
+
+	/*
+	 * Save our processor parameters. Note: this information
+	 * is needed for clock calibration.
+	 */
+	smp_store_cpu_info(cpuid);
+
 	/*
 	 * Get our bogomips.
+	 * Update loops_per_jiffy in cpu_data. Previous call to
+	 * smp_store_cpu_info() stored a value that is close but not as
+	 * accurate as the value just calculated.
 	 *
 	 * Need to enable IRQs because it can take longer and then
 	 * the NMI watchdog might kill us.
 	 */
 	local_irq_enable();
 	calibrate_delay();
+	cpu_data(cpuid).loops_per_jiffy = loops_per_jiffy;
 	local_irq_disable();
 	pr_debug("Stack at about %p\n", &cpuid);
 
 	/*
-	 * Save our processor parameters
-	 */
-	smp_store_cpu_info(cpuid);
-
-	/*
 	 * This must be done before setting cpu_online_mask
 	 * or calling notify_cpu_starting.
 	 */
Index: linux/arch/x86/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/arch/x86/kernel/tsc.c	2011-08-08 15:31:43.236111159 -0500
@@ -995,3 +995,22 @@ void __init tsc_init(void)
 	check_system_tsc_reliable();
 }
 
+/*
+ * If we have a constant TSC and are using the TSC for the delay loop,
+ * we can skip clock calibration if another cpu in the same socket has already
+ * been calibrated. This assumes that CONSTANT_TSC applies to all
+ * cpus in the socket - this should be a safe assumption.
+ */
+unsigned long __cpuinit calibrate_delay_is_known(void)
+{
+	int i, cpu = smp_processor_id();
+
+	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
+		return 0;
+
+	for_each_online_cpu(i)
+		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
+			return cpu_data(i).loops_per_jiffy;
+	return 0;
+}
+
Index: linux/init/calibrate.c
===================================================================
--- linux.orig/init/calibrate.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/init/calibrate.c	2011-08-08 15:33:17.003985763 -0500
@@ -246,6 +246,19 @@ recalibrate:
 
 static DEFINE_PER_CPU(unsigned long, cpu_loops_per_jiffy) = { 0 };
 
+/*
+ * Check if cpu calibration delay is already known. For example,
+ * some processors with multi-core sockets may have all cores
+ * with the same calibration delay.
+ *
+ * Architectures should override this function if a faster calibration
+ * method is available.
+ */
+unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)
+{
+	return 0;
+}
+
 void __cpuinit calibrate_delay(void)
 {
 	unsigned long lpj;
@@ -265,6 +278,8 @@ void __cpuinit calibrate_delay(void)
 		lpj = lpj_fine;
 		pr_info("Calibrating delay loop (skipped), "
 			"value calculated using timer frequency.. ");
+	} else if ((lpj = calibrate_delay_is_known())) {
+		;
 	} else if ((lpj = calibrate_delay_direct()) != 0) {
 		if (!printed)
 			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-07  0:36               ` Matthew Garrett
  2011-08-08 20:44                 ` Jack Steiner
@ 2011-08-09 15:06                 ` Ingo Molnar
  2011-08-09 15:18                   ` Matthew Garrett
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2011-08-09 15:06 UTC (permalink / raw)
  To: Matthew Garrett; +Cc: Jack Steiner, tglx, davej, yinghan, linux-kernel


* Matthew Garrett <mjg@redhat.com> wrote:

> On Fri, Aug 05, 2011 at 11:38:36PM +0200, Ingo Molnar wrote:
> 
> > Well, it still uses heuristics: it assumes frequency is the same 
> > when the cpuid data tells us that two CPUs are on the same 
> > socket, right?
> 
> If we only assume that when we have a constant TSC then it's a 
> pretty safe assumption - the delay loop will be calibrated against 
> the TSC, and the TSC will be constant across the package regardless 
> of what frequency the cores are actually running at.

The delay loop might be calibrated against the TSC, but the amount of 
real delay we get when we loop 100,000 times will be frequency 
dependent.

What we probably want is the most conservative udelay calibration: 
have a lpj value measured on the highest possible frequency - this 
way hardware components can never be overclocked by a driver.

Or does udelay() scale with the current frequency of the CPU?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] x86: Reduce clock calibration time during slave cpu startup
  2011-08-09 15:06                 ` Ingo Molnar
@ 2011-08-09 15:18                   ` Matthew Garrett
  2011-08-11 20:14                     ` [PATCH v3] " Jack Steiner
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Garrett @ 2011-08-09 15:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jack Steiner, tglx, davej, yinghan, linux-kernel

On Tue, Aug 09, 2011 at 05:06:24PM +0200, Ingo Molnar wrote:

> The delay loop might be calibrated against the TSC, but the amount of 
> real delay we get when we loop 100,000 times will be frequency 
> dependent.

We don't have a situation where a system boots with one core in a 
package fixed at one frequency and another core in the same package at 
another. It's possible that they'll float independently due to cpufreq 
changes (although that's not possible with most current hardware), but 
we need to take that into account anyway.

> What we probably want is the most conservative udelay calibration: 
> have a lpj value measured on the highest possible frequency - this 
> way hardware components can never be overclocked by a driver.

There's no way to force the highest possible frequency. Calibration 
occurs before cpuidle is running, and the only way to get the maximum 
frequency on a given core is to have all the other cores in C6.

> Or does udelay() scale with the current frequency of the CPU?

If it doesn't then it's been broken for the past 8 years or so.
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3] x86: Reduce clock calibration time during slave cpu startup
  2011-08-09 15:18                   ` Matthew Garrett
@ 2011-08-11 20:14                     ` Jack Steiner
  0 siblings, 0 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-11 20:14 UTC (permalink / raw)
  To: mingo, tglx; +Cc: davej, yinghan, linux-kernel, mjg

Reduce the startup time for slave cpus.

This patch adds hooks for an arch-specific function for clock calibration.

These hooks are used on x86. If a newly started cpu has the same phys_proc_id
as a core already active, uses the TSC for the delay loop and has a CONSTANT_TSC,
use the already-calculated value of loops_per_jiffy.

This patch reduces the time required to start slave cpus on a 4096 cpu system
from:
	465 sec  OLD
	 62 sec NEW

This reduces boot time on a 4096p system by almost 7 minutes.  Nice...


Signed-off-by: Jack Steiner <steiner@sgi.com>


---
V2 - remove assumption that all cores in a socket have the same core frequency.
     the patch is enabled only if delay() uses the TSC & socket is CONSTANT_TSC.

V3 - Update comments & patch description. No code changes.
	[RESEND - failed to change V2 to V3 in previous posting.]

     Not sure if this is acceptible or what needs to be done next. Pointers
     appreciated.



Also see https://lkml.org/lkml/2010/12/14/511 for a previous discussion
for similar improvement. This patch gives almost equivalent performance
and is less intrusive.


 arch/x86/kernel/smpboot.c |   16 +++++++++++-----
 arch/x86/kernel/tsc.c     |   19 +++++++++++++++++++
 init/calibrate.c          |   15 +++++++++++++++
 3 files changed, 45 insertions(+), 5 deletions(-)

Index: linux/arch/x86/kernel/smpboot.c
===================================================================
--- linux.orig/arch/x86/kernel/smpboot.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/arch/x86/kernel/smpboot.c	2011-08-08 15:21:47.919979997 -0500
@@ -207,23 +207,29 @@ static void __cpuinit smp_callin(void)
 	 * Need to setup vector mappings before we enable interrupts.
 	 */
 	setup_vector_irq(smp_processor_id());
+
+	/*
+	 * Save our processor parameters. Note: this information
+	 * is needed for clock calibration.
+	 */
+	smp_store_cpu_info(cpuid);
+
 	/*
 	 * Get our bogomips.
+	 * Update loops_per_jiffy in cpu_data. Previous call to
+	 * smp_store_cpu_info() stored a value that is close but not as
+	 * accurate as the value just calculated.
 	 *
 	 * Need to enable IRQs because it can take longer and then
 	 * the NMI watchdog might kill us.
 	 */
 	local_irq_enable();
 	calibrate_delay();
+	cpu_data(cpuid).loops_per_jiffy = loops_per_jiffy;
 	local_irq_disable();
 	pr_debug("Stack at about %p\n", &cpuid);
 
 	/*
-	 * Save our processor parameters
-	 */
-	smp_store_cpu_info(cpuid);
-
-	/*
 	 * This must be done before setting cpu_online_mask
 	 * or calling notify_cpu_starting.
 	 */
Index: linux/arch/x86/kernel/tsc.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/arch/x86/kernel/tsc.c	2011-08-08 15:31:43.236111159 -0500
@@ -995,3 +995,22 @@ void __init tsc_init(void)
 	check_system_tsc_reliable();
 }
 
+/*
+ * If we have a constant TSC and are using the TSC for the delay loop,
+ * we can skip clock calibration if another cpu in the same socket has already
+ * been calibrated. This assumes that CONSTANT_TSC applies to all
+ * cpus in the socket - this should be a safe assumption.
+ */
+unsigned long __cpuinit calibrate_delay_is_known(void)
+{
+	int i, cpu = smp_processor_id();
+
+	if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
+		return 0;
+
+	for_each_online_cpu(i)
+		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
+			return cpu_data(i).loops_per_jiffy;
+	return 0;
+}
+
Index: linux/init/calibrate.c
===================================================================
--- linux.orig/init/calibrate.c	2011-08-05 08:24:19.492148934 -0500
+++ linux/init/calibrate.c	2011-08-08 15:33:17.003985763 -0500
@@ -246,6 +246,19 @@ recalibrate:
 
 static DEFINE_PER_CPU(unsigned long, cpu_loops_per_jiffy) = { 0 };
 
+/*
+ * Check if cpu calibration delay is already known. For example,
+ * some processors with multi-core sockets may have all cores
+ * with the same calibration delay.
+ *
+ * Architectures should override this function if a faster calibration
+ * method is available.
+ */
+unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)
+{
+	return 0;
+}
+
 void __cpuinit calibrate_delay(void)
 {
 	unsigned long lpj;
@@ -265,6 +278,8 @@ void __cpuinit calibrate_delay(void)
 		lpj = lpj_fine;
 		pr_info("Calibrating delay loop (skipped), "
 			"value calculated using timer frequency.. ");
+	} else if ((lpj = calibrate_delay_is_known())) {
+		;
 	} else if ((lpj = calibrate_delay_direct()) != 0) {
 		if (!printed)
 			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Reduce clock calibration time during slave cpu startup
  2011-07-27 13:57 [PATCH] x86: Reduce clock calibration time during slave cpu startup Jack Steiner
  2011-07-27 14:05 ` Dave Jones
@ 2011-08-26 23:56 ` Andrew Morton
  2011-08-26 23:57   ` Andrew Morton
  2011-08-29 15:04   ` Jack Steiner
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Morton @ 2011-08-26 23:56 UTC (permalink / raw)
  To: Jack Steiner; +Cc: mingo, tglx, linux-kernel

On Wed, 27 Jul 2011 08:57:31 -0500
Jack Steiner <steiner@sgi.com> wrote:

> Reduce the startup time for slave cpus.
> 
> This patch adds hooks for an arch-specific function for clock calibration.
> These hooks are used on x86. They assume all cores in a physical socket
> run at the same core speed. If a newly started cpu has the same phys_proc_id
> as a core already active, use the already-calculated value of loops_per_jiffy.
> 
> This patch reduces the time required to start slave cpus on a 4096 cpu
> system from:
> 	465 sec  OLD
> 	 62 sec NEW

Eight minutes is just stupid.

100ms/cpu is just stupid too.  What's the CPU doing?  Spinning around
counting ticks?  That's parallelizable.

> This reduces boot time on a 4096p system by almost 7 minutes.  Nice...
> 
> 
> Signed-off-by: Jack Steiner <steiner@sgi.com>
> 
> 
> ---
> Note: patch assumes that all multi-core x86 processor sockets have the same
> clock frequency for all cores. AFAIK, this is true & will continue
> to be true for a long time. Have I overlooked anything???

Well, Andi thinks this may become untrue relatively soon.  Then what do
we do?

>  /*
> + * Check if another cpu is in the same socket and has already been calibrated.
> + * If found, use the previous value. This assumes all cores in the same physical
> + * socket have the same core frequency.
> + */
> +unsigned long __cpuinit calibrate_delay_is_known(void)
> +{
> +	int i, cpu = smp_processor_id();
> +
> +	for_each_online_cpu(i)
> +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)

This will always match when `i' reaches `cpu'.  Or is this cpu not
online at this time?

> +			return cpu_data(i).loops_per_jiffy;
> +	return 0;
> +}
> +
> +/*
>   * Activate a secondary processor.
>   */
>  notrace static void __cpuinit start_secondary(void *unused)
> Index: linux/init/calibrate.c
> ===================================================================
> --- linux.orig/init/calibrate.c	2011-07-26 08:01:15.571979739 -0500
> +++ linux/init/calibrate.c	2011-07-27 08:39:35.691983745 -0500
> @@ -243,6 +243,20 @@ recalibrate:
>  	return lpj;
>  }
>  
> +/*
> + * Check if cpu calibration delay is already known. For example,
> + * some processors with multi-core sockets may have all sockets
> + * use the same core frequency. It is not necessary to calibrate
> + * each core.
> + *
> + * Architectures should override this function if a faster calibration
> + * method is available.
> + */
> +unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)

__weak

> +{
> +	return 0;
> +}
> +
>  void __cpuinit calibrate_delay(void)
>  {
>  	unsigned long lpj;
> @@ -257,6 +271,8 @@ void __cpuinit calibrate_delay(void)
>  		lpj = lpj_fine;
>  		pr_info("Calibrating delay loop (skipped), "
>  			"value calculated using timer frequency.. ");
> +	} else if ((lpj = calibrate_delay_is_known())) {
> +		;
>  	} else if ((lpj = calibrate_delay_direct()) != 0) {
>  		if (!printed)
>  			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Reduce clock calibration time during slave cpu startup
  2011-08-26 23:56 ` [PATCH] " Andrew Morton
@ 2011-08-26 23:57   ` Andrew Morton
  2011-08-29 15:04   ` Jack Steiner
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2011-08-26 23:57 UTC (permalink / raw)
  To: Jack Steiner, mingo, tglx, linux-kernel

On Fri, 26 Aug 2011 16:56:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > Note: patch assumes that all multi-core x86 processor sockets have the same
> > clock frequency for all cores. AFAIK, this is true & will continue
> > to be true for a long time. Have I overlooked anything???
> 
> Well, Andi thinks this may become untrue relatively soon.  Then what do
> we do?

err, make that davej@firstfloor.org or something.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86: Reduce clock calibration time during slave cpu startup
  2011-08-26 23:56 ` [PATCH] " Andrew Morton
  2011-08-26 23:57   ` Andrew Morton
@ 2011-08-29 15:04   ` Jack Steiner
  1 sibling, 0 replies; 19+ messages in thread
From: Jack Steiner @ 2011-08-29 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, tglx, linux-kernel

On Fri, Aug 26, 2011 at 04:56:34PM -0700, Andrew Morton wrote:
> On Wed, 27 Jul 2011 08:57:31 -0500
> Jack Steiner <steiner@sgi.com> wrote:
> 
> > Reduce the startup time for slave cpus.
> > 
> > This patch adds hooks for an arch-specific function for clock calibration.
> > These hooks are used on x86. They assume all cores in a physical socket
> > run at the same core speed. If a newly started cpu has the same phys_proc_id
> > as a core already active, use the already-calculated value of loops_per_jiffy.
> > 
> > This patch reduces the time required to start slave cpus on a 4096 cpu
> > system from:
> > 	465 sec  OLD
> > 	 62 sec NEW
> 
> Eight minutes is just stupid.

Agree. I'd like to reduce that. It currently takes about 65 minutes to
boot a 4096p system with a reasonable sized IO config (a big part
of the boot time is IO dependent). Reducing by 8 min is a good improvement
but we still have more to do. Calibration is one of larger contributors
to boot times.


> 
> 100ms/cpu is just stupid too.  What's the CPU doing?  Spinning around
> counting ticks?  That's parallelizable.

The time is spent in the clock calibration code. It unfortunately takes a while
to calibrate to a high degree of accuracy.

Ingo was concerned that trying to calibrate in parallel would introduce error.

	Running calibration in parallel is pretty stupid: cores/threads might
	impact each other and there might be a lot of avoidable noise in the
	results.

	Thanks, Ingo



> 
> > This reduces boot time on a 4096p system by almost 7 minutes.  Nice...
> > 
> > 
> > Signed-off-by: Jack Steiner <steiner@sgi.com>
> > 
> > 
> > ---
> > Note: patch assumes that all multi-core x86 processor sockets have the same
> > clock frequency for all cores. AFAIK, this is true & will continue
> > to be true for a long time. Have I overlooked anything???
> 
> Well, Andi thinks this may become untrue relatively soon.  Then what do
> we do?

I posted a V3 version of the patch that eliminates this assumption. The new version
skip recalibration of cores within a socket only if the delay loop uses the TSC
and for CONSTANT_TSC for the cores within the socket.

So far, I have not received any feedback. The patch is at:

	http://marc.info/?l=linux-kernel&m=131309367414891&w=2

I'll resend again.


> 
> >  /*
> > + * Check if another cpu is in the same socket and has already been calibrated.
> > + * If found, use the previous value. This assumes all cores in the same physical
> > + * socket have the same core frequency.
> > + */
> > +unsigned long __cpuinit calibrate_delay_is_known(void)
> > +{
> > +	int i, cpu = smp_processor_id();
> > +
> > +	for_each_online_cpu(i)
> > +		if (cpu_data(i).phys_proc_id == cpu_data(cpu).phys_proc_id)
> 
> This will always match when `i' reaches `cpu'.  Or is this cpu not
> online at this time?

Correct - not online.


> 
> > +			return cpu_data(i).loops_per_jiffy;
> > +	return 0;
> > +}
> > +
> > +/*
> >   * Activate a secondary processor.
> >   */
> >  notrace static void __cpuinit start_secondary(void *unused)
> > Index: linux/init/calibrate.c
> > ===================================================================
> > --- linux.orig/init/calibrate.c	2011-07-26 08:01:15.571979739 -0500
> > +++ linux/init/calibrate.c	2011-07-27 08:39:35.691983745 -0500
> > @@ -243,6 +243,20 @@ recalibrate:
> >  	return lpj;
> >  }
> >  
> > +/*
> > + * Check if cpu calibration delay is already known. For example,
> > + * some processors with multi-core sockets may have all sockets
> > + * use the same core frequency. It is not necessary to calibrate
> > + * each core.
> > + *
> > + * Architectures should override this function if a faster calibration
> > + * method is available.
> > + */
> > +unsigned long __attribute__((weak)) __cpuinit calibrate_delay_is_known(void)
> 
> __weak
> 
> > +{
> > +	return 0;
> > +}
> > +
> >  void __cpuinit calibrate_delay(void)
> >  {
> >  	unsigned long lpj;
> > @@ -257,6 +271,8 @@ void __cpuinit calibrate_delay(void)
> >  		lpj = lpj_fine;
> >  		pr_info("Calibrating delay loop (skipped), "
> >  			"value calculated using timer frequency.. ");
> > +	} else if ((lpj = calibrate_delay_is_known())) {
> > +		;
> >  	} else if ((lpj = calibrate_delay_direct()) != 0) {
> >  		if (!printed)
> >  			pr_info("Calibrating delay using timer "

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-08-29 15:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27 13:57 [PATCH] x86: Reduce clock calibration time during slave cpu startup Jack Steiner
2011-07-27 14:05 ` Dave Jones
     [not found]   ` <20110727141527.GA8453@sgi.com>
     [not found]     ` <20110727155200.GA25381@redhat.com>
2011-08-01 18:45       ` [PATCH v2] " Jack Steiner
2011-08-05 10:46         ` Ingo Molnar
2011-08-05 13:13           ` Jack Steiner
2011-08-05 13:16           ` Jack Steiner
2011-08-05 21:38             ` Ingo Molnar
2011-08-07  0:36               ` Matthew Garrett
2011-08-08 20:44                 ` Jack Steiner
2011-08-09 15:06                 ` Ingo Molnar
2011-08-09 15:18                   ` Matthew Garrett
2011-08-11 20:14                     ` [PATCH v3] " Jack Steiner
2011-08-06  0:21             ` [PATCH v2] " Yinghai Lu
2011-08-06  6:58               ` Ingo Molnar
2011-08-06 10:51               ` Robin Holt
2011-08-06 14:39               ` Jack Steiner
2011-08-26 23:56 ` [PATCH] " Andrew Morton
2011-08-26 23:57   ` Andrew Morton
2011-08-29 15:04   ` Jack Steiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.