linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
@ 2012-02-23  0:36 Venkatesh Pallipadi
  2012-02-23  7:50 ` Ingo Molnar
  2012-02-23  9:30 ` [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-02-23  0:36 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Suresh Siddha, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel, Venkatesh Pallipadi

smp_call_function_single and ttwu_queue_remote sends unconditional IPI
to target CPU. However, if the target CPU is in mwait based idle, we can
do IPI-less wakeups using the magical powers of monitor-mwait.
Doing this has certain advantages:
* Lower overhead on Async IPI send path. Measurements on Westmere based
  systems show savings on "no wait" smp_call_function_single with idle
  target CPU (as measured on the sender side).
  local socket smp_call_func cost goes from ~1600 to ~1200 cycles
  remote socket smp_call_func cost goes from ~2000 to ~1800 cycles
* Avoiding actual interrupts shows a measurable reduction (10%) in system
  non-idle cycles and cache-references with micro-benchmark sending IPI from
  one CPU to all the other mostly idle CPUs in the system.
* On a mostly idle system, turbostat shows a tiny decrease in C0(active) time
  and a corresponding increase in C6 state (Each row being 10min avg)
          %c0   %c1   %c6
  Before
  Run 1  1.49  2.88 95.63
  Run 2  1.48  2.89 95.63
  Run 3  1.50  2.91 95.59
  After
  Run 1  1.28  2.38 96.33
  Run 2  1.30  2.44 96.26
  Run 3  1.31  2.45 96.24

* As a bonus, we can avoid sched/call IPI overhead altogether in a special case.
  When CPU Y has woken up CPU X (which can take 50-100us to actually wakeup
  from a deep idle state) and CPU Z wants to send IPI to CPU X in this period.
  It can get it for free.

We started looking at this with one of our workloads where system is partially
busy and we noticed some kernel hotspots in find_next_bit and
default_send_IPI_mask_sequence_phys coming from sched wakeup (futex wakeups)
and networking call functions.

Thanks to Suresh for the suggestion of using TIF flags instead of
having a new percpu state variable and complicated update logic.

Notes:
* This only helps when target CPU is idle. When it is busy we will still send
  IPI as before.
* Do we need some accounting for these wakeups exported for powertop?
* We can also eliminate TS_POLLING flag in favor of this. But, that will have
  a lot more touchpoints and better done as a standlone change.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---

Changes since previous versions:
* RFC https://lkml.org/lkml/2012/2/6/357
  Moved the changes into arch specific code as per PeterZ suggestion
  Got rid of new per cpu state logic in favor of TIF flag bits

 arch/x86/include/asm/ipiless_poke.h |   82 +++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h  |    4 ++
 arch/x86/kernel/acpi/cstate.c       |    7 ++-
 arch/x86/kernel/process_32.c        |    2 +
 arch/x86/kernel/process_64.c        |    2 +
 arch/x86/kernel/smp.c               |   16 +++++++
 include/linux/sched.h               |    2 +
 kernel/sched/core.c                 |   13 ++++++
 8 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/ipiless_poke.h

diff --git a/arch/x86/include/asm/ipiless_poke.h b/arch/x86/include/asm/ipiless_poke.h
new file mode 100644
index 0000000..58670c7
--- /dev/null
+++ b/arch/x86/include/asm/ipiless_poke.h
@@ -0,0 +1,82 @@
+#ifndef _ASM_X86_IPILESS_POKE_H
+#define _ASM_X86_IPILESS_POKE_H
+
+#include <linux/sched.h>
+#include <asm/thread_info.h>
+
+#ifdef CONFIG_SMP
+
+DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
+
+/*
+ * Use 2 bits in idle_task's thead info flags:
+ * TIF_IPILESS_IDLE marks enter to and exit from idle states with ipiless
+ * wakeup capability.
+ * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI target CPU
+ * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already set).
+ * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle state.
+ */
+
+static inline void ipiless_idle_enter(void)
+{
+	set_thread_flag(TIF_IPILESS_IDLE);
+}
+
+static inline void ipiless_idle_exit(void)
+{
+	clear_thread_flag(TIF_IPILESS_IDLE);
+}
+
+static inline int is_ipi_pending(void)
+{
+	return unlikely(test_thread_flag(TIF_IPI_PENDING));
+}
+
+static inline int need_wakeup(void)
+{
+	return need_resched() || is_ipi_pending();
+}
+
+static inline void ipiless_pending_work(void)
+{
+	if (is_ipi_pending()) {
+		clear_thread_flag(TIF_IPI_PENDING);
+		local_bh_disable();
+		local_irq_disable();
+		generic_smp_call_function_single_interrupt();
+		__scheduler_ipi();
+		local_irq_enable();
+		local_bh_enable();
+	}
+}
+
+static inline int ipiless_magic_poke(int cpu)
+{
+	int val;
+	atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);
+
+	val = atomic_read(idle_flag);
+	if (unlikely(val & _TIF_IPI_PENDING))
+		return 1;
+
+	if (!(val & _TIF_IPILESS_IDLE))
+		return 0;
+
+	if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
+		return 1;
+
+	return 0;
+}
+
+#else
+static inline void ipiless_pending_work(void) { }
+static inline void ipiless_idle_enter(void) { }
+static inline void ipiless_idle_exit(void) { }
+
+static inline int need_wakeup(void)
+{
+	return need_resched();
+}
+#endif
+
+#endif /* _ASM_X86_IPILESS_POKE_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index bc817cd..f5cd1b8 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,8 @@ struct thread_info {
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT	28	/* syscall tracepoint instrumentation */
+#define TIF_IPILESS_IDLE	29	/* IPIless idle bit */
+#define TIF_IPI_PENDING		30	/* IPI pending on the CPU */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -116,6 +118,8 @@ struct thread_info {
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_IPILESS_IDLE	(1 << TIF_IPILESS_IDLE)
+#define _TIF_IPI_PENDING	(1 << TIF_IPI_PENDING)
 
 /* work to do in syscall_trace_enter() */
 #define _TIF_WORK_SYSCALL_ENTRY	\
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f50e7fb..50833ed 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
 #include <acpi/processor.h>
 #include <asm/acpi.h>
 #include <asm/mwait.h>
+#include <asm/ipiless_poke.h>
 
 /*
  * Initialize bm_flags based on the CPU cache properties
@@ -161,15 +162,17 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
  */
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
-	if (!need_resched()) {
+	ipiless_idle_enter();
+	if (!need_wakeup()) {
 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
-		if (!need_resched())
+		if (!need_wakeup())
 			__mwait(ax, cx);
 	}
+	ipiless_idle_exit();
 }
 
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 324cd72..a963e98 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -44,6 +44,7 @@
 #include <asm/system.h>
 #include <asm/ldt.h>
 #include <asm/processor.h>
+#include <asm/ipiless_poke.h>
 #include <asm/i387.h>
 #include <asm/desc.h>
 #ifdef CONFIG_MATH_EMULATION
@@ -116,6 +117,7 @@ void cpu_idle(void)
 			if (cpuidle_idle_call())
 				pm_idle();
 			start_critical_timings();
+			ipiless_pending_work();
 		}
 		rcu_idle_exit();
 		tick_nohz_idle_exit();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 753e803..93b2e5f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -44,6 +44,7 @@
 #include <asm/processor.h>
 #include <asm/i387.h>
 #include <asm/mmu_context.h>
+#include <asm/ipiless_poke.h>
 #include <asm/prctl.h>
 #include <asm/desc.h>
 #include <asm/proto.h>
@@ -153,6 +154,7 @@ void cpu_idle(void)
 			   has already called exit_idle. But some idle
 			   loops can be woken up without interrupt. */
 			__exit_idle();
+			ipiless_pending_work();
 		}
 
 		tick_nohz_idle_exit();
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index a8ff227..e66a4c8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
+#include <asm/ipiless_poke.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
 #include <asm/nmi.h>
@@ -109,6 +110,14 @@
  *	about nothing of note with C stepping upwards.
  */
 
+DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
+
+void __cpuinit thread_idle_state_setup(struct task_struct *idle, int cpu)
+{
+	per_cpu(idle_task_ti_flags, cpu) =
+				(atomic_t *)(&(task_thread_info(idle)->flags));
+}
+
 /*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
@@ -120,11 +129,18 @@ static void native_smp_send_reschedule(int cpu)
 		WARN_ON(1);
 		return;
 	}
+
+	if (ipiless_magic_poke(cpu))
+		return;
+
 	apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
 }
 
 void native_send_call_func_single_ipi(int cpu)
 {
+	if (ipiless_magic_poke(cpu))
+		return;
+
 	apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..e07ca62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2298,9 +2298,11 @@ extern char *get_task_comm(char *to, struct task_struct *tsk);
 
 #ifdef CONFIG_SMP
 void scheduler_ipi(void);
+void __scheduler_ipi(void);
 extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
 #else
 static inline void scheduler_ipi(void) { }
+static inline void __scheduler_ipi(void) { }
 static inline unsigned long wait_task_inactive(struct task_struct *p,
 					       long match_state)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..1558316 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
 	raw_spin_unlock(&rq->lock);
 }
 
+void __scheduler_ipi(void)
+{
+	if (llist_empty(&this_rq()->wake_list))
+		return;
+
+	sched_ttwu_pending();
+}
+
 void scheduler_ipi(void)
 {
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
@@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct task_struct *idle)
 	idle->sched_class = &idle_sched_class;
 }
 
+void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle, int cpu)
+{
+}
+
 /**
  * init_idle - set up an idle thread for a given CPU
  * @idle: task in question
@@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 
 	/* Set the preempt count _outside_ the spinlocks! */
 	task_thread_info(idle)->preempt_count = 0;
+	thread_idle_state_setup(idle, cpu);
 
 	/*
 	 * The idle tasks have their own, simple scheduling class:
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  0:36 [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Venkatesh Pallipadi
@ 2012-02-23  7:50 ` Ingo Molnar
  2012-02-23  9:08   ` Peter Zijlstra
                     ` (2 more replies)
  2012-02-23  9:30 ` [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Peter Zijlstra
  1 sibling, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2012-02-23  7:50 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel


* Venkatesh Pallipadi <venki@google.com> wrote:

> * Do we need some accounting for these wakeups exported for powertop?

If then tracepoints.

> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>   a lot more touchpoints and better done as a standlone change.

Should most definitely be done for this series to be acceptble - 
as a preparatory patch in the series, with the feature at the 
end of the series.

> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);

That's ugly, we should access the idle task's ti flags directly.

To have efficient percpu access to the idle threads another 
clean-up is needed: we should turn idle_thread_array into a 
full-structure PER_CPU area.

For that we need a small variant of fork_idle(), which does not 
dup the init thread - pretty trivial.

fork_idle() should also make sure it does not schedule the child 
thread: thus we'd also be able to further simplify smpboot.c and 
get rid of all that extremely ugly 'struct create_idle' 
gymnastics in smpboot.c.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  7:50 ` Ingo Molnar
@ 2012-02-23  9:08   ` Peter Zijlstra
  2012-02-23 20:04     ` Venki Pallipadi
  2012-02-23 20:03   ` Venki Pallipadi
  2012-03-02  0:33   ` Venki Pallipadi
  2 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-02-23  9:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Venkatesh Pallipadi, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	Yong Zhang, linux-kernel

On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
> * Venkatesh Pallipadi <venki@google.com> wrote:

> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
> >   a lot more touchpoints and better done as a standlone change.
> 
> Should most definitely be done for this series to be acceptble - 
> as a preparatory patch in the series, with the feature at the 
> end of the series.

I don't think you can, TS_POLLING still works for idle=poll, whereas
this new stuff requires monitor/mwait.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  0:36 [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Venkatesh Pallipadi
  2012-02-23  7:50 ` Ingo Molnar
@ 2012-02-23  9:30 ` Peter Zijlstra
  2012-02-23 19:34   ` Venki Pallipadi
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-02-23  9:30 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Suresh Siddha,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel

On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:

> Changes since previous versions:

>   Moved the changes into arch specific code as per PeterZ suggestion

You failed:

>  include/linux/sched.h               |    2 +
>  kernel/sched/core.c                 |   13 ++++++



> diff --git a/arch/x86/include/asm/ipiless_poke.h b/arch/x86/include/asm/ipiless_poke.h
> new file mode 100644
> index 0000000..58670c7
> --- /dev/null
> +++ b/arch/x86/include/asm/ipiless_poke.h
> @@ -0,0 +1,82 @@
> +#ifndef _ASM_X86_IPILESS_POKE_H
> +#define _ASM_X86_IPILESS_POKE_H
> +
> +#include <linux/sched.h>
> +#include <asm/thread_info.h>
> +
> +#ifdef CONFIG_SMP
> +
> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +/*
> + * Use 2 bits in idle_task's thead info flags:
> + * TIF_IPILESS_IDLE marks enter to and exit from idle states with ipiless
> + * wakeup capability.
> + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI target CPU
> + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already set).
> + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle state.
> + */
> +
> +static inline void ipiless_idle_enter(void)
> +{
> +	set_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline void ipiless_idle_exit(void)
> +{
> +	clear_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline int is_ipi_pending(void)
> +{
> +	return unlikely(test_thread_flag(TIF_IPI_PENDING));
> +}
> +
> +static inline int need_wakeup(void)
> +{
> +	return need_resched() || is_ipi_pending();
> +}
> +
> +static inline void ipiless_pending_work(void)
> +{
> +	if (is_ipi_pending()) {
> +		clear_thread_flag(TIF_IPI_PENDING);
> +		local_bh_disable();
> +		local_irq_disable();

That local_bh_disable() is completely superfluous, disabling IRQs
already kills bh.

> +		generic_smp_call_function_single_interrupt();
> +		__scheduler_ipi();

Why not scheduler_ipi()?

Also, you could keep a pending vector bitmask in a per-cpu variable to
avoid having to call all handlers. not sure that's worth it for just
two, but something to keep in mind for if/when this starts expanding.

> +		local_irq_enable();
> +		local_bh_enable();
> +	}
> +}

Why doesn't ipiless_idle_exit() call this? That would keep it isolated
to just those idle routines that actually use mwait instead of bothering
the generic idle paths with this.

> +static inline int ipiless_magic_poke(int cpu)
> +{
> +	int val;
> +	atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);

What's this atomic_t nonsense about? thread_info::flags is __u32,
casting it to atomic_t is complete rubbish.

> +
> +	val = atomic_read(idle_flag);

The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);

> +	if (unlikely(val & _TIF_IPI_PENDING))
> +		return 1;
> +
> +	if (!(val & _TIF_IPILESS_IDLE))
> +		return 0;
> +
> +	if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

The __u32 version would look like:

  if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

Bonus win for being shorter!

> +		return 1;
> +
> +	return 0;
> +}
> +
> +#else
> +static inline void ipiless_pending_work(void) { }
> +static inline void ipiless_idle_enter(void) { }
> +static inline void ipiless_idle_exit(void) { }
> +
> +static inline int need_wakeup(void)
> +{
> +	return need_resched();
> +}
> +#endif
> +
> +#endif /* _ASM_X86_IPILESS_POKE_H */


> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index a8ff227..e66a4c8 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
>  #include <asm/mtrr.h>
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> +#include <asm/ipiless_poke.h>
>  #include <asm/proto.h>
>  #include <asm/apic.h>
>  #include <asm/nmi.h>
> @@ -109,6 +110,14 @@
>   *	about nothing of note with C stepping upwards.
>   */
>  
> +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> +	per_cpu(idle_task_ti_flags, cpu) =
> +				(atomic_t *)(&(task_thread_info(idle)->flags));
> +}

As Ingo already pointed out, its the architecture that actually sets up
the idle threads, so putting callbacks into the generic code to find
them is kinda backwards.

Again, *yuck* at that atomic_t business.


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5255c9d..1558316 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
>  	raw_spin_unlock(&rq->lock);
>  }
>  
> +void __scheduler_ipi(void)
> +{
> +	if (llist_empty(&this_rq()->wake_list))
> +		return;
> +
> +	sched_ttwu_pending();
> +}

FAIL!! It should be 100% identical to a normal IPI, that calls
scheduler_ipi() so this should too.

>  void scheduler_ipi(void)
>  {
>  	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct task_struct *idle)
>  	idle->sched_class = &idle_sched_class;
>  }
>  
> +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> +}
> +
>  /**
>   * init_idle - set up an idle thread for a given CPU
>   * @idle: task in question
> @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
>  
>  	/* Set the preempt count _outside_ the spinlocks! */
>  	task_thread_info(idle)->preempt_count = 0;
> +	thread_idle_state_setup(idle, cpu);

I suggest you put this in smpboot.c someplace ;-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  9:30 ` [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Peter Zijlstra
@ 2012-02-23 19:34   ` Venki Pallipadi
  2012-02-24  5:41     ` Yong Zhang
  2012-02-27  8:45     ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Venki Pallipadi @ 2012-02-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Suresh Siddha,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel

On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:
>
> > Changes since previous versions:
>
> >   Moved the changes into arch specific code as per PeterZ suggestion
>
> You failed:
>
> >  include/linux/sched.h               |    2 +
> >  kernel/sched/core.c                 |   13 ++++++
>
>
>
> > diff --git a/arch/x86/include/asm/ipiless_poke.h
> > b/arch/x86/include/asm/ipiless_poke.h
> > new file mode 100644
> > index 0000000..58670c7
> > --- /dev/null
> > +++ b/arch/x86/include/asm/ipiless_poke.h
> > @@ -0,0 +1,82 @@
> > +#ifndef _ASM_X86_IPILESS_POKE_H
> > +#define _ASM_X86_IPILESS_POKE_H
> > +
> > +#include <linux/sched.h>
> > +#include <asm/thread_info.h>
> > +
> > +#ifdef CONFIG_SMP
> > +
> > +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +/*
> > + * Use 2 bits in idle_task's thead info flags:
> > + * TIF_IPILESS_IDLE marks enter to and exit from idle states with
> > ipiless
> > + * wakeup capability.
> > + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI
> > target CPU
> > + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already
> > set).
> > + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle
> > state.
> > + */
> > +
> > +static inline void ipiless_idle_enter(void)
> > +{
> > +     set_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline void ipiless_idle_exit(void)
> > +{
> > +     clear_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline int is_ipi_pending(void)
> > +{
> > +     return unlikely(test_thread_flag(TIF_IPI_PENDING));
> > +}
> > +
> > +static inline int need_wakeup(void)
> > +{
> > +     return need_resched() || is_ipi_pending();
> > +}
> > +
> > +static inline void ipiless_pending_work(void)
> > +{
> > +     if (is_ipi_pending()) {
> > +             clear_thread_flag(TIF_IPI_PENDING);
> > +             local_bh_disable();
> > +             local_irq_disable();
>
> That local_bh_disable() is completely superfluous, disabling IRQs
> already kills bh.

The reason for local_bh_disable/enable was to check for potential
softirqs after these interrupt.

>
> > +             generic_smp_call_function_single_interrupt();
> > +             __scheduler_ipi();
>
> Why not scheduler_ipi()?

Was trying to avoid irq_enter/exit. As the work here is done in idle
thread context, I though we could avoid enter/exit. Also, if we need
it, we should ideally do it once across scheduler and smp_call work
together instead of in scheduler_ipi().

>
> Also, you could keep a pending vector bitmask in a per-cpu variable to
> avoid having to call all handlers. not sure that's worth it for just
> two, but something to keep in mind for if/when this starts expanding.
>

Agree. For anything more than two, it will be better to have an
additional bitmask.

> > +             local_irq_enable();
> > +             local_bh_enable();
> > +     }
> > +}
>
> Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> to just those idle routines that actually use mwait instead of bothering
> the generic idle paths with this.

ipiless_idle_exit is called in the inner most part of idle entry exit.
In mwait case we may not even have interrupts enabled at that time and
there is code that does idle residency timing etc which will get
impacted if we start doing the pending ipi work there. Doing it at
higher level along with things like enter-exit tickless felt nicer.

>
> > +static inline int ipiless_magic_poke(int cpu)
> > +{
> > +     int val;
> > +     atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);
>
> What's this atomic_t nonsense about? thread_info::flags is __u32,
> casting it to atomic_t is complete rubbish.
>

I have to say I was not a big fan of that typecast as well. I just did
not know about ACCESS_ONCE() interface.
Thanks for the tips below. This is the first thing I am going to
change in this patch.

> > +
> > +     val = atomic_read(idle_flag);
>
> The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);
>
> > +     if (unlikely(val & _TIF_IPI_PENDING))
> > +             return 1;
> > +
> > +     if (!(val & _TIF_IPILESS_IDLE))
> > +             return 0;
> > +
> > +     if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> The __u32 version would look like:
>
>  if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> Bonus win for being shorter!
>
> > +             return 1;
> > +
> > +     return 0;
> > +}
> > +
> > +#else
> > +static inline void ipiless_pending_work(void) { }
> > +static inline void ipiless_idle_enter(void) { }
> > +static inline void ipiless_idle_exit(void) { }
> > +
> > +static inline int need_wakeup(void)
> > +{
> > +     return need_resched();
> > +}
> > +#endif
> > +
> > +#endif /* _ASM_X86_IPILESS_POKE_H */
>
>
> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index a8ff227..e66a4c8 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -27,6 +27,7 @@
> >  #include <asm/mtrr.h>
> >  #include <asm/tlbflush.h>
> >  #include <asm/mmu_context.h>
> > +#include <asm/ipiless_poke.h>
> >  #include <asm/proto.h>
> >  #include <asm/apic.h>
> >  #include <asm/nmi.h>
> > @@ -109,6 +110,14 @@
> >   *   about nothing of note with C stepping upwards.
> >   */
> >
> > +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int
> > cpu)
> > +{
> > +     per_cpu(idle_task_ti_flags, cpu) =
> > +                             (atomic_t
> > *)(&(task_thread_info(idle)->flags));
> > +}
>
> As Ingo already pointed out, its the architecture that actually sets up
> the idle threads, so putting callbacks into the generic code to find
> them is kinda backwards.

Yes. I need to spend a bit more time looking at cleanups Ingo suggested.

>
> Again, *yuck* at that atomic_t business.

They are gone now...

>
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5255c9d..1558316 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
> >       raw_spin_unlock(&rq->lock);
> >  }
> >
> > +void __scheduler_ipi(void)
> > +{
> > +     if (llist_empty(&this_rq()->wake_list))
> > +             return;
> > +
> > +     sched_ttwu_pending();
> > +}
>
> FAIL!! It should be 100% identical to a normal IPI, that calls
> scheduler_ipi() so this should too.
>
> >  void scheduler_ipi(void)
> >  {
> >       if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct
> > task_struct *idle)
> >       idle->sched_class = &idle_sched_class;
> >  }
> >
> > +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle,
> > int cpu)
> > +{
> > +}
> > +
> >  /**
> >   * init_idle - set up an idle thread for a given CPU
> >   * @idle: task in question
> > @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle,
> > int cpu)
> >
> >       /* Set the preempt count _outside_ the spinlocks! */
> >       task_thread_info(idle)->preempt_count = 0;
> > +     thread_idle_state_setup(idle, cpu);
>
> I suggest you put this in smpboot.c someplace ;-)

Thanks,
Venki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  7:50 ` Ingo Molnar
  2012-02-23  9:08   ` Peter Zijlstra
@ 2012-02-23 20:03   ` Venki Pallipadi
  2012-03-02  0:33   ` Venki Pallipadi
  2 siblings, 0 replies; 37+ messages in thread
From: Venki Pallipadi @ 2012-02-23 20:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

[ Resending without the ugly email client formatting ]

On Wed, Feb 22, 2012 at 11:50 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Venkatesh Pallipadi <venki@google.com> wrote:
>
>> * Do we need some accounting for these wakeups exported for powertop?
>
> If then tracepoints.
>
>> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>>   a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.
>
OK. Will look at TS_POLLING part and likely include it in the next resend.

>> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
>
> That's ugly, we should access the idle task's ti flags directly.
>
> To have efficient percpu access to the idle threads another
> clean-up is needed: we should turn idle_thread_array into a
> full-structure PER_CPU area.
>
> For that we need a small variant of fork_idle(), which does not
> dup the init thread - pretty trivial.
>
> fork_idle() should also make sure it does not schedule the child
> thread: thus we'd also be able to further simplify smpboot.c and
> get rid of all that extremely ugly 'struct create_idle'
> gymnastics in smpboot.c.
>

Hmm. Not being very familiar with that code, I will have to take a closer
look at this potential cleanup...

Thanks,
Venki

> Thanks,
>
>        Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  9:08   ` Peter Zijlstra
@ 2012-02-23 20:04     ` Venki Pallipadi
  0 siblings, 0 replies; 37+ messages in thread
From: Venki Pallipadi @ 2012-02-23 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

[ Resending without the ugly email client formatting ]

On Thu, Feb 23, 2012 at 1:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
>> * Venkatesh Pallipadi <venki@google.com> wrote:
>
>> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>> >   a lot more touchpoints and better done as a standlone change.
>>
>> Should most definitely be done for this series to be acceptble -
>> as a preparatory patch in the series, with the feature at the
>> end of the series.
>
> I don't think you can, TS_POLLING still works for idle=poll, whereas
> this new stuff requires monitor/mwait.
>

Even poll_idle should be able to use this ipiless_wakeup. It just have
to poll for an additional IPI_PENDING bit along with need_resched.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23 19:34   ` Venki Pallipadi
@ 2012-02-24  5:41     ` Yong Zhang
  2012-02-24  6:13       ` Yong Zhang
  2012-02-26  1:32       ` Paul E. McKenney
  2012-02-27  8:45     ` Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Yong Zhang @ 2012-02-24  5:41 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, linux-kernel

On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Why not scheduler_ipi()?
> 
> Was trying to avoid irq_enter/exit. As the work here is done in idle
> thread context, I though we could avoid enter/exit. 

It seems we could not.
At least RCU need it, see commit c5d753a55, otherwise we will get
warning like 'RCU used illegally from extended quiescent state!'

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-24  5:41     ` Yong Zhang
@ 2012-02-24  6:13       ` Yong Zhang
  2012-02-27  8:38         ` Peter Zijlstra
  2012-02-26  1:32       ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Yong Zhang @ 2012-02-24  6:13 UTC (permalink / raw)
  To: Venki Pallipadi, Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Suresh Siddha,
	Aaron Durbin, Paul Turner, linux-kernel

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > Why not scheduler_ipi()?
> > 
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit. 
> 
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

[Off topic]
This remind me that we should have moved the irq_enter()/irq_exit() to
each arch's related irq handler.
see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2

So Peter, is there someone alread on it? or it still worth doing now?

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-24  5:41     ` Yong Zhang
  2012-02-24  6:13       ` Yong Zhang
@ 2012-02-26  1:32       ` Paul E. McKenney
  2012-02-27  9:06         ` Yong Zhang
  1 sibling, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2012-02-26  1:32 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > Why not scheduler_ipi()?
> > 
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit. 
> 
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

If the use is tracing, then Steven Rostedt's patchset plus use of his
_rcuidle() tracing variants handles this:

	https://lkml.org/lkml/2012/2/7/231

If this is instead algorithmic use of RCU, a set of patches I have queued
up for 3.4 will be required.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-24  6:13       ` Yong Zhang
@ 2012-02-27  8:38         ` Peter Zijlstra
  2012-02-27  9:08           ` Yong Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-02-27  8:38 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venki Pallipadi, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, linux-kernel

On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:

> This remind me that we should have moved the irq_enter()/irq_exit() to
> each arch's related irq handler.
> see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> 
> So Peter, is there someone alread on it? or it still worth doing now?

trouble was that some people didn't feel comfortable adding that
overhead to plain resched IPIs that didn't need to do anything. So I
more or less let it sit for a while.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23 19:34   ` Venki Pallipadi
  2012-02-24  5:41     ` Yong Zhang
@ 2012-02-27  8:45     ` Peter Zijlstra
  2012-02-27 18:17       ` Venki Pallipadi
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-02-27  8:45 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Suresh Siddha,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel

On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:

> > > +             local_bh_disable();
> > > +             local_irq_disable();
> >
> > That local_bh_disable() is completely superfluous, disabling IRQs
> > already kills bh.
> 
> The reason for local_bh_disable/enable was to check for potential
> softirqs after these interrupt.

Why is that needed? We never wrap local_irq_disable() in bh muck?


> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> > to just those idle routines that actually use mwait instead of bothering
> > the generic idle paths with this.
> 
> ipiless_idle_exit is called in the inner most part of idle entry exit.
> In mwait case we may not even have interrupts enabled at that time and
> there is code that does idle residency timing etc which will get
> impacted if we start doing the pending ipi work there. Doing it at
> higher level along with things like enter-exit tickless felt nicer.

But but but, an actual interrupt can be handled before the instruction
after mwait, so why would this be a problem?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-26  1:32       ` Paul E. McKenney
@ 2012-02-27  9:06         ` Yong Zhang
  2012-02-27 17:05           ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Yong Zhang @ 2012-02-27  9:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > Why not scheduler_ipi()?
> > > 
> > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > thread context, I though we could avoid enter/exit. 
> > 
> > It seems we could not.
> > At least RCU need it, see commit c5d753a55, otherwise we will get
> > warning like 'RCU used illegally from extended quiescent state!'
> 
> If the use is tracing, then Steven Rostedt's patchset plus use of his
> _rcuidle() tracing variants handles this:
> 
> 	https://lkml.org/lkml/2012/2/7/231
> 
> If this is instead algorithmic use of RCU, a set of patches I have queued
> up for 3.4 will be required.

scheduler_ipi() doing more than tracing. Will look at your patches :)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27  8:38         ` Peter Zijlstra
@ 2012-02-27  9:08           ` Yong Zhang
  2012-02-27  9:30             ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Yong Zhang @ 2012-02-27  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venki Pallipadi, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, linux-kernel

On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> 
> > This remind me that we should have moved the irq_enter()/irq_exit() to
> > each arch's related irq handler.
> > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > 
> > So Peter, is there someone alread on it? or it still worth doing now?
> 
> trouble was that some people didn't feel comfortable adding that
> overhead to plain resched IPIs that didn't need to do anything.

Any ideas on where the plain resched IPIs come from?

Thanks,
Yong

> So I
> more or less let it sit for a while.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27  9:08           ` Yong Zhang
@ 2012-02-27  9:30             ` Peter Zijlstra
  2012-02-27  9:51               ` Yong Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2012-02-27  9:30 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venki Pallipadi, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, linux-kernel

On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> > 
> > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > each arch's related irq handler.
> > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > > 
> > > So Peter, is there someone alread on it? or it still worth doing now?
> > 
> > trouble was that some people didn't feel comfortable adding that
> > overhead to plain resched IPIs that didn't need to do anything.
> 
> Any ideas on where the plain resched IPIs come from?

remote wakeups that don't queue I guess. 

IIRC I looked at it a while back and while not in the majority there
were still a few (I only added counters, didn't look where they came
from).

However since 518cd623 there'd be a lot more I guess..

Feel free to investigate. Also, some non-sched users of the resched ipi
exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27  9:30             ` Peter Zijlstra
@ 2012-02-27  9:51               ` Yong Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yong Zhang @ 2012-02-27  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Venki Pallipadi, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, linux-kernel

On Mon, Feb 27, 2012 at 10:30:58AM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> > On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> > > 
> > > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > > each arch's related irq handler.
> > > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > > > 
> > > > So Peter, is there someone alread on it? or it still worth doing now?
> > > 
> > > trouble was that some people didn't feel comfortable adding that
> > > overhead to plain resched IPIs that didn't need to do anything.
> > 
> > Any ideas on where the plain resched IPIs come from?
> 
> remote wakeups that don't queue I guess. 
> 
> IIRC I looked at it a while back and while not in the majority there
> were still a few (I only added counters, didn't look where they came
> from).
> 
> However since 518cd623 there'd be a lot more I guess..
> 
> Feel free to investigate. Also, some non-sched users of the resched ipi
> exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

Thanks for the detail.

Will try to find if there is something interesting.

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27  9:06         ` Yong Zhang
@ 2012-02-27 17:05           ` Paul E. McKenney
  2012-02-28  7:12             ` Yong Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2012-02-27 17:05 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > Why not scheduler_ipi()?
> > > > 
> > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > thread context, I though we could avoid enter/exit. 
> > > 
> > > It seems we could not.
> > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > warning like 'RCU used illegally from extended quiescent state!'
> > 
> > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > _rcuidle() tracing variants handles this:
> > 
> > 	https://lkml.org/lkml/2012/2/7/231
> > 
> > If this is instead algorithmic use of RCU, a set of patches I have queued
> > up for 3.4 will be required.
> 
> scheduler_ipi() doing more than tracing. Will look at your patches :)

Ah!  The key question is whether or not the code in question is called
both from idle and from non-idle.  This will be easiest if the code is
called only from idle, in which case you should only need this one:

	https://lkml.org/lkml/2012/2/3/498

							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27  8:45     ` Peter Zijlstra
@ 2012-02-27 18:17       ` Venki Pallipadi
  0 siblings, 0 replies; 37+ messages in thread
From: Venki Pallipadi @ 2012-02-27 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Suresh Siddha,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel

On Mon, Feb 27, 2012 at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:
>
>> > > +             local_bh_disable();
>> > > +             local_irq_disable();
>> >
>> > That local_bh_disable() is completely superfluous, disabling IRQs
>> > already kills bh.
>>
>> The reason for local_bh_disable/enable was to check for potential
>> softirqs after these interrupt.
>
> Why is that needed? We never wrap local_irq_disable() in bh muck?
>

For normal interrupts, we check for bh on interrutp return. For idle
loop no interrupt case, we will have to call bh handler explicitly as
otherwise we will go back to idle until need_resched().
local_bh_enable() handles this bh call here.

>> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
>> > to just those idle routines that actually use mwait instead of bothering
>> > the generic idle paths with this.
>>
>> ipiless_idle_exit is called in the inner most part of idle entry exit.
>> In mwait case we may not even have interrupts enabled at that time and
>> there is code that does idle residency timing etc which will get
>> impacted if we start doing the pending ipi work there. Doing it at
>> higher level along with things like enter-exit tickless felt nicer.
>
> But but but, an actual interrupt can be handled before the instruction
> after mwait, so why would this be a problem?
>

With most commom usage of mwait(), interrupts are disabled on entry
and exit from mwait(). There is a special flag that brings CPU out of
mwait on interrupt, without actually handling the interrupt. We do
C-state timing in acpi idle/intel idle and enable interrupts and thats
where interrupt will get handled. My concern is calling the interrupt
and bh immediatley after mwait exit will inflate C-state residency
timings.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-27 17:05           ` Paul E. McKenney
@ 2012-02-28  7:12             ` Yong Zhang
  2012-02-28 13:05               ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Yong Zhang @ 2012-02-28  7:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > Why not scheduler_ipi()?
> > > > > 
> > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > thread context, I though we could avoid enter/exit. 
> > > > 
> > > > It seems we could not.
> > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > warning like 'RCU used illegally from extended quiescent state!'
> > > 
> > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > _rcuidle() tracing variants handles this:
> > > 
> > > 	https://lkml.org/lkml/2012/2/7/231
> > > 
> > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > up for 3.4 will be required.
> > 
> > scheduler_ipi() doing more than tracing. Will look at your patches :)
> 
> Ah!  The key question is whether or not the code in question is called
> both from idle and from non-idle.

In fact before this patch from Venki, the only call site of scheduler_ipi()
is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

But when I was testing this patch, I didn't see explicit warning on
illegal rcu usage. The reason maybe 1) there are no much rcu dereference
in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
some kind of protection.
Maybe I'm overstraining, but it is potential danger.

But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
in his latest version.

> This will be easiest if the code is
> called only from idle, in which case you should only need this one:
> 
> 	https://lkml.org/lkml/2012/2/3/498

Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-28  7:12             ` Yong Zhang
@ 2012-02-28 13:05               ` Paul E. McKenney
  2012-02-29  6:36                 ` Yong Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2012-02-28 13:05 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Tue, Feb 28, 2012 at 03:12:55PM +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > > Why not scheduler_ipi()?
> > > > > > 
> > > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > > thread context, I though we could avoid enter/exit. 
> > > > > 
> > > > > It seems we could not.
> > > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > > warning like 'RCU used illegally from extended quiescent state!'
> > > > 
> > > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > > _rcuidle() tracing variants handles this:
> > > > 
> > > > 	https://lkml.org/lkml/2012/2/7/231
> > > > 
> > > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > > up for 3.4 will be required.
> > > 
> > > scheduler_ipi() doing more than tracing. Will look at your patches :)
> > 
> > Ah!  The key question is whether or not the code in question is called
> > both from idle and from non-idle.
> 
> In fact before this patch from Venki, the only call site of scheduler_ipi()
> is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
> irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

Ah, that explains why I didn't see it in my testing.  ;-)

> But when I was testing this patch, I didn't see explicit warning on
> illegal rcu usage. The reason maybe 1) there are no much rcu dereference
> in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
> some kind of protection.
> Maybe I'm overstraining, but it is potential danger.

Did you have CONFIG_PROVE_RCU=y when testing?

> But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
> in his latest version.

OK.

> > This will be easiest if the code is
> > called only from idle, in which case you should only need this one:
> > 
> > 	https://lkml.org/lkml/2012/2/3/498
> 
> Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Seems like it might be needed sooner rather than later...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-28 13:05               ` Paul E. McKenney
@ 2012-02-29  6:36                 ` Yong Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yong Zhang @ 2012-02-29  6:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Suresh Siddha, Aaron Durbin, Paul Turner,
	linux-kernel

On Tue, Feb 28, 2012 at 05:05:52AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 28, 2012 at 03:12:55PM +0800, Yong Zhang wrote:
> > On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > > > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > > > > > Why not scheduler_ipi()?
> > > > > > > 
> > > > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > > > thread context, I though we could avoid enter/exit. 
> > > > > > 
> > > > > > It seems we could not.
> > > > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > > > warning like 'RCU used illegally from extended quiescent state!'
> > > > > 
> > > > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > > > _rcuidle() tracing variants handles this:
> > > > > 
> > > > > 	https://lkml.org/lkml/2012/2/7/231
> > > > > 
> > > > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > > > up for 3.4 will be required.
> > > > 
> > > > scheduler_ipi() doing more than tracing. Will look at your patches :)
> > > 
> > > Ah!  The key question is whether or not the code in question is called
> > > both from idle and from non-idle.
> > 
> > In fact before this patch from Venki, the only call site of scheduler_ipi()
> > is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
> > irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.
> 
> Ah, that explains why I didn't see it in my testing.  ;-)
> 
> > But when I was testing this patch, I didn't see explicit warning on
> > illegal rcu usage. The reason maybe 1) there are no much rcu dereference
> > in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
> > some kind of protection.
> > Maybe I'm overstraining, but it is potential danger.
> 
> Did you have CONFIG_PROVE_RCU=y when testing?

Yeah.

> zgrep PROVE /proc/config.gz 
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y

> 
> > But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
> > in his latest version.
> 
> OK.
> 
> > > This will be easiest if the code is
> > > called only from idle, in which case you should only need this one:
> > > 
> > > 	https://lkml.org/lkml/2012/2/3/498
> > 
> > Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)
> 
> Seems like it might be needed sooner rather than later...

;-)

Thanks,
Yong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-02-23  7:50 ` Ingo Molnar
  2012-02-23  9:08   ` Peter Zijlstra
  2012-02-23 20:03   ` Venki Pallipadi
@ 2012-03-02  0:33   ` Venki Pallipadi
  2012-03-02  1:28     ` Suresh Siddha
  2 siblings, 1 reply; 37+ messages in thread
From: Venki Pallipadi @ 2012-03-02  0:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Suresh Siddha, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Wed, Feb 22, 2012 at 11:50 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Venkatesh Pallipadi <venki@google.com> wrote:
>
>> * Do we need some accounting for these wakeups exported for powertop?
>
> If then tracepoints.
>
>> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>>   a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.
>
>> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
>
> That's ugly, we should access the idle task's ti flags directly.
>
> To have efficient percpu access to the idle threads another
> clean-up is needed: we should turn idle_thread_array into a
> full-structure PER_CPU area.
>
> For that we need a small variant of fork_idle(), which does not
> dup the init thread - pretty trivial.

OK. I looked a bit deeper into this and I understand what you suggested above.

>
> fork_idle() should also make sure it does not schedule the child
> thread: thus we'd also be able to further simplify smpboot.c and
> get rid of all that extremely ugly 'struct create_idle'
> gymnastics in smpboot.c.

But not this. I am not sure where fork_idle results in resched of the child.
As I saw it, fork_idle calls init_idle and that sets the affinity of
idle_task to target CPU. So, reschedule should not be a problem. What
am I missing here?

Also, I tried this silly test patch (Cut and paste... Sorry) and it
seemed to work fine both with and without CPU hotplug.

Thanks,
Venki

---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 66d250c..36b80ef 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -686,7 +686,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
                .done   = COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
        };

-       INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
+       // INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);

        alternatives_smp_switch(1);

@@ -703,12 +703,13 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
                goto do_rest;
        }

-       schedule_work(&c_idle.work);
-       wait_for_completion(&c_idle.done);
+       // schedule_work(&c_idle.work);
+       // wait_for_completion(&c_idle.done);
+       c_idle.idle = fork_idle(cpu);

        if (IS_ERR(c_idle.idle)) {
                printk("failed fork for CPU %d\n", cpu);
-               destroy_work_on_stack(&c_idle.work);
+               // destroy_work_on_stack(&c_idle.work);
                return PTR_ERR(c_idle.idle);
        }

@@ -831,7 +832,7 @@ do_rest:
                smpboot_restore_warm_reset_vector();
        }

-       destroy_work_on_stack(&c_idle.work);
+       // destroy_work_on_stack(&c_idle.work);
        return boot_error;
 }

---

>
> Thanks,
>
>        Ingo

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  0:33   ` Venki Pallipadi
@ 2012-03-02  1:28     ` Suresh Siddha
  2012-03-02  1:35       ` Venki Pallipadi
  0 siblings, 1 reply; 37+ messages in thread
From: Suresh Siddha @ 2012-03-02  1:28 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Thu, 2012-03-01 at 16:33 -0800, Venki Pallipadi wrote:
> >
> > fork_idle() should also make sure it does not schedule the child
> > thread: thus we'd also be able to further simplify smpboot.c and
> > get rid of all that extremely ugly 'struct create_idle'
> > gymnastics in smpboot.c.
> 
> But not this. I am not sure where fork_idle results in resched of the child.
> As I saw it, fork_idle calls init_idle and that sets the affinity of
> idle_task to target CPU. So, reschedule should not be a problem. What
> am I missing here?

I think Ingo is referring to the fact that we can't use kthread_create()
here and hence we were relying on fork_idle().

> Also, I tried this silly test patch (Cut and paste... Sorry) and it
> seemed to work fine both with and without CPU hotplug.
> 

I don't think we can do this today, as we need to make sure we have the
correct current context. With dynamic cpu hotplug, current context can
be any process and hence we were depending on the schedule_work()
context.

thanks,
suresh

> Thanks,
> Venki
> 
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 66d250c..36b80ef 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -686,7 +686,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
>                 .done   = COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
>         };
> 
> -       INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
> +       // INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
> 
>         alternatives_smp_switch(1);
> 
> @@ -703,12 +703,13 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
>                 goto do_rest;
>         }
> 
> -       schedule_work(&c_idle.work);
> -       wait_for_completion(&c_idle.done);
> +       // schedule_work(&c_idle.work);
> +       // wait_for_completion(&c_idle.done);
> +       c_idle.idle = fork_idle(cpu);
> 
>         if (IS_ERR(c_idle.idle)) {
>                 printk("failed fork for CPU %d\n", cpu);
> -               destroy_work_on_stack(&c_idle.work);
> +               // destroy_work_on_stack(&c_idle.work);
>                 return PTR_ERR(c_idle.idle);
>         }
> 
> @@ -831,7 +832,7 @@ do_rest:
>                 smpboot_restore_warm_reset_vector();
>         }
> 
> -       destroy_work_on_stack(&c_idle.work);
> +       // destroy_work_on_stack(&c_idle.work);
>         return boot_error;
>  }
> 
> ---
> 
> >
> > Thanks,
> >
> >        Ingo



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  1:28     ` Suresh Siddha
@ 2012-03-02  1:35       ` Venki Pallipadi
  2012-03-02  1:37         ` Suresh Siddha
  0 siblings, 1 reply; 37+ messages in thread
From: Venki Pallipadi @ 2012-03-02  1:35 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Thu, Mar 1, 2012 at 5:28 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Thu, 2012-03-01 at 16:33 -0800, Venki Pallipadi wrote:
>> >
>> > fork_idle() should also make sure it does not schedule the child
>> > thread: thus we'd also be able to further simplify smpboot.c and
>> > get rid of all that extremely ugly 'struct create_idle'
>> > gymnastics in smpboot.c.
>>
>> But not this. I am not sure where fork_idle results in resched of the child.
>> As I saw it, fork_idle calls init_idle and that sets the affinity of
>> idle_task to target CPU. So, reschedule should not be a problem. What
>> am I missing here?
>
> I think Ingo is referring to the fact that we can't use kthread_create()
> here and hence we were relying on fork_idle().
>
>> Also, I tried this silly test patch (Cut and paste... Sorry) and it
>> seemed to work fine both with and without CPU hotplug.
>>
>
> I don't think we can do this today, as we need to make sure we have the
> correct current context. With dynamic cpu hotplug, current context can
> be any process and hence we were depending on the schedule_work()
> context.
>

schedule_work() is only done at boot time. In case of dynamic cpu
hotplug, we skip the whole fork_idle as we already have the task
struct and just do init_idle().

> thanks,
> suresh
>
>> Thanks,
>> Venki
>>
>> ---
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 66d250c..36b80ef 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -686,7 +686,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
>>                 .done   = COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
>>         };
>>
>> -       INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
>> +       // INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
>>
>>         alternatives_smp_switch(1);
>>
>> @@ -703,12 +703,13 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
>>                 goto do_rest;
>>         }
>>
>> -       schedule_work(&c_idle.work);
>> -       wait_for_completion(&c_idle.done);
>> +       // schedule_work(&c_idle.work);
>> +       // wait_for_completion(&c_idle.done);
>> +       c_idle.idle = fork_idle(cpu);
>>
>>         if (IS_ERR(c_idle.idle)) {
>>                 printk("failed fork for CPU %d\n", cpu);
>> -               destroy_work_on_stack(&c_idle.work);
>> +               // destroy_work_on_stack(&c_idle.work);
>>                 return PTR_ERR(c_idle.idle);
>>         }
>>
>> @@ -831,7 +832,7 @@ do_rest:
>>                 smpboot_restore_warm_reset_vector();
>>         }
>>
>> -       destroy_work_on_stack(&c_idle.work);
>> +       // destroy_work_on_stack(&c_idle.work);
>>         return boot_error;
>>  }
>>
>> ---
>>
>> >
>> > Thanks,
>> >
>> >        Ingo
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  1:35       ` Venki Pallipadi
@ 2012-03-02  1:37         ` Suresh Siddha
  2012-03-02  2:00           ` Venki Pallipadi
  0 siblings, 1 reply; 37+ messages in thread
From: Suresh Siddha @ 2012-03-02  1:37 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Thu, 2012-03-01 at 17:35 -0800, Venki Pallipadi wrote:
> On Thu, Mar 1, 2012 at 5:28 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > On Thu, 2012-03-01 at 16:33 -0800, Venki Pallipadi wrote:
> >> >
> >> > fork_idle() should also make sure it does not schedule the child
> >> > thread: thus we'd also be able to further simplify smpboot.c and
> >> > get rid of all that extremely ugly 'struct create_idle'
> >> > gymnastics in smpboot.c.
> >>
> >> But not this. I am not sure where fork_idle results in resched of the child.
> >> As I saw it, fork_idle calls init_idle and that sets the affinity of
> >> idle_task to target CPU. So, reschedule should not be a problem. What
> >> am I missing here?
> >
> > I think Ingo is referring to the fact that we can't use kthread_create()
> > here and hence we were relying on fork_idle().
> >
> >> Also, I tried this silly test patch (Cut and paste... Sorry) and it
> >> seemed to work fine both with and without CPU hotplug.
> >>
> >
> > I don't think we can do this today, as we need to make sure we have the
> > correct current context. With dynamic cpu hotplug, current context can
> > be any process and hence we were depending on the schedule_work()
> > context.
> >
> 
> schedule_work() is only done at boot time. In case of dynamic cpu
> hotplug, we skip the whole fork_idle as we already have the task
> struct and just do init_idle().
> 

What happens if we boot with "maxcpus=" and later online the remaining
cpu's? same issue with the physical cpu-online case too right?

thanks,
suresh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  1:37         ` Suresh Siddha
@ 2012-03-02  2:00           ` Venki Pallipadi
  2012-03-02  7:21             ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Venki Pallipadi @ 2012-03-02  2:00 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Thu, Mar 1, 2012 at 5:37 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Thu, 2012-03-01 at 17:35 -0800, Venki Pallipadi wrote:
>> On Thu, Mar 1, 2012 at 5:28 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>> > On Thu, 2012-03-01 at 16:33 -0800, Venki Pallipadi wrote:
>> >> >
>> >> > fork_idle() should also make sure it does not schedule the child
>> >> > thread: thus we'd also be able to further simplify smpboot.c and
>> >> > get rid of all that extremely ugly 'struct create_idle'
>> >> > gymnastics in smpboot.c.
>> >>
>> >> But not this. I am not sure where fork_idle results in resched of the child.
>> >> As I saw it, fork_idle calls init_idle and that sets the affinity of
>> >> idle_task to target CPU. So, reschedule should not be a problem. What
>> >> am I missing here?
>> >
>> > I think Ingo is referring to the fact that we can't use kthread_create()
>> > here and hence we were relying on fork_idle().
>> >
>> >> Also, I tried this silly test patch (Cut and paste... Sorry) and it
>> >> seemed to work fine both with and without CPU hotplug.
>> >>
>> >
>> > I don't think we can do this today, as we need to make sure we have the
>> > correct current context. With dynamic cpu hotplug, current context can
>> > be any process and hence we were depending on the schedule_work()
>> > context.
>> >
>>
>> schedule_work() is only done at boot time. In case of dynamic cpu
>> hotplug, we skip the whole fork_idle as we already have the task
>> struct and just do init_idle().
>>
>
> What happens if we boot with "maxcpus=" and later online the remaining
> cpu's?

Yes. This case will be problematic. So, we still need struct
create_idle work stuff even after percpu idle_task. Or was Ingo's
suggestion to do something along the lines of - init any CPU's idle
task from CPU 0's idle task?

> same issue with the physical cpu-online case too right?
Any fresh online at runtime. Yes.

>
> thanks,
> suresh
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  2:00           ` Venki Pallipadi
@ 2012-03-02  7:21             ` Ingo Molnar
  2012-03-02 17:41               ` Suresh Siddha
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2012-03-02  7:21 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Suresh Siddha, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel


* Venki Pallipadi <venki@google.com> wrote:

> On Thu, Mar 1, 2012 at 5:37 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > On Thu, 2012-03-01 at 17:35 -0800, Venki Pallipadi wrote:
> >> On Thu, Mar 1, 2012 at 5:28 PM, Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> >> > On Thu, 2012-03-01 at 16:33 -0800, Venki Pallipadi wrote:
> >> >> >
> >> >> > fork_idle() should also make sure it does not schedule the child
> >> >> > thread: thus we'd also be able to further simplify smpboot.c and
> >> >> > get rid of all that extremely ugly 'struct create_idle'
> >> >> > gymnastics in smpboot.c.
> >> >>
> >> >> But not this. I am not sure where fork_idle results in resched of the child.
> >> >> As I saw it, fork_idle calls init_idle and that sets the affinity of
> >> >> idle_task to target CPU. So, reschedule should not be a problem. What
> >> >> am I missing here?
> >> >
> >> > I think Ingo is referring to the fact that we can't use kthread_create()
> >> > here and hence we were relying on fork_idle().
> >> >
> >> >> Also, I tried this silly test patch (Cut and paste... Sorry) and it
> >> >> seemed to work fine both with and without CPU hotplug.
> >> >>
> >> >
> >> > I don't think we can do this today, as we need to make 
> >> > sure we have the correct current context. With dynamic 
> >> > cpu hotplug, current context can be any process and hence 
> >> > we were depending on the schedule_work() context.
> >>
> >> schedule_work() is only done at boot time. In case of 
> >> dynamic cpu hotplug, we skip the whole fork_idle as we 
> >> already have the task struct and just do init_idle().
> >>
> >
> > What happens if we boot with "maxcpus=" and later online the 
> > remaining cpu's?
> 
> Yes. This case will be problematic. So, we still need struct 
> create_idle work stuff even after percpu idle_task. Or was 
> Ingo's suggestion to do something along the lines of - init 
> any CPU's idle task from CPU 0's idle task?

If the percpu area of possible CPUs is already allocated at this 
point then we should probably do that.

If not then what is the problem with doing fork_idle() from the 
hotplug process context? Where does the assymetry in the dynamic 
case come from?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1
  2012-03-02  7:21             ` Ingo Molnar
@ 2012-03-02 17:41               ` Suresh Siddha
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
  0 siblings, 1 reply; 37+ messages in thread
From: Suresh Siddha @ 2012-03-02 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Venki Pallipadi, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel

On Fri, 2012-03-02 at 08:21 +0100, Ingo Molnar wrote:
> * Venki Pallipadi <venki@google.com> wrote:
> > Yes. This case will be problematic. So, we still need struct 
> > create_idle work stuff even after percpu idle_task. Or was 
> > Ingo's suggestion to do something along the lines of - init 
> > any CPU's idle task from CPU 0's idle task?
> 
> If the percpu area of possible CPUs is already allocated at this 
> point then we should probably do that.
> 
> If not then what is the problem with doing fork_idle() from the 
> hotplug process context? Where does the assymetry in the dynamic 
> case come from?

As far as I know, we limit the # of cpus that can be onlined to the
possible cpus determined at boot time and this doesn't change after
boot. percpu area of possible cpu's already allocated by this point.

Main exercise I think is to move the init_task to percpu area of cpu-0
and ensure that fork_idle() does the init properly (with out allocating
the memory but do the initialization done by the
copy_process/dup_task_struct etc)

thanks,
suresh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* fork_idle from wq cleanup
  2012-03-02 17:41               ` Suresh Siddha
@ 2012-03-06 21:41                 ` Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 1/5] x86: Move fork_idle from wq and idle caching to common code Venkatesh Pallipadi
                                     ` (5 more replies)
  0 siblings, 6 replies; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens

I looked at avoiding the wq stuff. But, there is no easy/clean way to do
if as we have number of routines within copy_process assume current
being the parent.

But, we can move most of the complication out of arch code into generic code
and share it across multiple archs. Here is the patch for that.

We can easily add x86 percpu idle task on top of this cleanup (except for
mini-circus to switch idle_task of CPU 0).

Sorry. The patches for ia64, mips, powerpc and s390 are untested and are in
"should work" category.

Overall diffstat looks like
---
 arch/ia64/kernel/smpboot.c |   49 ++---------------------------
 arch/mips/kernel/smp.c     |   47 +---------------------------
 arch/powerpc/kernel/smp.c  |   63 ++++----------------------------------
 arch/s390/kernel/smp.c     |   38 +++--------------------
 arch/x86/kernel/smpboot.c  |   74 ++++++---------------------------------------
 include/linux/sched.h      |    1 
 kernel/fork.c              |   48 +++++++++++++++++++++++++++++
 7 files changed, 79 insertions(+), 241 deletions(-)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] x86: Move fork_idle from wq and idle caching to common code
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
@ 2012-03-06 21:41                   ` Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 2/5] ia64: Use common fork_idle_from_wq in smpboot Venkatesh Pallipadi
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Venkatesh Pallipadi

As a part of cleanup suggested by Ingo here

* move smpboot wq stuff to common code.
* move idle task caching to common code and use the existing percpu
  idle_task() as cache, instead of another percpu var or NR_CPUs array.

These can be shared across archs.

Should not have any functionality impact.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/x86/kernel/smpboot.c |   74 ++++++--------------------------------------
 include/linux/sched.h     |    1 +
 kernel/fork.c             |   48 +++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 64 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 66d250c..cc714b1 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -75,20 +75,8 @@
 /* State of each CPU */
 DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
-/* Store all idle threads, this can be reused instead of creating
-* a new thread. Also avoids complicated thread destroy functionality
-* for idle threads.
-*/
 #ifdef CONFIG_HOTPLUG_CPU
 /*
- * Needed only for CONFIG_HOTPLUG_CPU because __cpuinitdata is
- * removed after init for !CONFIG_HOTPLUG_CPU.
- */
-static DEFINE_PER_CPU(struct task_struct *, idle_thread_array);
-#define get_idle_for_cpu(x)      (per_cpu(idle_thread_array, x))
-#define set_idle_for_cpu(x, p)   (per_cpu(idle_thread_array, x) = (p))
-
-/*
  * We need this for trampoline_base protection from concurrent accesses when
  * off- and onlining cores wildly.
  */
@@ -106,10 +94,6 @@ void cpu_hotplug_driver_unlock(void)
 
 ssize_t arch_cpu_probe(const char *buf, size_t count) { return -1; }
 ssize_t arch_cpu_release(const char *buf, size_t count) { return -1; }
-#else
-static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
-#define get_idle_for_cpu(x)      (idle_thread_array[(x)])
-#define set_idle_for_cpu(x, p)   (idle_thread_array[(x)] = (p))
 #endif
 
 /* Number of siblings per CPU package */
@@ -634,22 +618,6 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
 	return (send_status | accept_status);
 }
 
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-static void __cpuinit do_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle =
-		container_of(work, struct create_idle, work);
-
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 /* reduce the number of lines printed when booting a large cpu count system */
 static void __cpuinit announce_cpu(int cpu, int apicid)
 {
@@ -681,53 +649,32 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
 	unsigned long boot_error = 0;
 	unsigned long start_ip;
 	int timeout;
-	struct create_idle c_idle = {
-		.cpu	= cpu,
-		.done	= COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
-	};
-
-	INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
+	struct task_struct *idle;
 
 	alternatives_smp_switch(1);
 
-	c_idle.idle = get_idle_for_cpu(cpu);
-
-	/*
-	 * We can't use kernel_thread since we must avoid to
-	 * reschedule the child.
-	 */
-	if (c_idle.idle) {
-		c_idle.idle->thread.sp = (unsigned long) (((struct pt_regs *)
-			(THREAD_SIZE +  task_stack_page(c_idle.idle))) - 1);
-		init_idle(c_idle.idle, cpu);
-		goto do_rest;
-	}
-
-	schedule_work(&c_idle.work);
-	wait_for_completion(&c_idle.done);
-
-	if (IS_ERR(c_idle.idle)) {
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle)) {
 		printk("failed fork for CPU %d\n", cpu);
-		destroy_work_on_stack(&c_idle.work);
-		return PTR_ERR(c_idle.idle);
+		return PTR_ERR(idle);
 	}
 
-	set_idle_for_cpu(cpu, c_idle.idle);
-do_rest:
-	per_cpu(current_task, cpu) = c_idle.idle;
+	idle->thread.sp = (unsigned long) (((struct pt_regs *)
+			(THREAD_SIZE +  task_stack_page(idle))) - 1);
+	per_cpu(current_task, cpu) = idle;
 #ifdef CONFIG_X86_32
 	/* Stack for startup_32 can be just as for start_secondary onwards */
 	irq_ctx_init(cpu);
 #else
-	clear_tsk_thread_flag(c_idle.idle, TIF_FORK);
+	clear_tsk_thread_flag(idle, TIF_FORK);
 	initial_gs = per_cpu_offset(cpu);
 	per_cpu(kernel_stack, cpu) =
-		(unsigned long)task_stack_page(c_idle.idle) -
+		(unsigned long)task_stack_page(idle) -
 		KERNEL_STACK_OFFSET + THREAD_SIZE;
 #endif
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
-	stack_start  = c_idle.idle->thread.sp;
+	stack_start  = idle->thread.sp;
 
 	/* start_ip had better be page-aligned! */
 	start_ip = trampoline_address();
@@ -831,7 +778,6 @@ do_rest:
 		smpboot_restore_warm_reset_vector();
 	}
 
-	destroy_work_on_stack(&c_idle.work);
 	return boot_error;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..357057f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2292,6 +2292,7 @@ extern int do_execve(const char *,
 		     const char __user * const __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
+struct task_struct *fork_idle_from_wq(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
 extern char *get_task_comm(char *to, struct task_struct *tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index e2cd3e2..8704237 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,8 @@
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
+#include <linux/workqueue.h>
+#include <linux/sched.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1479,6 +1481,52 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	return task;
 }
 
+struct create_idle {
+	struct work_struct work;
+	struct task_struct *idle;
+	struct completion done;
+	int cpu;
+};
+
+static void __cpuinit do_fork_idle(struct work_struct *work)
+{
+	struct create_idle *c_idle =
+		container_of(work, struct create_idle, work);
+
+	c_idle->idle = fork_idle(c_idle->cpu);
+	complete(&c_idle->done);
+}
+
+struct task_struct * __cpuinit fork_idle_from_wq(int cpu)
+{
+	struct task_struct *idle = idle_task(cpu);
+	struct create_idle c_idle = {
+		.cpu	= cpu,
+		.done	= COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
+	};
+
+	/* Reuse stored idle thread when possible instead of creating
+	 * a new thread. Also avoids complicated thread destroy functionality
+	 * for idle threads.
+	 */
+	if (idle) {
+		init_idle(idle, cpu);
+		return idle;
+	}
+
+	/*
+	 * We can't use kernel_thread since we must avoid to
+	 * reschedule the child.
+	*/
+	INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
+
+	schedule_work(&c_idle.work);
+	wait_for_completion(&c_idle.done);
+	destroy_work_on_stack(&c_idle.work);
+
+	return c_idle.idle;
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] ia64: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 1/5] x86: Move fork_idle from wq and idle caching to common code Venkatesh Pallipadi
@ 2012-03-06 21:41                   ` Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 3/5] mips: " Venkatesh Pallipadi
                                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Venkatesh Pallipadi

Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
use fork_idle_from_wq().

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/ia64/kernel/smpboot.c |   49 +++----------------------------------------
 1 files changed, 4 insertions(+), 45 deletions(-)

diff --git a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
index 5590979..b9026fa 100644
--- a/arch/ia64/kernel/smpboot.c
+++ b/arch/ia64/kernel/smpboot.c
@@ -95,13 +95,8 @@ struct sal_to_os_boot *sal_state_for_booting_cpu = &sal_boot_rendez_state[0];
 
 #define set_brendez_area(x) (sal_state_for_booting_cpu = &sal_boot_rendez_state[(x)]);
 
-#define get_idle_for_cpu(x)		(idle_thread_array[(x)])
-#define set_idle_for_cpu(x,p)	(idle_thread_array[(x)] = (p))
-
 #else
 
-#define get_idle_for_cpu(x)		(NULL)
-#define set_idle_for_cpu(x,p)
 #define set_brendez_area(x)
 #endif
 
@@ -481,53 +476,17 @@ struct pt_regs * __cpuinit idle_regs(struct pt_regs *regs)
 	return NULL;
 }
 
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-void __cpuinit
-do_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle =
-		container_of(work, struct create_idle, work);
-
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 static int __cpuinit
 do_boot_cpu (int sapicid, int cpu)
 {
 	int timeout;
-	struct create_idle c_idle = {
-		.work = __WORK_INITIALIZER(c_idle.work, do_fork_idle),
-		.cpu	= cpu,
-		.done	= COMPLETION_INITIALIZER(c_idle.done),
-	};
-
-	/*
-	 * We can't use kernel_thread since we must avoid to
-	 * reschedule the child.
-	 */
- 	c_idle.idle = get_idle_for_cpu(cpu);
- 	if (c_idle.idle) {
-		init_idle(c_idle.idle, cpu);
- 		goto do_rest;
-	}
-
-	schedule_work(&c_idle.work);
-	wait_for_completion(&c_idle.done);
+	struct task_struct *idle;
 
-	if (IS_ERR(c_idle.idle))
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle))
 		panic("failed fork for CPU %d", cpu);
 
-	set_idle_for_cpu(cpu, c_idle.idle);
-
-do_rest:
-	task_for_booting_cpu = c_idle.idle;
+	task_for_booting_cpu = idle;
 
 	Dprintk("Sending wakeup vector %lu to AP 0x%x/0x%x.\n", ap_wakeup_vector, cpu, sapicid);
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] mips: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 1/5] x86: Move fork_idle from wq and idle caching to common code Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 2/5] ia64: Use common fork_idle_from_wq in smpboot Venkatesh Pallipadi
@ 2012-03-06 21:41                   ` Venkatesh Pallipadi
  2012-03-06 22:51                     ` Ralf Baechle
  2012-03-06 21:41                   ` [PATCH 4/5] powerpc: " Venkatesh Pallipadi
                                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Venkatesh Pallipadi

Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
use fork_idle_from_wq().

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/mips/kernel/smp.c |   47 +++--------------------------------------------
 1 files changed, 3 insertions(+), 44 deletions(-)

diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 32c1e95..37a4133 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -191,54 +191,13 @@ void __devinit smp_prepare_boot_cpu(void)
  * and keep control until "cpu_online(cpu)" is set.  Note: cpu is
  * physical, not logical.
  */
-static struct task_struct *cpu_idle_thread[NR_CPUS];
-
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-static void __cpuinit do_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle =
-		container_of(work, struct create_idle, work);
-
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 int __cpuinit __cpu_up(unsigned int cpu)
 {
 	struct task_struct *idle;
 
-	/*
-	 * Processor goes to start_secondary(), sets online flag
-	 * The following code is purely to make sure
-	 * Linux can schedule processes on this slave.
-	 */
-	if (!cpu_idle_thread[cpu]) {
-		/*
-		 * Schedule work item to avoid forking user task
-		 * Ported from arch/x86/kernel/smpboot.c
-		 */
-		struct create_idle c_idle = {
-			.cpu    = cpu,
-			.done   = COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
-		};
-
-		INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
-		schedule_work(&c_idle.work);
-		wait_for_completion(&c_idle.done);
-		idle = cpu_idle_thread[cpu] = c_idle.idle;
-
-		if (IS_ERR(idle))
-			panic(KERN_ERR "Fork failed for CPU %d", cpu);
-	} else {
-		idle = cpu_idle_thread[cpu];
-		init_idle(idle, cpu);
-	}
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle))
+		panic(KERN_ERR "Fork failed for CPU %d", cpu);
 
 	mp_ops->boot_secondary(cpu, idle);
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] powerpc: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
                                     ` (2 preceding siblings ...)
  2012-03-06 21:41                   ` [PATCH 3/5] mips: " Venkatesh Pallipadi
@ 2012-03-06 21:41                   ` Venkatesh Pallipadi
  2012-03-06 21:41                   ` [PATCH 5/5] s390: " Venkatesh Pallipadi
  2012-03-07  6:06                   ` fork_idle from wq cleanup Suresh Siddha
  5 siblings, 0 replies; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Venkatesh Pallipadi

Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
use fork_idle_from_wq().

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/powerpc/kernel/smp.c |   63 +++++---------------------------------------
 1 files changed, 8 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 46695fe..3b9705d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -57,27 +57,11 @@
 #define DBG(fmt...)
 #endif
 
-
-/* Store all idle threads, this can be reused instead of creating
-* a new thread. Also avoids complicated thread destroy functionality
-* for idle threads.
-*/
 #ifdef CONFIG_HOTPLUG_CPU
-/*
- * Needed only for CONFIG_HOTPLUG_CPU because __cpuinitdata is
- * removed after init for !CONFIG_HOTPLUG_CPU.
- */
-static DEFINE_PER_CPU(struct task_struct *, idle_thread_array);
-#define get_idle_for_cpu(x)      (per_cpu(idle_thread_array, x))
-#define set_idle_for_cpu(x, p)   (per_cpu(idle_thread_array, x) = (p))
 
 /* State of each CPU during hotplug phases */
 static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
-#else
-static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
-#define get_idle_for_cpu(x)      (idle_thread_array[(x)])
-#define set_idle_for_cpu(x, p)   (idle_thread_array[(x)] = (p))
 #endif
 
 struct thread_info *secondary_ti;
@@ -429,51 +413,20 @@ int generic_check_cpu_restart(unsigned int cpu)
 }
 #endif
 
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-static void __cpuinit do_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle =
-		container_of(work, struct create_idle, work);
-
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 static int __cpuinit create_idle(unsigned int cpu)
 {
 	struct thread_info *ti;
-	struct create_idle c_idle = {
-		.cpu	= cpu,
-		.done	= COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
-	};
-	INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle);
-
-	c_idle.idle = get_idle_for_cpu(cpu);
-
-	/* We can't use kernel_thread since we must avoid to
-	 * reschedule the child. We use a workqueue because
-	 * we want to fork from a kernel thread, not whatever
-	 * userspace process happens to be trying to online us.
-	 */
-	if (!c_idle.idle) {
-		schedule_work(&c_idle.work);
-		wait_for_completion(&c_idle.done);
-	} else
-		init_idle(c_idle.idle, cpu);
-	if (IS_ERR(c_idle.idle)) {		
-		pr_err("Failed fork for CPU %u: %li", cpu, PTR_ERR(c_idle.idle));
-		return PTR_ERR(c_idle.idle);
+	struct task_struct *idle;
+
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle)) {
+		pr_err("Failed fork for CPU %u: %li", cpu, PTR_ERR(idle));
+		return PTR_ERR(idle);
 	}
-	ti = task_thread_info(c_idle.idle);
+	ti = task_thread_info(idle);
 
 #ifdef CONFIG_PPC64
-	paca[cpu].__current = c_idle.idle;
+	paca[cpu].__current = idle;
 	paca[cpu].kstack = (unsigned long)ti + THREAD_SIZE - STACK_FRAME_OVERHEAD;
 #endif
 	ti->cpu = cpu;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] s390: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
                                     ` (3 preceding siblings ...)
  2012-03-06 21:41                   ` [PATCH 4/5] powerpc: " Venkatesh Pallipadi
@ 2012-03-06 21:41                   ` Venkatesh Pallipadi
  2012-03-07  7:00                     ` Heiko Carstens
  2012-03-07  6:06                   ` fork_idle from wq cleanup Suresh Siddha
  5 siblings, 1 reply; 37+ messages in thread
From: Venkatesh Pallipadi @ 2012-03-06 21:41 UTC (permalink / raw)
  To: Suresh Siddha, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Aaron Durbin, Paul Turner, Yong Zhang, linux-kernel, Tony Luck,
	Fenghua Yu, Ralf Baechle, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, Heiko Carstens, Venkatesh Pallipadi

Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
use fork_idle_from_wq().

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/s390/kernel/smp.c |   38 +++++---------------------------------
 1 files changed, 5 insertions(+), 33 deletions(-)

diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 2398ce6..7255a57 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -58,8 +58,6 @@
 /* logical cpu to cpu address */
 unsigned short __cpu_logical_map[NR_CPUS];
 
-static struct task_struct *current_set[NR_CPUS];
-
 static u8 smp_cpu_type;
 static int smp_use_sigp_detection;
 
@@ -562,22 +560,6 @@ int __cpuinit start_secondary(void *cpuvoid)
 	return 0;
 }
 
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-static void __cpuinit smp_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle;
-
-	c_idle = container_of(work, struct create_idle, work);
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 static int __cpuinit smp_alloc_lowcore(int cpu)
 {
 	unsigned long async_stack, panic_stack;
@@ -644,7 +626,6 @@ static void smp_free_lowcore(int cpu)
 int __cpuinit __cpu_up(unsigned int cpu)
 {
 	struct _lowcore *cpu_lowcore;
-	struct create_idle c_idle;
 	struct task_struct *idle;
 	struct stack_frame *sf;
 	u32 lowcore;
@@ -652,19 +633,11 @@ int __cpuinit __cpu_up(unsigned int cpu)
 
 	if (smp_cpu_state[cpu] != CPU_STATE_CONFIGURED)
 		return -EIO;
-	idle = current_set[cpu];
-	if (!idle) {
-		c_idle.done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done);
-		INIT_WORK_ONSTACK(&c_idle.work, smp_fork_idle);
-		c_idle.cpu = cpu;
-		schedule_work(&c_idle.work);
-		wait_for_completion(&c_idle.done);
-		if (IS_ERR(c_idle.idle))
-			return PTR_ERR(c_idle.idle);
-		idle = c_idle.idle;
-		current_set[cpu] = c_idle.idle;
-	}
-	init_idle(idle, cpu);
+
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle))
+		return PTR_ERR(idle);
+
 	if (smp_alloc_lowcore(cpu))
 		return -ENOMEM;
 	do {
@@ -836,7 +809,6 @@ void __init smp_prepare_boot_cpu(void)
 	set_cpu_present(0, true);
 	set_cpu_online(0, true);
 	S390_lowcore.percpu_offset = __per_cpu_offset[0];
-	current_set[0] = current;
 	smp_cpu_state[0] = CPU_STATE_CONFIGURED;
 	cpu_set_polarization(0, POLARIZATION_UNKNOWN);
 }
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] mips: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                   ` [PATCH 3/5] mips: " Venkatesh Pallipadi
@ 2012-03-06 22:51                     ` Ralf Baechle
  0 siblings, 0 replies; 37+ messages in thread
From: Ralf Baechle @ 2012-03-06 22:51 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Suresh Siddha, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Aaron Durbin, Paul Turner,
	Yong Zhang, linux-kernel, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Martin Schwidefsky,
	Heiko Carstens

On Tue, Mar 06, 2012 at 01:41:12PM -0800, Venkatesh Pallipadi wrote:

> Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
> use fork_idle_from_wq().
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>

Looking good; this is code that really shouldn't be replicated a over
several architectures.

Acked-by: Ralf Baechle <ralf@linux-mips.org>

  Ralf

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: fork_idle from wq cleanup
  2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
                                     ` (4 preceding siblings ...)
  2012-03-06 21:41                   ` [PATCH 5/5] s390: " Venkatesh Pallipadi
@ 2012-03-07  6:06                   ` Suresh Siddha
  5 siblings, 0 replies; 37+ messages in thread
From: Suresh Siddha @ 2012-03-07  6:06 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Aaron Durbin, Paul Turner, Yong Zhang,
	linux-kernel, Tony Luck, Fenghua Yu, Ralf Baechle,
	Benjamin Herrenschmidt, Paul Mackerras, Martin Schwidefsky,
	Heiko Carstens

On Tue, 2012-03-06 at 13:41 -0800, Venkatesh Pallipadi wrote:
> I looked at avoiding the wq stuff. But, there is no easy/clean way to do
> if as we have number of routines within copy_process assume current
> being the parent.
> 
> But, we can move most of the complication out of arch code into generic code
> and share it across multiple archs. Here is the patch for that.
> 
> We can easily add x86 percpu idle task on top of this cleanup (except for
> mini-circus to switch idle_task of CPU 0).
> 
> Sorry. The patches for ia64, mips, powerpc and s390 are untested and are in
> "should work" category.
> 
> Overall diffstat looks like
> ---
>  arch/ia64/kernel/smpboot.c |   49 ++---------------------------
>  arch/mips/kernel/smp.c     |   47 +---------------------------
>  arch/powerpc/kernel/smp.c  |   63 ++++----------------------------------
>  arch/s390/kernel/smp.c     |   38 +++--------------------
>  arch/x86/kernel/smpboot.c  |   74 ++++++---------------------------------------
>  include/linux/sched.h      |    1 
>  kernel/fork.c              |   48 +++++++++++++++++++++++++++++
>  7 files changed, 79 insertions(+), 241 deletions(-)
> 

Looks good to me.

Reviewed-by: Suresh Siddha <suresh.b.siddha@intel.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] s390: Use common fork_idle_from_wq in smpboot
  2012-03-06 21:41                   ` [PATCH 5/5] s390: " Venkatesh Pallipadi
@ 2012-03-07  7:00                     ` Heiko Carstens
  0 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2012-03-07  7:00 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Suresh Siddha, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Aaron Durbin, Paul Turner,
	Yong Zhang, linux-kernel, Tony Luck, Fenghua Yu, Ralf Baechle,
	Benjamin Herrenschmidt, Paul Mackerras, Martin Schwidefsky

On Tue, Mar 06, 2012 at 01:41:14PM -0800, Venkatesh Pallipadi wrote:
> Cleanup. Instead of reimplementing fork_idle on wq and idle task caching,
> use fork_idle_from_wq().
> 
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> ---
>  arch/s390/kernel/smp.c |   38 +++++---------------------------------
>  1 files changed, 5 insertions(+), 33 deletions(-)

Your patch won't apply on linux-next because of rather large code changes.
Anyway see below a patch that applies cleanly (and works) on s390.
Thanks for cleaning this up!

Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

---
 arch/s390/kernel/smp.c |   36 +++++-------------------------------
 1 file changed, 5 insertions(+), 31 deletions(-)

--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -83,7 +83,6 @@ enum {
 
 struct pcpu {
 	struct cpu cpu;
-	struct task_struct *idle;	/* idle process for the cpu */
 	struct _lowcore *lowcore;	/* lowcore page(s) for the cpu */
 	unsigned long async_stack;	/* async stack for the cpu */
 	unsigned long panic_stack;	/* panic stack for the cpu */
@@ -724,26 +723,10 @@ static void __cpuinit smp_start_secondar
 	cpu_idle();
 }
 
-struct create_idle {
-	struct work_struct work;
-	struct task_struct *idle;
-	struct completion done;
-	int cpu;
-};
-
-static void __cpuinit smp_fork_idle(struct work_struct *work)
-{
-	struct create_idle *c_idle;
-
-	c_idle = container_of(work, struct create_idle, work);
-	c_idle->idle = fork_idle(c_idle->cpu);
-	complete(&c_idle->done);
-}
-
 /* Upping and downing of CPUs */
 int __cpuinit __cpu_up(unsigned int cpu)
 {
-	struct create_idle c_idle;
+	struct task_struct *idle;
 	struct pcpu *pcpu;
 	int rc;
 
@@ -753,22 +736,14 @@ int __cpuinit __cpu_up(unsigned int cpu)
 	if (pcpu_sigp_retry(pcpu, sigp_initial_cpu_reset, 0) !=
 	    sigp_order_code_accepted)
 		return -EIO;
-	if (!pcpu->idle) {
-		c_idle.done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done);
-		INIT_WORK_ONSTACK(&c_idle.work, smp_fork_idle);
-		c_idle.cpu = cpu;
-		schedule_work(&c_idle.work);
-		wait_for_completion(&c_idle.done);
-		if (IS_ERR(c_idle.idle))
-			return PTR_ERR(c_idle.idle);
-		pcpu->idle = c_idle.idle;
-	}
-	init_idle(pcpu->idle, cpu);
+	idle = fork_idle_from_wq(cpu);
+	if (IS_ERR(idle))
+		return PTR_ERR(idle);
 	rc = pcpu_alloc_lowcore(pcpu, cpu);
 	if (rc)
 		return rc;
 	pcpu_prepare_secondary(pcpu, cpu);
-	pcpu_attach_task(pcpu, pcpu->idle);
+	pcpu_attach_task(pcpu, idle);
 	pcpu_start_fn(pcpu, smp_start_secondary, NULL);
 	while (!cpu_online(cpu))
 		cpu_relax();
@@ -855,7 +830,6 @@ void __init smp_prepare_boot_cpu(void)
 	struct pcpu *pcpu = pcpu_devices;
 
 	boot_cpu_address = stap();
-	pcpu->idle = current;
 	pcpu->state = CPU_STATE_CONFIGURED;
 	pcpu->address = boot_cpu_address;
 	pcpu->lowcore = (struct _lowcore *)(unsigned long) store_prefix();


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-03-07  7:00 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-23  0:36 [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Venkatesh Pallipadi
2012-02-23  7:50 ` Ingo Molnar
2012-02-23  9:08   ` Peter Zijlstra
2012-02-23 20:04     ` Venki Pallipadi
2012-02-23 20:03   ` Venki Pallipadi
2012-03-02  0:33   ` Venki Pallipadi
2012-03-02  1:28     ` Suresh Siddha
2012-03-02  1:35       ` Venki Pallipadi
2012-03-02  1:37         ` Suresh Siddha
2012-03-02  2:00           ` Venki Pallipadi
2012-03-02  7:21             ` Ingo Molnar
2012-03-02 17:41               ` Suresh Siddha
2012-03-06 21:41                 ` fork_idle from wq cleanup Venkatesh Pallipadi
2012-03-06 21:41                   ` [PATCH 1/5] x86: Move fork_idle from wq and idle caching to common code Venkatesh Pallipadi
2012-03-06 21:41                   ` [PATCH 2/5] ia64: Use common fork_idle_from_wq in smpboot Venkatesh Pallipadi
2012-03-06 21:41                   ` [PATCH 3/5] mips: " Venkatesh Pallipadi
2012-03-06 22:51                     ` Ralf Baechle
2012-03-06 21:41                   ` [PATCH 4/5] powerpc: " Venkatesh Pallipadi
2012-03-06 21:41                   ` [PATCH 5/5] s390: " Venkatesh Pallipadi
2012-03-07  7:00                     ` Heiko Carstens
2012-03-07  6:06                   ` fork_idle from wq cleanup Suresh Siddha
2012-02-23  9:30 ` [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1 Peter Zijlstra
2012-02-23 19:34   ` Venki Pallipadi
2012-02-24  5:41     ` Yong Zhang
2012-02-24  6:13       ` Yong Zhang
2012-02-27  8:38         ` Peter Zijlstra
2012-02-27  9:08           ` Yong Zhang
2012-02-27  9:30             ` Peter Zijlstra
2012-02-27  9:51               ` Yong Zhang
2012-02-26  1:32       ` Paul E. McKenney
2012-02-27  9:06         ` Yong Zhang
2012-02-27 17:05           ` Paul E. McKenney
2012-02-28  7:12             ` Yong Zhang
2012-02-28 13:05               ` Paul E. McKenney
2012-02-29  6:36                 ` Yong Zhang
2012-02-27  8:45     ` Peter Zijlstra
2012-02-27 18:17       ` Venki Pallipadi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).