linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/22] sched: Reduce runqueue lock contention -v5
@ 2011-03-02 17:38 Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
                   ` (22 more replies)
  0 siblings, 23 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

This patch series aims to optimize remote wakeups by moving most of the
work of the wakeup to the remote cpu and avoid bouncing runqueue data
structures where possible.

If there are no more 'fun' bits left I'll queue this work for .40.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-11  1:36   ` Frank Rowand
  2011-03-11 15:07   ` [01/22] " Milton Miller
  2011-03-02 17:38 ` [PATCH 02/22] sched: Always provide p->on_cpu Peter Zijlstra
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle

[-- Attachment #1: peter_zijlstra-sched-provide_scheduler_ipi_callback_in_response_to.patch --]
[-- Type: text/plain, Size: 17623 bytes --]

For future rework of try_to_wake_up() we'd like to push part of that
onto the CPU the task is actually going to run on, in order to do so we
need a generic callback from the existing scheduler IPI.

This patch introduces such a generic callback: scheduler_ipi() and
implements it as a NOP.

BenH notes: PowerPC might use this IPI on offline CPUs under rare
conditions!!

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/alpha/kernel/smp.c             |    3 +--
 arch/arm/kernel/smp.c               |    5 +----
 arch/blackfin/mach-common/smp.c     |    3 +++
 arch/cris/arch-v32/kernel/smp.c     |   13 ++++++++-----
 arch/ia64/kernel/irq_ia64.c         |    2 ++
 arch/ia64/xen/irq_xen.c             |   10 +++++++++-
 arch/m32r/kernel/smp.c              |    4 +---
 arch/mips/cavium-octeon/smp.c       |    2 ++
 arch/mips/kernel/smtc.c             |    2 +-
 arch/mips/mti-malta/malta-int.c     |    2 ++
 arch/mips/pmc-sierra/yosemite/smp.c |    4 ++++
 arch/mips/sgi-ip27/ip27-irq.c       |    2 ++
 arch/mips/sibyte/bcm1480/smp.c      |    7 +++----
 arch/mips/sibyte/sb1250/smp.c       |    7 +++----
 arch/mn10300/kernel/smp.c           |    5 +----
 arch/parisc/kernel/smp.c            |    5 +----
 arch/powerpc/kernel/smp.c           |    3 ++-
 arch/s390/kernel/smp.c              |    6 +++---
 arch/sh/kernel/smp.c                |    2 ++
 arch/sparc/kernel/smp_32.c          |    2 +-
 arch/sparc/kernel/smp_64.c          |    1 +
 arch/tile/kernel/smp.c              |    6 +-----
 arch/um/kernel/smp.c                |    2 +-
 arch/x86/kernel/smp.c               |    5 ++---
 arch/x86/xen/smp.c                  |    5 ++---
 include/linux/sched.h               |    1 +
 26 files changed, 60 insertions(+), 49 deletions(-)

Index: linux-2.6/arch/alpha/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/alpha/kernel/smp.c
+++ linux-2.6/arch/alpha/kernel/smp.c
@@ -585,8 +585,7 @@ handle_ipi(struct pt_regs *regs)
 
 		switch (which) {
 		case IPI_RESCHEDULE:
-			/* Reschedule callback.  Everything to be done
-			   is done by the interrupt return path.  */
+			scheduler_ipi();
 			break;
 
 		case IPI_CALL_FUNC:
Index: linux-2.6/arch/arm/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/smp.c
+++ linux-2.6/arch/arm/kernel/smp.c
@@ -561,10 +561,7 @@ asmlinkage void __exception_irq_entry do
 		break;
 
 	case IPI_RESCHEDULE:
-		/*
-		 * nothing more to do - eveything is
-		 * done on the interrupt return path
-		 */
+		scheduler_ipi();
 		break;
 
 	case IPI_CALL_FUNC:
Index: linux-2.6/arch/blackfin/mach-common/smp.c
===================================================================
--- linux-2.6.orig/arch/blackfin/mach-common/smp.c
+++ linux-2.6/arch/blackfin/mach-common/smp.c
@@ -160,6 +160,9 @@ static irqreturn_t ipi_handler_int1(int 
 	while (msg_queue->count) {
 		msg = &msg_queue->ipi_message[msg_queue->head];
 		switch (msg->type) {
+		case BFIN_IPI_RESCHEDULE:
+			scheduler_ipi();
+			break;
 		case BFIN_IPI_CALL_FUNC:
 			spin_unlock_irqrestore(&msg_queue->lock, flags);
 			ipi_call_function(cpu, msg);
Index: linux-2.6/arch/cris/arch-v32/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v32/kernel/smp.c
+++ linux-2.6/arch/cris/arch-v32/kernel/smp.c
@@ -342,15 +342,18 @@ irqreturn_t crisv32_ipi_interrupt(int ir
 
 	ipi = REG_RD(intr_vect, irq_regs[smp_processor_id()], rw_ipi);
 
+	if (ipi.vector & IPI_SCHEDULE) {
+		scheduler_ipi();
+	}
 	if (ipi.vector & IPI_CALL) {
-	         func(info);
+		func(info);
 	}
 	if (ipi.vector & IPI_FLUSH_TLB) {
-		     if (flush_mm == FLUSH_ALL)
-			 __flush_tlb_all();
-		     else if (flush_vma == FLUSH_ALL)
+		if (flush_mm == FLUSH_ALL)
+			__flush_tlb_all();
+		else if (flush_vma == FLUSH_ALL)
 			__flush_tlb_mm(flush_mm);
-		     else
+		else
 			__flush_tlb_page(flush_vma, flush_addr);
 	}
 
Index: linux-2.6/arch/ia64/kernel/irq_ia64.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/irq_ia64.c
+++ linux-2.6/arch/ia64/kernel/irq_ia64.c
@@ -31,6 +31,7 @@
 #include <linux/irq.h>
 #include <linux/ratelimit.h>
 #include <linux/acpi.h>
+#include <linux/sched.h>
 
 #include <asm/delay.h>
 #include <asm/intrinsics.h>
@@ -496,6 +497,7 @@ ia64_handle_irq (ia64_vector vector, str
 			smp_local_flush_tlb();
 			kstat_incr_irqs_this_cpu(irq, desc);
 		} else if (unlikely(IS_RESCHEDULE(vector))) {
+			scheduler_ipi();
 			kstat_incr_irqs_this_cpu(irq, desc);
 		} else {
 			ia64_setreg(_IA64_REG_CR_TPR, vector);
Index: linux-2.6/arch/ia64/xen/irq_xen.c
===================================================================
--- linux-2.6.orig/arch/ia64/xen/irq_xen.c
+++ linux-2.6/arch/ia64/xen/irq_xen.c
@@ -92,6 +92,8 @@ static unsigned short saved_irq_cnt;
 static int xen_slab_ready;
 
 #ifdef CONFIG_SMP
+#include <linux/sched.h>
+
 /* Dummy stub. Though we may check XEN_RESCHEDULE_VECTOR before __do_IRQ,
  * it ends up to issue several memory accesses upon percpu data and
  * thus adds unnecessary traffic to other paths.
@@ -99,7 +101,13 @@ static int xen_slab_ready;
 static irqreturn_t
 xen_dummy_handler(int irq, void *dev_id)
 {
+	return IRQ_HANDLED;
+}
 
+static irqreturn_t
+xen_resched_handler(int irq, void *dev_id)
+{
+	scheduler_ipi();
 	return IRQ_HANDLED;
 }
 
@@ -110,7 +118,7 @@ static struct irqaction xen_ipi_irqactio
 };
 
 static struct irqaction xen_resched_irqaction = {
-	.handler =	xen_dummy_handler,
+	.handler =	xen_resched_handler,
 	.flags =	IRQF_DISABLED,
 	.name =		"resched"
 };
Index: linux-2.6/arch/m32r/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/m32r/kernel/smp.c
+++ linux-2.6/arch/m32r/kernel/smp.c
@@ -122,8 +122,6 @@ void smp_send_reschedule(int cpu_id)
  *
  * Description:  This routine executes on CPU which received
  *               'RESCHEDULE_IPI'.
- *               Rescheduling is processed at the exit of interrupt
- *               operation.
  *
  * Born on Date: 2002.02.05
  *
@@ -138,7 +136,7 @@ void smp_send_reschedule(int cpu_id)
  *==========================================================================*/
 void smp_reschedule_interrupt(void)
 {
-	/* nothing to do */
+	scheduler_ipi();
 }
 
 /*==========================================================================*
Index: linux-2.6/arch/mips/kernel/smtc.c
===================================================================
--- linux-2.6.orig/arch/mips/kernel/smtc.c
+++ linux-2.6/arch/mips/kernel/smtc.c
@@ -930,7 +930,7 @@ static void post_direct_ipi(int cpu, str
 
 static void ipi_resched_interrupt(void)
 {
-	/* Return from interrupt should be enough to cause scheduler check */
+	scheduler_ipi();
 }
 
 static void ipi_call_interrupt(void)
Index: linux-2.6/arch/mips/sibyte/bcm1480/smp.c
===================================================================
--- linux-2.6.orig/arch/mips/sibyte/bcm1480/smp.c
+++ linux-2.6/arch/mips/sibyte/bcm1480/smp.c
@@ -20,6 +20,7 @@
 #include <linux/delay.h>
 #include <linux/smp.h>
 #include <linux/kernel_stat.h>
+#include <linux/sched.h>
 
 #include <asm/mmu_context.h>
 #include <asm/io.h>
@@ -189,10 +190,8 @@ void bcm1480_mailbox_interrupt(void)
 	/* Clear the mailbox to clear the interrupt */
 	__raw_writeq(((u64)action)<<48, mailbox_0_clear_regs[cpu]);
 
-	/*
-	 * Nothing to do for SMP_RESCHEDULE_YOURSELF; returning from the
-	 * interrupt will do the reschedule for us
-	 */
+	if (action & SMP_RESCHEDULE_YOURSELF)
+		scheduler_ipi();
 
 	if (action & SMP_CALL_FUNCTION)
 		smp_call_function_interrupt();
Index: linux-2.6/arch/mips/sibyte/sb1250/smp.c
===================================================================
--- linux-2.6.orig/arch/mips/sibyte/sb1250/smp.c
+++ linux-2.6/arch/mips/sibyte/sb1250/smp.c
@@ -21,6 +21,7 @@
 #include <linux/interrupt.h>
 #include <linux/smp.h>
 #include <linux/kernel_stat.h>
+#include <linux/sched.h>
 
 #include <asm/mmu_context.h>
 #include <asm/io.h>
@@ -177,10 +178,8 @@ void sb1250_mailbox_interrupt(void)
 	/* Clear the mailbox to clear the interrupt */
 	____raw_writeq(((u64)action) << 48, mailbox_clear_regs[cpu]);
 
-	/*
-	 * Nothing to do for SMP_RESCHEDULE_YOURSELF; returning from the
-	 * interrupt will do the reschedule for us
-	 */
+	if (action & SMP_RESCHEDULE_YOURSELF)
+		scheduler_ipi();
 
 	if (action & SMP_CALL_FUNCTION)
 		smp_call_function_interrupt();
Index: linux-2.6/arch/mn10300/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/mn10300/kernel/smp.c
+++ linux-2.6/arch/mn10300/kernel/smp.c
@@ -464,14 +464,11 @@ void smp_send_stop(void)
  * @irq: The interrupt number.
  * @dev_id: The device ID.
  *
- * We need do nothing here, since the scheduling will be effected on our way
- * back through entry.S.
- *
  * Returns IRQ_HANDLED to indicate we handled the interrupt successfully.
  */
 static irqreturn_t smp_reschedule_interrupt(int irq, void *dev_id)
 {
-	/* do nothing */
+	scheduler_ipi();
 	return IRQ_HANDLED;
 }
 
Index: linux-2.6/arch/parisc/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/smp.c
+++ linux-2.6/arch/parisc/kernel/smp.c
@@ -155,10 +155,7 @@ ipi_interrupt(int irq, void *dev_id) 
 				
 			case IPI_RESCHEDULE:
 				smp_debug(100, KERN_DEBUG "CPU%d IPI_RESCHEDULE\n", this_cpu);
-				/*
-				 * Reschedule callback.  Everything to be
-				 * done is done by the interrupt return path.
-				 */
+				scheduler_ipi();
 				break;
 
 			case IPI_CALL_FUNC:
Index: linux-2.6/arch/powerpc/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/smp.c
+++ linux-2.6/arch/powerpc/kernel/smp.c
@@ -98,6 +98,7 @@ void smp_message_recv(int msg)
 		break;
 	case PPC_MSG_RESCHEDULE:
 		/* we notice need_resched on exit */
+		scheduler_ipi();
 		break;
 	case PPC_MSG_CALL_FUNC_SINGLE:
 		generic_smp_call_function_single_interrupt();
@@ -127,7 +128,7 @@ static irqreturn_t call_function_action(
 
 static irqreturn_t reschedule_action(int irq, void *data)
 {
-	/* we just need the return path side effect of checking need_resched */
+	scheduler_ipi();
 	return IRQ_HANDLED;
 }
 
Index: linux-2.6/arch/s390/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/s390/kernel/smp.c
+++ linux-2.6/arch/s390/kernel/smp.c
@@ -165,12 +165,12 @@ static void do_ext_call_interrupt(unsign
 	kstat_cpu(smp_processor_id()).irqs[EXTINT_IPI]++;
 	/*
 	 * handle bit signal external calls
-	 *
-	 * For the ec_schedule signal we have to do nothing. All the work
-	 * is done automatically when we return from the interrupt.
 	 */
 	bits = xchg(&S390_lowcore.ext_call_fast, 0);
 
+	if (test_bit(ec_schedule, &bits))
+		scheduler_ipi();
+
 	if (test_bit(ec_call_function, &bits))
 		generic_smp_call_function_interrupt();
 
Index: linux-2.6/arch/sh/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/sh/kernel/smp.c
+++ linux-2.6/arch/sh/kernel/smp.c
@@ -20,6 +20,7 @@
 #include <linux/module.h>
 #include <linux/cpu.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/atomic.h>
 #include <asm/processor.h>
 #include <asm/system.h>
@@ -323,6 +324,7 @@ void smp_message_recv(unsigned int msg)
 		generic_smp_call_function_interrupt();
 		break;
 	case SMP_MSG_RESCHEDULE:
+		scheduler_ipi();
 		break;
 	case SMP_MSG_FUNCTION_SINGLE:
 		generic_smp_call_function_single_interrupt();
Index: linux-2.6/arch/sparc/kernel/smp_32.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/smp_32.c
+++ linux-2.6/arch/sparc/kernel/smp_32.c
@@ -125,7 +125,7 @@ struct linux_prom_registers smp_penguin_
 
 void smp_send_reschedule(int cpu)
 {
-	/* See sparc64 */
+	scheduler_ipi();
 }
 
 void smp_send_stop(void)
Index: linux-2.6/arch/sparc/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/smp_64.c
+++ linux-2.6/arch/sparc/kernel/smp_64.c
@@ -1369,6 +1369,7 @@ void smp_send_reschedule(int cpu)
 void __irq_entry smp_receive_signal_client(int irq, struct pt_regs *regs)
 {
 	clear_softint(1 << irq);
+	scheduler_ipi();
 }
 
 /* This is a nop because we capture all other cpus
Index: linux-2.6/arch/tile/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/tile/kernel/smp.c
+++ linux-2.6/arch/tile/kernel/smp.c
@@ -184,12 +184,8 @@ void flush_icache_range(unsigned long st
 /* Called when smp_send_reschedule() triggers IRQ_RESCHEDULE. */
 static irqreturn_t handle_reschedule_ipi(int irq, void *token)
 {
-	/*
-	 * Nothing to do here; when we return from interrupt, the
-	 * rescheduling will occur there. But do bump the interrupt
-	 * profiler count in the meantime.
-	 */
 	__get_cpu_var(irq_stat).irq_resched_count++;
+	scheduler_ipi();
 
 	return IRQ_HANDLED;
 }
Index: linux-2.6/arch/um/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/smp.c
+++ linux-2.6/arch/um/kernel/smp.c
@@ -173,7 +173,7 @@ void IPI_handler(int cpu)
 			break;
 
 		case 'R':
-			set_tsk_need_resched(current);
+			scheduler_ipi();
 			break;
 
 		case 'S':
Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -194,14 +194,13 @@ static void native_stop_other_cpus(int w
 }
 
 /*
- * Reschedule call back. Nothing to do,
- * all the work is done automatically when
- * we return from the interrupt.
+ * Reschedule call back.
  */
 void smp_reschedule_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
 	inc_irq_stat(irq_resched_count);
+	scheduler_ipi();
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
Index: linux-2.6/arch/x86/xen/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/xen/smp.c
+++ linux-2.6/arch/x86/xen/smp.c
@@ -46,13 +46,12 @@ static irqreturn_t xen_call_function_int
 static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
 
 /*
- * Reschedule call back. Nothing to do,
- * all the work is done automatically when
- * we return from the interrupt.
+ * Reschedule call back.
  */
 static irqreturn_t xen_reschedule_interrupt(int irq, void *dev_id)
 {
 	inc_irq_stat(irq_resched_count);
+	scheduler_ipi();
 
 	return IRQ_HANDLED;
 }
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -2182,6 +2182,7 @@ extern void set_task_comm(struct task_st
 extern char *get_task_comm(char *to, struct task_struct *tsk);
 
 #ifdef CONFIG_SMP
+static inline void scheduler_ipi(void) { }
 extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
 #else
 static inline unsigned long wait_task_inactive(struct task_struct *p,
Index: linux-2.6/arch/mips/cavium-octeon/smp.c
===================================================================
--- linux-2.6.orig/arch/mips/cavium-octeon/smp.c
+++ linux-2.6/arch/mips/cavium-octeon/smp.c
@@ -44,6 +44,8 @@ static irqreturn_t mailbox_interrupt(int
 
 	if (action & SMP_CALL_FUNCTION)
 		smp_call_function_interrupt();
+	if (action & SMP_RESCHEDULE_YOURSELF)
+		scheduler_ipi();
 
 	/* Check if we've been told to flush the icache */
 	if (action & SMP_ICACHE_FLUSH)
Index: linux-2.6/arch/mips/mti-malta/malta-int.c
===================================================================
--- linux-2.6.orig/arch/mips/mti-malta/malta-int.c
+++ linux-2.6/arch/mips/mti-malta/malta-int.c
@@ -309,6 +309,8 @@ static void ipi_call_dispatch(void)
 
 static irqreturn_t ipi_resched_interrupt(int irq, void *dev_id)
 {
+	scheduler_ipi();
+
 	return IRQ_HANDLED;
 }
 
Index: linux-2.6/arch/mips/pmc-sierra/yosemite/smp.c
===================================================================
--- linux-2.6.orig/arch/mips/pmc-sierra/yosemite/smp.c
+++ linux-2.6/arch/mips/pmc-sierra/yosemite/smp.c
@@ -55,6 +55,8 @@ void titan_mailbox_irq(void)
 
 		if (status & 0x2)
 			smp_call_function_interrupt();
+		if (status & 0x4)
+			scheduler_ipi();
 		break;
 
 	case 1:
@@ -63,6 +65,8 @@ void titan_mailbox_irq(void)
 
 		if (status & 0x2)
 			smp_call_function_interrupt();
+		if (status & 0x4)
+			scheduler_ipi();
 		break;
 	}
 }
Index: linux-2.6/arch/mips/sgi-ip27/ip27-irq.c
===================================================================
--- linux-2.6.orig/arch/mips/sgi-ip27/ip27-irq.c
+++ linux-2.6/arch/mips/sgi-ip27/ip27-irq.c
@@ -147,8 +147,10 @@ static void ip27_do_irq_mask0(void)
 #ifdef CONFIG_SMP
 	if (pend0 & (1UL << CPU_RESCHED_A_IRQ)) {
 		LOCAL_HUB_CLR_INTR(CPU_RESCHED_A_IRQ);
+		scheduler_ipi();
 	} else if (pend0 & (1UL << CPU_RESCHED_B_IRQ)) {
 		LOCAL_HUB_CLR_INTR(CPU_RESCHED_B_IRQ);
+		scheduler_ipi();
 	} else if (pend0 & (1UL << CPU_CALL_A_IRQ)) {
 		LOCAL_HUB_CLR_INTR(CPU_CALL_A_IRQ);
 		smp_call_function_interrupt();



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 02/22] sched: Always provide p->on_cpu
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 03/22] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu.patch --]
[-- Type: text/plain, Size: 3731 bytes --]

Always provide p->on_cpu so that we can determine if its on a cpu
without having to lock the rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h |    4 +---
 kernel/sched.c        |   46 +++++++++++++++++++++++++++++-----------------
 2 files changed, 30 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -845,18 +845,39 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline int task_running(struct rq *rq, struct task_struct *p)
 {
+#ifdef CONFIG_SMP
+	return p->on_cpu;
+#else
 	return task_current(rq, p);
+#endif
 }
 
+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * We can optimise this out completely for !SMP, because the
+	 * SMP rebalancing from interrupt is the only thing that cares
+	 * here.
+	 */
+	next->on_cpu = 1;
+#endif
 }
 
 static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
+	 * We must ensure this doesn't happen until the switch is completely
+	 * finished.
+	 */
+	smp_wmb();
+	prev->on_cpu = 0;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq->lock.owner = current;
@@ -872,15 +893,6 @@ static inline void finish_lock_switch(st
 }
 
 #else /* __ARCH_WANT_UNLOCKED_CTXSW */
-static inline int task_running(struct rq *rq, struct task_struct *p)
-{
-#ifdef CONFIG_SMP
-	return p->oncpu;
-#else
-	return task_current(rq, p);
-#endif
-}
-
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -889,7 +901,7 @@ static inline void prepare_lock_switch(s
 	 * SMP rebalancing from interrupt is the only thing that cares
 	 * here.
 	 */
-	next->oncpu = 1;
+	next->on_cpu = 1;
 #endif
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	raw_spin_unlock_irq(&rq->lock);
@@ -902,12 +914,12 @@ static inline void finish_lock_switch(st
 {
 #ifdef CONFIG_SMP
 	/*
-	 * After ->oncpu is cleared, the task can be moved to a different CPU.
+	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 */
 	smp_wmb();
-	prev->oncpu = 0;
+	prev->on_cpu = 0;
 #endif
 #ifndef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	local_irq_enable();
@@ -2645,8 +2657,8 @@ void sched_fork(struct task_struct *p, i
 	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
 #endif
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
-	p->oncpu = 0;
+#if defined(CONFIG_SMP)
+	p->on_cpu = 0;
 #endif
 #ifdef CONFIG_PREEMPT
 	/* Want to start with kernel preemption disabled. */
@@ -5557,8 +5569,8 @@ void __cpuinit init_idle(struct task_str
 	rcu_read_unlock();
 
 	rq->curr = rq->idle = idle;
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
-	idle->oncpu = 1;
+#if defined(CONFIG_SMP)
+	idle->on_cpu = 1;
 #endif
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1198,9 +1198,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
-#ifdef __ARCH_WANT_UNLOCKED_CTXSW
-	int oncpu;
-#endif
+	int on_cpu;
 #endif
 
 	int prio, static_prio, normal_prio;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 03/22] mutex: Use p->on_cpu for the adaptive spin
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 02/22] sched: Always provide p->on_cpu Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 04/22] sched: Change the ttwu success details Peter Zijlstra
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-on_cpu-use.patch --]
[-- Type: text/plain, Size: 5902 bytes --]

Since we now have p->on_cpu unconditionally available, use it to
re-implement mutex_spin_on_owner.

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   83 +++++++++++++++++++-------------------------------
 7 files changed, 39 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
 
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -160,7 +160,7 @@ __mutex_lock_common(struct mutex *lock, 
 	 */
 
 	for (;;) {
-		struct thread_info *owner;
+		struct task_struct *owner;
 
 		/*
 		 * If we own the BKL, then don't spin. The owner of
Index: linux-2.6/kernel/mutex.h
===================================================================
--- linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -19,7 +19,7 @@
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4034,70 +4034,53 @@ asmlinkage void __sched schedule(void)
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-/*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
- */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
-{
-	unsigned int cpu;
-	struct rq *rq;
 
-	if (!sched_feat(OWNER_SPIN))
-		return 0;
+static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
+{
+	bool ret = false;
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-	/*
-	 * Need to access the cpu field knowing that
-	 * DEBUG_PAGEALLOC could have unmapped it if
-	 * the mutex owner just released it and exited.
-	 */
-	if (probe_kernel_address(&owner->cpu, cpu))
-		return 0;
-#else
-	cpu = owner->cpu;
-#endif
+	rcu_read_lock();
+	if (lock->owner != owner)
+		goto fail;
 
 	/*
-	 * Even if the access succeeded (likely case),
-	 * the cpu field may no longer be valid.
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
 	 */
-	if (cpu >= nr_cpumask_bits)
-		return 0;
+	barrier();
 
-	/*
-	 * We need to validate that we can do a
-	 * get_cpu() and that we have the percpu area.
-	 */
-	if (!cpu_online(cpu))
-		return 0;
+	ret = owner->on_cpu;
+fail:
+	rcu_read_unlock();
 
-	rq = cpu_rq(cpu);
+	return ret;
+}
 
-	for (;;) {
-		/*
-		 * Owner changed, break to re-assess state.
-		 */
-		if (lock->owner != owner) {
-			/*
-			 * If the lock has switched to a different owner,
-			 * we likely have heavy contention. Return 0 to quit
-			 * optimistic spinning and not contend further:
-			 */
-			if (lock->owner)
-				return 0;
-			break;
-		}
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+{
+	if (!sched_feat(OWNER_SPIN))
+		return 0;
 
-		/*
-		 * Is that owner really running on that cpu?
-		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
+	while (owner_running(lock, owner)) {
+		if (need_resched())
 			return 0;
 
 		arch_mutex_cpu_relax();
 	}
 
+	/*
+	 * If the owner changed to another task there is likely
+	 * heavy contention, stop spinning.
+	 */
+	if (lock->owner)
+		return 0;
+
 	return 1;
 }
 #endif
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -360,7 +360,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner);
 
 struct nsproxy;
 struct user_namespace;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 04/22] sched: Change the ttwu success details
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 03/22] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 05/22] sched: Clean up ttwu stats Peter Zijlstra
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-change-ttwu-return.patch --]
[-- Type: text/plain, Size: 2464 bytes --]

try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.

This results in that success is always true for the tracepoints.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2383,10 +2383,10 @@ static inline void ttwu_activate(struct 
 	activate_task(rq, p, en_flags);
 }
 
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
-					int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
 {
-	trace_sched_wakeup(p, success);
+	trace_sched_wakeup(p, true);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2406,7 +2406,7 @@ static inline void ttwu_post_activation(
 	}
 #endif
 	/* if a worker is waking up, notify workqueue */
-	if ((p->flags & PF_WQ_WORKER) && success)
+	if (p->flags & PF_WQ_WORKER)
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
@@ -2505,9 +2505,9 @@ static int try_to_wake_up(struct task_st
 #endif /* CONFIG_SMP */
 	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
 		      cpu == this_cpu, en_flags);
-	success = 1;
 out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
+	ttwu_post_activation(p, rq, wake_flags);
+	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
@@ -2526,7 +2526,6 @@ static int try_to_wake_up(struct task_st
 static void try_to_wake_up_local(struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
-	bool success = false;
 
 	BUG_ON(rq != this_rq());
 	BUG_ON(p == current);
@@ -2541,9 +2540,8 @@ static void try_to_wake_up_local(struct 
 			schedstat_inc(rq, ttwu_local);
 		}
 		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-		success = true;
 	}
-	ttwu_post_activation(p, rq, 0, success);
+	ttwu_post_activation(p, rq, 0);
 }
 
 /**
@@ -2705,7 +2703,7 @@ void wake_up_new_task(struct task_struct
 
 	rq = task_rq_lock(p, &flags);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(p, 1);
+	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 05/22] sched: Clean up ttwu stats
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 04/22] sched: Change the ttwu success details Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 06/22] sched: Provide p->on_rq Peter Zijlstra
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-stat.patch --]
[-- Type: text/plain, Size: 3318 bytes --]

Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   69 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 34 insertions(+), 35 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2408,21 +2408,38 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
 {
+#ifdef CONFIG_SCHEDSTATS
+	int this_cpu = smp_processor_id();
+
+	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (is_sync)
+
+	if (wake_flags & WF_SYNC)
 		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (is_migrate)
+
+	if (cpu != task_cpu(p))
 		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (is_local)
+
+#ifdef CONFIG_SMP
+	if (cpu == this_cpu) {
+		schedstat_inc(rq, ttwu_local);
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+	} else {
+		struct sched_domain *sd;
 
-	activate_task(rq, p, en_flags);
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		for_each_domain(this_cpu, sd) {
+			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHEDSTATS */
 }
 
 static void
@@ -2482,12 +2499,12 @@ static int try_to_wake_up(struct task_st
 	if (!(p->state & state))
 		goto out;
 
+	cpu = task_cpu(p);
+
 	if (p->se.on_rq)
 		goto out_running;
 
-	cpu = task_cpu(p);
 	orig_cpu = cpu;
-
 #ifdef CONFIG_SMP
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
@@ -2528,27 +2545,12 @@ static int try_to_wake_up(struct task_st
 	WARN_ON(task_cpu(p) != cpu);
 	WARN_ON(p->state != TASK_WAKING);
 
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
-	}
-#endif /* CONFIG_SCHEDSTATS */
-
 out_activate:
 #endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
+	activate_task(rq, p, en_flags);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
+	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
@@ -2576,14 +2578,11 @@ static void try_to_wake_up_local(struct 
 	if (!(p->state & TASK_NORMAL))
 		return;
 
-	if (!p->se.on_rq) {
-		if (likely(!task_running(rq, p))) {
-			schedstat_inc(rq, ttwu_count);
-			schedstat_inc(rq, ttwu_local);
-		}
-		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
-	}
+	if (!p->se.on_rq)
+		activate_task(rq, p, ENQUEUE_WAKEUP);
+
 	ttwu_post_activation(p, rq, 0);
+	ttwu_stat(rq, p, smp_processor_id(), 0);
 }
 
 /**



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 06/22] sched: Provide p->on_rq
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 05/22] sched: Clean up ttwu stats Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 07/22] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-onrq.patch --]
[-- Type: text/plain, Size: 9261 bytes --]

Provide a generic p->on_rq because the p->se.on_rq semantics are
unfavourable for lockless wakeups but needed for sched_fair.

In particular, p->on_rq is only cleared when we actually dequeue the
task in schedule() and not on any random dequeue as done by things
like __migrate_task() and __sched_setscheduler().

This also allows us to remove p->se usage from !sched_fair code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h   |    1 +
 kernel/sched.c          |   36 ++++++++++++++++++------------------
 kernel/sched_debug.c    |    2 +-
 kernel/sched_rt.c       |   16 ++++++++--------
 kernel/sched_stoptask.c |    2 +-
 5 files changed, 29 insertions(+), 28 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1201,6 +1201,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	int on_cpu;
 #endif
+	int on_rq;
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1790,7 +1790,6 @@ static void enqueue_task(struct rq *rq, 
 	update_rq_clock(rq);
 	sched_info_queued(p);
 	p->sched_class->enqueue_task(rq, p, flags);
-	p->se.on_rq = 1;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -1798,7 +1797,6 @@ static void dequeue_task(struct rq *rq, 
 	update_rq_clock(rq);
 	sched_info_dequeued(p);
 	p->sched_class->dequeue_task(rq, p, flags);
-	p->se.on_rq = 0;
 }
 
 /*
@@ -1811,6 +1809,7 @@ static void activate_task(struct rq *rq,
 
 	enqueue_task(rq, p, flags);
 	inc_nr_running(rq);
+	p->on_rq = 1;
 }
 
 /*
@@ -2133,7 +2132,7 @@ static void check_preempt_curr(struct rq
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
+	if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
 		rq->skip_clock_update = 1;
 }
 
@@ -2208,7 +2207,7 @@ static bool migrate_task(struct task_str
 	 * If the task is not on a runqueue (and not running), then
 	 * the next wake-up will properly place the task.
 	 */
-	return p->se.on_rq || task_running(rq, p);
+	return p->on_rq || task_running(rq, p);
 }
 
 /*
@@ -2268,7 +2267,7 @@ unsigned long wait_task_inactive(struct 
 		rq = task_rq_lock(p, &flags);
 		trace_sched_wait_task(p);
 		running = task_running(rq, p);
-		on_rq = p->se.on_rq;
+		on_rq = p->on_rq;
 		ncsw = 0;
 		if (!match_state || p->state == match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
@@ -2499,7 +2498,7 @@ static int try_to_wake_up(struct task_st
 
 	cpu = task_cpu(p);
 
-	if (p->se.on_rq)
+	if (p->on_rq)
 		goto out_running;
 
 	orig_cpu = cpu;
@@ -2576,7 +2575,7 @@ static void try_to_wake_up_local(struct 
 	if (!(p->state & TASK_NORMAL))
 		return;
 
-	if (!p->se.on_rq)
+	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
@@ -2613,19 +2612,21 @@ int wake_up_state(struct task_struct *p,
  */
 static void __sched_fork(struct task_struct *p)
 {
+	p->on_rq			= 0;
+
+	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
-	p->se.on_rq = 0;
-	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
@@ -4044,7 +4045,7 @@ static inline void schedule_debug(struct
 
 static void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	if (prev->se.on_rq)
+	if (prev->on_rq)
 		update_rq_clock(rq);
 	prev->sched_class->put_prev_task(rq, prev);
 }
@@ -4123,6 +4124,7 @@ asmlinkage void __sched schedule(void)
 					try_to_wake_up_local(to_wakeup);
 			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+			prev->on_rq = 0;
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -4683,7 +4685,7 @@ void rt_mutex_setprio(struct task_struct
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
 		dequeue_task(rq, p, 0);
@@ -4731,7 +4733,7 @@ void set_user_nice(struct task_struct *p
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
 
@@ -4865,8 +4867,6 @@ static struct task_struct *find_process_
 static void
 __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 {
-	BUG_ON(p->se.on_rq);
-
 	p->policy = policy;
 	p->rt_priority = prio;
 	p->normal_prio = normal_prio(p);
@@ -5015,7 +5015,7 @@ static int __sched_setscheduler(struct t
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
 		deactivate_task(rq, p, 0);
@@ -5925,7 +5925,7 @@ static int __migrate_task(struct task_st
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
-	if (p->se.on_rq) {
+	if (p->on_rq) {
 		deactivate_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
 		activate_task(rq_dest, p, 0);
@@ -8296,7 +8296,7 @@ static void normalize_task(struct rq *rq
 	int old_prio = p->prio;
 	int on_rq;
 
-	on_rq = p->se.on_rq;
+	on_rq = p->on_rq;
 	if (on_rq)
 		deactivate_task(rq, p, 0);
 	__setscheduler(rq, p, SCHED_NORMAL, 0);
@@ -8642,7 +8642,7 @@ void sched_move_task(struct task_struct 
 	rq = task_rq_lock(tsk, &flags);
 
 	running = task_current(rq, tsk);
-	on_rq = tsk->se.on_rq;
+	on_rq = tsk->on_rq;
 
 	if (on_rq)
 		dequeue_task(rq, tsk, 0);
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -152,7 +152,7 @@ static void print_rq(struct seq_file *m,
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->se.on_rq || task_cpu(p) != rq_cpu)
+		if (!p->on_rq || task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -1132,7 +1132,7 @@ static void put_prev_task_rt(struct rq *
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
 	 */
-	if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
+	if (on_rt_rq(&p->rt) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
 }
 
@@ -1283,7 +1283,7 @@ static struct rq *find_lock_lowest_rq(st
 				     !cpumask_test_cpu(lowest_rq->cpu,
 						       &task->cpus_allowed) ||
 				     task_running(rq, task) ||
-				     !task->se.on_rq)) {
+				     !task->on_rq)) {
 
 				raw_spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;
@@ -1317,7 +1317,7 @@ static struct task_struct *pick_next_pus
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->rt.nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!rt_task(p));
 
 	return p;
@@ -1463,7 +1463,7 @@ static int pull_rt_task(struct rq *this_
 		 */
 		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * There's a chance that p is higher in priority
@@ -1534,7 +1534,7 @@ static void set_cpus_allowed_rt(struct t
 	 * Update the migration status of the RQ if we have an RT task
 	 * which is running AND changing its weight value.
 	 */
-	if (p->se.on_rq && (weight != p->rt.nr_cpus_allowed)) {
+	if (p->on_rq && (weight != p->rt.nr_cpus_allowed)) {
 		struct rq *rq = task_rq(p);
 
 		if (!task_current(rq, p)) {
@@ -1604,7 +1604,7 @@ static void switched_from_rt(struct rq *
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (p->se.on_rq && !rq->rt.rt_nr_running)
+	if (p->on_rq && !rq->rt.rt_nr_running)
 		pull_rt_task(rq);
 }
 
@@ -1634,7 +1634,7 @@ static void switched_to_rt(struct rq *rq
 	 * If that current running task is also an RT task
 	 * then see if we can move to another run queue.
 	 */
-	if (p->se.on_rq && rq->curr != p) {
+	if (p->on_rq && rq->curr != p) {
 #ifdef CONFIG_SMP
 		if (rq->rt.overloaded && push_rt_task(rq) &&
 		    /* Don't resched if we changed runqueues */
@@ -1653,7 +1653,7 @@ static void switched_to_rt(struct rq *rq
 static void
 prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 {
-	if (!p->se.on_rq)
+	if (!p->on_rq)
 		return;
 
 	if (rq->curr == p) {
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -26,7 +26,7 @@ static struct task_struct *pick_next_tas
 {
 	struct task_struct *stop = rq->stop;
 
-	if (stop && stop->se.on_rq)
+	if (stop && stop->on_rq)
 		return stop;
 
 	return NULL;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 07/22] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 06/22] sched: Provide p->on_rq Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 08/22] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-pi_lock-wakeup.patch --]
[-- Type: text/plain, Size: 3648 bytes --]

Currently p->pi_lock already serializes p->sched_class, also put
p->cpus_allowed and try_to_wake_up() under it, this prepares the way
to do the first part of ttwu() without holding rq->lock.

By having p->sched_class and p->cpus_allowed serialized by p->pi_lock,
we prepare the way to call select_task_rq() without holding rq->lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2301,7 +2301,7 @@ void task_oncpu_function_call(struct tas
 
 #ifdef CONFIG_SMP
 /*
- * ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
+ * ->cpus_allowed is protected by both rq->lock and p->pi_lock
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
@@ -2334,7 +2334,7 @@ static int select_fallback_rq(int cpu, s
 }
 
 /*
- * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
+ * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
 int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
@@ -2450,7 +2450,8 @@ static int try_to_wake_up(struct task_st
 	this_cpu = get_cpu();
 
 	smp_wmb();
-	rq = task_rq_lock(p, &flags);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	rq = __task_rq_lock(p);
 	if (!(p->state & state))
 		goto out;
 
@@ -2508,7 +2509,8 @@ static int try_to_wake_up(struct task_st
 	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
 
 	return success;
@@ -4543,6 +4545,8 @@ void rt_mutex_setprio(struct task_struct
 
 	BUG_ON(prio < 0 || prio > MAX_PRIO);
 
+	lockdep_assert_held(&p->pi_lock);
+
 	rq = task_rq_lock(p, &flags);
 
 	trace_sched_pi_setprio(p, prio);
@@ -5150,7 +5154,6 @@ long sched_getaffinity(pid_t pid, struct
 {
 	struct task_struct *p;
 	unsigned long flags;
-	struct rq *rq;
 	int retval;
 
 	get_online_cpus();
@@ -5165,9 +5168,9 @@ long sched_getaffinity(pid_t pid, struct
 	if (retval)
 		goto out_unlock;
 
-	rq = task_rq_lock(p, &flags);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	cpumask_and(mask, &p->cpus_allowed, cpu_online_mask);
-	task_rq_unlock(rq, &flags);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 out_unlock:
 	rcu_read_unlock();
@@ -5652,18 +5655,8 @@ int set_cpus_allowed_ptr(struct task_str
 	unsigned int dest_cpu;
 	int ret = 0;
 
-	/*
-	 * Serialize against TASK_WAKING so that ttwu() and wunt() can
-	 * drop the rq->lock and still rely on ->cpus_allowed.
-	 */
-again:
-	while (task_is_waking(p))
-		cpu_relax();
-	rq = task_rq_lock(p, &flags);
-	if (task_is_waking(p)) {
-		task_rq_unlock(rq, &flags);
-		goto again;
-	}
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	rq = __task_rq_lock(p);
 
 	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
 		ret = -EINVAL;
@@ -5691,13 +5684,15 @@ int set_cpus_allowed_ptr(struct task_str
 	if (migrate_task(p, rq)) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
-		task_rq_unlock(rq, &flags);
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
 		tlb_migrate_finish(p->mm);
 		return 0;
 	}
 out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 	return ret;
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 08/22] sched: Drop the rq argument to sched_class::select_task_rq()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 07/22] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 09/22] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-select_task_rq.patch --]
[-- Type: text/plain, Size: 7460 bytes --]

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h   |    3 +--
 kernel/sched.c          |   20 +++++++++++---------
 kernel/sched_fair.c     |    2 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |   38 ++++++++++++++++++++++++++------------
 kernel/sched_stoptask.c |    3 +--
 6 files changed, 41 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1063,8 +1063,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2138,13 +2138,15 @@ static int migration_cpu_stop(void *data
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static bool migrate_task(struct task_struct *p, struct rq *rq)
+static bool need_migrate_task(struct task_struct *p)
 {
 	/*
 	 * If the task is not on a runqueue (and not running), then
 	 * the next wake-up will properly place the task.
 	 */
-	return p->on_rq || task_running(rq, p);
+	bool running = p->on_rq || p->on_cpu;
+	smp_rmb(); /* finish_lock_switch() */
+	return running;
 }
 
 /*
@@ -2337,9 +2339,9 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2484,7 +2486,7 @@ static int try_to_wake_up(struct task_st
 		en_flags |= ENQUEUE_WAKING;
 	}
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 	if (cpu != orig_cpu)
 		set_task_cpu(p, cpu);
 	__task_rq_unlock(rq);
@@ -2694,7 +2696,7 @@ void wake_up_new_task(struct task_struct
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3420,7 +3422,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
@@ -3428,7 +3430,7 @@ void sched_exec(void)
 	 * select_task_rq() can race against ->cpus_allowed
 	 */
 	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
-	    likely(cpu_active(dest_cpu)) && migrate_task(p, rq)) {
+	    likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 
 		task_rq_unlock(rq, &flags);
@@ -5681,7 +5683,7 @@ int set_cpus_allowed_ptr(struct task_str
 		goto out;
 
 	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
-	if (migrate_task(p, rq)) {
+	if (need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
 		__task_rq_unlock(rq);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1623,7 +1623,7 @@ static int select_idle_sibling(struct ta
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,13 +973,23 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct task_struct *curr;
+	struct rq *rq;
+	int cpu;
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 
+	cpu = task_cpu(p);
+	rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
 	/*
-	 * If the current task is an RT task, then
+	 * If the current task on @p's runqueue is an RT task, then
 	 * try to see if we can wake this RT task up on another
 	 * runqueue. Otherwise simply start this RT task
 	 * on its current runqueue.
@@ -993,21 +1003,25 @@ select_task_rq_rt(struct rq *rq, struct 
 	 * lock?
 	 *
 	 * For equal prio tasks, we just let the scheduler sort it out.
+	 *
+	 * Otherwise, just let it ride on the affined RQ and the
+	 * post-schedule router will push the preempted task away
+	 *
+	 * This test is optimistic, if we get it wrong the load-balancer
+	 * will have to sort it out.
 	 */
-	if (unlikely(rt_task(rq->curr)) &&
-	    (rq->curr->rt.nr_cpus_allowed < 2 ||
-	     rq->curr->prio < p->prio) &&
+	if (curr && unlikely(rt_task(curr)) &&
+	    (curr->rt.nr_cpus_allowed < 2 ||
+	     curr->prio < p->prio) &&
 	    (p->rt.nr_cpus_allowed > 1)) {
-		int cpu = find_lowest_rq(p);
+		int target = find_lowest_rq(p);
 
-		return (cpu == -1) ? task_cpu(p) : cpu;
+		if (target != -1)
+			cpu = target;
 	}
+	rcu_read_unlock();
 
-	/*
-	 * Otherwise, just let it ride on the affined RQ and the
-	 * post-schedule router will push the preempted task away
-	 */
-	return task_cpu(p);
+	return cpu;
 }
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -9,8 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct rq *rq, struct task_struct *p,
-		    int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 09/22] sched: Remove rq argument to sched_class::task_waking()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 08/22] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 10/22] sched: Deal with non-atomic min_vruntime reads on 32bits Peter Zijlstra
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-task_waking.patch --]
[-- Type: text/plain, Size: 2263 bytes --]

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h |   10 +++++++---
 kernel/sched.c        |    2 +-
 kernel/sched_fair.c   |    4 +++-
 3 files changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1045,8 +1045,12 @@ struct sched_domain;
 #define WF_FORK		0x02		/* child wakeup after fork */
 
 #define ENQUEUE_WAKEUP		1
-#define ENQUEUE_WAKING		2
-#define ENQUEUE_HEAD		4
+#define ENQUEUE_HEAD		2
+#ifdef CONFIG_SMP
+#define ENQUEUE_WAKING		4	/* sched_class::task_waking was called */
+#else
+#define ENQUEUE_WAKING		0
+#endif
 
 #define DEQUEUE_SLEEP		1
 
@@ -1067,7 +1071,7 @@ struct sched_class {
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2481,7 +2481,7 @@ static int try_to_wake_up(struct task_st
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
+		p->sched_class->task_waking(p);
 		en_flags |= ENQUEUE_WAKING;
 	}
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1338,11 +1338,13 @@ static void yield_task_fair(struct rq *r
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	lockdep_assert_held(&task_rq(p)->lock);
+
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 10/22] sched: Deal with non-atomic min_vruntime reads on 32bits
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 09/22] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 11/22] sched: Delay task_contributes_to_load() Peter Zijlstra
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-fair-fix-wakeup.patch --]
[-- Type: text/plain, Size: 1696 bytes --]

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c      |    3 +++
 kernel/sched_fair.c |   19 +++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -313,6 +313,9 @@ struct cfs_rq {
 
 	u64 exec_clock;
 	u64 min_vruntime;
+#ifndef CONFIG_64BIT
+	u64 min_vruntime_copy;
+#endif
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -365,6 +365,10 @@ static void update_min_vruntime(struct c
 	}
 
 	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+#ifndef CONFIG_64BIT
+	smp_wmb();
+	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
 }
 
 /*
@@ -1342,10 +1346,21 @@ static void task_waking_fair(struct task
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 min_vruntime;
 
-	lockdep_assert_held(&task_rq(p)->lock);
+#ifndef CONFIG_64BIT
+	u64 min_vruntime_copy;
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	do {
+		min_vruntime_copy = cfs_rq->min_vruntime_copy;
+		smp_rmb();
+		min_vruntime = cfs_rq->min_vruntime;
+	} while (min_vruntime != min_vruntime_copy);
+#else
+	min_vruntime = cfs_rq->min_vruntime;
+#endif
+
+	se->vruntime -= min_vruntime;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 11/22] sched: Delay task_contributes_to_load()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 10/22] sched: Deal with non-atomic min_vruntime reads on 32bits Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 12/22] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-contribute-to-load.patch --]
[-- Type: text/plain, Size: 1804 bytes --]

In prepratation of having to call task_contributes_to_load() without
holding rq->lock, we need to store the result until we do and can
update the rq accounting accordingly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   16 ++++------------
 2 files changed, 5 insertions(+), 12 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1264,6 +1264,7 @@ struct task_struct {
 
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
+	unsigned sched_contributes_to_load:1;
 
 	pid_t pid;
 	pid_t tgid;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2467,18 +2467,7 @@ static int try_to_wake_up(struct task_st
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
-	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
@@ -2503,6 +2492,9 @@ static int try_to_wake_up(struct task_st
 	WARN_ON(task_cpu(p) != cpu);
 	WARN_ON(p->state != TASK_WAKING);
 
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+
 out_activate:
 #endif /* CONFIG_SMP */
 	activate_task(rq, p, en_flags);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 12/22] sched: Also serialize ttwu_local() with p->pi_lock
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 11/22] sched: Delay task_contributes_to_load() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 13/22] sched: Add p->pi_lock to task_rq_lock() Peter Zijlstra
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_local.patch --]
[-- Type: text/plain, Size: 2516 bytes --]

Since we now serialize ttwu() using p->pi_lock, we also need to
serialize ttwu_local() using that, otherwise, once we drop the
rq->lock from ttwu() it can race with ttwu_local().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2559,9 +2559,9 @@ static int try_to_wake_up(struct task_st
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened
  *
- * Put @p on the run-queue if it's not already there.  The caller must
+ * Put @p on the run-queue if it's not already there. The caller must
  * ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task.  this_rq() stays locked over invocation.
+ * the current task.
  */
 static void try_to_wake_up_local(struct task_struct *p)
 {
@@ -2569,16 +2569,21 @@ static void try_to_wake_up_local(struct 
 
 	BUG_ON(rq != this_rq());
 	BUG_ON(p == current);
-	lockdep_assert_held(&rq->lock);
+
+	raw_spin_unlock(&rq->lock);
+	raw_spin_lock(&p->pi_lock);
+	raw_spin_lock(&rq->lock);
 
 	if (!(p->state & TASK_NORMAL))
-		return;
+		goto out;
 
 	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
 	ttwu_stat(rq, p, smp_processor_id(), 0);
+out:
+	raw_spin_unlock(&p->pi_lock);
 }
 
 /**
@@ -4082,6 +4087,7 @@ pick_next_task(struct rq *rq)
  */
 asmlinkage void __sched schedule(void)
 {
+	struct task_struct *to_wakeup = NULL;
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
 	struct rq *rq;
@@ -4115,21 +4121,21 @@ asmlinkage void __sched schedule(void)
 			 * task to maintain concurrency.  If so, wake
 			 * up the task.
 			 */
-			if (prev->flags & PF_WQ_WORKER) {
-				struct task_struct *to_wakeup;
-
+			if (prev->flags & PF_WQ_WORKER)
 				to_wakeup = wq_worker_sleeping(prev, cpu);
-				if (to_wakeup)
-					try_to_wake_up_local(to_wakeup);
-			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 		}
 		switch_count = &prev->nvcsw;
 	}
 
+	/*
+	 * All three: try_to_wake_up_local(), pre_schedule() and idle_balance()
+	 * can drop rq->lock.
+	 */
+	if (to_wakeup)
+		try_to_wake_up_local(to_wakeup);
 	pre_schedule(rq, prev);
-
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 13/22] sched: Add p->pi_lock to task_rq_lock()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 12/22] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 14/22] sched: Drop rq->lock from first part of wake_up_new_task() Peter Zijlstra
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-task_rq_lock.patch --]
[-- Type: text/plain, Size: 10544 bytes --]

In order to be able to call set_task_cpu() while either holding
p->pi_lock or task_rq(p)->lock we need to hold both locks in order to
stabilize task_rq().

This makes task_rq_lock() acquire both locks, and have
__task_rq_lock() validate that p->pi_lock is held. This increases the
locking overhead for most scheduler syscalls but allows reduction of
rq->lock contention for some scheduler hot paths (ttwu).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |  103 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 47 insertions(+), 56 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -600,7 +600,7 @@ static inline int cpu_of(struct rq *rq)
  * Return the group to which this tasks belongs.
  *
  * We use task_subsys_state_check() and extend the RCU verification
- * with lockdep_is_held(&task_rq(p)->lock) because cpu_cgroup_attach()
+ * with lockdep_is_held(&p->pi_lock) because cpu_cgroup_attach()
  * holds that lock for each task it moves into the cgroup. Therefore
  * by holding that lock, we pin the task to the current cgroup.
  */
@@ -610,7 +610,7 @@ static inline struct task_group *task_gr
 	struct cgroup_subsys_state *css;
 
 	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
-			lockdep_is_held(&task_rq(p)->lock));
+			lockdep_is_held(&p->pi_lock));
 	tg = container_of(css, struct task_group, css);
 
 	return autogroup_task_group(p, tg);
@@ -926,23 +926,15 @@ static inline void finish_lock_switch(st
 #endif /* __ARCH_WANT_UNLOCKED_CTXSW */
 
 /*
- * Check whether the task is waking, we use this to synchronize ->cpus_allowed
- * against ttwu().
- */
-static inline int task_is_waking(struct task_struct *p)
-{
-	return unlikely(p->state == TASK_WAKING);
-}
-
-/*
- * __task_rq_lock - lock the runqueue a given task resides on.
- * Must be called interrupts disabled.
+ * __task_rq_lock - lock the rq @p resides on.
  */
 static inline struct rq *__task_rq_lock(struct task_struct *p)
 	__acquires(rq->lock)
 {
 	struct rq *rq;
 
+	lockdep_assert_held(&p->pi_lock);
+
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
@@ -953,22 +945,22 @@ static inline struct rq *__task_rq_lock(
 }
 
 /*
- * task_rq_lock - lock the runqueue a given task resides on and disable
- * interrupts. Note the ordering: we can safely lookup the task_rq without
- * explicitly disabling preemption.
+ * task_rq_lock - lock p->pi_lock and lock the rq @p resides on.
  */
 static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
+	__acquires(p->pi_lock)
 	__acquires(rq->lock)
 {
 	struct rq *rq;
 
 	for (;;) {
-		local_irq_save(*flags);
+		raw_spin_lock_irqsave(&p->pi_lock, *flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
 		if (likely(rq == task_rq(p)))
 			return rq;
-		raw_spin_unlock_irqrestore(&rq->lock, *flags);
+		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
 	}
 }
 
@@ -978,10 +970,13 @@ static void __task_rq_unlock(struct rq *
 	raw_spin_unlock(&rq->lock);
 }
 
-static inline void task_rq_unlock(struct rq *rq, unsigned long *flags)
+static inline void
+task_rq_unlock(struct rq *rq, struct task_struct *p, unsigned long *flags)
 	__releases(rq->lock)
+	__releases(p->pi_lock)
 {
-	raw_spin_unlock_irqrestore(&rq->lock, *flags);
+	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
 }
 
 /*
@@ -2178,6 +2173,11 @@ void set_task_cpu(struct task_struct *p,
 	 */
 	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
 			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
+
+#ifdef CONFIG_LOCKDEP
+	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
+				      lockdep_is_held(&task_rq(p)->lock)));
+#endif
 #endif
 
 	trace_sched_migrate_task(p, new_cpu);
@@ -2273,7 +2273,7 @@ unsigned long wait_task_inactive(struct 
 		ncsw = 0;
 		if (!match_state || p->state == match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, p, &flags);
 
 		/*
 		 * If it changed from the expected state, bail out now.
@@ -2639,6 +2639,7 @@ static void __sched_fork(struct task_str
  */
 void sched_fork(struct task_struct *p, int clone_flags)
 {
+	unsigned long flags;
 	int cpu = get_cpu();
 
 	__sched_fork(p);
@@ -2689,9 +2690,9 @@ void sched_fork(struct task_struct *p, i
 	 *
 	 * Silence PROVE_RCU.
 	 */
-	rcu_read_lock();
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	set_task_cpu(p, cpu);
-	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	if (likely(sched_info_on()))
@@ -2740,7 +2741,7 @@ void wake_up_new_task(struct task_struct
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 #endif
 
 	rq = task_rq_lock(p, &flags);
@@ -2751,7 +2752,7 @@ void wake_up_new_task(struct task_struct
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
 #endif
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 	put_cpu();
 }
 
@@ -3476,12 +3477,12 @@ void sched_exec(void)
 	    likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, p, &flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 }
 
 #endif
@@ -3518,7 +3519,7 @@ unsigned long long task_delta_exec(struc
 
 	rq = task_rq_lock(p, &flags);
 	ns = do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 
 	return ns;
 }
@@ -3536,7 +3537,7 @@ unsigned long long task_sched_runtime(st
 
 	rq = task_rq_lock(p, &flags);
 	ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 
 	return ns;
 }
@@ -3560,7 +3561,7 @@ unsigned long long thread_group_sched_ru
 	rq = task_rq_lock(p, &flags);
 	thread_group_cputime(p, &totals);
 	ns = totals.sum_exec_runtime + do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 
 	return ns;
 }
@@ -4675,16 +4676,13 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	unsigned long flags;
 	int oldprio, on_rq, running;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
 	BUG_ON(prio < 0 || prio > MAX_PRIO);
 
-	lockdep_assert_held(&p->pi_lock);
-
-	rq = task_rq_lock(p, &flags);
+	rq = __task_rq_lock(p);
 
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
@@ -4709,7 +4707,7 @@ void rt_mutex_setprio(struct task_struct
 		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
 
 	check_class_changed(rq, p, prev_class, oldprio);
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
 }
 
 #endif
@@ -4757,7 +4755,7 @@ void set_user_nice(struct task_struct *p
 			resched_task(rq->curr);
 	}
 out_unlock:
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 }
 EXPORT_SYMBOL(set_user_nice);
 
@@ -4979,20 +4977,17 @@ static int __sched_setscheduler(struct t
 	/*
 	 * make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
-	 */
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	/*
+	 *
 	 * To be able to change p->policy safely, the apropriate
 	 * runqueue lock must be held.
 	 */
-	rq = __task_rq_lock(p);
+	rq = task_rq_lock(p, &flags);
 
 	/*
 	 * Changing the policy of the stop threads its a very bad idea
 	 */
 	if (p == rq->stop) {
-		__task_rq_unlock(rq);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		task_rq_unlock(rq, p, &flags);
 		return -EINVAL;
 	}
 
@@ -5005,8 +5000,7 @@ static int __sched_setscheduler(struct t
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0 &&
 				!task_group_is_autogroup(task_group(p))) {
-			__task_rq_unlock(rq);
-			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+			task_rq_unlock(rq, p, &flags);
 			return -EPERM;
 		}
 	}
@@ -5015,8 +5009,7 @@ static int __sched_setscheduler(struct t
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
-		__task_rq_unlock(rq);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		task_rq_unlock(rq, p, &flags);
 		goto recheck;
 	}
 	on_rq = p->on_rq;
@@ -5038,8 +5031,7 @@ static int __sched_setscheduler(struct t
 		activate_task(rq, p, 0);
 
 	check_class_changed(rq, p, prev_class, oldprio);
-	__task_rq_unlock(rq);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	task_rq_unlock(rq, p, &flags);
 
 	rt_mutex_adjust_pi(p);
 
@@ -5620,7 +5612,7 @@ SYSCALL_DEFINE2(sched_rr_get_interval, p
 
 	rq = task_rq_lock(p, &flags);
 	time_slice = p->sched_class->get_rr_interval(rq, p);
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, p, &flags);
 
 	rcu_read_unlock();
 	jiffies_to_timespec(time_slice, &t);
@@ -5843,8 +5835,7 @@ int set_cpus_allowed_ptr(struct task_str
 	unsigned int dest_cpu;
 	int ret = 0;
 
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	rq = __task_rq_lock(p);
+	rq = task_rq_lock(p, &flags);
 
 	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
 		ret = -EINVAL;
@@ -5872,15 +5863,13 @@ int set_cpus_allowed_ptr(struct task_str
 	if (need_migrate_task(p)) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
-		__task_rq_unlock(rq);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		task_rq_unlock(rq, p, &flags);
 		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
 		tlb_migrate_finish(p->mm);
 		return 0;
 	}
 out:
-	__task_rq_unlock(rq);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	task_rq_unlock(rq, p, &flags);
 
 	return ret;
 }
@@ -5908,6 +5897,7 @@ static int __migrate_task(struct task_st
 	rq_src = cpu_rq(src_cpu);
 	rq_dest = cpu_rq(dest_cpu);
 
+	raw_spin_lock(&p->pi_lock);
 	double_rq_lock(rq_src, rq_dest);
 	/* Already moved. */
 	if (task_cpu(p) != src_cpu)
@@ -5930,6 +5920,7 @@ static int __migrate_task(struct task_st
 	ret = 1;
 fail:
 	double_rq_unlock(rq_src, rq_dest);
+	raw_spin_unlock(&p->pi_lock);
 	return ret;
 }
 
@@ -8656,7 +8647,7 @@ void sched_move_task(struct task_struct 
 	if (on_rq)
 		enqueue_task(rq, tsk, 0);
 
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, tsk, &flags);
 }
 #endif /* CONFIG_CGROUP_SCHED */
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 14/22] sched: Drop rq->lock from first part of wake_up_new_task()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 13/22] sched: Add p->pi_lock to task_rq_lock() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 15/22] sched: Drop rq->lock from sched_exec() Peter Zijlstra
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-wunt.patch --]
[-- Type: text/plain, Size: 1619 bytes --]

Since p->pi_lock now protects all things needed to call
select_task_rq() avoid the double remote rq->lock acquisition and rely
on p->pi_lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2726,28 +2726,18 @@ void wake_up_new_task(struct task_struct
 {
 	unsigned long flags;
 	struct rq *rq;
-	int cpu __maybe_unused = get_cpu();
 
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_SMP
-	rq = task_rq_lock(p, &flags);
-	p->state = TASK_WAKING;
-
 	/*
 	 * Fork balancing, do it here and not earlier because:
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
-	 *
-	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
-	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
-	set_task_cpu(p, cpu);
-
-	p->state = TASK_RUNNING;
-	task_rq_unlock(rq, p, &flags);
+	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
 #endif
 
-	rq = task_rq_lock(p, &flags);
+	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
@@ -2756,7 +2746,6 @@ void wake_up_new_task(struct task_struct
 		p->sched_class->task_woken(rq, p);
 #endif
 	task_rq_unlock(rq, p, &flags);
-	put_cpu();
 }
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 15/22] sched: Drop rq->lock from sched_exec()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 14/22] sched: Drop rq->lock from first part of wake_up_new_task() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 16/22] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-exec.patch --]
[-- Type: text/plain, Size: 1557 bytes --]

Since we can now call select_task_rq() and set_task_cpu() with only
p->pi_lock held, and sched_exec() load-balancing has always been
optimistic, drop all rq->lock usage.

Oleg also noted that need_migrate_task() will always be true for
current, so don't bother calling that at all.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3454,27 +3454,22 @@ void sched_exec(void)
 {
 	struct task_struct *p = current;
 	unsigned long flags;
-	struct rq *rq;
 	int dest_cpu;
 
-	rq = task_rq_lock(p, &flags);
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
-	/*
-	 * select_task_rq() can race against ->cpus_allowed
-	 */
-	if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed) &&
-	    likely(cpu_active(dest_cpu)) && need_migrate_task(p)) {
+	if (likely(cpu_active(dest_cpu))) {
 		struct migration_arg arg = { p, dest_cpu };
 
-		task_rq_unlock(rq, p, &flags);
-		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
-	task_rq_unlock(rq, p, &flags);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 }
 
 #endif



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 16/22] sched: Remove rq->lock from the first half of ttwu()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 15/22] sched: Drop rq->lock from sched_exec() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 17/22] sched: Remove rq argument from ttwu_stat() Peter Zijlstra
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-optimize.patch --]
[-- Type: text/plain, Size: 3781 bytes --]

Currently ttwu() does two rq->lock acquisitions, once on the task's
old rq, holding it over the p->state fiddling and load-balance pass.
Then it drops the old rq->lock to acquire the new rq->lock.

By having serialized ttwu(), p->sched_class, p->cpus_allowed with
p->pi_lock, we can now drop the whole first rq->lock acquisition.

The p->pi_lock serializing concurrent ttwu() calls protects p->state,
which we will set to TASK_WAKING to bridge possible p->pi_lock to
rq->lock gaps and serialize set_task_cpu() calls against
task_rq_lock().

The p->pi_lock serialization of p->sched_class allows us to call
scheduling class methods without holding the rq->lock, and the
serialization of p->cpus_allowed allows us to do the load-balancing
bits without races.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   63 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 28 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2486,69 +2486,76 @@ ttwu_post_activation(struct task_struct 
  * Returns %true if @p was woken up, %false if it was already running
  * or @state didn't match @p's state.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu, this_cpu, success = 0;
 	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
 
 	this_cpu = get_cpu();
 
 	smp_wmb();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	rq = __task_rq_lock(p);
 	if (!(p->state & state))
 		goto out;
 
 	cpu = task_cpu(p);
 
-	if (p->on_rq)
-		goto out_running;
+	if (p->on_rq) {
+		rq = __task_rq_lock(p);
+		if (p->on_rq)
+			goto out_running;
+		__task_rq_unlock(rq);
+	}
 
-	orig_cpu = cpu;
 #ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
+	while (p->on_cpu) {
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+		/*
+		 * If called from interrupt context we could have landed in the
+		 * middle of schedule(), in this case we should take care not
+		 * to spin on ->on_cpu if p is current, since that would
+		 * deadlock.
+		 */
+		if (p == current)
+			goto out_activate;
+#endif
+		cpu_relax();
+	}
+	/*
+	 * Pairs with the smp_wmb() in finish_lock_switch().
+	 */
+	smp_rmb();
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
-	if (p->sched_class->task_waking) {
+	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
-		en_flags |= ENQUEUE_WAKING;
-	}
 
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
+out_activate:
+#endif /* CONFIG_SMP */
 
 	rq = cpu_rq(cpu);
 	raw_spin_lock(&rq->lock);
 
-	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
-	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
+#ifdef CONFIG_SMP
+	if (cpu != task_cpu(p))
+		set_task_cpu(p, cpu);
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
+#endif
 
-out_activate:
-#endif /* CONFIG_SMP */
-	activate_task(rq, p, en_flags);
+	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
 	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
-out:
 	__task_rq_unlock(rq);
+out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 17/22] sched: Remove rq argument from ttwu_stat()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 16/22] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 18/22] sched: Rename ttwu_post_activation Peter Zijlstra
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_stat-rq.patch --]
[-- Type: text/plain, Size: 1568 bytes --]

In order to call ttwu_stat() without holding rq->lock we must remove
its rq argument. Since we need to change rq stats, account to the
local rq instead of the task rq, this is safe since we have IRQs
disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2367,10 +2367,11 @@ static void update_avg(u64 *avg, u64 sam
 #endif
 
 static void
-ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
+ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
 #ifdef CONFIG_SCHEDSTATS
 	int this_cpu = smp_processor_id();
+	struct rq *rq = this_rq();
 
 	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
@@ -2491,9 +2492,10 @@ try_to_wake_up(struct task_struct *p, un
 	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
 	ttwu_post_activation(p, rq, wake_flags);
-	ttwu_stat(rq, p, cpu, wake_flags);
 	success = 1;
 	__task_rq_unlock(rq);
+
+	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	put_cpu();
@@ -2527,7 +2529,7 @@ static void try_to_wake_up_local(struct 
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_post_activation(p, rq, 0);
-	ttwu_stat(rq, p, smp_processor_id(), 0);
+	ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 18/22] sched: Rename ttwu_post_activation
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 17/22] sched: Remove rq argument from ttwu_stat() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 19/22] sched: Restructure ttwu some more Peter Zijlstra
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu_do_wakeup.patch --]
[-- Type: text/plain, Size: 1400 bytes --]

The ttwu_post_activation() does the core wakeup, it sets TASK_RUNNING
and performs wakeup-preemption, so give is a more descriptive name.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2399,8 +2399,11 @@ ttwu_stat(struct task_struct *p, int cpu
 #endif /* CONFIG_SCHEDSTATS */
 }
 
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
 static void
-ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	trace_sched_wakeup(p, true);
 	check_preempt_curr(rq, p, wake_flags);
@@ -2492,7 +2495,7 @@ try_to_wake_up(struct task_struct *p, un
 
 	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
 out_running:
-	ttwu_post_activation(p, rq, wake_flags);
+	ttwu_do_wakeup(rq, p, wake_flags);
 	success = 1;
 	__task_rq_unlock(rq);
 
@@ -2529,7 +2532,7 @@ static void try_to_wake_up_local(struct 
 	if (!p->on_rq)
 		activate_task(rq, p, ENQUEUE_WAKEUP);
 
-	ttwu_post_activation(p, rq, 0);
+	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 19/22] sched: Restructure ttwu some more
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 18/22] sched: Rename ttwu_post_activation Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-foo.patch --]
[-- Type: text/plain, Size: 3241 bytes --]

The last few changes to ttwu that allow for adding remote queues.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   83 +++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 55 insertions(+), 28 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2475,6 +2475,48 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+static void
+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+#ifdef CONFIG_SMP
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+#endif
+
+	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
+	ttwu_do_wakeup(rq, p, wake_flags);
+}
+
+/*
+ * Called in case the task @p isn't fully descheduled from its runqueue,
+ * in this case we must do a remote wakeup. Its a 'light' wakeup though,
+ * since all we need to do is flip p->state to TASK_RUNNING, since
+ * the task is still ->on_rq.
+ */
+static int ttwu_remote(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+	int ret = 0;
+
+	rq = __task_rq_lock(p);
+	if (p->on_rq) {
+		ttwu_do_wakeup(rq, p, wake_flags);
+		ret = 1;
+	}
+	__task_rq_unlock(rq);
+
+	return ret;
+}
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	raw_spin_lock(&rq->lock);
+	ttwu_do_activate(rq, p, 0);
+	raw_spin_unlock(&rq->lock);
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2493,27 +2535,25 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 static int
 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, this_cpu, success = 0;
 	unsigned long flags;
-	struct rq *rq;
-
-	this_cpu = get_cpu();
+	int cpu, success = 0;
 
 	smp_wmb();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
 
+	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
-	if (p->on_rq) {
-		rq = __task_rq_lock(p);
-		if (p->on_rq)
-			goto out_running;
-		__task_rq_unlock(rq);
-	}
+	if (p->on_rq && ttwu_remote(p, wake_flags))
+		goto stat;
 
 #ifdef CONFIG_SMP
+	/*
+	 * If the owning (remote) cpu is still in the middle of schedule() with
+	 * this task as prev, wait until its done referencing the task.
+	 */
 	while (p->on_cpu) {
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 		/*
@@ -2539,30 +2579,17 @@ try_to_wake_up(struct task_struct *p, un
 		p->sched_class->task_waking(p);
 
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
-out_activate:
-#endif /* CONFIG_SMP */
-
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
-
-#ifdef CONFIG_SMP
-	if (cpu != task_cpu(p))
+	if (task_cpu(p) != cpu)
 		set_task_cpu(p, cpu);
 
-	if (p->sched_contributes_to_load)
-		rq->nr_uninterruptible--;
-#endif
-
-	activate_task(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
-out_running:
-	ttwu_do_wakeup(rq, p, wake_flags);
-	success = 1;
-	__task_rq_unlock(rq);
+out_activate:
+#endif /* CONFIG_SMP */
 
+	ttwu_queue(p, cpu);
+stat:
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
-	put_cpu();
 
 	return success;
 }



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 19/22] sched: Restructure ttwu some more Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-11  1:44   ` Frank Rowand
  2011-03-02 17:38 ` [PATCH 21/22] sched: Remove need_migrate_task() Peter Zijlstra
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-ttwu-queue-remote.patch --]
[-- Type: text/plain, Size: 3766 bytes --]

Now that we've removed the rq->lock requirement from the first part of
ttwu() and can compute placement without holding any rq->lock, ensure
we execute the second half of ttwu() on the actual cpu we want the
task to run on.

This avoids having to take rq->lock and doing the task enqueue
remotely, saving lots on cacheline transfers.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 847526 worker burns per second

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h   |    4 +--
 kernel/sched.c          |   56 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_features.h |    2 +
 3 files changed, 60 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1202,6 +1201,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 	int on_cpu;
 #endif
 	int on_rq;
@@ -2185,7 +2185,7 @@ extern void set_task_comm(struct task_st
 extern char *get_task_comm(char *to, struct task_struct *tsk);
 
 #ifdef CONFIG_SMP
-static inline void scheduler_ipi(void) { }
+void scheduler_ipi(void);
 extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
 #else
 static inline unsigned long wait_task_inactive(struct task_struct *p,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -557,6 +557,10 @@ struct rq {
 	unsigned int ttwu_count;
 	unsigned int ttwu_local;
 #endif
+
+#ifdef CONFIG_SMP
+	struct task_struct *wake_list;
+#endif
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -2511,10 +2515,61 @@ static int ttwu_remote(struct task_struc
 	return ret;
 }
 
+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, 0);
+	}
+
+	raw_spin_unlock(&rq->lock);
+}
+
+void scheduler_ipi(void)
+{
+	sched_ttwu_pending();
+}
+
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *next = rq->wake_list;
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
 static void ttwu_queue(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+#ifdef CONFIG_SMP
+	if (!sched_feat(TTWU_FORCE_REMOTE) && cpu != smp_processor_id()) {
+		ttwu_queue_remote(p, cpu);
+		return;
+	}
+#endif
+
 	raw_spin_lock(&rq->lock);
 	ttwu_do_activate(rq, p, 0);
 	raw_spin_unlock(&rq->lock);
@@ -6287,6 +6342,7 @@ migration_call(struct notifier_block *nf
 
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DYING:
+		sched_ttwu_pending();
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -64,3 +64,5 @@ SCHED_FEAT(OWNER_SPIN, 1)
  * Decrement CPU power based on irq activity
  */
 SCHED_FEAT(NONIRQ_POWER, 1)
+
+SCHED_FEAT(TTWU_FORCE_REMOTE, 0)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 21/22] sched: Remove need_migrate_task()
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-02 17:38 ` [PATCH 22/22] sched: Remove TASK_WAKING Peter Zijlstra
  2011-03-11  1:51 ` [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Frank Rowand
  22 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-opt-need_migrate_task.patch --]
[-- Type: text/plain, Size: 1569 bytes --]

Oleg noticed that need_migrate_task() doesn't need the ->on_cpu check
now that ttwu() doesn't do remote enqueues for !->on_rq && ->on_cpu,
so remove the helper and replace the single instance with a direct
->on_rq test.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c |   17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2139,21 +2139,6 @@ struct migration_arg {
 static int migration_cpu_stop(void *data);
 
 /*
- * The task's runqueue lock must be held.
- * Returns true if you have to wait for migration thread.
- */
-static bool need_migrate_task(struct task_struct *p)
-{
-	/*
-	 * If the task is not on a runqueue (and not running), then
-	 * the next wake-up will properly place the task.
-	 */
-	bool running = p->on_rq || p->on_cpu;
-	smp_rmb(); /* finish_lock_switch() */
-	return running;
-}
-
-/*
  * wait_task_inactive - wait for a thread to unschedule.
  *
  * If @match_state is nonzero, it's the @p->state value just checked and
@@ -5734,7 +5719,7 @@ int set_cpus_allowed_ptr(struct task_str
 		goto out;
 
 	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
-	if (need_migrate_task(p)) {
+	if (p->on_rq) {
 		struct migration_arg arg = { p, dest_cpu };
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, p, &flags);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 22/22] sched: Remove TASK_WAKING
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 21/22] sched: Remove need_migrate_task() Peter Zijlstra
@ 2011-03-02 17:38 ` Peter Zijlstra
  2011-03-11  1:49   ` Frank Rowand
  2011-03-11  1:51 ` [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Frank Rowand
  22 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-02 17:38 UTC (permalink / raw)
  To: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-remove-TASK_WAKING.patch --]
[-- Type: text/plain, Size: 2176 bytes --]

With the new locking TASK_WAKING has become obsolete, remove it.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 fs/proc/array.c       |    1 -
 include/linux/sched.h |    5 ++---
 kernel/sched.c        |    4 ++--
 3 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -141,7 +141,6 @@ static const char *task_state_array[] = 
 	"X (dead)",		/*  32 */
 	"x (dead)",		/*  64 */
 	"K (wakekill)",		/* 128 */
-	"W (waking)",		/* 256 */
 };
 
 static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -189,10 +189,9 @@ print_cfs_rq(struct seq_file *m, int cpu
 /* in tsk->state again */
 #define TASK_DEAD		64
 #define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_STATE_MAX		512
+#define TASK_STATE_MAX		256
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxK"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2175,7 +2175,7 @@ void set_task_cpu(struct task_struct *p,
 	 * We should never call set_task_cpu() on a blocked task,
 	 * ttwu() will sort out the placement.
 	 */
-	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
+	WARN_ON_ONCE(p->state != TASK_RUNNING &&
 			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
 
 #ifdef CONFIG_LOCKDEP
@@ -2613,7 +2613,7 @@ try_to_wake_up(struct task_struct *p, un
 	smp_rmb();
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
+	p->state = TASK_RUNNING;
 
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
@ 2011-03-11  1:36   ` Frank Rowand
  2011-03-16  8:30     ` Peter Zijlstra
  2011-03-11 15:07   ` [01/22] " Milton Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Frank Rowand @ 2011-03-11  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle

On 03/02/11 09:38, Peter Zijlstra wrote:
> For future rework of try_to_wake_up() we'd like to push part of that
> onto the CPU the task is actually going to run on, in order to do so we
> need a generic callback from the existing scheduler IPI.
> 
> This patch introduces such a generic callback: scheduler_ipi() and
> implements it as a NOP.
> 
> BenH notes: PowerPC might use this IPI on offline CPUs under rare
> conditions!!
> 
> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Acked-by: Chris Metcalf <cmetcalf@tilera.com>
> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <new-submission>
> ---
>  arch/alpha/kernel/smp.c             |    3 +--
>  arch/arm/kernel/smp.c               |    5 +----
>  arch/blackfin/mach-common/smp.c     |    3 +++
>  arch/cris/arch-v32/kernel/smp.c     |   13 ++++++++-----
>  arch/ia64/kernel/irq_ia64.c         |    2 ++
>  arch/ia64/xen/irq_xen.c             |   10 +++++++++-
>  arch/m32r/kernel/smp.c              |    4 +---
>  arch/mips/cavium-octeon/smp.c       |    2 ++
>  arch/mips/kernel/smtc.c             |    2 +-
>  arch/mips/mti-malta/malta-int.c     |    2 ++
>  arch/mips/pmc-sierra/yosemite/smp.c |    4 ++++
>  arch/mips/sgi-ip27/ip27-irq.c       |    2 ++
>  arch/mips/sibyte/bcm1480/smp.c      |    7 +++----
>  arch/mips/sibyte/sb1250/smp.c       |    7 +++----
>  arch/mn10300/kernel/smp.c           |    5 +----
>  arch/parisc/kernel/smp.c            |    5 +----
>  arch/powerpc/kernel/smp.c           |    3 ++-
>  arch/s390/kernel/smp.c              |    6 +++---
>  arch/sh/kernel/smp.c                |    2 ++
>  arch/sparc/kernel/smp_32.c          |    2 +-
>  arch/sparc/kernel/smp_64.c          |    1 +
>  arch/tile/kernel/smp.c              |    6 +-----
>  arch/um/kernel/smp.c                |    2 +-
>  arch/x86/kernel/smp.c               |    5 ++---
>  arch/x86/xen/smp.c                  |    5 ++---
>  include/linux/sched.h               |    1 +
>  26 files changed, 60 insertions(+), 49 deletions(-)
> 

< snip >

In this patch the tabs turned into spaces.  (None of the other
patches have this problem.)


> Index: linux-2.6/arch/sparc/kernel/smp_32.c
> ===================================================================
> --- linux-2.6.orig/arch/sparc/kernel/smp_32.c
> +++ linux-2.6/arch/sparc/kernel/smp_32.c
> @@ -125,7 +125,7 @@ struct linux_prom_registers smp_penguin_
> 
>  void smp_send_reschedule(int cpu)
>  {
> -       /* See sparc64 */
> +       scheduler_ipi();
>  }

If I understand correctly, this is calling scheduler_ipi() on the
cpu that should be sending an IPI, not on the cpu receiving the IPI.
If so, smp_send_reschedule() needs to send an IPI, and the scheduler_ipi()
put in the place where the IPI is processed.

-Frank


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu
  2011-03-02 17:38 ` [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
@ 2011-03-11  1:44   ` Frank Rowand
  2011-03-16  8:32     ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Frank Rowand @ 2011-03-11  1:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

On 03/02/11 09:38, Peter Zijlstra wrote:
> Now that we've removed the rq->lock requirement from the first part of
> ttwu() and can compute placement without holding any rq->lock, ensure
> we execute the second half of ttwu() on the actual cpu we want the
> task to run on.
> 
> This avoids having to take rq->lock and doing the task enqueue
> remotely, saving lots on cacheline transfers.
> 
> As measured using: http://oss.oracle.com/~mason/sembench.c
> 
> $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
> $ ./sembench -t 2048 -w 1900 -o 0
> 
> unpatched: run time 30 seconds 537953 worker burns per second
> patched:   run time 30 seconds 847526 worker burns per second
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <new-submission>
> ---
>  include/linux/sched.h   |    4 +--
>  kernel/sched.c          |   56 ++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched_features.h |    2 +
>  3 files changed, 60 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1202,6 +1201,7 @@ struct task_struct {
>  	int lock_depth;		/* BKL lock depth */
>  
>  #ifdef CONFIG_SMP
> +	struct task_struct *wake_entry;
>  	int on_cpu;
>  #endif
>  	int on_rq;
> @@ -2185,7 +2185,7 @@ extern void set_task_comm(struct task_st
>  extern char *get_task_comm(char *to, struct task_struct *tsk);
>  
>  #ifdef CONFIG_SMP
> -static inline void scheduler_ipi(void) { }
> +void scheduler_ipi(void);
>  extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
>  #else
>  static inline unsigned long wait_task_inactive(struct task_struct *p,
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -557,6 +557,10 @@ struct rq {
>  	unsigned int ttwu_count;
>  	unsigned int ttwu_local;
>  #endif
> +
> +#ifdef CONFIG_SMP
> +	struct task_struct *wake_list;
> +#endif
>  };
>  
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> @@ -2511,10 +2515,61 @@ static int ttwu_remote(struct task_struc
>  	return ret;
>  }
>  
> +#ifdef CONFIG_SMP
> +void sched_ttwu_pending(void)
> +{

sched_ttwu_pending() is now only used in sched.c, so can be static
(in the previous patch version is was called from other files).

> +	struct rq *rq = this_rq();
> +	struct task_struct *list = xchg(&rq->wake_list, NULL);
> +
> +	if (!list)
> +		return;
> +
> +	raw_spin_lock(&rq->lock);
> +
> +	while (list) {
> +		struct task_struct *p = list;
> +		list = list->wake_entry;
> +		ttwu_do_activate(rq, p, 0);
> +	}
> +
> +	raw_spin_unlock(&rq->lock);
> +}
> +
> +void scheduler_ipi(void)
> +{
> +	sched_ttwu_pending();
> +}
> +
> +static void ttwu_queue_remote(struct task_struct *p, int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	struct task_struct *next = rq->wake_list;
> +
> +	for (;;) {
> +		struct task_struct *old = next;
> +
> +		p->wake_entry = next;
> +		next = cmpxchg(&rq->wake_list, old, p);
> +		if (next == old)
> +			break;
> +	}
> +
> +	if (!next)
> +		smp_send_reschedule(cpu);
> +}
> +#endif
> +
>  static void ttwu_queue(struct task_struct *p, int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  
> +#ifdef CONFIG_SMP
> +	if (!sched_feat(TTWU_FORCE_REMOTE) && cpu != smp_processor_id()) {
> +		ttwu_queue_remote(p, cpu);
> +		return;
> +	}
> +#endif
> +
>  	raw_spin_lock(&rq->lock);
>  	ttwu_do_activate(rq, p, 0);
>  	raw_spin_unlock(&rq->lock);
> @@ -6287,6 +6342,7 @@ migration_call(struct notifier_block *nf
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  	case CPU_DYING:

Should pi_lock be locked here, so that additional wake ups can not
be put on the wake list in the window after sched_ttwu_pending()
completes, and before set_rq_offline(rq) is called?  If so, then
of course unlock pi_lock after the matching
"raw_spin_unlock_irqrestore(&rq->lock, flags);"

> +		sched_ttwu_pending();
>  		/* Update our root-domain */
>  		raw_spin_lock_irqsave(&rq->lock, flags);
>  		if (rq->rd) {
> Index: linux-2.6/kernel/sched_features.h
> ===================================================================
> --- linux-2.6.orig/kernel/sched_features.h
> +++ linux-2.6/kernel/sched_features.h
> @@ -64,3 +64,5 @@ SCHED_FEAT(OWNER_SPIN, 1)
>   * Decrement CPU power based on irq activity
>   */
>  SCHED_FEAT(NONIRQ_POWER, 1)
> +
> +SCHED_FEAT(TTWU_FORCE_REMOTE, 0)
> 
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 22/22] sched: Remove TASK_WAKING
  2011-03-02 17:38 ` [PATCH 22/22] sched: Remove TASK_WAKING Peter Zijlstra
@ 2011-03-11  1:49   ` Frank Rowand
  2011-03-16  9:53     ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Frank Rowand @ 2011-03-11  1:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

On 03/02/11 09:38, Peter Zijlstra wrote:
> With the new locking TASK_WAKING has become obsolete, remove it.
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <new-submission>
> ---

< snip >

> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2175,7 +2175,7 @@ void set_task_cpu(struct task_struct *p,
>  	 * We should never call set_task_cpu() on a blocked task,
>  	 * ttwu() will sort out the placement.
>  	 */
> -	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
> +	WARN_ON_ONCE(p->state != TASK_RUNNING &&
>  			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -2613,7 +2613,7 @@ try_to_wake_up(struct task_struct *p, un
>  	smp_rmb();
>  
>  	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> -	p->state = TASK_WAKING;
> +	p->state = TASK_RUNNING;
>  
>  	if (p->sched_class->task_waking)
>  		p->sched_class->task_waking(p);

No harm if the coded as in the patch, but an alternate suggestion
if you like it:

The only reason left for "p->state = TASK_RUNNING;" here is when
cpu is remote.  If cpu is not remote then p->state will be set by:

   ttwu_queue()
      ttwu_do_activate()
         ttwu_do_wakeup()
            p->state = TASK_RUNNING;

It would be more clear that setting state to TASK_RUNNING is protecting
the process until it has been removed from the wake_list by
sched_ttwu_pending() by setting p->state = TASK_RUNNING in ttwu_queue_remote().


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/22] sched: Reduce runqueue lock contention -v5
  2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2011-03-02 17:38 ` [PATCH 22/22] sched: Remove TASK_WAKING Peter Zijlstra
@ 2011-03-11  1:51 ` Frank Rowand
  22 siblings, 0 replies; 34+ messages in thread
From: Frank Rowand @ 2011-03-11  1:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

On 03/02/11 09:38, Peter Zijlstra wrote:
> This patch series aims to optimize remote wakeups by moving most of the
> work of the wakeup to the remote cpu and avoid bouncing runqueue data
> structures where possible.
> 
> If there are no more 'fun' bits left I'll queue this work for .40.

For patches 01 through 22:

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>

(Comments in separate replies to patches 01, 20, 22.)

The patches just keep getting better...

-Frank


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
  2011-03-11  1:36   ` Frank Rowand
@ 2011-03-11 15:07   ` Milton Miller
  2011-03-11 15:27     ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Milton Miller @ 2011-03-11 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Peter Zijlstra, Russell King,
	Martin Schwidefsky, Chris Metcalf, Jesper Nilsson,
	Benjamin Herrenschmidt, Ralf Baechle

On Wed, 02 Mar 2011 17:38:32 -0000, To: Peter Zijlstra wrote:

> For future rework of try_to_wake_up() we'd like to push part of that
> onto the CPU the task is actually going to run on, in order to do so we
> need a generic callback from the existing scheduler IPI.
> 
> This patch introduces such a generic callback: scheduler_ipi() and
> implements it as a NOP.
> 
> BenH notes: PowerPC might use this IPI on offline CPUs under rare
> conditions!!
..
> Index: linux-2.6/arch/powerpc/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/smp.c
> +++ linux-2.6/arch/powerpc/kernel/smp.c
> @@ -98,6 +98,7 @@ void smp_message_recv(int msg)
>  		break;
>  	case PPC_MSG_RESCHEDULE:
>  		/* we notice need_resched on exit */

This comment should also be removed, as it was documenting the empty action.

> +		scheduler_ipi();
>  		break;
>  	case PPC_MSG_CALL_FUNC_SINGLE:
>  		generic_smp_call_function_single_interrupt();
> @@ -127,7 +128,7 @@ static irqreturn_t call_function_action(
>  
>  static irqreturn_t reschedule_action(int irq, void *data)
>  {
> -	/* we just need the return path side effect of checking need_resched */
> +	scheduler_ipi();
>  	return IRQ_HANDLED;
>  }
>  

Thanks,
milton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-11 15:07   ` [01/22] " Milton Miller
@ 2011-03-11 15:27     ` Peter Zijlstra
  2011-03-15  3:59       ` Milton Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-11 15:27 UTC (permalink / raw)
  To: Milton Miller
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle

On Fri, 2011-03-11 at 09:07 -0600, Milton Miller wrote:
> >       case PPC_MSG_RESCHEDULE:
> >               /* we notice need_resched on exit */
> 
> This comment should also be removed, as it was documenting the empty
> action. 

But its still true, TIF_NEED_RESCHED isn't going away and we still
notice that on the interrupt return path.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-11 15:27     ` Peter Zijlstra
@ 2011-03-15  3:59       ` Milton Miller
  2011-03-15  9:13         ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Milton Miller @ 2011-03-15  3:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle

On Fri, 11 Mar 2011 16:27:27 +0100, Peter Zijlstra wrote:
> 
> On Fri, 2011-03-11 at 09:07 -0600, Milton Miller wrote:
> > >       case PPC_MSG_RESCHEDULE:
> > >               /* we notice need_resched on exit */
> > > +		    smp_reschedule_ipi()
> >
> > This comment should also be removed, as it was documenting the empty
> > action.
> 
> But its still true, TIF_NEED_RESCHED isn't going away and we still
> notice that on the interrupt return path.

Just because it is true does not mean it is useful.  Why is this site
different than the twenty odd other call sites that it needs to document
the behaivors requried upon exiting the kernel from interrupt deep in this
interrupt handling call path?  Why is it not placed after the call, where
it would talk about what happens next instead of what will happen later?
For that matter, what does on exit mean?

I went ahead and sent this when I saw the comments questioning the
sparc 32 part of the patch and thought you might respin.  But if this is
already committed then I will submit a followup patch for consideration
after this is merged.

milton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-15  3:59       ` Milton Miller
@ 2011-03-15  9:13         ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-15  9:13 UTC (permalink / raw)
  To: Milton Miller
  Cc: Chris Mason, Frank Rowand, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle

On Mon, 2011-03-14 at 21:59 -0600, Milton Miller wrote:
> 
> I went ahead and sent this when I saw the comments questioning the
> sparc 32 part of the patch and thought you might respin.  But if this is
> already committed then I will submit a followup patch for consideration
> after this is merged. 

No, its not yet committed, and I do indeed need to respin. If you want
that comment gone, consider it so :-)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()
  2011-03-11  1:36   ` Frank Rowand
@ 2011-03-16  8:30     ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-16  8:30 UTC (permalink / raw)
  To: frank.rowand
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel, Russell King, Martin Schwidefsky,
	Chris Metcalf, Jesper Nilsson, Benjamin Herrenschmidt,
	Ralf Baechle, davem

On Thu, 2011-03-10 at 17:36 -0800, Frank Rowand wrote:
> 
> > Index: linux-2.6/arch/sparc/kernel/smp_32.c
> > ===================================================================
> > --- linux-2.6.orig/arch/sparc/kernel/smp_32.c
> > +++ linux-2.6/arch/sparc/kernel/smp_32.c
> > @@ -125,7 +125,7 @@ struct linux_prom_registers smp_penguin_
> > 
> >  void smp_send_reschedule(int cpu)
> >  {
> > -       /* See sparc64 */
> > +       scheduler_ipi();
> >  }
> 
> If I understand correctly, this is calling scheduler_ipi() on the
> cpu that should be sending an IPI, not on the cpu receiving the IPI.
> If so, smp_send_reschedule() needs to send an IPI, and the scheduler_ipi()
> put in the place where the IPI is processed. 

Yes, D'0h.. Dave how does sparc32 work here? There appears to be an ipi
missing.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu
  2011-03-11  1:44   ` Frank Rowand
@ 2011-03-16  8:32     ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-16  8:32 UTC (permalink / raw)
  To: frank.rowand
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

On Thu, 2011-03-10 at 17:44 -0800, Frank Rowand wrote:
> > @@ -6287,6 +6342,7 @@ migration_call(struct notifier_block *nf
> >  
> >  #ifdef CONFIG_HOTPLUG_CPU
> >       case CPU_DYING:
> 
> Should pi_lock be locked here, so that additional wake ups can not
> be put on the wake list in the window after sched_ttwu_pending()
> completes, and before set_rq_offline(rq) is called?  If so, then
> of course unlock pi_lock after the matching
> "raw_spin_unlock_irqrestore(&rq->lock, flags);"

The cpu should be offline already, so select_task_rq() will never return
it and hence no new tasks should end up on this list.

> > +             sched_ttwu_pending();
> >               /* Update our root-domain */ 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 22/22] sched: Remove TASK_WAKING
  2011-03-11  1:49   ` Frank Rowand
@ 2011-03-16  9:53     ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-03-16  9:53 UTC (permalink / raw)
  To: frank.rowand
  Cc: Chris Mason, Rowand, Frank, Ingo Molnar, Thomas Gleixner,
	Mike Galbraith, Oleg Nesterov, Paul Turner, Jens Axboe,
	Yong Zhang, linux-kernel

On Thu, 2011-03-10 at 17:49 -0800, Frank Rowand wrote:
> On 03/02/11 09:38, Peter Zijlstra wrote:
> > With the new locking TASK_WAKING has become obsolete, remove it.
> > 
> > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > LKML-Reference: <new-submission>
> > ---
> 
> < snip >
> 
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2175,7 +2175,7 @@ void set_task_cpu(struct task_struct *p,
> >  	 * We should never call set_task_cpu() on a blocked task,
> >  	 * ttwu() will sort out the placement.
> >  	 */
> > -	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
> > +	WARN_ON_ONCE(p->state != TASK_RUNNING &&
> >  			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
> >  
> >  #ifdef CONFIG_LOCKDEP
> > @@ -2613,7 +2613,7 @@ try_to_wake_up(struct task_struct *p, un
> >  	smp_rmb();
> >  
> >  	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> > -	p->state = TASK_WAKING;
> > +	p->state = TASK_RUNNING;
> >  
> >  	if (p->sched_class->task_waking)
> >  		p->sched_class->task_waking(p);
> 
> No harm if the coded as in the patch, but an alternate suggestion
> if you like it:
> 
> The only reason left for "p->state = TASK_RUNNING;" here is when
> cpu is remote.  If cpu is not remote then p->state will be set by:
> 
>    ttwu_queue()
>       ttwu_do_activate()
>          ttwu_do_wakeup()
>             p->state = TASK_RUNNING;
> 
> It would be more clear that setting state to TASK_RUNNING is protecting
> the process until it has been removed from the wake_list by
> sched_ttwu_pending() by setting p->state = TASK_RUNNING in ttwu_queue_remote().
> 

Yeah, its a bit of a maze.. maybe we should just drop this and keep the
slightly redundant but more clear TASK_WAKING around.




^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-03-16  9:51 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-02 17:38 [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Peter Zijlstra
2011-03-02 17:38 ` [PATCH 01/22] sched: Provide scheduler_ipi() callback in response to smp_send_reschedule() Peter Zijlstra
2011-03-11  1:36   ` Frank Rowand
2011-03-16  8:30     ` Peter Zijlstra
2011-03-11 15:07   ` [01/22] " Milton Miller
2011-03-11 15:27     ` Peter Zijlstra
2011-03-15  3:59       ` Milton Miller
2011-03-15  9:13         ` Peter Zijlstra
2011-03-02 17:38 ` [PATCH 02/22] sched: Always provide p->on_cpu Peter Zijlstra
2011-03-02 17:38 ` [PATCH 03/22] mutex: Use p->on_cpu for the adaptive spin Peter Zijlstra
2011-03-02 17:38 ` [PATCH 04/22] sched: Change the ttwu success details Peter Zijlstra
2011-03-02 17:38 ` [PATCH 05/22] sched: Clean up ttwu stats Peter Zijlstra
2011-03-02 17:38 ` [PATCH 06/22] sched: Provide p->on_rq Peter Zijlstra
2011-03-02 17:38 ` [PATCH 07/22] sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Peter Zijlstra
2011-03-02 17:38 ` [PATCH 08/22] sched: Drop the rq argument to sched_class::select_task_rq() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 09/22] sched: Remove rq argument to sched_class::task_waking() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 10/22] sched: Deal with non-atomic min_vruntime reads on 32bits Peter Zijlstra
2011-03-02 17:38 ` [PATCH 11/22] sched: Delay task_contributes_to_load() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 12/22] sched: Also serialize ttwu_local() with p->pi_lock Peter Zijlstra
2011-03-02 17:38 ` [PATCH 13/22] sched: Add p->pi_lock to task_rq_lock() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 14/22] sched: Drop rq->lock from first part of wake_up_new_task() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 15/22] sched: Drop rq->lock from sched_exec() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 16/22] sched: Remove rq->lock from the first half of ttwu() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 17/22] sched: Remove rq argument from ttwu_stat() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 18/22] sched: Rename ttwu_post_activation Peter Zijlstra
2011-03-02 17:38 ` [PATCH 19/22] sched: Restructure ttwu some more Peter Zijlstra
2011-03-02 17:38 ` [PATCH 20/22] sched: Move the second half of ttwu() to the remote cpu Peter Zijlstra
2011-03-11  1:44   ` Frank Rowand
2011-03-16  8:32     ` Peter Zijlstra
2011-03-02 17:38 ` [PATCH 21/22] sched: Remove need_migrate_task() Peter Zijlstra
2011-03-02 17:38 ` [PATCH 22/22] sched: Remove TASK_WAKING Peter Zijlstra
2011-03-11  1:49   ` Frank Rowand
2011-03-16  9:53     ` Peter Zijlstra
2011-03-11  1:51 ` [PATCH 00/22] sched: Reduce runqueue lock contention -v5 Frank Rowand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).