All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states
@ 2013-08-14 11:55 Preeti U Murthy
  2013-08-14 11:55 ` [RFC V2 PATCH 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) Preeti U Murthy
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:55 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

On PowerPC, when CPUs enter deep idle states, their local timers are
switched off. The responsibility of waking them up at their next timer event,
needs to be handed over to an external device. On PowerPC, we do not have an
external device equivalent to HPET, which is currently done on architectures
like x86. Instead we assign the local timer of one of the CPUs to do this job.

This patchset is an attempt to make use of the existing timer broadcast
framework in the kernel to meet the above requirement, except that the tick
broadcast device is the local timer of the boot CPU.

This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend
for powernv posted by Deepthi Dharwar recently. NOHZ_FULL is disabled for all
testing purposes.
  The current design and implementation supports the ONESHOT tick mode.
It does not yet support the PERIODIC tick mode.

The discussion around V1 of this patchset can be found here:
https://lkml.org/lkml/2013/7/25/740.

Changes since V1:
1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast CPU.

Thanks to Ben H, Frederic Weisbecker, Li Yang and Vaidyanathan Srinivasan for
all their comments and suggestions on the V1 of this patchset.

Patch[1/6], Patch[2/6]: optimize the broadcast mechanism on ppc.
Patch[3/6]: Introduces the core of the timer offload framework on powerpc.
Patch[4/6]: Add a deep idle state to the cpuidle state table on powernv
Patch[5/6]: Dynamically pick a broadcast CPU
Patch[6/6]: Remove the constraint of having to disable tickless idle on the
broadcast cpu, by queueing a hrtimer exclusively to do broadcast handling.

---

Preeti U Murthy (4):
      cpuidle/ppc: Add timer offload framework to support deep idle states
      cpuidle/ppc: Add longnap state to the idle states on powernv
      cpuidle/ppc: Enable dynamic movement of the broadcast functionality across CPUs
      cpuidle/ppc : Queue a hrtimer on bc_cpu, explicitly to do broadcast handling

Srivatsa S. Bhat (2):
      powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
      powerpc: Implement broadcast timer interrupt as an IPI message


 arch/powerpc/include/asm/smp.h                  |    3 -
 arch/powerpc/include/asm/time.h                 |    9 ++
 arch/powerpc/kernel/smp.c                       |   23 +++--
 arch/powerpc/kernel/time.c                      |  114 +++++++++++++++++++++++
 arch/powerpc/platforms/cell/interrupt.c         |    2 
 arch/powerpc/platforms/powernv/Kconfig          |    1 
 arch/powerpc/platforms/powernv/processor_idle.c |  104 +++++++++++++++++++++
 arch/powerpc/platforms/ps3/smp.c                |    2 
 8 files changed, 246 insertions(+), 12 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
@ 2013-08-14 11:55 ` Preeti U Murthy
  2013-08-14 11:56 ` [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message Preeti U Murthy
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:55 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE
map to a common implementation - generic_smp_call_function_single_interrupt().
So, we can consolidate them and save one of the IPI message slots, (which are
precious, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ffbaabe..51bf017 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION   0
+#define PPC_MSG_UNUSED		0
 #define PPC_MSG_RESCHEDULE      1
 #define PPC_MSG_CALL_FUNC_SINGLE	2
 #define PPC_MSG_DEBUGGER_BREAK  3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
+	[PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
 	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
+	[PPC_MSG_UNUSED] =  "ipi unused",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
 	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
 		all = xchg(&info->messages, 0);
 
 #ifdef __BIG_ENDIAN
-		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNCTION)))
-			generic_smp_call_function_interrupt();
 		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
 			scheduler_ipi();
 		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 	unsigned int cpu;
 
 	for_each_cpu(cpu, mask)
-		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+		do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
 	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
 		* to index needs to be setup.
 		*/
 
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
  2013-08-14 11:55 ` [RFC V2 PATCH 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) Preeti U Murthy
@ 2013-08-14 11:56 ` Preeti U Murthy
  2013-08-22  3:10     ` Benjamin Herrenschmidt
  2013-08-14 11:56 ` [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states Preeti U Murthy
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:56 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the broadcast timer
interrupts to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/smp.h          |    3 ++-
 arch/powerpc/kernel/smp.c               |   19 +++++++++++++++----
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 51bf017..d877b69 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED		0
+#define PPC_MSG_TIMER		0
 #define PPC_MSG_RESCHEDULE      1
 #define PPC_MSG_CALL_FUNC_SINGLE	2
 #define PPC_MSG_DEBUGGER_BREAK  3
@@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops;
 
 extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);
 
 /* Definitions relative to the secondary CPU spin loop
  * and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..6a68ca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	timer_interrupt();
 	return IRQ_HANDLED;
 }
 
@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-	[PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
+	[PPC_MSG_TIMER] =  timer_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
 	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-	[PPC_MSG_UNUSED] =  "ipi unused",
+	[PPC_MSG_TIMER] =  "ipi timer",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
 	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
 		all = xchg(&info->messages, 0);
 
 #ifdef __BIG_ENDIAN
+		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
+			timer_interrupt();
 		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
 			scheduler_ipi();
 		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 28166e4..1359113 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TIMER);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
 	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 488f069..5cb742a 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
 		* to index needs to be setup.
 		*/
 
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 0);
+		BUILD_BUG_ON(PPC_MSG_TIMER	      != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
  2013-08-14 11:55 ` [RFC V2 PATCH 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) Preeti U Murthy
  2013-08-14 11:56 ` [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message Preeti U Murthy
@ 2013-08-14 11:56 ` Preeti U Murthy
  2013-08-22  3:27     ` Benjamin Herrenschmidt
  2013-08-14 11:56 ` [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv Preeti U Murthy
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:56 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

On ppc, in deep idle states, the local clock event device of CPUs gets
switched off. On PowerPC, the local clock event device is called the
decrementer. Make use of the broadcast framework to issue interrupts to
cpus in deep idle states on their timer events, except that on ppc, we
do not have an external device such as HPET, but we use the decrementer
of one of the CPUs itself as the broadcast device.

Instantiate two different clock event devices, one representing the
decrementer and another representing the broadcast device for each cpu.
The cpu which registers its broadcast device will be responsible for
performing the function of issuing timer interrupts to CPUs in deep idle
states, and is referred to as the broadcast cpu/bc_cpu in the changelogs of this
patchset for convenience. Such a CPU is not allowed to enter deep idle
states, where the decrementer is switched off.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the decrementer. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular local
timer event handler the broadcast handler is also called. We avoid the overhead
of programming the decrementer specifically for a broadcast event. The reason is for
performance and scalability reasons. Say cpuX goes to deep idle state. It
has to ask the broadcast CPU to reprogram its(broadcast CPU's) decrementer for
the next local timer event of cpuX. cpuX can do so only by sending an IPI to the
broadcast CPU. With many more cpus going to deep idle, this model of sending
IPIs each time will result in performance bottleneck and may not scale well.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/time.h        |    3 +
 arch/powerpc/kernel/smp.c              |    4 +-
 arch/powerpc/kernel/time.c             |   81 ++++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/Kconfig |    1 
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)
 
 static irqreturn_t timer_action(int irq, void *data)
 {
-	timer_interrupt();
+	decrementer_timer_interrupt();
 	return IRQ_HANDLED;
 }
 
@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
 
 #ifdef __BIG_ENDIAN
 		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
-			timer_interrupt();
+			decrementer_timer_interrupt();
 		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
 			scheduler_ipi();
 		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..7e858e1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include <linux/timex.h>
 #include <linux/kernel_stat.h>
 #include <linux/time.h>
+#include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/profile.h>
 #include <linux/cpu.h>
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
 				      struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+				      struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 				 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
 	.name           = "decrementer",
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
 	.irq            = 0,
 	.set_next_event = decrementer_set_next_event,
 	.set_mode       = decrementer_set_mode,
-	.features       = CLOCK_EVT_FEAT_ONESHOT,
+	.broadcast	= decrementer_timer_broadcast,
+	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
+struct clock_event_device broadcast_clockevent = {
+	.name           = "broadcast",
+	.rating         = 200,
+	.irq            = 0,
+	.set_next_event = broadcast_set_next_event,
+	.set_mode       = decrementer_set_mode,
+	.features       = CLOCK_EVT_FEAT_ONESHOT,
+};
+EXPORT_SYMBOL(broadcast_clockevent);
+
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
+static DEFINE_PER_CPU(struct clock_event_device, bc_timer);
 
+int bc_cpu;
 #define XSEC_PER_SEC (1024*1024)
 
 #ifdef CONFIG_PPC64
@@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
+	int cpu = smp_processor_id();
 	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
@@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
 		*next_tb = ~(u64)0;
 		if (evt->event_handler)
 			evt->event_handler(evt);
+		if (cpu == bc_cpu && bc_evt->event_handler) {
+			bc_evt->event_handler(bc_evt);
+		}
+
 	} else {
 		now = *next_tb - now;
 		if (now <= DECREMENTER_MAX)
@@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
 	return 0;
 }
 
+/*
+ * We cannot program the decrementer of a remote CPU. Hence CPUs going into
+ * deep idle states need to send IPIs to the broadcast CPU to program its
+ * decrementer for their next local event so as to receive a broadcast IPI
+ * for the same. In order to avoid the overhead of multiple CPUs from sending
+ * IPIs, this function is a nop. Instead the broadcast CPU will handle the
+ * wakeup of CPUs in deep idle states in each of its local timer interrupts.
+ */
+static int broadcast_set_next_event(unsigned long evt,
+					struct clock_event_device *dev)
+{
+	return 0;
+}
+
 static void decrementer_set_mode(enum clock_event_mode mode,
 				 struct clock_event_device *dev)
 {
@@ -813,6 +850,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+void decrementer_timer_interrupt(void)
+{
+	struct clock_event_device *evt;
+	evt = &per_cpu(decrementers, smp_processor_id());
+
+	if (evt->event_handler)
+		evt->event_handler(evt);
+}
+
+static void decrementer_timer_broadcast(const struct cpumask *mask)
+{
+	arch_send_tick_broadcast(mask);
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
@@ -826,6 +877,20 @@ static void register_decrementer_clockevent(int cpu)
 	clockevents_register_device(dec);
 }
 
+static void register_broadcast_clockevent(int cpu)
+{
+	struct clock_event_device *bc_evt = &per_cpu(bc_timer, cpu);
+
+	*bc_evt = broadcast_clockevent;
+	bc_evt->cpumask = cpumask_of(cpu);
+
+	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
+		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
+
+	clockevents_register_device(bc_evt);
+	bc_cpu = cpu;
+}
+
 static void __init init_decrementer_clockevent(void)
 {
 	int cpu = smp_processor_id();
@@ -840,6 +905,19 @@ static void __init init_decrementer_clockevent(void)
 	register_decrementer_clockevent(cpu);
 }
 
+static void __init init_broadcast_clockevent(void)
+{
+	int cpu = smp_processor_id();
+
+	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
+
+	broadcast_clockevent.max_delta_ns =
+		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
+	broadcast_clockevent.min_delta_ns =
+		clockevent_delta2ns(2, &broadcast_clockevent);
+	register_broadcast_clockevent(cpu);
+}
+
 void secondary_cpu_time_init(void)
 {
 	/* Start the decrementer on CPUs that have manual control
@@ -916,6 +994,7 @@ void __init time_init(void)
 	clocksource_init();
 
 	init_decrementer_clockevent();
+	init_broadcast_clockevent();
 }
 
 
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ace2d22..e1a96eb 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -6,6 +6,7 @@ config PPC_POWERNV
 	select PPC_ICP_NATIVE
 	select PPC_P7_NAP
 	select PPC_PCI_CHOICE if EMBEDDED
+	select GENERIC_CLOCKEVENTS_BROADCAST
 	select EPAPR_BOOT
 	default y
 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
                   ` (2 preceding siblings ...)
  2013-08-14 11:56 ` [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states Preeti U Murthy
@ 2013-08-14 11:56 ` Preeti U Murthy
  2013-08-22  3:28     ` Benjamin Herrenschmidt
  2013-08-14 11:56 ` [RFC V2 PATCH 5/6] cpuidle/ppc: Enable dynamic movement of the broadcast functionality across CPUs Preeti U Murthy
  2013-08-14 11:56 ` [RFC V2 PATCH 6/6] cpuidle/ppc : Queue a hrtimer on bc_cpu, explicitly to do broadcast handling Preeti U Murthy
  5 siblings, 1 reply; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:56 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

This patch hooks into the existing broadcast framework along with the support
that this patchset introduces for ppc, and the cpuidle driver backend
for powernv(posted out by Deepthi Dharwar:https://lkml.org/lkml/2013/7/23/128)
to add sleep state as one of the deep idle states, in which the decrementer
is switched off.

However in this patch, we only emulate sleep by going into a state which does
a nap with the decrementer interrupts disabled, termed as longnap. This enables
focus on the timer broadcast framework for ppc in this series of patches ,
which is required as a first step to enable sleep on ppc.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/platforms/powernv/processor_idle.c |   48 +++++++++++++++++++++++
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
index f43ad91a..9aca502 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -9,16 +9,18 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
+#include <linux/clockchips.h>
 
 #include <asm/machdep.h>
 #include <asm/runlatch.h>
+#include <asm/time.h>
 
 struct cpuidle_driver powernv_idle_driver = {
 	.name =		"powernv_idle",
 	.owner =	THIS_MODULE,
 };
 
-#define MAX_IDLE_STATE_COUNT	2
+#define MAX_IDLE_STATE_COUNT	3
 
 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
 static struct cpuidle_device __percpu *powernv_cpuidle_devices;
@@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the timer offload framework for ppc.
+ * We will eventually introduce the sleep state once the timer offload framework
+ * for ppc is stable.
+ */
+static int longnap_loop(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv,
+				int index)
+{
+	int cpu = dev->cpu;
+
+	unsigned long lpcr = mfspr(SPRN_LPCR);
+
+	lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+	/* exit powersave upon external interrupt, but not decrementer
+	 * interrupt, Emulate sleep.
+	 */
+	lpcr |= LPCR_PECE0;
+
+	if (cpu != bc_cpu) {
+		mtspr(SPRN_LPCR, lpcr);
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+		power7_nap();
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+	} else {
+		/* Wakeup on a decrementer interrupt, Do a nap */
+		lpcr |= LPCR_PECE1;
+		mtspr(SPRN_LPCR, lpcr);
+		power7_nap();
+	}
+
+	return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = {
 		.exit_latency = 10,
 		.target_residency = 100,
 		.enter = &nap_loop },
+	 { /* LongNap */
+		.name = "LongNap",
+		.desc = "LongNap",
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 10,
+		.target_residency = 100,
+		.enter = &longnap_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 5/6] cpuidle/ppc: Enable dynamic movement of the broadcast functionality across CPUs
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
                   ` (3 preceding siblings ...)
  2013-08-14 11:56 ` [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv Preeti U Murthy
@ 2013-08-14 11:56 ` Preeti U Murthy
  2013-08-14 11:56 ` [RFC V2 PATCH 6/6] cpuidle/ppc : Queue a hrtimer on bc_cpu, explicitly to do broadcast handling Preeti U Murthy
  5 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:56 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

In the current design of the timer offload framework for powerpc, there is a
dedicated broadcast CPU, which is the boot CPU. But this is not good because:

a.It disallows this CPU from being hotplugged out.

b.Overburdening this CPU with the broadcast duty can take
a toll on the performance, which could worsen if this CPU
is already too busy.

c.This could lead to thermal or power imbalance within the chip.

To overcome the above constraints, float around the duty of doing a broadcast
around the CPUs. The current design proposes to choose the first CPU that
attempts to go to deep idle state to be the broadcast CPU/bc_cpu.
It is disallowed from entering deep idle state.

Let the broadcast CPU become invalidated when there are no more CPUs in
the broadcast mask. Until this point the rest of the CPUs attempting to enter
deep idle will be allowed to do so, to be woken up by the broadcast CPU.
Hence the set and unset of the bc_cpu variable is done only by the broadcast
CPU.

Protect the region of all the above activity with a lock in order to avoid
race conditions between readers and writers of the bc_cpu
entry and the broadcast cpus mask. One typical scenario could be:

CPUA				CPUB

Read bc_cpu exists		Is the bc_cpu, finds the broadcast mask
				empty,and invalidates the bc_cpu.

Enter itself into the
the broadcast mask.

Thus, CPUA  would go into deep idle when broadcast handling is inactive.

The broadcast clockevent device is now one single pseudo device capable of working for
all possible cpus (instead of being per-cpu like it was before, there is no
point in having so), due to the dynamic movement of the broadcast CPU. This
is a pseudo device and the dynamic movement of bc_cpu will therefore not affect
its functioning. The broadcast clockevent device's event handler will be called
by the bc_cpu in each of its timer interrupt.

This patchset adds hotplug notifiers to change the bc_cpu, in case it goes
offline. In this case choose the first cpu in the broadcast mask to be the
next bc_cpu and send an IPI, to wake it up to begin to handle broadcast events
thereon. This IPI is the same as the broadcast IPI.
   The idea being the intention of both these scenarios(hotplug and actual
broadcast wakeup) is to wake up a CPU in the broadcast mask, except that
they are for different reasons.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/time.h                 |    1 
 arch/powerpc/kernel/time.c                      |   10 ++--
 arch/powerpc/platforms/powernv/processor_idle.c |   56 +++++++++++++++++++----
 3 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 936be0d..92260c9 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -25,6 +25,7 @@ extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 extern struct clock_event_device broadcast_clockevent;
+extern struct clock_event_device bc_timer;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7e858e1..a19c8ca 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -127,9 +127,9 @@ EXPORT_SYMBOL(broadcast_clockevent);
 
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
-static DEFINE_PER_CPU(struct clock_event_device, bc_timer);
+struct clock_event_device bc_timer;
 
-int bc_cpu;
+int bc_cpu = -1;
 #define XSEC_PER_SEC (1024*1024)
 
 #ifdef CONFIG_PPC64
@@ -504,7 +504,7 @@ void timer_interrupt(struct pt_regs * regs)
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
+	struct clock_event_device *bc_evt = &bc_timer;
 	int cpu = smp_processor_id();
 	u64 now;
 
@@ -879,10 +879,10 @@ static void register_decrementer_clockevent(int cpu)
 
 static void register_broadcast_clockevent(int cpu)
 {
-	struct clock_event_device *bc_evt = &per_cpu(bc_timer, cpu);
+	struct clock_event_device *bc_evt = &bc_timer;
 
 	*bc_evt = broadcast_clockevent;
-	bc_evt->cpumask = cpumask_of(cpu);
+	bc_evt->cpumask = cpu_possible_mask;
 
 	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
 		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
index 9aca502..9554da6 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -10,6 +10,8 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/clockchips.h>
+#include <linux/tick.h>
+#include <linux/spinlock.h>
 
 #include <asm/machdep.h>
 #include <asm/runlatch.h>
@@ -26,6 +28,8 @@ static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
 static struct cpuidle_device __percpu *powernv_cpuidle_devices;
 static struct cpuidle_state *cpuidle_state_table;
 
+static DEFINE_SPINLOCK(longnap_idle_lock);
+
 static int snooze_loop(struct cpuidle_device *dev,
 			struct cpuidle_driver *drv,
 			int index)
@@ -70,6 +74,7 @@ static int longnap_loop(struct cpuidle_device *dev,
 	int cpu = dev->cpu;
 
 	unsigned long lpcr = mfspr(SPRN_LPCR);
+	unsigned long flags;
 
 	lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
 
@@ -78,16 +83,35 @@ static int longnap_loop(struct cpuidle_device *dev,
 	 */
 	lpcr |= LPCR_PECE0;
 
-	if (cpu != bc_cpu) {
-		mtspr(SPRN_LPCR, lpcr);
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-		power7_nap();
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+	spin_lock_irqsave(&longnap_idle_lock, flags);
+
+	if (bc_cpu != -1) {
+		if (cpu != bc_cpu) {
+			mtspr(SPRN_LPCR, lpcr);
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+			spin_unlock_irqrestore(&longnap_idle_lock, flags);
+
+			power7_nap();
+
+			spin_lock_irqsave(&longnap_idle_lock, flags);
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+			spin_unlock_irqrestore(&longnap_idle_lock, flags);
+		}
+		else {
+			if (cpumask_empty(tick_get_broadcast_oneshot_mask()))
+				bc_cpu = -1;
+			/* Wakeup on a decrementer interrupt, Do a nap */
+			lpcr |= LPCR_PECE1;
+			mtspr(SPRN_LPCR, lpcr);
+			spin_unlock_irqrestore(&longnap_idle_lock, flags);
+			power7_nap();
+		}
 	} else {
-		/* Wakeup on a decrementer interrupt, Do a nap */
-		lpcr |= LPCR_PECE1;
-		mtspr(SPRN_LPCR, lpcr);
-		power7_nap();
+		/* First CPU to go to longnap, hence become the bc_cpu. Then
+		 * exit idle and re-enter to disable tickless idle on this cpu
+		 */
+		bc_cpu = cpu;
+		spin_unlock_irqrestore(&longnap_idle_lock, flags);
 	}
 
 	return index;
@@ -124,6 +148,8 @@ static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
 			unsigned long action, void *hcpu)
 {
 	int hotcpu = (unsigned long)hcpu;
+	int cpu;
+	unsigned long flags;
 	struct cpuidle_device *dev =
 			per_cpu_ptr(powernv_cpuidle_devices, hotcpu);
 
@@ -136,6 +162,18 @@ static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
 			cpuidle_resume_and_unlock();
 			break;
 
+		case CPU_DYING:
+		case CPU_DYING_FROZEN:
+			spin_lock_irqsave(&longnap_idle_lock, flags);
+			if (hotcpu == bc_cpu)
+				bc_cpu = -1;
+			if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
+				cpu = cpumask_first(tick_get_broadcast_oneshot_mask());
+				bc_cpu = cpu;
+				arch_send_tick_broadcast(cpumask_of(cpu));
+			}
+			spin_unlock_irqrestore(&longnap_idle_lock, flags);
+			break;
 		case CPU_DEAD:
 		case CPU_DEAD_FROZEN:
 			cpuidle_pause_and_lock();


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC V2 PATCH 6/6] cpuidle/ppc : Queue a hrtimer on bc_cpu, explicitly to do broadcast handling
  2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
                   ` (4 preceding siblings ...)
  2013-08-14 11:56 ` [RFC V2 PATCH 5/6] cpuidle/ppc: Enable dynamic movement of the broadcast functionality across CPUs Preeti U Murthy
@ 2013-08-14 11:56 ` Preeti U Murthy
  5 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-14 11:56 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi, benh,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

In the current design we were depending on the timer interrupt on the
bc_cpu to trigger broadcast handling. In tickless idle, timer interrupts
could be many ticks away which could result in missed wakeups on CPUs in deep
idle states. Disabling tickless idle on the bc_cpu is not good for
powersavings.

Therefore queue a hrtimer specifically for broadcast handling. When the broadcast
CPU is chosen, it schedules this hrtimer to fire after a jiffy.
This is meant to initiate broadcast handling. For each expiration of
this hrtimer thereon, it is reprogrammed to fire at the time the next broadcast
handling has to be done. But if there is no pending broadcast handling to be
done in the future, the broadcast cpu is invalidated and the hrtimer is
cancelled. The above cycle repeats when the next CPU attempts to enter sleep state.

Of course the time at which the hrtimer fires initially can be scheduled for
time=target_residency of deep idle state instead of a jiffy, since CPUs going
into deep idle states will not have their next wakeup event before this
target_residency time of the the deep idle state.
  But this patchset is based on longnap which is now used to mimick sleep
but is actually nap with decrementer interrupts disabled. Therefore its
target_residency is the same as nap. The CPUs going into longnap, will
probably need to be woken up sooner than they would have been,had they gone into
sleep. Hence the initial scheduling of the hrtimer is held at a jiffy as of now.

There is one other significant point. On CPU hotplug, the hrtimer on the
broadcast CPU is cancelled, the bc_cpu entry is invalidated, a new
broadcast cpu is chosen as before, and an IPI is sent to it. However instead
of using a broadcast IPI to wake it up, use smp_call_function_single(),
because apart from just wakeup, the new broadcast CPU has to restart
the hrtimer on itself so as to continue broadcast handling.

Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/time.h                 |    5 ++
 arch/powerpc/kernel/time.c                      |   47 ++++++++++++++++++++---
 arch/powerpc/platforms/powernv/processor_idle.c |   38 ++++++++++++++-----
 3 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 92260c9..b9a60eb 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -16,6 +16,7 @@
 #ifdef __KERNEL__
 #include <linux/types.h>
 #include <linux/percpu.h>
+#include <linux/ktime.h>
 
 #include <asm/processor.h>
 
@@ -26,6 +27,7 @@ extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 extern struct clock_event_device broadcast_clockevent;
 extern struct clock_event_device bc_timer;
+extern struct hrtimer *bc_hrtimer;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
@@ -35,7 +37,10 @@ extern void decrementer_timer_interrupt(void);
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern ktime_t get_next_bc_tick(void);
+extern enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer);
 extern int bc_cpu;
+extern int bc_hrtimer_initialized;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index a19c8ca..1a64d58 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -43,6 +43,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/time.h>
 #include <linux/timer.h>
+#include <linux/hrtimer.h>
+#include <linux/ktime.h>
 #include <linux/init.h>
 #include <linux/profile.h>
 #include <linux/cpu.h>
@@ -128,6 +130,8 @@ EXPORT_SYMBOL(broadcast_clockevent);
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
 struct clock_event_device bc_timer;
+struct hrtimer *bc_hrtimer;
+int bc_hrtimer_initialized = 0;
 
 int bc_cpu = -1;
 #define XSEC_PER_SEC (1024*1024)
@@ -504,8 +508,6 @@ void timer_interrupt(struct pt_regs * regs)
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	struct clock_event_device *bc_evt = &bc_timer;
-	int cpu = smp_processor_id();
 	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
@@ -551,10 +553,6 @@ void timer_interrupt(struct pt_regs * regs)
 		*next_tb = ~(u64)0;
 		if (evt->event_handler)
 			evt->event_handler(evt);
-		if (cpu == bc_cpu && bc_evt->event_handler) {
-			bc_evt->event_handler(bc_evt);
-		}
-
 	} else {
 		now = *next_tb - now;
 		if (now <= DECREMENTER_MAX)
@@ -864,6 +862,42 @@ static void decrementer_timer_broadcast(const struct cpumask *mask)
 	arch_send_tick_broadcast(mask);
 }
 
+ktime_t get_next_bc_tick(void)
+{
+	u64 next_bc_ns;
+
+	next_bc_ns = (tb_ticks_per_jiffy / tb_ticks_per_usec) * 1000;
+	return ns_to_ktime(next_bc_ns);
+}
+
+enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer)
+{
+	struct clock_event_device *bc_evt = &bc_timer;
+	int cpu = smp_processor_id();
+	ktime_t interval;
+
+	u64 now = get_tb_or_rtc();
+	ktime_t now_ktime = ns_to_ktime((now / tb_ticks_per_usec) * 1000);
+
+	bc_evt->event_handler(bc_evt);
+
+	/* FIXME: There could be a race condition between the time we do this
+	 * check and invalidate the bc_cpu and CPUs check for the existence of
+	 * bc_cpu and enter longnap_loop.This region could be protected by
+	 * the longnap_idle_lock as well. But is there a better way to handle such
+	 * race conditions, like relying on the existing tick_broadcast_lock
+	 * and remove explicit locking under such circumstances as below?
+	 */
+	if (bc_evt->next_event.tv64 == KTIME_MAX) {
+		bc_cpu = -1;
+		return HRTIMER_NORESTART;
+	}
+
+	interval.tv64 = bc_evt->next_event.tv64 - now_ktime.tv64;
+	hrtimer_forward_now(hrtimer, interval);
+	return HRTIMER_RESTART;
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
@@ -888,7 +922,6 @@ static void register_broadcast_clockevent(int cpu)
 		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
 
 	clockevents_register_device(bc_evt);
-	bc_cpu = cpu;
 }
 
 static void __init init_decrementer_clockevent(void)
diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
index 9554da6..8386ef6 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -12,6 +12,7 @@
 #include <linux/clockchips.h>
 #include <linux/tick.h>
 #include <linux/spinlock.h>
+#include <linux/slab.h>
 
 #include <asm/machdep.h>
 #include <asm/runlatch.h>
@@ -98,8 +99,6 @@ static int longnap_loop(struct cpuidle_device *dev,
 			spin_unlock_irqrestore(&longnap_idle_lock, flags);
 		}
 		else {
-			if (cpumask_empty(tick_get_broadcast_oneshot_mask()))
-				bc_cpu = -1;
 			/* Wakeup on a decrementer interrupt, Do a nap */
 			lpcr |= LPCR_PECE1;
 			mtspr(SPRN_LPCR, lpcr);
@@ -107,9 +106,21 @@ static int longnap_loop(struct cpuidle_device *dev,
 			power7_nap();
 		}
 	} else {
-		/* First CPU to go to longnap, hence become the bc_cpu. Then
-		 * exit idle and re-enter to disable tickless idle on this cpu
-		 */
+		/* First CPU to go to longnap, hence become the bc_cpu.
+ 		 */
+		if (!bc_hrtimer_initialized) {
+			bc_hrtimer = kmalloc(sizeof(*bc_hrtimer), GFP_KERNEL);
+			if (!bc_hrtimer)
+				return index;
+			hrtimer_init(bc_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+			bc_hrtimer->function = handle_broadcast;
+			hrtimer_start(bc_hrtimer, get_next_bc_tick(),
+				HRTIMER_MODE_REL_PINNED);
+			bc_hrtimer_initialized = 1;
+		}
+		else
+			hrtimer_start(bc_hrtimer, get_next_bc_tick(), HRTIMER_MODE_REL_PINNED);
+
 		bc_cpu = cpu;
 		spin_unlock_irqrestore(&longnap_idle_lock, flags);
 	}
@@ -117,6 +128,11 @@ static int longnap_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+static void start_hrtimer(void *data)
+{
+	hrtimer_start(bc_hrtimer, get_next_bc_tick(), HRTIMER_MODE_REL_PINNED);
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -165,12 +181,14 @@ static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
 		case CPU_DYING:
 		case CPU_DYING_FROZEN:
 			spin_lock_irqsave(&longnap_idle_lock, flags);
-			if (hotcpu == bc_cpu)
+			if (hotcpu == bc_cpu) {
 				bc_cpu = -1;
-			if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
-				cpu = cpumask_first(tick_get_broadcast_oneshot_mask());
-				bc_cpu = cpu;
-				arch_send_tick_broadcast(cpumask_of(cpu));
+				hrtimer_cancel(bc_hrtimer);
+				if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
+					cpu = cpumask_first(tick_get_broadcast_oneshot_mask());
+					bc_cpu = cpu;
+					smp_call_function_single(cpu, start_hrtimer, NULL, 0);
+				}
 			}
 			spin_unlock_irqrestore(&longnap_idle_lock, flags);
 			break;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
  2013-08-14 11:56 ` [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message Preeti U Murthy
@ 2013-08-22  3:10     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:10 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> -static irqreturn_t unused_action(int irq, void *data)
> +static irqreturn_t timer_action(int irq, void *data)
>  {
> -       /* This slot is unused and hence available for use, if needed
> */
> +       timer_interrupt();
>         return IRQ_HANDLED;
>  }
>  

That means we'll do irq_enter/irq_exit twice no ? And things like
may_hard_irq_enable() are also already done by do_IRQ so you
don't need timer_interrupt() to do it again.

We probably are better off breaking timer_interrupt in two:

void __timer_interrupt(struct pt_regs * regs)

Does the current stuff between irq_enter and irq_exit, timer_interrupt
does the remaining around it and calls __timer_interrupt.

Then from timer_action, you call __timer_interrupt()

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
@ 2013-08-22  3:10     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:10 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> -static irqreturn_t unused_action(int irq, void *data)
> +static irqreturn_t timer_action(int irq, void *data)
>  {
> -       /* This slot is unused and hence available for use, if needed
> */
> +       timer_interrupt();
>         return IRQ_HANDLED;
>  }
>  

That means we'll do irq_enter/irq_exit twice no ? And things like
may_hard_irq_enable() are also already done by do_IRQ so you
don't need timer_interrupt() to do it again.

We probably are better off breaking timer_interrupt in two:

void __timer_interrupt(struct pt_regs * regs)

Does the current stuff between irq_enter and irq_exit, timer_interrupt
does the remaining around it and calls __timer_interrupt.

Then from timer_action, you call __timer_interrupt()

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states
  2013-08-14 11:56 ` [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states Preeti U Murthy
@ 2013-08-22  3:27     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:27 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:

>  static irqreturn_t timer_action(int irq, void *data)
>  {
> -	timer_interrupt();
> +	decrementer_timer_interrupt();
>  	return IRQ_HANDLED;
>  }

I don't completely understand what you are doing here, but ...

> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>  
>  #ifdef __BIG_ENDIAN
>  		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
> -			timer_interrupt();
> +			decrementer_timer_interrupt();

Why call this decrementer_* since it's specifically *not* the
decrementer ?

Makes more sense to be called broadcast_timer_interrupt() no ?

>  		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>  			scheduler_ipi();
>  		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 65ab9e9..7e858e1 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -42,6 +42,7 @@
>  #include <linux/timex.h>
>  #include <linux/kernel_stat.h>
>  #include <linux/time.h>
> +#include <linux/timer.h>
>  #include <linux/init.h>
>  #include <linux/profile.h>
>  #include <linux/cpu.h>
> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>  
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev);
> +static int broadcast_set_next_event(unsigned long evt,
> +				      struct clock_event_device *dev);
>  static void decrementer_set_mode(enum clock_event_mode mode,
>  				 struct clock_event_device *dev);
> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>  
>  struct clock_event_device decrementer_clockevent = {
>  	.name           = "decrementer",
> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>  	.irq            = 0,
>  	.set_next_event = decrementer_set_next_event,
>  	.set_mode       = decrementer_set_mode,
> -	.features       = CLOCK_EVT_FEAT_ONESHOT,
> +	.broadcast	= decrementer_timer_broadcast,
> +	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>  };
>  EXPORT_SYMBOL(decrementer_clockevent);
>  
> +struct clock_event_device broadcast_clockevent = {
> +	.name           = "broadcast",
> +	.rating         = 200,
> +	.irq            = 0,
> +	.set_next_event = broadcast_set_next_event,
> +	.set_mode       = decrementer_set_mode,

Same here, why "decrementer" ? This event device is *not* the
decrementer right ?

Also, pardon my ignorance, by why do we need a separate
clock_event_device ? Ie what does that do ? I am not familiar with the
broadcast scheme and what .broadcast do in the "decrementer" one, so
you need to provide me at least with better explanations.

> +	.features       = CLOCK_EVT_FEAT_ONESHOT,
> +};
> +EXPORT_SYMBOL(broadcast_clockevent);
> +
>  DEFINE_PER_CPU(u64, decrementers_next_tb);
>  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
> +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);

Do we really need an additional per CPU here ? What does the bc_timer
actually represent ? The sender (broadcaster) or receiver ? In the
latter case, why does it have to differ from the decrementer ?

> +int bc_cpu;

A global with that name ? Exported ? That's gross...

>  #define XSEC_PER_SEC (1024*1024)
>  
>  #ifdef CONFIG_PPC64
> @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
>  	struct pt_regs *old_regs;
>  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>  	struct clock_event_device *evt = &__get_cpu_var(decrementers);
> +	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
> +	int cpu = smp_processor_id();
>  	u64 now;
>  
>  	/* Ensure a positive value is written to the decrementer, or else
> @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
>  		*next_tb = ~(u64)0;
>  		if (evt->event_handler)
>  			evt->event_handler(evt);
> +		if (cpu == bc_cpu && bc_evt->event_handler) {
> +			bc_evt->event_handler(bc_evt);
> +		}
> +

So here, on a CPU that happens to be "bc_cpu", we call an additional
"event_handler" on every timer interrupt ? What does that handler
actually do ? How does it relate to the "broadcast" field in the
original clock source ?

>  	} else {
>  		now = *next_tb - now;
>  		if (now <= DECREMENTER_MAX)
> @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
>  	return 0;
>  }
>  
> +/*
> + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
> + * deep idle states need to send IPIs to the broadcast CPU to program its
> + * decrementer for their next local event so as to receive a broadcast IPI
> + * for the same. In order to avoid the overhead of multiple CPUs from sending
> + * IPIs, this function is a nop. Instead the broadcast CPU will handle the
> + * wakeup of CPUs in deep idle states in each of its local timer interrupts.
> + */
> +static int broadcast_set_next_event(unsigned long evt,
> +					struct clock_event_device *dev)
> +{
> +	return 0;
> +}
> +
>  static void decrementer_set_mode(enum clock_event_mode mode,
>  				 struct clock_event_device *dev)
>  {
> @@ -813,6 +850,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
>  		decrementer_set_next_event(DECREMENTER_MAX, dev);
>  }
>  
> +void decrementer_timer_interrupt(void)
> +{
> +	struct clock_event_device *evt;
> +	evt = &per_cpu(decrementers, smp_processor_id());
> +
> +	if (evt->event_handler)
> +		evt->event_handler(evt);
> +}

So that's what happens when we receive the broadcast... it might need
some stats ... and it's using the normal "decrementer" clock source,
so I still have a problem understanding why you need another one.

> +static void decrementer_timer_broadcast(const struct cpumask *mask)
> +{
> +	arch_send_tick_broadcast(mask);
> +}

Ok, so far so good. But that's also hooked into the normal clock
source...

>  static void register_decrementer_clockevent(int cpu)
>  {
>  	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
> @@ -826,6 +877,20 @@ static void register_decrementer_clockevent(int cpu)
>  	clockevents_register_device(dec);
>  }
>  
> +static void register_broadcast_clockevent(int cpu)
> +{
> +	struct clock_event_device *bc_evt = &per_cpu(bc_timer, cpu);
> +
> +	*bc_evt = broadcast_clockevent;
> +	bc_evt->cpumask = cpumask_of(cpu);
> +
> +	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
> +		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
> +
> +	clockevents_register_device(bc_evt);
> +	bc_cpu = cpu;
> +}
> +
>  static void __init init_decrementer_clockevent(void)
>  {
>  	int cpu = smp_processor_id();
> @@ -840,6 +905,19 @@ static void __init init_decrementer_clockevent(void)
>  	register_decrementer_clockevent(cpu);
>  }
>  
> +static void __init init_broadcast_clockevent(void)
> +{
> +	int cpu = smp_processor_id();
> +
> +	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
> +
> +	broadcast_clockevent.max_delta_ns =
> +		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
> +	broadcast_clockevent.min_delta_ns =
> +		clockevent_delta2ns(2, &broadcast_clockevent);
> +	register_broadcast_clockevent(cpu);
> +}
> +
>  void secondary_cpu_time_init(void)
>  {
>  	/* Start the decrementer on CPUs that have manual control
> @@ -916,6 +994,7 @@ void __init time_init(void)
>  	clocksource_init();
>  
>  	init_decrementer_clockevent();
> +	init_broadcast_clockevent();
>  }
>  
> 
> diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
> index ace2d22..e1a96eb 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -6,6 +6,7 @@ config PPC_POWERNV
>  	select PPC_ICP_NATIVE
>  	select PPC_P7_NAP
>  	select PPC_PCI_CHOICE if EMBEDDED
> +	select GENERIC_CLOCKEVENTS_BROADCAST
>  	select EPAPR_BOOT
>  	default y
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states
@ 2013-08-22  3:27     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:27 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:

>  static irqreturn_t timer_action(int irq, void *data)
>  {
> -	timer_interrupt();
> +	decrementer_timer_interrupt();
>  	return IRQ_HANDLED;
>  }

I don't completely understand what you are doing here, but ...

> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>  
>  #ifdef __BIG_ENDIAN
>  		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
> -			timer_interrupt();
> +			decrementer_timer_interrupt();

Why call this decrementer_* since it's specifically *not* the
decrementer ?

Makes more sense to be called broadcast_timer_interrupt() no ?

>  		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>  			scheduler_ipi();
>  		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 65ab9e9..7e858e1 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -42,6 +42,7 @@
>  #include <linux/timex.h>
>  #include <linux/kernel_stat.h>
>  #include <linux/time.h>
> +#include <linux/timer.h>
>  #include <linux/init.h>
>  #include <linux/profile.h>
>  #include <linux/cpu.h>
> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>  
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev);
> +static int broadcast_set_next_event(unsigned long evt,
> +				      struct clock_event_device *dev);
>  static void decrementer_set_mode(enum clock_event_mode mode,
>  				 struct clock_event_device *dev);
> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>  
>  struct clock_event_device decrementer_clockevent = {
>  	.name           = "decrementer",
> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>  	.irq            = 0,
>  	.set_next_event = decrementer_set_next_event,
>  	.set_mode       = decrementer_set_mode,
> -	.features       = CLOCK_EVT_FEAT_ONESHOT,
> +	.broadcast	= decrementer_timer_broadcast,
> +	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>  };
>  EXPORT_SYMBOL(decrementer_clockevent);
>  
> +struct clock_event_device broadcast_clockevent = {
> +	.name           = "broadcast",
> +	.rating         = 200,
> +	.irq            = 0,
> +	.set_next_event = broadcast_set_next_event,
> +	.set_mode       = decrementer_set_mode,

Same here, why "decrementer" ? This event device is *not* the
decrementer right ?

Also, pardon my ignorance, by why do we need a separate
clock_event_device ? Ie what does that do ? I am not familiar with the
broadcast scheme and what .broadcast do in the "decrementer" one, so
you need to provide me at least with better explanations.

> +	.features       = CLOCK_EVT_FEAT_ONESHOT,
> +};
> +EXPORT_SYMBOL(broadcast_clockevent);
> +
>  DEFINE_PER_CPU(u64, decrementers_next_tb);
>  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
> +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);

Do we really need an additional per CPU here ? What does the bc_timer
actually represent ? The sender (broadcaster) or receiver ? In the
latter case, why does it have to differ from the decrementer ?

> +int bc_cpu;

A global with that name ? Exported ? That's gross...

>  #define XSEC_PER_SEC (1024*1024)
>  
>  #ifdef CONFIG_PPC64
> @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
>  	struct pt_regs *old_regs;
>  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>  	struct clock_event_device *evt = &__get_cpu_var(decrementers);
> +	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
> +	int cpu = smp_processor_id();
>  	u64 now;
>  
>  	/* Ensure a positive value is written to the decrementer, or else
> @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
>  		*next_tb = ~(u64)0;
>  		if (evt->event_handler)
>  			evt->event_handler(evt);
> +		if (cpu == bc_cpu && bc_evt->event_handler) {
> +			bc_evt->event_handler(bc_evt);
> +		}
> +

So here, on a CPU that happens to be "bc_cpu", we call an additional
"event_handler" on every timer interrupt ? What does that handler
actually do ? How does it relate to the "broadcast" field in the
original clock source ?

>  	} else {
>  		now = *next_tb - now;
>  		if (now <= DECREMENTER_MAX)
> @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
>  	return 0;
>  }
>  
> +/*
> + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
> + * deep idle states need to send IPIs to the broadcast CPU to program its
> + * decrementer for their next local event so as to receive a broadcast IPI
> + * for the same. In order to avoid the overhead of multiple CPUs from sending
> + * IPIs, this function is a nop. Instead the broadcast CPU will handle the
> + * wakeup of CPUs in deep idle states in each of its local timer interrupts.
> + */
> +static int broadcast_set_next_event(unsigned long evt,
> +					struct clock_event_device *dev)
> +{
> +	return 0;
> +}
> +
>  static void decrementer_set_mode(enum clock_event_mode mode,
>  				 struct clock_event_device *dev)
>  {
> @@ -813,6 +850,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
>  		decrementer_set_next_event(DECREMENTER_MAX, dev);
>  }
>  
> +void decrementer_timer_interrupt(void)
> +{
> +	struct clock_event_device *evt;
> +	evt = &per_cpu(decrementers, smp_processor_id());
> +
> +	if (evt->event_handler)
> +		evt->event_handler(evt);
> +}

So that's what happens when we receive the broadcast... it might need
some stats ... and it's using the normal "decrementer" clock source,
so I still have a problem understanding why you need another one.

> +static void decrementer_timer_broadcast(const struct cpumask *mask)
> +{
> +	arch_send_tick_broadcast(mask);
> +}

Ok, so far so good. But that's also hooked into the normal clock
source...

>  static void register_decrementer_clockevent(int cpu)
>  {
>  	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
> @@ -826,6 +877,20 @@ static void register_decrementer_clockevent(int cpu)
>  	clockevents_register_device(dec);
>  }
>  
> +static void register_broadcast_clockevent(int cpu)
> +{
> +	struct clock_event_device *bc_evt = &per_cpu(bc_timer, cpu);
> +
> +	*bc_evt = broadcast_clockevent;
> +	bc_evt->cpumask = cpumask_of(cpu);
> +
> +	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
> +		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
> +
> +	clockevents_register_device(bc_evt);
> +	bc_cpu = cpu;
> +}
> +
>  static void __init init_decrementer_clockevent(void)
>  {
>  	int cpu = smp_processor_id();
> @@ -840,6 +905,19 @@ static void __init init_decrementer_clockevent(void)
>  	register_decrementer_clockevent(cpu);
>  }
>  
> +static void __init init_broadcast_clockevent(void)
> +{
> +	int cpu = smp_processor_id();
> +
> +	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
> +
> +	broadcast_clockevent.max_delta_ns =
> +		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
> +	broadcast_clockevent.min_delta_ns =
> +		clockevent_delta2ns(2, &broadcast_clockevent);
> +	register_broadcast_clockevent(cpu);
> +}
> +
>  void secondary_cpu_time_init(void)
>  {
>  	/* Start the decrementer on CPUs that have manual control
> @@ -916,6 +994,7 @@ void __init time_init(void)
>  	clocksource_init();
>  
>  	init_decrementer_clockevent();
> +	init_broadcast_clockevent();
>  }
>  
> 
> diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
> index ace2d22..e1a96eb 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -6,6 +6,7 @@ config PPC_POWERNV
>  	select PPC_ICP_NATIVE
>  	select PPC_P7_NAP
>  	select PPC_PCI_CHOICE if EMBEDDED
> +	select GENERIC_CLOCKEVENTS_BROADCAST
>  	select EPAPR_BOOT
>  	default y
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv
  2013-08-14 11:56 ` [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv Preeti U Murthy
@ 2013-08-22  3:28     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:28 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> This patch hooks into the existing broadcast framework along with the support
> that this patchset introduces for ppc, and the cpuidle driver backend
> for powernv(posted out by Deepthi Dharwar:https://lkml.org/lkml/2013/7/23/128)
> to add sleep state as one of the deep idle states, in which the decrementer
> is switched off.
> 
> However in this patch, we only emulate sleep by going into a state which does
> a nap with the decrementer interrupts disabled, termed as longnap. This enables
> focus on the timer broadcast framework for ppc in this series of patches ,
> which is required as a first step to enable sleep on ppc.

This is only for debug / proof of concept right ? We should use a real
sleep here.

If we need to know whether the FW supports it (PORE etc...) we shall add
a device-tree property from the FW to indicate that fact.

Cheers,
Ben.

> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> 
>  arch/powerpc/platforms/powernv/processor_idle.c |   48 +++++++++++++++++++++++
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
> index f43ad91a..9aca502 100644
> --- a/arch/powerpc/platforms/powernv/processor_idle.c
> +++ b/arch/powerpc/platforms/powernv/processor_idle.c
> @@ -9,16 +9,18 @@
>  #include <linux/cpuidle.h>
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
> +#include <linux/clockchips.h>
>  
>  #include <asm/machdep.h>
>  #include <asm/runlatch.h>
> +#include <asm/time.h>
>  
>  struct cpuidle_driver powernv_idle_driver = {
>  	.name =		"powernv_idle",
>  	.owner =	THIS_MODULE,
>  };
>  
> -#define MAX_IDLE_STATE_COUNT	2
> +#define MAX_IDLE_STATE_COUNT	3
>  
>  static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
>  static struct cpuidle_device __percpu *powernv_cpuidle_devices;
> @@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
>  	return index;
>  }
>  
> +/* Emulate sleep, with long nap.
> + * During sleep, the core does not receive decrementer interrupts.
> + * Emulate sleep using long nap with decrementers interrupts disabled.
> + * This is an initial prototype to test the timer offload framework for ppc.
> + * We will eventually introduce the sleep state once the timer offload framework
> + * for ppc is stable.
> + */
> +static int longnap_loop(struct cpuidle_device *dev,
> +				struct cpuidle_driver *drv,
> +				int index)
> +{
> +	int cpu = dev->cpu;
> +
> +	unsigned long lpcr = mfspr(SPRN_LPCR);
> +
> +	lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
> +
> +	/* exit powersave upon external interrupt, but not decrementer
> +	 * interrupt, Emulate sleep.
> +	 */
> +	lpcr |= LPCR_PECE0;
> +
> +	if (cpu != bc_cpu) {
> +		mtspr(SPRN_LPCR, lpcr);
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> +		power7_nap();
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> +	} else {
> +		/* Wakeup on a decrementer interrupt, Do a nap */
> +		lpcr |= LPCR_PECE1;
> +		mtspr(SPRN_LPCR, lpcr);
> +		power7_nap();
> +	}
> +
> +	return index;
> +}
> +
>  /*
>   * States for dedicated partition case.
>   */
> @@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = {
>  		.exit_latency = 10,
>  		.target_residency = 100,
>  		.enter = &nap_loop },
> +	 { /* LongNap */
> +		.name = "LongNap",
> +		.desc = "LongNap",
> +		.flags = CPUIDLE_FLAG_TIME_VALID,
> +		.exit_latency = 10,
> +		.target_residency = 100,
> +		.enter = &longnap_loop },
>  };
>  
>  static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv
@ 2013-08-22  3:28     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2013-08-22  3:28 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> This patch hooks into the existing broadcast framework along with the support
> that this patchset introduces for ppc, and the cpuidle driver backend
> for powernv(posted out by Deepthi Dharwar:https://lkml.org/lkml/2013/7/23/128)
> to add sleep state as one of the deep idle states, in which the decrementer
> is switched off.
> 
> However in this patch, we only emulate sleep by going into a state which does
> a nap with the decrementer interrupts disabled, termed as longnap. This enables
> focus on the timer broadcast framework for ppc in this series of patches ,
> which is required as a first step to enable sleep on ppc.

This is only for debug / proof of concept right ? We should use a real
sleep here.

If we need to know whether the FW supports it (PORE etc...) we shall add
a device-tree property from the FW to indicate that fact.

Cheers,
Ben.

> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
> 
>  arch/powerpc/platforms/powernv/processor_idle.c |   48 +++++++++++++++++++++++
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
> index f43ad91a..9aca502 100644
> --- a/arch/powerpc/platforms/powernv/processor_idle.c
> +++ b/arch/powerpc/platforms/powernv/processor_idle.c
> @@ -9,16 +9,18 @@
>  #include <linux/cpuidle.h>
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
> +#include <linux/clockchips.h>
>  
>  #include <asm/machdep.h>
>  #include <asm/runlatch.h>
> +#include <asm/time.h>
>  
>  struct cpuidle_driver powernv_idle_driver = {
>  	.name =		"powernv_idle",
>  	.owner =	THIS_MODULE,
>  };
>  
> -#define MAX_IDLE_STATE_COUNT	2
> +#define MAX_IDLE_STATE_COUNT	3
>  
>  static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
>  static struct cpuidle_device __percpu *powernv_cpuidle_devices;
> @@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
>  	return index;
>  }
>  
> +/* Emulate sleep, with long nap.
> + * During sleep, the core does not receive decrementer interrupts.
> + * Emulate sleep using long nap with decrementers interrupts disabled.
> + * This is an initial prototype to test the timer offload framework for ppc.
> + * We will eventually introduce the sleep state once the timer offload framework
> + * for ppc is stable.
> + */
> +static int longnap_loop(struct cpuidle_device *dev,
> +				struct cpuidle_driver *drv,
> +				int index)
> +{
> +	int cpu = dev->cpu;
> +
> +	unsigned long lpcr = mfspr(SPRN_LPCR);
> +
> +	lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
> +
> +	/* exit powersave upon external interrupt, but not decrementer
> +	 * interrupt, Emulate sleep.
> +	 */
> +	lpcr |= LPCR_PECE0;
> +
> +	if (cpu != bc_cpu) {
> +		mtspr(SPRN_LPCR, lpcr);
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> +		power7_nap();
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> +	} else {
> +		/* Wakeup on a decrementer interrupt, Do a nap */
> +		lpcr |= LPCR_PECE1;
> +		mtspr(SPRN_LPCR, lpcr);
> +		power7_nap();
> +	}
> +
> +	return index;
> +}
> +
>  /*
>   * States for dedicated partition case.
>   */
> @@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = {
>  		.exit_latency = 10,
>  		.target_residency = 100,
>  		.enter = &nap_loop },
> +	 { /* LongNap */
> +		.name = "LongNap",
> +		.desc = "LongNap",
> +		.flags = CPUIDLE_FLAG_TIME_VALID,
> +		.exit_latency = 10,
> +		.target_residency = 100,
> +		.enter = &longnap_loop },
>  };
>  
>  static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
  2013-08-22  3:10     ` Benjamin Herrenschmidt
@ 2013-08-22  5:49       ` Preeti U Murthy
  -1 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  5:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

Hi Ben

On 08/22/2013 08:40 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
>> -static irqreturn_t unused_action(int irq, void *data)
>> +static irqreturn_t timer_action(int irq, void *data)
>>  {
>> -       /* This slot is unused and hence available for use, if needed
>> */
>> +       timer_interrupt();
>>         return IRQ_HANDLED;
>>  }
>>  
> 
> That means we'll do irq_enter/irq_exit twice no ? And things like
> may_hard_irq_enable() are also already done by do_IRQ so you
> don't need timer_interrupt() to do it again.
> 
> We probably are better off breaking timer_interrupt in two:
> 
> void __timer_interrupt(struct pt_regs * regs)
> 
> Does the current stuff between irq_enter and irq_exit, timer_interrupt
> does the remaining around it and calls __timer_interrupt.
> 
> Then from timer_action, you call __timer_interrupt()

We actually tried out this approach. The implementation was have a
set_dec(0) in the timer_action(). This would ensure that we actually do
get a timer interrupt.

But the problem with either this approach or the one that you
suggest,i.e. calling __timer_interrupt is in the following flow.

do_IRQ() -> irq_exit() -> tick_irq_exit() -> tick_nohz_irq_exit()
-> tick_nohz_stop_sched_tick()

The problem lies in the function tick_nohz_stop_sched_tick(). This
function checks for the next timer interrupt pending on this cpu, and
programs the decrementer_next_event to the time of the next event, which
is of course > now.

As a result when in the timer_action() above, we do call
__timer_interrupt() or try to trigger a timer interrupt through
set_dec(0), the condition  if(now >= *next_tb) in timer_interrupt() or
__timer_interrupt() will fail, and will not call the timer interrupt
event handler.

---> if (now >= *next_tb) {
             *next_tb = ~(u64)0;
             if (evt->event_handler)
                evt->event_handler(evt);
      } else {

The broadcast IPI , is meant to make the target CPU of this IPI believe
that it woke up from a timer interrupt, and not from an IPI. (The reason
for this I will explain in the reply to the next patch). The target CPU
should then ideally do what it would have done had it received a real
timer interrupt, call the timer interrupt event handler.

But due to the above code flow this does  not happen.
Hence as the next patch PATCH 3/6 shows, we simply call the event
handler of a timer interrupt without this explicit now >= *next_tb check.

This problem arises only in the implementation of this patchset, because
a timer interrupt is pseudo triggered from an IPI. So the effects of the
IPI handler will be felt on the timer interrupt handler triggered from
this IPI, like above.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
@ 2013-08-22  5:49       ` Preeti U Murthy
  0 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  5:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

Hi Ben

On 08/22/2013 08:40 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
>> -static irqreturn_t unused_action(int irq, void *data)
>> +static irqreturn_t timer_action(int irq, void *data)
>>  {
>> -       /* This slot is unused and hence available for use, if needed
>> */
>> +       timer_interrupt();
>>         return IRQ_HANDLED;
>>  }
>>  
> 
> That means we'll do irq_enter/irq_exit twice no ? And things like
> may_hard_irq_enable() are also already done by do_IRQ so you
> don't need timer_interrupt() to do it again.
> 
> We probably are better off breaking timer_interrupt in two:
> 
> void __timer_interrupt(struct pt_regs * regs)
> 
> Does the current stuff between irq_enter and irq_exit, timer_interrupt
> does the remaining around it and calls __timer_interrupt.
> 
> Then from timer_action, you call __timer_interrupt()

We actually tried out this approach. The implementation was have a
set_dec(0) in the timer_action(). This would ensure that we actually do
get a timer interrupt.

But the problem with either this approach or the one that you
suggest,i.e. calling __timer_interrupt is in the following flow.

do_IRQ() -> irq_exit() -> tick_irq_exit() -> tick_nohz_irq_exit()
-> tick_nohz_stop_sched_tick()

The problem lies in the function tick_nohz_stop_sched_tick(). This
function checks for the next timer interrupt pending on this cpu, and
programs the decrementer_next_event to the time of the next event, which
is of course > now.

As a result when in the timer_action() above, we do call
__timer_interrupt() or try to trigger a timer interrupt through
set_dec(0), the condition  if(now >= *next_tb) in timer_interrupt() or
__timer_interrupt() will fail, and will not call the timer interrupt
event handler.

---> if (now >= *next_tb) {
             *next_tb = ~(u64)0;
             if (evt->event_handler)
                evt->event_handler(evt);
      } else {

The broadcast IPI , is meant to make the target CPU of this IPI believe
that it woke up from a timer interrupt, and not from an IPI. (The reason
for this I will explain in the reply to the next patch). The target CPU
should then ideally do what it would have done had it received a real
timer interrupt, call the timer interrupt event handler.

But due to the above code flow this does  not happen.
Hence as the next patch PATCH 3/6 shows, we simply call the event
handler of a timer interrupt without this explicit now >= *next_tb check.

This problem arises only in the implementation of this patchset, because
a timer interrupt is pseudo triggered from an IPI. So the effects of the
IPI handler will be felt on the timer interrupt handler triggered from
this IPI, like above.

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states
  2013-08-22  3:27     ` Benjamin Herrenschmidt
@ 2013-08-22  9:03       ` Preeti U Murthy
  -1 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  9:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

Hi Ben,

On 08/22/2013 08:57 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> 
>>  static irqreturn_t timer_action(int irq, void *data)
>>  {
>> -	timer_interrupt();
>> +	decrementer_timer_interrupt();
>>  	return IRQ_HANDLED;
>>  }
> 
> I don't completely understand what you are doing here, but ...
> 
>> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>>  
>>  #ifdef __BIG_ENDIAN
>>  		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
>> -			timer_interrupt();
>> +			decrementer_timer_interrupt();
> 
> Why call this decrementer_* since it's specifically *not* the
> decrementer ?
>
> Makes more sense to be called broadcast_timer_interrupt() no ?

A broadcast IPI is meant to trigger timer interrupt handling on the
target CPUs. In deep idle states, even though the local timers of CPUs
become non-functional, it should not make a difference to them because
of the broadcast framework's help.

The broadcast framework is meant to hide the fact that the CPUs in deep
idle states require external help to wakeup on their expired timer
events. So the CPUs should wake up to find themselves handling the
timers under such a scenario, although they woke up from an IPI.

This whole idea gets conveyed by naming the handler of the broadcast IPI
to decrementer_timer_interrupt().

That said, ideally it should have been called timer_interrupt(). But
since we already have the timer interrupt handler with the same name,
and also we cannot call it directly for reasons mentioned in the reply
to your review on PATCH 2/6, I named it decrementer_timer_interrupt() to
come close to conveying the idea. This calls only the timer interrupt
handler there.

> 
>>  		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>>  			scheduler_ipi();
>>  		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 65ab9e9..7e858e1 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -42,6 +42,7 @@
>>  #include <linux/timex.h>
>>  #include <linux/kernel_stat.h>
>>  #include <linux/time.h>
>> +#include <linux/timer.h>
>>  #include <linux/init.h>
>>  #include <linux/profile.h>
>>  #include <linux/cpu.h>
>> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>>  
>>  static int decrementer_set_next_event(unsigned long evt,
>>  				      struct clock_event_device *dev);
>> +static int broadcast_set_next_event(unsigned long evt,
>> +				      struct clock_event_device *dev);
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>  				 struct clock_event_device *dev);
>> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>>  
>>  struct clock_event_device decrementer_clockevent = {
>>  	.name           = "decrementer",
>> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>>  	.irq            = 0,
>>  	.set_next_event = decrementer_set_next_event,
>>  	.set_mode       = decrementer_set_mode,
>> -	.features       = CLOCK_EVT_FEAT_ONESHOT,
>> +	.broadcast	= decrementer_timer_broadcast,
>> +	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>>  };
>>  EXPORT_SYMBOL(decrementer_clockevent);
>>  
>> +struct clock_event_device broadcast_clockevent = {
>> +	.name           = "broadcast",
>> +	.rating         = 200,
>> +	.irq            = 0,
>> +	.set_next_event = broadcast_set_next_event,
>> +	.set_mode       = decrementer_set_mode,
> 
> Same here, why "decrementer" ? This event device is *not* the
> decrementer right ?

You are right. In this case it should have been broadcast_set_mode. This
is because this function is associated with the sender(the broadcast cpu).

> 
> Also, pardon my ignorance, by why do we need a separate
> clock_event_device ? Ie what does that do ? I am not familiar with the
> broadcast scheme and what .broadcast do in the "decrementer" one, so
> you need to provide me at least with better explanations.

The answer in short to why we need an additional clock event device is
that, in this patchset, we try to integrate the support for deep sleep
states in power, with the broadcast framework existent today in the
kernel without any changes to this broadcast framework and trying our
best to adapt to it. Let me now elaborate.

The broadcast framework kicks in if there is a broadcast clock event
device. We can double up the local timer/decrementer of a CPU as the
broadcast clock event device itself. But the broadcast framework
considers a clock event device an eligible candidate for broadcast, if
the feature CLOCK_EVT_FEAT_C3STOP is not associated with it. This
feature says that the clock event device stops in deep idle states. The
broadcast framework expects the clock event device which takes up the
responsibility of doing broadcast to be available and on at all points
in time.
  The decrementer clock event device however is susceptible to
non-availability in deep idle states.

Now the next question of why does a broadcast framework require a clock
event device? This clock event device is the equivalent of the local
timer of a CPU, except that this behaves like the global timer for all
CPUs whose local timers are non-functional. Of course this timer can
wakeup only one CPU on expiry. So the CPU chosen for this purpose will
effectively have two devices, the local clock event device and the
broadcast clock event device targeting their interrupts to it. When the
broadcast clock event device interrupts the CPU, the CPU sends an IPI to
those of the idling CPUs whose local timers have stopped, if required,
with an intention of waking them up to handle their local timer events.

The job of the broadcast framework is to find out which CPU, that has
its local timers switched off has to be woken up next. The important
thing to note here is that the local timers are not expected to be
switched off only in deep idle states. They can be non functional due to
whatever reason, at which time they come to broadcast framework seeking
help; to wake them up at their local timer events.

The broadcast framework puts the CPUs under such a category, into a mask
and each time a new CPU enters such a mask, it reprograms the broadcast
clock event device to the earliest of the new wakeup and the old. If the
local timer of a CPU was also used for this purpose, its original local
wakeup event would be overwritten.

On powerpc, since we do not have a physical clock event device
additionally to the local clock event devices, we need to have a pseudo
device so that we can activate the broadcast framework.

I will explain what the .broadcast in decrementer clock event device
does further down in the mail.

> 
>> +	.features       = CLOCK_EVT_FEAT_ONESHOT,
>> +};
>> +EXPORT_SYMBOL(broadcast_clockevent);
>> +
>>  DEFINE_PER_CPU(u64, decrementers_next_tb);
>>  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
>> +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);
> 
> Do we really need an additional per CPU here ? What does the bc_timer
> actually represent ? The sender (broadcaster) or receiver ? In the
> latter case, why does it have to differ from the decrementer ?

No we dont really need this to be per cpu. The PATCH 4/6 removes this
per_cpu instance, instead has one clock event device. The bc_timer is
the broadcast clock event device that is described above. So to answer
the above question it is associated with the sender. It needs to be
different from the decrementer for the reasons mentioned above.

> 
>> +int bc_cpu;
> 
> A global with that name ? Exported ? That's gross...

Sorry about this, ill take care of the implementation of this in the
next version by adding accessor functions to set/unset it.

> 
>>  #define XSEC_PER_SEC (1024*1024)
>>  
>>  #ifdef CONFIG_PPC64
>> @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
>>  	struct pt_regs *old_regs;
>>  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>>  	struct clock_event_device *evt = &__get_cpu_var(decrementers);
>> +	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
>> +	int cpu = smp_processor_id();
>>  	u64 now;
>>  
>>  	/* Ensure a positive value is written to the decrementer, or else
>> @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
>>  		*next_tb = ~(u64)0;
>>  		if (evt->event_handler)
>>  			evt->event_handler(evt);
>> +		if (cpu == bc_cpu && bc_evt->event_handler) {
>> +			bc_evt->event_handler(bc_evt);
>> +		}
>> +
> 
> So here, on a CPU that happens to be "bc_cpu", we call an additional
> "event_handler" on every timer interrupt ? What does that handler
> actually do ? How does it relate to the "broadcast" field in the
> original clock source ?

This handler is in charge of finding out which of the CPUs whose local
timers have been switched off, need to be woken up to enable them to
handle their expired timer events. The bc_cpu sends an IPI to such
sleeping CPUs.

The broadcast framework currently uses the *broadcast* function to send
IPIs to the CPUs in deep idle states. This function needs to be
associated with the local timers of the CPUs according to the current
design of the broadcast framework.

So the event handler of the broadcast clock event device, uses the
broadcast function to send IPIs to the CPUs required, in addition to
finding out who to send the IPIs and when the next wakeup of idling CPUs
should be handled.

> 
>>  	} else {
>>  		now = *next_tb - now;
>>  		if (now <= DECREMENTER_MAX)
>> @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
>> + * deep idle states need to send IPIs to the broadcast CPU to program its
>> + * decrementer for their next local event so as to receive a broadcast IPI
>> + * for the same. In order to avoid the overhead of multiple CPUs from sending
>> + * IPIs, this function is a nop. Instead the broadcast CPU will handle the
>> + * wakeup of CPUs in deep idle states in each of its local timer interrupts.
>> + */
>> +static int broadcast_set_next_event(unsigned long evt,
>> +					struct clock_event_device *dev)
>> +{
>> +	return 0;
>> +}
>> +
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>  				 struct clock_event_device *dev)
>>  {
>> @@ -813,6 +850,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
>>  		decrementer_set_next_event(DECREMENTER_MAX, dev);
>>  }
>>  
>> +void decrementer_timer_interrupt(void)
>> +{
>> +	struct clock_event_device *evt;
>> +	evt = &per_cpu(decrementers, smp_processor_id());
>> +
>> +	if (evt->event_handler)
>> +		evt->event_handler(evt);
>> +}
> 
> So that's what happens when we receive the broadcast... it might need
> some stats ... and it's using the normal "decrementer" clock source,
> so I still have a problem understanding why you need another one.

So this is the receiver part of the broadcast. The sender needs to call
the event handler of the broadcast clock event device while the receiver
has to invoke the event handler of the decrementer clock event device so
as to handle its local timer events.
  But you are right, we are missing stats accumulation by doing the
above. I need to take care of this.

> 
>> +static void decrementer_timer_broadcast(const struct cpumask *mask)
>> +{
>> +	arch_send_tick_broadcast(mask);
>> +}
> 
> Ok, so far so good. But that's also hooked into the normal clock
> source...

This broadcast function has to be associated with the normal clock
source according to the implementation of the broadcast framework. I am
not too sure about the reason behind this.

So the current implementation has the broadcast event handler associated
with the broadcast clock event device(where cpus that have to be woken
up are selected and the broadcast clock event device reprogrammed), and
the actual job of broadcasting is associated with the normal clock
source(sending an IPI), i.e. the .broadcast fn.

And the broadcast IPI handler on the target CPUs has to invoke the timer
handler of the nromal clock source.

So only the broadcast event handler(tick_handle_oneshot_broadcast() and
tick_handle_periodic_broadcast()) is associated with the broadcast clock
source.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states
@ 2013-08-22  9:03       ` Preeti U Murthy
  0 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  9:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

Hi Ben,

On 08/22/2013 08:57 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> 
>>  static irqreturn_t timer_action(int irq, void *data)
>>  {
>> -	timer_interrupt();
>> +	decrementer_timer_interrupt();
>>  	return IRQ_HANDLED;
>>  }
> 
> I don't completely understand what you are doing here, but ...
> 
>> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>>  
>>  #ifdef __BIG_ENDIAN
>>  		if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
>> -			timer_interrupt();
>> +			decrementer_timer_interrupt();
> 
> Why call this decrementer_* since it's specifically *not* the
> decrementer ?
>
> Makes more sense to be called broadcast_timer_interrupt() no ?

A broadcast IPI is meant to trigger timer interrupt handling on the
target CPUs. In deep idle states, even though the local timers of CPUs
become non-functional, it should not make a difference to them because
of the broadcast framework's help.

The broadcast framework is meant to hide the fact that the CPUs in deep
idle states require external help to wakeup on their expired timer
events. So the CPUs should wake up to find themselves handling the
timers under such a scenario, although they woke up from an IPI.

This whole idea gets conveyed by naming the handler of the broadcast IPI
to decrementer_timer_interrupt().

That said, ideally it should have been called timer_interrupt(). But
since we already have the timer interrupt handler with the same name,
and also we cannot call it directly for reasons mentioned in the reply
to your review on PATCH 2/6, I named it decrementer_timer_interrupt() to
come close to conveying the idea. This calls only the timer interrupt
handler there.

> 
>>  		if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>>  			scheduler_ipi();
>>  		if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 65ab9e9..7e858e1 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -42,6 +42,7 @@
>>  #include <linux/timex.h>
>>  #include <linux/kernel_stat.h>
>>  #include <linux/time.h>
>> +#include <linux/timer.h>
>>  #include <linux/init.h>
>>  #include <linux/profile.h>
>>  #include <linux/cpu.h>
>> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>>  
>>  static int decrementer_set_next_event(unsigned long evt,
>>  				      struct clock_event_device *dev);
>> +static int broadcast_set_next_event(unsigned long evt,
>> +				      struct clock_event_device *dev);
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>  				 struct clock_event_device *dev);
>> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>>  
>>  struct clock_event_device decrementer_clockevent = {
>>  	.name           = "decrementer",
>> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>>  	.irq            = 0,
>>  	.set_next_event = decrementer_set_next_event,
>>  	.set_mode       = decrementer_set_mode,
>> -	.features       = CLOCK_EVT_FEAT_ONESHOT,
>> +	.broadcast	= decrementer_timer_broadcast,
>> +	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>>  };
>>  EXPORT_SYMBOL(decrementer_clockevent);
>>  
>> +struct clock_event_device broadcast_clockevent = {
>> +	.name           = "broadcast",
>> +	.rating         = 200,
>> +	.irq            = 0,
>> +	.set_next_event = broadcast_set_next_event,
>> +	.set_mode       = decrementer_set_mode,
> 
> Same here, why "decrementer" ? This event device is *not* the
> decrementer right ?

You are right. In this case it should have been broadcast_set_mode. This
is because this function is associated with the sender(the broadcast cpu).

> 
> Also, pardon my ignorance, by why do we need a separate
> clock_event_device ? Ie what does that do ? I am not familiar with the
> broadcast scheme and what .broadcast do in the "decrementer" one, so
> you need to provide me at least with better explanations.

The answer in short to why we need an additional clock event device is
that, in this patchset, we try to integrate the support for deep sleep
states in power, with the broadcast framework existent today in the
kernel without any changes to this broadcast framework and trying our
best to adapt to it. Let me now elaborate.

The broadcast framework kicks in if there is a broadcast clock event
device. We can double up the local timer/decrementer of a CPU as the
broadcast clock event device itself. But the broadcast framework
considers a clock event device an eligible candidate for broadcast, if
the feature CLOCK_EVT_FEAT_C3STOP is not associated with it. This
feature says that the clock event device stops in deep idle states. The
broadcast framework expects the clock event device which takes up the
responsibility of doing broadcast to be available and on at all points
in time.
  The decrementer clock event device however is susceptible to
non-availability in deep idle states.

Now the next question of why does a broadcast framework require a clock
event device? This clock event device is the equivalent of the local
timer of a CPU, except that this behaves like the global timer for all
CPUs whose local timers are non-functional. Of course this timer can
wakeup only one CPU on expiry. So the CPU chosen for this purpose will
effectively have two devices, the local clock event device and the
broadcast clock event device targeting their interrupts to it. When the
broadcast clock event device interrupts the CPU, the CPU sends an IPI to
those of the idling CPUs whose local timers have stopped, if required,
with an intention of waking them up to handle their local timer events.

The job of the broadcast framework is to find out which CPU, that has
its local timers switched off has to be woken up next. The important
thing to note here is that the local timers are not expected to be
switched off only in deep idle states. They can be non functional due to
whatever reason, at which time they come to broadcast framework seeking
help; to wake them up at their local timer events.

The broadcast framework puts the CPUs under such a category, into a mask
and each time a new CPU enters such a mask, it reprograms the broadcast
clock event device to the earliest of the new wakeup and the old. If the
local timer of a CPU was also used for this purpose, its original local
wakeup event would be overwritten.

On powerpc, since we do not have a physical clock event device
additionally to the local clock event devices, we need to have a pseudo
device so that we can activate the broadcast framework.

I will explain what the .broadcast in decrementer clock event device
does further down in the mail.

> 
>> +	.features       = CLOCK_EVT_FEAT_ONESHOT,
>> +};
>> +EXPORT_SYMBOL(broadcast_clockevent);
>> +
>>  DEFINE_PER_CPU(u64, decrementers_next_tb);
>>  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
>> +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);
> 
> Do we really need an additional per CPU here ? What does the bc_timer
> actually represent ? The sender (broadcaster) or receiver ? In the
> latter case, why does it have to differ from the decrementer ?

No we dont really need this to be per cpu. The PATCH 4/6 removes this
per_cpu instance, instead has one clock event device. The bc_timer is
the broadcast clock event device that is described above. So to answer
the above question it is associated with the sender. It needs to be
different from the decrementer for the reasons mentioned above.

> 
>> +int bc_cpu;
> 
> A global with that name ? Exported ? That's gross...

Sorry about this, ill take care of the implementation of this in the
next version by adding accessor functions to set/unset it.

> 
>>  #define XSEC_PER_SEC (1024*1024)
>>  
>>  #ifdef CONFIG_PPC64
>> @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
>>  	struct pt_regs *old_regs;
>>  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>>  	struct clock_event_device *evt = &__get_cpu_var(decrementers);
>> +	struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
>> +	int cpu = smp_processor_id();
>>  	u64 now;
>>  
>>  	/* Ensure a positive value is written to the decrementer, or else
>> @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
>>  		*next_tb = ~(u64)0;
>>  		if (evt->event_handler)
>>  			evt->event_handler(evt);
>> +		if (cpu == bc_cpu && bc_evt->event_handler) {
>> +			bc_evt->event_handler(bc_evt);
>> +		}
>> +
> 
> So here, on a CPU that happens to be "bc_cpu", we call an additional
> "event_handler" on every timer interrupt ? What does that handler
> actually do ? How does it relate to the "broadcast" field in the
> original clock source ?

This handler is in charge of finding out which of the CPUs whose local
timers have been switched off, need to be woken up to enable them to
handle their expired timer events. The bc_cpu sends an IPI to such
sleeping CPUs.

The broadcast framework currently uses the *broadcast* function to send
IPIs to the CPUs in deep idle states. This function needs to be
associated with the local timers of the CPUs according to the current
design of the broadcast framework.

So the event handler of the broadcast clock event device, uses the
broadcast function to send IPIs to the CPUs required, in addition to
finding out who to send the IPIs and when the next wakeup of idling CPUs
should be handled.

> 
>>  	} else {
>>  		now = *next_tb - now;
>>  		if (now <= DECREMENTER_MAX)
>> @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
>> + * deep idle states need to send IPIs to the broadcast CPU to program its
>> + * decrementer for their next local event so as to receive a broadcast IPI
>> + * for the same. In order to avoid the overhead of multiple CPUs from sending
>> + * IPIs, this function is a nop. Instead the broadcast CPU will handle the
>> + * wakeup of CPUs in deep idle states in each of its local timer interrupts.
>> + */
>> +static int broadcast_set_next_event(unsigned long evt,
>> +					struct clock_event_device *dev)
>> +{
>> +	return 0;
>> +}
>> +
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>  				 struct clock_event_device *dev)
>>  {
>> @@ -813,6 +850,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
>>  		decrementer_set_next_event(DECREMENTER_MAX, dev);
>>  }
>>  
>> +void decrementer_timer_interrupt(void)
>> +{
>> +	struct clock_event_device *evt;
>> +	evt = &per_cpu(decrementers, smp_processor_id());
>> +
>> +	if (evt->event_handler)
>> +		evt->event_handler(evt);
>> +}
> 
> So that's what happens when we receive the broadcast... it might need
> some stats ... and it's using the normal "decrementer" clock source,
> so I still have a problem understanding why you need another one.

So this is the receiver part of the broadcast. The sender needs to call
the event handler of the broadcast clock event device while the receiver
has to invoke the event handler of the decrementer clock event device so
as to handle its local timer events.
  But you are right, we are missing stats accumulation by doing the
above. I need to take care of this.

> 
>> +static void decrementer_timer_broadcast(const struct cpumask *mask)
>> +{
>> +	arch_send_tick_broadcast(mask);
>> +}
> 
> Ok, so far so good. But that's also hooked into the normal clock
> source...

This broadcast function has to be associated with the normal clock
source according to the implementation of the broadcast framework. I am
not too sure about the reason behind this.

So the current implementation has the broadcast event handler associated
with the broadcast clock event device(where cpus that have to be woken
up are selected and the broadcast clock event device reprogrammed), and
the actual job of broadcasting is associated with the normal clock
source(sending an IPI), i.e. the .broadcast fn.

And the broadcast IPI handler on the target CPUs has to invoke the timer
handler of the nromal clock source.

So only the broadcast event handler(tick_handle_oneshot_broadcast() and
tick_handle_periodic_broadcast()) is associated with the broadcast clock
source.

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv
  2013-08-22  3:28     ` Benjamin Herrenschmidt
@ 2013-08-22  9:09       ` Preeti U Murthy
  -1 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  9:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, galak, deepthi,
	paulmck, arnd, linux-pm, rostedt, rjw, john.stultz, tglx,
	chenhui.zhao, michael, r58472, geoff, linux-kernel,
	srivatsa.bhat, schwidefsky, svaidy, linuxppc-dev

Hi Ben,

On 08/22/2013 08:58 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
>> This patch hooks into the existing broadcast framework along with the support
>> that this patchset introduces for ppc, and the cpuidle driver backend
>> for powernv(posted out by Deepthi Dharwar:https://lkml.org/lkml/2013/7/23/128)
>> to add sleep state as one of the deep idle states, in which the decrementer
>> is switched off.
>>
>> However in this patch, we only emulate sleep by going into a state which does
>> a nap with the decrementer interrupts disabled, termed as longnap. This enables
>> focus on the timer broadcast framework for ppc in this series of patches ,
>> which is required as a first step to enable sleep on ppc.
> 
> This is only for debug / proof of concept right ? We should use a real
> sleep here.
> 
> If we need to know whether the FW supports it (PORE etc...) we shall add
> a device-tree property from the FW to indicate that fact.

We also need the sleep support right? The context management, I mean.
Yes it is a debug patch, so as to first ensure that we have the hook up
to the broadcast framework done right.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv
@ 2013-08-22  9:09       ` Preeti U Murthy
  0 siblings, 0 replies; 19+ messages in thread
From: Preeti U Murthy @ 2013-08-22  9:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, john.stultz, tglx, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

Hi Ben,

On 08/22/2013 08:58 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
>> This patch hooks into the existing broadcast framework along with the support
>> that this patchset introduces for ppc, and the cpuidle driver backend
>> for powernv(posted out by Deepthi Dharwar:https://lkml.org/lkml/2013/7/23/128)
>> to add sleep state as one of the deep idle states, in which the decrementer
>> is switched off.
>>
>> However in this patch, we only emulate sleep by going into a state which does
>> a nap with the decrementer interrupts disabled, termed as longnap. This enables
>> focus on the timer broadcast framework for ppc in this series of patches ,
>> which is required as a first step to enable sleep on ppc.
> 
> This is only for debug / proof of concept right ? We should use a real
> sleep here.
> 
> If we need to know whether the FW supports it (PORE etc...) we shall add
> a device-tree property from the FW to indicate that fact.

We also need the sleep support right? The context management, I mean.
Yes it is a debug patch, so as to first ensure that we have the hook up
to the broadcast framework done right.

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-08-22  9:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-14 11:55 [RFC V2 PATCH 0/6] cpuidle/ppc: Timer offload framework to support deep idle states Preeti U Murthy
2013-08-14 11:55 ` [RFC V2 PATCH 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) Preeti U Murthy
2013-08-14 11:56 ` [RFC V2 PATCH 2/6] powerpc: Implement broadcast timer interrupt as an IPI message Preeti U Murthy
2013-08-22  3:10   ` Benjamin Herrenschmidt
2013-08-22  3:10     ` Benjamin Herrenschmidt
2013-08-22  5:49     ` Preeti U Murthy
2013-08-22  5:49       ` Preeti U Murthy
2013-08-14 11:56 ` [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states Preeti U Murthy
2013-08-22  3:27   ` Benjamin Herrenschmidt
2013-08-22  3:27     ` Benjamin Herrenschmidt
2013-08-22  9:03     ` Preeti U Murthy
2013-08-22  9:03       ` Preeti U Murthy
2013-08-14 11:56 ` [RFC V2 PATCH 4/6] cpuidle/ppc: Add longnap state to the idle states on powernv Preeti U Murthy
2013-08-22  3:28   ` Benjamin Herrenschmidt
2013-08-22  3:28     ` Benjamin Herrenschmidt
2013-08-22  9:09     ` Preeti U Murthy
2013-08-22  9:09       ` Preeti U Murthy
2013-08-14 11:56 ` [RFC V2 PATCH 5/6] cpuidle/ppc: Enable dynamic movement of the broadcast functionality across CPUs Preeti U Murthy
2013-08-14 11:56 ` [RFC V2 PATCH 6/6] cpuidle/ppc : Queue a hrtimer on bc_cpu, explicitly to do broadcast handling Preeti U Murthy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.