All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-10  0:20 Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
                   ` (19 more replies)
  0 siblings, 20 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

This is the initial public posting of the initial support for big.LITTLE.
Included here is the code required to safely power up and down CPUs in a
b.L system, whether this is via CPU hotplug, a cpuidle driver or the
Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
boot and CPU hotplug support is included at this time.  Getting to this
point already represents a significcant chunk of code as illustrated by
the diffstat below.

This work was presented at Linaro Connect in Copenhagen by Dave Martin and
myself.  The presentation slides are available here:

http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c

The code is now stable on both Fast Models as well as Virtual Express TC2
and ready for public review.

Platform support is included for Fast Models implementing the
Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
successful compilation, I also included a preliminary version of the
CCI400 driver from Lorenzo Pieralisi.

Support for actual hardware such as Vexpress TC2 should come later,
once the basic infrastructure from this series is merged.  A few DT
bindings are used but not yet documented.

This series is made of the following parts:

Low-level support code:
[PATCH 01/16] ARM: b.L: secondary kernel entry code
[PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
[PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
[PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
[PATCH 05/16] ARM: bL_head: vlock-based first man election

Adaptation layer to hook with the generic kernel infrastructure:
[PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
[PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
[PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
[PATCH 09/16] ARM: vexpress: Select the correct SMP operations at

Fast Models support:
[PATCH 10/16] ARM: vexpress: introduce DCSCB support
[PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
[PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
[PATCH 13/16] drivers: misc: add ARM CCI support
[PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
[PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
[PATCH 16/16] ARM: vexpress/dcscb: probe via device tree

Here's the diffstat:

 .../big.LITTLE/cluster-pm-race-avoidance.txt    | 498 ++++++++++++++++++
 Documentation/arm/big.LITTLE/vlocks.txt         | 211 ++++++++
 arch/arm/Kconfig                                |   6 +
 arch/arm/common/Makefile                        |   3 +
 arch/arm/common/bL_entry.c                      | 278 ++++++++++
 arch/arm/common/bL_head.S                       | 232 ++++++++
 arch/arm/common/bL_platsmp.c                    |  85 +++
 arch/arm/common/gic.c                           |   6 +
 arch/arm/common/vlock.S                         | 108 ++++
 arch/arm/common/vlock.h                         |  43 ++
 arch/arm/include/asm/bL_entry.h                 | 189 +++++++
 arch/arm/include/asm/hardware/gic.h             |   2 +
 arch/arm/include/asm/mach/arch.h                |   3 +
 arch/arm/kernel/setup.c                         |   5 +-
 arch/arm/mach-vexpress/Kconfig                  |   9 +
 arch/arm/mach-vexpress/Makefile                 |   1 +
 arch/arm/mach-vexpress/core.h                   |   2 +
 arch/arm/mach-vexpress/dcscb.c                  | 257 +++++++++
 arch/arm/mach-vexpress/dcscb_setup.S            |  77 +++
 arch/arm/mach-vexpress/platsmp.c                |  12 +
 arch/arm/mach-vexpress/v2m.c                    |   2 +-
 drivers/misc/Kconfig                            |   4 +
 drivers/misc/Makefile                           |   1 +
 drivers/misc/arm-cci.c                          | 124 +++++
 include/linux/arm-cci.h                         |  30 ++
 25 files changed, 2186 insertions(+), 2 deletions(-)

Review comments are welcome!

[*] General design information on the b.L switcher can be found here:
    http://lwn.net/Articles/481055/
    However the code is only accessible to Linaro members for the
    time being.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  7:12   ` Stephen Boyd
                     ` (4 more replies)
  2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
                   ` (18 subsequent siblings)
  19 siblings, 5 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

CPUs in a big.LITTLE systems have special needs when entering the kernel
due to a hotplug event, or when resuming from a deep sleep mode.

This is vectorized so multiple CPUs can enter the kernel in parallel
without serialization.

Only the basic structure is introduced here.  This will be extended
later.

TODO: MPIDR based indexing should eventually be made runtime adjusted.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/Kconfig                |  6 +++
 arch/arm/common/Makefile        |  3 ++
 arch/arm/common/bL_entry.c      | 30 +++++++++++++++
 arch/arm/common/bL_head.S       | 81 +++++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/bL_entry.h | 35 ++++++++++++++++++
 5 files changed, 155 insertions(+)
 create mode 100644 arch/arm/common/bL_entry.c
 create mode 100644 arch/arm/common/bL_head.S
 create mode 100644 arch/arm/include/asm/bL_entry.h

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f95ba14ae3..2271f02e8e 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1579,6 +1579,12 @@ config HAVE_ARM_TWD
 	help
 	  This options enables support for the ARM timer and watchdog unit
 
+config BIG_LITTLE
+	bool "big.LITTLE support (Experimental)"
+	depends on CPU_V7 && SMP && EXPERIMENTAL
+	help
+	  This option enables support for the big.LITTLE architecture.
+
 choice
 	prompt "Memory split"
 	default VMSPLIT_3G
diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index e8a4e58f1b..50880c494f 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -13,3 +13,6 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
 obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
 obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
+obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
+obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o
diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
new file mode 100644
index 0000000000..80fff49417
--- /dev/null
+++ b/arch/arm/common/bL_entry.c
@@ -0,0 +1,30 @@
+/*
+ * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
+ *
+ * Created by:  Nicolas Pitre, March 2012
+ * Copyright:   (C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#include <asm/bL_entry.h>
+#include <asm/barrier.h>
+#include <asm/proc-fns.h>
+#include <asm/cacheflush.h>
+
+extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
+
+void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
+{
+	unsigned long val = ptr ? virt_to_phys(ptr) : 0;
+	bL_entry_vectors[cluster][cpu] = val;
+	smp_wmb();
+	__cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
+	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
+			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
+}
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
new file mode 100644
index 0000000000..9d351f2b4c
--- /dev/null
+++ b/arch/arm/common/bL_head.S
@@ -0,0 +1,81 @@
+/*
+ * arch/arm/common/bL_head.S -- big.LITTLE kernel re-entry point
+ *
+ * Created by:  Nicolas Pitre, March 2012
+ * Copyright:   (C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/linkage.h>
+#include <asm/bL_entry.h>
+
+	.macro	pr_dbg	cpu, string
+#if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
+	b	1901f
+1902:	.ascii	"CPU 0: \0CPU 1: \0CPU 2: \0CPU 3: \0"
+	.ascii	"CPU 4: \0CPU 5: \0CPU 6: \0CPU 7: \0"
+1903:	.asciz	"\string"
+	.align
+1901:	adr	r0, 1902b
+	add	r0, r0, \cpu, lsl #3
+	bl	printascii
+	adr	r0, 1903b
+	bl	printascii
+#endif
+	.endm
+
+	.arm
+	.align
+
+ENTRY(bL_entry_point)
+
+ THUMB(	adr	r12, BSYM(1f)	)
+ THUMB(	bx	r12		)
+ THUMB(	.thumb			)
+1:
+	mrc	p15, 0, r0, c0, c0, 5
+	ubfx	r9, r0, #0, #4			@ r9 = cpu
+	ubfx	r10, r0, #8, #4			@ r10 = cluster
+	mov	r3, #BL_CPUS_PER_CLUSTER
+	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
+	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
+	blo	2f
+
+	/* We didn't expect this CPU.  Try to make it quiet. */
+1:	wfi
+	wfe
+	b	1b
+
+2:	pr_dbg	r4, "kernel bL_entry_point\n"
+
+	/*
+	 * MMU is off so we need to get to bL_entry_vectors in a
+	 * position independent way.
+	 */
+	adr	r5, 3f
+	ldr	r6, [r5]
+	add	r6, r5, r6			@ r6 = bL_entry_vectors
+
+bL_entry_gated:
+	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
+	cmp	r5, #0
+	wfeeq
+	beq	bL_entry_gated
+	pr_dbg	r4, "released\n"
+	bx	r5
+
+	.align	2
+
+3:	.word	bL_entry_vectors - .
+
+ENDPROC(bL_entry_point)
+
+	.bss
+	.align	5
+
+	.type	bL_entry_vectors, #object
+ENTRY(bL_entry_vectors)
+	.space	4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
new file mode 100644
index 0000000000..ff623333a1
--- /dev/null
+++ b/arch/arm/include/asm/bL_entry.h
@@ -0,0 +1,35 @@
+/*
+ * arch/arm/include/asm/bL_entry.h
+ *
+ * Created by:  Nicolas Pitre, April 2012
+ * Copyright:   (C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef BL_ENTRY_H
+#define BL_ENTRY_H
+
+#define BL_CPUS_PER_CLUSTER	4
+#define BL_NR_CLUSTERS		2
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Platform specific code should use this symbol to set up secondary
+ * entry location for processors to use when released from reset.
+ */
+extern void bL_entry_point(void);
+
+/*
+ * This is used to indicate where the given CPU from given cluster should
+ * branch once it is ready to re-enter the kernel using ptr, or NULL if it
+ * should be gated.  A gated CPU is held in a WFE loop until its vector
+ * becomes non NULL.
+ */
+void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
+
+#endif /* ! __ASSEMBLY__ */
+#endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10 23:08   ` Will Deacon
  2013-01-11 17:26   ` Santosh Shilimkar
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

This is the basic API used to handle the powering up/down of individual
CPUs in a big.LITTLE system.  The platform specific backend implementation
has the responsibility to also handle the cluster level power as well when
the first/last CPU in a cluster is brought up/down.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+)

diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 80fff49417..41de0622de 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -11,11 +11,13 @@
 
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/irqflags.h>
 
 #include <asm/bL_entry.h>
 #include <asm/barrier.h>
 #include <asm/proc-fns.h>
 #include <asm/cacheflush.h>
+#include <asm/idmap.h>
 
 extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
 
@@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
 	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
 			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
 }
+
+static const struct bL_platform_power_ops *platform_ops;
+
+int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
+{
+	if (platform_ops)
+		return -EBUSY;
+	platform_ops = ops;
+	return 0;
+}
+
+int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
+{
+	if (!platform_ops)
+		return -EUNATCH;
+	might_sleep();
+	return platform_ops->power_up(cpu, cluster);
+}
+
+typedef void (*phys_reset_t)(unsigned long);
+
+void bL_cpu_power_down(void)
+{
+	phys_reset_t phys_reset;
+
+	BUG_ON(!platform_ops);
+	BUG_ON(!irqs_disabled());
+
+	/*
+	 * Do this before calling into the power_down method,
+	 * as it might not always be safe to do afterwards.
+	 */
+	setup_mm_for_reboot();
+
+	platform_ops->power_down();
+
+	/*
+	 * It is possible for a power_up request to happen concurrently
+	 * with a power_down request for the same CPU. In this case the
+	 * power_down method might not be able to actually enter a
+	 * powered down state with the WFI instruction if the power_up
+	 * method has removed the required reset condition.  The
+	 * power_down method is then allowed to return. We must perform
+	 * a re-entry in the kernel as if the power_up method just had
+	 * deasserted reset on the CPU.
+	 *
+	 * To simplify race issues, the platform specific implementation
+	 * must accommodate for the possibility of unordered calls to
+	 * power_down and power_up with a usage count. Therefore, if a
+	 * call to power_up is issued for a CPU that is not down, then
+	 * the next call to power_down must not attempt a full shutdown
+	 * but only do the minimum (normally disabling L1 cache and CPU
+	 * coherency) and return just as if a concurrent power_up request
+	 * had happened as described above.
+	 */
+
+	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
+	phys_reset(virt_to_phys(bL_entry_point));
+
+	/* should never get here */
+	BUG();
+}
+
+void bL_cpu_suspend(u64 expected_residency)
+{
+	phys_reset_t phys_reset;
+
+	BUG_ON(!platform_ops);
+	BUG_ON(!irqs_disabled());
+
+	/* Very similar to bL_cpu_power_down() */
+	setup_mm_for_reboot();
+	platform_ops->suspend(expected_residency);
+	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
+	phys_reset(virt_to_phys(bL_entry_point));
+	BUG();
+}
+
+int bL_cpu_powered_up(void)
+{
+	if (!platform_ops)
+		return -EUNATCH;
+	if (platform_ops->powered_up)
+		platform_ops->powered_up();
+	return 0;
+}
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index ff623333a1..942d7f9f19 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -31,5 +31,97 @@ extern void bL_entry_point(void);
  */
 void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
 
+/*
+ * CPU/cluster power operations API for higher subsystems to use.
+ */
+
+/**
+ * bL_cpu_power_up - make given CPU in given cluster runable
+ *
+ * @cpu: CPU number within given cluster
+ * @cluster: cluster number for the CPU
+ *
+ * The identified CPU is brought out of reset.  If the cluster was powered
+ * down then it is brought up as well, taking care not to let the other CPUs
+ * in the cluster run, and ensuring appropriate cluster setup.
+ *
+ * Caller must ensure the appropriate entry vector is initialized with
+ * bL_set_entry_vector() prior to calling this.
+ *
+ * This must be called in a sleepable context.  However, the implementation
+ * is strongly encouraged to return early and let the operation happen
+ * asynchronously, especially when significant delays are expected.
+ *
+ * If the operation cannot be performed then an error code is returned.
+ */
+int bL_cpu_power_up(unsigned int cpu, unsigned int cluster);
+
+/**
+ * bL_cpu_power_down - power the calling CPU down
+ *
+ * The calling CPU is powered down.
+ *
+ * If this CPU is found to be the "last man standing" in the cluster
+ * then the cluster is prepared for power-down too.
+ *
+ * This must be called with interrupts disabled.
+ *
+ * This does not return.  Re-entry in the kernel is expected via
+ * bL_entry_point.
+ */
+void bL_cpu_power_down(void);
+
+/**
+ * bL_cpu_suspend - bring the calling CPU in a suspended state
+ *
+ * @expected_residency: duration in microseconds the CPU is expected
+ *			to remain suspended, or 0 if unknown/infinity.
+ *
+ * The calling CPU is suspended.  The expected residency argument is used
+ * as a hint by the platform specific backend to implement the appropriate
+ * sleep state level according to the knowledge it has on wake-up latency
+ * for the given hardware.
+ *
+ * If this CPU is found to be the "last man standing" in the cluster
+ * then the cluster may be prepared for power-down too, if the expected
+ * residency makes it worthwhile.
+ *
+ * This must be called with interrupts disabled.
+ *
+ * This does not return.  Re-entry in the kernel is expected via
+ * bL_entry_point.
+ */
+void bL_cpu_suspend(u64 expected_residency);
+
+/**
+ * bL_cpu_powered_up - housekeeping workafter a CPU has been powered up
+ *
+ * This lets the platform specific backend code perform needed housekeeping
+ * work.  This must be called by the newly activated CPU as soon as it is
+ * fully operational in kernel space, before it enables interrupts.
+ *
+ * If the operation cannot be performed then an error code is returned.
+ */
+int bL_cpu_powered_up(void);
+
+/*
+ * Platform specific methods used in the implementation of the above API.
+ */
+struct bL_platform_power_ops {
+	int (*power_up)(unsigned int cpu, unsigned int cluster);
+	void (*power_down)(void);
+	void (*suspend)(u64);
+	void (*powered_up)(void);
+};
+
+/**
+ * bL_platform_power_register - register platform specific power methods
+ *
+ * @ops: bL_platform_power_ops structure to register
+ *
+ * An error is returned if the registration has been done previously.
+ */
+int __init bL_platform_power_register(const struct bL_platform_power_ops *ops);
+
 #endif /* ! __ASSEMBLY__ */
 #endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10 12:01   ` Dave Martin
                     ` (4 more replies)
  2013-01-10  0:20 ` [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
                   ` (16 subsequent siblings)
  19 siblings, 5 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

This provides helper methods to coordinate between CPUs coming down
and CPUs going up, as well as documentation on the used algorithms,
so that cluster teardown and setup
operations are not done for a cluster simultaneously.

For use in the power_down() implementation:
  * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
  * __bL_outbound_enter_critical(unsigned int cluster)
  * __bL_outbound_leave_critical(unsigned int cluster)
  * __bL_cpu_down(unsigned int cluster, unsigned int cpu)

The power_up_setup() helper should do platform-specific setup in
preparation for turning the CPU on, such as invalidating local caches
or entering coherency.  It must be assembler for now, since it must
run before the MMU can be switched on.  It is passed the affinity level
which should be initialized.

Because the bL_cluster_sync_struct content is looked-up and modified
with the cache enabled or disabled depending on the code path, it is
crucial to always ensure proper cache maintenance to update main memory
right away.  Therefore, any cached write must be followed by a cache clean
operation and any cached read must be preceded by a cache invalidate
operation on the accessed memory.

To avoid races where a reader would invalidate the cache and discard the
latest update from a writer before that writer had a chance to clean it
to RAM, we simply use cache flush (clean+invalidate) operations
everywhere.

Also, in order to prevent a cached writer from interfering with an
adjacent non-cached writer, we ensure each state variable is located to
a separate cache line.

Thanks to Nicolas Pitre and Achin Gupta for the help with this
patch.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
 arch/arm/common/bL_entry.c                         | 160 +++++++
 arch/arm/common/bL_head.S                          |  88 +++-
 arch/arm/include/asm/bL_entry.h                    |  62 +++
 4 files changed, 806 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt

diff --git a/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
new file mode 100644
index 0000000000..d6151e0235
--- /dev/null
+++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
@@ -0,0 +1,498 @@
+Big.LITTLE cluster Power-up/power-down race avoidance algorithm
+===============================================================
+
+This file documents the algorithm which is used to coordinate CPU and
+cluster setup and teardown operations and to manage hardware coherency
+controls safely.
+
+The section "Rationale" explains what the algorithm is for and why it is
+needed.  "Basic model" explains general concepts using a simplified view
+of the system.  The other sections explain the actual details of the
+algorithm in use.
+
+
+Rationale
+---------
+
+In a system containing multiple CPUs, it is desirable to have the
+ability to turn off individual CPUs when the system is idle, reducing
+power consumption and thermal dissipation.
+
+In a system containing multiple clusters of CPUs, it is also desirable
+to have the ability to turn off entire clusters.
+
+Turning entire clusters off and on is a risky business, because it
+involves performing potentially destructive operations affecting a group
+of independently running CPUs, while the OS continues to run.  This
+means that we need some coordination in order to ensure that critical
+cluster-level operations are only performed when it is truly safe to do
+so.
+
+Simple locking may not be sufficient to solve this problem, because
+mechanisms like Linux spinlocks may rely on coherency mechanisms which
+are not immediately enabled when a cluster powers up.  Since enabling or
+disabling those mechanisms may itself be a non-atomic operation (such as
+writing some hardware registers and invalidating large caches), other
+methods of coordination are required in order to guarantee safe
+power-down and power-up at the cluster level.
+
+The mechanism presented in this document describes a coherent memory
+based protocol for performing the needed coordination.  It aims to be as
+lightweight as possible, while providing the required safety properties.
+
+
+Basic model
+-----------
+
+Each cluster and CPU is assigned a state, as follows:
+
+	DOWN
+	COMING_UP
+	UP
+	GOING_DOWN
+
+	    +---------> UP ----------+
+	    |                        v
+
+	COMING_UP                GOING_DOWN
+
+	    ^                        |
+	    +--------- DOWN <--------+
+
+
+DOWN:	The CPU or cluster is not coherent, and is either powered off or
+	suspended, or is ready to be powered off or suspended.
+
+COMING_UP: The CPU or cluster has committed to moving to the UP state.
+	It may be part way through the process of initialisation and
+	enabling coherency.
+
+UP:	The CPU or cluster is active and coherent at the hardware
+	level.  A CPU in this state is not necessarily being used
+	actively by the kernel.
+
+GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
+	state.  It may be part way through the process of teardown and
+	coherency exit.
+
+
+Each CPU has one of these states assigned to it at any point in time.
+The CPU states are described in the "CPU state" section, below.
+
+Each cluster is also assigned a state, but it is necessary to split the
+state value into two parts (the "cluster" state and "inbound" state) and
+to introduce additional states in order to avoid races between different
+CPUs in the cluster simultaneously modifying the state.  The cluster-
+level states are described in the "Cluster state" section.
+
+To help distinguish the CPU states from cluster states in this
+discussion, the state names are given a CPU_ prefix for the CPU states,
+and a CLUSTER_ or INBOUND_ prefix for the cluster states.
+
+
+CPU state
+---------
+
+In this algorithm, each individual core in a multi-core processor is
+referred to as a "CPU".  CPUs are assumed to be single-threaded:
+therefore, a CPU can only be doing one thing@a single point in time.
+
+This means that CPUs fit the basic model closely.
+
+The algorithm defines the following states for each CPU in the system:
+
+	CPU_DOWN
+	CPU_COMING_UP
+	CPU_UP
+	CPU_GOING_DOWN
+
+	 cluster setup and
+	CPU setup complete          policy decision
+	      +-----------> CPU_UP ------------+
+	      |                                v
+
+	CPU_COMING_UP                   CPU_GOING_DOWN
+
+	      ^                                |
+	      +----------- CPU_DOWN <----------+
+	 policy decision           CPU teardown complete
+	or hardware event
+
+
+The definitions of the four states correspond closely to the states of
+the basic model.
+
+Transitions between states occur as follows.
+
+A trigger event (spontaneous) means that the CPU can transition to the
+next state as a result of making local progress only, with no
+requirement for any external event to happen.
+
+
+CPU_DOWN:
+
+	A CPU reaches the CPU_DOWN state when it is ready for
+	power-down.  On reaching this state, the CPU will typically
+	power itself down or suspend itself, via a WFI instruction or a
+	firmware call.
+
+	Next state:	CPU_COMING_UP
+	Conditions:	none
+
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CPU_COMING_UP:
+
+	A CPU cannot start participating in hardware coherency until the
+	cluster is set up and coherent.  If the cluster is not ready,
+	then the CPU will wait in the CPU_COMING_UP state until the
+	cluster has been set up.
+
+	Next state:	CPU_UP
+	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
+	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
+
+	Refer to the "Cluster state" section for a description of the
+	CLUSTER_UP state.
+
+
+CPU_UP:
+	When a CPU reaches the CPU_UP state, it is safe for the CPU to
+	start participating in local coherency.
+
+	This is done by jumping to the kernel's CPU resume code.
+
+	Note that the definition of this state is slightly different
+	from the basic model definition: CPU_UP does not mean that the
+	CPU is coherent yet, but it does mean that it is safe to resume
+	the kernel.  The kernel handles the rest of the resume
+	procedure, so the remaining steps are not visible as part of the
+	race avoidance algorithm.
+
+	The CPU remains in this state until an explicit policy decision
+	is made to shut down or suspend the CPU.
+
+	Next state:	CPU_GOING_DOWN
+	Conditions:	none
+	Trigger events:	explicit policy decision
+
+
+CPU_GOING_DOWN:
+
+	While in this state, the CPU exits coherency, including any
+	operations required to achieve this (such as cleaning data
+	caches).
+
+	Next state:	CPU_DOWN
+	Conditions:	local CPU teardown complete
+	Trigger events:	(spontaneous)
+
+
+Cluster state
+-------------
+
+A cluster is a group of connected CPUs with some common resources.
+Because a cluster contains multiple CPUs, it can be doing multiple
+things@the same time.  This has some implications.  In particular, a
+CPU can start up while another CPU is tearing the cluster down.
+
+In this discussion, the "outbound side" is the view of the cluster state
+as seen by a CPU tearing the cluster down.  The "inbound side" is the
+view of the cluster state as seen by a CPU setting the CPU up.
+
+In order to enable safe coordination in such situations, it is important
+that a CPU which is setting up the cluster can advertise its state
+independently of the CPU which is tearing down the cluster.  For this
+reason, the cluster state is split into two parts:
+
+	"cluster" state: The global state of the cluster; or the state
+		on the outbound side:
+
+		CLUSTER_DOWN
+		CLUSTER_UP
+		CLUSTER_GOING_DOWN
+
+	"inbound" state: The state of the cluster on the inbound side.
+
+		INBOUND_NOT_COMING_UP
+		INBOUND_COMING_UP
+
+
+	The different pairings of these states results in six possible
+	states for the cluster as a whole:
+
+	                            CLUSTER_UP
+	          +==========> INBOUND_NOT_COMING_UP -------------+
+	          #                                               |
+	                                                          |
+	     CLUSTER_UP     <----+                                |
+	  INBOUND_COMING_UP      |                                v
+
+	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
+	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
+
+	    CLUSTER_DOWN         |                                |
+	  INBOUND_COMING_UP <----+                                |
+	                                                          |
+	          ^                                               |
+	          +===========     CLUSTER_DOWN      <------------+
+	                       INBOUND_NOT_COMING_UP
+
+	Transitions -----> can only be made by the outbound CPU, and
+	only involve changes to the "cluster" state.
+
+	Transitions ===##> can only be made by the inbound CPU, and only
+	involve changes to the "inbound" state, except where there is no
+	further transition possible on the outbound side (i.e., the
+	outbound CPU has put the cluster into the CLUSTER_DOWN state).
+
+	The race avoidance algorithm does not provide a way to determine
+	which exact CPUs within the cluster play these roles.  This must
+	be decided in advance by some other means.  Refer to the section
+	"Last man and first man selection" for more explanation.
+
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
+	cluster can actually be powered down.
+
+	The parallelism of the inbound and outbound CPUs is observed by
+	the existence of two different paths from CLUSTER_GOING_DOWN/
+	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
+	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
+	COMING_UP in the basic model).  The second path avoids cluster
+	teardown completely.
+
+	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
+	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
+	is trivial and merely resets the state machine ready for the
+	next cycle.
+
+	Details of the allowable transitions follow.
+
+	The next state in each case is notated
+
+		<cluster state>/<inbound state> (<transitioner>)
+
+	where the <transitioner> is the side on which the transition
+	can occur; either the inbound or the outbound side.
+
+
+CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
+
+	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CLUSTER_DOWN/INBOUND_COMING_UP:
+
+	In this state, an inbound CPU sets up the cluster, including
+	enabling of hardware coherency at the cluster level and any
+	other operations (such as cache invalidation) which are required
+	in order to achieve this.
+
+	The purpose of this state is to do sufficient cluster-level
+	setup to enable other CPUs in the cluster to enter coherency
+	safely.
+
+	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
+	Conditions:	cluster-level setup and hardware coherency complete
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	This is a transient state, leading immediately to
+	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
+	should consider treat these two states as equivalent.
+
+	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_NOT_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	The cluster will remain in this state until a policy decision is
+	made to power the cluster down.
+
+	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
+	Conditions:	none
+	Trigger events:	policy decision to power down the cluster
+
+
+CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
+
+	An outbound CPU is tearing the cluster down.  The selected CPU
+	must wait in this state until all CPUs in the cluster are in the
+	CPU_DOWN state.
+
+	When all CPUs are in the CPU_DOWN state, the cluster can be torn
+	down, for example by cleaning data caches and exiting
+	cluster-level coherency.
+
+	To avoid wasteful unnecessary teardown operations, the outbound
+	should check the inbound cluster state for asynchronous
+	transitions to INBOUND_COMING_UP.  Alternatively, individual
+	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
+
+
+	Next states:
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
+		Conditions:	none
+		Trigger events:
+
+			a) an explicit hardware power-up operation,
+			   resulting from a policy decision on another
+			   CPU;
+
+			b) a hardware event, such as an interrupt.
+
+
+CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
+
+	The cluster is (or was) being torn down, but another CPU has
+	come online in the meantime and is trying to set up the cluster
+	again.
+
+	If the outbound CPU observes this state, it has two choices:
+
+		a) back out of teardown, restoring the cluster to the
+		   CLUSTER_UP state;
+
+		b) finish tearing the cluster down and put the cluster
+		   in the CLUSTER_DOWN state; the inbound CPU will
+		   set up the cluster again from there.
+
+	Choice (a) permits the removal of some latency by avoiding
+	unnecessary teardown and setup operations in situations where
+	the cluster is not really going to be powered down.
+
+
+	Next states:
+
+	CLUSTER_UP/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster-level setup and hardware
+				coherency complete
+		Trigger events:	(spontaneous)
+
+	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+
+Last man and First man selection
+--------------------------------
+
+The CPU which performs cluster tear-down operations on the outbound side
+is commonly referred to as the "last man".
+
+The CPU which performs cluster setup on the inbound side is commonly
+referred to as the "first man".
+
+The race avoidance algorithm documented above does not provide a
+mechanism to choose which CPUs should play these roles.
+
+
+Last man:
+
+When shutting down the cluster, all the CPUs involved are initially
+executing Linux and hence coherent.  Therefore, ordinary spinlocks can
+be used to select a last man safely, before the CPUs become
+non-coherent.
+
+
+First man:
+
+Because CPUs may power up asynchronously in response to external wake-up
+events, a dynamic mechanism is needed to make sure that only one CPU
+attempts to play the first man role and do the cluster-level
+initialisation: any other CPUs must wait for this to complete before
+proceeding.
+
+Cluster-level initialisation may involve actions such as configuring
+coherency controls in the bus fabric.
+
+The current implementation in bL_head.S uses a separate mutual exclusion
+mechanism to do this arbitration.  This mechanism is documented in
+detail in vlocks.txt.
+
+
+Features and Limitations
+------------------------
+
+Implementation:
+
+	The current ARM-based implementation is split between
+	arch/arm/common/bL_head.S (low-level inbound CPU operations) and
+	arch/arm/common/bL_entry.c (everything else):
+
+	__bL_cpu_going_down() signals the transition of a CPU to the
+		CPU_GOING_DOWN state.
+
+	__bL_cpu_down() signals the transition of a CPU to the CPU_DOWN
+		state.
+
+	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
+		low-level power-up code in bL_head.S.  This could
+		involve CPU-specific setup code, but in the current
+		implementation it does not.
+
+	__bL_outbound_enter_critical() and __bL_outbound_leave_critical()
+		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
+		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
+		the case of an aborted cluster power-down).
+
+		These functions are more complex than the __bL_cpu_*()
+		functions due to the extra inter-CPU coordination which
+		is needed for safe transitions at the cluster level.
+
+	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
+		the low-level power-up code in bL_head.S.  This
+		typically involves platform-specific setup code,
+		provided by the platform-specific power_up_setup
+		function registered via bL_cluster_sync_init.
+
+Deep topologies:
+
+	As currently described and implemented, the algorithm does not
+	support CPU topologies involving more than two levels (i.e.,
+	clusters of clusters are not supported).  The algorithm could be
+	extended by replicating the cluster-level states for the
+	additional topological levels, and modifying the transition
+	rules for the intermediate (non-outermost) cluster levels.
+
+
+Colophon
+--------
+
+Originally created and documented by Dave Martin for Linaro Limited, in
+collaboration with Nicolas Pitre and Achin Gupta.
+
+Copyright (C) 2012  Linaro Limited
+Distributed under the terms of Version 2 of the GNU General Public
+License, as defined in linux/COPYING.
diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 41de0622de..1ea4ec9df0 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
 		platform_ops->powered_up();
 	return 0;
 }
+
+struct bL_sync_struct bL_sync;
+
+static void __sync_range(volatile void *p, size_t size)
+{
+	char *_p = (char *)p;
+
+	__cpuc_flush_dcache_area(_p, size);
+	outer_flush_range(__pa(_p), __pa(_p + size));
+	outer_sync();
+}
+
+#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
+
+/*
+ * __bL_cpu_going_down: Indicates that the cpu is being torn down.
+ *    This must be called at the point of committing to teardown of a CPU.
+ *    The CPU cache (SCTRL.C bit) is expected to still be active.
+ */
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
+{
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+}
+
+/*
+ * __bL_cpu_down: Indicates that cpu teardown is complete and that the
+ *    cluster can be torn down without disrupting this CPU.
+ *    To avoid deadlocks, this must be called before a CPU is powered down.
+ *    The CPU cache (SCTRL.C bit) is expected to be off.
+ */
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
+{
+	dsb();
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+	sev();
+}
+
+/*
+ * __bL_outbound_leave_critical: Leave the cluster teardown critical section.
+ * @state: the final state of the cluster:
+ *     CLUSTER_UP: no destructive teardown was done and the cluster has been
+ *         restored to the previous state (CPU cache still active); or
+ *     CLUSTER_DOWN: the cluster has been torn-down, ready for power-off
+ *         (CPU cache disabled).
+ */
+void __bL_outbound_leave_critical(unsigned int cluster, int state)
+{
+	dsb();
+	bL_sync.clusters[cluster].cluster = state;
+	sync_mem(&bL_sync.clusters[cluster].cluster);
+	sev();
+}
+
+/*
+ * __bL_outbound_enter_critical: Enter the cluster teardown critical section.
+ * This function should be called by the last man, after local CPU teardown
+ * is complete.  CPU cache expected to be active.
+ *
+ * Returns:
+ *     false: the critical section was not entered because an inbound CPU was
+ *         observed, or the cluster is already being set up;
+ *     true: the critical section was entered: it is now safe to tear down the
+ *         cluster.
+ */
+bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
+{
+	unsigned int i;
+	struct bL_cluster_sync_struct *c = &bL_sync.clusters[cluster];
+
+	/* Warn inbound CPUs that the cluster is being torn down: */
+	c->cluster = CLUSTER_GOING_DOWN;
+	sync_mem(&c->cluster);
+
+	/* Back out if the inbound cluster is already in the critical region: */
+	sync_mem(&c->inbound);
+	if (c->inbound == INBOUND_COMING_UP)
+		goto abort;
+
+	/*
+	 * Wait for all CPUs to get out of the GOING_DOWN state, so that local
+	 * teardown is complete on each CPU before tearing down the cluster.
+	 *
+	 * If any CPU has been woken up again from the DOWN state, then we
+	 * shouldn't be taking the cluster down at all: abort in that case.
+	 */
+	sync_mem(&c->cpus);
+	for (i = 0; i < BL_CPUS_PER_CLUSTER; i++) {
+		int cpustate;
+
+		if (i == cpu)
+			continue;
+
+		while (1) {
+			cpustate = c->cpus[i].cpu;
+			if (cpustate != CPU_GOING_DOWN)
+				break;
+
+			wfe();
+			sync_mem(&c->cpus[i].cpu);
+		}
+
+		switch (cpustate) {
+		case CPU_DOWN:
+			continue;
+
+		default:
+			goto abort;
+		}
+	}
+
+	dsb();
+
+	return true;
+
+abort:
+	__bL_outbound_leave_critical(cluster, CLUSTER_UP);
+	return false;
+}
+
+int __bL_cluster_state(unsigned int cluster)
+{
+	sync_mem(&bL_sync.clusters[cluster].cluster);
+	return bL_sync.clusters[cluster].cluster;
+}
+
+extern unsigned long bL_power_up_setup_phys;
+
+int __init bL_cluster_sync_init(void (*power_up_setup)(void))
+{
+	unsigned int i, j, mpidr, this_cluster;
+
+	BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
+	BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
+
+	/*
+	 * Set initial CPU and cluster states.
+	 * Only one cluster is assumed to be active at this point.
+	 */
+	for (i = 0; i < BL_NR_CLUSTERS; i++) {
+		bL_sync.clusters[i].cluster = CLUSTER_DOWN;
+		bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
+		for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
+			bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
+	}
+	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
+	this_cluster = (mpidr >> 8) & 0xf;
+	for_each_online_cpu(i)
+		bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
+	bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
+	sync_mem(&bL_sync);
+
+	if (power_up_setup) {
+		bL_power_up_setup_phys = virt_to_phys(power_up_setup);
+		sync_mem(&bL_power_up_setup_phys);
+	}
+
+	return 0;
+}
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index 9d351f2b4c..f7a64ac127 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -7,11 +7,19 @@
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
+ *
+ *
+ * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
+ * for details of the synchronisation algorithms used here.
  */
 
 #include <linux/linkage.h>
 #include <asm/bL_entry.h>
 
+.if BL_SYNC_CLUSTER_CPUS
+.error "cpus must be the first member of struct bL_cluster_sync_struct"
+.endif
+
 	.macro	pr_dbg	cpu, string
 #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
 	b	1901f
@@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
 2:	pr_dbg	r4, "kernel bL_entry_point\n"
 
 	/*
-	 * MMU is off so we need to get to bL_entry_vectors in a
+	 * MMU is off so we need to get to various variables in a
 	 * position independent way.
 	 */
 	adr	r5, 3f
-	ldr	r6, [r5]
+	ldmia	r5, {r6, r7, r8}
 	add	r6, r5, r6			@ r6 = bL_entry_vectors
+	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
+	add	r8, r5, r8			@ r8 = bL_sync
+
+	mov	r0, #BL_SYNC_CLUSTER_SIZE
+	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
+
+	@ Signal that this CPU is coming UP:
+	mov	r0, #CPU_COMING_UP
+	mov	r5, #BL_SYNC_CPU_SIZE
+	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
+	strb	r0, [r5]
+
+	dsb
+
+	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
+	@ state, because there is at least one active CPU (this CPU).
+
+	@ Check if the cluster has been set up yet:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_UP
+	beq	cluster_already_up
+
+	@ Signal that the cluster is being brought up:
+	mov	r0, #INBOUND_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+
+	dsb
+
+	@ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
+	@ point onwards will observe INBOUND_COMING_UP and abort.
+
+	@ Wait for any previously-pending cluster teardown operations to abort
+	@ or complete:
+cluster_teardown_wait:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_GOING_DOWN
+	wfeeq
+	beq	cluster_teardown_wait
+
+	@ power_up_setup is responsible for setting up the cluster:
+
+	cmp	r7, #0
+	mov	r0, #1		@ second (cluster) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+
+	@ Leave the cluster setup critical section:
+
+	dsb
+	mov	r0, #INBOUND_NOT_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+	mov	r0, #CLUSTER_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	dsb
+	sev
+
+cluster_already_up:
+	@ If a platform-specific CPU setup hook is needed, it is
+	@ called from here.
+
+	cmp	r7, #0
+	mov	r0, #0		@ first (CPU) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+
+	@ Mark the CPU as up:
+
+	dsb
+	mov	r0, #CPU_UP
+	strb	r0, [r5]
+	dsb
+	sev
 
 bL_entry_gated:
 	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
@@ -70,6 +148,8 @@ bL_entry_gated:
 	.align	2
 
 3:	.word	bL_entry_vectors - .
+	.word	bL_power_up_setup_phys - 3b
+	.word	bL_sync - 3b
 
 ENDPROC(bL_entry_point)
 
@@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
 	.type	bL_entry_vectors, #object
 ENTRY(bL_entry_vectors)
 	.space	4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
+
+	.type	bL_power_up_setup_phys, #object
+ENTRY(bL_power_up_setup_phys)
+	.space  4		@ set by bL_cluster_sync_init()
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index 942d7f9f19..167394d9a0 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -15,8 +15,37 @@
 #define BL_CPUS_PER_CLUSTER	4
 #define BL_NR_CLUSTERS		2
 
+/* Definitions for bL_cluster_sync_struct */
+#define CPU_DOWN		0x11
+#define CPU_COMING_UP		0x12
+#define CPU_UP			0x13
+#define CPU_GOING_DOWN		0x14
+
+#define CLUSTER_DOWN		0x21
+#define CLUSTER_UP		0x22
+#define CLUSTER_GOING_DOWN	0x23
+
+#define INBOUND_NOT_COMING_UP	0x31
+#define INBOUND_COMING_UP	0x32
+
+/* This is a complete guess. */
+#define __CACHE_WRITEBACK_ORDER	6
+#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
+
+/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
+#define BL_SYNC_CLUSTER_CPUS	0
+#define BL_SYNC_CPU_SIZE	__CACHE_WRITEBACK_GRANULE
+#define BL_SYNC_CLUSTER_CLUSTER \
+	(BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
+#define BL_SYNC_CLUSTER_INBOUND \
+	(BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
+#define BL_SYNC_CLUSTER_SIZE \
+	(BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
+
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 /*
  * Platform specific code should use this symbol to set up secondary
  * entry location for processors to use when released from reset.
@@ -123,5 +152,38 @@ struct bL_platform_power_ops {
  */
 int __init bL_platform_power_register(const struct bL_platform_power_ops *ops);
 
+/* Synchronisation structures for coordinating safe cluster setup/teardown: */
+
+/*
+ * When modifying this structure, make sure you update the BL_SYNC_ defines
+ * to match.
+ */
+struct bL_cluster_sync_struct {
+	/* individual CPU states */
+	struct {
+		volatile s8 cpu __aligned(__CACHE_WRITEBACK_GRANULE);
+	} cpus[BL_CPUS_PER_CLUSTER];
+
+	/* cluster state */
+	volatile s8 cluster __aligned(__CACHE_WRITEBACK_GRANULE);
+
+	/* inbound-side state */
+	volatile s8 inbound __aligned(__CACHE_WRITEBACK_GRANULE);
+};
+
+struct bL_sync_struct {
+	struct bL_cluster_sync_struct clusters[BL_NR_CLUSTERS];
+};
+
+extern unsigned long bL_sync_phys;	/* physical address of *bL_sync */
+
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster);
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster);
+void __bL_outbound_leave_critical(unsigned int cluster, int state);
+bool __bL_outbound_enter_critical(unsigned int this_cpu, unsigned int cluster);
+int __bL_cluster_state(unsigned int cluster);
+
+int __init bL_cluster_sync_init(void (*power_up_setup)(void));
+
 #endif /* ! __ASSEMBLY__ */
 #endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (2 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10 23:18   ` Will Deacon
  2013-01-10  0:20 ` [PATCH 05/16] ARM: bL_head: vlock-based first man election Nicolas Pitre
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

This patch adds a simple low-level voting mutex implementation
to be used to arbitrate during first man selection when no load/store
exclusive instructions are usable.

For want of a better name, these are called "vlocks".  (I was
tempted to call them ballot locks, but "block" is way too confusing
an abbreviation...)

There is no function to wait for the lock to be released, and no
vlock_lock() function since we don't need these at the moment.
These could straightforwardly be added if vlocks get used for other
purposes.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
---
 Documentation/arm/big.LITTLE/vlocks.txt | 211 ++++++++++++++++++++++++++++++++
 arch/arm/common/vlock.S                 | 108 ++++++++++++++++
 arch/arm/common/vlock.h                 |  43 +++++++
 3 files changed, 362 insertions(+)
 create mode 100644 Documentation/arm/big.LITTLE/vlocks.txt
 create mode 100644 arch/arm/common/vlock.S
 create mode 100644 arch/arm/common/vlock.h

diff --git a/Documentation/arm/big.LITTLE/vlocks.txt b/Documentation/arm/big.LITTLE/vlocks.txt
new file mode 100644
index 0000000000..90672ddc6a
--- /dev/null
+++ b/Documentation/arm/big.LITTLE/vlocks.txt
@@ -0,0 +1,211 @@
+vlocks for Bare-Metal Mutual Exclusion
+======================================
+
+Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
+mechanism, with reasonable but minimal requirements on the memory
+system.
+
+These are intended to be used to coordinate critical activity among CPUs
+which are otherwise non-coherent, in situations where the hardware
+provides no other mechanism to support this and ordinary spinlocks
+cannot be used.
+
+
+vlocks make use of the atomicity provided by the memory system for
+writes to a single memory location.  To arbitrate, every CPU "votes for
+itself", by storing a unique number to a common memory location.  The
+final value seen in that memory location when all the votes have been
+cast identifies the winner.
+
+In order to make sure that the election produces an unambiguous result
+in finite time, a CPU will only enter the election in the first place if
+no winner has been chosen and the election does not appear to have
+started yet.
+
+
+Algorithm
+---------
+
+The easiest way to explain the vlocks algorithm is with some pseudo-code:
+
+
+	int currently_voting[NR_CPUS] = { 0, };
+	int last_vote = -1; /* no votes yet */
+
+	bool vlock_trylock(int this_cpu)
+	{
+		/* signal our desire to vote */
+		currently_voting[this_cpu] = 1;
+		if (last_vote != -1) {
+			/* someone already volunteered himself */
+			currently_voting[this_cpu] = 0;
+			return false; /* not ourself */
+		}
+
+		/* let's suggest ourself */
+		last_vote = this_cpu;
+		currently_voting[this_cpu] = 0;
+
+		/* then wait until everyone else is done voting */
+		for_each_cpu(i) {
+			while (currently_voting[i] != 0)
+				/* wait */;
+		}
+
+		/* result */
+		if (last_vote == this_cpu)
+			return true; /* we won */
+		return false;
+	}
+
+	bool vlock_unlock(void)
+	{
+		last_vote = -1;
+	}
+
+
+The currently_voting[] array provides a way for the CPUs to determine
+whether an election is in progress, and plays a role analogous to the
+"entering" array in Lamport's bakery algorithm [1].
+
+However, once the election has started, the underlying memory system
+atomicity is used to pick the winner.  This avoids the need for a static
+priority rule to act as a tie-breaker, or any counters which could
+overflow.
+
+As long as the last_vote variable is globally visible to all CPUs, it
+will contain only one value that won't change once every CPU has cleared
+its currently_voting flag.
+
+
+Features and limitations
+------------------------
+
+ * vlocks are not intended to be fair.  In the contended case, it is the
+   _last_ CPU which attempts to get the lock which will be most likely
+   to win.
+
+   vlocks are therefore best suited to situations where it is necessary
+   to pick a unique winner, but it does not matter which CPU actually
+   wins.
+
+ * Like other similar mechanisms, vlocks will not scale well to a large
+   number of CPUs.
+
+   vlocks can be cascaded in a voting hierarchy to permit better scaling
+   if necessary, as in the following hypothetical example for 4096 CPUs:
+
+	/* first level: local election */
+	my_town = towns[(this_cpu >> 4) & 0xf];
+	I_won = vlock_trylock(my_town, this_cpu & 0xf);
+	if (I_won) {
+		/* we won the town election, let's go for the state */
+		my_state = states[(this_cpu >> 8) & 0xf];
+		I_won = vlock_lock(my_state, this_cpu & 0xf));
+		if (I_won) {
+			/* and so on */
+			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
+			if (I_won) {
+				/* ... */
+			}
+			vlock_unlock(the_whole_country);
+		}
+		vlock_unlock(my_state);
+	}
+	vlock_unlock(my_town);
+
+
+ARM implementation
+------------------
+
+The current ARM implementation [2] contains a some optimisations beyond
+the basic algorithm:
+
+ * By packing the members of the currently_voting array close together,
+   we can read the whole array in one transaction (providing the number
+   of CPUs potentially contending the lock is small enough).  This
+   reduces the number of round-trips required to external memory.
+
+   In the ARM implementation, this means that we can use a single load
+   and comparison:
+
+	LDR	Rt, [Rn]
+	CMP	Rt, #0
+
+   ...in place of code equivalent to:
+
+	LDRB	Rt, [Rn]
+	CMP	Rt, #0
+	LDRBEQ	Rt, [Rn, #1]
+	CMPEQ	Rt, #0
+	LDRBEQ	Rt, [Rn, #2]
+	CMPEQ	Rt, #0
+	LDRBEQ	Rt, [Rn, #3]
+	CMPEQ	Rt, #0
+
+   This cuts down on the fast-path latency, as well as potentially
+   reducing bus contention in contended cases.
+
+   The optimisation relies on the fact that the ARM memory system
+   guarantees coherency between overlapping memory accesses of
+   different sizes, similarly to many other architectures.  Note that
+   we do not care which element of currently_voting appears in which
+   bits of Rt, so there is no need to worry about endianness in this
+   optimisation.
+
+   If there are too many CPUs to read the currently_voting array in
+   one transaction then multiple transations are still required.  The
+   implementation uses a simple loop of word-sized loads for this
+   case.  The number of transactions is still fewer than would be
+   required if bytes were loaded individually.
+
+
+   In principle, we could aggregate further by using LDRD or LDM, but
+   to keep the code simple this was not attempted in the initial
+   implementation.
+
+
+ * vlocks are currently only used to coordinate between CPUs which are
+   unable to enable their caches yet.  This means that the
+   implementation removes many of the barriers which would be required
+   when executing the algorithm in cached memory.
+
+   packing of the currently_voting array does not work with cached
+   memory unless all CPUs contending the lock are cache-coherent, due
+   to cache writebacks from one CPU clobbering values written by other
+   CPUs.  (Though if all the CPUs are cache-coherent, you should be
+   probably be using proper spinlocks instead anyway).
+
+
+ * The "no votes yet" value used for the last_vote variable is 0 (not
+   -1 as in the pseudocode).  This allows statically-allocated vlocks
+   to be implicitly initialised to an unlocked state simply by putting
+   them in .bss.
+
+   An offset is added to each CPU's ID for the purpose of setting this
+   variable, so that no CPU uses the value 0 for its ID.
+
+
+Colophon
+--------
+
+Originally created and documented by Dave Martin for Linaro Limited, for
+use in ARM-based big.LITTLE platforms, with review and input gratefully
+received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
+grabbing most of this text out of the relevant mail thread and writing
+up the pseudocode.
+
+Copyright (C) 2012  Linaro Limited
+Distributed under the terms of Version 2 of the GNU General Public
+License, as defined in linux/COPYING.
+
+
+References
+----------
+
+[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
+    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
+
+    http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
+
+[2] linux/arch/arm/common/vlock.S, www.kernel.org.
diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
new file mode 100644
index 0000000000..0a1ee3a7f5
--- /dev/null
+++ b/arch/arm/common/vlock.S
@@ -0,0 +1,108 @@
+/*
+ * vlock.S - simple voting lock implementation for ARM
+ *
+ * Created by:	Dave Martin, 2012-08-16
+ * Copyright:	(C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ *
+ * This algorithm is described in more detail in
+ * Documentation/arm/big.LITTLE/vlocks.txt.
+ */
+
+#include <linux/linkage.h>
+#include "vlock.h"
+
+#if VLOCK_VOTING_SIZE > 4
+#define FEW(x...)
+#define MANY(x...) x
+#else
+#define FEW(x...) x
+#define MANY(x...)
+#endif
+
+@ voting lock for first-man coordination
+
+.macro voting_begin rbase:req, rcpu:req, rscratch:req
+	mov	\rscratch, #1
+	strb	\rscratch, [\rbase, \rcpu]
+	dsb
+.endm
+
+.macro voting_end rbase:req, rcpu:req, rscratch:req
+	mov	\rscratch, #0
+	strb	\rscratch, [\rbase, \rcpu]
+	dsb
+	sev
+.endm
+
+/*
+ * The vlock structure must reside in Strongly-Ordered or Device memory.
+ * This implementation deliberately eliminates most of the barriers which
+ * would be required for other memory types, and assumes that independent
+ * writes to neighbouring locations within a cacheline do not interfere
+ * with one another.
+ */
+
+@ r0: lock structure base
+@ r1: CPU ID (0-based index within cluster)
+ENTRY(vlock_trylock)
+	add	r1, r1, #VLOCK_VOTING_OFFSET
+
+	voting_begin	r0, r1, r2
+
+	ldrb	r2, [r0, #VLOCK_OWNER_OFFSET]	@ check whether lock is held
+	cmp	r2, #VLOCK_OWNER_NONE
+	bne	trylock_fail			@ fail if so
+
+	strb	r1, [r0, #VLOCK_OWNER_OFFSET]	@ submit my vote
+
+	voting_end	r0, r1, r2
+
+	@ Wait for the current round of voting to finish:
+
+ MANY(	mov	r3, #VLOCK_VOTING_OFFSET			)
+0:
+ MANY(	ldr	r2, [r0, r3]					)
+ FEW(	ldr	r2, [r0, #VLOCK_VOTING_OFFSET]			)
+	cmp	r2, #0
+	wfene
+	bne	0b
+ MANY(	add	r3, r3, #4					)
+ MANY(	cmp	r3, #VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE	)
+ MANY(	bne	0b						)
+
+	@ Check who won:
+
+	ldrb	r2, [r0, #VLOCK_OWNER_OFFSET]
+	eor	r0, r1, r2			@ zero if I won, else nonzero
+	bx	lr
+
+trylock_fail:
+	voting_end	r0, r1, r2
+	mov	r0, #1				@ nonzero indicates that I lost
+	bx	lr
+ENDPROC(vlock_trylock)
+
+@ r0: lock structure base
+ENTRY(vlock_unlock)
+	mov	r1, #VLOCK_OWNER_NONE
+	dsb
+	strb	r1, [r0, #VLOCK_OWNER_OFFSET]
+	dsb
+	sev
+	bx	lr
+ENDPROC(vlock_unlock)
diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
new file mode 100644
index 0000000000..94c29a6caf
--- /dev/null
+++ b/arch/arm/common/vlock.h
@@ -0,0 +1,43 @@
+/*
+ * vlock.h - simple voting lock implementation
+ *
+ * Created by:	Dave Martin, 2012-08-16
+ * Copyright:	(C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#ifndef __VLOCK_H
+#define __VLOCK_H
+
+#include <asm/bL_entry.h>
+
+#define VLOCK_OWNER_OFFSET	0
+#define VLOCK_VOTING_OFFSET	4
+#define VLOCK_VOTING_SIZE	((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)
+#define VLOCK_SIZE		(VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
+#define VLOCK_OWNER_NONE	0
+
+#ifndef __ASSEMBLY__
+
+struct vlock {
+	char data[VLOCK_SIZE];
+};
+
+int vlock_trylock(struct vlock *lock, unsigned int owner);
+void vlock_unlock(struct vlock *lock);
+
+#endif /* __ASSEMBLY__ */
+#endif /* ! __VLOCK_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 05/16] ARM: bL_head: vlock-based first man election
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (3 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Instead of requiring the first man to be elected in advance (which
can be suboptimal in some situations), this patch uses a per-
cluster mutex to co-ordinate selection of the first man.

This should also make it more feasible to reuse this code path for
asynchronous cluster resume (as in CPUidle scenarios).

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
---
 arch/arm/common/Makefile  |  2 +-
 arch/arm/common/bL_head.S | 91 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index 50880c494f..894c2ddf9b 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
 obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
 obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
-obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o vlock.o
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index f7a64ac127..e70dd432e8 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -16,6 +16,8 @@
 #include <linux/linkage.h>
 #include <asm/bL_entry.h>
 
+#include "vlock.h"
+
 .if BL_SYNC_CLUSTER_CPUS
 .error "cpus must be the first member of struct bL_cluster_sync_struct"
 .endif
@@ -64,10 +66,11 @@ ENTRY(bL_entry_point)
 	 * position independent way.
 	 */
 	adr	r5, 3f
-	ldmia	r5, {r6, r7, r8}
+	ldmia	r5, {r6, r7, r8, r11}
 	add	r6, r5, r6			@ r6 = bL_entry_vectors
 	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
 	add	r8, r5, r8			@ r8 = bL_sync
+	add	r11, r5, r11			@ r11 = first_man_locks
 
 	mov	r0, #BL_SYNC_CLUSTER_SIZE
 	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
@@ -83,11 +86,25 @@ ENTRY(bL_entry_point)
 	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
 	@ state, because there is at least one active CPU (this CPU).
 
-	@ Check if the cluster has been set up yet:
+	mov	r0, #.Lvlock_size
+	mla	r11, r0, r10, r11		@ r11 = cluster first man lock
+	mov	r0, r11
+	mov	r1, r9				@ cpu
+	bl	vlock_trylock
+
+	cmp	r0, #0				@ failed to get the lock?
+	bne	cluster_setup_wait		@ wait for cluster setup if so
+
 	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
-	cmp	r0, #CLUSTER_UP
-	beq	cluster_already_up
+	cmp	r0, #CLUSTER_UP			@ cluster already up?
+	bne	cluster_setup			@ if not, set up the cluster
+
+	@ Otherwise, release the first man lock and skip setup:
+	mov	r0, r11
+	bl	vlock_unlock
+	b	cluster_setup_complete
 
+cluster_setup:
 	@ Signal that the cluster is being brought up:
 	mov	r0, #INBOUND_COMING_UP
 	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
@@ -102,26 +119,47 @@ ENTRY(bL_entry_point)
 cluster_teardown_wait:
 	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_GOING_DOWN
-	wfeeq
-	beq	cluster_teardown_wait
+	bne	first_man_setup
+	wfe
+	b	cluster_teardown_wait
+
+first_man_setup:
+	@ If the outbound gave up before teardown started, skip cluster setup:
 
-	@ power_up_setup is responsible for setting up the cluster:
+	cmp	r0, #CLUSTER_UP
+	beq	cluster_setup_leave
+
+	@ power_up_setup is now responsible for setting up the cluster:
 
 	cmp	r7, #0
 	mov	r0, #1		@ second (cluster) affinity level
 	blxne	r7		@ Call power_up_setup if defined
 
+	dsb
+	mov	r0, #CLUSTER_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+
+cluster_setup_leave:
 	@ Leave the cluster setup critical section:
 
-	dsb
 	mov	r0, #INBOUND_NOT_COMING_UP
 	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
-	mov	r0, #CLUSTER_UP
-	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	dsb
 	sev
 
-cluster_already_up:
+	mov	r0, r11
+	bl	vlock_unlock
+	b	cluster_setup_complete
+
+	@ In the contended case, non-first men wait here for cluster setup
+	@ to complete:
+cluster_setup_wait:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_UP
+	wfene
+	bne	cluster_setup_wait
+
+cluster_setup_complete:
 	@ If a platform-specific CPU setup hook is needed, it is
 	@ called from here.
 
@@ -150,11 +188,40 @@ bL_entry_gated:
 3:	.word	bL_entry_vectors - .
 	.word	bL_power_up_setup_phys - 3b
 	.word	bL_sync - 3b
+	.word	first_man_locks - 3b
 
 ENDPROC(bL_entry_point)
 
 	.bss
-	.align	5
+
+	@ Magic to size and align the first-man vlock structures
+	@ so that each does not cross a 1KB boundary.
+	@ We also must ensure that none of these shares a cacheline with
+	@ any data which might be accessed through the cache.
+
+	.equ	.Log2, 0
+	.rept	11
+		.if (1 << .Log2) < VLOCK_SIZE
+			.equ .Log2, .Log2 + 1
+		.endif
+	.endr
+	.if	.Log2 > 10
+		.error "vlock struct is too large for guaranteed barrierless access ordering"
+	.endif
+	.equ	.Lvlock_size, 1 << .Log2
+
+	@ The presence of two .align directives here is deliberate: we must
+	@ align to whichever of the two boundaries is larger:
+	.align	__CACHE_WRITEBACK_ORDER
+	.align	.Log2
+first_man_locks:
+	.rept	BL_NR_CLUSTERS
+	.space	.Lvlock_size
+	.endr
+	.size	first_man_locks, . - first_man_locks
+	.type	first_man_locks, #object
+
+	.align	__CACHE_WRITEBACK_ORDER
 
 	.type	bL_entry_vectors, #object
 ENTRY(bL_entry_vectors)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (4 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 05/16] ARM: bL_head: vlock-based first man election Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-11 18:02   ` Santosh Shilimkar
  2013-01-14 16:35   ` Will Deacon
  2013-01-10  0:20 ` [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

Now that the b.L power API is in place, we can use it for SMP secondary
bringup and CPU hotplug in a generic fashion.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/Makefile     |  2 +-
 arch/arm/common/bL_platsmp.c | 79 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/common/bL_platsmp.c

diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index 894c2ddf9b..59b36db7cc 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
 obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
 obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
-obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o vlock.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o bL_platsmp.o vlock.o
diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
new file mode 100644
index 0000000000..0acb9f4685
--- /dev/null
+++ b/arch/arm/common/bL_platsmp.c
@@ -0,0 +1,79 @@
+/*
+ * linux/arch/arm/mach-vexpress/bL_platsmp.c
+ *
+ * Created by:  Nicolas Pitre, November 2012
+ * Copyright:   (C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Code to handle secondary CPU bringup and hotplug for the bL power API.
+ */
+
+#include <linux/init.h>
+#include <linux/smp.h>
+
+#include <asm/bL_entry.h>
+#include <asm/smp_plat.h>
+#include <asm/hardware/gic.h>
+
+static void __init simple_smp_init_cpus(void)
+{
+	set_smp_cross_call(gic_raise_softirq);
+}
+
+static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
+{
+	unsigned int pcpu, pcluster, ret;
+	extern void secondary_startup(void);
+
+	pcpu = cpu_logical_map(cpu) & 0xff;
+	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
+	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
+		 __func__, cpu, pcpu, pcluster);
+
+	bL_set_entry_vector(pcpu, pcluster, NULL);
+	ret = bL_cpu_power_up(pcpu, pcluster);
+	if (ret)
+		return ret;
+	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
+	gic_raise_softirq(cpumask_of(cpu), 0);
+	sev();
+	return 0;
+}
+
+static void __cpuinit bL_secondary_init(unsigned int cpu)
+{
+	bL_cpu_powered_up();
+	gic_secondary_init(0);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+static int bL_cpu_disable(unsigned int cpu)
+{
+	/*
+	 * We assume all CPUs may be shut down.
+	 * This would be the hook to use for eventual Secure
+	 * OS migration requests.
+	 */
+	return 0;
+}
+
+static void __ref bL_cpu_die(unsigned int cpu)
+{
+	bL_cpu_power_down();
+}
+
+#endif
+
+struct smp_operations __initdata bL_smp_ops = {
+	.smp_init_cpus		= simple_smp_init_cpus,
+	.smp_boot_secondary	= bL_boot_secondary,
+	.smp_secondary_init	= bL_secondary_init,
+#ifdef CONFIG_HOTPLUG_CPU
+	.cpu_disable		= bL_cpu_disable,
+	.cpu_die		= bL_cpu_die,
+#endif
+};
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (5 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-14 16:37   ` Will Deacon
  2013-01-10  0:20 ` [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled Nicolas Pitre
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

If for whatever reason a CPU is unexpectedly awaken, it shouldn't
re-enter the kernel by using whatever entry vector that might have
been set by a previous operation.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/bL_platsmp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
index 0acb9f4685..0ae44123bf 100644
--- a/arch/arm/common/bL_platsmp.c
+++ b/arch/arm/common/bL_platsmp.c
@@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
 
 static void __ref bL_cpu_die(unsigned int cpu)
 {
+	unsigned int mpidr, pcpu, pcluster;
+	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
+	pcpu = mpidr & 0xff;
+	pcluster = (mpidr >> 8) & 0xff;
+	bL_set_entry_vector(pcpu, pcluster, NULL);
 	bL_cpu_power_down();
 }
 
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (6 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-11 18:07   ` Santosh Shilimkar
  2013-01-14 16:39   ` Will Deacon
  2013-01-10  0:20 ` [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
                   ` (11 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

Otherwise there might be some interrupts or IPIs becoming pending and the
CPU will not enter low power mode when doing a WFI.  The effect of this
is a CPU that loops back into the kernel, go through the first man
election, signals itself as alive,  and prevent the cluster from being
shut down.

This could benefit from a better solution.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/bL_platsmp.c        | 1 +
 arch/arm/common/gic.c               | 6 ++++++
 arch/arm/include/asm/hardware/gic.h | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
index 0ae44123bf..6a3b251b97 100644
--- a/arch/arm/common/bL_platsmp.c
+++ b/arch/arm/common/bL_platsmp.c
@@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
 	pcpu = mpidr & 0xff;
 	pcluster = (mpidr >> 8) & 0xff;
 	bL_set_entry_vector(pcpu, pcluster, NULL);
+	gic_cpu_if_down();
 	bL_cpu_power_down();
 }
 
diff --git a/arch/arm/common/gic.c b/arch/arm/common/gic.c
index 36ae03a3f5..760e8f4ca1 100644
--- a/arch/arm/common/gic.c
+++ b/arch/arm/common/gic.c
@@ -428,6 +428,12 @@ static void __cpuinit gic_cpu_init(struct gic_chip_data *gic)
 	writel_relaxed(1, base + GIC_CPU_CTRL);
 }
 
+void gic_cpu_if_down(void)
+{
+	void __iomem *cpu_base = gic_data_cpu_base(&gic_data[0]);
+	writel_relaxed(0, cpu_base + GIC_CPU_CTRL);
+}
+
 #ifdef CONFIG_CPU_PM
 /*
  * Saves the GIC distributor registers during suspend or idle.  Must be called
diff --git a/arch/arm/include/asm/hardware/gic.h b/arch/arm/include/asm/hardware/gic.h
index 4b1ce6cd47..2a7605492d 100644
--- a/arch/arm/include/asm/hardware/gic.h
+++ b/arch/arm/include/asm/hardware/gic.h
@@ -46,6 +46,8 @@ void gic_handle_irq(struct pt_regs *regs);
 void gic_cascade_irq(unsigned int gic_nr, unsigned int irq);
 void gic_raise_softirq(const struct cpumask *mask, unsigned int irq);
 
+void gic_cpu_if_down(void);
+
 static inline void gic_init(unsigned int nr, int start,
 			    void __iomem *dist , void __iomem *cpu)
 {
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at run-time
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (7 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Jon Medhurst <tixy@linaro.org>

Signed-off-by: Jon Medhurst <tixy@linaro.org>
---
 arch/arm/include/asm/mach/arch.h |  3 +++
 arch/arm/kernel/setup.c          |  5 ++++-
 arch/arm/mach-vexpress/core.h    |  2 ++
 arch/arm/mach-vexpress/platsmp.c | 12 ++++++++++++
 arch/arm/mach-vexpress/v2m.c     |  2 +-
 5 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/mach/arch.h b/arch/arm/include/asm/mach/arch.h
index 917d4fcfd9..3d01c6d6c3 100644
--- a/arch/arm/include/asm/mach/arch.h
+++ b/arch/arm/include/asm/mach/arch.h
@@ -17,8 +17,10 @@ struct pt_regs;
 struct smp_operations;
 #ifdef CONFIG_SMP
 #define smp_ops(ops) (&(ops))
+#define smp_init_ops(ops) (&(ops))
 #else
 #define smp_ops(ops) (struct smp_operations *)NULL
+#define smp_init_ops(ops) (void (*)(void))NULL
 #endif
 
 struct machine_desc {
@@ -42,6 +44,7 @@ struct machine_desc {
 	unsigned char		reserve_lp2 :1;	/* never has lp2	*/
 	char			restart_mode;	/* default restart mode	*/
 	struct smp_operations	*smp;		/* SMP operations	*/
+	void			(*smp_init)(void);
 	void			(*fixup)(struct tag *, char **,
 					 struct meminfo *);
 	void			(*reserve)(void);/* reserve mem blocks	*/
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 3f6cbb2e3e..41edca8582 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -768,7 +768,10 @@ void __init setup_arch(char **cmdline_p)
 	arm_dt_init_cpu_maps();
 #ifdef CONFIG_SMP
 	if (is_smp()) {
-		smp_set_ops(mdesc->smp);
+		if(mdesc->smp_init)
+			(*mdesc->smp_init)();
+		else
+			smp_set_ops(mdesc->smp);
 		smp_init_cpus();
 	}
 #endif
diff --git a/arch/arm/mach-vexpress/core.h b/arch/arm/mach-vexpress/core.h
index f134cd4a85..3a761fd76c 100644
--- a/arch/arm/mach-vexpress/core.h
+++ b/arch/arm/mach-vexpress/core.h
@@ -6,6 +6,8 @@
 
 void vexpress_dt_smp_map_io(void);
 
+void vexpress_smp_init_ops(void);
+
 extern struct smp_operations	vexpress_smp_ops;
 
 extern void vexpress_cpu_die(unsigned int cpu);
diff --git a/arch/arm/mach-vexpress/platsmp.c b/arch/arm/mach-vexpress/platsmp.c
index c5d70de9bb..e62a08b561 100644
--- a/arch/arm/mach-vexpress/platsmp.c
+++ b/arch/arm/mach-vexpress/platsmp.c
@@ -12,6 +12,7 @@
 #include <linux/errno.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/vexpress.h>
 
@@ -206,3 +207,14 @@ struct smp_operations __initdata vexpress_smp_ops = {
 	.cpu_die		= vexpress_cpu_die,
 #endif
 };
+
+void __init vexpress_smp_init_ops(void)
+{
+	struct smp_operations *ops = &vexpress_smp_ops;
+#ifdef CONFIG_BIG_LITTLE
+	extern struct smp_operations bL_smp_ops;
+	if(of_find_compatible_node(NULL, NULL, "arm,cci"))
+		ops = &bL_smp_ops;
+#endif
+	smp_set_ops(ops);
+}
diff --git a/arch/arm/mach-vexpress/v2m.c b/arch/arm/mach-vexpress/v2m.c
index 011661a6c5..34172bd504 100644
--- a/arch/arm/mach-vexpress/v2m.c
+++ b/arch/arm/mach-vexpress/v2m.c
@@ -494,7 +494,7 @@ static const char * const v2m_dt_match[] __initconst = {
 
 DT_MACHINE_START(VEXPRESS_DT, "ARM-Versatile Express")
 	.dt_compat	= v2m_dt_match,
-	.smp		= smp_ops(vexpress_smp_ops),
+	.smp_init	= smp_init_ops(vexpress_smp_init_ops),
 	.map_io		= v2m_dt_map_io,
 	.init_early	= v2m_dt_init_early,
 	.init_irq	= v2m_dt_init_irq,
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 10/16] ARM: vexpress: introduce DCSCB support
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (8 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-11 18:12   ` Santosh Shilimkar
  2013-01-10  0:20 ` [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

This adds basic CPU and cluster reset controls on RTSM for the
A15x4-A7x4 model configuration using the Dual Cluster System
Configuration Block (DCSCB).

The cache coherency interconnect (CCI) is not handled yet.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/Kconfig  |   8 ++
 arch/arm/mach-vexpress/Makefile |   1 +
 arch/arm/mach-vexpress/dcscb.c  | 160 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 169 insertions(+)
 create mode 100644 arch/arm/mach-vexpress/dcscb.c

diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
index 99e63f5f99..e55c02562f 100644
--- a/arch/arm/mach-vexpress/Kconfig
+++ b/arch/arm/mach-vexpress/Kconfig
@@ -53,4 +53,12 @@ config ARCH_VEXPRESS_CORTEX_A5_A9_ERRATA
 config ARCH_VEXPRESS_CA9X4
 	bool "Versatile Express Cortex-A9x4 tile"
 
+config ARCH_VEXPRESS_DCSCB
+	bool "Dual Cluster System Control Block (DCSCB) support"
+	depends on BIG_LITTLE
+	help
+	  Support for the Dual Cluster System Configuration Block (DCSCB).
+	  This is needed to provide CPU and cluster power management
+	  on RTSM.
+
 endmenu
diff --git a/arch/arm/mach-vexpress/Makefile b/arch/arm/mach-vexpress/Makefile
index 80b64971fb..2253644054 100644
--- a/arch/arm/mach-vexpress/Makefile
+++ b/arch/arm/mach-vexpress/Makefile
@@ -6,5 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) := -I$(srctree)/$(src)/include \
 
 obj-y					:= v2m.o reset.o
 obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
+obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
 obj-$(CONFIG_SMP)			+= platsmp.o
 obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
new file mode 100644
index 0000000000..cccd943cd4
--- /dev/null
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -0,0 +1,160 @@
+/*
+ * arch/arm/mach-vexpress/dcscb.c - Dual Cluster System Control Block
+ *
+ * Created by:	Nicolas Pitre, May 2012
+ * Copyright:	(C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/io.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/vexpress.h>
+
+#include <asm/bL_entry.h>
+#include <asm/proc-fns.h>
+#include <asm/cacheflush.h>
+
+
+#define DCSCB_PHYS_BASE	0x60000000
+
+#define RST_HOLD0	0x0
+#define RST_HOLD1	0x4
+#define SYS_SWRESET	0x8
+#define RST_STAT0	0xc
+#define RST_STAT1	0x10
+#define EAG_CFG_R	0x20
+#define EAG_CFG_W	0x24
+#define KFC_CFG_R	0x28
+#define KFC_CFG_W	0x2c
+#define DCS_CFG_R	0x30
+
+/*
+ * We can't use regular spinlocks. In the switcher case, it is possible
+ * for an outbound CPU to call power_down() after its inbound counterpart
+ * is already live using the same logical CPU number which trips lockdep
+ * debugging.
+ */
+static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
+
+static void __iomem *dcscb_base;
+
+static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
+{
+	unsigned int rst_hold, cpumask = (1 << cpu);
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	if (cpu >= 4 || cluster >= 2)
+		return -EINVAL;
+
+	/*
+	 * Since this is called with IRQs enabled, and no arch_spin_lock_irq
+	 * variant exists, we need to disable IRQs manually here.
+	 */
+	local_irq_disable();
+	arch_spin_lock(&dcscb_lock);
+
+	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+	if (rst_hold & (1 << 8)) {
+		/* remove cluster reset and add individual CPU's reset */
+		rst_hold &= ~(1 << 8);
+		rst_hold |= 0xf;
+	}
+	rst_hold &= ~(cpumask | (cpumask << 4));
+	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+
+	arch_spin_unlock(&dcscb_lock);
+	local_irq_enable();
+
+	return 0;
+}
+
+static void dcscb_power_down(void)
+{
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
+
+	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
+	cpu = mpidr & 0xff;
+	cluster = (mpidr >> 8) & 0xff;
+	cpumask = (1 << cpu);
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	BUG_ON(cpu >= 4 || cluster >= 2);
+
+	arch_spin_lock(&dcscb_lock);
+	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+	rst_hold |= cpumask;
+	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
+		rst_hold |= (1 << 8);
+	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	arch_spin_unlock(&dcscb_lock);
+	last_man = (rst_hold & (1 << 8));
+
+	/*
+	 * Now let's clean our L1 cache and shut ourself down.
+	 * If we're the last CPU in this cluster then clean L2 too.
+	 */
+
+	/*
+	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+	 * a preliminary flush here for those CPUs.  At least, that's
+	 * the theory -- without the extra flush, Linux explodes on
+	 * RTSM (maybe not needed anymore, to be investigated)..
+	 */
+	flush_cache_louis();
+	cpu_proc_fin();
+
+	if (!last_man) {
+		flush_cache_louis();
+	} else {
+		flush_cache_all();
+		outer_flush_all();
+	}
+
+	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+	asm volatile (
+		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
+		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
+		"mcr	p15, 0, ip, c1, c0, 1"
+		: : : "ip" );
+
+	/* Now we are prepared for power-down, do it: */
+	wfi();
+
+	/* Not dead@this point?  Let our caller cope. */
+}
+
+static const struct bL_platform_power_ops dcscb_power_ops = {
+	.power_up	= dcscb_power_up,
+	.power_down	= dcscb_power_down,
+};
+
+static int __init dcscb_init(void)
+{
+	int ret;
+
+	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
+	if (!dcscb_base)
+		return -ENOMEM;
+
+	ret = bL_platform_power_register(&dcscb_power_ops);
+	if (ret) {
+		iounmap(dcscb_base);
+		return ret;
+	}
+
+	/*
+	 * Future entries into the kernel can now go
+	 * through the b.L entry vectors.
+	 */
+	vexpress_flags_set(virt_to_phys(bL_entry_point));
+
+	return 0;
+}
+
+early_initcall(dcscb_init);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (9 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

It is possible for a CPU to be told to power up before it managed
to power itself down.  Solve this race with a usage count as mandated
by the API definition.

Signed-off-by: nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 74 +++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index cccd943cd4..81dd443b95 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -43,6 +43,7 @@
 static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 
 static void __iomem *dcscb_base;
+static int dcscb_use_count[4][2];
 
 static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 {
@@ -59,14 +60,27 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 	local_irq_disable();
 	arch_spin_lock(&dcscb_lock);
 
-	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
-	if (rst_hold & (1 << 8)) {
-		/* remove cluster reset and add individual CPU's reset */
-		rst_hold &= ~(1 << 8);
-		rst_hold |= 0xf;
+	dcscb_use_count[cpu][cluster]++;
+	if (dcscb_use_count[cpu][cluster] == 1) {
+		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+		if (rst_hold & (1 << 8)) {
+			/* remove cluster reset and add individual CPU's reset */
+			rst_hold &= ~(1 << 8);
+			rst_hold |= 0xf;
+		}
+		rst_hold &= ~(cpumask | (cpumask << 4));
+		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	} else if (dcscb_use_count[cpu][cluster] != 2) {
+		/*
+		 * The only possible values are:
+		 * 0 = CPU down
+		 * 1 = CPU (still) up
+		 * 2 = CPU requested to be up before it had a chance
+		 *     to actually make itself down.
+		 * Any other value is a bug.
+		 */
+		BUG();
 	}
-	rst_hold &= ~(cpumask | (cpumask << 4));
-	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
 
 	arch_spin_unlock(&dcscb_lock);
 	local_irq_enable();
@@ -76,7 +90,8 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 
 static void dcscb_power_down(void)
 {
-	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask;
+	bool last_man = false, skip_wfi = false;
 
 	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
 	cpu = mpidr & 0xff;
@@ -87,13 +102,26 @@ static void dcscb_power_down(void)
 	BUG_ON(cpu >= 4 || cluster >= 2);
 
 	arch_spin_lock(&dcscb_lock);
-	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
-	rst_hold |= cpumask;
-	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
-		rst_hold |= (1 << 8);
-	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	dcscb_use_count[cpu][cluster]--;
+	if (dcscb_use_count[cpu][cluster] == 0) {
+		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+		rst_hold |= cpumask;
+		if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf) {
+			rst_hold |= (1 << 8);
+			last_man = true;
+		}
+		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	} else if (dcscb_use_count[cpu][cluster] == 1) {
+		/*
+		 * A power_up request went ahead of us.
+		 * Even if we do not want to shut this CPU down,
+		 * the caller expects a certain state as if the WFI
+		 * was aborted.  So let's continue with cache cleaning.
+		 */
+		skip_wfi = true;
+	} else
+		BUG();
 	arch_spin_unlock(&dcscb_lock);
-	last_man = (rst_hold & (1 << 8));
 
 	/*
 	 * Now let's clean our L1 cache and shut ourself down.
@@ -124,7 +152,8 @@ static void dcscb_power_down(void)
 		: : : "ip" );
 
 	/* Now we are prepared for power-down, do it: */
-	wfi();
+	if (!skip_wfi)
+		wfi();
 
 	/* Not dead@this point?  Let our caller cope. */
 }
@@ -134,6 +163,19 @@ static const struct bL_platform_power_ops dcscb_power_ops = {
 	.power_down	= dcscb_power_down,
 };
 
+static void __init dcscb_usage_count_init(void)
+{
+	unsigned int mpidr, cpu, cluster;
+
+	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
+	cpu = mpidr & 0xff;
+	cluster = (mpidr >> 8) & 0xff;
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	BUG_ON(cpu >= 4 || cluster >= 2);
+	dcscb_use_count[cpu][cluster] = 1;
+}
+
 static int __init dcscb_init(void)
 {
 	int ret;
@@ -142,6 +184,8 @@ static int __init dcscb_init(void)
 	if (!dcscb_base)
 		return -ENOMEM;
 
+	dcscb_usage_count_init();
+
 	ret = bL_platform_power_register(&dcscb_power_ops);
 	if (ret) {
 		iounmap(dcscb_base);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (10 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  0:20 ` [PATCH 13/16] drivers: misc: add ARM CCI support Nicolas Pitre
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

If 4 CPUs are assumed, the A15x1-A7x1 model configuration would never
shut down the initial cluster as the 0xf reset bit mask will never be
observed.  Let's construct this mask based on the provided information
in the DCSCB config register for the number of CPUs per cluster.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 81dd443b95..59b690376f 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -44,10 +44,12 @@ static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 
 static void __iomem *dcscb_base;
 static int dcscb_use_count[4][2];
+static int dcscb_cluster_cpu_mask[2];
 
 static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 {
 	unsigned int rst_hold, cpumask = (1 << cpu);
+	unsigned int cluster_mask = dcscb_cluster_cpu_mask[cluster];
 
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	if (cpu >= 4 || cluster >= 2)
@@ -66,7 +68,7 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 		if (rst_hold & (1 << 8)) {
 			/* remove cluster reset and add individual CPU's reset */
 			rst_hold &= ~(1 << 8);
-			rst_hold |= 0xf;
+			rst_hold |= cluster_mask;
 		}
 		rst_hold &= ~(cpumask | (cpumask << 4));
 		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
@@ -90,13 +92,14 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 
 static void dcscb_power_down(void)
 {
-	unsigned int mpidr, cpu, cluster, rst_hold, cpumask;
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, cluster_mask;
 	bool last_man = false, skip_wfi = false;
 
 	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
 	cpu = mpidr & 0xff;
 	cluster = (mpidr >> 8) & 0xff;
 	cpumask = (1 << cpu);
+	cluster_mask = dcscb_cluster_cpu_mask[cluster];
 
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	BUG_ON(cpu >= 4 || cluster >= 2);
@@ -106,7 +109,7 @@ static void dcscb_power_down(void)
 	if (dcscb_use_count[cpu][cluster] == 0) {
 		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
 		rst_hold |= cpumask;
-		if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf) {
+		if (((rst_hold | (rst_hold >> 4)) & cluster_mask) == cluster_mask) {
 			rst_hold |= (1 << 8);
 			last_man = true;
 		}
@@ -178,12 +181,15 @@ static void __init dcscb_usage_count_init(void)
 
 static int __init dcscb_init(void)
 {
+	unsigned int cfg;
 	int ret;
 
 	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
 	if (!dcscb_base)
 		return -ENOMEM;
-
+	cfg = readl_relaxed(dcscb_base + DCS_CFG_R);
+	dcscb_cluster_cpu_mask[0] = (1 << (((cfg >> 16) >> (0 << 2)) & 0xf)) - 1;
+	dcscb_cluster_cpu_mask[1] = (1 << (((cfg >> 16) >> (1 << 2)) & 0xf)) - 1;
 	dcscb_usage_count_init();
 
 	ret = bL_platform_power_register(&dcscb_power_ops);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 13/16] drivers: misc: add ARM CCI support
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (11 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-11 18:20   ` Santosh Shilimkar
  2013-01-10  0:20 ` [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache Nicolas Pitre
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

On ARM multi-cluster systems coherency between cores running on
different clusters is managed by the cache-coherent interconnect (CCI).
It allows broadcasting of TLB invalidates and memory barriers and it
guarantees cache coherency at system level.

This patch enables the basic infrastructure required in Linux to
handle and programme the CCI component. The first implementation is
based on a platform device, its relative DT compatible property and
a simple programming interface.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 drivers/misc/Kconfig    |   3 ++
 drivers/misc/Makefile   |   1 +
 drivers/misc/arm-cci.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/arm-cci.h |  30 ++++++++++++++
 4 files changed, 141 insertions(+)
 create mode 100644 drivers/misc/arm-cci.c
 create mode 100644 include/linux/arm-cci.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index b151b7c1bd..30d5be1ad2 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -499,6 +499,9 @@ config USB_SWITCH_FSA9480
 	  stereo and mono audio, video, microphone and UART data to use
 	  a common connector port.
 
+config ARM_CCI
+       bool "ARM CCI driver support"
+
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 2129377c0d..d052d109f9 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -49,3 +49,4 @@ obj-y				+= carma/
 obj-$(CONFIG_USB_SWITCH_FSA9480) += fsa9480.o
 obj-$(CONFIG_ALTERA_STAPL)	+=altera-stapl/
 obj-$(CONFIG_INTEL_MEI)		+= mei/
+obj-$(CONFIG_ARM_CCI)		+= arm-cci.o
diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
new file mode 100644
index 0000000000..f329c43099
--- /dev/null
+++ b/drivers/misc/arm-cci.c
@@ -0,0 +1,107 @@
+/*
+ * CCI support
+ *
+ * Copyright (C) 2012 ARM Ltd.
+ * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/device.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/arm-cci.h>
+
+#define CCI400_EAG_OFFSET       0x4000
+#define CCI400_KF_OFFSET        0x5000
+
+#define DRIVER_NAME	"CCI"
+struct cci_drvdata {
+	void __iomem *baseaddr;
+	spinlock_t lock;
+};
+
+static struct cci_drvdata *info;
+
+void disable_cci(int cluster)
+{
+	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
+	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
+
+	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
+			;
+}
+EXPORT_SYMBOL_GPL(disable_cci);
+
+static int __devinit cci_driver_probe(struct platform_device *pdev)
+{
+	struct resource *res;
+	int ret = 0;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		dev_err(&pdev->dev, "unable to allocate mem\n");
+		return -ENOMEM;
+	}
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res) {
+		dev_err(&pdev->dev, "No memory resource\n");
+		ret = -EINVAL;
+		goto mem_free;
+	}
+
+	if (!request_mem_region(res->start, resource_size(res),
+				dev_name(&pdev->dev))) {
+		dev_err(&pdev->dev, "address 0x%x in use\n", (u32) res->start);
+		ret = -EBUSY;
+		goto mem_free;
+	}
+
+	info->baseaddr = ioremap(res->start, resource_size(res));
+	if (!info->baseaddr) {
+		ret = -ENXIO;
+		goto ioremap_err;
+	}
+
+	platform_set_drvdata(pdev, info);
+
+	pr_info("CCI loaded at %p\n", info->baseaddr);
+	return ret;
+
+ioremap_err:
+	release_region(res->start, resource_size(res));
+mem_free:
+	kfree(info);
+
+	return ret;
+}
+
+static const struct of_device_id arm_cci_matches[] = {
+	{.compatible = "arm,cci"},
+	{},
+};
+
+static struct platform_driver cci_platform_driver = {
+	.driver = {
+		   .name = DRIVER_NAME,
+		   .of_match_table = arm_cci_matches,
+		  },
+	.probe = cci_driver_probe,
+};
+
+static int __init cci_init(void)
+{
+	return platform_driver_register(&cci_platform_driver);
+}
+
+core_initcall(cci_init);
diff --git a/include/linux/arm-cci.h b/include/linux/arm-cci.h
new file mode 100644
index 0000000000..ce3f705fb6
--- /dev/null
+++ b/include/linux/arm-cci.h
@@ -0,0 +1,30 @@
+/*
+ * CCI support
+ *
+ * Copyright (C) 2012 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __LINUX_ARM_CCI_H
+#define __LINUX_ARM_CCI_H
+
+#ifdef CONFIG_ARM_CCI
+extern void disable_cci(int cluster);
+#else
+static inline void disable_cci(int cluster) { }
+#endif
+
+#endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (12 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 13/16] drivers: misc: add ARM CCI support Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10 18:50   ` Dave Martin
  2013-01-10  0:20 ` [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Non-local variables used by the CCI management function called after
disabling the cache must be flushed out to main memory in advance,
otherwise incoherency of those values may occur if they are sitting
in the cache of some other CPU when cci_disable() executes.

This patch adds the appropriate flushing to the CCI driver to ensure
that the relevant data is available in RAM ahead of time.

Because this creates a dependency on arch-specific cacheflushing
functions, this patch also makes ARM_CCI depend on ARM.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 drivers/misc/Kconfig   |  1 +
 drivers/misc/arm-cci.c | 21 +++++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 30d5be1ad2..b24630696c 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -501,6 +501,7 @@ config USB_SWITCH_FSA9480
 
 config ARM_CCI
        bool "ARM CCI driver support"
+	depends on ARM
 
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
index f329c43099..739e1c96d3 100644
--- a/drivers/misc/arm-cci.c
+++ b/drivers/misc/arm-cci.c
@@ -21,8 +21,16 @@
 #include <linux/slab.h>
 #include <linux/arm-cci.h>
 
-#define CCI400_EAG_OFFSET       0x4000
-#define CCI400_KF_OFFSET        0x5000
+#include <asm/cacheflush.h>
+#include <asm/memory.h>
+#include <asm/outercache.h>
+
+#include <asm/irq_regs.h>
+#include <asm/pmu.h>
+
+#define CCI400_PMCR                   0x0100
+#define CCI400_EAG_OFFSET             0x4000
+#define CCI400_KF_OFFSET              0x5000
 
 #define DRIVER_NAME	"CCI"
 struct cci_drvdata {
@@ -73,6 +81,15 @@ static int __devinit cci_driver_probe(struct platform_device *pdev)
 		goto ioremap_err;
 	}
 
+	/*
+	 * Multi-cluster systems may need this data when non-coherent, during
+	 * cluster power-up/power-down. Make sure it reaches main memory:
+	 */
+	__cpuc_flush_dcache_area(info, sizeof *info);
+	__cpuc_flush_dcache_area(&info, sizeof info);
+	outer_clean_range(virt_to_phys(info), virt_to_phys(info + 1));
+	outer_clean_range(virt_to_phys(&info), virt_to_phys(&info + 1));
+
 	platform_set_drvdata(pdev, info);
 
 	pr_info("CCI loaded at %p\n", info->baseaddr);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (13 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10 12:05   ` Dave Martin
  2013-01-11 18:27   ` Santosh Shilimkar
  2013-01-10  0:20 ` [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Add the required code to properly handle race free platform coherency exit
to the DCSCB power down method.

The power_up_setup callback is used to enable the CCI interface for
the cluster being brought up.  This must be done in assembly before
the kernel environment is entered.

Thanks to Achin Gupta and Nicolas Pitre for their help and
contributions.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/Kconfig       |  1 +
 arch/arm/mach-vexpress/Makefile      |  2 +-
 arch/arm/mach-vexpress/dcscb.c       | 90 +++++++++++++++++++++++++++---------
 arch/arm/mach-vexpress/dcscb_setup.S | 77 ++++++++++++++++++++++++++++++
 4 files changed, 146 insertions(+), 24 deletions(-)
 create mode 100644 arch/arm/mach-vexpress/dcscb_setup.S

diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
index e55c02562f..180633dda6 100644
--- a/arch/arm/mach-vexpress/Kconfig
+++ b/arch/arm/mach-vexpress/Kconfig
@@ -56,6 +56,7 @@ config ARCH_VEXPRESS_CA9X4
 config ARCH_VEXPRESS_DCSCB
 	bool "Dual Cluster System Control Block (DCSCB) support"
 	depends on BIG_LITTLE
+	select ARM_CCI
 	help
 	  Support for the Dual Cluster System Configuration Block (DCSCB).
 	  This is needed to provide CPU and cluster power management
diff --git a/arch/arm/mach-vexpress/Makefile b/arch/arm/mach-vexpress/Makefile
index 2253644054..f6e90f3272 100644
--- a/arch/arm/mach-vexpress/Makefile
+++ b/arch/arm/mach-vexpress/Makefile
@@ -6,6 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) := -I$(srctree)/$(src)/include \
 
 obj-y					:= v2m.o reset.o
 obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
-obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
+obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o	dcscb_setup.o
 obj-$(CONFIG_SMP)			+= platsmp.o
 obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 59b690376f..95a2d0df20 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -15,6 +15,7 @@
 #include <linux/spinlock.h>
 #include <linux/errno.h>
 #include <linux/vexpress.h>
+#include <linux/arm-cci.h>
 
 #include <asm/bL_entry.h>
 #include <asm/proc-fns.h>
@@ -104,6 +105,8 @@ static void dcscb_power_down(void)
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	BUG_ON(cpu >= 4 || cluster >= 2);
 
+	__bL_cpu_going_down(cpu, cluster);
+
 	arch_spin_lock(&dcscb_lock);
 	dcscb_use_count[cpu][cluster]--;
 	if (dcscb_use_count[cpu][cluster] == 0) {
@@ -111,6 +114,7 @@ static void dcscb_power_down(void)
 		rst_hold |= cpumask;
 		if (((rst_hold | (rst_hold >> 4)) & cluster_mask) == cluster_mask) {
 			rst_hold |= (1 << 8);
+			BUG_ON(__bL_cluster_state(cluster) != CLUSTER_UP);
 			last_man = true;
 		}
 		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
@@ -124,35 +128,71 @@ static void dcscb_power_down(void)
 		skip_wfi = true;
 	} else
 		BUG();
-	arch_spin_unlock(&dcscb_lock);
 
-	/*
-	 * Now let's clean our L1 cache and shut ourself down.
-	 * If we're the last CPU in this cluster then clean L2 too.
-	 */
-
-	/*
-	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
-	 * a preliminary flush here for those CPUs.  At least, that's
-	 * the theory -- without the extra flush, Linux explodes on
-	 * RTSM (maybe not needed anymore, to be investigated)..
-	 */
-	flush_cache_louis();
-	cpu_proc_fin();
+	if (last_man && __bL_outbound_enter_critical(cpu, cluster)) {
+		arch_spin_unlock(&dcscb_lock);
 
-	if (!last_man) {
-		flush_cache_louis();
-	} else {
+		/*
+		 * Flush all cache levels for this cluster.
+		 *
+		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+		 * a preliminary flush here for those CPUs.  At least, that's
+		 * the theory -- without the extra flush, Linux explodes on
+		 * RTSM (maybe not needed anymore, to be investigated).
+		 */
 		flush_cache_all();
+		cpu_proc_fin(); /* disable allocation into internal caches*/
+		flush_cache_all();
+
+		/*
+		 * This is a harmless no-op.  On platforms with a real
+		 * outer cache this might either be needed or not,
+		 * depending on where the outer cache sits.
+		 */
 		outer_flush_all();
+
+		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+		asm volatile (
+			"mrc	p15, 0, ip, c1, c0, 1 \n\t"
+			"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
+			"mcr	p15, 0, ip, c1, c0, 1 \n\t"
+			"isb \n\t"
+			"dsb"
+			: : : "ip" );
+
+		/*
+		 * Disable cluster-level coherency by masking
+		 * incoming snoops and DVM messages:
+		 */
+		disable_cci(cluster);
+
+		__bL_outbound_leave_critical(cluster, CLUSTER_DOWN);
+	} else {
+		arch_spin_unlock(&dcscb_lock);
+
+		/*
+		 * Flush the local CPU cache.
+		 *
+		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+		 * a preliminary flush here for those CPUs.  At least, that's
+		 * the theory -- without the extra flush, Linux explodes on
+		 * RTSM (maybe not needed anymore, to be investigated).
+		 */
+		flush_cache_louis();
+		cpu_proc_fin(); /* disable allocation into internal caches*/
+		flush_cache_louis();
+
+		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+		asm volatile (
+			"mrc	p15, 0, ip, c1, c0, 1 \n\t"
+			"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
+			"mcr	p15, 0, ip, c1, c0, 1 \n\t"
+			"isb \n\t"
+			"dsb"
+			: : : "ip" );
 	}
 
-	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
-	asm volatile (
-		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
-		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
-		"mcr	p15, 0, ip, c1, c0, 1"
-		: : : "ip" );
+	__bL_cpu_down(cpu, cluster);
 
 	/* Now we are prepared for power-down, do it: */
 	if (!skip_wfi)
@@ -179,6 +219,8 @@ static void __init dcscb_usage_count_init(void)
 	dcscb_use_count[cpu][cluster] = 1;
 }
 
+extern void dcscb_power_up_setup(void);
+
 static int __init dcscb_init(void)
 {
 	unsigned int cfg;
@@ -193,6 +235,8 @@ static int __init dcscb_init(void)
 	dcscb_usage_count_init();
 
 	ret = bL_platform_power_register(&dcscb_power_ops);
+	if (!ret)
+		ret = bL_cluster_sync_init(dcscb_power_up_setup);
 	if (ret) {
 		iounmap(dcscb_base);
 		return ret;
diff --git a/arch/arm/mach-vexpress/dcscb_setup.S b/arch/arm/mach-vexpress/dcscb_setup.S
new file mode 100644
index 0000000000..c75ee8c4db
--- /dev/null
+++ b/arch/arm/mach-vexpress/dcscb_setup.S
@@ -0,0 +1,77 @@
+/*
+ * arch/arm/include/asm/dcscb_setup.S
+ *
+ * Created by:  Dave Martin, 2012-06-22
+ * Copyright:   (C) 2012  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+
+#include <linux/linkage.h>
+#include <asm/bL_entry.h>
+
+
+#define SLAVE_SNOOPCTL_OFFSET	0
+#define SNOOPCTL_SNOOP_ENABLE	(1 << 0)
+#define SNOOPCTL_DVM_ENABLE	(1 << 1)
+
+#define CCI_STATUS_OFFSET	0xc
+#define STATUS_CHANGE_PENDING	(1 << 0)
+
+#define CCI_SLAVE_OFFSET(n)	(0x1000 + 0x1000 * (n))
+
+#define RTSM_CCI_PHYS_BASE	0x2c090000
+#define RTSM_CCI_SLAVE_A15	3
+#define RTSM_CCI_SLAVE_A7	4
+
+#define RTSM_CCI_A15_OFFSET	CCI_SLAVE_OFFSET(RTSM_CCI_SLAVE_A15)
+#define RTSM_CCI_A7_OFFSET	CCI_SLAVE_OFFSET(RTSM_CCI_SLAVE_A7)
+
+
+ENTRY(dcscb_power_up_setup)
+
+	cmp	r0, #0			@ check affinity level
+	beq	2f
+
+/*
+ * Enable cluster-level coherency, in preparation for turning on the MMU.
+ * The ACTLR SMP bit does not need to be set here, because cpu_resume()
+ * already restores that.
+ */
+
+	mrc	p15, 0, r0, c0, c0, 5	@ MPIDR
+	ubfx	r0, r0, #8, #4		@ cluster
+
+	@ A15/A7 may not require explicit L2 invalidation on reset, dependent
+	@ on hardware integration desicions.
+	@ For now, this code assumes that L2 is either already invalidated, or
+	@ invalidation is not required.
+
+	ldr	r3, =RTSM_CCI_PHYS_BASE + RTSM_CCI_A15_OFFSET
+	cmp	r0, #0		@ A15 cluster?
+	addne	r3, r3, #RTSM_CCI_A7_OFFSET - RTSM_CCI_A15_OFFSET
+
+	@ r3 now points to the correct CCI slave register block
+
+	ldr	r0, [r3, #SLAVE_SNOOPCTL_OFFSET]
+	orr	r0, r0, #SNOOPCTL_SNOOP_ENABLE | SNOOPCTL_DVM_ENABLE
+	str	r0, [r3, #SLAVE_SNOOPCTL_OFFSET]	@ enable CCI snoops
+
+	@ Wait for snoop control change to complete:
+
+	ldr	r3, =RTSM_CCI_PHYS_BASE
+
+	b	1f
+0:	dsb
+1:	ldr	r0, [r3, #CCI_STATUS_OFFSET]
+	tst	r0, #STATUS_CHANGE_PENDING
+	bne	0b
+
+2:	@ Implementation-specific local CPU setup operations should go here,
+	@ if any.  In this case, there is nothing to do.
+
+	bx	lr
+ENDPROC(dcscb_power_up_setup)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (14 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
@ 2013-01-10  0:20 ` Nicolas Pitre
  2013-01-10  0:46 ` [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Rob Herring
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  0:20 UTC (permalink / raw)
  To: linux-arm-kernel

This allows for the DCSCB support to be compiled in and selected
at run time.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 95a2d0df20..5990f1ff25 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -14,6 +14,7 @@
 #include <linux/io.h>
 #include <linux/spinlock.h>
 #include <linux/errno.h>
+#include <linux/of_address.h>
 #include <linux/vexpress.h>
 #include <linux/arm-cci.h>
 
@@ -22,8 +23,6 @@
 #include <asm/cacheflush.h>
 
 
-#define DCSCB_PHYS_BASE	0x60000000
-
 #define RST_HOLD0	0x0
 #define RST_HOLD1	0x4
 #define SYS_SWRESET	0x8
@@ -223,12 +222,16 @@ extern void dcscb_power_up_setup(void);
 
 static int __init dcscb_init(void)
 {
+	struct device_node *node;
 	unsigned int cfg;
 	int ret;
 
-	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
+	node = of_find_compatible_node(NULL, NULL, "arm,dcscb");
+	if (!node)
+		return -ENODEV;
+	dcscb_base= of_iomap(node, 0);
 	if (!dcscb_base)
-		return -ENOMEM;
+		return -EINVAL;
 	cfg = readl_relaxed(dcscb_base + DCS_CFG_R);
 	dcscb_cluster_cpu_mask[0] = (1 << (((cfg >> 16) >> (0 << 2)) & 0xf)) - 1;
 	dcscb_cluster_cpu_mask[1] = (1 << (((cfg >> 16) >> (1 << 2)) & 0xf)) - 1;
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (15 preceding siblings ...)
  2013-01-10  0:20 ` [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre
@ 2013-01-10  0:46 ` Rob Herring
  2013-01-10  5:04   ` Nicolas Pitre
  2013-01-10 23:01 ` Will Deacon
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 140+ messages in thread
From: Rob Herring @ 2013-01-10  0:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/09/2013 06:20 PM, Nicolas Pitre wrote:
> This is the initial public posting of the initial support for big.LITTLE.
> Included here is the code required to safely power up and down CPUs in a
> b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> boot and CPU hotplug support is included at this time.  Getting to this
> point already represents a significcant chunk of code as illustrated by
> the diffstat below.
> 
> This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> myself.  The presentation slides are available here:
> 
> http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> 
> The code is now stable on both Fast Models as well as Virtual Express TC2
> and ready for public review.
> 
> Platform support is included for Fast Models implementing the
> Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> successful compilation, I also included a preliminary version of the
> CCI400 driver from Lorenzo Pieralisi.
> 
> Support for actual hardware such as Vexpress TC2 should come later,
> once the basic infrastructure from this series is merged.  A few DT
> bindings are used but not yet documented.
> 
> This series is made of the following parts:
> 
> Low-level support code:
> [PATCH 01/16] ARM: b.L: secondary kernel entry code
> [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> [PATCH 05/16] ARM: bL_head: vlock-based first man election

After a quick scan, I have a basic question. How are any of these
specific to big.LITTLE? Isn't all this just general multi-cluster support?

Rob
> 
> Adaptation layer to hook with the generic kernel infrastructure:
> [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> 
> Fast Models support:
> [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> [PATCH 13/16] drivers: misc: add ARM CCI support
> [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> 
> Here's the diffstat:
> 
>  .../big.LITTLE/cluster-pm-race-avoidance.txt    | 498 ++++++++++++++++++
>  Documentation/arm/big.LITTLE/vlocks.txt         | 211 ++++++++
>  arch/arm/Kconfig                                |   6 +
>  arch/arm/common/Makefile                        |   3 +
>  arch/arm/common/bL_entry.c                      | 278 ++++++++++
>  arch/arm/common/bL_head.S                       | 232 ++++++++
>  arch/arm/common/bL_platsmp.c                    |  85 +++
>  arch/arm/common/gic.c                           |   6 +
>  arch/arm/common/vlock.S                         | 108 ++++
>  arch/arm/common/vlock.h                         |  43 ++
>  arch/arm/include/asm/bL_entry.h                 | 189 +++++++
>  arch/arm/include/asm/hardware/gic.h             |   2 +
>  arch/arm/include/asm/mach/arch.h                |   3 +
>  arch/arm/kernel/setup.c                         |   5 +-
>  arch/arm/mach-vexpress/Kconfig                  |   9 +
>  arch/arm/mach-vexpress/Makefile                 |   1 +
>  arch/arm/mach-vexpress/core.h                   |   2 +
>  arch/arm/mach-vexpress/dcscb.c                  | 257 +++++++++
>  arch/arm/mach-vexpress/dcscb_setup.S            |  77 +++
>  arch/arm/mach-vexpress/platsmp.c                |  12 +
>  arch/arm/mach-vexpress/v2m.c                    |   2 +-
>  drivers/misc/Kconfig                            |   4 +
>  drivers/misc/Makefile                           |   1 +
>  drivers/misc/arm-cci.c                          | 124 +++++
>  include/linux/arm-cci.h                         |  30 ++
>  25 files changed, 2186 insertions(+), 2 deletions(-)
> 
> Review comments are welcome!
> 
> [*] General design information on the b.L switcher can be found here:
>     http://lwn.net/Articles/481055/
>     However the code is only accessible to Linaro members for the
>     time being.
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-10  0:46 ` [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Rob Herring
@ 2013-01-10  5:04   ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10  5:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 9 Jan 2013, Rob Herring wrote:

> On 01/09/2013 06:20 PM, Nicolas Pitre wrote:
> > This is the initial public posting of the initial support for big.LITTLE.
> > Included here is the code required to safely power up and down CPUs in a
> > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > boot and CPU hotplug support is included at this time.  Getting to this
> > point already represents a significcant chunk of code as illustrated by
> > the diffstat below.
> > 
> > This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> > myself.  The presentation slides are available here:
> > 
> > http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> > 
> > The code is now stable on both Fast Models as well as Virtual Express TC2
> > and ready for public review.
> > 
> > Platform support is included for Fast Models implementing the
> > Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> > successful compilation, I also included a preliminary version of the
> > CCI400 driver from Lorenzo Pieralisi.
> > 
> > Support for actual hardware such as Vexpress TC2 should come later,
> > once the basic infrastructure from this series is merged.  A few DT
> > bindings are used but not yet documented.
> > 
> > This series is made of the following parts:
> > 
> > Low-level support code:
> > [PATCH 01/16] ARM: b.L: secondary kernel entry code
> > [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> > [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> > [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> > [PATCH 05/16] ARM: bL_head: vlock-based first man election
> 
> After a quick scan, I have a basic question. How are any of these
> specific to big.LITTLE? Isn't all this just general multi-cluster support?

It is.  However, b.L is what brought the need for this multi-cluster 
support, and being first it gave the name.

I've been pondering about naming this code differently so not to imply 
b.L but I couldn't come up with a good name/prefix.  The bL_ prefix is 
really hard to beat!  So I decided to put it off until there is actually 
some non b.L users for this code, at which point we can trivially rename 
it.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
@ 2013-01-10  7:12   ` Stephen Boyd
  2013-01-10 15:30     ` Nicolas Pitre
  2013-01-10 15:34   ` Catalin Marinas
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Stephen Boyd @ 2013-01-10  7:12 UTC (permalink / raw)
  To: linux-arm-kernel

On 1/9/2013 4:20 PM, Nicolas Pitre wrote:
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index f95ba14ae3..2271f02e8e 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1579,6 +1579,12 @@ config HAVE_ARM_TWD
>  	help
>  	  This options enables support for the ARM timer and watchdog unit
>  
> +config BIG_LITTLE
> +	bool "big.LITTLE support (Experimental)"
> +	depends on CPU_V7 && SMP && EXPERIMENTAL

I thought EXPERIMENTAL was being phased out?

> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
> index e8a4e58f1b..50880c494f 100644
> --- a/arch/arm/common/Makefile
> +++ b/arch/arm/common/Makefile
> @@ -13,3 +13,6 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
>  obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
>  obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>  obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
> +obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
> +obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o

This looks like non-related stuff?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
@ 2013-01-10 12:01   ` Dave Martin
  2013-01-10 19:04     ` Nicolas Pitre
  2013-01-10 16:53   ` Catalin Marinas
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-10 12:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 09, 2013 at 07:20:38PM -0500, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.
> 
> For use in the power_down() implementation:
>   * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
>   * __bL_outbound_enter_critical(unsigned int cluster)
>   * __bL_outbound_leave_critical(unsigned int cluster)
>   * __bL_cpu_down(unsigned int cluster, unsigned int cpu)
> 
> The power_up_setup() helper should do platform-specific setup in
> preparation for turning the CPU on, such as invalidating local caches
> or entering coherency.  It must be assembler for now, since it must
> run before the MMU can be switched on.  It is passed the affinity level
> which should be initialized.
> 
> Because the bL_cluster_sync_struct content is looked-up and modified
> with the cache enabled or disabled depending on the code path, it is
> crucial to always ensure proper cache maintenance to update main memory
> right away.  Therefore, any cached write must be followed by a cache clean
> operation and any cached read must be preceded by a cache invalidate
> operation on the accessed memory.
> 
> To avoid races where a reader would invalidate the cache and discard the
> latest update from a writer before that writer had a chance to clean it
> to RAM, we simply use cache flush (clean+invalidate) operations
> everywhere.
> 
> Also, in order to prevent a cached writer from interfering with an
> adjacent non-cached writer, we ensure each state variable is located to
> a separate cache line.
> 
> Thanks to Nicolas Pitre and Achin Gupta for the help with this
> patch.
> 
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---
>  .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
>  arch/arm/common/bL_entry.c                         | 160 +++++++
>  arch/arm/common/bL_head.S                          |  88 +++-
>  arch/arm/include/asm/bL_entry.h                    |  62 +++
>  4 files changed, 806 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt

[...]

> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c

[...]

> +int __init bL_cluster_sync_init(void (*power_up_setup)(void))

The addition of the affinity level parameter for power_up_setup means
that this prototype is not correct.

This is not a functional change, since that function is only called from
assembler anyway, but it will help avoid confusion.

This could fixed by folding the following changes into the patch.

Cheers
---Dave

diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 1ea4ec9..05cfdd3 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -245,7 +245,8 @@ int __bL_cluster_state(unsigned int cluster)
 
 extern unsigned long bL_power_up_setup_phys;
 
-int __init bL_cluster_sync_init(void (*power_up_setup)(void))
+int __init bL_cluster_sync_init(
+	void (*power_up_setup)(unsigned int affinity_level))
 {
 	unsigned int i, j, mpidr, this_cluster;
 
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index 167394d..c9c29b2 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -183,7 +183,8 @@ void __bL_outbound_leave_critical(unsigned int cluster, int state);
 bool __bL_outbound_enter_critical(unsigned int this_cpu, unsigned int cluster);
 int __bL_cluster_state(unsigned int cluster);
 
-int __init bL_cluster_sync_init(void (*power_up_setup)(void));
+int __init bL_cluster_sync_init(
+	void (*power_up_setup)(unsigned int affinity_level));
 
 #endif /* ! __ASSEMBLY__ */
 #endif

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-10  0:20 ` [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
@ 2013-01-10 12:05   ` Dave Martin
  2013-01-11 18:27   ` Santosh Shilimkar
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-10 12:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 09, 2013 at 07:20:50PM -0500, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> Add the required code to properly handle race free platform coherency exit
> to the DCSCB power down method.
> 
> The power_up_setup callback is used to enable the CCI interface for
> the cluster being brought up.  This must be done in assembly before
> the kernel environment is entered.
> 
> Thanks to Achin Gupta and Nicolas Pitre for their help and
> contributions.
> 
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  arch/arm/mach-vexpress/Kconfig       |  1 +
>  arch/arm/mach-vexpress/Makefile      |  2 +-
>  arch/arm/mach-vexpress/dcscb.c       | 90 +++++++++++++++++++++++++++---------
>  arch/arm/mach-vexpress/dcscb_setup.S | 77 ++++++++++++++++++++++++++++++
>  4 files changed, 146 insertions(+), 24 deletions(-)
>  create mode 100644 arch/arm/mach-vexpress/dcscb_setup.S

[...]

> diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c

[...]

> @@ -179,6 +219,8 @@ static void __init dcscb_usage_count_init(void)
>  	dcscb_use_count[cpu][cluster] = 1;
>  }
>  
> +extern void dcscb_power_up_setup(void);

The following change can be folded in to match the prototype to the
underlying function.

Cheers
---Dave

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 95a2d0d..4057f8b 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -219,7 +219,7 @@ static void __init dcscb_usage_count_init(void)
 	dcscb_use_count[cpu][cluster] = 1;
 }
 
-extern void dcscb_power_up_setup(void);
+extern void dcscb_power_up_setup(unsigned int affinity_level);
 
 static int __init dcscb_init(void)
 {

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  7:12   ` Stephen Boyd
@ 2013-01-10 15:30     ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 15:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 9 Jan 2013, Stephen Boyd wrote:

> On 1/9/2013 4:20 PM, Nicolas Pitre wrote:
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index f95ba14ae3..2271f02e8e 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -1579,6 +1579,12 @@ config HAVE_ARM_TWD
> >  	help
> >  	  This options enables support for the ARM timer and watchdog unit
> >  
> > +config BIG_LITTLE
> > +	bool "big.LITTLE support (Experimental)"
> > +	depends on CPU_V7 && SMP && EXPERIMENTAL
> 
> I thought EXPERIMENTAL was being phased out?

True.

> > diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
> > index e8a4e58f1b..50880c494f 100644
> > --- a/arch/arm/common/Makefile
> > +++ b/arch/arm/common/Makefile
> > @@ -13,3 +13,6 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
> >  obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
> >  obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
> >  obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
> > +obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
> > +obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
> 
> This looks like non-related stuff?

Indeed.  Rebase fallouts.

Thanks.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
  2013-01-10  7:12   ` Stephen Boyd
@ 2013-01-10 15:34   ` Catalin Marinas
  2013-01-10 16:47     ` Nicolas Pitre
  2013-01-10 23:05   ` Will Deacon
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 140+ messages in thread
From: Catalin Marinas @ 2013-01-10 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nico,

On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> --- /dev/null
> +++ b/arch/arm/common/bL_entry.c
...
> +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];

IMHO, we should keep this array linear and ignore the cluster grouping
at this stage. This information could be added to latter patches that
actually need to know about the b.L topology. This would also imply
that we treat the MPIDR just as an ID without digging into its bit
layout. But I haven't looked at the other patches yet to see how this
would fit.

> +void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> +{
> +       unsigned long val = ptr ? virt_to_phys(ptr) : 0;
> +       bL_entry_vectors[cluster][cpu] = val;
> +       smp_wmb();
> +       __cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
> +       outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> +                         __pa(&bL_entry_vectors[cluster][cpu + 1]));

Why are you using the smp_wmb() here? We don't need any barrier since
data cache ops by MVA are automatically ordered in relation to stores
to the same MVA (as long as the MVA is in Normal Cacheable memory).

> --- /dev/null
> +++ b/arch/arm/common/bL_head.S
...
> +ENTRY(bL_entry_point)
> +
> + THUMB(        adr     r12, BSYM(1f)   )
> + THUMB(        bx      r12             )
> + THUMB(        .thumb                  )
> +1:
> +       mrc     p15, 0, r0, c0, c0, 5

Minor thing, maybe a comment for this line like @ MPIDR.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10 15:34   ` Catalin Marinas
@ 2013-01-10 16:47     ` Nicolas Pitre
  2013-01-11 11:45       ` Catalin Marinas
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Catalin Marinas wrote:

> Hi Nico,
> 
> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > --- /dev/null
> > +++ b/arch/arm/common/bL_entry.c
> ...
> > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> 
> IMHO, we should keep this array linear and ignore the cluster grouping
> at this stage. This information could be added to latter patches that
> actually need to know about the b.L topology.

That's virtually all of them.  Everything b.L related is always 
expressed in terms of a cpu,cluster tuple at the low level.

> This would also imply that we treat the MPIDR just as an ID without 
> digging into its bit layout.

That makes for too large an index space.  We always end up needing to 
break the MPIDR into a cpu,cluster thing as the MPIDR bits are too 
sparse.

> > +void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> > +{
> > +       unsigned long val = ptr ? virt_to_phys(ptr) : 0;
> > +       bL_entry_vectors[cluster][cpu] = val;
> > +       smp_wmb();
> > +       __cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
> > +       outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> > +                         __pa(&bL_entry_vectors[cluster][cpu + 1]));
> 
> Why are you using the smp_wmb() here? We don't need any barrier since
> data cache ops by MVA are automatically ordered in relation to stores
> to the same MVA (as long as the MVA is in Normal Cacheable memory).

That was the result of monkeying the write_pen_release() code.  I'll 
remove that as the rest of the code added later doesn't use that anyway.

> > --- /dev/null
> > +++ b/arch/arm/common/bL_head.S
> ...
> > +ENTRY(bL_entry_point)
> > +
> > + THUMB(        adr     r12, BSYM(1f)   )
> > + THUMB(        bx      r12             )
> > + THUMB(        .thumb                  )
> > +1:
> > +       mrc     p15, 0, r0, c0, c0, 5
> 
> Minor thing, maybe a comment for this line like @ MPIDR.

ACK.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
  2013-01-10 12:01   ` Dave Martin
@ 2013-01-10 16:53   ` Catalin Marinas
  2013-01-10 17:59     ` Nicolas Pitre
  2013-01-10 22:32     ` Nicolas Pitre
  2013-01-10 23:13   ` Will Deacon
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-10 16:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> --- /dev/null
> +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> @@ -0,0 +1,498 @@
> +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> +===============================================================

Nice description and ascii art :).

> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>                 platform_ops->powered_up();
>         return 0;
>  }
> +
> +struct bL_sync_struct bL_sync;
> +
> +static void __sync_range(volatile void *p, size_t size)
> +{
> +       char *_p = (char *)p;
> +
> +       __cpuc_flush_dcache_area(_p, size);
> +       outer_flush_range(__pa(_p), __pa(_p + size));
> +       outer_sync();

The outer flush-range operations already contain a cache_sync, so an
additional outer_sync() operation is not necessary.

You (well, Dave) said that you use the flush instead of
clean/invalidate to avoid races with other CPUs writing the location.
However, on the same CPU you can get a speculative load into L1 after
the L1 flush but before the L2 flush, so the reader case can fail.

The sequence for readers is (note *L2* inval first):

L2 inval
L1 inval

The sequence for writers is:

L1 clean
L2 clean

The bi-directional sequence (that's what you need) is:

L1 clean
L2 clean+inval
L1 clean+inval

The last L1 op must be clean+inval in case another CPU writes to this
location to avoid discarding the write.

If you don't have an L2, you just end up with two L1 clean ops, so you
can probably put some checks.

> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> +
> +/*
> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> + *    This must be called at the point of committing to teardown of a CPU.
> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> + */
> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> +{
> +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +}
> +
> +/*
> + * __bL_cpu_down: Indicates that cpu teardown is complete and that the
> + *    cluster can be torn down without disrupting this CPU.
> + *    To avoid deadlocks, this must be called before a CPU is powered down.
> + *    The CPU cache (SCTRL.C bit) is expected to be off.
> + */
> +void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
> +{
> +       dsb();
> +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
> +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +       sev();

For the sev() here (and other functions in this patch) you need a
dsb() before. I'm not sure outer_sync() has one.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 16:53   ` Catalin Marinas
@ 2013-01-10 17:59     ` Nicolas Pitre
  2013-01-10 21:50       ` Catalin Marinas
  2013-01-10 22:32     ` Nicolas Pitre
  1 sibling, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 17:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > --- /dev/null
> > +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > @@ -0,0 +1,498 @@
> > +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> > +===============================================================
> 
> Nice description and ascii art :).
> 
> > --- a/arch/arm/common/bL_entry.c
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >                 platform_ops->powered_up();
> >         return 0;
> >  }
> > +
> > +struct bL_sync_struct bL_sync;
> > +
> > +static void __sync_range(volatile void *p, size_t size)
> > +{
> > +       char *_p = (char *)p;
> > +
> > +       __cpuc_flush_dcache_area(_p, size);
> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> > +       outer_sync();
> 
> The outer flush-range operations already contain a cache_sync, so an
> additional outer_sync() operation is not necessary.
> 
> You (well, Dave) said that you use the flush instead of
> clean/invalidate to avoid races with other CPUs writing the location.

Yes.  To clarify for everyone, the issue here is that those state values 
are being written and/or read by different CPUs which may or may not 
have their cache enabled.  And in some cases the L1 cache is disabled 
but L2 is still enabled.

So a cached reader must invalidate the cache to ensure it reads an 
up-to-date value from RAM since the last update might have come from a 
CPU with its cache disabled.  But invalidating the cache might discard 
the newly updated state from a writer with an active cache before that 
writer had the chance to clean its cache to RAM.  Therefore, using a 
cache flush rather than a cache invalidate before every reads solves 
this race.

> However, on the same CPU you can get a speculative load into L1 after
> the L1 flush but before the L2 flush, so the reader case can fail.
> 
> The sequence for readers is (note *L2* inval first):
> 
> L2 inval
> L1 inval

As you noted below and as I explained above, this can't be an inval 
operation as that could discard a concurrent writer's update.

> The sequence for writers is:
> 
> L1 clean
> L2 clean
> 
> The bi-directional sequence (that's what you need) is:
> 
> L1 clean
> L2 clean+inval
> L1 clean+inval
> 
> The last L1 op must be clean+inval in case another CPU writes to this
> location to avoid discarding the write.
> 
> If you don't have an L2, you just end up with two L1 clean ops, so you
> can probably put some checks.

In fact, since this is only used on A7/A15 right now, there is no outer 
cache and the outer calls are effectively no-ops.  I'm wondering if 
those should simply be removed until/unless there is some system showing 
up with a need for them.

> > +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> > +
> > +/*
> > + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> > + *    This must be called at the point of committing to teardown of a CPU.
> > + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> > + */
> > +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> > +{
> > +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> > +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> > +}
> > +
> > +/*
> > + * __bL_cpu_down: Indicates that cpu teardown is complete and that the
> > + *    cluster can be torn down without disrupting this CPU.
> > + *    To avoid deadlocks, this must be called before a CPU is powered down.
> > + *    The CPU cache (SCTRL.C bit) is expected to be off.
> > + */
> > +void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
> > +{
> > +       dsb();
> > +       bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
> > +       sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> > +       sev();
> 
> For the sev() here (and other functions in this patch) you need a
> dsb() before. I'm not sure outer_sync() has one.

__cpuc_flush_dcache_area() does though, via v7_flush_kern_dcache_area.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache
  2013-01-10  0:20 ` [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache Nicolas Pitre
@ 2013-01-10 18:50   ` Dave Martin
  2013-01-10 19:13     ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-10 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 09, 2013 at 07:20:49PM -0500, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>

To avoid confusion, the prefix in the subject line should be "CCI", not
"TC2".  Any platform which calls disable_cci() after turning caches off
and/or disabling the SMP bit may fall foul of this otherwise ... i.e.,
any platform which has CCI.

Let me know if you want me to send you a modified patch.

Cheers
---Daave

> Non-local variables used by the CCI management function called after
> disabling the cache must be flushed out to main memory in advance,
> otherwise incoherency of those values may occur if they are sitting
> in the cache of some other CPU when cci_disable() executes.
> 
> This patch adds the appropriate flushing to the CCI driver to ensure
> that the relevant data is available in RAM ahead of time.
> 
> Because this creates a dependency on arch-specific cacheflushing
> functions, this patch also makes ARM_CCI depend on ARM.
> 
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  drivers/misc/Kconfig   |  1 +
>  drivers/misc/arm-cci.c | 21 +++++++++++++++++++--
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index 30d5be1ad2..b24630696c 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -501,6 +501,7 @@ config USB_SWITCH_FSA9480
>  
>  config ARM_CCI
>         bool "ARM CCI driver support"
> +	depends on ARM
>  
>  source "drivers/misc/c2port/Kconfig"
>  source "drivers/misc/eeprom/Kconfig"
> diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
> index f329c43099..739e1c96d3 100644
> --- a/drivers/misc/arm-cci.c
> +++ b/drivers/misc/arm-cci.c
> @@ -21,8 +21,16 @@
>  #include <linux/slab.h>
>  #include <linux/arm-cci.h>
>  
> -#define CCI400_EAG_OFFSET       0x4000
> -#define CCI400_KF_OFFSET        0x5000
> +#include <asm/cacheflush.h>
> +#include <asm/memory.h>
> +#include <asm/outercache.h>
> +
> +#include <asm/irq_regs.h>
> +#include <asm/pmu.h>
> +
> +#define CCI400_PMCR                   0x0100
> +#define CCI400_EAG_OFFSET             0x4000
> +#define CCI400_KF_OFFSET              0x5000
>  
>  #define DRIVER_NAME	"CCI"
>  struct cci_drvdata {
> @@ -73,6 +81,15 @@ static int __devinit cci_driver_probe(struct platform_device *pdev)
>  		goto ioremap_err;
>  	}
>  
> +	/*
> +	 * Multi-cluster systems may need this data when non-coherent, during
> +	 * cluster power-up/power-down. Make sure it reaches main memory:
> +	 */
> +	__cpuc_flush_dcache_area(info, sizeof *info);
> +	__cpuc_flush_dcache_area(&info, sizeof info);
> +	outer_clean_range(virt_to_phys(info), virt_to_phys(info + 1));
> +	outer_clean_range(virt_to_phys(&info), virt_to_phys(&info + 1));
> +
>  	platform_set_drvdata(pdev, info);
>  
>  	pr_info("CCI loaded at %p\n", info->baseaddr);
> -- 
> 1.8.0
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 12:01   ` Dave Martin
@ 2013-01-10 19:04     ` Nicolas Pitre
  2013-01-11 11:30       ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 19:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Dave Martin wrote:

> > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> 
> The addition of the affinity level parameter for power_up_setup means
> that this prototype is not correct.

Indeed.

> This is not a functional change, since that function is only called from
> assembler anyway, but it will help avoid confusion.

Fixed now, as well as the DCSCB usage.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache
  2013-01-10 18:50   ` Dave Martin
@ 2013-01-10 19:13     ` Nicolas Pitre
  2013-01-11 11:38       ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Dave Martin wrote:

> On Wed, Jan 09, 2013 at 07:20:49PM -0500, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> 
> To avoid confusion, the prefix in the subject line should be "CCI", not
> "TC2".

Absolutely.  This is my mistake as I removed the TC2 changes from your 
original patch to only keep the CCI ones, but forgot to update the patch 
title.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 17:59     ` Nicolas Pitre
@ 2013-01-10 21:50       ` Catalin Marinas
  2013-01-10 22:31         ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Catalin Marinas @ 2013-01-10 21:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 10 Jan 2013, Catalin Marinas wrote:
>
>> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > --- a/arch/arm/common/bL_entry.c
>> > +++ b/arch/arm/common/bL_entry.c
>> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>> >                 platform_ops->powered_up();
>> >         return 0;
>> >  }
>> > +
>> > +struct bL_sync_struct bL_sync;
>> > +
>> > +static void __sync_range(volatile void *p, size_t size)
>> > +{
>> > +       char *_p = (char *)p;
>> > +
>> > +       __cpuc_flush_dcache_area(_p, size);
>> > +       outer_flush_range(__pa(_p), __pa(_p + size));
>> > +       outer_sync();
...
>> However, on the same CPU you can get a speculative load into L1 after
>> the L1 flush but before the L2 flush, so the reader case can fail.
>>
>> The sequence for readers is (note *L2* inval first):
>>
>> L2 inval
>> L1 inval
>
> As you noted below and as I explained above, this can't be an inval
> operation as that could discard a concurrent writer's update.
>
>> The sequence for writers is:
>>
>> L1 clean
>> L2 clean
>>
>> The bi-directional sequence (that's what you need) is:
>>
>> L1 clean
>> L2 clean+inval
>> L1 clean+inval
>>
>> The last L1 op must be clean+inval in case another CPU writes to this
>> location to avoid discarding the write.
>>
>> If you don't have an L2, you just end up with two L1 clean ops, so you
>> can probably put some checks.
>
> In fact, since this is only used on A7/A15 right now, there is no outer
> cache and the outer calls are effectively no-ops.  I'm wondering if
> those should simply be removed until/unless there is some system showing
> up with a need for them.

You could. I expect multi-cluster systems to have integrated L2 cache
and avoid explicit outer cache maintenance. But is there a chance that
your patches could be generalised to existing systems with A9 (not b.L
configuration but just hotplug or cpuidle support)? I haven't finished
reading all the patches, so maybe that's not the case at all.

Anyway, my point is that if L1 is inner and L2 outer, the correct
bi-derectional flushing sequence is slightly different.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 21:50       ` Catalin Marinas
@ 2013-01-10 22:31         ` Nicolas Pitre
  2013-01-11 10:36           ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> >
> >> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > --- a/arch/arm/common/bL_entry.c
> >> > +++ b/arch/arm/common/bL_entry.c
> >> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >> >                 platform_ops->powered_up();
> >> >         return 0;
> >> >  }
> >> > +
> >> > +struct bL_sync_struct bL_sync;
> >> > +
> >> > +static void __sync_range(volatile void *p, size_t size)
> >> > +{
> >> > +       char *_p = (char *)p;
> >> > +
> >> > +       __cpuc_flush_dcache_area(_p, size);
> >> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> >> > +       outer_sync();
> ...
> >> However, on the same CPU you can get a speculative load into L1 after
> >> the L1 flush but before the L2 flush, so the reader case can fail.
> >>
> >> The sequence for readers is (note *L2* inval first):
> >>
> >> L2 inval
> >> L1 inval
> >
> > As you noted below and as I explained above, this can't be an inval
> > operation as that could discard a concurrent writer's update.
> >
> >> The sequence for writers is:
> >>
> >> L1 clean
> >> L2 clean
> >>
> >> The bi-directional sequence (that's what you need) is:
> >>
> >> L1 clean
> >> L2 clean+inval
> >> L1 clean+inval
> >>
> >> The last L1 op must be clean+inval in case another CPU writes to this
> >> location to avoid discarding the write.
> >>
> >> If you don't have an L2, you just end up with two L1 clean ops, so you
> >> can probably put some checks.
> >
> > In fact, since this is only used on A7/A15 right now, there is no outer
> > cache and the outer calls are effectively no-ops.  I'm wondering if
> > those should simply be removed until/unless there is some system showing
> > up with a need for them.
> 
> You could. I expect multi-cluster systems to have integrated L2 cache
> and avoid explicit outer cache maintenance. But is there a chance that
> your patches could be generalised to existing systems with A9 (not b.L
> configuration but just hotplug or cpuidle support)? I haven't finished
> reading all the patches, so maybe that's not the case at all.

I suppose it could, although the special requirements put on the first 
man / last man exist only for multi-cluster systems.  OTOH, existing A9 
systems are already served by far less complex code already, so it is 
really a matter of figuring out if the backend for those A9 systems 
needed by this cluster code would be simpler than the existing code, in 
which case that would certainly be beneficial.

> Anyway, my point is that if L1 is inner and L2 outer, the correct
> bi-derectional flushing sequence is slightly different.

Agreed, I'll make sure to capture that in the code somehow.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 16:53   ` Catalin Marinas
  2013-01-10 17:59     ` Nicolas Pitre
@ 2013-01-10 22:32     ` Nicolas Pitre
  1 sibling, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-10 22:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Catalin Marinas wrote:

> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > --- /dev/null
> > +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > @@ -0,0 +1,498 @@
> > +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> > +===============================================================
> 
> Nice description and ascii art :).

Credits go to Dave Martin.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (16 preceding siblings ...)
  2013-01-10  0:46 ` [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Rob Herring
@ 2013-01-10 23:01 ` Will Deacon
       [not found] ` <1357777251-13541-1-git-send-email-nicolas.pitre-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
  2013-03-07  8:27 ` Pavel Machek
  19 siblings, 0 replies; 140+ messages in thread
From: Will Deacon @ 2013-01-10 23:01 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Thu, Jan 10, 2013 at 12:20:35AM +0000, Nicolas Pitre wrote:
> This is the initial public posting of the initial support for big.LITTLE.
> Included here is the code required to safely power up and down CPUs in a
> b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> boot and CPU hotplug support is included at this time.  Getting to this
> point already represents a significcant chunk of code as illustrated by
> the diffstat below.

I've just started going through this, so I have some comments on the first
few patches. I'll try and get through the rest of it soon (but Christoffer
is kicking me to look at kvm again too :)

> Low-level support code:
> [PATCH 01/16] ARM: b.L: secondary kernel entry code
> [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> [PATCH 05/16] ARM: bL_head: vlock-based first man election

I got this far, so I'll send my comments as replies to those.

> Adaptation layer to hook with the generic kernel infrastructure:
> [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> 
> Fast Models support:
> [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> [PATCH 13/16] drivers: misc: add ARM CCI support
> [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree

These last six really need Pawel Moll on CC. He's on holiday at the moment,
but he'll be back at the end of the month so I suggest pinging him then so
they don't get lost.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
  2013-01-10  7:12   ` Stephen Boyd
  2013-01-10 15:34   ` Catalin Marinas
@ 2013-01-10 23:05   ` Will Deacon
  2013-01-11  1:26     ` Nicolas Pitre
  2013-01-11 17:16   ` Santosh Shilimkar
  2013-03-07  7:37   ` Pavel Machek
  4 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-10 23:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:36AM +0000, Nicolas Pitre wrote:
> CPUs in a big.LITTLE systems have special needs when entering the kernel
> due to a hotplug event, or when resuming from a deep sleep mode.
> 
> This is vectorized so multiple CPUs can enter the kernel in parallel
> without serialization.
> 
> Only the basic structure is introduced here.  This will be extended
> later.
> 
> TODO: MPIDR based indexing should eventually be made runtime adjusted.

Agreed.

> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> new file mode 100644
> index 0000000000..80fff49417
> --- /dev/null
> +++ b/arch/arm/common/bL_entry.c
> @@ -0,0 +1,30 @@
> +/*
> + * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
> + *
> + * Created by:  Nicolas Pitre, March 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +
> +#include <asm/bL_entry.h>
> +#include <asm/barrier.h>
> +#include <asm/proc-fns.h>
> +#include <asm/cacheflush.h>
> +
> +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];

Does this actually need to be volatile? I'd have thought a compiler
barrier in place of the smp_wmb below would be enough (following on from
Catalin's comments).

> +
> +void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> +{
> +	unsigned long val = ptr ? virt_to_phys(ptr) : 0;
> +	bL_entry_vectors[cluster][cpu] = val;
> +	smp_wmb();
> +	__cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
> +	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> +			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> +}
> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> new file mode 100644
> index 0000000000..9d351f2b4c
> --- /dev/null
> +++ b/arch/arm/common/bL_head.S
> @@ -0,0 +1,81 @@
> +/*
> + * arch/arm/common/bL_head.S -- big.LITTLE kernel re-entry point
> + *
> + * Created by:  Nicolas Pitre, March 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/bL_entry.h>
> +
> +	.macro	pr_dbg	cpu, string
> +#if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> +	b	1901f
> +1902:	.ascii	"CPU 0: \0CPU 1: \0CPU 2: \0CPU 3: \0"
> +	.ascii	"CPU 4: \0CPU 5: \0CPU 6: \0CPU 7: \0"
> +1903:	.asciz	"\string"
> +	.align
> +1901:	adr	r0, 1902b
> +	add	r0, r0, \cpu, lsl #3
> +	bl	printascii
> +	adr	r0, 1903b
> +	bl	printascii
> +#endif
> +	.endm
> +
> +	.arm
> +	.align
> +
> +ENTRY(bL_entry_point)
> +
> + THUMB(	adr	r12, BSYM(1f)	)
> + THUMB(	bx	r12		)
> + THUMB(	.thumb			)
> +1:
> +	mrc	p15, 0, r0, c0, c0, 5
> +	ubfx	r9, r0, #0, #4			@ r9 = cpu
> +	ubfx	r10, r0, #8, #4			@ r10 = cluster
> +	mov	r3, #BL_CPUS_PER_CLUSTER
> +	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
> +	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
> +	blo	2f
> +
> +	/* We didn't expect this CPU.  Try to make it quiet. */
> +1:	wfi
> +	wfe
> +	b	1b

I realise this CPU is stuck at this point, but you should have a dsb
before a wfi instruction. This could be problematic with the CCI this
early, so maybe just a comment saying that it doesn't matter because we
don't care about this core?

> +
> +2:	pr_dbg	r4, "kernel bL_entry_point\n"
> +
> +	/*
> +	 * MMU is off so we need to get to bL_entry_vectors in a
> +	 * position independent way.
> +	 */
> +	adr	r5, 3f
> +	ldr	r6, [r5]
> +	add	r6, r5, r6			@ r6 = bL_entry_vectors
> +
> +bL_entry_gated:
> +	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
> +	cmp	r5, #0
> +	wfeeq
> +	beq	bL_entry_gated
> +	pr_dbg	r4, "released\n"
> +	bx	r5
> +
> +	.align	2
> +
> +3:	.word	bL_entry_vectors - .
> +
> +ENDPROC(bL_entry_point)
> +
> +	.bss
> +	.align	5
> +
> +	.type	bL_entry_vectors, #object
> +ENTRY(bL_entry_vectors)
> +	.space	4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER

Is there a particular reason to put this in the bss?

> diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> new file mode 100644
> index 0000000000..ff623333a1
> --- /dev/null
> +++ b/arch/arm/include/asm/bL_entry.h
> @@ -0,0 +1,35 @@
> +/*
> + * arch/arm/include/asm/bL_entry.h
> + *
> + * Created by:  Nicolas Pitre, April 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef BL_ENTRY_H
> +#define BL_ENTRY_H
> +
> +#define BL_CPUS_PER_CLUSTER	4
> +#define BL_NR_CLUSTERS		2

Hmm, I see these have to be constant so you can allocate your space in
the assembly file. In which case, I think it's worth changing their
names to have MAX or LIMIT in them... maybe they could even be CONFIG
options?

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
@ 2013-01-10 23:08   ` Will Deacon
  2013-01-11  2:30     ` Nicolas Pitre
  2013-01-11 17:26   ` Santosh Shilimkar
  1 sibling, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-10 23:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:37AM +0000, Nicolas Pitre wrote:
> This is the basic API used to handle the powering up/down of individual
> CPUs in a big.LITTLE system.  The platform specific backend implementation
> has the responsibility to also handle the cluster level power as well when
> the first/last CPU in a cluster is brought up/down.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
>  arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 180 insertions(+)
> 
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> index 80fff49417..41de0622de 100644
> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -11,11 +11,13 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/init.h>
> +#include <linux/irqflags.h>
>  
>  #include <asm/bL_entry.h>
>  #include <asm/barrier.h>
>  #include <asm/proc-fns.h>
>  #include <asm/cacheflush.h>
> +#include <asm/idmap.h>
>  
>  extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
>  
> @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
>  	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
>  			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
>  }
> +
> +static const struct bL_platform_power_ops *platform_ops;
> +
> +int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
> +{
> +	if (platform_ops)
> +		return -EBUSY;
> +	platform_ops = ops;
> +	return 0;
> +}
> +
> +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> +{
> +	if (!platform_ops)
> +		return -EUNATCH;

Is this the right error code?

> +	might_sleep();
> +	return platform_ops->power_up(cpu, cluster);
> +}
> +
> +typedef void (*phys_reset_t)(unsigned long);

Maybe it's worth putting this typedef in a header file somewhere. It's
also used by the soft reboot code.

> +
> +void bL_cpu_power_down(void)
> +{
> +	phys_reset_t phys_reset;
> +
> +	BUG_ON(!platform_ops);

Seems a bit overkill, or are we unrecoverable by this point?

> +	BUG_ON(!irqs_disabled());
> +
> +	/*
> +	 * Do this before calling into the power_down method,
> +	 * as it might not always be safe to do afterwards.
> +	 */
> +	setup_mm_for_reboot();
> +
> +	platform_ops->power_down();
> +
> +	/*
> +	 * It is possible for a power_up request to happen concurrently
> +	 * with a power_down request for the same CPU. In this case the
> +	 * power_down method might not be able to actually enter a
> +	 * powered down state with the WFI instruction if the power_up
> +	 * method has removed the required reset condition.  The
> +	 * power_down method is then allowed to return. We must perform
> +	 * a re-entry in the kernel as if the power_up method just had
> +	 * deasserted reset on the CPU.
> +	 *
> +	 * To simplify race issues, the platform specific implementation
> +	 * must accommodate for the possibility of unordered calls to
> +	 * power_down and power_up with a usage count. Therefore, if a
> +	 * call to power_up is issued for a CPU that is not down, then
> +	 * the next call to power_down must not attempt a full shutdown
> +	 * but only do the minimum (normally disabling L1 cache and CPU
> +	 * coherency) and return just as if a concurrent power_up request
> +	 * had happened as described above.
> +	 */
> +
> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> +	phys_reset(virt_to_phys(bL_entry_point));
> +
> +	/* should never get here */
> +	BUG();
> +}
> +
> +void bL_cpu_suspend(u64 expected_residency)
> +{
> +	phys_reset_t phys_reset;
> +
> +	BUG_ON(!platform_ops);
> +	BUG_ON(!irqs_disabled());
> +
> +	/* Very similar to bL_cpu_power_down() */
> +	setup_mm_for_reboot();
> +	platform_ops->suspend(expected_residency);
> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> +	phys_reset(virt_to_phys(bL_entry_point));
> +	BUG();
> +}
> +
> +int bL_cpu_powered_up(void)
> +{
> +	if (!platform_ops)
> +		return -EUNATCH;
> +	if (platform_ops->powered_up)
> +		platform_ops->powered_up();
> +	return 0;
> +}
> diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> index ff623333a1..942d7f9f19 100644
> --- a/arch/arm/include/asm/bL_entry.h
> +++ b/arch/arm/include/asm/bL_entry.h
> @@ -31,5 +31,97 @@ extern void bL_entry_point(void);
>   */
>  void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
>  
> +/*
> + * CPU/cluster power operations API for higher subsystems to use.
> + */
> +
> +/**
> + * bL_cpu_power_up - make given CPU in given cluster runable
> + *
> + * @cpu: CPU number within given cluster
> + * @cluster: cluster number for the CPU
> + *
> + * The identified CPU is brought out of reset.  If the cluster was powered
> + * down then it is brought up as well, taking care not to let the other CPUs
> + * in the cluster run, and ensuring appropriate cluster setup.
> + *
> + * Caller must ensure the appropriate entry vector is initialized with
> + * bL_set_entry_vector() prior to calling this.
> + *
> + * This must be called in a sleepable context.  However, the implementation
> + * is strongly encouraged to return early and let the operation happen
> + * asynchronously, especially when significant delays are expected.
> + *
> + * If the operation cannot be performed then an error code is returned.
> + */
> +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster);
> +
> +/**
> + * bL_cpu_power_down - power the calling CPU down
> + *
> + * The calling CPU is powered down.
> + *
> + * If this CPU is found to be the "last man standing" in the cluster
> + * then the cluster is prepared for power-down too.
> + *
> + * This must be called with interrupts disabled.
> + *
> + * This does not return.  Re-entry in the kernel is expected via
> + * bL_entry_point.
> + */
> +void bL_cpu_power_down(void);
> +
> +/**
> + * bL_cpu_suspend - bring the calling CPU in a suspended state
> + *
> + * @expected_residency: duration in microseconds the CPU is expected
> + *			to remain suspended, or 0 if unknown/infinity.
> + *
> + * The calling CPU is suspended.  The expected residency argument is used
> + * as a hint by the platform specific backend to implement the appropriate
> + * sleep state level according to the knowledge it has on wake-up latency
> + * for the given hardware.
> + *
> + * If this CPU is found to be the "last man standing" in the cluster
> + * then the cluster may be prepared for power-down too, if the expected
> + * residency makes it worthwhile.
> + *
> + * This must be called with interrupts disabled.
> + *
> + * This does not return.  Re-entry in the kernel is expected via
> + * bL_entry_point.
> + */
> +void bL_cpu_suspend(u64 expected_residency);
> +
> +/**
> + * bL_cpu_powered_up - housekeeping workafter a CPU has been powered up
> + *
> + * This lets the platform specific backend code perform needed housekeeping
> + * work.  This must be called by the newly activated CPU as soon as it is
> + * fully operational in kernel space, before it enables interrupts.
> + *
> + * If the operation cannot be performed then an error code is returned.
> + */
> +int bL_cpu_powered_up(void);
> +
> +/*
> + * Platform specific methods used in the implementation of the above API.
> + */
> +struct bL_platform_power_ops {
> +	int (*power_up)(unsigned int cpu, unsigned int cluster);
> +	void (*power_down)(void);
> +	void (*suspend)(u64);
> +	void (*powered_up)(void);
> +};

It would be good if these prototypes matched the PSCI code, then platforms
could just glue them together directly.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
  2013-01-10 12:01   ` Dave Martin
  2013-01-10 16:53   ` Catalin Marinas
@ 2013-01-10 23:13   ` Will Deacon
  2013-01-11  1:50     ` Nicolas Pitre
  2013-01-11 17:46   ` Santosh Shilimkar
  2013-01-14 17:08   ` Dave Martin
  4 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-10 23:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.

[...]

> +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> +{
> +       unsigned int i, j, mpidr, this_cluster;
> +
> +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> +
> +       /*
> +        * Set initial CPU and cluster states.
> +        * Only one cluster is assumed to be active at this point.
> +        */
> +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> +       }
> +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));

We have a helper for this...

> +       this_cluster = (mpidr >> 8) & 0xf;

... and also this, thanks to Lorenzo's recent patches.

> +       for_each_online_cpu(i)
> +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> +       sync_mem(&bL_sync);
> +
> +       if (power_up_setup) {
> +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> +               sync_mem(&bL_power_up_setup_phys);
> +       }
> +
> +       return 0;
> +}
> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> index 9d351f2b4c..f7a64ac127 100644
> --- a/arch/arm/common/bL_head.S
> +++ b/arch/arm/common/bL_head.S
> @@ -7,11 +7,19 @@
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
>   * published by the Free Software Foundation.
> + *
> + *
> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> + * for details of the synchronisation algorithms used here.
>   */
> 
>  #include <linux/linkage.h>
>  #include <asm/bL_entry.h>
> 
> +.if BL_SYNC_CLUSTER_CPUS
> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> +.endif
> +
>         .macro  pr_dbg  cpu, string
>  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>         b       1901f
> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> 
>         /*
> -        * MMU is off so we need to get to bL_entry_vectors in a
> +        * MMU is off so we need to get to various variables in a
>          * position independent way.
>          */
>         adr     r5, 3f
> -       ldr     r6, [r5]
> +       ldmia   r5, {r6, r7, r8}
>         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> +       add     r8, r5, r8                      @ r8 = bL_sync
> +
> +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> +
> +       @ Signal that this CPU is coming UP:
> +       mov     r0, #CPU_COMING_UP
> +       mov     r5, #BL_SYNC_CPU_SIZE
> +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> +       strb    r0, [r5]
> +
> +       dsb

Why is a dmb not enough here? In fact, the same goes for most of these
other than the one preceeding the sev. Is there an interaction with the
different mappings for the cluster data that I've missed?

> +
> +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> +       @ state, because there is at least one active CPU (this CPU).
> +
> +       @ Check if the cluster has been set up yet:
> +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       cmp     r0, #CLUSTER_UP
> +       beq     cluster_already_up
> +
> +       @ Signal that the cluster is being brought up:
> +       mov     r0, #INBOUND_COMING_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +
> +       dsb
> +
> +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> +       @ point onwards will observe INBOUND_COMING_UP and abort.
> +
> +       @ Wait for any previously-pending cluster teardown operations to abort
> +       @ or complete:
> +cluster_teardown_wait:
> +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       cmp     r0, #CLUSTER_GOING_DOWN
> +       wfeeq
> +       beq     cluster_teardown_wait
> +
> +       @ power_up_setup is responsible for setting up the cluster:
> +
> +       cmp     r7, #0
> +       mov     r0, #1          @ second (cluster) affinity level
> +       blxne   r7              @ Call power_up_setup if defined
> +
> +       @ Leave the cluster setup critical section:
> +
> +       dsb
> +       mov     r0, #INBOUND_NOT_COMING_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +       mov     r0, #CLUSTER_UP
> +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +       dsb
> +       sev
> +
> +cluster_already_up:
> +       @ If a platform-specific CPU setup hook is needed, it is
> +       @ called from here.
> +
> +       cmp     r7, #0
> +       mov     r0, #0          @ first (CPU) affinity level
> +       blxne   r7              @ Call power_up_setup if defined
> +
> +       @ Mark the CPU as up:
> +
> +       dsb
> +       mov     r0, #CPU_UP
> +       strb    r0, [r5]
> +       dsb
> +       sev
> 
>  bL_entry_gated:
>         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> @@ -70,6 +148,8 @@ bL_entry_gated:
>         .align  2
> 
>  3:     .word   bL_entry_vectors - .
> +       .word   bL_power_up_setup_phys - 3b
> +       .word   bL_sync - 3b
> 
>  ENDPROC(bL_entry_point)
> 
> @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
>         .type   bL_entry_vectors, #object
>  ENTRY(bL_entry_vectors)
>         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> +
> +       .type   bL_power_up_setup_phys, #object
> +ENTRY(bL_power_up_setup_phys)
> +       .space  4               @ set by bL_cluster_sync_init()
> diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> index 942d7f9f19..167394d9a0 100644
> --- a/arch/arm/include/asm/bL_entry.h
> +++ b/arch/arm/include/asm/bL_entry.h
> @@ -15,8 +15,37 @@
>  #define BL_CPUS_PER_CLUSTER    4
>  #define BL_NR_CLUSTERS         2
> 
> +/* Definitions for bL_cluster_sync_struct */
> +#define CPU_DOWN               0x11
> +#define CPU_COMING_UP          0x12
> +#define CPU_UP                 0x13
> +#define CPU_GOING_DOWN         0x14
> +
> +#define CLUSTER_DOWN           0x21
> +#define CLUSTER_UP             0x22
> +#define CLUSTER_GOING_DOWN     0x23
> +
> +#define INBOUND_NOT_COMING_UP  0x31
> +#define INBOUND_COMING_UP      0x32

Do these numbers signify anything? Why not 0, 1, 2 etc?

> +
> +/* This is a complete guess. */
> +#define __CACHE_WRITEBACK_ORDER        6

Is this CONFIG_ARM_L1_CACHE_SHIFT?

> +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> +
> +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> +#define BL_SYNC_CLUSTER_CPUS   0

Why not use asm-offsets.h for this?

> +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> +#define BL_SYNC_CLUSTER_CLUSTER \
> +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> +#define BL_SYNC_CLUSTER_INBOUND \
> +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> +#define BL_SYNC_CLUSTER_SIZE \
> +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> +

Hmm, this looks pretty fragile to me but again, you need this stuff at
compile time. Is there an architected maximum value for the writeback
granule? Failing that, we may as well just use things like
__cacheline_aligned if we're only using the L1 alignment anyway.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-10  0:20 ` [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
@ 2013-01-10 23:18   ` Will Deacon
  2013-01-11  3:15     ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-10 23:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:39AM +0000, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> This patch adds a simple low-level voting mutex implementation
> to be used to arbitrate during first man selection when no load/store
> exclusive instructions are usable.
> 
> For want of a better name, these are called "vlocks".  (I was
> tempted to call them ballot locks, but "block" is way too confusing
> an abbreviation...)
> 
> There is no function to wait for the lock to be released, and no
> vlock_lock() function since we don't need these at the moment.
> These could straightforwardly be added if vlocks get used for other
> purposes.

[...]

> diff --git a/Documentation/arm/big.LITTLE/vlocks.txt b/Documentation/arm/big.LITTLE/vlocks.txt
> new file mode 100644
> index 0000000000..90672ddc6a
> --- /dev/null
> +++ b/Documentation/arm/big.LITTLE/vlocks.txt
> @@ -0,0 +1,211 @@
> +vlocks for Bare-Metal Mutual Exclusion
> +======================================

[...]

> +ARM implementation
> +------------------
> +
> +The current ARM implementation [2] contains a some optimisations beyond

-a

> +the basic algorithm:
> +
> + * By packing the members of the currently_voting array close together,
> +   we can read the whole array in one transaction (providing the number
> +   of CPUs potentially contending the lock is small enough).  This
> +   reduces the number of round-trips required to external memory.
> +
> +   In the ARM implementation, this means that we can use a single load
> +   and comparison:
> +
> +       LDR     Rt, [Rn]
> +       CMP     Rt, #0
> +
> +   ...in place of code equivalent to:
> +
> +       LDRB    Rt, [Rn]
> +       CMP     Rt, #0
> +       LDRBEQ  Rt, [Rn, #1]
> +       CMPEQ   Rt, #0
> +       LDRBEQ  Rt, [Rn, #2]
> +       CMPEQ   Rt, #0
> +       LDRBEQ  Rt, [Rn, #3]
> +       CMPEQ   Rt, #0
> +
> +   This cuts down on the fast-path latency, as well as potentially
> +   reducing bus contention in contended cases.
> +
> +   The optimisation relies on the fact that the ARM memory system
> +   guarantees coherency between overlapping memory accesses of
> +   different sizes, similarly to many other architectures.  Note that
> +   we do not care which element of currently_voting appears in which
> +   bits of Rt, so there is no need to worry about endianness in this
> +   optimisation.
> +
> +   If there are too many CPUs to read the currently_voting array in
> +   one transaction then multiple transations are still required.  The
> +   implementation uses a simple loop of word-sized loads for this
> +   case.  The number of transactions is still fewer than would be
> +   required if bytes were loaded individually.
> +
> +
> +   In principle, we could aggregate further by using LDRD or LDM, but
> +   to keep the code simple this was not attempted in the initial
> +   implementation.
> +
> +
> + * vlocks are currently only used to coordinate between CPUs which are
> +   unable to enable their caches yet.  This means that the
> +   implementation removes many of the barriers which would be required
> +   when executing the algorithm in cached memory.

I think you need to elaborate on this and clearly identify the
requirements of the memory behaviour. In reality, these locks are hardly
ever usable so we don't want them cropping up in driver code and the
like!

> +
> +   packing of the currently_voting array does not work with cached
> +   memory unless all CPUs contending the lock are cache-coherent, due
> +   to cache writebacks from one CPU clobbering values written by other
> +   CPUs.  (Though if all the CPUs are cache-coherent, you should be
> +   probably be using proper spinlocks instead anyway).
> +
> +
> + * The "no votes yet" value used for the last_vote variable is 0 (not
> +   -1 as in the pseudocode).  This allows statically-allocated vlocks
> +   to be implicitly initialised to an unlocked state simply by putting
> +   them in .bss.

You could also put them in their own section and initialise them to -1
there.

> +
> +   An offset is added to each CPU's ID for the purpose of setting this
> +   variable, so that no CPU uses the value 0 for its ID.
> +
> +
> +Colophon
> +--------
> +
> +Originally created and documented by Dave Martin for Linaro Limited, for
> +use in ARM-based big.LITTLE platforms, with review and input gratefully
> +received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
> +grabbing most of this text out of the relevant mail thread and writing
> +up the pseudocode.
> +
> +Copyright (C) 2012  Linaro Limited
> +Distributed under the terms of Version 2 of the GNU General Public
> +License, as defined in linux/COPYING.
> +
> +
> +References
> +----------
> +
> +[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
> +    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
> +
> +    http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
> +
> +[2] linux/arch/arm/common/vlock.S, www.kernel.org.
> diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
> new file mode 100644
> index 0000000000..0a1ee3a7f5
> --- /dev/null
> +++ b/arch/arm/common/vlock.S
> @@ -0,0 +1,108 @@
> +/*
> + * vlock.S - simple voting lock implementation for ARM
> + *
> + * Created by: Dave Martin, 2012-08-16
> + * Copyright:  (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.

Your documentation is strictly GPLv2, so there's a strange discrepancy
here.

> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> + *
> + *
> + * This algorithm is described in more detail in
> + * Documentation/arm/big.LITTLE/vlocks.txt.
> + */
> +
> +#include <linux/linkage.h>
> +#include "vlock.h"
> +
> +#if VLOCK_VOTING_SIZE > 4

4? Maybe a CONFIG option or a #define in an arch vlock.h?

> +#define FEW(x...)
> +#define MANY(x...) x
> +#else
> +#define FEW(x...) x
> +#define MANY(x...)
> +#endif
> +
> +@ voting lock for first-man coordination
> +
> +.macro voting_begin rbase:req, rcpu:req, rscratch:req
> +       mov     \rscratch, #1
> +       strb    \rscratch, [\rbase, \rcpu]
> +       dsb
> +.endm
> +
> +.macro voting_end rbase:req, rcpu:req, rscratch:req
> +       mov     \rscratch, #0
> +       strb    \rscratch, [\rbase, \rcpu]
> +       dsb
> +       sev
> +.endm
> +
> +/*
> + * The vlock structure must reside in Strongly-Ordered or Device memory.
> + * This implementation deliberately eliminates most of the barriers which
> + * would be required for other memory types, and assumes that independent
> + * writes to neighbouring locations within a cacheline do not interfere
> + * with one another.
> + */
> +
> +@ r0: lock structure base
> +@ r1: CPU ID (0-based index within cluster)
> +ENTRY(vlock_trylock)
> +       add     r1, r1, #VLOCK_VOTING_OFFSET
> +
> +       voting_begin    r0, r1, r2
> +
> +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]   @ check whether lock is held
> +       cmp     r2, #VLOCK_OWNER_NONE
> +       bne     trylock_fail                    @ fail if so
> +
> +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]   @ submit my vote
> +
> +       voting_end      r0, r1, r2
> +
> +       @ Wait for the current round of voting to finish:
> +
> + MANY( mov     r3, #VLOCK_VOTING_OFFSET                        )
> +0:
> + MANY( ldr     r2, [r0, r3]                                    )
> + FEW(  ldr     r2, [r0, #VLOCK_VOTING_OFFSET]                  )
> +       cmp     r2, #0
> +       wfene

Is there a race here? I wonder if you can end up in a situation where
everybody enters wfe and then there is nobody left to signal an event
via voting_end (if, for example the last voter sent the sev when
everybody else was simultaneously doing the cmp before the wfe)...

... actually, that's ok as long as VLOCK_VOTING_OFFSET isn't speculated,
which it shouldn't be from strongly-ordered memory. Fair enough!

> +       bne     0b
> + MANY( add     r3, r3, #4                                      )
> + MANY( cmp     r3, #VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE    )
> + MANY( bne     0b                                              )
> +
> +       @ Check who won:
> +
> +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]
> +       eor     r0, r1, r2                      @ zero if I won, else nonzero
> +       bx      lr
> +
> +trylock_fail:
> +       voting_end      r0, r1, r2
> +       mov     r0, #1                          @ nonzero indicates that I lost
> +       bx      lr
> +ENDPROC(vlock_trylock)
> +
> +@ r0: lock structure base
> +ENTRY(vlock_unlock)
> +       mov     r1, #VLOCK_OWNER_NONE
> +       dsb
> +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]
> +       dsb
> +       sev
> +       bx      lr
> +ENDPROC(vlock_unlock)
> diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
> new file mode 100644
> index 0000000000..94c29a6caf
> --- /dev/null
> +++ b/arch/arm/common/vlock.h
> @@ -0,0 +1,43 @@
> +/*
> + * vlock.h - simple voting lock implementation
> + *
> + * Created by: Dave Martin, 2012-08-16
> + * Copyright:  (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> + */
> +
> +#ifndef __VLOCK_H
> +#define __VLOCK_H
> +
> +#include <asm/bL_entry.h>
> +
> +#define VLOCK_OWNER_OFFSET     0
> +#define VLOCK_VOTING_OFFSET    4

asm-offsets again?

> +#define VLOCK_VOTING_SIZE      ((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)

Huh?

> +#define VLOCK_SIZE             (VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
> +#define VLOCK_OWNER_NONE       0
> +
> +#ifndef __ASSEMBLY__
> +
> +struct vlock {
> +       char data[VLOCK_SIZE];
> +};

Does this mean the struct is only single byte aligned? You do word
accesses to it in your vlock code and rely on atomicity, so I'd feel
safer if it was aligned to 4 bytes, especially since this isn't being
accessed via a normal mapping.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10 23:05   ` Will Deacon
@ 2013-01-11  1:26     ` Nicolas Pitre
  2013-01-11 10:55       ` Will Deacon
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11  1:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:36AM +0000, Nicolas Pitre wrote:
> > CPUs in a big.LITTLE systems have special needs when entering the kernel
> > due to a hotplug event, or when resuming from a deep sleep mode.
> > 
> > This is vectorized so multiple CPUs can enter the kernel in parallel
> > without serialization.
> > 
> > Only the basic structure is introduced here.  This will be extended
> > later.
> > 
> > TODO: MPIDR based indexing should eventually be made runtime adjusted.
> 
> Agreed.
> 
> > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > new file mode 100644
> > index 0000000000..80fff49417
> > --- /dev/null
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -0,0 +1,30 @@
> > +/*
> > + * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
> > + *
> > + * Created by:  Nicolas Pitre, March 2012
> > + * Copyright:   (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/init.h>
> > +
> > +#include <asm/bL_entry.h>
> > +#include <asm/barrier.h>
> > +#include <asm/proc-fns.h>
> > +#include <asm/cacheflush.h>
> > +
> > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> 
> Does this actually need to be volatile? I'd have thought a compiler
> barrier in place of the smp_wmb below would be enough (following on from
> Catalin's comments).

Actually, I did the reverse i.e. I removed the smp_wmb() entirely. A 
compiler barrier forces the whole world to memory while here we only 
want this particular assignment to be pushed out.

Furthermore, I like the volatile as it flags that this is a special 
variable which in this case is also accessed from CPUs with no cache.

> > +void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> > +{
> > +	unsigned long val = ptr ? virt_to_phys(ptr) : 0;
> > +	bL_entry_vectors[cluster][cpu] = val;
> > +	smp_wmb();
> > +	__cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
> > +	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> > +			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> > +}
> > diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> > new file mode 100644
> > index 0000000000..9d351f2b4c
> > --- /dev/null
> > +++ b/arch/arm/common/bL_head.S
> > @@ -0,0 +1,81 @@
> > +/*
> > + * arch/arm/common/bL_head.S -- big.LITTLE kernel re-entry point
> > + *
> > + * Created by:  Nicolas Pitre, March 2012
> > + * Copyright:   (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/linkage.h>
> > +#include <asm/bL_entry.h>
> > +
> > +	.macro	pr_dbg	cpu, string
> > +#if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> > +	b	1901f
> > +1902:	.ascii	"CPU 0: \0CPU 1: \0CPU 2: \0CPU 3: \0"
> > +	.ascii	"CPU 4: \0CPU 5: \0CPU 6: \0CPU 7: \0"
> > +1903:	.asciz	"\string"
> > +	.align
> > +1901:	adr	r0, 1902b
> > +	add	r0, r0, \cpu, lsl #3
> > +	bl	printascii
> > +	adr	r0, 1903b
> > +	bl	printascii
> > +#endif
> > +	.endm
> > +
> > +	.arm
> > +	.align
> > +
> > +ENTRY(bL_entry_point)
> > +
> > + THUMB(	adr	r12, BSYM(1f)	)
> > + THUMB(	bx	r12		)
> > + THUMB(	.thumb			)
> > +1:
> > +	mrc	p15, 0, r0, c0, c0, 5
> > +	ubfx	r9, r0, #0, #4			@ r9 = cpu
> > +	ubfx	r10, r0, #8, #4			@ r10 = cluster
> > +	mov	r3, #BL_CPUS_PER_CLUSTER
> > +	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
> > +	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
> > +	blo	2f
> > +
> > +	/* We didn't expect this CPU.  Try to make it quiet. */
> > +1:	wfi
> > +	wfe
> > +	b	1b
> 
> I realise this CPU is stuck at this point, but you should have a dsb
> before a wfi instruction. This could be problematic with the CCI this
> early, so maybe just a comment saying that it doesn't matter because we
> don't care about this core?

Why a dsb?  No data was even touched at this point.  And since this is 
meant to be a better "b ." kind of loop, I'd rather not try to make it 
more sophisticated than it already is.  And of course it is meant to 
never be executed in practice.

> > +
> > +2:	pr_dbg	r4, "kernel bL_entry_point\n"
> > +
> > +	/*
> > +	 * MMU is off so we need to get to bL_entry_vectors in a
> > +	 * position independent way.
> > +	 */
> > +	adr	r5, 3f
> > +	ldr	r6, [r5]
> > +	add	r6, r5, r6			@ r6 = bL_entry_vectors
> > +
> > +bL_entry_gated:
> > +	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
> > +	cmp	r5, #0
> > +	wfeeq
> > +	beq	bL_entry_gated
> > +	pr_dbg	r4, "released\n"
> > +	bx	r5
> > +
> > +	.align	2
> > +
> > +3:	.word	bL_entry_vectors - .
> > +
> > +ENDPROC(bL_entry_point)
> > +
> > +	.bss
> > +	.align	5
> > +
> > +	.type	bL_entry_vectors, #object
> > +ENTRY(bL_entry_vectors)
> > +	.space	4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> 
> Is there a particular reason to put this in the bss?

Yes, to have it zero initialized without taking up binary space.

> > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > new file mode 100644
> > index 0000000000..ff623333a1
> > --- /dev/null
> > +++ b/arch/arm/include/asm/bL_entry.h
> > @@ -0,0 +1,35 @@
> > +/*
> > + * arch/arm/include/asm/bL_entry.h
> > + *
> > + * Created by:  Nicolas Pitre, April 2012
> > + * Copyright:   (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef BL_ENTRY_H
> > +#define BL_ENTRY_H
> > +
> > +#define BL_CPUS_PER_CLUSTER	4
> > +#define BL_NR_CLUSTERS		2
> 
> Hmm, I see these have to be constant so you can allocate your space in
> the assembly file. In which case, I think it's worth changing their
> names to have MAX or LIMIT in them...

Yes, good point.  I'll change them.

>  maybe they could even be CONFIG options?

Nah.  I prefer not adding new config options unless this is really 
necessary or useful.  For the forseeable future, we'll see systems with 
at most 2 clusters and at most 4 CPUs per cluster.  That could easily be 
revisited later if that becomes unsuitable for some new systems.

Initially I wanted all those things to be runtime sized in relation with 
the TODO item in the commit log.  That too can come later.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 23:13   ` Will Deacon
@ 2013-01-11  1:50     ` Nicolas Pitre
  2013-01-11 11:09       ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11  1:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> > 
> > This provides helper methods to coordinate between CPUs coming down
> > and CPUs going up, as well as documentation on the used algorithms,
> > so that cluster teardown and setup
> > operations are not done for a cluster simultaneously.
> 
> [...]
> 
> > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> > +{
> > +       unsigned int i, j, mpidr, this_cluster;
> > +
> > +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> > +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> > +
> > +       /*
> > +        * Set initial CPU and cluster states.
> > +        * Only one cluster is assumed to be active at this point.
> > +        */
> > +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> > +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> > +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> > +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> > +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> > +       }
> > +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> 
> We have a helper for this...
> 
> > +       this_cluster = (mpidr >> 8) & 0xf;
> 
> ... and also this, thanks to Lorenzo's recent patches.

Indeed, I'll have a closer look at them.

> > +       for_each_online_cpu(i)
> > +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> > +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> > +       sync_mem(&bL_sync);
> > +
> > +       if (power_up_setup) {
> > +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> > +               sync_mem(&bL_power_up_setup_phys);
> > +       }
> > +
> > +       return 0;
> > +}
> > diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> > index 9d351f2b4c..f7a64ac127 100644
> > --- a/arch/arm/common/bL_head.S
> > +++ b/arch/arm/common/bL_head.S
> > @@ -7,11 +7,19 @@
> >   * This program is free software; you can redistribute it and/or modify
> >   * it under the terms of the GNU General Public License version 2 as
> >   * published by the Free Software Foundation.
> > + *
> > + *
> > + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > + * for details of the synchronisation algorithms used here.
> >   */
> > 
> >  #include <linux/linkage.h>
> >  #include <asm/bL_entry.h>
> > 
> > +.if BL_SYNC_CLUSTER_CPUS
> > +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> > +.endif
> > +
> >         .macro  pr_dbg  cpu, string
> >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> >         b       1901f
> > @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> >  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> > 
> >         /*
> > -        * MMU is off so we need to get to bL_entry_vectors in a
> > +        * MMU is off so we need to get to various variables in a
> >          * position independent way.
> >          */
> >         adr     r5, 3f
> > -       ldr     r6, [r5]
> > +       ldmia   r5, {r6, r7, r8}
> >         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> > +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> > +       add     r8, r5, r8                      @ r8 = bL_sync
> > +
> > +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> > +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> > +
> > +       @ Signal that this CPU is coming UP:
> > +       mov     r0, #CPU_COMING_UP
> > +       mov     r5, #BL_SYNC_CPU_SIZE
> > +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> > +       strb    r0, [r5]
> > +
> > +       dsb
> 
> Why is a dmb not enough here? In fact, the same goes for most of these
> other than the one preceeding the sev. Is there an interaction with the
> different mappings for the cluster data that I've missed?

Probably Dave could comment more on this as this is his code, or Achin 
who also reviewed it.  I don't know the level of discussion that 
happened inside ARM around those barriers.

When the TC2 firmware didn't properly handle the ACP snoops, the dsb's 
couldn't be used at this point.  The replacement for a dsb was a read 
back followed by a dmb in that case, and then the general sentiment was 
that this was an A15 specific workaround which wasn't architecturally 
guaranteed on all ARMv7 compliant implementations, or something along 
those lines.

Given that the TC2 firmware properly handles the snoops now, and that 
the dsb apparently doesn't require a readback, we just decided to revert 
to having simple dsb's.

> > +
> > +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> > +       @ state, because there is at least one active CPU (this CPU).
> > +
> > +       @ Check if the cluster has been set up yet:
> > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       cmp     r0, #CLUSTER_UP
> > +       beq     cluster_already_up
> > +
> > +       @ Signal that the cluster is being brought up:
> > +       mov     r0, #INBOUND_COMING_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > +
> > +       dsb
> > +
> > +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> > +       @ point onwards will observe INBOUND_COMING_UP and abort.
> > +
> > +       @ Wait for any previously-pending cluster teardown operations to abort
> > +       @ or complete:
> > +cluster_teardown_wait:
> > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       cmp     r0, #CLUSTER_GOING_DOWN
> > +       wfeeq
> > +       beq     cluster_teardown_wait
> > +
> > +       @ power_up_setup is responsible for setting up the cluster:
> > +
> > +       cmp     r7, #0
> > +       mov     r0, #1          @ second (cluster) affinity level
> > +       blxne   r7              @ Call power_up_setup if defined
> > +
> > +       @ Leave the cluster setup critical section:
> > +
> > +       dsb
> > +       mov     r0, #INBOUND_NOT_COMING_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > +       mov     r0, #CLUSTER_UP
> > +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > +       dsb
> > +       sev
> > +
> > +cluster_already_up:
> > +       @ If a platform-specific CPU setup hook is needed, it is
> > +       @ called from here.
> > +
> > +       cmp     r7, #0
> > +       mov     r0, #0          @ first (CPU) affinity level
> > +       blxne   r7              @ Call power_up_setup if defined
> > +
> > +       @ Mark the CPU as up:
> > +
> > +       dsb
> > +       mov     r0, #CPU_UP
> > +       strb    r0, [r5]
> > +       dsb
> > +       sev
> > 
> >  bL_entry_gated:
> >         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> > @@ -70,6 +148,8 @@ bL_entry_gated:
> >         .align  2
> > 
> >  3:     .word   bL_entry_vectors - .
> > +       .word   bL_power_up_setup_phys - 3b
> > +       .word   bL_sync - 3b
> > 
> >  ENDPROC(bL_entry_point)
> > 
> > @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
> >         .type   bL_entry_vectors, #object
> >  ENTRY(bL_entry_vectors)
> >         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> > +
> > +       .type   bL_power_up_setup_phys, #object
> > +ENTRY(bL_power_up_setup_phys)
> > +       .space  4               @ set by bL_cluster_sync_init()
> > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > index 942d7f9f19..167394d9a0 100644
> > --- a/arch/arm/include/asm/bL_entry.h
> > +++ b/arch/arm/include/asm/bL_entry.h
> > @@ -15,8 +15,37 @@
> >  #define BL_CPUS_PER_CLUSTER    4
> >  #define BL_NR_CLUSTERS         2
> > 
> > +/* Definitions for bL_cluster_sync_struct */
> > +#define CPU_DOWN               0x11
> > +#define CPU_COMING_UP          0x12
> > +#define CPU_UP                 0x13
> > +#define CPU_GOING_DOWN         0x14
> > +
> > +#define CLUSTER_DOWN           0x21
> > +#define CLUSTER_UP             0x22
> > +#define CLUSTER_GOING_DOWN     0x23
> > +
> > +#define INBOUND_NOT_COMING_UP  0x31
> > +#define INBOUND_COMING_UP      0x32
> 
> Do these numbers signify anything? Why not 0, 1, 2 etc?

Initially that's what they were.  But durring debugging (as we faced a 
few cache coherency issues here) it was more useful to use numbers with 
an easily distinguishable signature.  For example, a 0 may come from 
anywhere and could mean anything so that is about the worst choice.
Other than that, those numbers have no particular significance.

> > +
> > +/* This is a complete guess. */
> > +#define __CACHE_WRITEBACK_ORDER        6
> 
> Is this CONFIG_ARM_L1_CACHE_SHIFT?

No.  That has to cover L2 as well.

> > +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> > +
> > +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> > +#define BL_SYNC_CLUSTER_CPUS   0
> 
> Why not use asm-offsets.h for this?

That's how that was done initially. But that ended up cluttering 
asm-offsets.h for stuff that actually is really a local implementation 
detail which doesn't need kernel wide scope.  In other words, the end 
result looked worse.

One could argue that they are still exposed too much as the only files 
that need to know about those defines are bL_head.S and bL_entry.c.

> > +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> > +#define BL_SYNC_CLUSTER_CLUSTER \
> > +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> > +#define BL_SYNC_CLUSTER_INBOUND \
> > +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> > +#define BL_SYNC_CLUSTER_SIZE \
> > +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> > +
> 
> Hmm, this looks pretty fragile to me but again, you need this stuff at
> compile time.

There are compile time and run time assertions in bL_entry.c to ensure 
those offsets and the corresponding C structure don't get out of sync.

> Is there an architected maximum value for the writeback
> granule? Failing that, we may as well just use things like
> __cacheline_aligned if we're only using the L1 alignment anyway.

See above -- we need L2 alignment.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-10 23:08   ` Will Deacon
@ 2013-01-11  2:30     ` Nicolas Pitre
  2013-01-11 10:58       ` Will Deacon
  2013-01-11 11:29       ` Dave Martin
  0 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11  2:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:37AM +0000, Nicolas Pitre wrote:
> > This is the basic API used to handle the powering up/down of individual
> > CPUs in a big.LITTLE system.  The platform specific backend implementation
> > has the responsibility to also handle the cluster level power as well when
> > the first/last CPU in a cluster is brought up/down.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
> >  arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 180 insertions(+)
> > 
> > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > index 80fff49417..41de0622de 100644
> > --- a/arch/arm/common/bL_entry.c
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -11,11 +11,13 @@
> >  
> >  #include <linux/kernel.h>
> >  #include <linux/init.h>
> > +#include <linux/irqflags.h>
> >  
> >  #include <asm/bL_entry.h>
> >  #include <asm/barrier.h>
> >  #include <asm/proc-fns.h>
> >  #include <asm/cacheflush.h>
> > +#include <asm/idmap.h>
> >  
> >  extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> >  
> > @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> >  	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> >  			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> >  }
> > +
> > +static const struct bL_platform_power_ops *platform_ops;
> > +
> > +int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
> > +{
> > +	if (platform_ops)
> > +		return -EBUSY;
> > +	platform_ops = ops;
> > +	return 0;
> > +}
> > +
> > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> > +{
> > +	if (!platform_ops)
> > +		return -EUNATCH;
> 
> Is this the right error code?

It is as good as any other, with some meaning to be distinguished from 
the traditional ones like -ENOMEM or -EINVAL that the platform backends 
could return.

Would you prefer another one?

> > +	might_sleep();
> > +	return platform_ops->power_up(cpu, cluster);
> > +}
> > +
> > +typedef void (*phys_reset_t)(unsigned long);
> 
> Maybe it's worth putting this typedef in a header file somewhere. It's
> also used by the soft reboot code.

Agreed.  Maybe separately from this series though.

> > +
> > +void bL_cpu_power_down(void)
> > +{
> > +	phys_reset_t phys_reset;
> > +
> > +	BUG_ON(!platform_ops);
> 
> Seems a bit overkill, or are we unrecoverable by this point?

We are.  The upper layer expects this CPU to be dead and there is no 
easy recovery possible.  This is a "should never happen" condition, and 
the kernel is badly configured otherwise.

> 
> > +	BUG_ON(!irqs_disabled());
> > +
> > +	/*
> > +	 * Do this before calling into the power_down method,
> > +	 * as it might not always be safe to do afterwards.
> > +	 */
> > +	setup_mm_for_reboot();
> > +
> > +	platform_ops->power_down();
> > +
> > +	/*
> > +	 * It is possible for a power_up request to happen concurrently
> > +	 * with a power_down request for the same CPU. In this case the
> > +	 * power_down method might not be able to actually enter a
> > +	 * powered down state with the WFI instruction if the power_up
> > +	 * method has removed the required reset condition.  The
> > +	 * power_down method is then allowed to return. We must perform
> > +	 * a re-entry in the kernel as if the power_up method just had
> > +	 * deasserted reset on the CPU.
> > +	 *
> > +	 * To simplify race issues, the platform specific implementation
> > +	 * must accommodate for the possibility of unordered calls to
> > +	 * power_down and power_up with a usage count. Therefore, if a
> > +	 * call to power_up is issued for a CPU that is not down, then
> > +	 * the next call to power_down must not attempt a full shutdown
> > +	 * but only do the minimum (normally disabling L1 cache and CPU
> > +	 * coherency) and return just as if a concurrent power_up request
> > +	 * had happened as described above.
> > +	 */
> > +
> > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > +	phys_reset(virt_to_phys(bL_entry_point));
> > +
> > +	/* should never get here */
> > +	BUG();
> > +}
> > +
> > +void bL_cpu_suspend(u64 expected_residency)
> > +{
> > +	phys_reset_t phys_reset;
> > +
> > +	BUG_ON(!platform_ops);
> > +	BUG_ON(!irqs_disabled());
> > +
> > +	/* Very similar to bL_cpu_power_down() */
> > +	setup_mm_for_reboot();
> > +	platform_ops->suspend(expected_residency);
> > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > +	phys_reset(virt_to_phys(bL_entry_point));
> > +	BUG();
> > +}
> > +
> > +int bL_cpu_powered_up(void)
> > +{
> > +	if (!platform_ops)
> > +		return -EUNATCH;
> > +	if (platform_ops->powered_up)
> > +		platform_ops->powered_up();
> > +	return 0;
> > +}
> > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > index ff623333a1..942d7f9f19 100644
> > --- a/arch/arm/include/asm/bL_entry.h
> > +++ b/arch/arm/include/asm/bL_entry.h
> > @@ -31,5 +31,97 @@ extern void bL_entry_point(void);
> >   */
> >  void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
> >  
> > +/*
> > + * CPU/cluster power operations API for higher subsystems to use.
> > + */
> > +
> > +/**
> > + * bL_cpu_power_up - make given CPU in given cluster runable
> > + *
> > + * @cpu: CPU number within given cluster
> > + * @cluster: cluster number for the CPU
> > + *
> > + * The identified CPU is brought out of reset.  If the cluster was powered
> > + * down then it is brought up as well, taking care not to let the other CPUs
> > + * in the cluster run, and ensuring appropriate cluster setup.
> > + *
> > + * Caller must ensure the appropriate entry vector is initialized with
> > + * bL_set_entry_vector() prior to calling this.
> > + *
> > + * This must be called in a sleepable context.  However, the implementation
> > + * is strongly encouraged to return early and let the operation happen
> > + * asynchronously, especially when significant delays are expected.
> > + *
> > + * If the operation cannot be performed then an error code is returned.
> > + */
> > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster);
> > +
> > +/**
> > + * bL_cpu_power_down - power the calling CPU down
> > + *
> > + * The calling CPU is powered down.
> > + *
> > + * If this CPU is found to be the "last man standing" in the cluster
> > + * then the cluster is prepared for power-down too.
> > + *
> > + * This must be called with interrupts disabled.
> > + *
> > + * This does not return.  Re-entry in the kernel is expected via
> > + * bL_entry_point.
> > + */
> > +void bL_cpu_power_down(void);
> > +
> > +/**
> > + * bL_cpu_suspend - bring the calling CPU in a suspended state
> > + *
> > + * @expected_residency: duration in microseconds the CPU is expected
> > + *			to remain suspended, or 0 if unknown/infinity.
> > + *
> > + * The calling CPU is suspended.  The expected residency argument is used
> > + * as a hint by the platform specific backend to implement the appropriate
> > + * sleep state level according to the knowledge it has on wake-up latency
> > + * for the given hardware.
> > + *
> > + * If this CPU is found to be the "last man standing" in the cluster
> > + * then the cluster may be prepared for power-down too, if the expected
> > + * residency makes it worthwhile.
> > + *
> > + * This must be called with interrupts disabled.
> > + *
> > + * This does not return.  Re-entry in the kernel is expected via
> > + * bL_entry_point.
> > + */
> > +void bL_cpu_suspend(u64 expected_residency);
> > +
> > +/**
> > + * bL_cpu_powered_up - housekeeping workafter a CPU has been powered up
> > + *
> > + * This lets the platform specific backend code perform needed housekeeping
> > + * work.  This must be called by the newly activated CPU as soon as it is
> > + * fully operational in kernel space, before it enables interrupts.
> > + *
> > + * If the operation cannot be performed then an error code is returned.
> > + */
> > +int bL_cpu_powered_up(void);
> > +
> > +/*
> > + * Platform specific methods used in the implementation of the above API.
> > + */
> > +struct bL_platform_power_ops {
> > +	int (*power_up)(unsigned int cpu, unsigned int cluster);
> > +	void (*power_down)(void);
> > +	void (*suspend)(u64);
> > +	void (*powered_up)(void);
> > +};
> 
> It would be good if these prototypes matched the PSCI code, then platforms
> could just glue them together directly.

No.

I discussed this at length with Charles (the PSCI spec author) already. 
Even in the PSCI case, a minimum PSCI backend is necessary to do some 
impedance matching between what the PSCI calls expect as arguments and 
what this kernel specific API needs to express.  For example, the UP 
method needs to always be provided with the address for bL_entry, 
irrespective of where the user of this kernel API wants execution to be 
resumed.  There might be some cases where the backend might decide to 
override the desired power saving state because of other kernel induced 
constraints (ongoing DMA operation for example) that PSCI doesn't (and 
should not) know about.  And the best place to arbitrate between those 
platform specific constraints is in this platform specific shim or 
backend.

Because of that, and because one feature of Linux is to not have stable 
APIs in the kernel so to be free to adapt them to future needs, I think 
it is best not to even try matching the PSCI interface here.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-10 23:18   ` Will Deacon
@ 2013-01-11  3:15     ` Nicolas Pitre
  2013-01-11 11:03       ` Will Deacon
  2013-01-11 16:57       ` Dave Martin
  0 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11  3:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 10 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:39AM +0000, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> > 
> > This patch adds a simple low-level voting mutex implementation
> > to be used to arbitrate during first man selection when no load/store
> > exclusive instructions are usable.
> > 
> > For want of a better name, these are called "vlocks".  (I was
> > tempted to call them ballot locks, but "block" is way too confusing
> > an abbreviation...)
> > 
> > There is no function to wait for the lock to be released, and no
> > vlock_lock() function since we don't need these at the moment.
> > These could straightforwardly be added if vlocks get used for other
> > purposes.
> 
> [...]
> 
> > diff --git a/Documentation/arm/big.LITTLE/vlocks.txt b/Documentation/arm/big.LITTLE/vlocks.txt
> > new file mode 100644
> > index 0000000000..90672ddc6a
> > --- /dev/null
> > +++ b/Documentation/arm/big.LITTLE/vlocks.txt
> > @@ -0,0 +1,211 @@
> > +vlocks for Bare-Metal Mutual Exclusion
> > +======================================
> 
> [...]
> 
> > +ARM implementation
> > +------------------
> > +
> > +The current ARM implementation [2] contains a some optimisations beyond
> 
> -a

Fixed.

> 
> > +the basic algorithm:
> > +
> > + * By packing the members of the currently_voting array close together,
> > +   we can read the whole array in one transaction (providing the number
> > +   of CPUs potentially contending the lock is small enough).  This
> > +   reduces the number of round-trips required to external memory.
> > +
> > +   In the ARM implementation, this means that we can use a single load
> > +   and comparison:
> > +
> > +       LDR     Rt, [Rn]
> > +       CMP     Rt, #0
> > +
> > +   ...in place of code equivalent to:
> > +
> > +       LDRB    Rt, [Rn]
> > +       CMP     Rt, #0
> > +       LDRBEQ  Rt, [Rn, #1]
> > +       CMPEQ   Rt, #0
> > +       LDRBEQ  Rt, [Rn, #2]
> > +       CMPEQ   Rt, #0
> > +       LDRBEQ  Rt, [Rn, #3]
> > +       CMPEQ   Rt, #0
> > +
> > +   This cuts down on the fast-path latency, as well as potentially
> > +   reducing bus contention in contended cases.
> > +
> > +   The optimisation relies on the fact that the ARM memory system
> > +   guarantees coherency between overlapping memory accesses of
> > +   different sizes, similarly to many other architectures.  Note that
> > +   we do not care which element of currently_voting appears in which
> > +   bits of Rt, so there is no need to worry about endianness in this
> > +   optimisation.
> > +
> > +   If there are too many CPUs to read the currently_voting array in
> > +   one transaction then multiple transations are still required.  The
> > +   implementation uses a simple loop of word-sized loads for this
> > +   case.  The number of transactions is still fewer than would be
> > +   required if bytes were loaded individually.
> > +
> > +
> > +   In principle, we could aggregate further by using LDRD or LDM, but
> > +   to keep the code simple this was not attempted in the initial
> > +   implementation.
> > +
> > +
> > + * vlocks are currently only used to coordinate between CPUs which are
> > +   unable to enable their caches yet.  This means that the
> > +   implementation removes many of the barriers which would be required
> > +   when executing the algorithm in cached memory.
> 
> I think you need to elaborate on this and clearly identify the
> requirements of the memory behaviour. In reality, these locks are hardly
> ever usable so we don't want them cropping up in driver code and the
> like!

Doesn't the following paragraph make that clear enough?

Maybe we should rip out the C interface to avoid such abuses.  I think 
that was initially added when we weren't sure if the C code had to be 
involved.

> > +   packing of the currently_voting array does not work with cached
> > +   memory unless all CPUs contending the lock are cache-coherent, due
> > +   to cache writebacks from one CPU clobbering values written by other
> > +   CPUs.  (Though if all the CPUs are cache-coherent, you should be
> > +   probably be using proper spinlocks instead anyway).
> > +
> > +
> > + * The "no votes yet" value used for the last_vote variable is 0 (not
> > +   -1 as in the pseudocode).  This allows statically-allocated vlocks
> > +   to be implicitly initialised to an unlocked state simply by putting
> > +   them in .bss.
> 
> You could also put them in their own section and initialise them to -1
> there.

Same argument as for bL_vectors: That is less efficient than using .bss 
which takes no image space.  Plus the transformation for CPU 0 to work 
with this is basically free. 

> > +   An offset is added to each CPU's ID for the purpose of setting this
> > +   variable, so that no CPU uses the value 0 for its ID.
> > +
> > +
> > +Colophon
> > +--------
> > +
> > +Originally created and documented by Dave Martin for Linaro Limited, for
> > +use in ARM-based big.LITTLE platforms, with review and input gratefully
> > +received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
> > +grabbing most of this text out of the relevant mail thread and writing
> > +up the pseudocode.
> > +
> > +Copyright (C) 2012  Linaro Limited
> > +Distributed under the terms of Version 2 of the GNU General Public
> > +License, as defined in linux/COPYING.
> > +
> > +
> > +References
> > +----------
> > +
> > +[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
> > +    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
> > +
> > +    http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
> > +
> > +[2] linux/arch/arm/common/vlock.S, www.kernel.org.
> > diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
> > new file mode 100644
> > index 0000000000..0a1ee3a7f5
> > --- /dev/null
> > +++ b/arch/arm/common/vlock.S
> > @@ -0,0 +1,108 @@
> > +/*
> > + * vlock.S - simple voting lock implementation for ARM
> > + *
> > + * Created by: Dave Martin, 2012-08-16
> > + * Copyright:  (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> 
> Your documentation is strictly GPLv2, so there's a strange discrepancy
> here.

Indeed.

@Dave: your call.

> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, write to the Free Software Foundation, Inc.,
> > + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> > + *
> > + *
> > + * This algorithm is described in more detail in
> > + * Documentation/arm/big.LITTLE/vlocks.txt.
> > + */
> > +
> > +#include <linux/linkage.h>
> > +#include "vlock.h"
> > +
> > +#if VLOCK_VOTING_SIZE > 4
> 
> 4? Maybe a CONFIG option or a #define in an arch vlock.h?

The 4 here is actually related to the number of bytes in a word, to 
decide whether or not a loop is needed for voters enumeration.  That is 
not configurable.

> > +#define FEW(x...)
> > +#define MANY(x...) x
> > +#else
> > +#define FEW(x...) x
> > +#define MANY(x...)
> > +#endif
> > +
> > +@ voting lock for first-man coordination
> > +
> > +.macro voting_begin rbase:req, rcpu:req, rscratch:req
> > +       mov     \rscratch, #1
> > +       strb    \rscratch, [\rbase, \rcpu]
> > +       dsb
> > +.endm
> > +
> > +.macro voting_end rbase:req, rcpu:req, rscratch:req
> > +       mov     \rscratch, #0
> > +       strb    \rscratch, [\rbase, \rcpu]
> > +       dsb
> > +       sev
> > +.endm
> > +
> > +/*
> > + * The vlock structure must reside in Strongly-Ordered or Device memory.
> > + * This implementation deliberately eliminates most of the barriers which
> > + * would be required for other memory types, and assumes that independent
> > + * writes to neighbouring locations within a cacheline do not interfere
> > + * with one another.
> > + */
> > +
> > +@ r0: lock structure base
> > +@ r1: CPU ID (0-based index within cluster)
> > +ENTRY(vlock_trylock)
> > +       add     r1, r1, #VLOCK_VOTING_OFFSET
> > +
> > +       voting_begin    r0, r1, r2
> > +
> > +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]   @ check whether lock is held
> > +       cmp     r2, #VLOCK_OWNER_NONE
> > +       bne     trylock_fail                    @ fail if so
> > +
> > +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]   @ submit my vote
> > +
> > +       voting_end      r0, r1, r2
> > +
> > +       @ Wait for the current round of voting to finish:
> > +
> > + MANY( mov     r3, #VLOCK_VOTING_OFFSET                        )
> > +0:
> > + MANY( ldr     r2, [r0, r3]                                    )
> > + FEW(  ldr     r2, [r0, #VLOCK_VOTING_OFFSET]                  )
> > +       cmp     r2, #0
> > +       wfene
> 
> Is there a race here? I wonder if you can end up in a situation where
> everybody enters wfe and then there is nobody left to signal an event
> via voting_end (if, for example the last voter sent the sev when
> everybody else was simultaneously doing the cmp before the wfe)...
> 
> ... actually, that's ok as long as VLOCK_VOTING_OFFSET isn't speculated,
> which it shouldn't be from strongly-ordered memory. Fair enough!
> 
> > +       bne     0b
> > + MANY( add     r3, r3, #4                                      )
> > + MANY( cmp     r3, #VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE    )
> > + MANY( bne     0b                                              )
> > +
> > +       @ Check who won:
> > +
> > +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]
> > +       eor     r0, r1, r2                      @ zero if I won, else nonzero
> > +       bx      lr
> > +
> > +trylock_fail:
> > +       voting_end      r0, r1, r2
> > +       mov     r0, #1                          @ nonzero indicates that I lost
> > +       bx      lr
> > +ENDPROC(vlock_trylock)
> > +
> > +@ r0: lock structure base
> > +ENTRY(vlock_unlock)
> > +       mov     r1, #VLOCK_OWNER_NONE
> > +       dsb
> > +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]
> > +       dsb
> > +       sev
> > +       bx      lr
> > +ENDPROC(vlock_unlock)
> > diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
> > new file mode 100644
> > index 0000000000..94c29a6caf
> > --- /dev/null
> > +++ b/arch/arm/common/vlock.h
> > @@ -0,0 +1,43 @@
> > +/*
> > + * vlock.h - simple voting lock implementation
> > + *
> > + * Created by: Dave Martin, 2012-08-16
> > + * Copyright:  (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, write to the Free Software Foundation, Inc.,
> > + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> > + */
> > +
> > +#ifndef __VLOCK_H
> > +#define __VLOCK_H
> > +
> > +#include <asm/bL_entry.h>
> > +
> > +#define VLOCK_OWNER_OFFSET     0
> > +#define VLOCK_VOTING_OFFSET    4
> 
> asm-offsets again?

Same answer.

> > +#define VLOCK_VOTING_SIZE      ((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)
> 
> Huh?

Each ballot is one byte, and we pack them into words.  So this is the 
size of the required words to hold all ballots.

> > +#define VLOCK_SIZE             (VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
> > +#define VLOCK_OWNER_NONE       0
> > +
> > +#ifndef __ASSEMBLY__
> > +
> > +struct vlock {
> > +       char data[VLOCK_SIZE];
> > +};
> 
> Does this mean the struct is only single byte aligned? You do word
> accesses to it in your vlock code and rely on atomicity, so I'd feel
> safer if it was aligned to 4 bytes, especially since this isn't being
> accessed via a normal mapping.

The structure size is always a multiple of 4 bytes.  Its alignment is 
actually much larger than 4 as it needs to span a whole cache line not 
to be overwritten by dirty line writeback.

As I mentioned before, given that this structure is allocated and 
accessed only by assembly code, we could simply remove all those unused 
C definitions to avoid potential confusion and misuse.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 22:31         ` Nicolas Pitre
@ 2013-01-11 10:36           ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 10:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 05:31:08PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Catalin Marinas wrote:
> 
> > On 10 January 2013 17:59, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> > >
> > >> On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > >> > --- a/arch/arm/common/bL_entry.c
> > >> > +++ b/arch/arm/common/bL_entry.c
> > >> > @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> > >> >                 platform_ops->powered_up();
> > >> >         return 0;
> > >> >  }
> > >> > +
> > >> > +struct bL_sync_struct bL_sync;
> > >> > +
> > >> > +static void __sync_range(volatile void *p, size_t size)
> > >> > +{
> > >> > +       char *_p = (char *)p;
> > >> > +
> > >> > +       __cpuc_flush_dcache_area(_p, size);
> > >> > +       outer_flush_range(__pa(_p), __pa(_p + size));
> > >> > +       outer_sync();
> > ...
> > >> However, on the same CPU you can get a speculative load into L1 after
> > >> the L1 flush but before the L2 flush, so the reader case can fail.
> > >>
> > >> The sequence for readers is (note *L2* inval first):
> > >>
> > >> L2 inval
> > >> L1 inval
> > >
> > > As you noted below and as I explained above, this can't be an inval
> > > operation as that could discard a concurrent writer's update.
> > >
> > >> The sequence for writers is:
> > >>
> > >> L1 clean
> > >> L2 clean
> > >>
> > >> The bi-directional sequence (that's what you need) is:
> > >>
> > >> L1 clean
> > >> L2 clean+inval
> > >> L1 clean+inval

Agreed.  My bad, sorry... I was focusing on other aspects; plus we have
no actual outer cache, so the mis-ordering is hidden in our testing.

This code has been through a few iterations, some of which had separate
sequences for reads and writes, though possibly the ordering is still
wrong.

If our cache is enabled, we may end up with the responsibility of writing
out another CPU's dirty lines due to speculative migration, so for most
or all of the flushes here, we do need the third sequence (inner clean;
outer clean+invalidate; inner clean+invalidate)

> > >>
> > >> The last L1 op must be clean+inval in case another CPU writes to this
> > >> location to avoid discarding the write.
> > >>
> > >> If you don't have an L2, you just end up with two L1 clean ops, so you
> > >> can probably put some checks.
> > >
> > > In fact, since this is only used on A7/A15 right now, there is no outer
> > > cache and the outer calls are effectively no-ops.  I'm wondering if
> > > those should simply be removed until/unless there is some system showing
> > > up with a need for them.
> > 
> > You could. I expect multi-cluster systems to have integrated L2 cache
> > and avoid explicit outer cache maintenance. But is there a chance that
> > your patches could be generalised to existing systems with A9 (not b.L
> > configuration but just hotplug or cpuidle support)? I haven't finished
> > reading all the patches, so maybe that's not the case at all.
> 
> I suppose it could, although the special requirements put on the first 
> man / last man exist only for multi-cluster systems.  OTOH, existing A9 
> systems are already served by far less complex code already, so it is 
> really a matter of figuring out if the backend for those A9 systems 
> needed by this cluster code would be simpler than the existing code, in 
> which case that would certainly be beneficial.

The outer operations just expand to nothing if there is no outer cache; the
optimisation would be that instead of L1 clean; L1 clean+inval, we just
need L1 clean+inval.

> > Anyway, my point is that if L1 is inner and L2 outer, the correct
> > bi-derectional flushing sequence is slightly different.
> 
> Agreed, I'll make sure to capture that in the code somehow.

I'll have a go at this today (but I won't over-elaborate in case you've
already done it...)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11  1:26     ` Nicolas Pitre
@ 2013-01-11 10:55       ` Will Deacon
  2013-01-11 11:35         ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-11 10:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 01:26:21AM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> > On Thu, Jan 10, 2013 at 12:20:36AM +0000, Nicolas Pitre wrote:
> > > +
> > > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > 
> > Does this actually need to be volatile? I'd have thought a compiler
> > barrier in place of the smp_wmb below would be enough (following on from
> > Catalin's comments).
> 
> Actually, I did the reverse i.e. I removed the smp_wmb() entirely. A 
> compiler barrier forces the whole world to memory while here we only 
> want this particular assignment to be pushed out.
> 
> Furthermore, I like the volatile as it flags that this is a special 
> variable which in this case is also accessed from CPUs with no cache.

Ok, fair enough. Given that the smp_wmb isn't needed that sounds better.

> > > +	/* We didn't expect this CPU.  Try to make it quiet. */
> > > +1:	wfi
> > > +	wfe
> > > +	b	1b
> > 
> > I realise this CPU is stuck at this point, but you should have a dsb
> > before a wfi instruction. This could be problematic with the CCI this
> > early, so maybe just a comment saying that it doesn't matter because we
> > don't care about this core?
> 
> Why a dsb?  No data was even touched at this point.  And since this is 
> meant to be a better "b ." kind of loop, I'd rather not try to make it 
> more sophisticated than it already is.  And of course it is meant to 
> never be executed in practice.

Sure, that's why I think just mentioning that we don't ever plan to boot
this CPU is a good idea (so people don't add code here later on).

> > > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > > new file mode 100644
> > > index 0000000000..ff623333a1
> > > --- /dev/null
> > > +++ b/arch/arm/include/asm/bL_entry.h
> > > @@ -0,0 +1,35 @@
> > > +/*
> > > + * arch/arm/include/asm/bL_entry.h
> > > + *
> > > + * Created by:  Nicolas Pitre, April 2012
> > > + * Copyright:   (C) 2012  Linaro Limited
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + */
> > > +
> > > +#ifndef BL_ENTRY_H
> > > +#define BL_ENTRY_H
> > > +
> > > +#define BL_CPUS_PER_CLUSTER	4
> > > +#define BL_NR_CLUSTERS		2
> > 
> > Hmm, I see these have to be constant so you can allocate your space in
> > the assembly file. In which case, I think it's worth changing their
> > names to have MAX or LIMIT in them...
> 
> Yes, good point.  I'll change them.

Thanks.

> >  maybe they could even be CONFIG options?
> 
> Nah.  I prefer not adding new config options unless this is really 
> necessary or useful.  For the forseeable future, we'll see systems with 
> at most 2 clusters and at most 4 CPUs per cluster.  That could easily be 
> revisited later if that becomes unsuitable for some new systems.

The current GIC is limited to 8 CPUs, so 4x2 is also a realistic possibility.

> Initially I wanted all those things to be runtime sized in relation with 
> the TODO item in the commit log.  That too can come later.

Out of interest: how would you achieve that? I also thought about getting
this information from the device tree, but I can't see how to plug that in
with static storage.

Cheers,

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-11  2:30     ` Nicolas Pitre
@ 2013-01-11 10:58       ` Will Deacon
  2013-01-11 11:29       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Will Deacon @ 2013-01-11 10:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 02:30:06AM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> > On Thu, Jan 10, 2013 at 12:20:37AM +0000, Nicolas Pitre wrote:
> > > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> > > +{
> > > +	if (!platform_ops)
> > > +		return -EUNATCH;
> > 
> > Is this the right error code?
> 
> It is as good as any other, with some meaning to be distinguished from 
> the traditional ones like -ENOMEM or -EINVAL that the platform backends 
> could return.
> 
> Would you prefer another one?

-ENODEV? Nothing to lose sleep over though.

> > > +	might_sleep();
> > > +	return platform_ops->power_up(cpu, cluster);
> > > +}
> > > +
> > > +typedef void (*phys_reset_t)(unsigned long);
> > 
> > Maybe it's worth putting this typedef in a header file somewhere. It's
> > also used by the soft reboot code.
> 
> Agreed.  Maybe separately from this series though.
> 
> > > +
> > > +void bL_cpu_power_down(void)
> > > +{
> > > +	phys_reset_t phys_reset;
> > > +
> > > +	BUG_ON(!platform_ops);
> > 
> > Seems a bit overkill, or are we unrecoverable by this point?
> 
> We are.  The upper layer expects this CPU to be dead and there is no 
> easy recovery possible.  This is a "should never happen" condition, and 
> the kernel is badly configured otherwise.

Okey doke, that's what I feared. The BUG_ON makes sense then.

> > > +/*
> > > + * Platform specific methods used in the implementation of the above API.
> > > + */
> > > +struct bL_platform_power_ops {
> > > +	int (*power_up)(unsigned int cpu, unsigned int cluster);
> > > +	void (*power_down)(void);
> > > +	void (*suspend)(u64);
> > > +	void (*powered_up)(void);
> > > +};
> > 
> > It would be good if these prototypes matched the PSCI code, then platforms
> > could just glue them together directly.
> 
> No.
> 
> I discussed this at length with Charles (the PSCI spec author) already. 
> Even in the PSCI case, a minimum PSCI backend is necessary to do some 
> impedance matching between what the PSCI calls expect as arguments and 
> what this kernel specific API needs to express.  For example, the UP 
> method needs to always be provided with the address for bL_entry, 
> irrespective of where the user of this kernel API wants execution to be 
> resumed.  There might be some cases where the backend might decide to 
> override the desired power saving state because of other kernel induced 
> constraints (ongoing DMA operation for example) that PSCI doesn't (and 
> should not) know about.  And the best place to arbitrate between those 
> platform specific constraints is in this platform specific shim or 
> backend.

Yes, you're right. I was thinking we could convert cpu/cluster into cpuid
automatically, but actually it's not guaranteed that the PSCI firmware will
follow the MPIDR format so we even need platform-specific marshalling for
that.

Thanks for the reply,

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-11  3:15     ` Nicolas Pitre
@ 2013-01-11 11:03       ` Will Deacon
  2013-01-11 16:57       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Will Deacon @ 2013-01-11 11:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 03:15:22AM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> > On Thu, Jan 10, 2013 at 12:20:39AM +0000, Nicolas Pitre wrote:
> > > + * vlocks are currently only used to coordinate between CPUs which are
> > > +   unable to enable their caches yet.  This means that the
> > > +   implementation removes many of the barriers which would be required
> > > +   when executing the algorithm in cached memory.
> >
> > I think you need to elaborate on this and clearly identify the
> > requirements of the memory behaviour. In reality, these locks are hardly
> > ever usable so we don't want them cropping up in driver code and the
> > like!
> 
> Doesn't the following paragraph make that clear enough?

I think it misses a lot out (e.g. we require single-copy atomicity
guarantees for byte access, we require no speculation etc). Essentially,
it's an imprecise definition of strongly-ordered memory. However, see
below (the bit about removing the C implementation)...

> Maybe we should rip out the C interface to avoid such abuses.  I think
> that was initially added when we weren't sure if the C code had to be
> involved.

[...]

> > > +#include <linux/linkage.h>
> > > +#include "vlock.h"
> > > +
> > > +#if VLOCK_VOTING_SIZE > 4
> >
> > 4? Maybe a CONFIG option or a #define in an arch vlock.h?
> 
> The 4 here is actually related to the number of bytes in a word, to
> decide whether or not a loop is needed for voters enumeration.  That is
> not configurable.

BYTES_PER_LONG then (or suitable shifting of BITS_PER_LONG)?

> > > +#define VLOCK_VOTING_SIZE      ((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)
> >
> > Huh?
> 
> Each ballot is one byte, and we pack them into words.  So this is the
> size of the required words to hold all ballots.

Ok, so this could make use of BYTES_PER_LONG too, just to make the reasoning
more clear.

> > > +#define VLOCK_SIZE             (VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
> > > +#define VLOCK_OWNER_NONE       0
> > > +
> > > +#ifndef __ASSEMBLY__
> > > +
> > > +struct vlock {
> > > +       char data[VLOCK_SIZE];
> > > +};
> >
> > Does this mean the struct is only single byte aligned? You do word
> > accesses to it in your vlock code and rely on atomicity, so I'd feel
> > safer if it was aligned to 4 bytes, especially since this isn't being
> > accessed via a normal mapping.
> 
> The structure size is always a multiple of 4 bytes.  Its alignment is
> actually much larger than 4 as it needs to span a whole cache line not
> to be overwritten by dirty line writeback.

That's not implied from the structure definition.

> As I mentioned before, given that this structure is allocated and
> accessed only by assembly code, we could simply remove all those unused
> C definitions to avoid potential confusion and misuse.

Yes, I think removing the C definitions is a great idea. Then, we have a
pure-asm implementation which is, as such, tied to ARM. In that case, the
documentation can just refer to ARM memory types instead of loosely defining
the required characteristics (i.e. state that device or strongly-ordered
memory is required).

Cheers,

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-11  1:50     ` Nicolas Pitre
@ 2013-01-11 11:09       ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 11:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 08:50:59PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:38AM +0000, Nicolas Pitre wrote:
> > > From: Dave Martin <dave.martin@linaro.org>
> > > 
> > > This provides helper methods to coordinate between CPUs coming down
> > > and CPUs going up, as well as documentation on the used algorithms,
> > > so that cluster teardown and setup
> > > operations are not done for a cluster simultaneously.
> > 
> > [...]
> > 
> > > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> > > +{
> > > +       unsigned int i, j, mpidr, this_cluster;
> > > +
> > > +       BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_NR_CLUSTERS != sizeof bL_sync);
> > > +       BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
> > > +
> > > +       /*
> > > +        * Set initial CPU and cluster states.
> > > +        * Only one cluster is assumed to be active at this point.
> > > +        */
> > > +       for (i = 0; i < BL_NR_CLUSTERS; i++) {
> > > +               bL_sync.clusters[i].cluster = CLUSTER_DOWN;
> > > +               bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
> > > +               for (j = 0; j < BL_CPUS_PER_CLUSTER; j++)
> > > +                       bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
> > > +       }
> > > +       asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > 
> > We have a helper for this...

Agreed, we would ideally use a single definition for that.

> > 
> > > +       this_cluster = (mpidr >> 8) & 0xf;
> > 
> > ... and also this, thanks to Lorenzo's recent patches.
> 
> Indeed, I'll have a closer look at them.
> 
> > > +       for_each_online_cpu(i)
> > > +               bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
> > > +       bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
> > > +       sync_mem(&bL_sync);
> > > +
> > > +       if (power_up_setup) {
> > > +               bL_power_up_setup_phys = virt_to_phys(power_up_setup);
> > > +               sync_mem(&bL_power_up_setup_phys);
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> > > index 9d351f2b4c..f7a64ac127 100644
> > > --- a/arch/arm/common/bL_head.S
> > > +++ b/arch/arm/common/bL_head.S
> > > @@ -7,11 +7,19 @@
> > >   * This program is free software; you can redistribute it and/or modify
> > >   * it under the terms of the GNU General Public License version 2 as
> > >   * published by the Free Software Foundation.
> > > + *
> > > + *
> > > + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> > > + * for details of the synchronisation algorithms used here.
> > >   */
> > > 
> > >  #include <linux/linkage.h>
> > >  #include <asm/bL_entry.h>
> > > 
> > > +.if BL_SYNC_CLUSTER_CPUS
> > > +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> > > +.endif
> > > +
> > >         .macro  pr_dbg  cpu, string
> > >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> > >         b       1901f
> > > @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> > >  2:     pr_dbg  r4, "kernel bL_entry_point\n"
> > > 
> > >         /*
> > > -        * MMU is off so we need to get to bL_entry_vectors in a
> > > +        * MMU is off so we need to get to various variables in a
> > >          * position independent way.
> > >          */
> > >         adr     r5, 3f
> > > -       ldr     r6, [r5]
> > > +       ldmia   r5, {r6, r7, r8}
> > >         add     r6, r5, r6                      @ r6 = bL_entry_vectors
> > > +       ldr     r7, [r5, r7]                    @ r7 = bL_power_up_setup_phys
> > > +       add     r8, r5, r8                      @ r8 = bL_sync
> > > +
> > > +       mov     r0, #BL_SYNC_CLUSTER_SIZE
> > > +       mla     r8, r0, r10, r8                 @ r8 = bL_sync cluster base
> > > +
> > > +       @ Signal that this CPU is coming UP:
> > > +       mov     r0, #CPU_COMING_UP
> > > +       mov     r5, #BL_SYNC_CPU_SIZE
> > > +       mla     r5, r9, r5, r8                  @ r5 = bL_sync cpu address
> > > +       strb    r0, [r5]
> > > +
> > > +       dsb
> > 
> > Why is a dmb not enough here? In fact, the same goes for most of these
> > other than the one preceeding the sev. Is there an interaction with the
> > different mappings for the cluster data that I've missed?
> 
> Probably Dave could comment more on this as this is his code, or Achin 
> who also reviewed it.  I don't know the level of discussion that 
> happened inside ARM around those barriers.
> 
> When the TC2 firmware didn't properly handle the ACP snoops, the dsb's 
> couldn't be used at this point.  The replacement for a dsb was a read 
> back followed by a dmb in that case, and then the general sentiment was 
> that this was an A15 specific workaround which wasn't architecturally 
> guaranteed on all ARMv7 compliant implementations, or something along 
> those lines.
> 
> Given that the TC2 firmware properly handles the snoops now, and that 
> the dsb apparently doesn't require a readback, we just decided to revert 
> to having simple dsb's.

I'll take another look at the code and think about this again.  This code
was initially a bit conservative.  Because we are S-O at this point,
most of your potential dmbs should actually require no barrier at all
(as in the vlock code).  I was cautious about that, but we've now seen
the principle work successfully with the vlock code (which postdates
the cluster state handling code here).

The one exception is before SEV.  Also, before WFE (opinions differ, but
since we are about to wait anyway the extra time cost of the dsb is not
really a concern here).

> 
> > > +
> > > +       @ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> > > +       @ state, because there is at least one active CPU (this CPU).
> > > +
> > > +       @ Check if the cluster has been set up yet:
> > > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       cmp     r0, #CLUSTER_UP
> > > +       beq     cluster_already_up
> > > +
> > > +       @ Signal that the cluster is being brought up:
> > > +       mov     r0, #INBOUND_COMING_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > > +
> > > +       dsb
> > > +
> > > +       @ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
> > > +       @ point onwards will observe INBOUND_COMING_UP and abort.
> > > +
> > > +       @ Wait for any previously-pending cluster teardown operations to abort
> > > +       @ or complete:
> > > +cluster_teardown_wait:
> > > +       ldrb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       cmp     r0, #CLUSTER_GOING_DOWN
> > > +       wfeeq
> > > +       beq     cluster_teardown_wait
> > > +
> > > +       @ power_up_setup is responsible for setting up the cluster:
> > > +
> > > +       cmp     r7, #0
> > > +       mov     r0, #1          @ second (cluster) affinity level
> > > +       blxne   r7              @ Call power_up_setup if defined
> > > +
> > > +       @ Leave the cluster setup critical section:
> > > +
> > > +       dsb
> > > +       mov     r0, #INBOUND_NOT_COMING_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> > > +       mov     r0, #CLUSTER_UP
> > > +       strb    r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> > > +       dsb
> > > +       sev
> > > +
> > > +cluster_already_up:
> > > +       @ If a platform-specific CPU setup hook is needed, it is
> > > +       @ called from here.
> > > +
> > > +       cmp     r7, #0
> > > +       mov     r0, #0          @ first (CPU) affinity level
> > > +       blxne   r7              @ Call power_up_setup if defined
> > > +
> > > +       @ Mark the CPU as up:
> > > +
> > > +       dsb
> > > +       mov     r0, #CPU_UP
> > > +       strb    r0, [r5]
> > > +       dsb
> > > +       sev
> > > 
> > >  bL_entry_gated:
> > >         ldr     r5, [r6, r4, lsl #2]            @ r5 = CPU entry vector
> > > @@ -70,6 +148,8 @@ bL_entry_gated:
> > >         .align  2
> > > 
> > >  3:     .word   bL_entry_vectors - .
> > > +       .word   bL_power_up_setup_phys - 3b
> > > +       .word   bL_sync - 3b
> > > 
> > >  ENDPROC(bL_entry_point)
> > > 
> > > @@ -79,3 +159,7 @@ ENDPROC(bL_entry_point)
> > >         .type   bL_entry_vectors, #object
> > >  ENTRY(bL_entry_vectors)
> > >         .space  4 * BL_NR_CLUSTERS * BL_CPUS_PER_CLUSTER
> > > +
> > > +       .type   bL_power_up_setup_phys, #object
> > > +ENTRY(bL_power_up_setup_phys)
> > > +       .space  4               @ set by bL_cluster_sync_init()
> > > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > > index 942d7f9f19..167394d9a0 100644
> > > --- a/arch/arm/include/asm/bL_entry.h
> > > +++ b/arch/arm/include/asm/bL_entry.h
> > > @@ -15,8 +15,37 @@
> > >  #define BL_CPUS_PER_CLUSTER    4
> > >  #define BL_NR_CLUSTERS         2
> > > 
> > > +/* Definitions for bL_cluster_sync_struct */
> > > +#define CPU_DOWN               0x11
> > > +#define CPU_COMING_UP          0x12
> > > +#define CPU_UP                 0x13
> > > +#define CPU_GOING_DOWN         0x14
> > > +
> > > +#define CLUSTER_DOWN           0x21
> > > +#define CLUSTER_UP             0x22
> > > +#define CLUSTER_GOING_DOWN     0x23
> > > +
> > > +#define INBOUND_NOT_COMING_UP  0x31
> > > +#define INBOUND_COMING_UP      0x32
> > 
> > Do these numbers signify anything? Why not 0, 1, 2 etc?
> 
> Initially that's what they were.  But durring debugging (as we faced a 
> few cache coherency issues here) it was more useful to use numbers with 
> an easily distinguishable signature.  For example, a 0 may come from 
> anywhere and could mean anything so that is about the worst choice.
> Other than that, those numbers have no particular significance.
> 
> > > +
> > > +/* This is a complete guess. */
> > > +#define __CACHE_WRITEBACK_ORDER        6
> > 
> > Is this CONFIG_ARM_L1_CACHE_SHIFT?
> 
> No.  That has to cover L2 as well.

Of course, I seem to remember that there are assumptions elsewhere in 
the kernel that 1 << CONFIG_ARM_L1_CACHE_SHIFT is (at least) the cache
writeback granule.

I prefer not to use a macro with a wholly misleading name, but I would
like a "proper" way to get this value, if there is one ... ?

One reason for adding a #define here was to document the fact that the
value used really is a guess and that we have no correct way to discover
it.

> 
> > > +#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
> > > +
> > > +/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
> > > +#define BL_SYNC_CLUSTER_CPUS   0
> > 
> > Why not use asm-offsets.h for this?
> 
> That's how that was done initially. But that ended up cluttering 
> asm-offsets.h for stuff that actually is really a local implementation 
> detail which doesn't need kernel wide scope.  In other words, the end 
> result looked worse.
> 
> One could argue that they are still exposed too much as the only files 
> that need to know about those defines are bL_head.S and bL_entry.c.
> 
> > > +#define BL_SYNC_CPU_SIZE       __CACHE_WRITEBACK_GRANULE
> > > +#define BL_SYNC_CLUSTER_CLUSTER \
> > > +       (BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_CPUS_PER_CLUSTER)
> > > +#define BL_SYNC_CLUSTER_INBOUND \
> > > +       (BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
> > > +#define BL_SYNC_CLUSTER_SIZE \
> > > +       (BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
> > > +
> > 
> > Hmm, this looks pretty fragile to me but again, you need this stuff at
> > compile time.
> 
> There are compile time and run time assertions in bL_entry.c to ensure 
> those offsets and the corresponding C structure don't get out of sync.
> 
> > Is there an architected maximum value for the writeback
> > granule? Failing that, we may as well just use things like

There is an architectural maximum, bit it is 2K (which although "safe"
feels a bit excessive for our purposes.  A 2+3 CPU system would require
at least 22K for the synchronisation data with this assumption, rising to
28K for 4+4.  Not the end of the world for .bss data on modern hardware
with GB of DRAM, but it still feels wasteful.

Does anyone have a view on how much we care?

If there is no outer cache, the actual granule size can be determined
via CP15 at run-time; if there is an outer cache, we would also need
to find out its granule somehow.

> > __cacheline_aligned if we're only using the L1 alignment anyway.
> 
> See above -- we need L2 alignment.

This partly depends on whether __cacheline_aligned is supposed to
guarantee cache writeback granule alignment.  Is it?  At best I was highly
uncertain about this.

Cheers
---Dave 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-11  2:30     ` Nicolas Pitre
  2013-01-11 10:58       ` Will Deacon
@ 2013-01-11 11:29       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 11:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 09:30:06PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:37AM +0000, Nicolas Pitre wrote:
> > > This is the basic API used to handle the powering up/down of individual
> > > CPUs in a big.LITTLE system.  The platform specific backend implementation
> > > has the responsibility to also handle the cluster level power as well when
> > > the first/last CPU in a cluster is brought up/down.
> > > 
> > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > ---
> > >  arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
> > >  arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 180 insertions(+)
> > > 
> > > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > > index 80fff49417..41de0622de 100644
> > > --- a/arch/arm/common/bL_entry.c
> > > +++ b/arch/arm/common/bL_entry.c
> > > @@ -11,11 +11,13 @@
> > >  
> > >  #include <linux/kernel.h>
> > >  #include <linux/init.h>
> > > +#include <linux/irqflags.h>
> > >  
> > >  #include <asm/bL_entry.h>
> > >  #include <asm/barrier.h>
> > >  #include <asm/proc-fns.h>
> > >  #include <asm/cacheflush.h>
> > > +#include <asm/idmap.h>
> > >  
> > >  extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > >  
> > > @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> > >  	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> > >  			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> > >  }
> > > +
> > > +static const struct bL_platform_power_ops *platform_ops;
> > > +
> > > +int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
> > > +{
> > > +	if (platform_ops)
> > > +		return -EBUSY;
> > > +	platform_ops = ops;
> > > +	return 0;
> > > +}
> > > +
> > > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> > > +{
> > > +	if (!platform_ops)
> > > +		return -EUNATCH;
> > 
> > Is this the right error code?
> 
> It is as good as any other, with some meaning to be distinguished from 
> the traditional ones like -ENOMEM or -EINVAL that the platform backends 
> could return.
> 
> Would you prefer another one?
> 
> > > +	might_sleep();
> > > +	return platform_ops->power_up(cpu, cluster);
> > > +}
> > > +
> > > +typedef void (*phys_reset_t)(unsigned long);
> > 
> > Maybe it's worth putting this typedef in a header file somewhere. It's
> > also used by the soft reboot code.
> 
> Agreed.  Maybe separately from this series though.
> 
> > > +
> > > +void bL_cpu_power_down(void)
> > > +{
> > > +	phys_reset_t phys_reset;
> > > +
> > > +	BUG_ON(!platform_ops);
> > 
> > Seems a bit overkill, or are we unrecoverable by this point?
> 
> We are.  The upper layer expects this CPU to be dead and there is no 
> easy recovery possible.  This is a "should never happen" condition, and 
> the kernel is badly configured otherwise.

bL_cpu_power_down() is unconditional and does not fail.  This means
that calling this function means that:

 a) a subsequent call to bL_cpu_power_up() on this CPU will cause it
    to jump to bL_entry_point, in something resembling reset state;

 b) for all, part (or, rarely, none) of the intervening period, the
    CPU may really be turned off.

Without this BUG_ON() we would need to implement a dummy mechanism
to send the CPU to bL_entry_point at the right time.  If this happens
instantaneously (without waiting for bL_cpu_power_up()), then this
will likely lead to a spin of some sort unless it only happens
occasionally.  Also, the whole purpose of this function is to power off
the CPU, permitting power savings, so if no means has been registered too
do that, a call to bL_power_off() is certainly buggy misuse.

> 
> > 
> > > +	BUG_ON(!irqs_disabled());
> > > +
> > > +	/*
> > > +	 * Do this before calling into the power_down method,
> > > +	 * as it might not always be safe to do afterwards.
> > > +	 */
> > > +	setup_mm_for_reboot();
> > > +
> > > +	platform_ops->power_down();
> > > +
> > > +	/*
> > > +	 * It is possible for a power_up request to happen concurrently
> > > +	 * with a power_down request for the same CPU. In this case the
> > > +	 * power_down method might not be able to actually enter a
> > > +	 * powered down state with the WFI instruction if the power_up
> > > +	 * method has removed the required reset condition.  The
> > > +	 * power_down method is then allowed to return. We must perform
> > > +	 * a re-entry in the kernel as if the power_up method just had
> > > +	 * deasserted reset on the CPU.
> > > +	 *
> > > +	 * To simplify race issues, the platform specific implementation
> > > +	 * must accommodate for the possibility of unordered calls to
> > > +	 * power_down and power_up with a usage count. Therefore, if a
> > > +	 * call to power_up is issued for a CPU that is not down, then
> > > +	 * the next call to power_down must not attempt a full shutdown
> > > +	 * but only do the minimum (normally disabling L1 cache and CPU
> > > +	 * coherency) and return just as if a concurrent power_up request
> > > +	 * had happened as described above.
> > > +	 */
> > > +
> > > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > > +	phys_reset(virt_to_phys(bL_entry_point));
> > > +
> > > +	/* should never get here */
> > > +	BUG();
> > > +}
> > > +
> > > +void bL_cpu_suspend(u64 expected_residency)
> > > +{
> > > +	phys_reset_t phys_reset;
> > > +
> > > +	BUG_ON(!platform_ops);
> > > +	BUG_ON(!irqs_disabled());
> > > +
> > > +	/* Very similar to bL_cpu_power_down() */
> > > +	setup_mm_for_reboot();
> > > +	platform_ops->suspend(expected_residency);
> > > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > > +	phys_reset(virt_to_phys(bL_entry_point));
> > > +	BUG();
> > > +}
> > > +
> > > +int bL_cpu_powered_up(void)
> > > +{
> > > +	if (!platform_ops)
> > > +		return -EUNATCH;
> > > +	if (platform_ops->powered_up)
> > > +		platform_ops->powered_up();
> > > +	return 0;
> > > +}
> > > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > > index ff623333a1..942d7f9f19 100644
> > > --- a/arch/arm/include/asm/bL_entry.h
> > > +++ b/arch/arm/include/asm/bL_entry.h
> > > @@ -31,5 +31,97 @@ extern void bL_entry_point(void);
> > >   */
> > >  void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
> > >  
> > > +/*
> > > + * CPU/cluster power operations API for higher subsystems to use.
> > > + */
> > > +
> > > +/**
> > > + * bL_cpu_power_up - make given CPU in given cluster runable
> > > + *
> > > + * @cpu: CPU number within given cluster
> > > + * @cluster: cluster number for the CPU
> > > + *
> > > + * The identified CPU is brought out of reset.  If the cluster was powered
> > > + * down then it is brought up as well, taking care not to let the other CPUs
> > > + * in the cluster run, and ensuring appropriate cluster setup.
> > > + *
> > > + * Caller must ensure the appropriate entry vector is initialized with
> > > + * bL_set_entry_vector() prior to calling this.
> > > + *
> > > + * This must be called in a sleepable context.  However, the implementation
> > > + * is strongly encouraged to return early and let the operation happen
> > > + * asynchronously, especially when significant delays are expected.
> > > + *
> > > + * If the operation cannot be performed then an error code is returned.
> > > + */
> > > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster);
> > > +
> > > +/**
> > > + * bL_cpu_power_down - power the calling CPU down
> > > + *
> > > + * The calling CPU is powered down.
> > > + *
> > > + * If this CPU is found to be the "last man standing" in the cluster
> > > + * then the cluster is prepared for power-down too.
> > > + *
> > > + * This must be called with interrupts disabled.
> > > + *
> > > + * This does not return.  Re-entry in the kernel is expected via
> > > + * bL_entry_point.
> > > + */
> > > +void bL_cpu_power_down(void);
> > > +
> > > +/**
> > > + * bL_cpu_suspend - bring the calling CPU in a suspended state
> > > + *
> > > + * @expected_residency: duration in microseconds the CPU is expected
> > > + *			to remain suspended, or 0 if unknown/infinity.
> > > + *
> > > + * The calling CPU is suspended.  The expected residency argument is used
> > > + * as a hint by the platform specific backend to implement the appropriate
> > > + * sleep state level according to the knowledge it has on wake-up latency
> > > + * for the given hardware.
> > > + *
> > > + * If this CPU is found to be the "last man standing" in the cluster
> > > + * then the cluster may be prepared for power-down too, if the expected
> > > + * residency makes it worthwhile.
> > > + *
> > > + * This must be called with interrupts disabled.
> > > + *
> > > + * This does not return.  Re-entry in the kernel is expected via
> > > + * bL_entry_point.
> > > + */
> > > +void bL_cpu_suspend(u64 expected_residency);
> > > +
> > > +/**
> > > + * bL_cpu_powered_up - housekeeping workafter a CPU has been powered up
> > > + *
> > > + * This lets the platform specific backend code perform needed housekeeping
> > > + * work.  This must be called by the newly activated CPU as soon as it is
> > > + * fully operational in kernel space, before it enables interrupts.
> > > + *
> > > + * If the operation cannot be performed then an error code is returned.
> > > + */
> > > +int bL_cpu_powered_up(void);
> > > +
> > > +/*
> > > + * Platform specific methods used in the implementation of the above API.
> > > + */
> > > +struct bL_platform_power_ops {
> > > +	int (*power_up)(unsigned int cpu, unsigned int cluster);
> > > +	void (*power_down)(void);
> > > +	void (*suspend)(u64);
> > > +	void (*powered_up)(void);
> > > +};
> > 
> > It would be good if these prototypes matched the PSCI code, then platforms
> > could just glue them together directly.
> 
> No.
> 
> I discussed this at length with Charles (the PSCI spec author) already. 
> Even in the PSCI case, a minimum PSCI backend is necessary to do some 
> impedance matching between what the PSCI calls expect as arguments and 
> what this kernel specific API needs to express.  For example, the UP 
> method needs to always be provided with the address for bL_entry, 
> irrespective of where the user of this kernel API wants execution to be 
> resumed.  There might be some cases where the backend might decide to 
> override the desired power saving state because of other kernel induced 
> constraints (ongoing DMA operation for example) that PSCI doesn't (and 
> should not) know about.  And the best place to arbitrate between those 
> platform specific constraints is in this platform specific shim or 
> backend.
> 
> Because of that, and because one feature of Linux is to not have stable 
> APIs in the kernel so to be free to adapt them to future needs, I think 
> it is best not to even try matching the PSCI interface here.

The kernel may need to do stuff in these functions, even if the
underlying backend is PSCI, so they wouldn't just be a pass-through ...
or does the mach-virt experience convince us that there would be nothing
to do?    I feel unsure about that, but I've not looked at the mach-virt
code yet.

mach-virt lacks most of the hardware nasties which we can't ignore at
the host kernel / firmware interface.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10 19:04     ` Nicolas Pitre
@ 2013-01-11 11:30       ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 11:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 02:04:02PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Dave Martin wrote:
> 
> > > +int __init bL_cluster_sync_init(void (*power_up_setup)(void))
> > 
> > The addition of the affinity level parameter for power_up_setup means
> > that this prototype is not correct.
> 
> Indeed.
> 
> > This is not a functional change, since that function is only called from
> > assembler anyway, but it will help avoid confusion.
> 
> Fixed now, as well as the DCSCB usage.
> 
> 
> Nicolas

OK, thanks
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11 10:55       ` Will Deacon
@ 2013-01-11 11:35         ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 11:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 10:55:26AM +0000, Will Deacon wrote:
> On Fri, Jan 11, 2013 at 01:26:21AM +0000, Nicolas Pitre wrote:
> > On Thu, 10 Jan 2013, Will Deacon wrote:
> > > On Thu, Jan 10, 2013 at 12:20:36AM +0000, Nicolas Pitre wrote:
> > > > +
> > > > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > > 
> > > Does this actually need to be volatile? I'd have thought a compiler
> > > barrier in place of the smp_wmb below would be enough (following on from
> > > Catalin's comments).
> > 
> > Actually, I did the reverse i.e. I removed the smp_wmb() entirely. A 
> > compiler barrier forces the whole world to memory while here we only 
> > want this particular assignment to be pushed out.
> > 
> > Furthermore, I like the volatile as it flags that this is a special 
> > variable which in this case is also accessed from CPUs with no cache.
> 
> Ok, fair enough. Given that the smp_wmb isn't needed that sounds better.
> 
> > > > +	/* We didn't expect this CPU.  Try to make it quiet. */
> > > > +1:	wfi
> > > > +	wfe
> > > > +	b	1b
> > > 
> > > I realise this CPU is stuck at this point, but you should have a dsb
> > > before a wfi instruction. This could be problematic with the CCI this
> > > early, so maybe just a comment saying that it doesn't matter because we
> > > don't care about this core?
> > 
> > Why a dsb?  No data was even touched at this point.  And since this is 
> > meant to be a better "b ." kind of loop, I'd rather not try to make it 
> > more sophisticated than it already is.  And of course it is meant to 
> > never be executed in practice.
> 
> Sure, that's why I think just mentioning that we don't ever plan to boot
> this CPU is a good idea (so people don't add code here later on).

I agree with the conclusions here.

> > > > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > > > new file mode 100644
> > > > index 0000000000..ff623333a1
> > > > --- /dev/null
> > > > +++ b/arch/arm/include/asm/bL_entry.h
> > > > @@ -0,0 +1,35 @@
> > > > +/*
> > > > + * arch/arm/include/asm/bL_entry.h
> > > > + *
> > > > + * Created by:  Nicolas Pitre, April 2012
> > > > + * Copyright:   (C) 2012  Linaro Limited
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or modify
> > > > + * it under the terms of the GNU General Public License version 2 as
> > > > + * published by the Free Software Foundation.
> > > > + */
> > > > +
> > > > +#ifndef BL_ENTRY_H
> > > > +#define BL_ENTRY_H
> > > > +
> > > > +#define BL_CPUS_PER_CLUSTER	4
> > > > +#define BL_NR_CLUSTERS		2
> > > 
> > > Hmm, I see these have to be constant so you can allocate your space in
> > > the assembly file. In which case, I think it's worth changing their
> > > names to have MAX or LIMIT in them...
> > 
> > Yes, good point.  I'll change them.
> 
> Thanks.
> 
> > >  maybe they could even be CONFIG options?
> > 
> > Nah.  I prefer not adding new config options unless this is really 
> > necessary or useful.  For the forseeable future, we'll see systems with 
> > at most 2 clusters and at most 4 CPUs per cluster.  That could easily be 
> > revisited later if that becomes unsuitable for some new systems.
> 
> The current GIC is limited to 8 CPUs, so 4x2 is also a realistic possibility.
> 
> > Initially I wanted all those things to be runtime sized in relation with 
> > the TODO item in the commit log.  That too can come later.
> 
> Out of interest: how would you achieve that? I also thought about getting
> this information from the device tree, but I can't see how to plug that in
> with static storage.

I think you would just have to bite the bullet and go dynamic in this
case. But it's not a lot of data in total with the current limits, so
this feels like overkill.

If we eventually need to go many-CPU with this code, it will need
addressing, but there are no current plans for that that I know of.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache
  2013-01-10 19:13     ` Nicolas Pitre
@ 2013-01-11 11:38       ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 02:13:09PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Dave Martin wrote:
> 
> > On Wed, Jan 09, 2013 at 07:20:49PM -0500, Nicolas Pitre wrote:
> > > From: Dave Martin <dave.martin@linaro.org>
> > 
> > To avoid confusion, the prefix in the subject line should be "CCI", not
> > "TC2".
> 
> Absolutely.  This is my mistake as I removed the TC2 changes from your 
> original patch to only keep the CCI ones, but forgot to update the patch 
> title.

Oh right.  I was quite happy to believe it was my mistake :)

It makes sense to have two separate patches anyway, though.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10 16:47     ` Nicolas Pitre
@ 2013-01-11 11:45       ` Catalin Marinas
  2013-01-11 12:05         ` Lorenzo Pieralisi
  2013-01-11 12:19         ` Dave Martin
  0 siblings, 2 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-11 11:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 04:47:09PM +0000, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Catalin Marinas wrote:
> > On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > --- /dev/null
> > > +++ b/arch/arm/common/bL_entry.c
> > ...
> > > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > 
> > IMHO, we should keep this array linear and ignore the cluster grouping
> > at this stage. This information could be added to latter patches that
> > actually need to know about the b.L topology.
> 
> That's virtually all of them.  Everything b.L related is always 
> expressed in terms of a cpu,cluster tuple at the low level.
> 
> > This would also imply that we treat the MPIDR just as an ID without 
> > digging into its bit layout.
> 
> That makes for too large an index space.  We always end up needing to 
> break the MPIDR into a cpu,cluster thing as the MPIDR bits are too 
> sparse.

You could find a way to compress this with some mask and shifts. We can
look at this later if we are to generalise this to non-b.L systems.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11 11:45       ` Catalin Marinas
@ 2013-01-11 12:05         ` Lorenzo Pieralisi
  2013-01-11 12:19         ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Lorenzo Pieralisi @ 2013-01-11 12:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 11:45:53AM +0000, Catalin Marinas wrote:
> On Thu, Jan 10, 2013 at 04:47:09PM +0000, Nicolas Pitre wrote:
> > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> > > On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > > --- /dev/null
> > > > +++ b/arch/arm/common/bL_entry.c
> > > ...
> > > > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > > 
> > > IMHO, we should keep this array linear and ignore the cluster grouping
> > > at this stage. This information could be added to latter patches that
> > > actually need to know about the b.L topology.
> > 
> > That's virtually all of them.  Everything b.L related is always 
> > expressed in terms of a cpu,cluster tuple at the low level.
> > 
> > > This would also imply that we treat the MPIDR just as an ID without 
> > > digging into its bit layout.
> > 
> > That makes for too large an index space.  We always end up needing to 
> > break the MPIDR into a cpu,cluster thing as the MPIDR bits are too 
> > sparse.
> 
> You could find a way to compress this with some mask and shifts. We can
> look at this later if we are to generalise this to non-b.L systems.

The MPIDR linearization (a simple hash to convert it to a linear index) is
planned anyway since code paths like cpu_{suspend/resume} do not work for
multi-cluster systems as things stand.

Lorenzo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11 11:45       ` Catalin Marinas
  2013-01-11 12:05         ` Lorenzo Pieralisi
@ 2013-01-11 12:19         ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 12:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 11:45:53AM +0000, Catalin Marinas wrote:
> On Thu, Jan 10, 2013 at 04:47:09PM +0000, Nicolas Pitre wrote:
> > On Thu, 10 Jan 2013, Catalin Marinas wrote:
> > > On 10 January 2013 00:20, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > > > --- /dev/null
> > > > +++ b/arch/arm/common/bL_entry.c
> > > ...
> > > > +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > > 
> > > IMHO, we should keep this array linear and ignore the cluster grouping
> > > at this stage. This information could be added to latter patches that
> > > actually need to know about the b.L topology.
> > 
> > That's virtually all of them.  Everything b.L related is always 
> > expressed in terms of a cpu,cluster tuple at the low level.
> > 
> > > This would also imply that we treat the MPIDR just as an ID without 
> > > digging into its bit layout.
> > 
> > That makes for too large an index space.  We always end up needing to 
> > break the MPIDR into a cpu,cluster thing as the MPIDR bits are too 
> > sparse.
> 
> You could find a way to compress this with some mask and shifts. We can
> look at this later if we are to generalise this to non-b.L systems.

The b.L cluster handling code has multiple instances of this issue.
We should either try to fix them all, or defer them all as being
overkill for the foreseeable future.

For current platforms, the space saved in unlikely to be larger than
the amount of code required to implement the optimisation.

I do think we need a good, generic way to map sparesely-populated,
multidimensional topological node IDs to/from a linear space, but
we should avoid reinventing that too many times.

I think Lorenzo was already potentially looking at this issue in
relation to managing cpu_logical_map.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-11  3:15     ` Nicolas Pitre
  2013-01-11 11:03       ` Will Deacon
@ 2013-01-11 16:57       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-11 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 10:15:22PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:39AM +0000, Nicolas Pitre wrote:
> > > From: Dave Martin <dave.martin@linaro.org>
> > > 
> > > This patch adds a simple low-level voting mutex implementation
> > > to be used to arbitrate during first man selection when no load/store
> > > exclusive instructions are usable.
> > > 
> > > For want of a better name, these are called "vlocks".  (I was
> > > tempted to call them ballot locks, but "block" is way too confusing
> > > an abbreviation...)
> > > 
> > > There is no function to wait for the lock to be released, and no
> > > vlock_lock() function since we don't need these at the moment.
> > > These could straightforwardly be added if vlocks get used for other
> > > purposes.
> > 
> > [...]
> > 
> > > diff --git a/Documentation/arm/big.LITTLE/vlocks.txt b/Documentation/arm/big.LITTLE/vlocks.txt
> > > new file mode 100644
> > > index 0000000000..90672ddc6a
> > > --- /dev/null
> > > +++ b/Documentation/arm/big.LITTLE/vlocks.txt
> > > @@ -0,0 +1,211 @@
> > > +vlocks for Bare-Metal Mutual Exclusion
> > > +======================================
> > 
> > [...]
> > 
> > > +ARM implementation
> > > +------------------
> > > +
> > > +The current ARM implementation [2] contains a some optimisations beyond
> > 
> > -a
> 
> Fixed.
> 
> > 
> > > +the basic algorithm:
> > > +
> > > + * By packing the members of the currently_voting array close together,
> > > +   we can read the whole array in one transaction (providing the number
> > > +   of CPUs potentially contending the lock is small enough).  This
> > > +   reduces the number of round-trips required to external memory.
> > > +
> > > +   In the ARM implementation, this means that we can use a single load
> > > +   and comparison:
> > > +
> > > +       LDR     Rt, [Rn]
> > > +       CMP     Rt, #0
> > > +
> > > +   ...in place of code equivalent to:
> > > +
> > > +       LDRB    Rt, [Rn]
> > > +       CMP     Rt, #0
> > > +       LDRBEQ  Rt, [Rn, #1]
> > > +       CMPEQ   Rt, #0
> > > +       LDRBEQ  Rt, [Rn, #2]
> > > +       CMPEQ   Rt, #0
> > > +       LDRBEQ  Rt, [Rn, #3]
> > > +       CMPEQ   Rt, #0
> > > +
> > > +   This cuts down on the fast-path latency, as well as potentially
> > > +   reducing bus contention in contended cases.
> > > +
> > > +   The optimisation relies on the fact that the ARM memory system
> > > +   guarantees coherency between overlapping memory accesses of
> > > +   different sizes, similarly to many other architectures.  Note that
> > > +   we do not care which element of currently_voting appears in which
> > > +   bits of Rt, so there is no need to worry about endianness in this
> > > +   optimisation.
> > > +
> > > +   If there are too many CPUs to read the currently_voting array in
> > > +   one transaction then multiple transations are still required.  The
> > > +   implementation uses a simple loop of word-sized loads for this
> > > +   case.  The number of transactions is still fewer than would be
> > > +   required if bytes were loaded individually.
> > > +
> > > +
> > > +   In principle, we could aggregate further by using LDRD or LDM, but
> > > +   to keep the code simple this was not attempted in the initial
> > > +   implementation.
> > > +
> > > +
> > > + * vlocks are currently only used to coordinate between CPUs which are
> > > +   unable to enable their caches yet.  This means that the
> > > +   implementation removes many of the barriers which would be required
> > > +   when executing the algorithm in cached memory.
> > 
> > I think you need to elaborate on this and clearly identify the
> > requirements of the memory behaviour. In reality, these locks are hardly
> > ever usable so we don't want them cropping up in driver code and the
> > like!
> 
> Doesn't the following paragraph make that clear enough?
> 
> Maybe we should rip out the C interface to avoid such abuses.  I think 
> that was initially added when we weren't sure if the C code had to be 
> involved.
> 
> > > +   packing of the currently_voting array does not work with cached
> > > +   memory unless all CPUs contending the lock are cache-coherent, due
> > > +   to cache writebacks from one CPU clobbering values written by other
> > > +   CPUs.  (Though if all the CPUs are cache-coherent, you should be
> > > +   probably be using proper spinlocks instead anyway).
> > > +
> > > +
> > > + * The "no votes yet" value used for the last_vote variable is 0 (not
> > > +   -1 as in the pseudocode).  This allows statically-allocated vlocks
> > > +   to be implicitly initialised to an unlocked state simply by putting
> > > +   them in .bss.
> > 
> > You could also put them in their own section and initialise them to -1
> > there.
> 
> Same argument as for bL_vectors: That is less efficient than using .bss 
> which takes no image space.  Plus the transformation for CPU 0 to work 
> with this is basically free. 
> 
> > > +   An offset is added to each CPU's ID for the purpose of setting this
> > > +   variable, so that no CPU uses the value 0 for its ID.
> > > +
> > > +
> > > +Colophon
> > > +--------
> > > +
> > > +Originally created and documented by Dave Martin for Linaro Limited, for
> > > +use in ARM-based big.LITTLE platforms, with review and input gratefully
> > > +received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
> > > +grabbing most of this text out of the relevant mail thread and writing
> > > +up the pseudocode.
> > > +
> > > +Copyright (C) 2012  Linaro Limited
> > > +Distributed under the terms of Version 2 of the GNU General Public
> > > +License, as defined in linux/COPYING.
> > > +
> > > +
> > > +References
> > > +----------
> > > +
> > > +[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
> > > +    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
> > > +
> > > +    http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
> > > +
> > > +[2] linux/arch/arm/common/vlock.S, www.kernel.org.
> > > diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
> > > new file mode 100644
> > > index 0000000000..0a1ee3a7f5
> > > --- /dev/null
> > > +++ b/arch/arm/common/vlock.S
> > > @@ -0,0 +1,108 @@
> > > +/*
> > > + * vlock.S - simple voting lock implementation for ARM
> > > + *
> > > + * Created by: Dave Martin, 2012-08-16
> > > + * Copyright:  (C) 2012  Linaro Limited
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > 
> > Your documentation is strictly GPLv2, so there's a strange discrepancy
> > here.
> 
> Indeed.
> 
> @Dave: your call.

This can all be strict v2.  The discrepancy was unintentional.

> 
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License along
> > > + * with this program; if not, write to the Free Software Foundation, Inc.,
> > > + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> > > + *
> > > + *
> > > + * This algorithm is described in more detail in
> > > + * Documentation/arm/big.LITTLE/vlocks.txt.
> > > + */
> > > +
> > > +#include <linux/linkage.h>
> > > +#include "vlock.h"
> > > +
> > > +#if VLOCK_VOTING_SIZE > 4
> > 
> > 4? Maybe a CONFIG option or a #define in an arch vlock.h?
> 
> The 4 here is actually related to the number of bytes in a word, to 
> decide whether or not a loop is needed for voters enumeration.  That is 
> not configurable.

This is arch-specific assembler, and the 4-bytes-per-word proprty is
a fixed property of the architecture.

We could have a comment maybe:

/*
 * Each voting occupies a byte, so if there are 4 or fewer, the whole
 * set of voting flags can be accessed with a single word access.
 */

> 
> > > +#define FEW(x...)
> > > +#define MANY(x...) x
> > > +#else
> > > +#define FEW(x...) x
> > > +#define MANY(x...)
> > > +#endif
> > > +
> > > +@ voting lock for first-man coordination
> > > +
> > > +.macro voting_begin rbase:req, rcpu:req, rscratch:req
> > > +       mov     \rscratch, #1
> > > +       strb    \rscratch, [\rbase, \rcpu]
> > > +       dsb
> > > +.endm
> > > +
> > > +.macro voting_end rbase:req, rcpu:req, rscratch:req
> > > +       mov     \rscratch, #0
> > > +       strb    \rscratch, [\rbase, \rcpu]
> > > +       dsb
> > > +       sev
> > > +.endm
> > > +
> > > +/*
> > > + * The vlock structure must reside in Strongly-Ordered or Device memory.
> > > + * This implementation deliberately eliminates most of the barriers which
> > > + * would be required for other memory types, and assumes that independent
> > > + * writes to neighbouring locations within a cacheline do not interfere
> > > + * with one another.
> > > + */
> > > +
> > > +@ r0: lock structure base
> > > +@ r1: CPU ID (0-based index within cluster)
> > > +ENTRY(vlock_trylock)
> > > +       add     r1, r1, #VLOCK_VOTING_OFFSET
> > > +
> > > +       voting_begin    r0, r1, r2
> > > +
> > > +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]   @ check whether lock is held
> > > +       cmp     r2, #VLOCK_OWNER_NONE
> > > +       bne     trylock_fail                    @ fail if so
> > > +
> > > +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]   @ submit my vote
> > > +
> > > +       voting_end      r0, r1, r2
> > > +
> > > +       @ Wait for the current round of voting to finish:
> > > +
> > > + MANY( mov     r3, #VLOCK_VOTING_OFFSET                        )
> > > +0:
> > > + MANY( ldr     r2, [r0, r3]                                    )
> > > + FEW(  ldr     r2, [r0, #VLOCK_VOTING_OFFSET]                  )
> > > +       cmp     r2, #0
> > > +       wfene
> > 
> > Is there a race here? I wonder if you can end up in a situation where
> > everybody enters wfe and then there is nobody left to signal an event
> > via voting_end (if, for example the last voter sent the sev when
> > everybody else was simultaneously doing the cmp before the wfe)...
> > 
> > ... actually, that's ok as long as VLOCK_VOTING_OFFSET isn't speculated,
> > which it shouldn't be from strongly-ordered memory. Fair enough!

Indeed.  The order of accesses to the voting flags is guaranteed by
strongly-ordered-ness.  The ordering between the strb and sev in voting_end
required a dsb, which we have.  The ordering between the external load and
wfe in the waiting code is guaranteed so S-O-ness and a control dependency.

> > 
> > > +       bne     0b
> > > + MANY( add     r3, r3, #4                                      )
> > > + MANY( cmp     r3, #VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE    )
> > > + MANY( bne     0b                                              )
> > > +
> > > +       @ Check who won:
> > > +
> > > +       ldrb    r2, [r0, #VLOCK_OWNER_OFFSET]
> > > +       eor     r0, r1, r2                      @ zero if I won, else nonzero
> > > +       bx      lr
> > > +
> > > +trylock_fail:
> > > +       voting_end      r0, r1, r2
> > > +       mov     r0, #1                          @ nonzero indicates that I lost
> > > +       bx      lr
> > > +ENDPROC(vlock_trylock)
> > > +
> > > +@ r0: lock structure base
> > > +ENTRY(vlock_unlock)
> > > +       mov     r1, #VLOCK_OWNER_NONE
> > > +       dsb
> > > +       strb    r1, [r0, #VLOCK_OWNER_OFFSET]
> > > +       dsb
> > > +       sev
> > > +       bx      lr
> > > +ENDPROC(vlock_unlock)
> > > diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
> > > new file mode 100644
> > > index 0000000000..94c29a6caf
> > > --- /dev/null
> > > +++ b/arch/arm/common/vlock.h
> > > @@ -0,0 +1,43 @@
> > > +/*
> > > + * vlock.h - simple voting lock implementation
> > > + *
> > > + * Created by: Dave Martin, 2012-08-16
> > > + * Copyright:  (C) 2012  Linaro Limited
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License along
> > > + * with this program; if not, write to the Free Software Foundation, Inc.,
> > > + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> > > + */
> > > +
> > > +#ifndef __VLOCK_H
> > > +#define __VLOCK_H
> > > +
> > > +#include <asm/bL_entry.h>
> > > +
> > > +#define VLOCK_OWNER_OFFSET     0
> > > +#define VLOCK_VOTING_OFFSET    4
> > 
> > asm-offsets again?
> 
> Same answer.

I did start out by adding stuff to asm-offsets, but it just ende up
looking like cruft.

asm-offsets is primarily for synchronising C structures with asm.  The
vlock structure is not accessed from C, though.

> 
> > > +#define VLOCK_VOTING_SIZE      ((BL_CPUS_PER_CLUSTER + 3) / 4 * 4)
> > 
> > Huh?
> 
> Each ballot is one byte, and we pack them into words.  So this is the 
> size of the required words to hold all ballots.

Hopefully we don't need a comment?  I hoped this was straightforward.

> 
> > > +#define VLOCK_SIZE             (VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
> > > +#define VLOCK_OWNER_NONE       0
> > > +
> > > +#ifndef __ASSEMBLY__
> > > +
> > > +struct vlock {
> > > +       char data[VLOCK_SIZE];
> > > +};
> > 
> > Does this mean the struct is only single byte aligned? You do word
> > accesses to it in your vlock code and rely on atomicity, so I'd feel
> > safer if it was aligned to 4 bytes, especially since this isn't being
> > accessed via a normal mapping.
> 
> The structure size is always a multiple of 4 bytes.  Its alignment is 
> actually much larger than 4 as it needs to span a whole cache line not 
> to be overwritten by dirty line writeback.
> 
> As I mentioned before, given that this structure is allocated and 
> accessed only by assembly code, we could simply remove all those unused 
> C definitions to avoid potential confusion and misuse.

Agreed.  Originally I anticipated this stuff being usable from C, but
this is so tenuous that providing C declarations may just confuse people.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
                     ` (2 preceding siblings ...)
  2013-01-10 23:05   ` Will Deacon
@ 2013-01-11 17:16   ` Santosh Shilimkar
  2013-01-11 18:10     ` Nicolas Pitre
  2013-03-07  7:37   ` Pavel Machek
  4 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> CPUs in a big.LITTLE systems have special needs when entering the kernel
> due to a hotplug event, or when resuming from a deep sleep mode.
>
> This is vectorized so multiple CPUs can enter the kernel in parallel
> without serialization.
>
> Only the basic structure is introduced here.  This will be extended
> later.
>
> TODO: MPIDR based indexing should eventually be made runtime adjusted.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---

[..]

> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
> index e8a4e58f1b..50880c494f 100644
> --- a/arch/arm/common/Makefile
> +++ b/arch/arm/common/Makefile
> @@ -13,3 +13,6 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
>   obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
>   obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>   obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
> +obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
> +obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
> +obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> new file mode 100644
> index 0000000000..80fff49417
> --- /dev/null
> +++ b/arch/arm/common/bL_entry.c
> @@ -0,0 +1,30 @@
> +/*
> + * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
> + *
> + * Created by:  Nicolas Pitre, March 2012
> + * Copyright:   (C) 2012  Linaro Limited
2013 now :-)
Looks like you need to update rest of the patches as well.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +
> +#include <asm/bL_entry.h>
> +#include <asm/barrier.h>
> +#include <asm/proc-fns.h>
> +#include <asm/cacheflush.h>
> +
> +extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> +
> +void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
> +{
> +	unsigned long val = ptr ? virt_to_phys(ptr) : 0;
> +	bL_entry_vectors[cluster][cpu] = val;
> +	smp_wmb();
> +	__cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
> +	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> +			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> +}
I had the same question about smp_wmb() as Catalin but after following
rest of the comments, I understand it will be removed so thats good.

> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> new file mode 100644
> index 0000000000..9d351f2b4c
> --- /dev/null
> +++ b/arch/arm/common/bL_head.S
> @@ -0,0 +1,81 @@
> +/*
> + * arch/arm/common/bL_head.S -- big.LITTLE kernel re-entry point
> + *
> + * Created by:  Nicolas Pitre, March 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/bL_entry.h>
> +
> +	.macro	pr_dbg	cpu, string
> +#if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> +	b	1901f
> +1902:	.ascii	"CPU 0: \0CPU 1: \0CPU 2: \0CPU 3: \0"
> +	.ascii	"CPU 4: \0CPU 5: \0CPU 6: \0CPU 7: \0"
> +1903:	.asciz	"\string"
> +	.align
> +1901:	adr	r0, 1902b
> +	add	r0, r0, \cpu, lsl #3
> +	bl	printascii
> +	adr	r0, 1903b
> +	bl	printascii
> +#endif
> +	.endm
> +
> +	.arm
> +	.align
> +
> +ENTRY(bL_entry_point)
> +
> + THUMB(	adr	r12, BSYM(1f)	)
> + THUMB(	bx	r12		)
> + THUMB(	.thumb			)
> +1:
> +	mrc	p15, 0, r0, c0, c0, 5
> +	ubfx	r9, r0, #0, #4			@ r9 = cpu
> +	ubfx	r10, r0, #8, #4			@ r10 = cluster
> +	mov	r3, #BL_CPUS_PER_CLUSTER
> +	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
> +	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
> +	blo	2f
> +
> +	/* We didn't expect this CPU.  Try to make it quiet. */
> +1:	wfi
> +	wfe

Why do you need a wfe followed by wif ?
Just curious.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
  2013-01-10 23:08   ` Will Deacon
@ 2013-01-11 17:26   ` Santosh Shilimkar
  2013-01-11 18:33     ` Nicolas Pitre
  1 sibling, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 17:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> This is the basic API used to handle the powering up/down of individual
> CPUs in a big.LITTLE system.  The platform specific backend implementation
> has the responsibility to also handle the cluster level power as well when
> the first/last CPU in a cluster is brought up/down.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>   arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
>   arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 180 insertions(+)
>
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> index 80fff49417..41de0622de 100644
> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -11,11 +11,13 @@
>
>   #include <linux/kernel.h>
>   #include <linux/init.h>
> +#include <linux/irqflags.h>
>
>   #include <asm/bL_entry.h>
>   #include <asm/barrier.h>
>   #include <asm/proc-fns.h>
>   #include <asm/cacheflush.h>
> +#include <asm/idmap.h>
>
>   extern volatile unsigned long bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
>
> @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
>   	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
>   			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
>   }
> +
> +static const struct bL_platform_power_ops *platform_ops;
> +
> +int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
> +{
> +	if (platform_ops)
> +		return -EBUSY;
> +	platform_ops = ops;
> +	return 0;
> +}
> +
> +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> +{
> +	if (!platform_ops)
> +		return -EUNATCH;
> +	might_sleep();
> +	return platform_ops->power_up(cpu, cluster);
> +}
> +
> +typedef void (*phys_reset_t)(unsigned long);
> +
> +void bL_cpu_power_down(void)
> +{
> +	phys_reset_t phys_reset;
> +
> +	BUG_ON(!platform_ops);
> +	BUG_ON(!irqs_disabled());
> +
> +	/*
> +	 * Do this before calling into the power_down method,
> +	 * as it might not always be safe to do afterwards.
> +	 */
> +	setup_mm_for_reboot();
> +
> +	platform_ops->power_down();
> +
> +	/*
> +	 * It is possible for a power_up request to happen concurrently
> +	 * with a power_down request for the same CPU. In this case the
> +	 * power_down method might not be able to actually enter a
> +	 * powered down state with the WFI instruction if the power_up
> +	 * method has removed the required reset condition.  The
> +	 * power_down method is then allowed to return. We must perform
> +	 * a re-entry in the kernel as if the power_up method just had
> +	 * deasserted reset on the CPU.
> +	 *
> +	 * To simplify race issues, the platform specific implementation
> +	 * must accommodate for the possibility of unordered calls to
> +	 * power_down and power_up with a usage count. Therefore, if a
> +	 * call to power_up is issued for a CPU that is not down, then
> +	 * the next call to power_down must not attempt a full shutdown
> +	 * but only do the minimum (normally disabling L1 cache and CPU
> +	 * coherency) and return just as if a concurrent power_up request
> +	 * had happened as described above.
> +	 */
> +
> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> +	phys_reset(virt_to_phys(bL_entry_point));
> +
> +	/* should never get here */
> +	BUG();
> +}
> +
> +void bL_cpu_suspend(u64 expected_residency)
> +{
> +	phys_reset_t phys_reset;
> +
> +	BUG_ON(!platform_ops);
> +	BUG_ON(!irqs_disabled());
> +
> +	/* Very similar to bL_cpu_power_down() */
> +	setup_mm_for_reboot();
> +	platform_ops->suspend(expected_residency);
> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> +	phys_reset(virt_to_phys(bL_entry_point));
> +	BUG();
>
I might be missing all the rationales behind not having a recovery for
CPUs entering suspend if they actualy come here because of some events.
This is pretty much possible in many scenario's and hence letting CPU
cpu come out of suspend should be possible. May be switcher code don't
have such requirement but it appeared bit off to me.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
                     ` (2 preceding siblings ...)
  2013-01-10 23:13   ` Will Deacon
@ 2013-01-11 17:46   ` Santosh Shilimkar
  2013-01-11 18:07     ` Dave Martin
  2013-01-14 17:08   ` Dave Martin
  4 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 17:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
>
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.
>
> For use in the power_down() implementation:
>    * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
>    * __bL_outbound_enter_critical(unsigned int cluster)
>    * __bL_outbound_leave_critical(unsigned int cluster)
>    * __bL_cpu_down(unsigned int cluster, unsigned int cpu)
>
> The power_up_setup() helper should do platform-specific setup in
> preparation for turning the CPU on, such as invalidating local caches
> or entering coherency.  It must be assembler for now, since it must
> run before the MMU can be switched on.  It is passed the affinity level
> which should be initialized.
>
> Because the bL_cluster_sync_struct content is looked-up and modified
> with the cache enabled or disabled depending on the code path, it is
> crucial to always ensure proper cache maintenance to update main memory
> right away.  Therefore, any cached write must be followed by a cache clean
> operation and any cached read must be preceded by a cache invalidate
> operation on the accessed memory.
>
> To avoid races where a reader would invalidate the cache and discard the
> latest update from a writer before that writer had a chance to clean it
> to RAM, we simply use cache flush (clean+invalidate) operations
> everywhere.
>
> Also, in order to prevent a cached writer from interfering with an
> adjacent non-cached writer, we ensure each state variable is located to
> a separate cache line.
>
> Thanks to Nicolas Pitre and Achin Gupta for the help with this
> patch.
>
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---
>   .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
>   arch/arm/common/bL_entry.c                         | 160 +++++++
>   arch/arm/common/bL_head.S                          |  88 +++-
>   arch/arm/include/asm/bL_entry.h                    |  62 +++
>   4 files changed, 806 insertions(+), 2 deletions(-)
>   create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
>
> diff --git a/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> new file mode 100644
> index 0000000000..d6151e0235
> --- /dev/null
> +++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> @@ -0,0 +1,498 @@
> +Big.LITTLE cluster Power-up/power-down race avoidance algorithm
> +===============================================================
> +
> +This file documents the algorithm which is used to coordinate CPU and
> +cluster setup and teardown operations and to manage hardware coherency
> +controls safely.
> +
> +The section "Rationale" explains what the algorithm is for and why it is
> +needed.  "Basic model" explains general concepts using a simplified view
> +of the system.  The other sections explain the actual details of the
> +algorithm in use.
> +
> +
> +Rationale
> +---------
> +
> +In a system containing multiple CPUs, it is desirable to have the
> +ability to turn off individual CPUs when the system is idle, reducing
> +power consumption and thermal dissipation.
> +
> +In a system containing multiple clusters of CPUs, it is also desirable
> +to have the ability to turn off entire clusters.
> +
> +Turning entire clusters off and on is a risky business, because it
> +involves performing potentially destructive operations affecting a group
> +of independently running CPUs, while the OS continues to run.  This
> +means that we need some coordination in order to ensure that critical
> +cluster-level operations are only performed when it is truly safe to do
> +so.
> +
> +Simple locking may not be sufficient to solve this problem, because
> +mechanisms like Linux spinlocks may rely on coherency mechanisms which
> +are not immediately enabled when a cluster powers up.  Since enabling or
> +disabling those mechanisms may itself be a non-atomic operation (such as
> +writing some hardware registers and invalidating large caches), other
> +methods of coordination are required in order to guarantee safe
> +power-down and power-up at the cluster level.
> +
> +The mechanism presented in this document describes a coherent memory
> +based protocol for performing the needed coordination.  It aims to be as
> +lightweight as possible, while providing the required safety properties.
> +
> +
> +Basic model
> +-----------
> +
> +Each cluster and CPU is assigned a state, as follows:
> +
> +	DOWN
> +	COMING_UP
> +	UP
> +	GOING_DOWN
> +
> +	    +---------> UP ----------+
> +	    |                        v
> +
> +	COMING_UP                GOING_DOWN
> +
> +	    ^                        |
> +	    +--------- DOWN <--------+
> +
> +
> +DOWN:	The CPU or cluster is not coherent, and is either powered off or
> +	suspended, or is ready to be powered off or suspended.
> +
> +COMING_UP: The CPU or cluster has committed to moving to the UP state.
> +	It may be part way through the process of initialisation and
> +	enabling coherency.
> +
> +UP:	The CPU or cluster is active and coherent at the hardware
> +	level.  A CPU in this state is not necessarily being used
> +	actively by the kernel.
> +
> +GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
> +	state.  It may be part way through the process of teardown and
> +	coherency exit.
> +
> +
> +Each CPU has one of these states assigned to it at any point in time.
> +The CPU states are described in the "CPU state" section, below.
> +
> +Each cluster is also assigned a state, but it is necessary to split the
> +state value into two parts (the "cluster" state and "inbound" state) and
> +to introduce additional states in order to avoid races between different
> +CPUs in the cluster simultaneously modifying the state.  The cluster-
> +level states are described in the "Cluster state" section.
> +
> +To help distinguish the CPU states from cluster states in this
> +discussion, the state names are given a CPU_ prefix for the CPU states,
> +and a CLUSTER_ or INBOUND_ prefix for the cluster states.
> +
> +
> +CPU state
> +---------
> +
> +In this algorithm, each individual core in a multi-core processor is
> +referred to as a "CPU".  CPUs are assumed to be single-threaded:
> +therefore, a CPU can only be doing one thing at a single point in time.
> +
> +This means that CPUs fit the basic model closely.
> +
> +The algorithm defines the following states for each CPU in the system:
> +
> +	CPU_DOWN
> +	CPU_COMING_UP
> +	CPU_UP
> +	CPU_GOING_DOWN
> +
> +	 cluster setup and
> +	CPU setup complete          policy decision
> +	      +-----------> CPU_UP ------------+
> +	      |                                v
> +
> +	CPU_COMING_UP                   CPU_GOING_DOWN
> +
> +	      ^                                |
> +	      +----------- CPU_DOWN <----------+
> +	 policy decision           CPU teardown complete
> +	or hardware event
> +
> +
> +The definitions of the four states correspond closely to the states of
> +the basic model.
> +
> +Transitions between states occur as follows.
> +
> +A trigger event (spontaneous) means that the CPU can transition to the
> +next state as a result of making local progress only, with no
> +requirement for any external event to happen.
> +
> +
> +CPU_DOWN:
> +
> +	A CPU reaches the CPU_DOWN state when it is ready for
> +	power-down.  On reaching this state, the CPU will typically
> +	power itself down or suspend itself, via a WFI instruction or a
> +	firmware call.
> +
> +	Next state:	CPU_COMING_UP
> +	Conditions:	none
> +
> +	Trigger events:
> +
> +		a) an explicit hardware power-up operation, resulting
> +		   from a policy decision on another CPU;
> +
> +		b) a hardware event, such as an interrupt.
> +
> +
> +CPU_COMING_UP:
> +
> +	A CPU cannot start participating in hardware coherency until the
> +	cluster is set up and coherent.  If the cluster is not ready,
> +	then the CPU will wait in the CPU_COMING_UP state until the
> +	cluster has been set up.
> +
> +	Next state:	CPU_UP
> +	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
> +	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
> +
> +	Refer to the "Cluster state" section for a description of the
> +	CLUSTER_UP state.
> +
> +
> +CPU_UP:
> +	When a CPU reaches the CPU_UP state, it is safe for the CPU to
> +	start participating in local coherency.
> +
> +	This is done by jumping to the kernel's CPU resume code.
> +
> +	Note that the definition of this state is slightly different
> +	from the basic model definition: CPU_UP does not mean that the
> +	CPU is coherent yet, but it does mean that it is safe to resume
> +	the kernel.  The kernel handles the rest of the resume
> +	procedure, so the remaining steps are not visible as part of the
> +	race avoidance algorithm.
> +
> +	The CPU remains in this state until an explicit policy decision
> +	is made to shut down or suspend the CPU.
> +
> +	Next state:	CPU_GOING_DOWN
> +	Conditions:	none
> +	Trigger events:	explicit policy decision
> +
> +
> +CPU_GOING_DOWN:
> +
> +	While in this state, the CPU exits coherency, including any
> +	operations required to achieve this (such as cleaning data
> +	caches).
> +
> +	Next state:	CPU_DOWN
> +	Conditions:	local CPU teardown complete
> +	Trigger events:	(spontaneous)
> +
> +
> +Cluster state
> +-------------
> +
> +A cluster is a group of connected CPUs with some common resources.
> +Because a cluster contains multiple CPUs, it can be doing multiple
> +things at the same time.  This has some implications.  In particular, a
> +CPU can start up while another CPU is tearing the cluster down.
> +
> +In this discussion, the "outbound side" is the view of the cluster state
> +as seen by a CPU tearing the cluster down.  The "inbound side" is the
> +view of the cluster state as seen by a CPU setting the CPU up.
> +
> +In order to enable safe coordination in such situations, it is important
> +that a CPU which is setting up the cluster can advertise its state
> +independently of the CPU which is tearing down the cluster.  For this
> +reason, the cluster state is split into two parts:
> +
> +	"cluster" state: The global state of the cluster; or the state
> +		on the outbound side:
> +
> +		CLUSTER_DOWN
> +		CLUSTER_UP
> +		CLUSTER_GOING_DOWN
> +
> +	"inbound" state: The state of the cluster on the inbound side.
> +
> +		INBOUND_NOT_COMING_UP
> +		INBOUND_COMING_UP
> +
> +
> +	The different pairings of these states results in six possible
> +	states for the cluster as a whole:
> +
> +	                            CLUSTER_UP
> +	          +==========> INBOUND_NOT_COMING_UP -------------+
> +	          #                                               |
> +	                                                          |
> +	     CLUSTER_UP     <----+                                |
> +	  INBOUND_COMING_UP      |                                v
> +
> +	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
> +	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
> +
> +	    CLUSTER_DOWN         |                                |
> +	  INBOUND_COMING_UP <----+                                |
> +	                                                          |
> +	          ^                                               |
> +	          +===========     CLUSTER_DOWN      <------------+
> +	                       INBOUND_NOT_COMING_UP
> +
> +	Transitions -----> can only be made by the outbound CPU, and
> +	only involve changes to the "cluster" state.
> +
> +	Transitions ===##> can only be made by the inbound CPU, and only
> +	involve changes to the "inbound" state, except where there is no
> +	further transition possible on the outbound side (i.e., the
> +	outbound CPU has put the cluster into the CLUSTER_DOWN state).
> +
> +	The race avoidance algorithm does not provide a way to determine
> +	which exact CPUs within the cluster play these roles.  This must
> +	be decided in advance by some other means.  Refer to the section
> +	"Last man and first man selection" for more explanation.
> +
> +
> +	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
> +	cluster can actually be powered down.
> +
> +	The parallelism of the inbound and outbound CPUs is observed by
> +	the existence of two different paths from CLUSTER_GOING_DOWN/
> +	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
> +	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
> +	COMING_UP in the basic model).  The second path avoids cluster
> +	teardown completely.
> +
> +	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
> +	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
> +	is trivial and merely resets the state machine ready for the
> +	next cycle.
> +
> +	Details of the allowable transitions follow.
> +
> +	The next state in each case is notated
> +
> +		<cluster state>/<inbound state> (<transitioner>)
> +
> +	where the <transitioner> is the side on which the transition
> +	can occur; either the inbound or the outbound side.
> +
> +
> +CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
> +
> +	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
> +	Conditions:	none
> +	Trigger events:
> +
> +		a) an explicit hardware power-up operation, resulting
> +		   from a policy decision on another CPU;
> +
> +		b) a hardware event, such as an interrupt.
> +
> +
> +CLUSTER_DOWN/INBOUND_COMING_UP:
> +
> +	In this state, an inbound CPU sets up the cluster, including
> +	enabling of hardware coherency at the cluster level and any
> +	other operations (such as cache invalidation) which are required
> +	in order to achieve this.
> +
> +	The purpose of this state is to do sufficient cluster-level
> +	setup to enable other CPUs in the cluster to enter coherency
> +	safely.
> +
> +	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
> +	Conditions:	cluster-level setup and hardware coherency complete
> +	Trigger events:	(spontaneous)
> +
> +
> +CLUSTER_UP/INBOUND_COMING_UP:
> +
> +	Cluster-level setup is complete and hardware coherency is
> +	enabled for the cluster.  Other CPUs in the cluster can safely
> +	enter coherency.
> +
> +	This is a transient state, leading immediately to
> +	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
> +	should consider treat these two states as equivalent.
> +
> +	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
> +	Conditions:	none
> +	Trigger events:	(spontaneous)
> +
> +
> +CLUSTER_UP/INBOUND_NOT_COMING_UP:
> +
> +	Cluster-level setup is complete and hardware coherency is
> +	enabled for the cluster.  Other CPUs in the cluster can safely
> +	enter coherency.
> +
> +	The cluster will remain in this state until a policy decision is
> +	made to power the cluster down.
> +
> +	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
> +	Conditions:	none
> +	Trigger events:	policy decision to power down the cluster
> +
> +
> +CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
> +
> +	An outbound CPU is tearing the cluster down.  The selected CPU
> +	must wait in this state until all CPUs in the cluster are in the
> +	CPU_DOWN state.
> +
> +	When all CPUs are in the CPU_DOWN state, the cluster can be torn
> +	down, for example by cleaning data caches and exiting
> +	cluster-level coherency.
> +
> +	To avoid wasteful unnecessary teardown operations, the outbound
> +	should check the inbound cluster state for asynchronous
> +	transitions to INBOUND_COMING_UP.  Alternatively, individual
> +	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
> +
> +
> +	Next states:
> +
> +	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
> +		Conditions:	cluster torn down and ready to power off
> +		Trigger events:	(spontaneous)
> +
> +	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
> +		Conditions:	none
> +		Trigger events:
> +
> +			a) an explicit hardware power-up operation,
> +			   resulting from a policy decision on another
> +			   CPU;
> +
> +			b) a hardware event, such as an interrupt.
> +
> +
> +CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
> +
> +	The cluster is (or was) being torn down, but another CPU has
> +	come online in the meantime and is trying to set up the cluster
> +	again.
> +
> +	If the outbound CPU observes this state, it has two choices:
> +
> +		a) back out of teardown, restoring the cluster to the
> +		   CLUSTER_UP state;
> +
> +		b) finish tearing the cluster down and put the cluster
> +		   in the CLUSTER_DOWN state; the inbound CPU will
> +		   set up the cluster again from there.
> +
> +	Choice (a) permits the removal of some latency by avoiding
> +	unnecessary teardown and setup operations in situations where
> +	the cluster is not really going to be powered down.
> +
> +
> +	Next states:
> +
> +	CLUSTER_UP/INBOUND_COMING_UP (outbound)
> +		Conditions:	cluster-level setup and hardware
> +				coherency complete
> +		Trigger events:	(spontaneous)
> +
> +	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
> +		Conditions:	cluster torn down and ready to power off
> +		Trigger events:	(spontaneous)
> +
> +
> +Last man and First man selection
> +--------------------------------
> +
> +The CPU which performs cluster tear-down operations on the outbound side
> +is commonly referred to as the "last man".
> +
> +The CPU which performs cluster setup on the inbound side is commonly
> +referred to as the "first man".
> +
> +The race avoidance algorithm documented above does not provide a
> +mechanism to choose which CPUs should play these roles.
> +
> +
> +Last man:
> +
> +When shutting down the cluster, all the CPUs involved are initially
> +executing Linux and hence coherent.  Therefore, ordinary spinlocks can
> +be used to select a last man safely, before the CPUs become
> +non-coherent.
> +
> +
> +First man:
> +
> +Because CPUs may power up asynchronously in response to external wake-up
> +events, a dynamic mechanism is needed to make sure that only one CPU
> +attempts to play the first man role and do the cluster-level
> +initialisation: any other CPUs must wait for this to complete before
> +proceeding.
> +
> +Cluster-level initialisation may involve actions such as configuring
> +coherency controls in the bus fabric.
> +
> +The current implementation in bL_head.S uses a separate mutual exclusion
> +mechanism to do this arbitration.  This mechanism is documented in
> +detail in vlocks.txt.
> +
> +
> +Features and Limitations
> +------------------------
> +
> +Implementation:
> +
> +	The current ARM-based implementation is split between
> +	arch/arm/common/bL_head.S (low-level inbound CPU operations) and
> +	arch/arm/common/bL_entry.c (everything else):
> +
> +	__bL_cpu_going_down() signals the transition of a CPU to the
> +		CPU_GOING_DOWN state.
> +
> +	__bL_cpu_down() signals the transition of a CPU to the CPU_DOWN
> +		state.
> +
> +	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
> +		low-level power-up code in bL_head.S.  This could
> +		involve CPU-specific setup code, but in the current
> +		implementation it does not.
> +
> +	__bL_outbound_enter_critical() and __bL_outbound_leave_critical()
> +		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
> +		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
> +		the case of an aborted cluster power-down).
> +
> +		These functions are more complex than the __bL_cpu_*()
> +		functions due to the extra inter-CPU coordination which
> +		is needed for safe transitions at the cluster level.
> +
> +	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
> +		the low-level power-up code in bL_head.S.  This
> +		typically involves platform-specific setup code,
> +		provided by the platform-specific power_up_setup
> +		function registered via bL_cluster_sync_init.
> +
> +Deep topologies:
> +
> +	As currently described and implemented, the algorithm does not
> +	support CPU topologies involving more than two levels (i.e.,
> +	clusters of clusters are not supported).  The algorithm could be
> +	extended by replicating the cluster-level states for the
> +	additional topological levels, and modifying the transition
> +	rules for the intermediate (non-outermost) cluster levels.
> +
> +
> +Colophon
> +--------
> +
> +Originally created and documented by Dave Martin for Linaro Limited, in
> +collaboration with Nicolas Pitre and Achin Gupta.
> +
Great write-up Dave!! I might have to do couple of more passes on it to
get overall idea, but surely this documentation is good start for
anybody reading/reviewing the big.LITTLE switcher code.

> +Copyright (C) 2012  Linaro Limited
> +Distributed under the terms of Version 2 of the GNU General Public
> +License, as defined in linux/COPYING.
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> index 41de0622de..1ea4ec9df0 100644
> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>   		platform_ops->powered_up();
>   	return 0;
>   }
> +
> +struct bL_sync_struct bL_sync;
> +
> +static void __sync_range(volatile void *p, size_t size)
> +{
> +	char *_p = (char *)p;
> +
> +	__cpuc_flush_dcache_area(_p, size);
> +	outer_flush_range(__pa(_p), __pa(_p + size));
> +	outer_sync();
> +}
> +
> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> +
> +/*
/** as per kerneldoc.

> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> + *    This must be called at the point of committing to teardown of a CPU.
> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
> + */
> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> +{
> +	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> +	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> +}
> +

[..]

> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> index 9d351f2b4c..f7a64ac127 100644
> --- a/arch/arm/common/bL_head.S
> +++ b/arch/arm/common/bL_head.S
> @@ -7,11 +7,19 @@
>    * This program is free software; you can redistribute it and/or modify
>    * it under the terms of the GNU General Public License version 2 as
>    * published by the Free Software Foundation.
> + *
> + *
> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> + * for details of the synchronisation algorithms used here.
>    */
>
>   #include <linux/linkage.h>
>   #include <asm/bL_entry.h>
>
> +.if BL_SYNC_CLUSTER_CPUS
> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
> +.endif
> +
>   	.macro	pr_dbg	cpu, string
>   #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>   	b	1901f
> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>   2:	pr_dbg	r4, "kernel bL_entry_point\n"
>
>   	/*
> -	 * MMU is off so we need to get to bL_entry_vectors in a
> +	 * MMU is off so we need to get to various variables in a
>   	 * position independent way.
>   	 */
>   	adr	r5, 3f
> -	ldr	r6, [r5]
> +	ldmia	r5, {r6, r7, r8}
>   	add	r6, r5, r6			@ r6 = bL_entry_vectors
> +	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
> +	add	r8, r5, r8			@ r8 = bL_sync
> +
> +	mov	r0, #BL_SYNC_CLUSTER_SIZE
> +	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
> +
> +	@ Signal that this CPU is coming UP:
> +	mov	r0, #CPU_COMING_UP
> +	mov	r5, #BL_SYNC_CPU_SIZE
> +	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
> +	strb	r0, [r5]
> +
> +	dsb
Do you really need above dsb(). With MMU off, the the store should any 
way make it to the main memory, No ?

> +
> +	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> +	@ state, because there is at least one active CPU (this CPU).
> +
> +	@ Check if the cluster has been set up yet:
> +	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> +	cmp	r0, #CLUSTER_UP
> +	beq	cluster_already_up
> +
> +	@ Signal that the cluster is being brought up:
> +	mov	r0, #INBOUND_COMING_UP
> +	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> +
> +	dsb
Same comment.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-10  0:20 ` [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
@ 2013-01-11 18:02   ` Santosh Shilimkar
  2013-01-14 18:05     ` Achin Gupta
  2013-01-14 16:35   ` Will Deacon
  1 sibling, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> Now that the b.L power API is in place, we can use it for SMP secondary
> bringup and CPU hotplug in a generic fashion.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>   arch/arm/common/Makefile     |  2 +-
>   arch/arm/common/bL_platsmp.c | 79 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 80 insertions(+), 1 deletion(-)
>   create mode 100644 arch/arm/common/bL_platsmp.c
>
> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
> index 894c2ddf9b..59b36db7cc 100644
> --- a/arch/arm/common/Makefile
> +++ b/arch/arm/common/Makefile
> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>   obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
>   obj-$(CONFIG_FIQ_GLUE)		+= fiq_glue.o fiq_glue_setup.o
>   obj-$(CONFIG_FIQ_DEBUGGER)	+= fiq_debugger.o
> -obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o vlock.o
> +obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o bL_platsmp.o vlock.o
> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> new file mode 100644
> index 0000000000..0acb9f4685
> --- /dev/null
> +++ b/arch/arm/common/bL_platsmp.c
> @@ -0,0 +1,79 @@
> +/*
> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
> + *
> + * Created by:  Nicolas Pitre, November 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Code to handle secondary CPU bringup and hotplug for the bL power API.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/smp.h>
> +
> +#include <asm/bL_entry.h>
> +#include <asm/smp_plat.h>
> +#include <asm/hardware/gic.h>
> +
> +static void __init simple_smp_init_cpus(void)
> +{
> +	set_smp_cross_call(gic_raise_softirq);
> +}
> +
> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
> +{
> +	unsigned int pcpu, pcluster, ret;
> +	extern void secondary_startup(void);
> +
> +	pcpu = cpu_logical_map(cpu) & 0xff;
> +	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
> +	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
> +		 __func__, cpu, pcpu, pcluster);
> +
> +	bL_set_entry_vector(pcpu, pcluster, NULL);
> +	ret = bL_cpu_power_up(pcpu, pcluster);
> +	if (ret)
> +		return ret;
> +	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
> +	gic_raise_softirq(cpumask_of(cpu), 0);
> +	sev();
softirq() should be enough to break a CPU if it is in standby with
wfe state. Is that additional sev() needed here ?

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-10  0:20 ` [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled Nicolas Pitre
@ 2013-01-11 18:07   ` Santosh Shilimkar
  2013-01-11 19:07     ` Nicolas Pitre
  2013-01-14 16:39   ` Will Deacon
  1 sibling, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> Otherwise there might be some interrupts or IPIs becoming pending and the
> CPU will not enter low power mode when doing a WFI.  The effect of this
> is a CPU that loops back into the kernel, go through the first man
> election, signals itself as alive,  and prevent the cluster from being
> shut down.
>
> This could benefit from a better solution.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>   arch/arm/common/bL_platsmp.c        | 1 +
>   arch/arm/common/gic.c               | 6 ++++++
>   arch/arm/include/asm/hardware/gic.h | 2 ++
>   3 files changed, 9 insertions(+)
>
> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> index 0ae44123bf..6a3b251b97 100644
> --- a/arch/arm/common/bL_platsmp.c
> +++ b/arch/arm/common/bL_platsmp.c
> @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
>   	pcpu = mpidr & 0xff;
>   	pcluster = (mpidr >> 8) & 0xff;
>   	bL_set_entry_vector(pcpu, pcluster, NULL);
> +	gic_cpu_if_down();

So for a case where CPU still don't power down for some reason even
after CPU interface is disabled, can not listen to and SGI or PPI.
Not sure if this happens on big.LITTLE but i have seen one such issue
on Cortex-A9 based SOC.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-11 17:46   ` Santosh Shilimkar
@ 2013-01-11 18:07     ` Dave Martin
  2013-01-11 18:34       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-11 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 11:16:18PM +0530, Santosh Shilimkar wrote:

[...]

> >+Originally created and documented by Dave Martin for Linaro Limited, in
> >+collaboration with Nicolas Pitre and Achin Gupta.
> >+
> Great write-up Dave!! I might have to do couple of more passes on it to
> get overall idea, but surely this documentation is good start for
> anybody reading/reviewing the big.LITTLE switcher code.

Thanks for reading through it.  Partly, this was insurance against me
forgetting how the code worked in between writing and posting it...
but this is all quite subtle code, so it felt important to document
it thoroughly.

> 
> >+Copyright (C) 2012  Linaro Limited
> >+Distributed under the terms of Version 2 of the GNU General Public
> >+License, as defined in linux/COPYING.
> >diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> >index 41de0622de..1ea4ec9df0 100644
> >--- a/arch/arm/common/bL_entry.c
> >+++ b/arch/arm/common/bL_entry.c
> >@@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
> >  		platform_ops->powered_up();
> >  	return 0;
> >  }
> >+
> >+struct bL_sync_struct bL_sync;
> >+
> >+static void __sync_range(volatile void *p, size_t size)
> >+{
> >+	char *_p = (char *)p;
> >+
> >+	__cpuc_flush_dcache_area(_p, size);
> >+	outer_flush_range(__pa(_p), __pa(_p + size));
> >+	outer_sync();
> >+}
> >+
> >+#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> >+
> >+/*
> /** as per kerneldoc.

Does kerneldoc not require the comment to be specially formatted?

I haven't played with that, so far.

> 
> >+ * __bL_cpu_going_down: Indicates that the cpu is being torn down.
> >+ *    This must be called at the point of committing to teardown of a CPU.
> >+ *    The CPU cache (SCTRL.C bit) is expected to still be active.
> >+ */
> >+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
> >+{
> >+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
> >+	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
> >+}
> >+
> 
> [..]
> 
> >diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
> >index 9d351f2b4c..f7a64ac127 100644
> >--- a/arch/arm/common/bL_head.S
> >+++ b/arch/arm/common/bL_head.S
> >@@ -7,11 +7,19 @@
> >   * This program is free software; you can redistribute it and/or modify
> >   * it under the terms of the GNU General Public License version 2 as
> >   * published by the Free Software Foundation.
> >+ *
> >+ *
> >+ * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
> >+ * for details of the synchronisation algorithms used here.
> >   */
> >
> >  #include <linux/linkage.h>
> >  #include <asm/bL_entry.h>
> >
> >+.if BL_SYNC_CLUSTER_CPUS
> >+.error "cpus must be the first member of struct bL_cluster_sync_struct"
> >+.endif
> >+
> >  	.macro	pr_dbg	cpu, string
> >  #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
> >  	b	1901f
> >@@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
> >  2:	pr_dbg	r4, "kernel bL_entry_point\n"
> >
> >  	/*
> >-	 * MMU is off so we need to get to bL_entry_vectors in a
> >+	 * MMU is off so we need to get to various variables in a
> >  	 * position independent way.
> >  	 */
> >  	adr	r5, 3f
> >-	ldr	r6, [r5]
> >+	ldmia	r5, {r6, r7, r8}
> >  	add	r6, r5, r6			@ r6 = bL_entry_vectors
> >+	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
> >+	add	r8, r5, r8			@ r8 = bL_sync
> >+
> >+	mov	r0, #BL_SYNC_CLUSTER_SIZE
> >+	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
> >+
> >+	@ Signal that this CPU is coming UP:
> >+	mov	r0, #CPU_COMING_UP
> >+	mov	r5, #BL_SYNC_CPU_SIZE
> >+	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
> >+	strb	r0, [r5]
> >+
> >+	dsb
> Do you really need above dsb(). With MMU off, the the store should

The short answer is "maybe not".  Some of the barriers can be
eliminated; some can be demoted to DSBs.  Others may be required but
unnecessarily duplicated e.g., between bL_head.S and vlock.S.

> any way make it to the main memory, No ?

Yes, but this raises issues about precisely what the architecture
guarantees about memory ordering in these scenarios.  The only obvious
thing about that is that it's non-obvious.

Strongly-Ordered memory is not quite the same as having explicit
barriers everywhere.

I need to have a careful think, but it should be possible to optimise
a bit here.

> 
> >+
> >+	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
> >+	@ state, because there is at least one active CPU (this CPU).
> >+
> >+	@ Check if the cluster has been set up yet:
> >+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
> >+	cmp	r0, #CLUSTER_UP
> >+	beq	cluster_already_up
> >+
> >+	@ Signal that the cluster is being brought up:
> >+	mov	r0, #INBOUND_COMING_UP
> >+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
> >+
> >+	dsb
> Same comment.

Same answer... for now

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11 17:16   ` Santosh Shilimkar
@ 2013-01-11 18:10     ` Nicolas Pitre
  2013-01-11 18:30       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 18:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > +ENTRY(bL_entry_point)
> > +
> > + THUMB(	adr	r12, BSYM(1f)	)
> > + THUMB(	bx	r12		)
> > + THUMB(	.thumb			)
> > +1:
> > +	mrc	p15, 0, r0, c0, c0, 5
> > +	ubfx	r9, r0, #0, #4			@ r9 = cpu
> > +	ubfx	r10, r0, #8, #4			@ r10 = cluster
> > +	mov	r3, #BL_CPUS_PER_CLUSTER
> > +	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
> > +	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
> > +	blo	2f
> > +
> > +	/* We didn't expect this CPU.  Try to make it quiet. */
> > +1:	wfi
> > +	wfe
> 
> Why do you need a wfe followed by wif ?
> Just curious.

If the WFI doesn't work because an interrupt is pending then the WFE 
might work better.  But as I mentioned before, this is not intended to 
be used for other purposes than "we're really screwed so at least let's 
try to cheaply quieten this CPU" case.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 10/16] ARM: vexpress: introduce DCSCB support
  2013-01-10  0:20 ` [PATCH 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
@ 2013-01-11 18:12   ` Santosh Shilimkar
  2013-01-11 19:13     ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> This adds basic CPU and cluster reset controls on RTSM for the
> A15x4-A7x4 model configuration using the Dual Cluster System
> Configuration Block (DCSCB).
>
> The cache coherency interconnect (CCI) is not handled yet.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>   arch/arm/mach-vexpress/Kconfig  |   8 ++
>   arch/arm/mach-vexpress/Makefile |   1 +
>   arch/arm/mach-vexpress/dcscb.c  | 160 ++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 169 insertions(+)
>   create mode 100644 arch/arm/mach-vexpress/dcscb.c
>
> diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
> index 99e63f5f99..e55c02562f 100644
> --- a/arch/arm/mach-vexpress/Kconfig
> +++ b/arch/arm/mach-vexpress/Kconfig
> @@ -53,4 +53,12 @@ config ARCH_VEXPRESS_CORTEX_A5_A9_ERRATA
>   config ARCH_VEXPRESS_CA9X4
>   	bool "Versatile Express Cortex-A9x4 tile"
>
> +config ARCH_VEXPRESS_DCSCB
> +	bool "Dual Cluster System Control Block (DCSCB) support"
> +	depends on BIG_LITTLE
> +	help
> +	  Support for the Dual Cluster System Configuration Block (DCSCB).
> +	  This is needed to provide CPU and cluster power management
> +	  on RTSM.
> +
>   endmenu
> diff --git a/arch/arm/mach-vexpress/Makefile b/arch/arm/mach-vexpress/Makefile
> index 80b64971fb..2253644054 100644
> --- a/arch/arm/mach-vexpress/Makefile
> +++ b/arch/arm/mach-vexpress/Makefile
> @@ -6,5 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) := -I$(srctree)/$(src)/include \
>
>   obj-y					:= v2m.o reset.o
>   obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
> +obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
>   obj-$(CONFIG_SMP)			+= platsmp.o
>   obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
> diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
> new file mode 100644
> index 0000000000..cccd943cd4
> --- /dev/null
> +++ b/arch/arm/mach-vexpress/dcscb.c
> @@ -0,0 +1,160 @@
> +/*
> + * arch/arm/mach-vexpress/dcscb.c - Dual Cluster System Control Block
> + *
> + * Created by:	Nicolas Pitre, May 2012
> + * Copyright:	(C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/io.h>
> +#include <linux/spinlock.h>
> +#include <linux/errno.h>
> +#include <linux/vexpress.h>
> +
> +#include <asm/bL_entry.h>
> +#include <asm/proc-fns.h>
> +#include <asm/cacheflush.h>
> +
> +
> +#define DCSCB_PHYS_BASE	0x60000000
> +
> +#define RST_HOLD0	0x0
> +#define RST_HOLD1	0x4
> +#define SYS_SWRESET	0x8
> +#define RST_STAT0	0xc
> +#define RST_STAT1	0x10
> +#define EAG_CFG_R	0x20
> +#define EAG_CFG_W	0x24
> +#define KFC_CFG_R	0x28
> +#define KFC_CFG_W	0x2c
> +#define DCS_CFG_R	0x30
> +
> +/*
> + * We can't use regular spinlocks. In the switcher case, it is possible
> + * for an outbound CPU to call power_down() after its inbound counterpart
> + * is already live using the same logical CPU number which trips lockdep
> + * debugging.
> + */
> +static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +
> +static void __iomem *dcscb_base;
> +
> +static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
> +{
> +	unsigned int rst_hold, cpumask = (1 << cpu);
> +
> +	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
> +	if (cpu >= 4 || cluster >= 2)
> +		return -EINVAL;
> +
> +	/*
> +	 * Since this is called with IRQs enabled, and no arch_spin_lock_irq
> +	 * variant exists, we need to disable IRQs manually here.
> +	 */
> +	local_irq_disable();
> +	arch_spin_lock(&dcscb_lock);
> +
> +	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
> +	if (rst_hold & (1 << 8)) {
> +		/* remove cluster reset and add individual CPU's reset */
> +		rst_hold &= ~(1 << 8);
> +		rst_hold |= 0xf;
> +	}
> +	rst_hold &= ~(cpumask | (cpumask << 4));
> +	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
> +
> +	arch_spin_unlock(&dcscb_lock);
> +	local_irq_enable();
> +
> +	return 0;
> +}
> +
> +static void dcscb_power_down(void)
> +{
> +	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
> +
> +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> +	cpu = mpidr & 0xff;
> +	cluster = (mpidr >> 8) & 0xff;
> +	cpumask = (1 << cpu);
> +
> +	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
> +	BUG_ON(cpu >= 4 || cluster >= 2);
> +
> +	arch_spin_lock(&dcscb_lock);
> +	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
> +	rst_hold |= cpumask;
> +	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
> +		rst_hold |= (1 << 8);
> +	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
> +	arch_spin_unlock(&dcscb_lock);
> +	last_man = (rst_hold & (1 << 8));
> +
> +	/*
> +	 * Now let's clean our L1 cache and shut ourself down.
> +	 * If we're the last CPU in this cluster then clean L2 too.
> +	 */
> +
Do you wanted to have C bit clear code here ?
> +	/*
> +	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> +	 * a preliminary flush here for those CPUs.  At least, that's
> +	 * the theory -- without the extra flush, Linux explodes on
> +	 * RTSM (maybe not needed anymore, to be investigated)..
> +	 */
> +	flush_cache_louis();
> +	cpu_proc_fin();
> +
> +	if (!last_man) {
> +		flush_cache_louis();
> +	} else {
> +		flush_cache_all();
> +		outer_flush_all();
> +	}
> +
> +	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
> +	asm volatile (
> +		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
> +		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
> +		"mcr	p15, 0, ip, c1, c0, 1"
> +		: : : "ip" );
> +
> +	/* Now we are prepared for power-down, do it: */
You need dsb here, right ?
> +	wfi();
> +
> +	/* Not dead at this point?  Let our caller cope. */
> +}
> +

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 13/16] drivers: misc: add ARM CCI support
  2013-01-10  0:20 ` [PATCH 13/16] drivers: misc: add ARM CCI support Nicolas Pitre
@ 2013-01-11 18:20   ` Santosh Shilimkar
  2013-01-11 19:22     ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>
> On ARM multi-cluster systems coherency between cores running on
> different clusters is managed by the cache-coherent interconnect (CCI).
> It allows broadcasting of TLB invalidates and memory barriers and it
> guarantees cache coherency at system level.
>
> This patch enables the basic infrastructure required in Linux to
> handle and programme the CCI component. The first implementation is
> based on a platform device, its relative DT compatible property and
> a simple programming interface.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>   drivers/misc/Kconfig    |   3 ++
>   drivers/misc/Makefile   |   1 +
>   drivers/misc/arm-cci.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/arm-cci.h |  30 ++++++++++++++
How about 'drivers/bus/' considering CCI is an interconnect bus (though
for coherency)

>   4 files changed, 141 insertions(+)
>   create mode 100644 drivers/misc/arm-cci.c
>   create mode 100644 include/linux/arm-cci.h
>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index b151b7c1bd..30d5be1ad2 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -499,6 +499,9 @@ config USB_SWITCH_FSA9480
>   	  stereo and mono audio, video, microphone and UART data to use
>   	  a common connector port.
>
> +config ARM_CCI
You might want add depends on ARM big.LITTTLE otherwise it will
break build for other arch's with random configurations.

[..]

> diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
> new file mode 100644
> index 0000000000..f329c43099
> --- /dev/null
> +++ b/drivers/misc/arm-cci.c
> @@ -0,0 +1,107 @@
> +/*
> + * CCI support
> + *
> + * Copyright (C) 2012 ARM Ltd.
> + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <linux/arm-cci.h>
> +
> +#define CCI400_EAG_OFFSET       0x4000
> +#define CCI400_KF_OFFSET        0x5000
> +
> +#define DRIVER_NAME	"CCI"
> +struct cci_drvdata {
> +	void __iomem *baseaddr;
> +	spinlock_t lock;
> +};
> +
> +static struct cci_drvdata *info;
> +
> +void disable_cci(int cluster)
> +{
> +	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
> +	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
> +
> +	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
> +			;
> +}
> +EXPORT_SYMBOL_GPL(disable_cci);
> +
Is more functionality going to be added for CCI driver. Having this
much of driver code for just a disable_cci() functions seems like
overkill.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-10  0:20 ` [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
  2013-01-10 12:05   ` Dave Martin
@ 2013-01-11 18:27   ` Santosh Shilimkar
  2013-01-11 19:28     ` Nicolas Pitre
  1 sibling, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
>
> Add the required code to properly handle race free platform coherency exit
> to the DCSCB power down method.
>
> The power_up_setup callback is used to enable the CCI interface for
> the cluster being brought up.  This must be done in assembly before
> the kernel environment is entered.
>
> Thanks to Achin Gupta and Nicolas Pitre for their help and
> contributions.
>
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
[..]

> diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
> index 59b690376f..95a2d0df20 100644
> --- a/arch/arm/mach-vexpress/dcscb.c
> +++ b/arch/arm/mach-vexpress/dcscb.c
> @@ -15,6 +15,7 @@
>   #include <linux/spinlock.h>
>   #include <linux/errno.h>
>   #include <linux/vexpress.h>
> +#include <linux/arm-cci.h>
>
>   #include <asm/bL_entry.h>
>   #include <asm/proc-fns.h>
> @@ -104,6 +105,8 @@ static void dcscb_power_down(void)
>   	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
>   	BUG_ON(cpu >= 4 || cluster >= 2);
>
> +	__bL_cpu_going_down(cpu, cluster);
> +
>   	arch_spin_lock(&dcscb_lock);
>   	dcscb_use_count[cpu][cluster]--;
>   	if (dcscb_use_count[cpu][cluster] == 0) {
> @@ -111,6 +114,7 @@ static void dcscb_power_down(void)
>   		rst_hold |= cpumask;
>   		if (((rst_hold | (rst_hold >> 4)) & cluster_mask) == cluster_mask) {
>   			rst_hold |= (1 << 8);
> +			BUG_ON(__bL_cluster_state(cluster) != CLUSTER_UP);
>   			last_man = true;
>   		}
>   		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
> @@ -124,35 +128,71 @@ static void dcscb_power_down(void)
>   		skip_wfi = true;
>   	} else
>   		BUG();
> -	arch_spin_unlock(&dcscb_lock);
>
> -	/*
> -	 * Now let's clean our L1 cache and shut ourself down.
> -	 * If we're the last CPU in this cluster then clean L2 too.
> -	 */
> -
> -	/*
> -	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> -	 * a preliminary flush here for those CPUs.  At least, that's
> -	 * the theory -- without the extra flush, Linux explodes on
> -	 * RTSM (maybe not needed anymore, to be investigated)..
> -	 */
> -	flush_cache_louis();
> -	cpu_proc_fin();
> +	if (last_man && __bL_outbound_enter_critical(cpu, cluster)) {
> +		arch_spin_unlock(&dcscb_lock);
>
> -	if (!last_man) {
> -		flush_cache_louis();
> -	} else {
> +		/*
> +		 * Flush all cache levels for this cluster.
> +		 *
> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> +		 * a preliminary flush here for those CPUs.  At least, that's
> +		 * the theory -- without the extra flush, Linux explodes on
> +		 * RTSM (maybe not needed anymore, to be investigated).
> +		 */
>   		flush_cache_all();
> +		cpu_proc_fin(); /* disable allocation into internal caches*/
I see now. In previous patch I missed the cpu_proc_fin() which clears
C bit
> +		flush_cache_all();
> +
> +		/*
> +		 * This is a harmless no-op.  On platforms with a real
> +		 * outer cache this might either be needed or not,
> +		 * depending on where the outer cache sits.
> +		 */
>   		outer_flush_all();
> +
> +		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
> +		asm volatile (
> +			"mrc	p15, 0, ip, c1, c0, 1 \n\t"
> +			"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
> +			"mcr	p15, 0, ip, c1, c0, 1 \n\t"
> +			"isb \n\t"
> +			"dsb"
> +			: : : "ip" );
> +
> +		/*
> +		 * Disable cluster-level coherency by masking
> +		 * incoming snoops and DVM messages:
> +		 */
> +		disable_cci(cluster);
> +
> +		__bL_outbound_leave_critical(cluster, CLUSTER_DOWN);
> +	} else {
> +		arch_spin_unlock(&dcscb_lock);
> +
> +		/*
> +		 * Flush the local CPU cache.
> +		 *
> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> +		 * a preliminary flush here for those CPUs.  At least, that's
> +		 * the theory -- without the extra flush, Linux explodes on
> +		 * RTSM (maybe not needed anymore, to be investigated).
> +		 */
This is expected if the entire code is not in one stack frame and the
additional flush is needed to avoid possible stack corruption. This
issue has been discussed in past on the list.

> +		flush_cache_louis();
> +		cpu_proc_fin(); /* disable allocation into internal caches*/
> +		flush_cache_louis();
> +
> +		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
> +		asm volatile (
> +			"mrc	p15, 0, ip, c1, c0, 1 \n\t"
> +			"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
> +			"mcr	p15, 0, ip, c1, c0, 1 \n\t"
> +			"isb \n\t"
> +			"dsb"
> +			: : : "ip" );
>   	}
>
> -	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
> -	asm volatile (
> -		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
> -		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
> -		"mcr	p15, 0, ip, c1, c0, 1"
> -		: : : "ip" );
> +	__bL_cpu_down(cpu, cluster);
>
>   	/* Now we are prepared for power-down, do it: */
>   	if (!skip_wfi)

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-11 18:10     ` Nicolas Pitre
@ 2013-01-11 18:30       ` Santosh Shilimkar
  0 siblings, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 January 2013 11:40 PM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> +ENTRY(bL_entry_point)
>>> +
>>> + THUMB(	adr	r12, BSYM(1f)	)
>>> + THUMB(	bx	r12		)
>>> + THUMB(	.thumb			)
>>> +1:
>>> +	mrc	p15, 0, r0, c0, c0, 5
>>> +	ubfx	r9, r0, #0, #4			@ r9 = cpu
>>> +	ubfx	r10, r0, #8, #4			@ r10 = cluster
>>> +	mov	r3, #BL_CPUS_PER_CLUSTER
>>> +	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
>>> +	cmp	r4, #(BL_CPUS_PER_CLUSTER * BL_NR_CLUSTERS)
>>> +	blo	2f
>>> +
>>> +	/* We didn't expect this CPU.  Try to make it quiet. */
>>> +1:	wfi
>>> +	wfe
>>
>> Why do you need a wfe followed by wif ?
>> Just curious.
>
> If the WFI doesn't work because an interrupt is pending then the WFE
> might work better.  But as I mentioned before, this is not intended to
> be used for other purposes than "we're really screwed so at least let's
> try to cheaply quieten this CPU" case.
>
Thanks for clarification.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-11 17:26   ` Santosh Shilimkar
@ 2013-01-11 18:33     ` Nicolas Pitre
  2013-01-11 18:41       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 18:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > This is the basic API used to handle the powering up/down of individual
> > CPUs in a big.LITTLE system.  The platform specific backend implementation
> > has the responsibility to also handle the cluster level power as well when
> > the first/last CPU in a cluster is brought up/down.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >   arch/arm/common/bL_entry.c      | 88
> > +++++++++++++++++++++++++++++++++++++++
> >   arch/arm/include/asm/bL_entry.h | 92
> > +++++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 180 insertions(+)
> > 
> > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > index 80fff49417..41de0622de 100644
> > --- a/arch/arm/common/bL_entry.c
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -11,11 +11,13 @@
> > 
> >   #include <linux/kernel.h>
> >   #include <linux/init.h>
> > +#include <linux/irqflags.h>
> > 
> >   #include <asm/bL_entry.h>
> >   #include <asm/barrier.h>
> >   #include <asm/proc-fns.h>
> >   #include <asm/cacheflush.h>
> > +#include <asm/idmap.h>
> > 
> >   extern volatile unsigned long
> > bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > 
> > @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster,
> > void *ptr)
> >   	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> >   			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> >   }
> > +
> > +static const struct bL_platform_power_ops *platform_ops;
> > +
> > +int __init bL_platform_power_register(const struct bL_platform_power_ops
> > *ops)
> > +{
> > +	if (platform_ops)
> > +		return -EBUSY;
> > +	platform_ops = ops;
> > +	return 0;
> > +}
> > +
> > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> > +{
> > +	if (!platform_ops)
> > +		return -EUNATCH;
> > +	might_sleep();
> > +	return platform_ops->power_up(cpu, cluster);
> > +}
> > +
> > +typedef void (*phys_reset_t)(unsigned long);
> > +
> > +void bL_cpu_power_down(void)
> > +{
> > +	phys_reset_t phys_reset;
> > +
> > +	BUG_ON(!platform_ops);
> > +	BUG_ON(!irqs_disabled());
> > +
> > +	/*
> > +	 * Do this before calling into the power_down method,
> > +	 * as it might not always be safe to do afterwards.
> > +	 */
> > +	setup_mm_for_reboot();
> > +
> > +	platform_ops->power_down();
> > +
> > +	/*
> > +	 * It is possible for a power_up request to happen concurrently
> > +	 * with a power_down request for the same CPU. In this case the
> > +	 * power_down method might not be able to actually enter a
> > +	 * powered down state with the WFI instruction if the power_up
> > +	 * method has removed the required reset condition.  The
> > +	 * power_down method is then allowed to return. We must perform
> > +	 * a re-entry in the kernel as if the power_up method just had
> > +	 * deasserted reset on the CPU.
> > +	 *
> > +	 * To simplify race issues, the platform specific implementation
> > +	 * must accommodate for the possibility of unordered calls to
> > +	 * power_down and power_up with a usage count. Therefore, if a
> > +	 * call to power_up is issued for a CPU that is not down, then
> > +	 * the next call to power_down must not attempt a full shutdown
> > +	 * but only do the minimum (normally disabling L1 cache and CPU
> > +	 * coherency) and return just as if a concurrent power_up request
> > +	 * had happened as described above.
> > +	 */
> > +
> > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > +	phys_reset(virt_to_phys(bL_entry_point));
> > +
> > +	/* should never get here */
> > +	BUG();
> > +}
> > +
> > +void bL_cpu_suspend(u64 expected_residency)
> > +{
> > +	phys_reset_t phys_reset;
> > +
> > +	BUG_ON(!platform_ops);
> > +	BUG_ON(!irqs_disabled());
> > +
> > +	/* Very similar to bL_cpu_power_down() */
> > +	setup_mm_for_reboot();
> > +	platform_ops->suspend(expected_residency);
> > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > +	phys_reset(virt_to_phys(bL_entry_point));
> > +	BUG();
> > 
> I might be missing all the rationales behind not having a recovery for
> CPUs entering suspend if they actualy come here because of some events.
> This is pretty much possible in many scenario's and hence letting CPU
> cpu come out of suspend should be possible. May be switcher code don't
> have such requirement but it appeared bit off to me.

There are two things to consider here:

1) The CPU is suspended.  CPU state is lost. Next interrupt to wake up
   the CPU will make it restart from the reset vector and re-entry in 
   the kernel will happen via bL_entry_point to deal with the various 
   cluster issues, to eventually resume kernel code via cpu_resume.  
   Obviously, the machine specific backend code would have set the
   bL_entry_point address in its machine specific reset vector in
   advance.

2) An interrupt comes along before the CPU is effectively suspended, say 
   right before the backend code executes a WFI to shut the CPU down.  
   The CPU and possibly cluster state was already set for being powered 
   off.  We cannot simply return at this point as caches are off, the 
   CPU is not coherent with the rest of the system anymore, etc.  So if 
   the platform specific backend ever returns, say because the final WFI 
   exited, then we have to go through the same arbitration process to 
   restore the CPU and cluster state as if that was a hard reset.  Hence 
   the cpu_reset call to loop back into bL_entry_point.

In either cases, we simply cannot ever return from bL_cpu_suspend() 
directly.  Of course, the caller is expected to have used 
bL_set_entry_vector() beforehand, most probably with cpu_resume as 
argument.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-11 18:07     ` Dave Martin
@ 2013-01-11 18:34       ` Santosh Shilimkar
  0 siblings, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 January 2013 11:37 PM, Dave Martin wrote:
> On Fri, Jan 11, 2013 at 11:16:18PM +0530, Santosh Shilimkar wrote:
>
> [...]
>
>>> +Originally created and documented by Dave Martin for Linaro Limited, in
>>> +collaboration with Nicolas Pitre and Achin Gupta.
>>> +
>> Great write-up Dave!! I might have to do couple of more passes on it to
>> get overall idea, but surely this documentation is good start for
>> anybody reading/reviewing the big.LITTLE switcher code.
>
> Thanks for reading through it.  Partly, this was insurance against me
> forgetting how the code worked in between writing and posting it...
> but this is all quite subtle code, so it felt important to document
> it thoroughly.
>
>>
>>> +Copyright (C) 2012  Linaro Limited
>>> +Distributed under the terms of Version 2 of the GNU General Public
>>> +License, as defined in linux/COPYING.
>>> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
>>> index 41de0622de..1ea4ec9df0 100644
>>> --- a/arch/arm/common/bL_entry.c
>>> +++ b/arch/arm/common/bL_entry.c
>>> @@ -116,3 +116,163 @@ int bL_cpu_powered_up(void)
>>>   		platform_ops->powered_up();
>>>   	return 0;
>>>   }
>>> +
>>> +struct bL_sync_struct bL_sync;
>>> +
>>> +static void __sync_range(volatile void *p, size_t size)
>>> +{
>>> +	char *_p = (char *)p;
>>> +
>>> +	__cpuc_flush_dcache_area(_p, size);
>>> +	outer_flush_range(__pa(_p), __pa(_p + size));
>>> +	outer_sync();
>>> +}
>>> +
>>> +#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
>>> +
>>> +/*
>> /** as per kerneldoc.
>
> Does kerneldoc not require the comment to be specially formatted?
>
> I haven't played with that, so far.
>
>>
>>> + * __bL_cpu_going_down: Indicates that the cpu is being torn down.
>>> + *    This must be called at the point of committing to teardown of a CPU.
>>> + *    The CPU cache (SCTRL.C bit) is expected to still be active.
>>> + */
>>> +void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
>>> +{
>>> +	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
>>> +	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
>>> +}
>>> +
>>
>> [..]
>>
>>> diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
>>> index 9d351f2b4c..f7a64ac127 100644
>>> --- a/arch/arm/common/bL_head.S
>>> +++ b/arch/arm/common/bL_head.S
>>> @@ -7,11 +7,19 @@
>>>    * This program is free software; you can redistribute it and/or modify
>>>    * it under the terms of the GNU General Public License version 2 as
>>>    * published by the Free Software Foundation.
>>> + *
>>> + *
>>> + * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
>>> + * for details of the synchronisation algorithms used here.
>>>    */
>>>
>>>   #include <linux/linkage.h>
>>>   #include <asm/bL_entry.h>
>>>
>>> +.if BL_SYNC_CLUSTER_CPUS
>>> +.error "cpus must be the first member of struct bL_cluster_sync_struct"
>>> +.endif
>>> +
>>>   	.macro	pr_dbg	cpu, string
>>>   #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
>>>   	b	1901f
>>> @@ -52,12 +60,82 @@ ENTRY(bL_entry_point)
>>>   2:	pr_dbg	r4, "kernel bL_entry_point\n"
>>>
>>>   	/*
>>> -	 * MMU is off so we need to get to bL_entry_vectors in a
>>> +	 * MMU is off so we need to get to various variables in a
>>>   	 * position independent way.
>>>   	 */
>>>   	adr	r5, 3f
>>> -	ldr	r6, [r5]
>>> +	ldmia	r5, {r6, r7, r8}
>>>   	add	r6, r5, r6			@ r6 = bL_entry_vectors
>>> +	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
>>> +	add	r8, r5, r8			@ r8 = bL_sync
>>> +
>>> +	mov	r0, #BL_SYNC_CLUSTER_SIZE
>>> +	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
>>> +
>>> +	@ Signal that this CPU is coming UP:
>>> +	mov	r0, #CPU_COMING_UP
>>> +	mov	r5, #BL_SYNC_CPU_SIZE
>>> +	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
>>> +	strb	r0, [r5]
>>> +
>>> +	dsb
>> Do you really need above dsb(). With MMU off, the the store should
>
> The short answer is "maybe not".  Some of the barriers can be
> eliminated; some can be demoted to DSBs.  Others may be required but
> unnecessarily duplicated e.g., between bL_head.S and vlock.S.
>
>> any way make it to the main memory, No ?
>
> Yes, but this raises issues about precisely what the architecture
> guarantees about memory ordering in these scenarios.  The only obvious
> thing about that is that it's non-obvious.
>
Well at least ARM documents clearly says the memory accesses will be
treated as strongly ordered with MMU OFF and that means they expect
to make it to main memory.

> Strongly-Ordered memory is not quite the same as having explicit
> barriers everywhere.
>
> I need to have a careful think, but it should be possible to optimise
> a bit here.
>
If the CCI comes in between that rule and if it needs a barrier to let
it flush is WB to main memory then thats a different story.

Anyway thanks for the answer.
Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-11 18:33     ` Nicolas Pitre
@ 2013-01-11 18:41       ` Santosh Shilimkar
  2013-01-11 19:54         ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-11 18:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 12:03 AM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> This is the basic API used to handle the powering up/down of individual
>>> CPUs in a big.LITTLE system.  The platform specific backend implementation
>>> has the responsibility to also handle the cluster level power as well when
>>> the first/last CPU in a cluster is brought up/down.
>>>
>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>> ---
>>>    arch/arm/common/bL_entry.c      | 88
>>> +++++++++++++++++++++++++++++++++++++++
>>>    arch/arm/include/asm/bL_entry.h | 92
>>> +++++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 180 insertions(+)
>>>
>>> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
>>> index 80fff49417..41de0622de 100644
>>> --- a/arch/arm/common/bL_entry.c
>>> +++ b/arch/arm/common/bL_entry.c
>>> @@ -11,11 +11,13 @@
>>>
>>>    #include <linux/kernel.h>
>>>    #include <linux/init.h>
>>> +#include <linux/irqflags.h>
>>>
>>>    #include <asm/bL_entry.h>
>>>    #include <asm/barrier.h>
>>>    #include <asm/proc-fns.h>
>>>    #include <asm/cacheflush.h>
>>> +#include <asm/idmap.h>
>>>
>>>    extern volatile unsigned long
>>> bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
>>>
>>> @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster,
>>> void *ptr)
>>>    	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
>>>    			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
>>>    }
>>> +
>>> +static const struct bL_platform_power_ops *platform_ops;
>>> +
>>> +int __init bL_platform_power_register(const struct bL_platform_power_ops
>>> *ops)
>>> +{
>>> +	if (platform_ops)
>>> +		return -EBUSY;
>>> +	platform_ops = ops;
>>> +	return 0;
>>> +}
>>> +
>>> +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
>>> +{
>>> +	if (!platform_ops)
>>> +		return -EUNATCH;
>>> +	might_sleep();
>>> +	return platform_ops->power_up(cpu, cluster);
>>> +}
>>> +
>>> +typedef void (*phys_reset_t)(unsigned long);
>>> +
>>> +void bL_cpu_power_down(void)
>>> +{
>>> +	phys_reset_t phys_reset;
>>> +
>>> +	BUG_ON(!platform_ops);
>>> +	BUG_ON(!irqs_disabled());
>>> +
>>> +	/*
>>> +	 * Do this before calling into the power_down method,
>>> +	 * as it might not always be safe to do afterwards.
>>> +	 */
>>> +	setup_mm_for_reboot();
>>> +
>>> +	platform_ops->power_down();
>>> +
>>> +	/*
>>> +	 * It is possible for a power_up request to happen concurrently
>>> +	 * with a power_down request for the same CPU. In this case the
>>> +	 * power_down method might not be able to actually enter a
>>> +	 * powered down state with the WFI instruction if the power_up
>>> +	 * method has removed the required reset condition.  The
>>> +	 * power_down method is then allowed to return. We must perform
>>> +	 * a re-entry in the kernel as if the power_up method just had
>>> +	 * deasserted reset on the CPU.
>>> +	 *
>>> +	 * To simplify race issues, the platform specific implementation
>>> +	 * must accommodate for the possibility of unordered calls to
>>> +	 * power_down and power_up with a usage count. Therefore, if a
>>> +	 * call to power_up is issued for a CPU that is not down, then
>>> +	 * the next call to power_down must not attempt a full shutdown
>>> +	 * but only do the minimum (normally disabling L1 cache and CPU
>>> +	 * coherency) and return just as if a concurrent power_up request
>>> +	 * had happened as described above.
>>> +	 */
>>> +
>>> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
>>> +	phys_reset(virt_to_phys(bL_entry_point));
>>> +
>>> +	/* should never get here */
>>> +	BUG();
>>> +}
>>> +
>>> +void bL_cpu_suspend(u64 expected_residency)
>>> +{
>>> +	phys_reset_t phys_reset;
>>> +
>>> +	BUG_ON(!platform_ops);
>>> +	BUG_ON(!irqs_disabled());
>>> +
>>> +	/* Very similar to bL_cpu_power_down() */
>>> +	setup_mm_for_reboot();
>>> +	platform_ops->suspend(expected_residency);
>>> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
>>> +	phys_reset(virt_to_phys(bL_entry_point));
>>> +	BUG();
>>>
>> I might be missing all the rationales behind not having a recovery for
>> CPUs entering suspend if they actualy come here because of some events.
>> This is pretty much possible in many scenario's and hence letting CPU
>> cpu come out of suspend should be possible. May be switcher code don't
>> have such requirement but it appeared bit off to me.
>
> There are two things to consider here:
>
> 1) The CPU is suspended.  CPU state is lost. Next interrupt to wake up
>     the CPU will make it restart from the reset vector and re-entry in
>     the kernel will happen via bL_entry_point to deal with the various
>     cluster issues, to eventually resume kernel code via cpu_resume.
>     Obviously, the machine specific backend code would have set the
>     bL_entry_point address in its machine specific reset vector in
>     advance.
This is the successful case and in that case you will anyway not hit the
BUG.
>
> 2) An interrupt comes along before the CPU is effectively suspended, say
>     right before the backend code executes a WFI to shut the CPU down.
>     The CPU and possibly cluster state was already set for being powered
>     off.  We cannot simply return at this point as caches are off, the
>     CPU is not coherent with the rest of the system anymore, etc.  So if
>     the platform specific backend ever returns, say because the final WFI
>     exited, then we have to go through the same arbitration process to
>     restore the CPU and cluster state as if that was a hard reset.  Hence
>     the cpu_reset call to loop back into bL_entry_point.
>
This is the one I was thinking. Enabling C bit and SMP bit should be
enough for CPU to get back to right state since the CPU has not lost
the context all the registers including SP is intact and CPU should
be able to resume.

> In either cases, we simply cannot ever return from bL_cpu_suspend()
> directly.  Of course, the caller is expected to have used
> bL_set_entry_vector() beforehand, most probably with cpu_resume as
> argument.
>
The above might get complicated if when above situation happens on
last CPU where even CCI gets disabled and then adding the rever code
for all of that may not be worth. You approach is much safer.
Thanks for explaining it further.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-11 18:07   ` Santosh Shilimkar
@ 2013-01-11 19:07     ` Nicolas Pitre
  2013-01-12  6:50       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 19:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > Otherwise there might be some interrupts or IPIs becoming pending and the
> > CPU will not enter low power mode when doing a WFI.  The effect of this
> > is a CPU that loops back into the kernel, go through the first man
> > election, signals itself as alive,  and prevent the cluster from being
> > shut down.
> > 
> > This could benefit from a better solution.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >   arch/arm/common/bL_platsmp.c        | 1 +
> >   arch/arm/common/gic.c               | 6 ++++++
> >   arch/arm/include/asm/hardware/gic.h | 2 ++
> >   3 files changed, 9 insertions(+)
> > 
> > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > index 0ae44123bf..6a3b251b97 100644
> > --- a/arch/arm/common/bL_platsmp.c
> > +++ b/arch/arm/common/bL_platsmp.c
> > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> >   	pcpu = mpidr & 0xff;
> >   	pcluster = (mpidr >> 8) & 0xff;
> >   	bL_set_entry_vector(pcpu, pcluster, NULL);
> > +	gic_cpu_if_down();
> 
> So for a case where CPU still don't power down for some reason even
> after CPU interface is disabled, can not listen to and SGI or PPI.
> Not sure if this happens on big.LITTLE but i have seen one such issue
> on Cortex-A9 based SOC.

Here the problem was the reverse i.e. a CPU wouldn't go down because 
some pending SGIs prevented that.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 10/16] ARM: vexpress: introduce DCSCB support
  2013-01-11 18:12   ` Santosh Shilimkar
@ 2013-01-11 19:13     ` Nicolas Pitre
  2013-01-12  6:52       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > This adds basic CPU and cluster reset controls on RTSM for the
> > A15x4-A7x4 model configuration using the Dual Cluster System
> > Configuration Block (DCSCB).
> > 
> > The cache coherency interconnect (CCI) is not handled yet.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >   arch/arm/mach-vexpress/Kconfig  |   8 ++
> >   arch/arm/mach-vexpress/Makefile |   1 +
> >   arch/arm/mach-vexpress/dcscb.c  | 160
> > ++++++++++++++++++++++++++++++++++++++++
> >   3 files changed, 169 insertions(+)
> >   create mode 100644 arch/arm/mach-vexpress/dcscb.c
> > 
> > diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
> > index 99e63f5f99..e55c02562f 100644
> > --- a/arch/arm/mach-vexpress/Kconfig
> > +++ b/arch/arm/mach-vexpress/Kconfig
> > @@ -53,4 +53,12 @@ config ARCH_VEXPRESS_CORTEX_A5_A9_ERRATA
> >   config ARCH_VEXPRESS_CA9X4
> >   	bool "Versatile Express Cortex-A9x4 tile"
> > 
> > +config ARCH_VEXPRESS_DCSCB
> > +	bool "Dual Cluster System Control Block (DCSCB) support"
> > +	depends on BIG_LITTLE
> > +	help
> > +	  Support for the Dual Cluster System Configuration Block (DCSCB).
> > +	  This is needed to provide CPU and cluster power management
> > +	  on RTSM.
> > +
> >   endmenu
> > diff --git a/arch/arm/mach-vexpress/Makefile
> > b/arch/arm/mach-vexpress/Makefile
> > index 80b64971fb..2253644054 100644
> > --- a/arch/arm/mach-vexpress/Makefile
> > +++ b/arch/arm/mach-vexpress/Makefile
> > @@ -6,5 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) :=
> > -I$(srctree)/$(src)/include \
> > 
> >   obj-y					:= v2m.o reset.o
> >   obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
> > +obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
> >   obj-$(CONFIG_SMP)			+= platsmp.o
> >   obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
> > diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
> > new file mode 100644
> > index 0000000000..cccd943cd4
> > --- /dev/null
> > +++ b/arch/arm/mach-vexpress/dcscb.c
> > @@ -0,0 +1,160 @@
> > +/*
> > + * arch/arm/mach-vexpress/dcscb.c - Dual Cluster System Control Block
> > + *
> > + * Created by:	Nicolas Pitre, May 2012
> > + * Copyright:	(C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/kernel.h>
> > +#include <linux/io.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/errno.h>
> > +#include <linux/vexpress.h>
> > +
> > +#include <asm/bL_entry.h>
> > +#include <asm/proc-fns.h>
> > +#include <asm/cacheflush.h>
> > +
> > +
> > +#define DCSCB_PHYS_BASE	0x60000000
> > +
> > +#define RST_HOLD0	0x0
> > +#define RST_HOLD1	0x4
> > +#define SYS_SWRESET	0x8
> > +#define RST_STAT0	0xc
> > +#define RST_STAT1	0x10
> > +#define EAG_CFG_R	0x20
> > +#define EAG_CFG_W	0x24
> > +#define KFC_CFG_R	0x28
> > +#define KFC_CFG_W	0x2c
> > +#define DCS_CFG_R	0x30
> > +
> > +/*
> > + * We can't use regular spinlocks. In the switcher case, it is possible
> > + * for an outbound CPU to call power_down() after its inbound counterpart
> > + * is already live using the same logical CPU number which trips lockdep
> > + * debugging.
> > + */
> > +static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
> > +
> > +static void __iomem *dcscb_base;
> > +
> > +static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
> > +{
> > +	unsigned int rst_hold, cpumask = (1 << cpu);
> > +
> > +	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
> > +	if (cpu >= 4 || cluster >= 2)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Since this is called with IRQs enabled, and no arch_spin_lock_irq
> > +	 * variant exists, we need to disable IRQs manually here.
> > +	 */
> > +	local_irq_disable();
> > +	arch_spin_lock(&dcscb_lock);
> > +
> > +	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
> > +	if (rst_hold & (1 << 8)) {
> > +		/* remove cluster reset and add individual CPU's reset */
> > +		rst_hold &= ~(1 << 8);
> > +		rst_hold |= 0xf;
> > +	}
> > +	rst_hold &= ~(cpumask | (cpumask << 4));
> > +	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
> > +
> > +	arch_spin_unlock(&dcscb_lock);
> > +	local_irq_enable();
> > +
> > +	return 0;
> > +}
> > +
> > +static void dcscb_power_down(void)
> > +{
> > +	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
> > +
> > +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > +	cpu = mpidr & 0xff;
> > +	cluster = (mpidr >> 8) & 0xff;
> > +	cpumask = (1 << cpu);
> > +
> > +	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
> > +	BUG_ON(cpu >= 4 || cluster >= 2);
> > +
> > +	arch_spin_lock(&dcscb_lock);
> > +	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
> > +	rst_hold |= cpumask;
> > +	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
> > +		rst_hold |= (1 << 8);
> > +	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
> > +	arch_spin_unlock(&dcscb_lock);
> > +	last_man = (rst_hold & (1 << 8));
> > +
> > +	/*
> > +	 * Now let's clean our L1 cache and shut ourself down.
> > +	 * If we're the last CPU in this cluster then clean L2 too.
> > +	 */
> > +
> Do you wanted to have C bit clear code here ?

cpu_proc_fin() does it.

> > +	/*
> > +	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> > +	 * a preliminary flush here for those CPUs.  At least, that's
> > +	 * the theory -- without the extra flush, Linux explodes on
> > +	 * RTSM (maybe not needed anymore, to be investigated)..
> > +	 */
> > +	flush_cache_louis();
> > +	cpu_proc_fin();
> > +
> > +	if (!last_man) {
> > +		flush_cache_louis();
> > +	} else {
> > +		flush_cache_all();
> > +		outer_flush_all();
> > +	}
> > +
> > +	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
> > +	asm volatile (
> > +		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
> > +		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
> > +		"mcr	p15, 0, ip, c1, c0, 1"
> > +		: : : "ip" );
> > +
> > +	/* Now we are prepared for power-down, do it: */
> You need dsb here, right ?

Probably.  However this code is being refactored significantly with 
subsequent patches.  This intermediate step was kept not to introduce 
too many concepts at once.

> > +	wfi();
> > +
> > +	/* Not dead at this point?  Let our caller cope. */
> > +}
> > +
> 
> Regards
> Santosh
> 

Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 13/16] drivers: misc: add ARM CCI support
  2013-01-11 18:20   ` Santosh Shilimkar
@ 2013-01-11 19:22     ` Nicolas Pitre
  2013-01-12  6:53       ` Santosh Shilimkar
  2013-01-15 18:34       ` Dave Martin
  0 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 19:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > 
> > On ARM multi-cluster systems coherency between cores running on
> > different clusters is managed by the cache-coherent interconnect (CCI).
> > It allows broadcasting of TLB invalidates and memory barriers and it
> > guarantees cache coherency at system level.
> > 
> > This patch enables the basic infrastructure required in Linux to
> > handle and programme the CCI component. The first implementation is
> > based on a platform device, its relative DT compatible property and
> > a simple programming interface.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >   drivers/misc/Kconfig    |   3 ++
> >   drivers/misc/Makefile   |   1 +
> >   drivers/misc/arm-cci.c  | 107
> > ++++++++++++++++++++++++++++++++++++++++++++++++
> >   include/linux/arm-cci.h |  30 ++++++++++++++
> How about 'drivers/bus/' considering CCI is an interconnect bus (though
> for coherency)

Yes, I like that better.

> >   4 files changed, 141 insertions(+)
> >   create mode 100644 drivers/misc/arm-cci.c
> >   create mode 100644 include/linux/arm-cci.h
> > 
> > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > index b151b7c1bd..30d5be1ad2 100644
> > --- a/drivers/misc/Kconfig
> > +++ b/drivers/misc/Kconfig
> > @@ -499,6 +499,9 @@ config USB_SWITCH_FSA9480
> >   	  stereo and mono audio, video, microphone and UART data to use
> >   	  a common connector port.
> > 
> > +config ARM_CCI
> You might want add depends on ARM big.LITTTLE otherwise it will
> break build for other arch's with random configurations.

As far as this patch goes, this is buildable on other architectures too.  
The next patch changes that though.

> [..]
> 
> > diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
> > new file mode 100644
> > index 0000000000..f329c43099
> > --- /dev/null
> > +++ b/drivers/misc/arm-cci.c
> > @@ -0,0 +1,107 @@
> > +/*
> > + * CCI support
> > + *
> > + * Copyright (C) 2012 ARM Ltd.
> > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > + * kind, whether express or implied; without even the implied warranty
> > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include <linux/device.h>
> > +#include <linux/io.h>
> > +#include <linux/module.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/slab.h>
> > +#include <linux/arm-cci.h>
> > +
> > +#define CCI400_EAG_OFFSET       0x4000
> > +#define CCI400_KF_OFFSET        0x5000
> > +
> > +#define DRIVER_NAME	"CCI"
> > +struct cci_drvdata {
> > +	void __iomem *baseaddr;
> > +	spinlock_t lock;
> > +};
> > +
> > +static struct cci_drvdata *info;
> > +
> > +void disable_cci(int cluster)
> > +{
> > +	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
> > +	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
> > +
> > +	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
> > +			;
> > +}
> > +EXPORT_SYMBOL_GPL(disable_cci);
> > +
> Is more functionality going to be added for CCI driver. Having this
> much of driver code for just a disable_cci() functions seems like
> overkill.

Yes.  More code will appear here to provide pmu functionalities, etc.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-11 18:27   ` Santosh Shilimkar
@ 2013-01-11 19:28     ` Nicolas Pitre
  2013-01-12  7:21       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 19:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 11 Jan 2013, Santosh Shilimkar wrote:

> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> > 
> > +		/*
> > +		 * Flush the local CPU cache.
> > +		 *
> > +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> > +		 * a preliminary flush here for those CPUs.  At least, that's
> > +		 * the theory -- without the extra flush, Linux explodes on
> > +		 * RTSM (maybe not needed anymore, to be investigated).
> > +		 */
> This is expected if the entire code is not in one stack frame and the
> additional flush is needed to avoid possible stack corruption. This
> issue has been discussed in past on the list.

I missed that.  Do you have a reference or pointer handy?

What is strange is that this is 100% reproducible on RTSM while this 
apparently is not an issue on real hardware so far.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-11 18:41       ` Santosh Shilimkar
@ 2013-01-11 19:54         ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-11 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 12 Jan 2013, Santosh Shilimkar wrote:

> On Saturday 12 January 2013 12:03 AM, Nicolas Pitre wrote:
> > On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> > 
> > > On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > > > This is the basic API used to handle the powering up/down of individual
> > > > CPUs in a big.LITTLE system.  The platform specific backend
> > > > implementation
> > > > has the responsibility to also handle the cluster level power as well
> > > > when
> > > > the first/last CPU in a cluster is brought up/down.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > ---
> > > >    arch/arm/common/bL_entry.c      | 88
> > > > +++++++++++++++++++++++++++++++++++++++
> > > >    arch/arm/include/asm/bL_entry.h | 92
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > >    2 files changed, 180 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > > > index 80fff49417..41de0622de 100644
> > > > --- a/arch/arm/common/bL_entry.c
> > > > +++ b/arch/arm/common/bL_entry.c
> > > > @@ -11,11 +11,13 @@
> > > > 
> > > >    #include <linux/kernel.h>
> > > >    #include <linux/init.h>
> > > > +#include <linux/irqflags.h>
> > > > 
> > > >    #include <asm/bL_entry.h>
> > > >    #include <asm/barrier.h>
> > > >    #include <asm/proc-fns.h>
> > > >    #include <asm/cacheflush.h>
> > > > +#include <asm/idmap.h>
> > > > 
> > > >    extern volatile unsigned long
> > > > bL_entry_vectors[BL_NR_CLUSTERS][BL_CPUS_PER_CLUSTER];
> > > > 
> > > > @@ -28,3 +30,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned
> > > > cluster,
> > > > void *ptr)
> > > >    	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
> > > >    			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
> > > >    }
> > > > +
> > > > +static const struct bL_platform_power_ops *platform_ops;
> > > > +
> > > > +int __init bL_platform_power_register(const struct
> > > > bL_platform_power_ops
> > > > *ops)
> > > > +{
> > > > +	if (platform_ops)
> > > > +		return -EBUSY;
> > > > +	platform_ops = ops;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
> > > > +{
> > > > +	if (!platform_ops)
> > > > +		return -EUNATCH;
> > > > +	might_sleep();
> > > > +	return platform_ops->power_up(cpu, cluster);
> > > > +}
> > > > +
> > > > +typedef void (*phys_reset_t)(unsigned long);
> > > > +
> > > > +void bL_cpu_power_down(void)
> > > > +{
> > > > +	phys_reset_t phys_reset;
> > > > +
> > > > +	BUG_ON(!platform_ops);
> > > > +	BUG_ON(!irqs_disabled());
> > > > +
> > > > +	/*
> > > > +	 * Do this before calling into the power_down method,
> > > > +	 * as it might not always be safe to do afterwards.
> > > > +	 */
> > > > +	setup_mm_for_reboot();
> > > > +
> > > > +	platform_ops->power_down();
> > > > +
> > > > +	/*
> > > > +	 * It is possible for a power_up request to happen concurrently
> > > > +	 * with a power_down request for the same CPU. In this case the
> > > > +	 * power_down method might not be able to actually enter a
> > > > +	 * powered down state with the WFI instruction if the power_up
> > > > +	 * method has removed the required reset condition.  The
> > > > +	 * power_down method is then allowed to return. We must perform
> > > > +	 * a re-entry in the kernel as if the power_up method just had
> > > > +	 * deasserted reset on the CPU.
> > > > +	 *
> > > > +	 * To simplify race issues, the platform specific implementation
> > > > +	 * must accommodate for the possibility of unordered calls to
> > > > +	 * power_down and power_up with a usage count. Therefore, if a
> > > > +	 * call to power_up is issued for a CPU that is not down, then
> > > > +	 * the next call to power_down must not attempt a full shutdown
> > > > +	 * but only do the minimum (normally disabling L1 cache and CPU
> > > > +	 * coherency) and return just as if a concurrent power_up request
> > > > +	 * had happened as described above.
> > > > +	 */
> > > > +
> > > > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > > > +	phys_reset(virt_to_phys(bL_entry_point));
> > > > +
> > > > +	/* should never get here */
> > > > +	BUG();
> > > > +}
> > > > +
> > > > +void bL_cpu_suspend(u64 expected_residency)
> > > > +{
> > > > +	phys_reset_t phys_reset;
> > > > +
> > > > +	BUG_ON(!platform_ops);
> > > > +	BUG_ON(!irqs_disabled());
> > > > +
> > > > +	/* Very similar to bL_cpu_power_down() */
> > > > +	setup_mm_for_reboot();
> > > > +	platform_ops->suspend(expected_residency);
> > > > +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> > > > +	phys_reset(virt_to_phys(bL_entry_point));
> > > > +	BUG();
> > > > 
> > > I might be missing all the rationales behind not having a recovery for
> > > CPUs entering suspend if they actualy come here because of some events.
> > > This is pretty much possible in many scenario's and hence letting CPU
> > > cpu come out of suspend should be possible. May be switcher code don't
> > > have such requirement but it appeared bit off to me.
> > 
> > There are two things to consider here:
> > 
> > 1) The CPU is suspended.  CPU state is lost. Next interrupt to wake up
> >     the CPU will make it restart from the reset vector and re-entry in
> >     the kernel will happen via bL_entry_point to deal with the various
> >     cluster issues, to eventually resume kernel code via cpu_resume.
> >     Obviously, the machine specific backend code would have set the
> >     bL_entry_point address in its machine specific reset vector in
> >     advance.
> This is the successful case and in that case you will anyway not hit the
> BUG.
> > 
> > 2) An interrupt comes along before the CPU is effectively suspended, say
> >     right before the backend code executes a WFI to shut the CPU down.
> >     The CPU and possibly cluster state was already set for being powered
> >     off.  We cannot simply return at this point as caches are off, the
> >     CPU is not coherent with the rest of the system anymore, etc.  So if
> >     the platform specific backend ever returns, say because the final WFI
> >     exited, then we have to go through the same arbitration process to
> >     restore the CPU and cluster state as if that was a hard reset.  Hence
> >     the cpu_reset call to loop back into bL_entry_point.
> > 
> This is the one I was thinking. Enabling C bit and SMP bit should be
> enough for CPU to get back to right state since the CPU has not lost
> the context all the registers including SP is intact and CPU should
> be able to resume.

No.  If we are the last man then the CCI is disabled and we cannot just 
enable the C and SMP bits anymore without turning the CCI back on.  And 
even if we are not the last man, maybe another CPU is concurrently going 
through the same code and that one _is_ the last man, in which case it 
will have waited until we are done flushing our cache to turn off the 
CCI but we don't know about that.  And yet another CPU might be coming 
up just at the same moment but this one will want to turn on the CCI and 
that has to be done in a controlled way, and that control is performed 
in the code from bL_entry_point.  So, in short, we cannot just return 
from here.

> > In either cases, we simply cannot ever return from bL_cpu_suspend()
> > directly.  Of course, the caller is expected to have used
> > bL_set_entry_vector() beforehand, most probably with cpu_resume as
> > argument.
> > 
> The above might get complicated if when above situation happens on
> last CPU where even CCI gets disabled and then adding the rever code
> for all of that may not be worth. You approach is much safer.

Indeed.  Furthermore, that revert code does exist already: it is all in 
bL_head.S.  Hence the cpu_reset(bL_entry_point) call.

> Thanks for explaining it further.

No problem.  That literally took us months to get this code 
right so it might not be fully obvious to others after the first look.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-11 19:07     ` Nicolas Pitre
@ 2013-01-12  6:50       ` Santosh Shilimkar
  2013-01-12 16:47         ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-12  6:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 12:37 AM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> Otherwise there might be some interrupts or IPIs becoming pending and the
>>> CPU will not enter low power mode when doing a WFI.  The effect of this
>>> is a CPU that loops back into the kernel, go through the first man
>>> election, signals itself as alive,  and prevent the cluster from being
>>> shut down.
>>>
>>> This could benefit from a better solution.
>>>
>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>> ---
>>>    arch/arm/common/bL_platsmp.c        | 1 +
>>>    arch/arm/common/gic.c               | 6 ++++++
>>>    arch/arm/include/asm/hardware/gic.h | 2 ++
>>>    3 files changed, 9 insertions(+)
>>>
>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>>> index 0ae44123bf..6a3b251b97 100644
>>> --- a/arch/arm/common/bL_platsmp.c
>>> +++ b/arch/arm/common/bL_platsmp.c
>>> @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
>>>    	pcpu = mpidr & 0xff;
>>>    	pcluster = (mpidr >> 8) & 0xff;
>>>    	bL_set_entry_vector(pcpu, pcluster, NULL);
>>> +	gic_cpu_if_down();
>>
>> So for a case where CPU still don't power down for some reason even
>> after CPU interface is disabled, can not listen to and SGI or PPI.
>> Not sure if this happens on big.LITTLE but i have seen one such issue
>> on Cortex-A9 based SOC.
>
> Here the problem was the reverse i.e. a CPU wouldn't go down because
> some pending SGIs prevented that.
>
I understood that part. What I was saying is, with CPU IF disabled and
if CPU doesn't enter into the intended low power state and if the wakeup
mechanism on that CPU is SGI/SPI, CPU may never wakeup and can lead to
dead lock. I have seen this scenario on OMAP especially in CPUidle path.
It may not be relevant for switcher considering, you almost force CPU to
enter to low power state :-)

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 10/16] ARM: vexpress: introduce DCSCB support
  2013-01-11 19:13     ` Nicolas Pitre
@ 2013-01-12  6:52       ` Santosh Shilimkar
  0 siblings, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-12  6:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 12:43 AM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> This adds basic CPU and cluster reset controls on RTSM for the
>>> A15x4-A7x4 model configuration using the Dual Cluster System
>>> Configuration Block (DCSCB).
>>>
>>> The cache coherency interconnect (CCI) is not handled yet.
>>>
>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>> ---
>>>    arch/arm/mach-vexpress/Kconfig  |   8 ++
>>>    arch/arm/mach-vexpress/Makefile |   1 +
>>>    arch/arm/mach-vexpress/dcscb.c  | 160
>>> ++++++++++++++++++++++++++++++++++++++++
>>>    3 files changed, 169 insertions(+)
>>>    create mode 100644 arch/arm/mach-vexpress/dcscb.c
>>>
[..]

>>> diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
>>> new file mode 100644
>>> index 0000000000..cccd943cd4
>>> --- /dev/null
>>> +++ b/arch/arm/mach-vexpress/dcscb.c
[..]

>>> +static void dcscb_power_down(void)
>>> +{
>>> +	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
>>> +
>>> +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
>>> +	cpu = mpidr & 0xff;
>>> +	cluster = (mpidr >> 8) & 0xff;
>>> +	cpumask = (1 << cpu);
>>> +
>>> +	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
>>> +	BUG_ON(cpu >= 4 || cluster >= 2);
>>> +
>>> +	arch_spin_lock(&dcscb_lock);
>>> +	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
>>> +	rst_hold |= cpumask;
>>> +	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
>>> +		rst_hold |= (1 << 8);
>>> +	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
>>> +	arch_spin_unlock(&dcscb_lock);
>>> +	last_man = (rst_hold & (1 << 8));
>>> +
>>> +	/*
>>> +	 * Now let's clean our L1 cache and shut ourself down.
>>> +	 * If we're the last CPU in this cluster then clean L2 too.
>>> +	 */
>>> +
>> Do you wanted to have C bit clear code here ?
>
> cpu_proc_fin() does it.
>
Yep. I noticed that in next patch when read the comment.

>>> +	/*
>>> +	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
>>> +	 * a preliminary flush here for those CPUs.  At least, that's
>>> +	 * the theory -- without the extra flush, Linux explodes on
>>> +	 * RTSM (maybe not needed anymore, to be investigated)..
>>> +	 */
>>> +	flush_cache_louis();
>>> +	cpu_proc_fin();
>>> +
>>> +	if (!last_man) {
>>> +		flush_cache_louis();
>>> +	} else {
>>> +		flush_cache_all();
>>> +		outer_flush_all();
>>> +	}
>>> +
>>> +	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
>>> +	asm volatile (
>>> +		"mrc	p15, 0, ip, c1, c0, 1 \n\t"
>>> +		"bic	ip, ip, #(1 << 6) @ clear SMP bit \n\t"
>>> +		"mcr	p15, 0, ip, c1, c0, 1"
>>> +		: : : "ip" );
>>> +
>>> +	/* Now we are prepared for power-down, do it: */
>> You need dsb here, right ?
>
> Probably.  However this code is being refactored significantly with
> subsequent patches.  This intermediate step was kept not to introduce
> too many concepts at once.
>
Yes. I do see updates in subsequent patch.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 13/16] drivers: misc: add ARM CCI support
  2013-01-11 19:22     ` Nicolas Pitre
@ 2013-01-12  6:53       ` Santosh Shilimkar
  2013-01-15 18:34       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-12  6:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 12:52 AM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>>>
>>> On ARM multi-cluster systems coherency between cores running on
>>> different clusters is managed by the cache-coherent interconnect (CCI).
>>> It allows broadcasting of TLB invalidates and memory barriers and it
>>> guarantees cache coherency at system level.
>>>
>>> This patch enables the basic infrastructure required in Linux to
>>> handle and programme the CCI component. The first implementation is
>>> based on a platform device, its relative DT compatible property and
>>> a simple programming interface.
>>>
>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>> ---
>>>    drivers/misc/Kconfig    |   3 ++
>>>    drivers/misc/Makefile   |   1 +
>>>    drivers/misc/arm-cci.c  | 107
>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>>    include/linux/arm-cci.h |  30 ++++++++++++++
>> How about 'drivers/bus/' considering CCI is an interconnect bus (though
>> for coherency)
>
> Yes, I like that better.
>
Great.

>>>    4 files changed, 141 insertions(+)
>>>    create mode 100644 drivers/misc/arm-cci.c
>>>    create mode 100644 include/linux/arm-cci.h
>>>
>>> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
>>> index b151b7c1bd..30d5be1ad2 100644
>>> --- a/drivers/misc/Kconfig
>>> +++ b/drivers/misc/Kconfig
>>> @@ -499,6 +499,9 @@ config USB_SWITCH_FSA9480
>>>    	  stereo and mono audio, video, microphone and UART data to use
>>>    	  a common connector port.
>>>
>>> +config ARM_CCI
>> You might want add depends on ARM big.LITTTLE otherwise it will
>> break build for other arch's with random configurations.
>
> As far as this patch goes, this is buildable on other architectures too.
> The next patch changes that though.
>
Thanks.

>> [..]
>>
>>> diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
>>> new file mode 100644
>>> index 0000000000..f329c43099
>>> --- /dev/null
>>> +++ b/drivers/misc/arm-cci.c
>>> @@ -0,0 +1,107 @@
>>> +/*
>>> + * CCI support
>>> + *
>>> + * Copyright (C) 2012 ARM Ltd.
>>> + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
>>> + * kind, whether express or implied; without even the implied warranty
>>> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + */
>>> +
>>> +#include <linux/device.h>
>>> +#include <linux/io.h>
>>> +#include <linux/module.h>
>>> +#include <linux/platform_device.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/arm-cci.h>
>>> +
>>> +#define CCI400_EAG_OFFSET       0x4000
>>> +#define CCI400_KF_OFFSET        0x5000
>>> +
>>> +#define DRIVER_NAME	"CCI"
>>> +struct cci_drvdata {
>>> +	void __iomem *baseaddr;
>>> +	spinlock_t lock;
>>> +};
>>> +
>>> +static struct cci_drvdata *info;
>>> +
>>> +void disable_cci(int cluster)
>>> +{
>>> +	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
>>> +	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
>>> +
>>> +	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
>>> +			;
>>> +}
>>> +EXPORT_SYMBOL_GPL(disable_cci);
>>> +
>> Is more functionality going to be added for CCI driver. Having this
>> much of driver code for just a disable_cci() functions seems like
>> overkill.
>
> Yes.  More code will appear here to provide pmu functionalities, etc.
>
Good to know.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-11 19:28     ` Nicolas Pitre
@ 2013-01-12  7:21       ` Santosh Shilimkar
  2013-01-14 12:25         ` Lorenzo Pieralisi
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-12  7:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>> From: Dave Martin <dave.martin@linaro.org>
>>>
>>> +		/*
>>> +		 * Flush the local CPU cache.
>>> +		 *
>>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
>>> +		 * a preliminary flush here for those CPUs.  At least, that's
>>> +		 * the theory -- without the extra flush, Linux explodes on
>>> +		 * RTSM (maybe not needed anymore, to be investigated).
>>> +		 */
>> This is expected if the entire code is not in one stack frame and the
>> additional flush is needed to avoid possible stack corruption. This
>> issue has been discussed in past on the list.
>
> I missed that.  Do you have a reference or pointer handy?
>
> What is strange is that this is 100% reproducible on RTSM while this
> apparently is not an issue on real hardware so far.
>
I tried searching archives and realized the discussion was in private
email thread. There are some bits and pieces on list but not all the
information.

The main issue RMK pointed out is- An additional L1 flush needed
to avoid the effective change of view of memory when the C bit is
turned off, and the cache is no longer searched for local CPU accesses.

In your case dcscb_power_down() has updated the stack which can be hit
in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
the C-bit and hence for sub sequent calls the L1 cache won't be
searched. You then call flush_cache_all() which again updates the
stack but avoids searching the L1 cache. So it overwrites previous
saved stack frame. This seems to be an issue in your case as well.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-12  6:50       ` Santosh Shilimkar
@ 2013-01-12 16:47         ` Nicolas Pitre
  2013-01-13  4:37           ` Santosh Shilimkar
  2013-01-14 17:53           ` Lorenzo Pieralisi
  0 siblings, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-12 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 12 Jan 2013, Santosh Shilimkar wrote:

> On Saturday 12 January 2013 12:37 AM, Nicolas Pitre wrote:
> > On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> > 
> > > On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > > > Otherwise there might be some interrupts or IPIs becoming pending and
> > > > the
> > > > CPU will not enter low power mode when doing a WFI.  The effect of this
> > > > is a CPU that loops back into the kernel, go through the first man
> > > > election, signals itself as alive,  and prevent the cluster from being
> > > > shut down.
> > > > 
> > > > This could benefit from a better solution.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > ---
> > > >    arch/arm/common/bL_platsmp.c        | 1 +
> > > >    arch/arm/common/gic.c               | 6 ++++++
> > > >    arch/arm/include/asm/hardware/gic.h | 2 ++
> > > >    3 files changed, 9 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > > index 0ae44123bf..6a3b251b97 100644
> > > > --- a/arch/arm/common/bL_platsmp.c
> > > > +++ b/arch/arm/common/bL_platsmp.c
> > > > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> > > >    	pcpu = mpidr & 0xff;
> > > >    	pcluster = (mpidr >> 8) & 0xff;
> > > >    	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > > +	gic_cpu_if_down();
> > > 
> > > So for a case where CPU still don't power down for some reason even
> > > after CPU interface is disabled, can not listen to and SGI or PPI.
> > > Not sure if this happens on big.LITTLE but i have seen one such issue
> > > on Cortex-A9 based SOC.
> > 
> > Here the problem was the reverse i.e. a CPU wouldn't go down because
> > some pending SGIs prevented that.
> > 
> I understood that part. What I was saying is, with CPU IF disabled and
> if CPU doesn't enter into the intended low power state and if the wakeup
> mechanism on that CPU is SGI/SPI, CPU may never wakeup and can lead to
> dead lock. I have seen this scenario on OMAP especially in CPUidle path.

Obviously, on the CPU idle path, you should not turn off the GIC 
interface as this might lose the ability to wake the CPU up with a 
pending interrupt, if your system is so configured.

Here this is the CPU hot unplug path and we don't want the CPU to be 
awaken at all until we explicitly do something to wake it back up.

However, in theory, all interrupts should have been migrated away from 
this CPU, so there shouldn't be any need for this.  I should revisit the 
test that led me to create this patch.

> It may not be relevant for switcher considering, you almost force CPU to
> enter to low power state :-)

The switcher doesn't use cpu_die() but calls into bL_cpu_power_down() 
directly.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-12 16:47         ` Nicolas Pitre
@ 2013-01-13  4:37           ` Santosh Shilimkar
  2013-01-14 17:53           ` Lorenzo Pieralisi
  1 sibling, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-13  4:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 January 2013 10:17 PM, Nicolas Pitre wrote:
> On Sat, 12 Jan 2013, Santosh Shilimkar wrote:
>
>> On Saturday 12 January 2013 12:37 AM, Nicolas Pitre wrote:
>>> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>>>
>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>> Otherwise there might be some interrupts or IPIs becoming pending and
>>>>> the
>>>>> CPU will not enter low power mode when doing a WFI.  The effect of this
>>>>> is a CPU that loops back into the kernel, go through the first man
>>>>> election, signals itself as alive,  and prevent the cluster from being
>>>>> shut down.
>>>>>
>>>>> This could benefit from a better solution.
>>>>>
>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>> ---
>>>>>     arch/arm/common/bL_platsmp.c        | 1 +
>>>>>     arch/arm/common/gic.c               | 6 ++++++
>>>>>     arch/arm/include/asm/hardware/gic.h | 2 ++
>>>>>     3 files changed, 9 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>>>>> index 0ae44123bf..6a3b251b97 100644
>>>>> --- a/arch/arm/common/bL_platsmp.c
>>>>> +++ b/arch/arm/common/bL_platsmp.c
>>>>> @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
>>>>>     	pcpu = mpidr & 0xff;
>>>>>     	pcluster = (mpidr >> 8) & 0xff;
>>>>>     	bL_set_entry_vector(pcpu, pcluster, NULL);
>>>>> +	gic_cpu_if_down();
>>>>
>>>> So for a case where CPU still don't power down for some reason even
>>>> after CPU interface is disabled, can not listen to and SGI or PPI.
>>>> Not sure if this happens on big.LITTLE but i have seen one such issue
>>>> on Cortex-A9 based SOC.
>>>
>>> Here the problem was the reverse i.e. a CPU wouldn't go down because
>>> some pending SGIs prevented that.
>>>
>> I understood that part. What I was saying is, with CPU IF disabled and
>> if CPU doesn't enter into the intended low power state and if the wakeup
>> mechanism on that CPU is SGI/SPI, CPU may never wakeup and can lead to
>> dead lock. I have seen this scenario on OMAP especially in CPUidle path.
>
> Obviously, on the CPU idle path, you should not turn off the GIC
> interface as this might lose the ability to wake the CPU up with a
> pending interrupt, if your system is so configured.
>
> Here this is the CPU hot unplug path and we don't want the CPU to be
> awaken at all until we explicitly do something to wake it back up.
>
I see.

> However, in theory, all interrupts should have been migrated away from
> this CPU, so there shouldn't be any need for this.  I should revisit the
> test that led me to create this patch.
>
Thats right from hot-plug path and SPI are concerned. But SGI/PPI can
still wakeup CPU and there is no migration as such since they are local
to that

>> It may not be relevant for switcher considering, you almost force CPU to
>> enter to low power state :-)
>
> The switcher doesn't use cpu_die() but calls into bL_cpu_power_down()
> directly.
>
Good to know that.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
@ 2013-01-14  9:56     ` Joseph Lo
  2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
                       ` (18 subsequent siblings)
  19 siblings, 0 replies; 140+ messages in thread
From: Joseph Lo @ 2013-01-14  9:56 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi Nicolas,

On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> This is the initial public posting of the initial support for big.LITTLE.
> Included here is the code required to safely power up and down CPUs in a
> b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> boot and CPU hotplug support is included at this time.  Getting to this
> point already represents a significcant chunk of code as illustrated by
> the diffstat below.
> 
> This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> myself.  The presentation slides are available here:
> 
> http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> 
> The code is now stable on both Fast Models as well as Virtual Express TC2
> and ready for public review.
> 
> Platform support is included for Fast Models implementing the
> Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> successful compilation, I also included a preliminary version of the
> CCI400 driver from Lorenzo Pieralisi.
> 
> Support for actual hardware such as Vexpress TC2 should come later,
> once the basic infrastructure from this series is merged.  A few DT
> bindings are used but not yet documented.
> 
> This series is made of the following parts:
> 
> Low-level support code:
> [PATCH 01/16] ARM: b.L: secondary kernel entry code
> [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> [PATCH 05/16] ARM: bL_head: vlock-based first man election
> 
> Adaptation layer to hook with the generic kernel infrastructure:
> [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> 
> Fast Models support:
> [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> [PATCH 13/16] drivers: misc: add ARM CCI support
> [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> 

Thanks for introducing this series.
I am taking a look at this series. It introduced an algorithm for
syncing and avoid racing when syncing the power status of clusters and
CPUs. Do you think these codes could have a chance to become a generic
framework?

The Tegra chip series had a similar design for CPU clusters and it had
limitation that the CPU0 always needs to be the last CPU to be shut down
before cluster power down as well. I believe it can also get benefits of
this works. We indeed need a similar algorithm to sync CPUs power status
before cluster power down and switching.

The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
looks have a chance to be a common framework for ARM platform even if it
just support one cluster. Because some systems had the limitations for
cluster power down. That's why the coupled cpuidle been introduced. And
this framework could be enabled automatically if platform dependent or
by menuconfig.

For ex,
	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM

How do you think of this suggestion?

BTW, some questions...
1. The "bL_entry_point" looks like a first run function when CPUs just
power up, then jumping to original reset vector that it should be
called. Do you think this should be a function and be called by reset
handler? Or in your design, this should be called as soon as possible
when the CPU power be resumed?

2. Does the Last_man mechanism should implement in platform specific
code to check something like cpu_online_status and if there is a
limitation for the specific last CPU to be powered down?

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-14  9:56     ` Joseph Lo
  0 siblings, 0 replies; 140+ messages in thread
From: Joseph Lo @ 2013-01-14  9:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> This is the initial public posting of the initial support for big.LITTLE.
> Included here is the code required to safely power up and down CPUs in a
> b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> boot and CPU hotplug support is included at this time.  Getting to this
> point already represents a significcant chunk of code as illustrated by
> the diffstat below.
> 
> This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> myself.  The presentation slides are available here:
> 
> http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> 
> The code is now stable on both Fast Models as well as Virtual Express TC2
> and ready for public review.
> 
> Platform support is included for Fast Models implementing the
> Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> successful compilation, I also included a preliminary version of the
> CCI400 driver from Lorenzo Pieralisi.
> 
> Support for actual hardware such as Vexpress TC2 should come later,
> once the basic infrastructure from this series is merged.  A few DT
> bindings are used but not yet documented.
> 
> This series is made of the following parts:
> 
> Low-level support code:
> [PATCH 01/16] ARM: b.L: secondary kernel entry code
> [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> [PATCH 05/16] ARM: bL_head: vlock-based first man election
> 
> Adaptation layer to hook with the generic kernel infrastructure:
> [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> 
> Fast Models support:
> [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> [PATCH 13/16] drivers: misc: add ARM CCI support
> [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> 

Thanks for introducing this series.
I am taking a look at this series. It introduced an algorithm for
syncing and avoid racing when syncing the power status of clusters and
CPUs. Do you think these codes could have a chance to become a generic
framework?

The Tegra chip series had a similar design for CPU clusters and it had
limitation that the CPU0 always needs to be the last CPU to be shut down
before cluster power down as well. I believe it can also get benefits of
this works. We indeed need a similar algorithm to sync CPUs power status
before cluster power down and switching.

The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
looks have a chance to be a common framework for ARM platform even if it
just support one cluster. Because some systems had the limitations for
cluster power down. That's why the coupled cpuidle been introduced. And
this framework could be enabled automatically if platform dependent or
by menuconfig.

For ex,
	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM

How do you think of this suggestion?

BTW, some questions...
1. The "bL_entry_point" looks like a first run function when CPUs just
power up, then jumping to original reset vector that it should be
called. Do you think this should be a function and be called by reset
handler? Or in your design, this should be called as soon as possible
when the CPU power be resumed?

2. Does the Last_man mechanism should implement in platform specific
code to check something like cpu_online_status and if there is a
limitation for the specific last CPU to be powered down?

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-12  7:21       ` Santosh Shilimkar
@ 2013-01-14 12:25         ` Lorenzo Pieralisi
  2013-01-15  6:23           ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Pieralisi @ 2013-01-14 12:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
> On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
> > On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> >
> >> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> >>> From: Dave Martin <dave.martin@linaro.org>
> >>>
> >>> +		/*
> >>> +		 * Flush the local CPU cache.
> >>> +		 *
> >>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> >>> +		 * a preliminary flush here for those CPUs.  At least, that's
> >>> +		 * the theory -- without the extra flush, Linux explodes on
> >>> +		 * RTSM (maybe not needed anymore, to be investigated).
> >>> +		 */
> >> This is expected if the entire code is not in one stack frame and the
> >> additional flush is needed to avoid possible stack corruption. This
> >> issue has been discussed in past on the list.
> >
> > I missed that.  Do you have a reference or pointer handy?
> >
> > What is strange is that this is 100% reproducible on RTSM while this
> > apparently is not an issue on real hardware so far.
> >
> I tried searching archives and realized the discussion was in private
> email thread. There are some bits and pieces on list but not all the
> information.
> 
> The main issue RMK pointed out is- An additional L1 flush needed
> to avoid the effective change of view of memory when the C bit is
> turned off, and the cache is no longer searched for local CPU accesses.
> 
> In your case dcscb_power_down() has updated the stack which can be hit
> in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
> the C-bit and hence for sub sequent calls the L1 cache won't be
> searched. You then call flush_cache_all() which again updates the
> stack but avoids searching the L1 cache. So it overwrites previous
> saved stack frame. This seems to be an issue in your case as well.

On A15/A7 even with the C bit cleared the D-cache is searched, the
situation above cannot happen and if it does we are facing a HW/model bug.
If this code is run on A9 then we have a problem since there, when the C bit
is cleared D-cache is not searched (and that's why the sequence above
should be written in assembly with no data access whatsoever), but on
A15/A7 we do not.

I have been running this code on TC2 for hours on end with nary a problem.

The sequence:

- clear C bit
- clean D-cache
- exit SMP

must be written in assembly with no data access whatsoever to make it
portable across v7 implementations. I think I will write some docs and
add them to the kernel to avoid further discussion on this topic.

FYI, the thread Santosh mentioned:

http://lists.infradead.org/pipermail/linux-arm-kernel/2012-May/099791.html

Lorenzo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-14  9:56     ` Joseph Lo
@ 2013-01-14 14:05         ` Nicolas Pitre
  -1 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 14:05 UTC (permalink / raw)
  To: Joseph Lo
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, 14 Jan 2013, Joseph Lo wrote:

> Hi Nicolas,
> 
> On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > This is the initial public posting of the initial support for big.LITTLE.
> > Included here is the code required to safely power up and down CPUs in a
> > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > boot and CPU hotplug support is included at this time.  Getting to this
> > point already represents a significcant chunk of code as illustrated by
> > the diffstat below.
> > 
> > This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> > myself.  The presentation slides are available here:
> > 
> > http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> > 
> > The code is now stable on both Fast Models as well as Virtual Express TC2
> > and ready for public review.
> > 
> > Platform support is included for Fast Models implementing the
> > Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> > successful compilation, I also included a preliminary version of the
> > CCI400 driver from Lorenzo Pieralisi.
> > 
> > Support for actual hardware such as Vexpress TC2 should come later,
> > once the basic infrastructure from this series is merged.  A few DT
> > bindings are used but not yet documented.
> > 
> > This series is made of the following parts:
> > 
> > Low-level support code:
> > [PATCH 01/16] ARM: b.L: secondary kernel entry code
> > [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> > [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> > [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> > [PATCH 05/16] ARM: bL_head: vlock-based first man election
> > 
> > Adaptation layer to hook with the generic kernel infrastructure:
> > [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> > [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> > [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> > [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> > 
> > Fast Models support:
> > [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> > [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> > [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> > [PATCH 13/16] drivers: misc: add ARM CCI support
> > [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> > [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> > [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> > 
> 
> Thanks for introducing this series.
> I am taking a look at this series. It introduced an algorithm for
> syncing and avoid racing when syncing the power status of clusters and
> CPUs. Do you think these codes could have a chance to become a generic
> framework?

Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
that big.LITTLE was the motivation for creating this code.

> The Tegra chip series had a similar design for CPU clusters and it 
had
> limitation that the CPU0 always needs to be the last CPU to be shut down
> before cluster power down as well. I believe it can also get benefits of
> this works. We indeed need a similar algorithm to sync CPUs power status
> before cluster power down and switching.
> 
> The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> looks have a chance to be a common framework for ARM platform even if it
> just support one cluster. Because some systems had the limitations for
> cluster power down. That's why the coupled cpuidle been introduced. And
> this framework could be enabled automatically if platform dependent or
> by menuconfig.

Absolutely.


> For ex,
> 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> 
> How do you think of this suggestion?

I'd prefer a more concise name though.

> BTW, some questions...
> 1. The "bL_entry_point" looks like a first run function when CPUs just
> power up, then jumping to original reset vector that it should be
> called. Do you think this should be a function and be called by reset
> handler? Or in your design, this should be called as soon as possible
> when the CPU power be resumed?

This should be called as soon as possible.

> 2. Does the Last_man mechanism should implement in platform specific
> code to check something like cpu_online_status and if there is a
> limitation for the specific last CPU to be powered down?

The selection of the last man is accomplished using a platform specific 
mechanism.  By the time this has to be done, the CPU is already dead as 
far as the Linux kernel is concerned, and therefore the generic CPU map 
is not reliable.  In the DCSCB case we simply look at the hardware reset 
register being modified to directly determine the last man.  On TC2 (not 
yet posted) we have to keep a local map of online CPUs.

In your case, the selection of the last man would simply be forced on 
CPU0.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-14 14:05         ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 14:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Joseph Lo wrote:

> Hi Nicolas,
> 
> On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > This is the initial public posting of the initial support for big.LITTLE.
> > Included here is the code required to safely power up and down CPUs in a
> > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > boot and CPU hotplug support is included at this time.  Getting to this
> > point already represents a significcant chunk of code as illustrated by
> > the diffstat below.
> > 
> > This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> > myself.  The presentation slides are available here:
> > 
> > http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> > 
> > The code is now stable on both Fast Models as well as Virtual Express TC2
> > and ready for public review.
> > 
> > Platform support is included for Fast Models implementing the
> > Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> > successful compilation, I also included a preliminary version of the
> > CCI400 driver from Lorenzo Pieralisi.
> > 
> > Support for actual hardware such as Vexpress TC2 should come later,
> > once the basic infrastructure from this series is merged.  A few DT
> > bindings are used but not yet documented.
> > 
> > This series is made of the following parts:
> > 
> > Low-level support code:
> > [PATCH 01/16] ARM: b.L: secondary kernel entry code
> > [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> > [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> > [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> > [PATCH 05/16] ARM: bL_head: vlock-based first man election
> > 
> > Adaptation layer to hook with the generic kernel infrastructure:
> > [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> > [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> > [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> > [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> > 
> > Fast Models support:
> > [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> > [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> > [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> > [PATCH 13/16] drivers: misc: add ARM CCI support
> > [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> > [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> > [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> > 
> 
> Thanks for introducing this series.
> I am taking a look at this series. It introduced an algorithm for
> syncing and avoid racing when syncing the power status of clusters and
> CPUs. Do you think these codes could have a chance to become a generic
> framework?

Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
that big.LITTLE was the motivation for creating this code.

> The Tegra chip series had a similar design for CPU clusters and it 
had
> limitation that the CPU0 always needs to be the last CPU to be shut down
> before cluster power down as well. I believe it can also get benefits of
> this works. We indeed need a similar algorithm to sync CPUs power status
> before cluster power down and switching.
> 
> The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> looks have a chance to be a common framework for ARM platform even if it
> just support one cluster. Because some systems had the limitations for
> cluster power down. That's why the coupled cpuidle been introduced. And
> this framework could be enabled automatically if platform dependent or
> by menuconfig.

Absolutely.


> For ex,
> 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> 
> How do you think of this suggestion?

I'd prefer a more concise name though.

> BTW, some questions...
> 1. The "bL_entry_point" looks like a first run function when CPUs just
> power up, then jumping to original reset vector that it should be
> called. Do you think this should be a function and be called by reset
> handler? Or in your design, this should be called as soon as possible
> when the CPU power be resumed?

This should be called as soon as possible.

> 2. Does the Last_man mechanism should implement in platform specific
> code to check something like cpu_online_status and if there is a
> limitation for the specific last CPU to be powered down?

The selection of the last man is accomplished using a platform specific 
mechanism.  By the time this has to be done, the CPU is already dead as 
far as the Linux kernel is concerned, and therefore the generic CPU map 
is not reliable.  In the DCSCB case we simply look at the hardware reset 
register being modified to directly determine the last man.  On TC2 (not 
yet posted) we have to keep a local map of online CPUs.

In your case, the selection of the last man would simply be forced on 
CPU0.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-10  0:20 ` [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
  2013-01-11 18:02   ` Santosh Shilimkar
@ 2013-01-14 16:35   ` Will Deacon
  2013-01-14 16:51     ` Nicolas Pitre
  1 sibling, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-14 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:41AM +0000, Nicolas Pitre wrote:
> Now that the b.L power API is in place, we can use it for SMP secondary
> bringup and CPU hotplug in a generic fashion.

[...]

> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> new file mode 100644
> index 0000000000..0acb9f4685
> --- /dev/null
> +++ b/arch/arm/common/bL_platsmp.c
> @@ -0,0 +1,79 @@
> +/*
> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
> + *
> + * Created by:  Nicolas Pitre, November 2012
> + * Copyright:   (C) 2012  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Code to handle secondary CPU bringup and hotplug for the bL power API.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/smp.h>
> +
> +#include <asm/bL_entry.h>
> +#include <asm/smp_plat.h>
> +#include <asm/hardware/gic.h>
> +
> +static void __init simple_smp_init_cpus(void)
> +{
> +	set_smp_cross_call(gic_raise_softirq);
> +}
> +
> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
> +{
> +	unsigned int pcpu, pcluster, ret;
> +	extern void secondary_startup(void);
> +
> +	pcpu = cpu_logical_map(cpu) & 0xff;
> +	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;

Again, you can probably use Lorenzo's helpers here.

> +	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
> +		 __func__, cpu, pcpu, pcluster);
> +
> +	bL_set_entry_vector(pcpu, pcluster, NULL);

Now that you don't have a barrier in this function, you need one here.

> +	ret = bL_cpu_power_up(pcpu, pcluster);
> +	if (ret)
> +		return ret;

and here, although I confess to not understanding why you write NULL the
first time.

> +	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
> +	gic_raise_softirq(cpumask_of(cpu), 0);
> +	sev();

This relise on the event register being able to be set if the target is in a
low-power (wfi) state. I'd feel safer with a dsb before the sev...

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-10  0:20 ` [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
@ 2013-01-14 16:37   ` Will Deacon
  2013-01-14 16:53     ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-14 16:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:42AM +0000, Nicolas Pitre wrote:
> If for whatever reason a CPU is unexpectedly awaken, it shouldn't
> re-enter the kernel by using whatever entry vector that might have
> been set by a previous operation.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  arch/arm/common/bL_platsmp.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> index 0acb9f4685..0ae44123bf 100644
> --- a/arch/arm/common/bL_platsmp.c
> +++ b/arch/arm/common/bL_platsmp.c
> @@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
>  
>  static void __ref bL_cpu_die(unsigned int cpu)
>  {
> +	unsigned int mpidr, pcpu, pcluster;
> +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> +	pcpu = mpidr & 0xff;
> +	pcluster = (mpidr >> 8) & 0xff;

Usual comment about helper functions :)

> +	bL_set_entry_vector(pcpu, pcluster, NULL);

Similar to the power_on story, you need a barrier here (unless you change
your platform_ops API to require barriers).

>  	bL_cpu_power_down();

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-10  0:20 ` [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled Nicolas Pitre
  2013-01-11 18:07   ` Santosh Shilimkar
@ 2013-01-14 16:39   ` Will Deacon
  2013-01-14 16:54     ` Nicolas Pitre
  1 sibling, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-14 16:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 10, 2013 at 12:20:43AM +0000, Nicolas Pitre wrote:
> Otherwise there might be some interrupts or IPIs becoming pending and the
> CPU will not enter low power mode when doing a WFI.  The effect of this
> is a CPU that loops back into the kernel, go through the first man
> election, signals itself as alive,  and prevent the cluster from being
> shut down.
> 
> This could benefit from a better solution.
> 
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  arch/arm/common/bL_platsmp.c        | 1 +
>  arch/arm/common/gic.c               | 6 ++++++
>  arch/arm/include/asm/hardware/gic.h | 2 ++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> index 0ae44123bf..6a3b251b97 100644
> --- a/arch/arm/common/bL_platsmp.c
> +++ b/arch/arm/common/bL_platsmp.c
> @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
>  	pcpu = mpidr & 0xff;
>  	pcluster = (mpidr >> 8) & 0xff;
>  	bL_set_entry_vector(pcpu, pcluster, NULL);
> +	gic_cpu_if_down();

I'm starting to sound like a stuck record (and not a very tuneful one at
that) but... I think you need a barrier here.

>  	bL_cpu_power_down();

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-14 16:35   ` Will Deacon
@ 2013-01-14 16:51     ` Nicolas Pitre
  2013-01-15 19:09       ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:41AM +0000, Nicolas Pitre wrote:
> > Now that the b.L power API is in place, we can use it for SMP secondary
> > bringup and CPU hotplug in a generic fashion.
> 
> [...]
> 
> > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > new file mode 100644
> > index 0000000000..0acb9f4685
> > --- /dev/null
> > +++ b/arch/arm/common/bL_platsmp.c
> > @@ -0,0 +1,79 @@
> > +/*
> > + * linux/arch/arm/mach-vexpress/bL_platsmp.c
> > + *
> > + * Created by:  Nicolas Pitre, November 2012
> > + * Copyright:   (C) 2012  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Code to handle secondary CPU bringup and hotplug for the bL power API.
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/smp.h>
> > +
> > +#include <asm/bL_entry.h>
> > +#include <asm/smp_plat.h>
> > +#include <asm/hardware/gic.h>
> > +
> > +static void __init simple_smp_init_cpus(void)
> > +{
> > +	set_smp_cross_call(gic_raise_softirq);
> > +}
> > +
> > +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
> > +{
> > +	unsigned int pcpu, pcluster, ret;
> > +	extern void secondary_startup(void);
> > +
> > +	pcpu = cpu_logical_map(cpu) & 0xff;
> > +	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
> 
> Again, you can probably use Lorenzo's helpers here.

Yes, that goes for the whole series.

> > +	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
> > +		 __func__, cpu, pcpu, pcluster);
> > +
> > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> 
> Now that you don't have a barrier in this function, you need one here.

Hmmm... Why?

> > +	ret = bL_cpu_power_up(pcpu, pcluster);
> > +	if (ret)
> > +		return ret;
> 
> and here, although I confess to not understanding why you write NULL the
> first time.

If for some reasons the bL_cpu_power_up() call fails, I don't want this 
CPU to suddenly decide to enter the kernel if it wakes up at a later 
time when secondary_startup is not ready to deal with it anymore.

> > +	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
> > +	gic_raise_softirq(cpumask_of(cpu), 0);
> > +	sev();
> 
> This relise on the event register being able to be set if the target is in a
> low-power (wfi) state. I'd feel safer with a dsb before the sev...

Sure.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 16:37   ` Will Deacon
@ 2013-01-14 16:53     ` Nicolas Pitre
  2013-01-14 17:00       ` Will Deacon
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 16:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:42AM +0000, Nicolas Pitre wrote:
> > If for whatever reason a CPU is unexpectedly awaken, it shouldn't
> > re-enter the kernel by using whatever entry vector that might have
> > been set by a previous operation.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  arch/arm/common/bL_platsmp.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > index 0acb9f4685..0ae44123bf 100644
> > --- a/arch/arm/common/bL_platsmp.c
> > +++ b/arch/arm/common/bL_platsmp.c
> > @@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
> >  
> >  static void __ref bL_cpu_die(unsigned int cpu)
> >  {
> > +	unsigned int mpidr, pcpu, pcluster;
> > +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > +	pcpu = mpidr & 0xff;
> > +	pcluster = (mpidr >> 8) & 0xff;
> 
> Usual comment about helper functions :)
> 
> > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> 
> Similar to the power_on story, you need a barrier here (unless you change
> your platform_ops API to require barriers).

The bL_set_entry_vector() includes a cache flush which itself has a DSB.  
Hence my previous interrogation.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-14 16:39   ` Will Deacon
@ 2013-01-14 16:54     ` Nicolas Pitre
  2013-01-14 17:02       ` Will Deacon
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 16:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Will Deacon wrote:

> On Thu, Jan 10, 2013 at 12:20:43AM +0000, Nicolas Pitre wrote:
> > Otherwise there might be some interrupts or IPIs becoming pending and the
> > CPU will not enter low power mode when doing a WFI.  The effect of this
> > is a CPU that loops back into the kernel, go through the first man
> > election, signals itself as alive,  and prevent the cluster from being
> > shut down.
> > 
> > This could benefit from a better solution.
> > 
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  arch/arm/common/bL_platsmp.c        | 1 +
> >  arch/arm/common/gic.c               | 6 ++++++
> >  arch/arm/include/asm/hardware/gic.h | 2 ++
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > index 0ae44123bf..6a3b251b97 100644
> > --- a/arch/arm/common/bL_platsmp.c
> > +++ b/arch/arm/common/bL_platsmp.c
> > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> >  	pcpu = mpidr & 0xff;
> >  	pcluster = (mpidr >> 8) & 0xff;
> >  	bL_set_entry_vector(pcpu, pcluster, NULL);
> > +	gic_cpu_if_down();
> 
> I'm starting to sound like a stuck record (and not a very tuneful one at
> that) but... I think you need a barrier here.

And I'm getting puzzled at the repetition.  ;-)


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 16:53     ` Nicolas Pitre
@ 2013-01-14 17:00       ` Will Deacon
  2013-01-14 17:11         ` Catalin Marinas
  2013-01-14 17:15         ` Nicolas Pitre
  0 siblings, 2 replies; 140+ messages in thread
From: Will Deacon @ 2013-01-14 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 04:53:41PM +0000, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:42AM +0000, Nicolas Pitre wrote:
> > > If for whatever reason a CPU is unexpectedly awaken, it shouldn't
> > > re-enter the kernel by using whatever entry vector that might have
> > > been set by a previous operation.
> > > 
> > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > ---
> > >  arch/arm/common/bL_platsmp.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > index 0acb9f4685..0ae44123bf 100644
> > > --- a/arch/arm/common/bL_platsmp.c
> > > +++ b/arch/arm/common/bL_platsmp.c
> > > @@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
> > >  
> > >  static void __ref bL_cpu_die(unsigned int cpu)
> > >  {
> > > +	unsigned int mpidr, pcpu, pcluster;
> > > +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > > +	pcpu = mpidr & 0xff;
> > > +	pcluster = (mpidr >> 8) & 0xff;
> > 
> > Usual comment about helper functions :)
> > 
> > > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> > 
> > Similar to the power_on story, you need a barrier here (unless you change
> > your platform_ops API to require barriers).
> 
> The bL_set_entry_vector() includes a cache flush which itself has a DSB.  
> Hence my previous interrogation.

For L1, sure, we always have the dsb for v7. However, for the outer-cache we
only have a dsb by virtue of a spin_unlock in l2x0.c... it seems a bit risky
to rely on that for ordering your entry_vector write with the power_on.

I think the best bet is to put a barrier in power_on, before invoking the
platform_ops function and similarly for power_off.

What do you reckon?

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-14 16:54     ` Nicolas Pitre
@ 2013-01-14 17:02       ` Will Deacon
  2013-01-14 17:18         ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-14 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 04:54:52PM +0000, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:43AM +0000, Nicolas Pitre wrote:
> > > Otherwise there might be some interrupts or IPIs becoming pending and the
> > > CPU will not enter low power mode when doing a WFI.  The effect of this
> > > is a CPU that loops back into the kernel, go through the first man
> > > election, signals itself as alive,  and prevent the cluster from being
> > > shut down.
> > > 
> > > This could benefit from a better solution.
> > > 
> > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > ---
> > >  arch/arm/common/bL_platsmp.c        | 1 +
> > >  arch/arm/common/gic.c               | 6 ++++++
> > >  arch/arm/include/asm/hardware/gic.h | 2 ++
> > >  3 files changed, 9 insertions(+)
> > > 
> > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > index 0ae44123bf..6a3b251b97 100644
> > > --- a/arch/arm/common/bL_platsmp.c
> > > +++ b/arch/arm/common/bL_platsmp.c
> > > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> > >  	pcpu = mpidr & 0xff;
> > >  	pcluster = (mpidr >> 8) & 0xff;
> > >  	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > +	gic_cpu_if_down();
> > 
> > I'm starting to sound like a stuck record (and not a very tuneful one at
> > that) but... I think you need a barrier here.
> 
> And I'm getting puzzled at the repetition.  ;-)

Sorry! This case is more interesting though, because you also want to order
the cpu_if_down GIC write so that it completes before we do the power_off.

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
                     ` (3 preceding siblings ...)
  2013-01-11 17:46   ` Santosh Shilimkar
@ 2013-01-14 17:08   ` Dave Martin
  2013-01-14 17:15     ` Catalin Marinas
  4 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-14 17:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 09, 2013 at 07:20:38PM -0500, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> This provides helper methods to coordinate between CPUs coming down
> and CPUs going up, as well as documentation on the used algorithms,
> so that cluster teardown and setup
> operations are not done for a cluster simultaneously.

[...]

In response to the incorrectness of the outer cache handling,
here's a supplementary patch:

>From b64f305c90e7ea585992df2d710f62ec6a7b5395 Mon Sep 17 00:00:00 2001
From: Dave Martin <dave.martin@linaro.org>
Date: Mon, 14 Jan 2013 16:25:47 +0000
Subject: [PATCH] ARM: b.L: Fix outer cache handling for coherency setup/exit helpers

This patch addresses the following issues:

  * When invalidating stale data from the cache before a read,
    outer caches must be invalidated _before_ inner caches, not
    after, otherwise stale data may be re-filled from outer to
    inner after the inner cache is flushed.

    We still retain an inner clean before touching the outer cache,
    to avoid stale data being rewritten from there into the outer
    cache after the outer cache is flushed.

  * All the sync_mem() calls synchronise either reads or writes,
    but not both.  This patch splits sync_mem() into separate
    functions for reads and writes, to avoid excessive inner
    flushes in the write case.

    The two functions are different from the original sync_mem(),
    to fix the above issues.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
NOTE: This patch is build-tested only.

 arch/arm/common/bL_entry.c |   57 ++++++++++++++++++++++++++++++++++----------
 1 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 1ea4ec9..3e1a404 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -119,16 +119,47 @@ int bL_cpu_powered_up(void)
 
 struct bL_sync_struct bL_sync;
 
-static void __sync_range(volatile void *p, size_t size)
+/*
+ * Ensure preceding writes to *p by this CPU are visible to
+ * subsequent reads by other CPUs:
+ */
+static void __sync_range_w(volatile void *p, size_t size)
 {
 	char *_p = (char *)p;
 
 	__cpuc_flush_dcache_area(_p, size);
-	outer_flush_range(__pa(_p), __pa(_p + size));
+	outer_clean_range(__pa(_p), __pa(_p + size));
 	outer_sync();
 }
 
-#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
+/*
+ * Ensure preceding writes to *p by other CPUs are visible to
+ * subsequent reads by this CPU:
+ */
+static void __sync_range_r(volatile void *p, size_t size)
+{
+	char *_p = (char *)p;
+
+#ifdef CONFIG_OUTER_CACHE
+	if (outer_cache.flush_range) {
+		/*
+		 * Ensure ditry data migrated from other CPUs into our cache
+		 * are cleaned out safely before the outer cache is cleaned:
+		 */
+		__cpuc_flush_dcache_area(_p, size);
+
+		/* Clean and invalidate stale data for *p from outer ... */
+		outer_flush_range(__pa(_p), __pa(_p + size));
+		outer_sync();
+	}
+#endif
+
+	/* ... and inner cache: */
+	__cpuc_flush_dcache_area(_p, size);
+}
+
+#define sync_w(ptr) __sync_range_w(ptr, sizeof *(ptr))
+#define sync_r(ptr) __sync_range_r(ptr, sizeof *(ptr))
 
 /*
  * __bL_cpu_going_down: Indicates that the cpu is being torn down.
@@ -138,7 +169,7 @@ static void __sync_range(volatile void *p, size_t size)
 void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
 {
 	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
-	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+	sync_w(&bL_sync.clusters[cluster].cpus[cpu].cpu);
 }
 
 /*
@@ -151,7 +182,7 @@ void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
 {
 	dsb();
 	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
-	sync_mem(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+	sync_w(&bL_sync.clusters[cluster].cpus[cpu].cpu);
 	sev();
 }
 
@@ -167,7 +198,7 @@ void __bL_outbound_leave_critical(unsigned int cluster, int state)
 {
 	dsb();
 	bL_sync.clusters[cluster].cluster = state;
-	sync_mem(&bL_sync.clusters[cluster].cluster);
+	sync_w(&bL_sync.clusters[cluster].cluster);
 	sev();
 }
 
@@ -189,10 +220,10 @@ bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
 
 	/* Warn inbound CPUs that the cluster is being torn down: */
 	c->cluster = CLUSTER_GOING_DOWN;
-	sync_mem(&c->cluster);
+	sync_w(&c->cluster);
 
 	/* Back out if the inbound cluster is already in the critical region: */
-	sync_mem(&c->inbound);
+	sync_r(&c->inbound);
 	if (c->inbound == INBOUND_COMING_UP)
 		goto abort;
 
@@ -203,7 +234,7 @@ bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
 	 * If any CPU has been woken up again from the DOWN state, then we
 	 * shouldn't be taking the cluster down at all: abort in that case.
 	 */
-	sync_mem(&c->cpus);
+	sync_r(&c->cpus);
 	for (i = 0; i < BL_CPUS_PER_CLUSTER; i++) {
 		int cpustate;
 
@@ -216,7 +247,7 @@ bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
 				break;
 
 			wfe();
-			sync_mem(&c->cpus[i].cpu);
+			sync_r(&c->cpus[i].cpu);
 		}
 
 		switch (cpustate) {
@@ -239,7 +270,7 @@ abort:
 
 int __bL_cluster_state(unsigned int cluster)
 {
-	sync_mem(&bL_sync.clusters[cluster].cluster);
+	sync_r(&bL_sync.clusters[cluster].cluster);
 	return bL_sync.clusters[cluster].cluster;
 }
 
@@ -267,11 +298,11 @@ int __init bL_cluster_sync_init(void (*power_up_setup)(void))
 	for_each_online_cpu(i)
 		bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
 	bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
-	sync_mem(&bL_sync);
+	sync_w(&bL_sync);
 
 	if (power_up_setup) {
 		bL_power_up_setup_phys = virt_to_phys(power_up_setup);
-		sync_mem(&bL_power_up_setup_phys);
+		sync_w(&bL_power_up_setup_phys);
 	}
 
 	return 0;
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 17:00       ` Will Deacon
@ 2013-01-14 17:11         ` Catalin Marinas
  2013-01-14 17:15         ` Nicolas Pitre
  1 sibling, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-14 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:00:28PM +0000, Will Deacon wrote:
> On Mon, Jan 14, 2013 at 04:53:41PM +0000, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Will Deacon wrote:
> > 
> > > On Thu, Jan 10, 2013 at 12:20:42AM +0000, Nicolas Pitre wrote:
> > > > If for whatever reason a CPU is unexpectedly awaken, it shouldn't
> > > > re-enter the kernel by using whatever entry vector that might have
> > > > been set by a previous operation.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > ---
> > > >  arch/arm/common/bL_platsmp.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > > index 0acb9f4685..0ae44123bf 100644
> > > > --- a/arch/arm/common/bL_platsmp.c
> > > > +++ b/arch/arm/common/bL_platsmp.c
> > > > @@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
> > > >  
> > > >  static void __ref bL_cpu_die(unsigned int cpu)
> > > >  {
> > > > +	unsigned int mpidr, pcpu, pcluster;
> > > > +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > > > +	pcpu = mpidr & 0xff;
> > > > +	pcluster = (mpidr >> 8) & 0xff;
> > > 
> > > Usual comment about helper functions :)
> > > 
> > > > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > 
> > > Similar to the power_on story, you need a barrier here (unless you change
> > > your platform_ops API to require barriers).
> > 
> > The bL_set_entry_vector() includes a cache flush which itself has a DSB.  
> > Hence my previous interrogation.
> 
> For L1, sure, we always have the dsb for v7. However, for the outer-cache we
> only have a dsb by virtue of a spin_unlock in l2x0.c... it seems a bit risky
> to rely on that for ordering your entry_vector write with the power_on.

I was discussing this with Dave earlier, I think we need to fix the
outer-cache functions even for the UP case to include a barrier (for
PL310 actually we may need to read a register as the cache_wait is a
no-op). We assume that cache functions (both inner and outer) fully
complete the operation before returning and there is no additional need
for barriers.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 17:00       ` Will Deacon
  2013-01-14 17:11         ` Catalin Marinas
@ 2013-01-14 17:15         ` Nicolas Pitre
  2013-01-14 17:23           ` Will Deacon
  2013-01-14 18:26           ` Russell King - ARM Linux
  1 sibling, 2 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Will Deacon wrote:

> On Mon, Jan 14, 2013 at 04:53:41PM +0000, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Will Deacon wrote:
> > 
> > > On Thu, Jan 10, 2013 at 12:20:42AM +0000, Nicolas Pitre wrote:
> > > > If for whatever reason a CPU is unexpectedly awaken, it shouldn't
> > > > re-enter the kernel by using whatever entry vector that might have
> > > > been set by a previous operation.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > ---
> > > >  arch/arm/common/bL_platsmp.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > > index 0acb9f4685..0ae44123bf 100644
> > > > --- a/arch/arm/common/bL_platsmp.c
> > > > +++ b/arch/arm/common/bL_platsmp.c
> > > > @@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
> > > >  
> > > >  static void __ref bL_cpu_die(unsigned int cpu)
> > > >  {
> > > > +	unsigned int mpidr, pcpu, pcluster;
> > > > +	asm ("mrc p15, 0, %0, c0, c0, 5" : "=r" (mpidr));
> > > > +	pcpu = mpidr & 0xff;
> > > > +	pcluster = (mpidr >> 8) & 0xff;
> > > 
> > > Usual comment about helper functions :)
> > > 
> > > > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > 
> > > Similar to the power_on story, you need a barrier here (unless you change
> > > your platform_ops API to require barriers).
> > 
> > The bL_set_entry_vector() includes a cache flush which itself has a DSB.  
> > Hence my previous interrogation.
> 
> For L1, sure, we always have the dsb for v7. However, for the outer-cache we
> only have a dsb by virtue of a spin_unlock in l2x0.c... it seems a bit risky
> to rely on that for ordering your entry_vector write with the power_on.
> 
> I think the best bet is to put a barrier in power_on, before invoking the
> platform_ops function and similarly for power_off.
> 
> What do you reckon?

I much prefer adding barriers inside the API when they are needed for 
proper execution of the API intent.  So if I call bL_set_entry_vector(), 
I trust that by the time it returns the vector is indeed set and ready 
for use by other processors.

The same could be said about the outer cache ops.  If a DSB is needed 
for their intent to be valid, then why isn't this DSB always implied by 
the corresponding cache op calls?  And as you say, there is already one 
implied by the spinlock used there, so that is not if things would 
change much in practice.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-14 17:08   ` Dave Martin
@ 2013-01-14 17:15     ` Catalin Marinas
  2013-01-14 18:10       ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Catalin Marinas @ 2013-01-14 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:08:51PM +0000, Dave Martin wrote:
> From b64f305c90e7ea585992df2d710f62ec6a7b5395 Mon Sep 17 00:00:00 2001
> From: Dave Martin <dave.martin@linaro.org>
> Date: Mon, 14 Jan 2013 16:25:47 +0000
> Subject: [PATCH] ARM: b.L: Fix outer cache handling for coherency setup/exit helpers
> 
> This patch addresses the following issues:
> 
>   * When invalidating stale data from the cache before a read,
>     outer caches must be invalidated _before_ inner caches, not
>     after, otherwise stale data may be re-filled from outer to
>     inner after the inner cache is flushed.
> 
>     We still retain an inner clean before touching the outer cache,
>     to avoid stale data being rewritten from there into the outer
>     cache after the outer cache is flushed.
> 
>   * All the sync_mem() calls synchronise either reads or writes,
>     but not both.  This patch splits sync_mem() into separate
>     functions for reads and writes, to avoid excessive inner
>     flushes in the write case.
> 
>     The two functions are different from the original sync_mem(),
>     to fix the above issues.
> 
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> ---
> NOTE: This patch is build-tested only.
> 
>  arch/arm/common/bL_entry.c |   57 ++++++++++++++++++++++++++++++++++----------
>  1 files changed, 44 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> index 1ea4ec9..3e1a404 100644
> --- a/arch/arm/common/bL_entry.c
> +++ b/arch/arm/common/bL_entry.c
> @@ -119,16 +119,47 @@ int bL_cpu_powered_up(void)
>  
>  struct bL_sync_struct bL_sync;
>  
> -static void __sync_range(volatile void *p, size_t size)
> +/*
> + * Ensure preceding writes to *p by this CPU are visible to
> + * subsequent reads by other CPUs:
> + */
> +static void __sync_range_w(volatile void *p, size_t size)
>  {
>  	char *_p = (char *)p;
>  
>  	__cpuc_flush_dcache_area(_p, size);
> -	outer_flush_range(__pa(_p), __pa(_p + size));
> +	outer_clean_range(__pa(_p), __pa(_p + size));
>  	outer_sync();

It's not part of your patch but I thought about commenting here. The
outer_clean_range() already has a cache_sync() operation, so no need for
the additional outer_sync().

>  }
>  
> -#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> +/*
> + * Ensure preceding writes to *p by other CPUs are visible to
> + * subsequent reads by this CPU:
> + */
> +static void __sync_range_r(volatile void *p, size_t size)
> +{
> +	char *_p = (char *)p;
> +
> +#ifdef CONFIG_OUTER_CACHE
> +	if (outer_cache.flush_range) {
> +		/*
> +		 * Ensure ditry data migrated from other CPUs into our cache
> +		 * are cleaned out safely before the outer cache is cleaned:
> +		 */
> +		__cpuc_flush_dcache_area(_p, size);
> +
> +		/* Clean and invalidate stale data for *p from outer ... */
> +		outer_flush_range(__pa(_p), __pa(_p + size));
> +		outer_sync();

Same here.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-14 17:02       ` Will Deacon
@ 2013-01-14 17:18         ` Nicolas Pitre
  2013-01-14 17:24           ` Will Deacon
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 17:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Will Deacon wrote:

> On Mon, Jan 14, 2013 at 04:54:52PM +0000, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Will Deacon wrote:
> > 
> > > On Thu, Jan 10, 2013 at 12:20:43AM +0000, Nicolas Pitre wrote:
> > > > Otherwise there might be some interrupts or IPIs becoming pending and the
> > > > CPU will not enter low power mode when doing a WFI.  The effect of this
> > > > is a CPU that loops back into the kernel, go through the first man
> > > > election, signals itself as alive,  and prevent the cluster from being
> > > > shut down.
> > > > 
> > > > This could benefit from a better solution.
> > > > 
> > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > ---
> > > >  arch/arm/common/bL_platsmp.c        | 1 +
> > > >  arch/arm/common/gic.c               | 6 ++++++
> > > >  arch/arm/include/asm/hardware/gic.h | 2 ++
> > > >  3 files changed, 9 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > > index 0ae44123bf..6a3b251b97 100644
> > > > --- a/arch/arm/common/bL_platsmp.c
> > > > +++ b/arch/arm/common/bL_platsmp.c
> > > > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> > > >  	pcpu = mpidr & 0xff;
> > > >  	pcluster = (mpidr >> 8) & 0xff;
> > > >  	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > > +	gic_cpu_if_down();
> > > 
> > > I'm starting to sound like a stuck record (and not a very tuneful one at
> > > that) but... I think you need a barrier here.
> > 
> > And I'm getting puzzled at the repetition.  ;-)
> 
> Sorry! This case is more interesting though, because you also want to order
> the cpu_if_down GIC write so that it completes before we do the power_off.

In this case I'm leaning toward removing that gic_cpu_if_down() 
entirely.  I'm not convinced it is necessary, and if it is then we 
probably have a bug somewhere else.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 17:15         ` Nicolas Pitre
@ 2013-01-14 17:23           ` Will Deacon
  2013-01-14 18:26           ` Russell King - ARM Linux
  1 sibling, 0 replies; 140+ messages in thread
From: Will Deacon @ 2013-01-14 17:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:15:07PM +0000, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Will Deacon wrote:
> > On Mon, Jan 14, 2013 at 04:53:41PM +0000, Nicolas Pitre wrote:
> > > The bL_set_entry_vector() includes a cache flush which itself has a DSB.  
> > > Hence my previous interrogation.
> > 
> > For L1, sure, we always have the dsb for v7. However, for the outer-cache we
> > only have a dsb by virtue of a spin_unlock in l2x0.c... it seems a bit risky
> > to rely on that for ordering your entry_vector write with the power_on.
> > 
> > I think the best bet is to put a barrier in power_on, before invoking the
> > platform_ops function and similarly for power_off.
> > 
> > What do you reckon?
> 
> I much prefer adding barriers inside the API when they are needed for 
> proper execution of the API intent.  So if I call bL_set_entry_vector(), 
> I trust that by the time it returns the vector is indeed set and ready 
> for use by other processors.
> 
> The same could be said about the outer cache ops.  If a DSB is needed 
> for their intent to be valid, then why isn't this DSB always implied by 
> the corresponding cache op calls?  And as you say, there is already one 
> implied by the spinlock used there, so that is not if things would 
> change much in practice.

Ok, so we can fix the outer_cache functions as suggested by Catalin. That
still leaves the GIC CPU interface problem in the later patch, which uses a
writel_relaxed to disable the CPU interface, so I suppose we can just put
a dsb at the end of gic_cpu_if_down().

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-14 17:18         ` Nicolas Pitre
@ 2013-01-14 17:24           ` Will Deacon
  2013-01-14 17:56             ` Lorenzo Pieralisi
  0 siblings, 1 reply; 140+ messages in thread
From: Will Deacon @ 2013-01-14 17:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:18:24PM +0000, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Will Deacon wrote:
> > Sorry! This case is more interesting though, because you also want to order
> > the cpu_if_down GIC write so that it completes before we do the power_off.
> 
> In this case I'm leaning toward removing that gic_cpu_if_down() 
> entirely.  I'm not convinced it is necessary, and if it is then we 
> probably have a bug somewhere else.

Or that :)

Will

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-12 16:47         ` Nicolas Pitre
  2013-01-13  4:37           ` Santosh Shilimkar
@ 2013-01-14 17:53           ` Lorenzo Pieralisi
  1 sibling, 0 replies; 140+ messages in thread
From: Lorenzo Pieralisi @ 2013-01-14 17:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jan 12, 2013 at 04:47:19PM +0000, Nicolas Pitre wrote:
> On Sat, 12 Jan 2013, Santosh Shilimkar wrote:
> 
> > On Saturday 12 January 2013 12:37 AM, Nicolas Pitre wrote:
> > > On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> > > 
> > > > On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > > > > Otherwise there might be some interrupts or IPIs becoming pending and
> > > > > the
> > > > > CPU will not enter low power mode when doing a WFI.  The effect of this
> > > > > is a CPU that loops back into the kernel, go through the first man
> > > > > election, signals itself as alive,  and prevent the cluster from being
> > > > > shut down.
> > > > > 
> > > > > This could benefit from a better solution.
> > > > > 
> > > > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > > > ---
> > > > >    arch/arm/common/bL_platsmp.c        | 1 +
> > > > >    arch/arm/common/gic.c               | 6 ++++++
> > > > >    arch/arm/include/asm/hardware/gic.h | 2 ++
> > > > >    3 files changed, 9 insertions(+)
> > > > > 
> > > > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > > > index 0ae44123bf..6a3b251b97 100644
> > > > > --- a/arch/arm/common/bL_platsmp.c
> > > > > +++ b/arch/arm/common/bL_platsmp.c
> > > > > @@ -68,6 +68,7 @@ static void __ref bL_cpu_die(unsigned int cpu)
> > > > >    	pcpu = mpidr & 0xff;
> > > > >    	pcluster = (mpidr >> 8) & 0xff;
> > > > >    	bL_set_entry_vector(pcpu, pcluster, NULL);
> > > > > +	gic_cpu_if_down();
> > > > 
> > > > So for a case where CPU still don't power down for some reason even
> > > > after CPU interface is disabled, can not listen to and SGI or PPI.
> > > > Not sure if this happens on big.LITTLE but i have seen one such issue
> > > > on Cortex-A9 based SOC.
> > > 
> > > Here the problem was the reverse i.e. a CPU wouldn't go down because
> > > some pending SGIs prevented that.
> > > 
> > I understood that part. What I was saying is, with CPU IF disabled and
> > if CPU doesn't enter into the intended low power state and if the wakeup
> > mechanism on that CPU is SGI/SPI, CPU may never wakeup and can lead to
> > dead lock. I have seen this scenario on OMAP especially in CPUidle path.
> 
> Obviously, on the CPU idle path, you should not turn off the GIC 
> interface as this might lose the ability to wake the CPU up with a 
> pending interrupt, if your system is so configured.

That's platform specific. On TC2 turning GIC CPU IF off is pivotal otherwise
the CPU receiving an IRQ can complete wfi and be reset by firmware when
executing in the middle of nowhere, leading to a system lock-up.

Disabling the GIC CPU IF must not be added to the gic_cpu_save() code,
but we do need a helper function to disable the CPU IF for platforms
that need this to happen to function properly (eg TC2).

Lorenzo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled
  2013-01-14 17:24           ` Will Deacon
@ 2013-01-14 17:56             ` Lorenzo Pieralisi
  0 siblings, 0 replies; 140+ messages in thread
From: Lorenzo Pieralisi @ 2013-01-14 17:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:24:01PM +0000, Will Deacon wrote:
> On Mon, Jan 14, 2013 at 05:18:24PM +0000, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Will Deacon wrote:
> > > Sorry! This case is more interesting though, because you also want to order
> > > the cpu_if_down GIC write so that it completes before we do the power_off.
> > 
> > In this case I'm leaning toward removing that gic_cpu_if_down() 
> > entirely.  I'm not convinced it is necessary, and if it is then we 
> > probably have a bug somewhere else.
> 
> Or that :)

In the CPU idle code path (cpu_suspend) we do need to turn off the GIC CPU IF,
hence we will have to cross that bridge when we come to it. Very soon.

Lorenzo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-11 18:02   ` Santosh Shilimkar
@ 2013-01-14 18:05     ` Achin Gupta
  2013-01-15  6:32       ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Achin Gupta @ 2013-01-14 18:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Santosh,

On Fri, Jan 11, 2013 at 6:02 PM, Santosh Shilimkar
<santosh.shilimkar@ti.com> wrote:
> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>
>> Now that the b.L power API is in place, we can use it for SMP secondary
>> bringup and CPU hotplug in a generic fashion.
>>
>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>> ---
>>   arch/arm/common/Makefile     |  2 +-
>>   arch/arm/common/bL_platsmp.c | 79
>> ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 80 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm/common/bL_platsmp.c
>>
>> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
>> index 894c2ddf9b..59b36db7cc 100644
>> --- a/arch/arm/common/Makefile
>> +++ b/arch/arm/common/Makefile
>> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>>   obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o
>>   obj-$(CONFIG_FIQ_GLUE)                += fiq_glue.o fiq_glue_setup.o
>>   obj-$(CONFIG_FIQ_DEBUGGER)    += fiq_debugger.o
>> -obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o vlock.o
>> +obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o bL_platsmp.o
>> vlock.o
>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>> new file mode 100644
>> index 0000000000..0acb9f4685
>> --- /dev/null
>> +++ b/arch/arm/common/bL_platsmp.c
>> @@ -0,0 +1,79 @@
>> +/*
>> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
>> + *
>> + * Created by:  Nicolas Pitre, November 2012
>> + * Copyright:   (C) 2012  Linaro Limited
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * Code to handle secondary CPU bringup and hotplug for the bL power API.
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/smp.h>
>> +
>> +#include <asm/bL_entry.h>
>> +#include <asm/smp_plat.h>
>> +#include <asm/hardware/gic.h>
>> +
>> +static void __init simple_smp_init_cpus(void)
>> +{
>> +       set_smp_cross_call(gic_raise_softirq);
>> +}
>> +
>> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct
>> task_struct *idle)
>> +{
>> +       unsigned int pcpu, pcluster, ret;
>> +       extern void secondary_startup(void);
>> +
>> +       pcpu = cpu_logical_map(cpu) & 0xff;
>> +       pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
>> +       pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
>> +                __func__, cpu, pcpu, pcluster);
>> +
>> +       bL_set_entry_vector(pcpu, pcluster, NULL);
>> +       ret = bL_cpu_power_up(pcpu, pcluster);
>> +       if (ret)
>> +               return ret;
>> +       bL_set_entry_vector(pcpu, pcluster, secondary_startup);
>> +       gic_raise_softirq(cpumask_of(cpu), 0);
>> +       sev();
>
> softirq() should be enough to break a CPU if it is in standby with
> wfe state. Is that additional sev() needed here ?

Not if the target cpu has its I & F bits disabled and that would be the
case with a secondary waiting to be woken up

thanks,
Achin

> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-14 17:15     ` Catalin Marinas
@ 2013-01-14 18:10       ` Dave Martin
  2013-01-14 21:34         ` Catalin Marinas
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-14 18:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 05:15:28PM +0000, Catalin Marinas wrote:
> On Mon, Jan 14, 2013 at 05:08:51PM +0000, Dave Martin wrote:
> > From b64f305c90e7ea585992df2d710f62ec6a7b5395 Mon Sep 17 00:00:00 2001
> > From: Dave Martin <dave.martin@linaro.org>
> > Date: Mon, 14 Jan 2013 16:25:47 +0000
> > Subject: [PATCH] ARM: b.L: Fix outer cache handling for coherency setup/exit helpers
> > 
> > This patch addresses the following issues:
> > 
> >   * When invalidating stale data from the cache before a read,
> >     outer caches must be invalidated _before_ inner caches, not
> >     after, otherwise stale data may be re-filled from outer to
> >     inner after the inner cache is flushed.
> > 
> >     We still retain an inner clean before touching the outer cache,
> >     to avoid stale data being rewritten from there into the outer
> >     cache after the outer cache is flushed.
> > 
> >   * All the sync_mem() calls synchronise either reads or writes,
> >     but not both.  This patch splits sync_mem() into separate
> >     functions for reads and writes, to avoid excessive inner
> >     flushes in the write case.
> > 
> >     The two functions are different from the original sync_mem(),
> >     to fix the above issues.
> > 
> > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> > ---
> > NOTE: This patch is build-tested only.
> > 
> >  arch/arm/common/bL_entry.c |   57 ++++++++++++++++++++++++++++++++++----------
> >  1 files changed, 44 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > index 1ea4ec9..3e1a404 100644
> > --- a/arch/arm/common/bL_entry.c
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -119,16 +119,47 @@ int bL_cpu_powered_up(void)
> >  
> >  struct bL_sync_struct bL_sync;
> >  
> > -static void __sync_range(volatile void *p, size_t size)
> > +/*
> > + * Ensure preceding writes to *p by this CPU are visible to
> > + * subsequent reads by other CPUs:
> > + */
> > +static void __sync_range_w(volatile void *p, size_t size)
> >  {
> >  	char *_p = (char *)p;
> >  
> >  	__cpuc_flush_dcache_area(_p, size);
> > -	outer_flush_range(__pa(_p), __pa(_p + size));
> > +	outer_clean_range(__pa(_p), __pa(_p + size));
> >  	outer_sync();
> 
> It's not part of your patch but I thought about commenting here. The
> outer_clean_range() already has a cache_sync() operation, so no need for
> the additional outer_sync().
> 
> >  }
> >  
> > -#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> > +/*
> > + * Ensure preceding writes to *p by other CPUs are visible to
> > + * subsequent reads by this CPU:
> > + */
> > +static void __sync_range_r(volatile void *p, size_t size)
> > +{
> > +	char *_p = (char *)p;
> > +
> > +#ifdef CONFIG_OUTER_CACHE
> > +	if (outer_cache.flush_range) {
> > +		/*
> > +		 * Ensure ditry data migrated from other CPUs into our cache
> > +		 * are cleaned out safely before the outer cache is cleaned:
> > +		 */
> > +		__cpuc_flush_dcache_area(_p, size);
> > +
> > +		/* Clean and invalidate stale data for *p from outer ... */
> > +		outer_flush_range(__pa(_p), __pa(_p + size));
> > +		outer_sync();
> 
> Same here.

Ah, right.  I've seen code do this in various places, and just copy-
pasted it under the assumption that it is needed.  Our discussion abouto
ensuring that outer_sync() really does guarantee completion of its
effects on return still applies.

Are there any situations when outer_sync() should be called explicitly?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 17:15         ` Nicolas Pitre
  2013-01-14 17:23           ` Will Deacon
@ 2013-01-14 18:26           ` Russell King - ARM Linux
  2013-01-14 18:49             ` Nicolas Pitre
  2013-01-15 18:40             ` Dave Martin
  1 sibling, 2 replies; 140+ messages in thread
From: Russell King - ARM Linux @ 2013-01-14 18:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 12:15:07PM -0500, Nicolas Pitre wrote:
> The same could be said about the outer cache ops.  If a DSB is needed 
> for their intent to be valid, then why isn't this DSB always implied by 
> the corresponding cache op calls?

Hmm, just been thinking about this.

The L2x0 calls do contain a DSB but it's not obvious.  They hold a
raw spinlock, and when that spinlock is dropped, we issue a dsb and
sev instruction.

Whether the other L2 implementations do this or not I'm not sure -
but the above is a requirement of the spinlock implementation, and
it just happens to provide the right behaviour for L2x0.

But... we _probably_ don't want to impose that down at the L2 cache
level of things - at least not for DMA ops, particular for the sanity
of the scatter-list operating operations.  We really want to avoid
doing one DSB per scatterlist entry, doing one DSB per scatterlist
operation instead.

That does affect how the L2 cache API gets used - maybe we want to
separate out the DMA stuff from the other users so that we can have
dsbs in that path for non-DMA users.

Thoughts?

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 18:26           ` Russell King - ARM Linux
@ 2013-01-14 18:49             ` Nicolas Pitre
  2013-01-15 18:40             ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-14 18:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 14 Jan 2013, Russell King - ARM Linux wrote:

> On Mon, Jan 14, 2013 at 12:15:07PM -0500, Nicolas Pitre wrote:
> > The same could be said about the outer cache ops.  If a DSB is needed 
> > for their intent to be valid, then why isn't this DSB always implied by 
> > the corresponding cache op calls?
> 
> Hmm, just been thinking about this.
> 
> The L2x0 calls do contain a DSB but it's not obvious.  They hold a
> raw spinlock, and when that spinlock is dropped, we issue a dsb and
> sev instruction.
> 
> Whether the other L2 implementations do this or not I'm not sure -
> but the above is a requirement of the spinlock implementation, and
> it just happens to provide the right behaviour for L2x0.
> 
> But... we _probably_ don't want to impose that down at the L2 cache
> level of things - at least not for DMA ops, particular for the sanity
> of the scatter-list operating operations.  We really want to avoid
> doing one DSB per scatterlist entry, doing one DSB per scatterlist
> operation instead.
> 
> That does affect how the L2 cache API gets used - maybe we want to
> separate out the DMA stuff from the other users so that we can have
> dsbs in that path for non-DMA users.
> 
> Thoughts?

The dsb or its intended effect could be confined to outer_sync() and 
then cache_sync() removed from l2x0_flush_range().  That would allow the 
sync to be applied when appropriate.  However that suffers the same API 
intent mismatch I was talking about.

Maybe adding some asynchronous methods to outer_cache (that could 
default to the synchronous calls) where the name of the function clearly 
implies a posted operation would be a better solution.  In that case the 
effect of the operation would be assumed complete only after a 
terminating outer_sync().


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-14 18:10       ` Dave Martin
@ 2013-01-14 21:34         ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-14 21:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 06:10:06PM +0000, Dave Martin wrote:
> On Mon, Jan 14, 2013 at 05:15:28PM +0000, Catalin Marinas wrote:
> > On Mon, Jan 14, 2013 at 05:08:51PM +0000, Dave Martin wrote:
> > > From b64f305c90e7ea585992df2d710f62ec6a7b5395 Mon Sep 17 00:00:00 2001
> > > From: Dave Martin <dave.martin@linaro.org>
> > > Date: Mon, 14 Jan 2013 16:25:47 +0000
> > > Subject: [PATCH] ARM: b.L: Fix outer cache handling for coherency setup/exit helpers
> > > 
> > > This patch addresses the following issues:
> > > 
> > >   * When invalidating stale data from the cache before a read,
> > >     outer caches must be invalidated _before_ inner caches, not
> > >     after, otherwise stale data may be re-filled from outer to
> > >     inner after the inner cache is flushed.
> > > 
> > >     We still retain an inner clean before touching the outer cache,
> > >     to avoid stale data being rewritten from there into the outer
> > >     cache after the outer cache is flushed.
> > > 
> > >   * All the sync_mem() calls synchronise either reads or writes,
> > >     but not both.  This patch splits sync_mem() into separate
> > >     functions for reads and writes, to avoid excessive inner
> > >     flushes in the write case.
> > > 
> > >     The two functions are different from the original sync_mem(),
> > >     to fix the above issues.
> > > 
> > > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> > > ---
> > > NOTE: This patch is build-tested only.
> > > 
> > >  arch/arm/common/bL_entry.c |   57 ++++++++++++++++++++++++++++++++++----------
> > >  1 files changed, 44 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > > index 1ea4ec9..3e1a404 100644
> > > --- a/arch/arm/common/bL_entry.c
> > > +++ b/arch/arm/common/bL_entry.c
> > > @@ -119,16 +119,47 @@ int bL_cpu_powered_up(void)
> > >  
> > >  struct bL_sync_struct bL_sync;
> > >  
> > > -static void __sync_range(volatile void *p, size_t size)
> > > +/*
> > > + * Ensure preceding writes to *p by this CPU are visible to
> > > + * subsequent reads by other CPUs:
> > > + */
> > > +static void __sync_range_w(volatile void *p, size_t size)
> > >  {
> > >  	char *_p = (char *)p;
> > >  
> > >  	__cpuc_flush_dcache_area(_p, size);
> > > -	outer_flush_range(__pa(_p), __pa(_p + size));
> > > +	outer_clean_range(__pa(_p), __pa(_p + size));
> > >  	outer_sync();
> > 
> > It's not part of your patch but I thought about commenting here. The
> > outer_clean_range() already has a cache_sync() operation, so no need for
> > the additional outer_sync().
> > 
> > >  }
> > >  
> > > -#define sync_mem(ptr) __sync_range(ptr, sizeof *(ptr))
> > > +/*
> > > + * Ensure preceding writes to *p by other CPUs are visible to
> > > + * subsequent reads by this CPU:
> > > + */
> > > +static void __sync_range_r(volatile void *p, size_t size)
> > > +{
> > > +	char *_p = (char *)p;
> > > +
> > > +#ifdef CONFIG_OUTER_CACHE
> > > +	if (outer_cache.flush_range) {
> > > +		/*
> > > +		 * Ensure ditry data migrated from other CPUs into our cache
> > > +		 * are cleaned out safely before the outer cache is cleaned:
> > > +		 */
> > > +		__cpuc_flush_dcache_area(_p, size);
> > > +
> > > +		/* Clean and invalidate stale data for *p from outer ... */
> > > +		outer_flush_range(__pa(_p), __pa(_p + size));
> > > +		outer_sync();
> > 
> > Same here.
> 
> Ah, right.  I've seen code do this in various places, and just copy-
> pasted it under the assumption that it is needed.  Our discussion abouto
> ensuring that outer_sync() really does guarantee completion of its
> effects on return still applies.
> 
> Are there any situations when outer_sync() should be called explicitly?

outer_sync() on its own ensures the draining of the PL310 write buffer.
DSB drains the CPU write buffers but PL310 doesn't detect it, so a
separate outer_sync() is needed. In general this is required when you
write a Normal Non-cacheable buffer (but bufferable, e.g. DMA coherent)
and you want to ensure data visibility (DSB+outer_sync(), that's what
the mb() macro does).

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-14 14:05         ` Nicolas Pitre
@ 2013-01-15  2:44             ` Joseph Lo
  -1 siblings, 0 replies; 140+ messages in thread
From: Joseph Lo @ 2013-01-15  2:44 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, 2013-01-14 at 22:05 +0800, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Joseph Lo wrote:
> 
> > Hi Nicolas,
> > 
> > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > This is the initial public posting of the initial support for big.LITTLE.
> > > Included here is the code required to safely power up and down CPUs in a
> > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > boot and CPU hotplug support is included at this time.  Getting to this
> > > point already represents a significcant chunk of code as illustrated by
> > > the diffstat below.
> > > 
> > > 
> > 
> > Thanks for introducing this series.
> > I am taking a look at this series. It introduced an algorithm for
> > syncing and avoid racing when syncing the power status of clusters and
> > CPUs. Do you think these codes could have a chance to become a generic
> > framework?
> 
> Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> that big.LITTLE was the motivation for creating this code.
> 
> > The Tegra chip series had a similar design for CPU clusters and it 
> had
> > limitation that the CPU0 always needs to be the last CPU to be shut down
> > before cluster power down as well. I believe it can also get benefits of
> > this works. We indeed need a similar algorithm to sync CPUs power status
> > before cluster power down and switching.
> > 
> > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > looks have a chance to be a common framework for ARM platform even if it
> > just support one cluster. Because some systems had the limitations for
> > cluster power down. That's why the coupled cpuidle been introduced. And
> > this framework could be enabled automatically if platform dependent or
> > by menuconfig.
> 
> Absolutely.
> 
So do you have a plan to make it become a generic framework in this
series or later work?

(And I will add some common power sync wrapper functions that based on
this framework for all Tegra series.)

> 
> > For ex,
> > 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> > 
> > How do you think of this suggestion?
> 
> I'd prefer a more concise name though.
> 
Sure. :-)

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-15  2:44             ` Joseph Lo
  0 siblings, 0 replies; 140+ messages in thread
From: Joseph Lo @ 2013-01-15  2:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2013-01-14 at 22:05 +0800, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Joseph Lo wrote:
> 
> > Hi Nicolas,
> > 
> > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > This is the initial public posting of the initial support for big.LITTLE.
> > > Included here is the code required to safely power up and down CPUs in a
> > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > boot and CPU hotplug support is included at this time.  Getting to this
> > > point already represents a significcant chunk of code as illustrated by
> > > the diffstat below.
> > > 
> > > 
> > 
> > Thanks for introducing this series.
> > I am taking a look at this series. It introduced an algorithm for
> > syncing and avoid racing when syncing the power status of clusters and
> > CPUs. Do you think these codes could have a chance to become a generic
> > framework?
> 
> Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> that big.LITTLE was the motivation for creating this code.
> 
> > The Tegra chip series had a similar design for CPU clusters and it 
> had
> > limitation that the CPU0 always needs to be the last CPU to be shut down
> > before cluster power down as well. I believe it can also get benefits of
> > this works. We indeed need a similar algorithm to sync CPUs power status
> > before cluster power down and switching.
> > 
> > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > looks have a chance to be a common framework for ARM platform even if it
> > just support one cluster. Because some systems had the limitations for
> > cluster power down. That's why the coupled cpuidle been introduced. And
> > this framework could be enabled automatically if platform dependent or
> > by menuconfig.
> 
> Absolutely.
> 
So do you have a plan to make it become a generic framework in this
series or later work?

(And I will add some common power sync wrapper functions that based on
this framework for all Tegra series.)

> 
> > For ex,
> > 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> > 
> > How do you think of this suggestion?
> 
> I'd prefer a more concise name though.
> 
Sure. :-)

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-14 12:25         ` Lorenzo Pieralisi
@ 2013-01-15  6:23           ` Santosh Shilimkar
  2013-01-15 18:20             ` Dave Martin
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-15  6:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 14 January 2013 05:55 PM, Lorenzo Pieralisi wrote:
> On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
>> On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
>>> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>>>
>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>> From: Dave Martin <dave.martin@linaro.org>
>>>>>
>>>>> +		/*
>>>>> +		 * Flush the local CPU cache.
>>>>> +		 *
>>>>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
>>>>> +		 * a preliminary flush here for those CPUs.  At least, that's
>>>>> +		 * the theory -- without the extra flush, Linux explodes on
>>>>> +		 * RTSM (maybe not needed anymore, to be investigated).
>>>>> +		 */
>>>> This is expected if the entire code is not in one stack frame and the
>>>> additional flush is needed to avoid possible stack corruption. This
>>>> issue has been discussed in past on the list.
>>>
>>> I missed that.  Do you have a reference or pointer handy?
>>>
>>> What is strange is that this is 100% reproducible on RTSM while this
>>> apparently is not an issue on real hardware so far.
>>>
>> I tried searching archives and realized the discussion was in private
>> email thread. There are some bits and pieces on list but not all the
>> information.
>>
>> The main issue RMK pointed out is- An additional L1 flush needed
>> to avoid the effective change of view of memory when the C bit is
>> turned off, and the cache is no longer searched for local CPU accesses.
>>
>> In your case dcscb_power_down() has updated the stack which can be hit
>> in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
>> the C-bit and hence for sub sequent calls the L1 cache won't be
>> searched. You then call flush_cache_all() which again updates the
>> stack but avoids searching the L1 cache. So it overwrites previous
>> saved stack frame. This seems to be an issue in your case as well.
>
> On A15/A7 even with the C bit cleared the D-cache is searched, the
> situation above cannot happen and if it does we are facing a HW/model bug.
> If this code is run on A9 then we have a problem since there, when the C bit
> is cleared D-cache is not searched (and that's why the sequence above
> should be written in assembly with no data access whatsoever), but on
> A15/A7 we do not.
>
Good point. May be model has modeled A9 and not A15 but in either
case, lets be consistent for all ARMv7 machines at least to avoid
people debugging similar issues. Many machines share code for ARMv7
processors so the best things is to stick to the sequence which works
across all ARMv7 processors.

> I have been running this code on TC2 for hours on end with nary a problem.
>
Thanks for the additional information.

> The sequence:
>
> - clear C bit
> - clean D-cache
> - exit SMP
>
> must be written in assembly with no data access whatsoever to make it
> portable across v7 implementations. I think I will write some docs and
> add them to the kernel to avoid further discussion on this topic.
>
Best thing is to update the ARM Architecture Reference Manual because
thats is what most of the time gets referred by many OS vendors.

> FYI, the thread Santosh mentioned:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-May/099791.html
>
Yes. This is one of the relevant thread. Thanks.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-14 18:05     ` Achin Gupta
@ 2013-01-15  6:32       ` Santosh Shilimkar
  2013-01-15 11:18         ` Achin Gupta
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-15  6:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 14 January 2013 11:35 PM, Achin Gupta wrote:
> Hi Santosh,
>
> On Fri, Jan 11, 2013 at 6:02 PM, Santosh Shilimkar
> <santosh.shilimkar@ti.com> wrote:
>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>
>>> Now that the b.L power API is in place, we can use it for SMP secondary
>>> bringup and CPU hotplug in a generic fashion.
>>>
>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>> ---
>>>    arch/arm/common/Makefile     |  2 +-
>>>    arch/arm/common/bL_platsmp.c | 79
>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 80 insertions(+), 1 deletion(-)
>>>    create mode 100644 arch/arm/common/bL_platsmp.c
>>>
>>> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
>>> index 894c2ddf9b..59b36db7cc 100644
>>> --- a/arch/arm/common/Makefile
>>> +++ b/arch/arm/common/Makefile
>>> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>>>    obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o
>>>    obj-$(CONFIG_FIQ_GLUE)                += fiq_glue.o fiq_glue_setup.o
>>>    obj-$(CONFIG_FIQ_DEBUGGER)    += fiq_debugger.o
>>> -obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o vlock.o
>>> +obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o bL_platsmp.o
>>> vlock.o
>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>>> new file mode 100644
>>> index 0000000000..0acb9f4685
>>> --- /dev/null
>>> +++ b/arch/arm/common/bL_platsmp.c
>>> @@ -0,0 +1,79 @@
>>> +/*
>>> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
>>> + *
>>> + * Created by:  Nicolas Pitre, November 2012
>>> + * Copyright:   (C) 2012  Linaro Limited
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + *
>>> + * Code to handle secondary CPU bringup and hotplug for the bL power API.
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/smp.h>
>>> +
>>> +#include <asm/bL_entry.h>
>>> +#include <asm/smp_plat.h>
>>> +#include <asm/hardware/gic.h>
>>> +
>>> +static void __init simple_smp_init_cpus(void)
>>> +{
>>> +       set_smp_cross_call(gic_raise_softirq);
>>> +}
>>> +
>>> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct
>>> task_struct *idle)
>>> +{
>>> +       unsigned int pcpu, pcluster, ret;
>>> +       extern void secondary_startup(void);
>>> +
>>> +       pcpu = cpu_logical_map(cpu) & 0xff;
>>> +       pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
>>> +       pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
>>> +                __func__, cpu, pcpu, pcluster);
>>> +
>>> +       bL_set_entry_vector(pcpu, pcluster, NULL);
>>> +       ret = bL_cpu_power_up(pcpu, pcluster);
>>> +       if (ret)
>>> +               return ret;
>>> +       bL_set_entry_vector(pcpu, pcluster, secondary_startup);
>>> +       gic_raise_softirq(cpumask_of(cpu), 0);
>>> +       sev();
>>
>> softirq() should be enough to break a CPU if it is in standby with
>> wfe state. Is that additional sev() needed here ?
>
> Not if the target cpu has its I & F bits disabled and that would be the
> case with a secondary waiting to be woken up
>
This is interesting since CPU is actually in standby state and this
was not my understanding so far. Your statement at least contradicts
the ARM ARM (B1.8.12 Wait For Interrupt)
-----------------------
The processor can remain in the WFI low-power state until it is reset, 
or it detects one of the following WFI wake-up
events:
? a physical IRQ interrupt, regardless of the value of the CPSR.I bit
? a physical FIQ interrupt, regardless of the value of the CPSR.F bit
----------------------------------

Are you referring to some new behavior on latest ARMv7 CPUs ?

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-15  6:32       ` Santosh Shilimkar
@ 2013-01-15 11:18         ` Achin Gupta
  2013-01-15 11:26           ` Santosh Shilimkar
  2013-01-15 18:53           ` Dave Martin
  0 siblings, 2 replies; 140+ messages in thread
From: Achin Gupta @ 2013-01-15 11:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Santosh,

On Tue, Jan 15, 2013 at 6:32 AM, Santosh Shilimkar
<santosh.shilimkar@ti.com> wrote:
> On Monday 14 January 2013 11:35 PM, Achin Gupta wrote:
>>
>> Hi Santosh,
>>
>> On Fri, Jan 11, 2013 at 6:02 PM, Santosh Shilimkar
>> <santosh.shilimkar@ti.com> wrote:
>>>
>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>
>>>>
>>>> Now that the b.L power API is in place, we can use it for SMP secondary
>>>> bringup and CPU hotplug in a generic fashion.
>>>>
>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>> ---
>>>>    arch/arm/common/Makefile     |  2 +-
>>>>    arch/arm/common/bL_platsmp.c | 79
>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 80 insertions(+), 1 deletion(-)
>>>>    create mode 100644 arch/arm/common/bL_platsmp.c
>>>>
>>>> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
>>>> index 894c2ddf9b..59b36db7cc 100644
>>>> --- a/arch/arm/common/Makefile
>>>> +++ b/arch/arm/common/Makefile
>>>> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>>>>    obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o
>>>>    obj-$(CONFIG_FIQ_GLUE)                += fiq_glue.o fiq_glue_setup.o
>>>>    obj-$(CONFIG_FIQ_DEBUGGER)    += fiq_debugger.o
>>>> -obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o vlock.o
>>>> +obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o bL_platsmp.o
>>>> vlock.o
>>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>>>> new file mode 100644
>>>> index 0000000000..0acb9f4685
>>>> --- /dev/null
>>>> +++ b/arch/arm/common/bL_platsmp.c
>>>> @@ -0,0 +1,79 @@
>>>> +/*
>>>> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
>>>> + *
>>>> + * Created by:  Nicolas Pitre, November 2012
>>>> + * Copyright:   (C) 2012  Linaro Limited
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License version 2 as
>>>> + * published by the Free Software Foundation.
>>>> + *
>>>> + * Code to handle secondary CPU bringup and hotplug for the bL power
>>>> API.
>>>> + */
>>>> +
>>>> +#include <linux/init.h>
>>>> +#include <linux/smp.h>
>>>> +
>>>> +#include <asm/bL_entry.h>
>>>> +#include <asm/smp_plat.h>
>>>> +#include <asm/hardware/gic.h>
>>>> +
>>>> +static void __init simple_smp_init_cpus(void)
>>>> +{
>>>> +       set_smp_cross_call(gic_raise_softirq);
>>>> +}
>>>> +
>>>> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct
>>>> task_struct *idle)
>>>> +{
>>>> +       unsigned int pcpu, pcluster, ret;
>>>> +       extern void secondary_startup(void);
>>>> +
>>>> +       pcpu = cpu_logical_map(cpu) & 0xff;
>>>> +       pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
>>>> +       pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
>>>> +                __func__, cpu, pcpu, pcluster);
>>>> +
>>>> +       bL_set_entry_vector(pcpu, pcluster, NULL);
>>>> +       ret = bL_cpu_power_up(pcpu, pcluster);
>>>> +       if (ret)
>>>> +               return ret;
>>>> +       bL_set_entry_vector(pcpu, pcluster, secondary_startup);
>>>> +       gic_raise_softirq(cpumask_of(cpu), 0);
>>>> +       sev();
>>>
>>>
>>> softirq() should be enough to break a CPU if it is in standby with
>>> wfe state. Is that additional sev() needed here ?
>>
>>
>> Not if the target cpu has its I & F bits disabled and that would be the
>> case with a secondary waiting to be woken up
>>
> This is interesting since CPU is actually in standby state and this
> was not my understanding so far. Your statement at least contradicts
> the ARM ARM (B1.8.12 Wait For Interrupt)
> -----------------------
> The processor can remain in the WFI low-power state until it is reset, or it
> detects one of the following WFI wake-up
> events:
> ? a physical IRQ interrupt, regardless of the value of the CPSR.I bit
> ? a physical FIQ interrupt, regardless of the value of the CPSR.F bit
> ----------------------------------
>
> Are you referring to some new behavior on latest ARMv7 CPUs ?

You are abs right about the 'wfi' behaviour. I was talking about the effect
of interrupts on a cpu thats in 'wfe'.

The power up process takes place in two steps. The first step involves
sending an ipi which will either:

a. cause the power controller to bring the processor out of reset
b. cause the processor to exit from wfi (most probably in the bootloader code)

The cpu then enters Linux (bL_entry_point) and after doing any cluster setup
waits in 'wfe' if its 'bL_entry_vector' has not been set as yet. The
'sev' is meant
to poke the cpu once this has been done.

Its not required in this case as we have already set 'bL_entry_vector' , issued
a barrier & flushed the cache line. So if the incoming cpu sees a 0 in
its vector
then that would be a symptom of a different problem.

Thanks,
Achin

> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-15 11:18         ` Achin Gupta
@ 2013-01-15 11:26           ` Santosh Shilimkar
  2013-01-15 18:53           ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-15 11:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 15 January 2013 04:48 PM, Achin Gupta wrote:
> Hi Santosh,
>
> On Tue, Jan 15, 2013 at 6:32 AM, Santosh Shilimkar
> <santosh.shilimkar@ti.com> wrote:
>> On Monday 14 January 2013 11:35 PM, Achin Gupta wrote:
>>>
>>> Hi Santosh,
>>>
>>> On Fri, Jan 11, 2013 at 6:02 PM, Santosh Shilimkar
>>> <santosh.shilimkar@ti.com> wrote:
>>>>
>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>>
>>>>>
>>>>> Now that the b.L power API is in place, we can use it for SMP secondary
>>>>> bringup and CPU hotplug in a generic fashion.
>>>>>
>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>> ---
>>>>>     arch/arm/common/Makefile     |  2 +-
>>>>>     arch/arm/common/bL_platsmp.c | 79
>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>     2 files changed, 80 insertions(+), 1 deletion(-)
>>>>>     create mode 100644 arch/arm/common/bL_platsmp.c
>>>>>
>>>>> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
>>>>> index 894c2ddf9b..59b36db7cc 100644
>>>>> --- a/arch/arm/common/Makefile
>>>>> +++ b/arch/arm/common/Makefile
>>>>> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
>>>>>     obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o
>>>>>     obj-$(CONFIG_FIQ_GLUE)                += fiq_glue.o fiq_glue_setup.o
>>>>>     obj-$(CONFIG_FIQ_DEBUGGER)    += fiq_debugger.o
>>>>> -obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o vlock.o
>>>>> +obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o bL_platsmp.o
>>>>> vlock.o
>>>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
>>>>> new file mode 100644
>>>>> index 0000000000..0acb9f4685
>>>>> --- /dev/null
>>>>> +++ b/arch/arm/common/bL_platsmp.c
>>>>> @@ -0,0 +1,79 @@
>>>>> +/*
>>>>> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
>>>>> + *
>>>>> + * Created by:  Nicolas Pitre, November 2012
>>>>> + * Copyright:   (C) 2012  Linaro Limited
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or modify
>>>>> + * it under the terms of the GNU General Public License version 2 as
>>>>> + * published by the Free Software Foundation.
>>>>> + *
>>>>> + * Code to handle secondary CPU bringup and hotplug for the bL power
>>>>> API.
>>>>> + */
>>>>> +
>>>>> +#include <linux/init.h>
>>>>> +#include <linux/smp.h>
>>>>> +
>>>>> +#include <asm/bL_entry.h>
>>>>> +#include <asm/smp_plat.h>
>>>>> +#include <asm/hardware/gic.h>
>>>>> +
>>>>> +static void __init simple_smp_init_cpus(void)
>>>>> +{
>>>>> +       set_smp_cross_call(gic_raise_softirq);
>>>>> +}
>>>>> +
>>>>> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct
>>>>> task_struct *idle)
>>>>> +{
>>>>> +       unsigned int pcpu, pcluster, ret;
>>>>> +       extern void secondary_startup(void);
>>>>> +
>>>>> +       pcpu = cpu_logical_map(cpu) & 0xff;
>>>>> +       pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
>>>>> +       pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
>>>>> +                __func__, cpu, pcpu, pcluster);
>>>>> +
>>>>> +       bL_set_entry_vector(pcpu, pcluster, NULL);
>>>>> +       ret = bL_cpu_power_up(pcpu, pcluster);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>> +       bL_set_entry_vector(pcpu, pcluster, secondary_startup);
>>>>> +       gic_raise_softirq(cpumask_of(cpu), 0);
>>>>> +       sev();
>>>>
>>>>
>>>> softirq() should be enough to break a CPU if it is in standby with
>>>> wfe state. Is that additional sev() needed here ?
>>>
>>>
>>> Not if the target cpu has its I & F bits disabled and that would be the
>>> case with a secondary waiting to be woken up
>>>
>> This is interesting since CPU is actually in standby state and this
>> was not my understanding so far. Your statement at least contradicts
>> the ARM ARM (B1.8.12 Wait For Interrupt)
>> -----------------------
>> The processor can remain in the WFI low-power state until it is reset, or it
>> detects one of the following WFI wake-up
>> events:
>> ? a physical IRQ interrupt, regardless of the value of the CPSR.I bit
>> ? a physical FIQ interrupt, regardless of the value of the CPSR.F bit
>> ----------------------------------
>>
>> Are you referring to some new behavior on latest ARMv7 CPUs ?
>
> You are abs right about the 'wfi' behaviour. I was talking about the effect
> of interrupts on a cpu thats in 'wfe'.
>
> The power up process takes place in two steps. The first step involves
> sending an ipi which will either:
>
> a. cause the power controller to bring the processor out of reset
> b. cause the processor to exit from wfi (most probably in the bootloader code)
>
> The cpu then enters Linux (bL_entry_point) and after doing any cluster setup
> waits in 'wfe' if its 'bL_entry_vector' has not been set as yet. The
> 'sev' is meant
> to poke the cpu once this has been done.
>
Thanks for additional information. Its clear to me now.

Regards
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-15  2:44             ` Joseph Lo
@ 2013-01-15 16:44                 ` Nicolas Pitre
  -1 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-15 16:44 UTC (permalink / raw)
  To: Joseph Lo
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Tue, 15 Jan 2013, Joseph Lo wrote:

> On Mon, 2013-01-14 at 22:05 +0800, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Joseph Lo wrote:
> > 
> > > Hi Nicolas,
> > > 
> > > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > > This is the initial public posting of the initial support for big.LITTLE.
> > > > Included here is the code required to safely power up and down CPUs in a
> > > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > > boot and CPU hotplug support is included at this time.  Getting to this
> > > > point already represents a significcant chunk of code as illustrated by
> > > > the diffstat below.
> > > > 
> > > > 
> > > 
> > > Thanks for introducing this series.
> > > I am taking a look at this series. It introduced an algorithm for
> > > syncing and avoid racing when syncing the power status of clusters and
> > > CPUs. Do you think these codes could have a chance to become a generic
> > > framework?
> > 
> > Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> > that big.LITTLE was the motivation for creating this code.
> > 
> > > The Tegra chip series had a similar design for CPU clusters and it 
> > had
> > > limitation that the CPU0 always needs to be the last CPU to be shut down
> > > before cluster power down as well. I believe it can also get benefits of
> > > this works. We indeed need a similar algorithm to sync CPUs power status
> > > before cluster power down and switching.
> > > 
> > > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > > looks have a chance to be a common framework for ARM platform even if it
> > > just support one cluster. Because some systems had the limitations for
> > > cluster power down. That's why the coupled cpuidle been introduced. And
> > > this framework could be enabled automatically if platform dependent or
> > > by menuconfig.
> > 
> > Absolutely.
> > 
> So do you have a plan to make it become a generic framework in this
> series or later work?

It is already generic, except for the name.  In other words, you could 
start using this code already.

I'm still debating a good substitute for the bL_ prefix in this series 
to give it the appearance of generic code.


> (And I will add some common power sync wrapper functions that based on
> this framework for all Tegra series.)

Great!


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-15 16:44                 ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-15 16:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 15 Jan 2013, Joseph Lo wrote:

> On Mon, 2013-01-14 at 22:05 +0800, Nicolas Pitre wrote:
> > On Mon, 14 Jan 2013, Joseph Lo wrote:
> > 
> > > Hi Nicolas,
> > > 
> > > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > > This is the initial public posting of the initial support for big.LITTLE.
> > > > Included here is the code required to safely power up and down CPUs in a
> > > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > > boot and CPU hotplug support is included at this time.  Getting to this
> > > > point already represents a significcant chunk of code as illustrated by
> > > > the diffstat below.
> > > > 
> > > > 
> > > 
> > > Thanks for introducing this series.
> > > I am taking a look at this series. It introduced an algorithm for
> > > syncing and avoid racing when syncing the power status of clusters and
> > > CPUs. Do you think these codes could have a chance to become a generic
> > > framework?
> > 
> > Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> > that big.LITTLE was the motivation for creating this code.
> > 
> > > The Tegra chip series had a similar design for CPU clusters and it 
> > had
> > > limitation that the CPU0 always needs to be the last CPU to be shut down
> > > before cluster power down as well. I believe it can also get benefits of
> > > this works. We indeed need a similar algorithm to sync CPUs power status
> > > before cluster power down and switching.
> > > 
> > > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > > looks have a chance to be a common framework for ARM platform even if it
> > > just support one cluster. Because some systems had the limitations for
> > > cluster power down. That's why the coupled cpuidle been introduced. And
> > > this framework could be enabled automatically if platform dependent or
> > > by menuconfig.
> > 
> > Absolutely.
> > 
> So do you have a plan to make it become a generic framework in this
> series or later work?

It is already generic, except for the name.  In other words, you could 
start using this code already.

I'm still debating a good substitute for the bL_ prefix in this series 
to give it the appearance of generic code.


> (And I will add some common power sync wrapper functions that based on
> this framework for all Tegra series.)

Great!


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-15  6:23           ` Santosh Shilimkar
@ 2013-01-15 18:20             ` Dave Martin
  2013-01-16  6:33               ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 15, 2013 at 11:53:14AM +0530, Santosh Shilimkar wrote:
> On Monday 14 January 2013 05:55 PM, Lorenzo Pieralisi wrote:
> >On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
> >>On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
> >>>On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> >>>
> >>>>On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> >>>>>From: Dave Martin <dave.martin@linaro.org>
> >>>>>
> >>>>>+		/*
> >>>>>+		 * Flush the local CPU cache.
> >>>>>+		 *
> >>>>>+		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> >>>>>+		 * a preliminary flush here for those CPUs.  At least, that's
> >>>>>+		 * the theory -- without the extra flush, Linux explodes on
> >>>>>+		 * RTSM (maybe not needed anymore, to be investigated).
> >>>>>+		 */
> >>>>This is expected if the entire code is not in one stack frame and the
> >>>>additional flush is needed to avoid possible stack corruption. This
> >>>>issue has been discussed in past on the list.
> >>>
> >>>I missed that.  Do you have a reference or pointer handy?
> >>>
> >>>What is strange is that this is 100% reproducible on RTSM while this
> >>>apparently is not an issue on real hardware so far.
> >>>
> >>I tried searching archives and realized the discussion was in private
> >>email thread. There are some bits and pieces on list but not all the
> >>information.
> >>
> >>The main issue RMK pointed out is- An additional L1 flush needed
> >>to avoid the effective change of view of memory when the C bit is
> >>turned off, and the cache is no longer searched for local CPU accesses.
> >>
> >>In your case dcscb_power_down() has updated the stack which can be hit
> >>in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
> >>the C-bit and hence for sub sequent calls the L1 cache won't be
> >>searched. You then call flush_cache_all() which again updates the
> >>stack but avoids searching the L1 cache. So it overwrites previous
> >>saved stack frame. This seems to be an issue in your case as well.
> >
> >On A15/A7 even with the C bit cleared the D-cache is searched, the
> >situation above cannot happen and if it does we are facing a HW/model bug.
> >If this code is run on A9 then we have a problem since there, when the C bit
> >is cleared D-cache is not searched (and that's why the sequence above
> >should be written in assembly with no data access whatsoever), but on
> >A15/A7 we do not.
> >
> Good point. May be model has modeled A9 and not A15 but in either
> case, lets be consistent for all ARMv7 machines at least to avoid
> people debugging similar issues. Many machines share code for ARMv7
> processors so the best things is to stick to the sequence which works
> across all ARMv7 processors.

Is it sufficient to clarify the comment to indicate that the code is
not directly reusable for other CPU combinations?

DCSCB is incredibly platform-specific, and we would not expect to
see it in other platforms.

Or do we consider the risk of people copying this code verbatim
(including the "do not copy this code" comment) too high?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-14 14:05         ` Nicolas Pitre
@ 2013-01-15 18:31             ` Dave Martin
  -1 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:31 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Joseph Lo, linux-tegra-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Jan 14, 2013 at 09:05:25AM -0500, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Joseph Lo wrote:
> 
> > Hi Nicolas,
> > 
> > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > This is the initial public posting of the initial support for big.LITTLE.
> > > Included here is the code required to safely power up and down CPUs in a
> > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > boot and CPU hotplug support is included at this time.  Getting to this
> > > point already represents a significcant chunk of code as illustrated by
> > > the diffstat below.
> > > 
> > > This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> > > myself.  The presentation slides are available here:
> > > 
> > > http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> > > 
> > > The code is now stable on both Fast Models as well as Virtual Express TC2
> > > and ready for public review.
> > > 
> > > Platform support is included for Fast Models implementing the
> > > Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> > > successful compilation, I also included a preliminary version of the
> > > CCI400 driver from Lorenzo Pieralisi.
> > > 
> > > Support for actual hardware such as Vexpress TC2 should come later,
> > > once the basic infrastructure from this series is merged.  A few DT
> > > bindings are used but not yet documented.
> > > 
> > > This series is made of the following parts:
> > > 
> > > Low-level support code:
> > > [PATCH 01/16] ARM: b.L: secondary kernel entry code
> > > [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> > > [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> > > [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> > > [PATCH 05/16] ARM: bL_head: vlock-based first man election
> > > 
> > > Adaptation layer to hook with the generic kernel infrastructure:
> > > [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> > > [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> > > [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> > > [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> > > 
> > > Fast Models support:
> > > [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> > > [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> > > [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> > > [PATCH 13/16] drivers: misc: add ARM CCI support
> > > [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> > > [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> > > [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> > > 
> > 
> > Thanks for introducing this series.
> > I am taking a look at this series. It introduced an algorithm for
> > syncing and avoid racing when syncing the power status of clusters and
> > CPUs. Do you think these codes could have a chance to become a generic
> > framework?
> 
> Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> that big.LITTLE was the motivation for creating this code.
> 
> > The Tegra chip series had a similar design for CPU clusters and it 
> had
> > limitation that the CPU0 always needs to be the last CPU to be shut down
> > before cluster power down as well. I believe it can also get benefits of
> > this works. We indeed need a similar algorithm to sync CPUs power status
> > before cluster power down and switching.
> > 
> > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > looks have a chance to be a common framework for ARM platform even if it
> > just support one cluster. Because some systems had the limitations for
> > cluster power down. That's why the coupled cpuidle been introduced. And
> > this framework could be enabled automatically if platform dependent or
> > by menuconfig.
> 
> Absolutely.
> 
> 
> > For ex,
> > 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> > 
> > How do you think of this suggestion?
> 
> I'd prefer a more concise name though.
> 
> > BTW, some questions...
> > 1. The "bL_entry_point" looks like a first run function when CPUs just
> > power up, then jumping to original reset vector that it should be
> > called. Do you think this should be a function and be called by reset
> > handler? Or in your design, this should be called as soon as possible
> > when the CPU power be resumed?
> 
> This should be called as soon as possible.

For one thing, you can't safely turn on the MMU or do anything which may
affect any other CPU, until the code at bL_entry_point has run.

On most real hardware, the first thing to run on a powered-up CPU will
be some boot ROM or firmware, but we expect bL_entry_point to be the
initial entry point into Linux in these scenarios.

> > 2. Does the Last_man mechanism should implement in platform specific
> > code to check something like cpu_online_status and if there is a
> > limitation for the specific last CPU to be powered down?
> 
> The selection of the last man is accomplished using a platform specific 
> mechanism.  By the time this has to be done, the CPU is already dead as 
> far as the Linux kernel is concerned, and therefore the generic CPU map 
> is not reliable.  In the DCSCB case we simply look at the hardware reset 
> register being modified to directly determine the last man.  On TC2 (not 
> yet posted) we have to keep a local map of online CPUs.
> 
> In your case, the selection of the last man would simply be forced on 
> CPU0.

Things are actually simpler in your scenario, because there
is only one CPU that can possibly become the last man.  However, the
algorithm could still be re-used: it doesn't matter that it is "too
safe" for your situation, and some aspects remain important, such
as checking for CPUs unexpectedly powering up while a cluster power-
down is pending, for example.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-15 18:31             ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 09:05:25AM -0500, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Joseph Lo wrote:
> 
> > Hi Nicolas,
> > 
> > On Thu, 2013-01-10 at 08:20 +0800, Nicolas Pitre wrote:
> > > This is the initial public posting of the initial support for big.LITTLE.
> > > Included here is the code required to safely power up and down CPUs in a
> > > b.L system, whether this is via CPU hotplug, a cpuidle driver or the
> > > Linaro b.L in-kernel switcher[*] on top of this.  Only  SMP secondary
> > > boot and CPU hotplug support is included at this time.  Getting to this
> > > point already represents a significcant chunk of code as illustrated by
> > > the diffstat below.
> > > 
> > > This work was presented at Linaro Connect in Copenhagen by Dave Martin and
> > > myself.  The presentation slides are available here:
> > > 
> > > http://www.linaro.org/documents/download/f3569407bb1fb8bde0d6da80e285b832508f92f57223c
> > > 
> > > The code is now stable on both Fast Models as well as Virtual Express TC2
> > > and ready for public review.
> > > 
> > > Platform support is included for Fast Models implementing the
> > > Cortex-A15x4-A7x4 and Cortex-A15x1-A7x1 configurations.  To allow
> > > successful compilation, I also included a preliminary version of the
> > > CCI400 driver from Lorenzo Pieralisi.
> > > 
> > > Support for actual hardware such as Vexpress TC2 should come later,
> > > once the basic infrastructure from this series is merged.  A few DT
> > > bindings are used but not yet documented.
> > > 
> > > This series is made of the following parts:
> > > 
> > > Low-level support code:
> > > [PATCH 01/16] ARM: b.L: secondary kernel entry code
> > > [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API
> > > [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency
> > > [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes
> > > [PATCH 05/16] ARM: bL_head: vlock-based first man election
> > > 
> > > Adaptation layer to hook with the generic kernel infrastructure:
> > > [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug
> > > [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before
> > > [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a
> > > [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at
> > > 
> > > Fast Models support:
> > > [PATCH 10/16] ARM: vexpress: introduce DCSCB support
> > > [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power
> > > [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs
> > > [PATCH 13/16] drivers: misc: add ARM CCI support
> > > [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from
> > > [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency
> > > [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree
> > > 
> > 
> > Thanks for introducing this series.
> > I am taking a look at this series. It introduced an algorithm for
> > syncing and avoid racing when syncing the power status of clusters and
> > CPUs. Do you think these codes could have a chance to become a generic
> > framework?
> 
> Yes.  As I mentioned before, the bL_ prefix is implied only by the fact 
> that big.LITTLE was the motivation for creating this code.
> 
> > The Tegra chip series had a similar design for CPU clusters and it 
> had
> > limitation that the CPU0 always needs to be the last CPU to be shut down
> > before cluster power down as well. I believe it can also get benefits of
> > this works. We indeed need a similar algorithm to sync CPUs power status
> > before cluster power down and switching.
> > 
> > The "bL_entry.c", "bL_entry.S", "bL_entry.h", "vlock.h" and "vlock.S"
> > looks have a chance to be a common framework for ARM platform even if it
> > just support one cluster. Because some systems had the limitations for
> > cluster power down. That's why the coupled cpuidle been introduced. And
> > this framework could be enabled automatically if platform dependent or
> > by menuconfig.
> 
> Absolutely.
> 
> 
> > For ex,
> > 	select CPUS_CLUSTERS_POWER_SYNC_FRAMEWORK if SMP && CPU_PM
> > 
> > How do you think of this suggestion?
> 
> I'd prefer a more concise name though.
> 
> > BTW, some questions...
> > 1. The "bL_entry_point" looks like a first run function when CPUs just
> > power up, then jumping to original reset vector that it should be
> > called. Do you think this should be a function and be called by reset
> > handler? Or in your design, this should be called as soon as possible
> > when the CPU power be resumed?
> 
> This should be called as soon as possible.

For one thing, you can't safely turn on the MMU or do anything which may
affect any other CPU, until the code at bL_entry_point has run.

On most real hardware, the first thing to run on a powered-up CPU will
be some boot ROM or firmware, but we expect bL_entry_point to be the
initial entry point into Linux in these scenarios.

> > 2. Does the Last_man mechanism should implement in platform specific
> > code to check something like cpu_online_status and if there is a
> > limitation for the specific last CPU to be powered down?
> 
> The selection of the last man is accomplished using a platform specific 
> mechanism.  By the time this has to be done, the CPU is already dead as 
> far as the Linux kernel is concerned, and therefore the generic CPU map 
> is not reliable.  In the DCSCB case we simply look at the hardware reset 
> register being modified to directly determine the last man.  On TC2 (not 
> yet posted) we have to keep a local map of online CPUs.
> 
> In your case, the selection of the last man would simply be forced on 
> CPU0.

Things are actually simpler in your scenario, because there
is only one CPU that can possibly become the last man.  However, the
algorithm could still be re-used: it doesn't matter that it is "too
safe" for your situation, and some aspects remain important, such
as checking for CPUs unexpectedly powering up while a cluster power-
down is pending, for example.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 13/16] drivers: misc: add ARM CCI support
  2013-01-11 19:22     ` Nicolas Pitre
  2013-01-12  6:53       ` Santosh Shilimkar
@ 2013-01-15 18:34       ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 11, 2013 at 02:22:10PM -0500, Nicolas Pitre wrote:
> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> 
> > On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> > > From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > > 
> > > On ARM multi-cluster systems coherency between cores running on
> > > different clusters is managed by the cache-coherent interconnect (CCI).
> > > It allows broadcasting of TLB invalidates and memory barriers and it
> > > guarantees cache coherency at system level.
> > > 
> > > This patch enables the basic infrastructure required in Linux to
> > > handle and programme the CCI component. The first implementation is
> > > based on a platform device, its relative DT compatible property and
> > > a simple programming interface.
> > > 
> > > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > > ---
> > >   drivers/misc/Kconfig    |   3 ++
> > >   drivers/misc/Makefile   |   1 +
> > >   drivers/misc/arm-cci.c  | 107
> > > ++++++++++++++++++++++++++++++++++++++++++++++++
> > >   include/linux/arm-cci.h |  30 ++++++++++++++
> > How about 'drivers/bus/' considering CCI is an interconnect bus (though
> > for coherency)
> 
> Yes, I like that better.
> 
> > >   4 files changed, 141 insertions(+)
> > >   create mode 100644 drivers/misc/arm-cci.c
> > >   create mode 100644 include/linux/arm-cci.h
> > > 
> > > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > > index b151b7c1bd..30d5be1ad2 100644
> > > --- a/drivers/misc/Kconfig
> > > +++ b/drivers/misc/Kconfig
> > > @@ -499,6 +499,9 @@ config USB_SWITCH_FSA9480
> > >   	  stereo and mono audio, video, microphone and UART data to use
> > >   	  a common connector port.
> > > 
> > > +config ARM_CCI
> > You might want add depends on ARM big.LITTTLE otherwise it will
> > break build for other arch's with random configurations.
> 
> As far as this patch goes, this is buildable on other architectures too.  
> The next patch changes that though.
> 
> > [..]
> > 
> > > diff --git a/drivers/misc/arm-cci.c b/drivers/misc/arm-cci.c
> > > new file mode 100644
> > > index 0000000000..f329c43099
> > > --- /dev/null
> > > +++ b/drivers/misc/arm-cci.c
> > > @@ -0,0 +1,107 @@
> > > +/*
> > > + * CCI support
> > > + *
> > > + * Copyright (C) 2012 ARM Ltd.
> > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > > + * kind, whether express or implied; without even the implied warranty
> > > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + */
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/io.h>
> > > +#include <linux/module.h>
> > > +#include <linux/platform_device.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/arm-cci.h>
> > > +
> > > +#define CCI400_EAG_OFFSET       0x4000
> > > +#define CCI400_KF_OFFSET        0x5000
> > > +
> > > +#define DRIVER_NAME	"CCI"
> > > +struct cci_drvdata {
> > > +	void __iomem *baseaddr;
> > > +	spinlock_t lock;
> > > +};
> > > +
> > > +static struct cci_drvdata *info;
> > > +
> > > +void disable_cci(int cluster)
> > > +{
> > > +	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
> > > +	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
> > > +
> > > +	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
> > > +			;
> > > +}
> > > +EXPORT_SYMBOL_GPL(disable_cci);
> > > +
> > Is more functionality going to be added for CCI driver. Having this
> > much of driver code for just a disable_cci() functions seems like
> > overkill.
> 
> Yes.  More code will appear here to provide pmu functionalities, etc.

There's also a load of QoS configuration and other stuff which we could
control, in principle.  Normally the optimum settings are a property of
the hardware, so maybe we would always just rely on the firmware to
set up a sensible configuration.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-14 18:26           ` Russell King - ARM Linux
  2013-01-14 18:49             ` Nicolas Pitre
@ 2013-01-15 18:40             ` Dave Martin
  2013-01-16 16:06               ` Catalin Marinas
  1 sibling, 1 reply; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 06:26:04PM +0000, Russell King - ARM Linux wrote:
> On Mon, Jan 14, 2013 at 12:15:07PM -0500, Nicolas Pitre wrote:
> > The same could be said about the outer cache ops.  If a DSB is needed 
> > for their intent to be valid, then why isn't this DSB always implied by 
> > the corresponding cache op calls?
> 
> Hmm, just been thinking about this.
> 
> The L2x0 calls do contain a DSB but it's not obvious.  They hold a
> raw spinlock, and when that spinlock is dropped, we issue a dsb and
> sev instruction.
> 
> Whether the other L2 implementations do this or not I'm not sure -
> but the above is a requirement of the spinlock implementation, and
> it just happens to provide the right behaviour for L2x0.
> 
> But... we _probably_ don't want to impose that down at the L2 cache
> level of things - at least not for DMA ops, particular for the sanity
> of the scatter-list operating operations.  We really want to avoid
> doing one DSB per scatterlist entry, doing one DSB per scatterlist
> operation instead.
> 
> That does affect how the L2 cache API gets used - maybe we want to
> separate out the DMA stuff from the other users so that we can have
> dsbs in that path for non-DMA users.
> 
> Thoughts?

Perhaps the existing functions could be renamed to things like:

outer_XXX_flush_range()
outer_XXX_sync()

Where XXX is something like "batch" or "background".  Optionally these
could be declared somewhere separate to discourage non-DMA code from
using them.  Other code could still want to do batches of outer cache
operations efficiently, but I guess DMA is the main user.

Then we could provide simple non-background wrappers which also do
the appropriate CPU-side synchronisation, and provide the familiar
interface for that.

It might be less invasive to rename the new functions instead of the
old ones.  It partly depends on what proportion of existing uses
of these functions are incorrect (i.e., assume full synchronisation).

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-15 11:18         ` Achin Gupta
  2013-01-15 11:26           ` Santosh Shilimkar
@ 2013-01-15 18:53           ` Dave Martin
  1 sibling, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-15 18:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 15, 2013 at 11:18:44AM +0000, Achin Gupta wrote:
> Hi Santosh,
> 
> On Tue, Jan 15, 2013 at 6:32 AM, Santosh Shilimkar
> <santosh.shilimkar@ti.com> wrote:
> > On Monday 14 January 2013 11:35 PM, Achin Gupta wrote:
> >>
> >> Hi Santosh,
> >>
> >> On Fri, Jan 11, 2013 at 6:02 PM, Santosh Shilimkar
> >> <santosh.shilimkar@ti.com> wrote:
> >>>
> >>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> >>>>
> >>>>
> >>>> Now that the b.L power API is in place, we can use it for SMP secondary
> >>>> bringup and CPU hotplug in a generic fashion.
> >>>>
> >>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> >>>> ---
> >>>>    arch/arm/common/Makefile     |  2 +-
> >>>>    arch/arm/common/bL_platsmp.c | 79
> >>>> ++++++++++++++++++++++++++++++++++++++++++++
> >>>>    2 files changed, 80 insertions(+), 1 deletion(-)
> >>>>    create mode 100644 arch/arm/common/bL_platsmp.c
> >>>>
> >>>> diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
> >>>> index 894c2ddf9b..59b36db7cc 100644
> >>>> --- a/arch/arm/common/Makefile
> >>>> +++ b/arch/arm/common/Makefile
> >>>> @@ -15,4 +15,4 @@ obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
> >>>>    obj-$(CONFIG_ARM_TIMER_SP804) += timer-sp.o
> >>>>    obj-$(CONFIG_FIQ_GLUE)                += fiq_glue.o fiq_glue_setup.o
> >>>>    obj-$(CONFIG_FIQ_DEBUGGER)    += fiq_debugger.o
> >>>> -obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o vlock.o
> >>>> +obj-$(CONFIG_BIG_LITTLE)       += bL_head.o bL_entry.o bL_platsmp.o
> >>>> vlock.o
> >>>> diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> >>>> new file mode 100644
> >>>> index 0000000000..0acb9f4685
> >>>> --- /dev/null
> >>>> +++ b/arch/arm/common/bL_platsmp.c
> >>>> @@ -0,0 +1,79 @@
> >>>> +/*
> >>>> + * linux/arch/arm/mach-vexpress/bL_platsmp.c
> >>>> + *
> >>>> + * Created by:  Nicolas Pitre, November 2012
> >>>> + * Copyright:   (C) 2012  Linaro Limited
> >>>> + *
> >>>> + * This program is free software; you can redistribute it and/or modify
> >>>> + * it under the terms of the GNU General Public License version 2 as
> >>>> + * published by the Free Software Foundation.
> >>>> + *
> >>>> + * Code to handle secondary CPU bringup and hotplug for the bL power
> >>>> API.
> >>>> + */
> >>>> +
> >>>> +#include <linux/init.h>
> >>>> +#include <linux/smp.h>
> >>>> +
> >>>> +#include <asm/bL_entry.h>
> >>>> +#include <asm/smp_plat.h>
> >>>> +#include <asm/hardware/gic.h>
> >>>> +
> >>>> +static void __init simple_smp_init_cpus(void)
> >>>> +{
> >>>> +       set_smp_cross_call(gic_raise_softirq);
> >>>> +}
> >>>> +
> >>>> +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct
> >>>> task_struct *idle)
> >>>> +{
> >>>> +       unsigned int pcpu, pcluster, ret;
> >>>> +       extern void secondary_startup(void);
> >>>> +
> >>>> +       pcpu = cpu_logical_map(cpu) & 0xff;
> >>>> +       pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
> >>>> +       pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
> >>>> +                __func__, cpu, pcpu, pcluster);
> >>>> +
> >>>> +       bL_set_entry_vector(pcpu, pcluster, NULL);
> >>>> +       ret = bL_cpu_power_up(pcpu, pcluster);
> >>>> +       if (ret)
> >>>> +               return ret;
> >>>> +       bL_set_entry_vector(pcpu, pcluster, secondary_startup);
> >>>> +       gic_raise_softirq(cpumask_of(cpu), 0);
> >>>> +       sev();
> >>>
> >>>
> >>> softirq() should be enough to break a CPU if it is in standby with
> >>> wfe state. Is that additional sev() needed here ?
> >>
> >>
> >> Not if the target cpu has its I & F bits disabled and that would be the
> >> case with a secondary waiting to be woken up
> >>
> > This is interesting since CPU is actually in standby state and this
> > was not my understanding so far. Your statement at least contradicts
> > the ARM ARM (B1.8.12 Wait For Interrupt)
> > -----------------------
> > The processor can remain in the WFI low-power state until it is reset, or it
> > detects one of the following WFI wake-up
> > events:
> > ? a physical IRQ interrupt, regardless of the value of the CPSR.I bit
> > ? a physical FIQ interrupt, regardless of the value of the CPSR.F bit
> > ----------------------------------
> >
> > Are you referring to some new behavior on latest ARMv7 CPUs ?
> 
> You are abs right about the 'wfi' behaviour. I was talking about the effect
> of interrupts on a cpu thats in 'wfe'.
> 
> The power up process takes place in two steps. The first step involves
> sending an ipi which will either:
> 
> a. cause the power controller to bring the processor out of reset
> b. cause the processor to exit from wfi (most probably in the bootloader code)
> 
> The cpu then enters Linux (bL_entry_point) and after doing any cluster setup
> waits in 'wfe' if its 'bL_entry_vector' has not been set as yet. The
> 'sev' is meant
> to poke the cpu once this has been done.
> 
> Its not required in this case as we have already set 'bL_entry_vector' , issued
> a barrier & flushed the cache line. So if the incoming cpu sees a 0 in
> its vector
> then that would be a symptom of a different problem.

Perhaps this could be made a bit clearer by defining two helpers,
bL_entry_close_gate() and bL_entry_open_gate().  The sev() is only
applicable for opening the gate, and could be buried in
bL_entry_open_gate().

This is pretty specialised low-level code though, so I'm not sure that
the abstraction is worth it.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-14 16:51     ` Nicolas Pitre
@ 2013-01-15 19:09       ` Dave Martin
  0 siblings, 0 replies; 140+ messages in thread
From: Dave Martin @ 2013-01-15 19:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 14, 2013 at 11:51:11AM -0500, Nicolas Pitre wrote:
> On Mon, 14 Jan 2013, Will Deacon wrote:
> 
> > On Thu, Jan 10, 2013 at 12:20:41AM +0000, Nicolas Pitre wrote:
> > > Now that the b.L power API is in place, we can use it for SMP secondary
> > > bringup and CPU hotplug in a generic fashion.
> > 
> > [...]
> > 
> > > diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
> > > new file mode 100644
> > > index 0000000000..0acb9f4685
> > > --- /dev/null
> > > +++ b/arch/arm/common/bL_platsmp.c
> > > @@ -0,0 +1,79 @@
> > > +/*
> > > + * linux/arch/arm/mach-vexpress/bL_platsmp.c
> > > + *
> > > + * Created by:  Nicolas Pitre, November 2012
> > > + * Copyright:   (C) 2012  Linaro Limited
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * Code to handle secondary CPU bringup and hotplug for the bL power API.
> > > + */
> > > +
> > > +#include <linux/init.h>
> > > +#include <linux/smp.h>
> > > +
> > > +#include <asm/bL_entry.h>
> > > +#include <asm/smp_plat.h>
> > > +#include <asm/hardware/gic.h>
> > > +
> > > +static void __init simple_smp_init_cpus(void)
> > > +{
> > > +	set_smp_cross_call(gic_raise_softirq);
> > > +}
> > > +
> > > +static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
> > > +{
> > > +	unsigned int pcpu, pcluster, ret;
> > > +	extern void secondary_startup(void);
> > > +
> > > +	pcpu = cpu_logical_map(cpu) & 0xff;
> > > +	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
> > 
> > Again, you can probably use Lorenzo's helpers here.
> 
> Yes, that goes for the whole series.
> 
> > > +	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
> > > +		 __func__, cpu, pcpu, pcluster);
> > > +
> > > +	bL_set_entry_vector(pcpu, pcluster, NULL);
> > 
> > Now that you don't have a barrier in this function, you need one here.
> 
> Hmmm... Why?

In effect, we are entering a critical section here: that's precisely
why we close the gate.

We need a barrier after bL_set_entry_vector() to make sure that no
operations from the critical section leak outside from the
perspective of the target CPU.

Similarly, we need a barrier before bL_set_entry_vector() when
opening the gate.

The corresponding barrier required in bL_head.S at bL_entry_gated
is added by my recent barriers and tidyups series.


Closing and opening the gate are a bit like taking and releasing
a lock -- which is one reason for having simple wrapper functions
to make these roles more obvious.

> 
> > > +	ret = bL_cpu_power_up(pcpu, pcluster);
> > > +	if (ret)
> > > +		return ret;
> > 
> > and here, although I confess to not understanding why you write NULL the
> > first time.
> 
> If for some reasons the bL_cpu_power_up() call fails, I don't want this 
> CPU to suddenly decide to enter the kernel if it wakes up at a later 
> time when secondary_startup is not ready to deal with it anymore.
> 
> > > +	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
> > > +	gic_raise_softirq(cpumask_of(cpu), 0);
> > > +	sev();
> > 
> > This relise on the event register being able to be set if the target is in a
> > low-power (wfi) state. I'd feel safer with a dsb before the sev...

The sev() signals the update to the entry vector, which has already been
DSB'd by the flushing in bL_set_entry_vector().

Also, the relative order of gic_raise_softirq() and sev() here is
not important, provided they both follow bL_set_entry_vector().

There are no circumstances under which we could know whether the
IRQ or SEV arrives first at the destination CPU anyway.  A DSB is
insufficient since the store may still not have arrived at the GIC;
but even doing a readback from the GIC isn't enough, because the
relationships and relative speed of the underlying interrupt and
SEV signalling mechanisms are not architecturally visible.

The crucial thing is that the SEV does not arrive before the destination
CPU has observed the modification to the entry vector -- that could
cause the target CPU to stall in WFE.

If the CPU powers up after missing the SEV, that's fine, because the
observability of the non-NULL entry vector is guaranteed by the
flushing which precedes gic_raise_softirq().  SEV is only important
if the CPU powers up early, observes a NULL entry vector and goes
into WFE.

Because this code only applies with the multiprocessing extensions,
we know that the writes associated with gic_raise_softirq() will
drain and take effect eventually, but we don't care when.  We
can't know the results for sure until the target CPU re-enters the
kernel.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-15 18:20             ` Dave Martin
@ 2013-01-16  6:33               ` Santosh Shilimkar
  2013-01-16 10:03                 ` Lorenzo Pieralisi
  0 siblings, 1 reply; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-16  6:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 15 January 2013 11:50 PM, Dave Martin wrote:
> On Tue, Jan 15, 2013 at 11:53:14AM +0530, Santosh Shilimkar wrote:
>> On Monday 14 January 2013 05:55 PM, Lorenzo Pieralisi wrote:
>>> On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
>>>> On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
>>>>> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>>>>>
>>>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>>>> From: Dave Martin <dave.martin@linaro.org>
>>>>>>>
>>>>>>> +		/*
>>>>>>> +		 * Flush the local CPU cache.
>>>>>>> +		 *
>>>>>>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
>>>>>>> +		 * a preliminary flush here for those CPUs.  At least, that's
>>>>>>> +		 * the theory -- without the extra flush, Linux explodes on
>>>>>>> +		 * RTSM (maybe not needed anymore, to be investigated).
>>>>>>> +		 */
>>>>>> This is expected if the entire code is not in one stack frame and the
>>>>>> additional flush is needed to avoid possible stack corruption. This
>>>>>> issue has been discussed in past on the list.
>>>>>
>>>>> I missed that.  Do you have a reference or pointer handy?
>>>>>
>>>>> What is strange is that this is 100% reproducible on RTSM while this
>>>>> apparently is not an issue on real hardware so far.
>>>>>
>>>> I tried searching archives and realized the discussion was in private
>>>> email thread. There are some bits and pieces on list but not all the
>>>> information.
>>>>
>>>> The main issue RMK pointed out is- An additional L1 flush needed
>>>> to avoid the effective change of view of memory when the C bit is
>>>> turned off, and the cache is no longer searched for local CPU accesses.
>>>>
>>>> In your case dcscb_power_down() has updated the stack which can be hit
>>>> in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
>>>> the C-bit and hence for sub sequent calls the L1 cache won't be
>>>> searched. You then call flush_cache_all() which again updates the
>>>> stack but avoids searching the L1 cache. So it overwrites previous
>>>> saved stack frame. This seems to be an issue in your case as well.
>>>
>>> On A15/A7 even with the C bit cleared the D-cache is searched, the
>>> situation above cannot happen and if it does we are facing a HW/model bug.
>>> If this code is run on A9 then we have a problem since there, when the C bit
>>> is cleared D-cache is not searched (and that's why the sequence above
>>> should be written in assembly with no data access whatsoever), but on
>>> A15/A7 we do not.
>>>
>> Good point. May be model has modeled A9 and not A15 but in either
>> case, lets be consistent for all ARMv7 machines at least to avoid
>> people debugging similar issues. Many machines share code for ARMv7
>> processors so the best things is to stick to the sequence which works
>> across all ARMv7 processors.
>
> Is it sufficient to clarify the comment to indicate that the code is
> not directly reusable for other CPU combinations?
>
Thats not what I mean. CPU power down sequence is as per the
ARM specs so there shouldn't be an issue in case people
find it useful for other purposes. Thats other topc though.

> DCSCB is incredibly platform-specific, and we would not expect to
> see it in other platforms.
>
> Or do we consider the risk of people copying this code verbatim
> (including the "do not copy this code" comment) too high?
>
I am not sure what exactly you mean. We are discussing the sequence
here on the basis of additional L1 cache flush. As mentioned
clearly the documentation is the ARM ARM(which is generic for
all ARMv7) missing to capture the need of the power
down code and stack usage which at least creates an issue on
A9. Documenting that in code and mainly in ARM specs would avoid
any further confusions.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-16  6:33               ` Santosh Shilimkar
@ 2013-01-16 10:03                 ` Lorenzo Pieralisi
  2013-01-16 10:12                   ` Santosh Shilimkar
  0 siblings, 1 reply; 140+ messages in thread
From: Lorenzo Pieralisi @ 2013-01-16 10:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 16, 2013 at 06:33:40AM +0000, Santosh Shilimkar wrote:
> On Tuesday 15 January 2013 11:50 PM, Dave Martin wrote:
> > On Tue, Jan 15, 2013 at 11:53:14AM +0530, Santosh Shilimkar wrote:
> >> On Monday 14 January 2013 05:55 PM, Lorenzo Pieralisi wrote:
> >>> On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
> >>>> On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
> >>>>> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
> >>>>>
> >>>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
> >>>>>>> From: Dave Martin <dave.martin@linaro.org>
> >>>>>>>
> >>>>>>> +		/*
> >>>>>>> +		 * Flush the local CPU cache.
> >>>>>>> +		 *
> >>>>>>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
> >>>>>>> +		 * a preliminary flush here for those CPUs.  At least, that's
> >>>>>>> +		 * the theory -- without the extra flush, Linux explodes on
> >>>>>>> +		 * RTSM (maybe not needed anymore, to be investigated).
> >>>>>>> +		 */
> >>>>>> This is expected if the entire code is not in one stack frame and the
> >>>>>> additional flush is needed to avoid possible stack corruption. This
> >>>>>> issue has been discussed in past on the list.
> >>>>>
> >>>>> I missed that.  Do you have a reference or pointer handy?
> >>>>>
> >>>>> What is strange is that this is 100% reproducible on RTSM while this
> >>>>> apparently is not an issue on real hardware so far.
> >>>>>
> >>>> I tried searching archives and realized the discussion was in private
> >>>> email thread. There are some bits and pieces on list but not all the
> >>>> information.
> >>>>
> >>>> The main issue RMK pointed out is- An additional L1 flush needed
> >>>> to avoid the effective change of view of memory when the C bit is
> >>>> turned off, and the cache is no longer searched for local CPU accesses.
> >>>>
> >>>> In your case dcscb_power_down() has updated the stack which can be hit
> >>>> in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
> >>>> the C-bit and hence for sub sequent calls the L1 cache won't be
> >>>> searched. You then call flush_cache_all() which again updates the
> >>>> stack but avoids searching the L1 cache. So it overwrites previous
> >>>> saved stack frame. This seems to be an issue in your case as well.
> >>>
> >>> On A15/A7 even with the C bit cleared the D-cache is searched, the
> >>> situation above cannot happen and if it does we are facing a HW/model bug.
> >>> If this code is run on A9 then we have a problem since there, when the C bit
> >>> is cleared D-cache is not searched (and that's why the sequence above
> >>> should be written in assembly with no data access whatsoever), but on
> >>> A15/A7 we do not.
> >>>
> >> Good point. May be model has modeled A9 and not A15 but in either
> >> case, lets be consistent for all ARMv7 machines at least to avoid
> >> people debugging similar issues. Many machines share code for ARMv7
> >> processors so the best things is to stick to the sequence which works
> >> across all ARMv7 processors.
> >
> > Is it sufficient to clarify the comment to indicate that the code is
> > not directly reusable for other CPU combinations?
> >
> Thats not what I mean. CPU power down sequence is as per the
> ARM specs so there shouldn't be an issue in case people
> find it useful for other purposes. Thats other topc though.

If they run it on an A9 there is an issue and as hotplug code for
vexpress proved, copy'n'paste is a real danger.

> 
> > DCSCB is incredibly platform-specific, and we would not expect to
> > see it in other platforms.

Agreed, but it is also the first example of power API implementation.
Stubbing out this code in an assembly file valid for all v7 implementations
is simple, provided we consider that worthwhile. I do. Or at least I can
write the sequence up in /Documentation, how it should be done to be generic
and describe the pitfalls.

> >
> > Or do we consider the risk of people copying this code verbatim
> > (including the "do not copy this code" comment) too high?
> >
> I am not sure what exactly you mean. We are discussing the sequence
> here on the basis of additional L1 cache flush. As mentioned
> clearly the documentation is the ARM ARM(which is generic for
> all ARMv7) missing to capture the need of the power
> down code and stack usage which at least creates an issue on
> A9. Documenting that in code and mainly in ARM specs would avoid
> any further confusions.

Power down sequence is defined explicitly in A15 and A7 TRMs. I do not
think they should write "do not use the stack or cacheable memory that
can result in dirty lines" in betweeen the power down steps. Once you
know the C bit behaviour coding follows. True, they might add this for
A9, and I asked that, to no avail, for internal reasons.

Documenting it in the kernel won't hurt either. And to answer Dave, I
think that copy'n'paste verbatim is a risk we should not run, unless we
are willing to be on the lookout for bugs. I can write up some documentation
that we can merge along with the power API.

Lorenzo

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-16 10:03                 ` Lorenzo Pieralisi
@ 2013-01-16 10:12                   ` Santosh Shilimkar
  0 siblings, 0 replies; 140+ messages in thread
From: Santosh Shilimkar @ 2013-01-16 10:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 16 January 2013 03:33 PM, Lorenzo Pieralisi wrote:
> On Wed, Jan 16, 2013 at 06:33:40AM +0000, Santosh Shilimkar wrote:
>> On Tuesday 15 January 2013 11:50 PM, Dave Martin wrote:
>>> On Tue, Jan 15, 2013 at 11:53:14AM +0530, Santosh Shilimkar wrote:
>>>> On Monday 14 January 2013 05:55 PM, Lorenzo Pieralisi wrote:
>>>>> On Sat, Jan 12, 2013 at 07:21:24AM +0000, Santosh Shilimkar wrote:
>>>>>> On Saturday 12 January 2013 12:58 AM, Nicolas Pitre wrote:
>>>>>>> On Fri, 11 Jan 2013, Santosh Shilimkar wrote:
>>>>>>>
>>>>>>>> On Thursday 10 January 2013 05:50 AM, Nicolas Pitre wrote:
>>>>>>>>> From: Dave Martin <dave.martin@linaro.org>
>>>>>>>>>
>>>>>>>>> +		/*
>>>>>>>>> +		 * Flush the local CPU cache.
>>>>>>>>> +		 *
>>>>>>>>> +		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
>>>>>>>>> +		 * a preliminary flush here for those CPUs.  At least, that's
>>>>>>>>> +		 * the theory -- without the extra flush, Linux explodes on
>>>>>>>>> +		 * RTSM (maybe not needed anymore, to be investigated).
>>>>>>>>> +		 */
>>>>>>>> This is expected if the entire code is not in one stack frame and the
>>>>>>>> additional flush is needed to avoid possible stack corruption. This
>>>>>>>> issue has been discussed in past on the list.
>>>>>>>
>>>>>>> I missed that.  Do you have a reference or pointer handy?
>>>>>>>
>>>>>>> What is strange is that this is 100% reproducible on RTSM while this
>>>>>>> apparently is not an issue on real hardware so far.
>>>>>>>
>>>>>> I tried searching archives and realized the discussion was in private
>>>>>> email thread. There are some bits and pieces on list but not all the
>>>>>> information.
>>>>>>
>>>>>> The main issue RMK pointed out is- An additional L1 flush needed
>>>>>> to avoid the effective change of view of memory when the C bit is
>>>>>> turned off, and the cache is no longer searched for local CPU accesses.
>>>>>>
>>>>>> In your case dcscb_power_down() has updated the stack which can be hit
>>>>>> in cache line and hence cache is dirty now. Then cpu_proc_fin() clears
>>>>>> the C-bit and hence for sub sequent calls the L1 cache won't be
>>>>>> searched. You then call flush_cache_all() which again updates the
>>>>>> stack but avoids searching the L1 cache. So it overwrites previous
>>>>>> saved stack frame. This seems to be an issue in your case as well.
>>>>>
>>>>> On A15/A7 even with the C bit cleared the D-cache is searched, the
>>>>> situation above cannot happen and if it does we are facing a HW/model bug.
>>>>> If this code is run on A9 then we have a problem since there, when the C bit
>>>>> is cleared D-cache is not searched (and that's why the sequence above
>>>>> should be written in assembly with no data access whatsoever), but on
>>>>> A15/A7 we do not.
>>>>>
>>>> Good point. May be model has modeled A9 and not A15 but in either
>>>> case, lets be consistent for all ARMv7 machines at least to avoid
>>>> people debugging similar issues. Many machines share code for ARMv7
>>>> processors so the best things is to stick to the sequence which works
>>>> across all ARMv7 processors.
>>>
>>> Is it sufficient to clarify the comment to indicate that the code is
>>> not directly reusable for other CPU combinations?
>>>
>> Thats not what I mean. CPU power down sequence is as per the
>> ARM specs so there shouldn't be an issue in case people
>> find it useful for other purposes. Thats other topc though.
>
> If they run it on an A9 there is an issue and as hotplug code for
> vexpress proved, copy'n'paste is a real danger.
>
>>
>>> DCSCB is incredibly platform-specific, and we would not expect to
>>> see it in other platforms.
>
> Agreed, but it is also the first example of power API implementation.
> Stubbing out this code in an assembly file valid for all v7 implementations
> is simple, provided we consider that worthwhile. I do. Or at least I can
> write the sequence up in /Documentation, how it should be done to be generic
> and describe the pitfalls.
>
>>>
>>> Or do we consider the risk of people copying this code verbatim
>>> (including the "do not copy this code" comment) too high?
>>>
>> I am not sure what exactly you mean. We are discussing the sequence
>> here on the basis of additional L1 cache flush. As mentioned
>> clearly the documentation is the ARM ARM(which is generic for
>> all ARMv7) missing to capture the need of the power
>> down code and stack usage which at least creates an issue on
>> A9. Documenting that in code and mainly in ARM specs would avoid
>> any further confusions.
>
> Power down sequence is defined explicitly in A15 and A7 TRMs. I do not
> think they should write "do not use the stack or cacheable memory that
> can result in dirty lines" in betweeen the power down steps. Once you
> know the C bit behaviour coding follows. True, they might add this for
> A9, and I asked that, to no avail, for internal reasons.
>
Fair enough.

> Documenting it in the kernel won't hurt either. And to answer Dave, I
> think that copy'n'paste verbatim is a risk we should not run, unless we
> are willing to be on the lookout for bugs. I can write up some documentation
> that we can merge along with the power API.
>
+1

Regards,
Santosh

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-15 16:44                 ` Nicolas Pitre
@ 2013-01-16 16:02                   ` Catalin Marinas
  -1 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-16 16:02 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: linux-tegra, linux-arm-kernel, Joseph Lo

On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> On Tue, 15 Jan 2013, Joseph Lo wrote:
> > So do you have a plan to make it become a generic framework in this
> > series or later work?
> 
> It is already generic, except for the name.  In other words, you could 
> start using this code already.
> 
> I'm still debating a good substitute for the bL_ prefix in this series 
> to give it the appearance of generic code.

mc_?

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-16 16:02                   ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-16 16:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> On Tue, 15 Jan 2013, Joseph Lo wrote:
> > So do you have a plan to make it become a generic framework in this
> > series or later work?
> 
> It is already generic, except for the name.  In other words, you could 
> start using this code already.
> 
> I'm still debating a good substitute for the bL_ prefix in this series 
> to give it the appearance of generic code.

mc_?

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-15 18:40             ` Dave Martin
@ 2013-01-16 16:06               ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-16 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 15, 2013 at 06:40:37PM +0000, Dave Martin wrote:
> On Mon, Jan 14, 2013 at 06:26:04PM +0000, Russell King - ARM Linux wrote:
> > On Mon, Jan 14, 2013 at 12:15:07PM -0500, Nicolas Pitre wrote:
> > > The same could be said about the outer cache ops.  If a DSB is needed 
> > > for their intent to be valid, then why isn't this DSB always implied by 
> > > the corresponding cache op calls?
> > 
> > Hmm, just been thinking about this.
> > 
> > The L2x0 calls do contain a DSB but it's not obvious.  They hold a
> > raw spinlock, and when that spinlock is dropped, we issue a dsb and
> > sev instruction.
> > 
> > Whether the other L2 implementations do this or not I'm not sure -
> > but the above is a requirement of the spinlock implementation, and
> > it just happens to provide the right behaviour for L2x0.
> > 
> > But... we _probably_ don't want to impose that down at the L2 cache
> > level of things - at least not for DMA ops, particular for the sanity
> > of the scatter-list operating operations.  We really want to avoid
> > doing one DSB per scatterlist entry, doing one DSB per scatterlist
> > operation instead.
> > 
> > That does affect how the L2 cache API gets used - maybe we want to
> > separate out the DMA stuff from the other users so that we can have
> > dsbs in that path for non-DMA users.
> > 
> > Thoughts?
> 
> Perhaps the existing functions could be renamed to things like:
> 
> outer_XXX_flush_range()
> outer_XXX_sync()
> 
> Where XXX is something like "batch" or "background".  Optionally these
> could be declared somewhere separate to discourage non-DMA code from
> using them.  Other code could still want to do batches of outer cache
> operations efficiently, but I guess DMA is the main user.

There can be some confusion with using 'background' name because both
PL310 and L2X0 have background operations (the former only for the
set/way ops). But in software you need to ensure the completion of such
operations otherwise the L2 controller behaviour can be unpredictable.

So you just want to drop the barriers (outer_sync and dsb), maybe could
use the 'relaxed' suffix which matches the I/O accessors.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-16 16:02                   ` Catalin Marinas
@ 2013-01-16 21:18                       ` Nicolas Pitre
  -1 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-16 21:18 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Joseph Lo, linux-tegra-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 16 Jan 2013, Catalin Marinas wrote:

> On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> > On Tue, 15 Jan 2013, Joseph Lo wrote:
> > > So do you have a plan to make it become a generic framework in this
> > > series or later work?
> > 
> > It is already generic, except for the name.  In other words, you could 
> > start using this code already.
> > 
> > I'm still debating a good substitute for the bL_ prefix in this series 
> > to give it the appearance of generic code.
> 
> mc_?

Bah.   :-/


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-16 21:18                       ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-01-16 21:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 16 Jan 2013, Catalin Marinas wrote:

> On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> > On Tue, 15 Jan 2013, Joseph Lo wrote:
> > > So do you have a plan to make it become a generic framework in this
> > > series or later work?
> > 
> > It is already generic, except for the name.  In other words, you could 
> > start using this code already.
> > 
> > I'm still debating a good substitute for the bL_ prefix in this series 
> > to give it the appearance of generic code.
> 
> mc_?

Bah.   :-/


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-16 21:18                       ` Nicolas Pitre
@ 2013-01-17 17:55                           ` Catalin Marinas
  -1 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-17 17:55 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Joseph Lo, linux-tegra-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, Jan 16, 2013 at 09:18:39PM +0000, Nicolas Pitre wrote:
> On Wed, 16 Jan 2013, Catalin Marinas wrote:
> 
> > On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> > > On Tue, 15 Jan 2013, Joseph Lo wrote:
> > > > So do you have a plan to make it become a generic framework in this
> > > > series or later work?
> > > 
> > > It is already generic, except for the name.  In other words, you could 
> > > start using this code already.
> > > 
> > > I'm still debating a good substitute for the bL_ prefix in this series 
> > > to give it the appearance of generic code.
> > 
> > mc_?
> 
> Bah.   :-/

Like in multi-cluster (not "Hammer time" ;). But if they are not even
cluster dependent, a simple mp_ would also do.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
@ 2013-01-17 17:55                           ` Catalin Marinas
  0 siblings, 0 replies; 140+ messages in thread
From: Catalin Marinas @ 2013-01-17 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 16, 2013 at 09:18:39PM +0000, Nicolas Pitre wrote:
> On Wed, 16 Jan 2013, Catalin Marinas wrote:
> 
> > On Tue, Jan 15, 2013 at 04:44:23PM +0000, Nicolas Pitre wrote:
> > > On Tue, 15 Jan 2013, Joseph Lo wrote:
> > > > So do you have a plan to make it become a generic framework in this
> > > > series or later work?
> > > 
> > > It is already generic, except for the name.  In other words, you could 
> > > start using this code already.
> > > 
> > > I'm still debating a good substitute for the bL_ prefix in this series 
> > > to give it the appearance of generic code.
> > 
> > mc_?
> 
> Bah.   :-/

Like in multi-cluster (not "Hammer time" ;). But if they are not even
cluster dependent, a simple mp_ would also do.

-- 
Catalin

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
                     ` (3 preceding siblings ...)
  2013-01-11 17:16   ` Santosh Shilimkar
@ 2013-03-07  7:37   ` Pavel Machek
  2013-03-07  8:57     ` Nicolas Pitre
  4 siblings, 1 reply; 140+ messages in thread
From: Pavel Machek @ 2013-03-07  7:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1579,6 +1579,12 @@ config HAVE_ARM_TWD
>  	help
>  	  This options enables support for the ARM timer and watchdog unit
>  
> +config BIG_LITTLE
> +	bool "big.LITTLE support (Experimental)"
> +	depends on CPU_V7 && SMP && EXPERIMENTAL
> +	help
> +	  This option enables support for the big.LITTLE architecture.
> +

Perhaps few lines of what big.LITTLE is would be nice here?

It is that "few high-performance cores + few low-power cores" on chip, right?

BTW... is it possible/good idea to run all the cores at the same time for 
max performance? From descriptions I understood that is not normally done
and I'd like to understand why....

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
                   ` (18 preceding siblings ...)
       [not found] ` <1357777251-13541-1-git-send-email-nicolas.pitre-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
@ 2013-03-07  8:27 ` Pavel Machek
  2013-03-07  9:12   ` Nicolas Pitre
  19 siblings, 1 reply; 140+ messages in thread
From: Pavel Machek @ 2013-03-07  8:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> Review comments are welcome!
> 
> [*] General design information on the b.L switcher can be found here:
>     http://lwn.net/Articles/481055/
>     However the code is only accessible to Linaro members for the
>     time being.

Assuming the lwn article is accurate:

Hmm. So we have hw capable of running 8 cores, and then we introduce strange
switching code, because "scheduler is not ready". Sounds like a bad idea.

I'd say:

* expose all 8 cores.

* as long as scheduler is not ready, you can offline "the other" set of cores...
	/sys/.../cpu/..../online

That way, I suspect scheduler will be fixed rather quickly.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 01/16] ARM: b.L: secondary kernel entry code
  2013-03-07  7:37   ` Pavel Machek
@ 2013-03-07  8:57     ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-03-07  8:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 7 Mar 2013, Pavel Machek wrote:

> Hi!
> 
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -1579,6 +1579,12 @@ config HAVE_ARM_TWD
> >  	help
> >  	  This options enables support for the ARM timer and watchdog unit
> >  
> > +config BIG_LITTLE
> > +	bool "big.LITTLE support (Experimental)"
> > +	depends on CPU_V7 && SMP && EXPERIMENTAL
> > +	help
> > +	  This option enables support for the big.LITTLE architecture.
> > +
> 
> Perhaps few lines of what big.LITTLE is would be nice here?

I'd invite you to look at the latest series posted on the list.  This 
patch is obsolete.

> It is that "few high-performance cores + few low-power cores" on chip, right?

Right.

> BTW... is it possible/good idea to run all the cores at the same time for 
> max performance? From descriptions I understood that is not normally done
> and I'd like to understand why....

It can be done.  But then the scheduler needs to be smarter about which 
task is put on which core.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-03-07  8:27 ` Pavel Machek
@ 2013-03-07  9:12   ` Nicolas Pitre
  2013-03-07  9:40     ` Pavel Machek
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-03-07  9:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 7 Mar 2013, Pavel Machek wrote:

> Hi!
> 
> > Review comments are welcome!
> > 
> > [*] General design information on the b.L switcher can be found here:
> >     http://lwn.net/Articles/481055/
> >     However the code is only accessible to Linaro members for the
> >     time being.
> 
> Assuming the lwn article is accurate:
> 
> Hmm. So we have hw capable of running 8 cores, and then we introduce strange
> switching code, because "scheduler is not ready". Sounds like a bad idea.
> 
> I'd say:
> 
> * expose all 8 cores.

You may do that now.  However the resulting power efficiency is far from 
optimal.

> * as long as scheduler is not ready, you can offline "the other" set of cores...
> 	/sys/.../cpu/..../online

You may do that now also.  But system performance is far from optimal.

> That way, I suspect scheduler will be fixed rather quickly.

If we could fix the scheduler quickly, we would have done that instead.  
But if you have a good idea for fixing it quickly, then please join the 
people who have been working on that problem for over a year already.


Niicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-03-07  9:12   ` Nicolas Pitre
@ 2013-03-07  9:40     ` Pavel Machek
  2013-03-07  9:56       ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Pavel Machek @ 2013-03-07  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > > Review comments are welcome!
> > > 
> > > [*] General design information on the b.L switcher can be found here:
> > >     http://lwn.net/Articles/481055/
> > >     However the code is only accessible to Linaro members for the
> > >     time being.
> > 
> > Assuming the lwn article is accurate:
> > 
> > Hmm. So we have hw capable of running 8 cores, and then we introduce strange
> > switching code, because "scheduler is not ready". Sounds like a bad idea.
> > 
> > I'd say:
> > 
> > * expose all 8 cores.
> 
> You may do that now.  However the resulting power efficiency is far from 
> optimal.

Really? Assuming the lwn article is accurate, it will be exactly the
same.

I propose exposing all 8 cores, then keeping 4 of them offline in
normal operation. This should have characteristics of your solution,
except that cpu does not change unexpectedly under the pinned tasks.

> > * as long as scheduler is not ready, you can offline "the other" set of cores...
> > 	/sys/.../cpu/..../online
> 
> You may do that now also.  But system performance is far from optimal.

Really? Do the offlining from the cpufreq code, and solution will be
equivalent to yours, except the hacks.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-03-07  9:40     ` Pavel Machek
@ 2013-03-07  9:56       ` Nicolas Pitre
  2013-03-07 14:51         ` Pavel Machek
  0 siblings, 1 reply; 140+ messages in thread
From: Nicolas Pitre @ 2013-03-07  9:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 7 Mar 2013, Pavel Machek wrote:

> Hi!
> 
> > > > Review comments are welcome!
> > > > 
> > > > [*] General design information on the b.L switcher can be found here:
> > > >     http://lwn.net/Articles/481055/
> > > >     However the code is only accessible to Linaro members for the
> > > >     time being.
> > > 
> > > Assuming the lwn article is accurate:
> > > 
> > > Hmm. So we have hw capable of running 8 cores, and then we introduce strange
> > > switching code, because "scheduler is not ready". Sounds like a bad idea.
> > > 
> > > I'd say:
> > > 
> > > * expose all 8 cores.
> > 
> > You may do that now.  However the resulting power efficiency is far from 
> > optimal.
> 
> Really? Assuming the lwn article is accurate, it will be exactly the
> same.
> 
> I propose exposing all 8 cores, then keeping 4 of them offline in
> normal operation. This should have characteristics of your solution,
> except that cpu does not change unexpectedly under the pinned tasks.
> 
> > > * as long as scheduler is not ready, you can offline "the other" set of cores...
> > > 	/sys/.../cpu/..../online
> > 
> > You may do that now also.  But system performance is far from optimal.
> 
> Really? Do the offlining from the cpufreq code, and solution will be
> equivalent to yours, except the hacks.

No it is not equivalent.  Offlining a CPU is a fairly heavy operation in 
Linux and you don't want to do that multiple times per second for 
example.  Whereas the switcher "hack" is similar to a context switch in 
terms of software complexity.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-03-07  9:56       ` Nicolas Pitre
@ 2013-03-07 14:51         ` Pavel Machek
  2013-03-07 15:42           ` Nicolas Pitre
  0 siblings, 1 reply; 140+ messages in thread
From: Pavel Machek @ 2013-03-07 14:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > > > I'd say:
> > > > 
> > > > * expose all 8 cores.
> > > 
> > > You may do that now.  However the resulting power efficiency is far from 
> > > optimal.
> > 
> > Really? Assuming the lwn article is accurate, it will be exactly the
> > same.
> > 
> > I propose exposing all 8 cores, then keeping 4 of them offline in
> > normal operation. This should have characteristics of your solution,
> > except that cpu does not change unexpectedly under the pinned tasks.
> > 
> > > > * as long as scheduler is not ready, you can offline "the other" set of cores...
> > > > 	/sys/.../cpu/..../online
> > > 
> > > You may do that now also.  But system performance is far from optimal.
> > 
> > Really? Do the offlining from the cpufreq code, and solution will be
> > equivalent to yours, except the hacks.
> 
> No it is not equivalent.  Offlining a CPU is a fairly heavy operation in 
> Linux and you don't want to do that multiple times per second for 
> example.  Whereas the switcher "hack" is similar to a context switch in 
> terms of software complexity.

Well, why not. Logically, you _are_ saving state, offlining, onlining,
and restoring state, multiple times per second, already. Fast
offline/online benefits everyone... 

As for the hacks... what does /proc/cpuinfo show on your system? What
happens when process requests binding to specific cpu? I fear that
switching hack is easy now, but will have unnice impact on unexpected
places.

BTW I guess this should be discussed on linux-kernel, not on the arm
list.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH 00/16] big.LITTLE low-level CPU and cluster power management
  2013-03-07 14:51         ` Pavel Machek
@ 2013-03-07 15:42           ` Nicolas Pitre
  0 siblings, 0 replies; 140+ messages in thread
From: Nicolas Pitre @ 2013-03-07 15:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 7 Mar 2013, Pavel Machek wrote:

> Hi!
> 
> > > > > I'd say:
> > > > > 
> > > > > * expose all 8 cores.
> > > > 
> > > > You may do that now.  However the resulting power efficiency is far from 
> > > > optimal.
> > > 
> > > Really? Assuming the lwn article is accurate, it will be exactly the
> > > same.
> > > 
> > > I propose exposing all 8 cores, then keeping 4 of them offline in
> > > normal operation. This should have characteristics of your solution,
> > > except that cpu does not change unexpectedly under the pinned tasks.
> > > 
> > > > > * as long as scheduler is not ready, you can offline "the other" set of cores...
> > > > > 	/sys/.../cpu/..../online
> > > > 
> > > > You may do that now also.  But system performance is far from optimal.
> > > 
> > > Really? Do the offlining from the cpufreq code, and solution will be
> > > equivalent to yours, except the hacks.
> > 
> > No it is not equivalent.  Offlining a CPU is a fairly heavy operation in 
> > Linux and you don't want to do that multiple times per second for 
> > example.  Whereas the switcher "hack" is similar to a context switch in 
> > terms of software complexity.
> 
> Well, why not. Logically, you _are_ saving state, offlining, onlining,
> and restoring state, multiple times per second, already. 

Maybe "logically", but in practice there is no actual 
offlining/onlining but rather a migration.

> Fast offline/online benefits everyone...

Absolutely! You could ask tglx and cie where they are with their attempt 
at making hotplug less costly?

> As for the hacks... what does /proc/cpuinfo show on your system? What
> happens when process requests binding to specific cpu? I fear that
> switching hack is easy now, but will have unnice impact on unexpected
> places.
> 
> BTW I guess this should be discussed on linux-kernel, not on the arm
> list.

There is nothing to discuss until I'm ready to show you the code.


Nicolas

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2013-03-07 15:42 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-10  0:20 [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Nicolas Pitre
2013-01-10  0:20 ` [PATCH 01/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
2013-01-10  7:12   ` Stephen Boyd
2013-01-10 15:30     ` Nicolas Pitre
2013-01-10 15:34   ` Catalin Marinas
2013-01-10 16:47     ` Nicolas Pitre
2013-01-11 11:45       ` Catalin Marinas
2013-01-11 12:05         ` Lorenzo Pieralisi
2013-01-11 12:19         ` Dave Martin
2013-01-10 23:05   ` Will Deacon
2013-01-11  1:26     ` Nicolas Pitre
2013-01-11 10:55       ` Will Deacon
2013-01-11 11:35         ` Dave Martin
2013-01-11 17:16   ` Santosh Shilimkar
2013-01-11 18:10     ` Nicolas Pitre
2013-01-11 18:30       ` Santosh Shilimkar
2013-03-07  7:37   ` Pavel Machek
2013-03-07  8:57     ` Nicolas Pitre
2013-01-10  0:20 ` [PATCH 02/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
2013-01-10 23:08   ` Will Deacon
2013-01-11  2:30     ` Nicolas Pitre
2013-01-11 10:58       ` Will Deacon
2013-01-11 11:29       ` Dave Martin
2013-01-11 17:26   ` Santosh Shilimkar
2013-01-11 18:33     ` Nicolas Pitre
2013-01-11 18:41       ` Santosh Shilimkar
2013-01-11 19:54         ` Nicolas Pitre
2013-01-10  0:20 ` [PATCH 03/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
2013-01-10 12:01   ` Dave Martin
2013-01-10 19:04     ` Nicolas Pitre
2013-01-11 11:30       ` Dave Martin
2013-01-10 16:53   ` Catalin Marinas
2013-01-10 17:59     ` Nicolas Pitre
2013-01-10 21:50       ` Catalin Marinas
2013-01-10 22:31         ` Nicolas Pitre
2013-01-11 10:36           ` Dave Martin
2013-01-10 22:32     ` Nicolas Pitre
2013-01-10 23:13   ` Will Deacon
2013-01-11  1:50     ` Nicolas Pitre
2013-01-11 11:09       ` Dave Martin
2013-01-11 17:46   ` Santosh Shilimkar
2013-01-11 18:07     ` Dave Martin
2013-01-11 18:34       ` Santosh Shilimkar
2013-01-14 17:08   ` Dave Martin
2013-01-14 17:15     ` Catalin Marinas
2013-01-14 18:10       ` Dave Martin
2013-01-14 21:34         ` Catalin Marinas
2013-01-10  0:20 ` [PATCH 04/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
2013-01-10 23:18   ` Will Deacon
2013-01-11  3:15     ` Nicolas Pitre
2013-01-11 11:03       ` Will Deacon
2013-01-11 16:57       ` Dave Martin
2013-01-10  0:20 ` [PATCH 05/16] ARM: bL_head: vlock-based first man election Nicolas Pitre
2013-01-10  0:20 ` [PATCH 06/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
2013-01-11 18:02   ` Santosh Shilimkar
2013-01-14 18:05     ` Achin Gupta
2013-01-15  6:32       ` Santosh Shilimkar
2013-01-15 11:18         ` Achin Gupta
2013-01-15 11:26           ` Santosh Shilimkar
2013-01-15 18:53           ` Dave Martin
2013-01-14 16:35   ` Will Deacon
2013-01-14 16:51     ` Nicolas Pitre
2013-01-15 19:09       ` Dave Martin
2013-01-10  0:20 ` [PATCH 07/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
2013-01-14 16:37   ` Will Deacon
2013-01-14 16:53     ` Nicolas Pitre
2013-01-14 17:00       ` Will Deacon
2013-01-14 17:11         ` Catalin Marinas
2013-01-14 17:15         ` Nicolas Pitre
2013-01-14 17:23           ` Will Deacon
2013-01-14 18:26           ` Russell King - ARM Linux
2013-01-14 18:49             ` Nicolas Pitre
2013-01-15 18:40             ` Dave Martin
2013-01-16 16:06               ` Catalin Marinas
2013-01-10  0:20 ` [PATCH 08/16] ARM: bL_platsmp.c: make sure the GIC interface of a dying CPU is disabled Nicolas Pitre
2013-01-11 18:07   ` Santosh Shilimkar
2013-01-11 19:07     ` Nicolas Pitre
2013-01-12  6:50       ` Santosh Shilimkar
2013-01-12 16:47         ` Nicolas Pitre
2013-01-13  4:37           ` Santosh Shilimkar
2013-01-14 17:53           ` Lorenzo Pieralisi
2013-01-14 16:39   ` Will Deacon
2013-01-14 16:54     ` Nicolas Pitre
2013-01-14 17:02       ` Will Deacon
2013-01-14 17:18         ` Nicolas Pitre
2013-01-14 17:24           ` Will Deacon
2013-01-14 17:56             ` Lorenzo Pieralisi
2013-01-10  0:20 ` [PATCH 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
2013-01-10  0:20 ` [PATCH 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
2013-01-11 18:12   ` Santosh Shilimkar
2013-01-11 19:13     ` Nicolas Pitre
2013-01-12  6:52       ` Santosh Shilimkar
2013-01-10  0:20 ` [PATCH 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
2013-01-10  0:20 ` [PATCH 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
2013-01-10  0:20 ` [PATCH 13/16] drivers: misc: add ARM CCI support Nicolas Pitre
2013-01-11 18:20   ` Santosh Shilimkar
2013-01-11 19:22     ` Nicolas Pitre
2013-01-12  6:53       ` Santosh Shilimkar
2013-01-15 18:34       ` Dave Martin
2013-01-10  0:20 ` [PATCH 14/16] ARM: TC2: ensure powerdown-time data is flushed from cache Nicolas Pitre
2013-01-10 18:50   ` Dave Martin
2013-01-10 19:13     ` Nicolas Pitre
2013-01-11 11:38       ` Dave Martin
2013-01-10  0:20 ` [PATCH 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
2013-01-10 12:05   ` Dave Martin
2013-01-11 18:27   ` Santosh Shilimkar
2013-01-11 19:28     ` Nicolas Pitre
2013-01-12  7:21       ` Santosh Shilimkar
2013-01-14 12:25         ` Lorenzo Pieralisi
2013-01-15  6:23           ` Santosh Shilimkar
2013-01-15 18:20             ` Dave Martin
2013-01-16  6:33               ` Santosh Shilimkar
2013-01-16 10:03                 ` Lorenzo Pieralisi
2013-01-16 10:12                   ` Santosh Shilimkar
2013-01-10  0:20 ` [PATCH 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre
2013-01-10  0:46 ` [PATCH 00/16] big.LITTLE low-level CPU and cluster power management Rob Herring
2013-01-10  5:04   ` Nicolas Pitre
2013-01-10 23:01 ` Will Deacon
     [not found] ` <1357777251-13541-1-git-send-email-nicolas.pitre-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
2013-01-14  9:56   ` Joseph Lo
2013-01-14  9:56     ` Joseph Lo
     [not found]     ` <1358157392.19304.243.camel-yx3yKKdKkHfc7b1ADBJPm0n48jw8i0AO@public.gmane.org>
2013-01-14 14:05       ` Nicolas Pitre
2013-01-14 14:05         ` Nicolas Pitre
     [not found]         ` <alpine.LFD.2.02.1301140849020.6300-QuJgVwGFrdf/9pzu0YdTqQ@public.gmane.org>
2013-01-15  2:44           ` Joseph Lo
2013-01-15  2:44             ` Joseph Lo
     [not found]             ` <1358217848.8513.14.camel-yx3yKKdKkHfc7b1ADBJPm0n48jw8i0AO@public.gmane.org>
2013-01-15 16:44               ` Nicolas Pitre
2013-01-15 16:44                 ` Nicolas Pitre
2013-01-16 16:02                 ` Catalin Marinas
2013-01-16 16:02                   ` Catalin Marinas
     [not found]                   ` <20130116160242.GB31318-5wv7dgnIgG8@public.gmane.org>
2013-01-16 21:18                     ` Nicolas Pitre
2013-01-16 21:18                       ` Nicolas Pitre
     [not found]                       ` <alpine.LFD.2.02.1301161614390.6300-QuJgVwGFrdf/9pzu0YdTqQ@public.gmane.org>
2013-01-17 17:55                         ` Catalin Marinas
2013-01-17 17:55                           ` Catalin Marinas
2013-01-15 18:31           ` Dave Martin
2013-01-15 18:31             ` Dave Martin
2013-03-07  8:27 ` Pavel Machek
2013-03-07  9:12   ` Nicolas Pitre
2013-03-07  9:40     ` Pavel Machek
2013-03-07  9:56       ` Nicolas Pitre
2013-03-07 14:51         ` Pavel Machek
2013-03-07 15:42           ` Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.