All of lore.kernel.org
 help / color / mirror / Atom feed
* Versatile Express randomly fails to boot
@ 2015-03-15 21:33 Russell King - ARM Linux
  2015-03-16  0:04 ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-15 21:33 UTC (permalink / raw)
  To: linux-arm-kernel

I have noticed recently that the Versatile Express with the CT9x4 tile
randomly fails to boot in the nightly boot tests.

Remember that these are from my build system, which is mainline plus
my development, plus arm-soc, so they're not pure -rc1, -rc2 nor -rc3.

Booting the exact same kernel image multuple times, with the same DT
blob shows the same symptom - there are no kernel messages output
after decompression.  This is a fairly recent regression - 4.0-rc1
seemed fine, the few tests that 4.0-rc2 were subjected to also look
fine, 4.0-rc3 seems to randomly fail.

Adding in my standard DEBUG_LL hack, and a few extra printk()s reveals
that the kernel stops when trying to bring up the secondary CPUs:

...
Calibrating local timer... 200.00MHz.
Calibrating delay loop... 795.44 BogoMIPS (lpj=3977216)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 2048 (order: 1, 8192 bytes)
Mountpoint-cache hash table entries: 2048 (order: 1, 8192 bytes)
CPU: Testing write buffer coherency: ok
CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
Setting up static identity map for 0x6038ef68 - 0x6038efc0
trying to boot secondary 1
sending ipi
ipi sent, waiting
wait done
boot secondary returned 0
CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
trying to boot secondary 2

So, we see that we brought up the CPU1 just fine, and:

printk("trying to boot secondary %u\n", cpu);
        ret = smp_ops.smp_boot_secondary(cpu, idle);
printk("boot secondary returned %d\n", ret);

we reached that printk() for CPU2, but we didn't get to:

int versatile_boot_secondary(unsigned int cpu, struct task_struct *idle)
{
        spin_lock(&boot_lock);
        write_pen_release(cpu_logical_map(cpu));
printk("sending ipi\n");
        arch_send_wakeup_ipi_mask(cpumask_of(cpu));
printk("ipi sent, waiting\n");

here.

Winding the kernel back to -rc1 results in it behaving again.

I don't see anything in the diff between -rc1 and -rc3 which would
explain it.

Another boot:

CPU: Testing write buffer coherency: ok
CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
Setting up static identity map for 0x6038efd8 - 0x6038f030
trying to boot secondary 1
before spin_lock()
before write_pen_release()
sending ipi
ipi sent, waiting
wait done
BUG: spinlock lockup suspected on CPU#1, swapper/1/0
 lock: boot_lock+0x0/0x30, .magic: dead4ead, .owner: swapper/0/1, .owner_cpu: 0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0-rc3+ #1
Hardware name: ARM-Versatile Express
Backtrace:
[<c0011a64>] (dump_backtrace) from [<c0011d24>] (show_stack+0x18/0x1c)
 r6:17b4c000 r5:c050d150 r4:00000000 r3:00200040
[<c0011d0c>] (show_stack) from [<c0388e00>] (dump_stack+0x74/0x90)
[<c0388d8c>] (dump_stack) from [<c038695c>] (spin_dump+0x80/0x94)
 r4:ee860000 r3:ee865a00
[<c03868dc>] (spin_dump) from [<c005a9d8>] (do_raw_spin_lock+0xec/0x1c0)
 r5:00000000 r4:17b4c000
[<c005a8ec>] (do_raw_spin_lock) from [<c038e55c>] (_raw_spin_lock+0x3c/0x44)
 r10:00000000 r9:410fc091 r8:6000406a r7:c0532d3c r6:10c0387d r5:00000015
 r4:c050d150
[<c038e520>] (_raw_spin_lock) from [<c001beb0>] (versatile_secondary_init+0x20/0x30)
 r4:c050d150
[<c001be90>] (versatile_secondary_init) from [<c00138f0>] (secondary_start_kernel+0x100/0x160)
 r4:00000001 r3:c001be90
[<c00137f0>] (secondary_start_kernel) from [<600087e4>] (0x600087e4)
 r4:8e87406a r3:c00087cc
boot secondary returned 0

This one is interesting, as it /seems/ that the boot CPU got stuck on
spin_unlock() in versatile_boot_secondary() - the unlock wasn't seen by
CPU1, but neither did CPU0 make it out of versatile_boot_secondary()
despite getting out of the pen_release==-1 loop.

Another interesting factor is that adding these printk()s appear to have
stopped it booting at all... :(

Arnd suggested commit 17f480342026e54000, so I reverted that, and I get
a different behaviour - instead of the spinlock lockup, I get:

trying to boot secondary 1
before spin_lock()
before write_pen_release()
sending ipi
ipi sent, waiting
wait done
boot secondary returned 0

instead - I guess it just changes the timing due to the placement of
code/data in the kernel.

I'm going to try a few other kernels to try and track down what's going
on - whether something from arm-soc or my tree is responsible for this
really weird behaviour.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot
  2015-03-15 21:33 Versatile Express randomly fails to boot Russell King - ARM Linux
@ 2015-03-16  0:04 ` Russell King - ARM Linux
  2015-03-16  0:42   ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16  0:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> I'm going to try a few other kernels to try and track down what's going
> on - whether something from arm-soc or my tree is responsible for this
> really weird behaviour.

Okay, this is weird - it seems that it's caused by the FIQ oops
dumping code/FIQ changes which I've carried for many months
unchanged in my tree.

I haven't yet been able to prove which bit of those changes is
responsible yet - with a build time of about 5 minutes, and a test
time (due to the number of iterations required to prove it) around
10 minutes, it takes a while to narrow stuff down - it's taken all
evening to work out which branch is responsible, I'm just narrowing
it down to the commit, which looks like it's a result of something
in the change below.

This is pretty close to Daniel's patches, and we know that the GIC
on Versatile Express has problems with this stuff - I wonder if this
means we can't even probe the GIC to find out whether it's capable of
FIQ delivery via testing whether GICD_ENABLE_GRP1 can be set.

That said, and as I said above, this exact patch has been in my kernel,
and has been built and tested over many months, so I find it hard to
believe that this really _is_ responsible.

For tonights build, I'm going to drop the FIQ stuff from my devel tree,
which means it'll be gone from the build tree until we can, again,
figure out what the heck's going on here.

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 17e54f1df258..e0ba62117c5a 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -543,7 +543,7 @@ static void ipi_cpu_stop(unsigned int cpu)
 		cpu_relax();
 }
 
-static void ipi_cpu_backtrace(struct pt_regs *regs)
+void ipi_cpu_backtrace(struct pt_regs *regs)
 {
 	int cpu = smp_processor_id();
 
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 439138d3437e..f88af68f9345 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -60,6 +60,7 @@ static int __init user_debug_setup(char *str)
 __setup("user_debug=", user_debug_setup);
 #endif
 
+extern void ipi_cpu_backtrace(struct pt_regs *regs);
 static void dump_mem(const char *, const char *, unsigned long, unsigned long);
 
 void dump_backtrace_entry(unsigned long where, unsigned long from, unsigned long frame)
@@ -480,6 +481,9 @@ asmlinkage void __exception_irq_entry handle_fiq_as_nmi(struct pt_regs *regs)
 	nmi_enter();
 
 	/* nop. FIQ handlers for special arch/arm features can be added here. */
+#ifdef CONFIG_SMP
+	ipi_cpu_backtrace(regs);
+#endif
 
 	nmi_exit();
 
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index dda6dbc23565..786a662f9842 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -48,6 +48,8 @@
 #include "irq-gic-common.h"
 #include "irqchip.h"
 
+#define GICD_ENABLE_GRP1               0x2
+
 union gic_base {
 	void __iomem *common_base;
 	void __percpu * __iomem *percpu_base;
@@ -102,7 +104,7 @@ static struct gic_chip_data gic_data[MAX_GIC_NR] __read_mostly;
 #ifdef CONFIG_GIC_NON_BANKED
 static void __iomem *gic_get_percpu_base(union gic_base *base)
 {
-	return *__this_cpu_ptr(base->percpu_base);
+	return raw_cpu_read(*base->percpu_base);
 }
 
 static void __iomem *gic_get_common_base(union gic_base *base)
@@ -270,8 +272,7 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)
 		irqnr = irqstat & GICC_IAR_INT_ID_MASK;
 
 		if (likely(irqnr > 15 && irqnr < 1021)) {
-			irqnr = irq_find_mapping(gic->domain, irqnr);
-			handle_IRQ(irqnr, regs);
+			handle_domain_irq(gic->domain, irqnr, regs);
 			continue;
 		}
 		if (irqnr < 16) {
@@ -298,8 +299,8 @@ static void gic_handle_cascade_irq(unsigned int irq, struct irq_desc *desc)
 	status = readl_relaxed(gic_data_cpu_base(chip_data) + GIC_CPU_INTACK);
 	raw_spin_unlock(&irq_controller_lock);
 
-	gic_irq = (status & 0x3ff);
-	if (gic_irq == 1023)
+	gic_irq = (status & GICC_IAR_INT_ID_MASK);
+	if (gic_irq == GICC_INT_SPURIOUS)
 		goto out;
 
 	cascade_irq = irq_find_mapping(chip_data->domain, gic_irq);
@@ -353,6 +354,25 @@ static u8 gic_get_cpumask(struct gic_chip_data *gic)
 	return mask;
 }
 
+static void gic_cpu_if_up(void)
+{
+	void __iomem *cpu_base = gic_data_cpu_base(&gic_data[0]);
+	void __iomem *dist_base = gic_data_dist_base(&gic_data[0]);
+	u32 bypass = 0;
+
+	/*
+	* Preserve bypass disable bits to be written back later
+	*/
+	bypass = readl(cpu_base + GIC_CPU_CTRL);
+	bypass &= GICC_DIS_BYPASS_MASK;
+
+	if (readl_relaxed(dist_base + GIC_DIST_CTRL) & GICD_ENABLE_GRP1)
+		bypass |= 0x1e;
+
+	writel_relaxed(bypass | GICC_ENABLE, cpu_base + GIC_CPU_CTRL);
+}
+
+
 static void __init gic_dist_init(struct gic_chip_data *gic)
 {
 	unsigned int i;
@@ -360,7 +380,7 @@ static void __init gic_dist_init(struct gic_chip_data *gic)
 	unsigned int gic_irqs = gic->gic_irqs;
 	void __iomem *base = gic_data_dist_base(gic);
 
-	writel_relaxed(0, base + GIC_DIST_CTRL);
+	writel_relaxed(GICD_DISABLE, base + GIC_DIST_CTRL);
 
 	/*
 	 * Set all global interrupts to this CPU only.
@@ -371,9 +391,19 @@ static void __init gic_dist_init(struct gic_chip_data *gic)
 	for (i = 32; i < gic_irqs; i += 4)
 		writel_relaxed(cpumask, base + GIC_DIST_TARGET + i * 4 / 4);
 
+	writel_relaxed(GICD_ENABLE_GRP1, base + GIC_DIST_CTRL);
+
+	/*
+	 * Optionally set all global interrupts to be group 1.
+	 */
+	if (readl_relaxed(base + GIC_DIST_CTRL) & GICD_ENABLE_GRP1) {
+		for (i = 32; i < gic_irqs; i += 32)
+			writel_relaxed(0xffffffff, base + GIC_DIST_IGROUP + i * 4 / 32);
+	}
+
 	gic_dist_config(base, gic_irqs, NULL);
 
-	writel_relaxed(1, base + GIC_DIST_CTRL);
+	writel_relaxed(GICD_ENABLE | GICD_ENABLE_GRP1, base + GIC_DIST_CTRL);
 }
 
 static void gic_cpu_init(struct gic_chip_data *gic)
@@ -400,14 +430,29 @@ static void gic_cpu_init(struct gic_chip_data *gic)
 
 	gic_cpu_config(dist_base, NULL);
 
-	writel_relaxed(0xf0, base + GIC_CPU_PRIMASK);
-	writel_relaxed(1, base + GIC_CPU_CTRL);
+	/*
+	 * Set all PPI and SGI interrupts to be group 1.
+	 *
+	 * If grouping is not available (not implemented or prohibited by
+	 * security mode) these registers are read-as-zero/write-ignored.
+	 */
+	if (readl_relaxed(dist_base + GIC_DIST_CTRL) & GICD_ENABLE_GRP1) {
+		writel_relaxed(0xfffffeff, dist_base + GIC_DIST_IGROUP + 0);
+		writel_relaxed(0xa0a0a000, dist_base + GIC_DIST_PRI + 8);
+	}
+
+	writel_relaxed(GICC_INT_PRI_THRESHOLD, base + GIC_CPU_PRIMASK);
+	gic_cpu_if_up();
 }
 
 void gic_cpu_if_down(void)
 {
 	void __iomem *cpu_base = gic_data_cpu_base(&gic_data[0]);
-	writel_relaxed(0, cpu_base + GIC_CPU_CTRL);
+	u32 val = 0;
+
+	val = readl(cpu_base + GIC_CPU_CTRL);
+	val &= ~GICC_ENABLE;
+	writel_relaxed(val, cpu_base + GIC_CPU_CTRL);
 }
 
 #ifdef CONFIG_CPU_PM
@@ -467,14 +512,14 @@ static void gic_dist_restore(unsigned int gic_nr)
 	if (!dist_base)
 		return;
 
-	writel_relaxed(0, dist_base + GIC_DIST_CTRL);
+	writel_relaxed(GICD_DISABLE, dist_base + GIC_DIST_CTRL);
 
 	for (i = 0; i < DIV_ROUND_UP(gic_irqs, 16); i++)
 		writel_relaxed(gic_data[gic_nr].saved_spi_conf[i],
 			dist_base + GIC_DIST_CONFIG + i * 4);
 
 	for (i = 0; i < DIV_ROUND_UP(gic_irqs, 4); i++)
-		writel_relaxed(0xa0a0a0a0,
+		writel_relaxed(GICD_INT_DEF_PRI_X4,
 			dist_base + GIC_DIST_PRI + i * 4);
 
 	for (i = 0; i < DIV_ROUND_UP(gic_irqs, 4); i++)
@@ -485,7 +530,7 @@ static void gic_dist_restore(unsigned int gic_nr)
 		writel_relaxed(gic_data[gic_nr].saved_spi_enable[i],
 			dist_base + GIC_DIST_ENABLE_SET + i * 4);
 
-	writel_relaxed(1, dist_base + GIC_DIST_CTRL);
+	writel_relaxed(GICD_ENABLE | 2, dist_base + GIC_DIST_CTRL);
 }
 
 static void gic_cpu_save(unsigned int gic_nr)
@@ -504,11 +549,11 @@ static void gic_cpu_save(unsigned int gic_nr)
 	if (!dist_base || !cpu_base)
 		return;
 
-	ptr = __this_cpu_ptr(gic_data[gic_nr].saved_ppi_enable);
+	ptr = raw_cpu_ptr(gic_data[gic_nr].saved_ppi_enable);
 	for (i = 0; i < DIV_ROUND_UP(32, 32); i++)
 		ptr[i] = readl_relaxed(dist_base + GIC_DIST_ENABLE_SET + i * 4);
 
-	ptr = __this_cpu_ptr(gic_data[gic_nr].saved_ppi_conf);
+	ptr = raw_cpu_ptr(gic_data[gic_nr].saved_ppi_conf);
 	for (i = 0; i < DIV_ROUND_UP(32, 16); i++)
 		ptr[i] = readl_relaxed(dist_base + GIC_DIST_CONFIG + i * 4);
 
@@ -530,19 +575,31 @@ static void gic_cpu_restore(unsigned int gic_nr)
 	if (!dist_base || !cpu_base)
 		return;
 
-	ptr = __this_cpu_ptr(gic_data[gic_nr].saved_ppi_enable);
+	ptr = raw_cpu_ptr(gic_data[gic_nr].saved_ppi_enable);
 	for (i = 0; i < DIV_ROUND_UP(32, 32); i++)
 		writel_relaxed(ptr[i], dist_base + GIC_DIST_ENABLE_SET + i * 4);
 
-	ptr = __this_cpu_ptr(gic_data[gic_nr].saved_ppi_conf);
+	ptr = raw_cpu_ptr(gic_data[gic_nr].saved_ppi_conf);
 	for (i = 0; i < DIV_ROUND_UP(32, 16); i++)
 		writel_relaxed(ptr[i], dist_base + GIC_DIST_CONFIG + i * 4);
 
 	for (i = 0; i < DIV_ROUND_UP(32, 4); i++)
-		writel_relaxed(0xa0a0a0a0, dist_base + GIC_DIST_PRI + i * 4);
+		writel_relaxed(GICD_INT_DEF_PRI_X4,
+					dist_base + GIC_DIST_PRI + i * 4);
 
-	writel_relaxed(0xf0, cpu_base + GIC_CPU_PRIMASK);
-	writel_relaxed(1, cpu_base + GIC_CPU_CTRL);
+	/*
+	 * Set all PPI and SGI interrupts to be group 1.
+	 *
+	 * If grouping is not available (not implemented or prohibited by
+	 * security mode) these registers are read-as-zero/write-ignored.
+	 */
+	if (readl_relaxed(dist_base + GIC_DIST_CTRL) & GICD_ENABLE_GRP1) {
+		writel_relaxed(0xfffffeff, dist_base + GIC_DIST_IGROUP + 0);
+		writel_relaxed(0xa0a0a000, dist_base + GIC_DIST_PRI + 8);
+	}
+
+	writel_relaxed(GICC_INT_PRI_THRESHOLD, cpu_base + GIC_CPU_PRIMASK);
+	gic_cpu_if_up();
 }
 
 static int gic_notifier(struct notifier_block *self, unsigned long cmd,	void *v)
@@ -600,10 +657,19 @@ static void __init gic_pm_init(struct gic_chip_data *gic)
 #endif
 
 #ifdef CONFIG_SMP
+static bool sgi_is_nonsecure(int irq, struct gic_chip_data *gic)
+{
+	void __iomem *dist_base = gic_data_dist_base(gic);
+	/* FIXME: this should be done in a more generic way */
+	return irq != 8 && readl_relaxed(dist_base + GIC_DIST_CTRL) & GICD_ENABLE_GRP1;
+}
+
 static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
 {
-	int cpu;
+	struct gic_chip_data *gic = &gic_data[0];
 	unsigned long flags, map = 0;
+	unsigned int softirq;
+	int cpu;
 
 	raw_spin_lock_irqsave(&irq_controller_lock, flags);
 
@@ -617,8 +683,14 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
 	 */
 	dmb(ishst);
 
+	softirq = map << 16 | irq;
+
+	/* SATT only has effect if we are running in the secure world */
+	if (sgi_is_nonsecure(irq, gic))
+		softirq |= 0x8000;
+
 	/* this always happens on GIC0 */
-	writel_relaxed(map << 16 | irq, gic_data_dist_base(&gic_data[0]) + GIC_DIST_SOFTINT);
+	writel_relaxed(softirq, gic_data_dist_base(gic) + GIC_DIST_SOFTINT);
 
 	raw_spin_unlock_irqrestore(&irq_controller_lock, flags);
 }


-- 
FTTC broadband for 0.8mile line: currently@10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot
  2015-03-16  0:04 ` Russell King - ARM Linux
@ 2015-03-16  0:42   ` Russell King - ARM Linux
  2015-03-16  9:35     ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16  0:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > I'm going to try a few other kernels to try and track down what's going
> > on - whether something from arm-soc or my tree is responsible for this
> > really weird behaviour.
> 
> Okay, this is weird - it seems that it's caused by the FIQ oops
> dumping code/FIQ changes which I've carried for many months
> unchanged in my tree.

More weirdness.  Progressing forwards through my development code
showed that when I merged the patch I mentioned in the previous mail,
things started to fail.

As I also mentioned, I'd drop that branch (two patches, one adding
the IPI backtrace stuff and the second one updating the GIC to allow
it to raise FIQs on suitably equipped platforms.)  I would have
expected that to have worked, but it just failed after four boot
iterations.  So either it's not the FIQ, or it is the FIQ code _and_
also something else.  Or it has something to do with the placement
of functions in the kernel.

I'll try more stuff tomorrow, working from where I presently am
(which is basically last night's code minus the FIQ changes) by
removing other changes to see what brings us back to a working
system.

As I've already said - this is really weird because all of these
changes were also tested against -rc1... those which weren't are:

mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
mm: move randomize_et_dyn into ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
arm: factor out mmap ASLR into mmap_rnd

and a number of clkdev rework patches (to make it use clk_hw
internally.)  Neither of these should be affecting it, but that's
something I will be testing tomorrow.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot
  2015-03-16  0:42   ` Russell King - ARM Linux
@ 2015-03-16  9:35     ` Russell King - ARM Linux
  2015-03-16 13:04       ` Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> > On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > > I'm going to try a few other kernels to try and track down what's going
> > > on - whether something from arm-soc or my tree is responsible for this
> > > really weird behaviour.
> > 
> > Okay, this is weird - it seems that it's caused by the FIQ oops
> > dumping code/FIQ changes which I've carried for many months
> > unchanged in my tree.
> 
> More weirdness.  Progressing forwards through my development code
> showed that when I merged the patch I mentioned in the previous mail,
> things started to fail.
> 
> As I also mentioned, I'd drop that branch (two patches, one adding
> the IPI backtrace stuff and the second one updating the GIC to allow
> it to raise FIQs on suitably equipped platforms.)  I would have
> expected that to have worked, but it just failed after four boot
> iterations.  So either it's not the FIQ, or it is the FIQ code _and_
> also something else.  Or it has something to do with the placement
> of functions in the kernel.
> 
> I'll try more stuff tomorrow, working from where I presently am
> (which is basically last night's code minus the FIQ changes) by
> removing other changes to see what brings us back to a working
> system.
> 
> As I've already said - this is really weird because all of these
> changes were also tested against -rc1... those which weren't are:
> 
> mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
> mm: split ET_DYN ASLR from mmap ASLR
> mm: move randomize_et_dyn into ELF_ET_DYN_BASE
> mm: expose arch_mmap_rnd when available
> arm: factor out mmap ASLR into mmap_rnd
> 
> and a number of clkdev rework patches (to make it use clk_hw
> internally.)  Neither of these should be affecting it, but that's
> something I will be testing tomorrow.

Okay, reverting the ASLR changes and the clkdev changes annoyingly still
results in random failure.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16  9:35     ` Russell King - ARM Linux
@ 2015-03-16 13:04       ` Russell King - ARM Linux
  2015-03-16 17:47         ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16 13:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 09:35:53AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
> > On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
> > > On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
> > > > I'm going to try a few other kernels to try and track down what's going
> > > > on - whether something from arm-soc or my tree is responsible for this
> > > > really weird behaviour.
> > > 
> > > Okay, this is weird - it seems that it's caused by the FIQ oops
> > > dumping code/FIQ changes which I've carried for many months
> > > unchanged in my tree.
> > 
> > More weirdness.  Progressing forwards through my development code
> > showed that when I merged the patch I mentioned in the previous mail,
> > things started to fail.
> > 
> > As I also mentioned, I'd drop that branch (two patches, one adding
> > the IPI backtrace stuff and the second one updating the GIC to allow
> > it to raise FIQs on suitably equipped platforms.)  I would have
> > expected that to have worked, but it just failed after four boot
> > iterations.  So either it's not the FIQ, or it is the FIQ code _and_
> > also something else.  Or it has something to do with the placement
> > of functions in the kernel.
> > 
> > I'll try more stuff tomorrow, working from where I presently am
> > (which is basically last night's code minus the FIQ changes) by
> > removing other changes to see what brings us back to a working
> > system.
> > 
> > As I've already said - this is really weird because all of these
> > changes were also tested against -rc1... those which weren't are:
> > 
> > mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
> > mm: split ET_DYN ASLR from mmap ASLR
> > mm: move randomize_et_dyn into ELF_ET_DYN_BASE
> > mm: expose arch_mmap_rnd when available
> > arm: factor out mmap ASLR into mmap_rnd
> > 
> > and a number of clkdev rework patches (to make it use clk_hw
> > internally.)  Neither of these should be affecting it, but that's
> > something I will be testing tomorrow.
> 
> Okay, reverting the ASLR changes and the clkdev changes annoyingly still
> results in random failure.

After ruling out ASLR and clkdev, I started progressively reverting other
stuff in the build tree.  Eventually, I got down to reverting the L2C
change I've been carrying since the L2C cleanups.

With that lot reverted, which is slightly more than the previously known
good test, it booted five times without issue.

So, I thought I'd add that L2C change to the list of bad commits, and try
omitting _just_ the L2C and FIQ changes... and it still fails - on the
first test boot iteration.

I think I'm going to declare that this is all down to some obscure
hardware problem with Versatile Express, which is tickled by the layout
of the kernel against the cache, and take it out of the nightly system
(it's pointless having unstable hardware being tested; random failures
are completely meaningless.)

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16 13:04       ` Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing Russell King - ARM Linux
@ 2015-03-16 17:47         ` Sudeep Holla
  2015-03-16 18:16           ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-16 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On 16/03/15 13:04, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 09:35:53AM +0000, Russell King - ARM Linux wrote:
>> On Mon, Mar 16, 2015 at 12:42:39AM +0000, Russell King - ARM Linux wrote:
>>> On Mon, Mar 16, 2015 at 12:04:38AM +0000, Russell King - ARM Linux wrote:
>>>> On Sun, Mar 15, 2015 at 09:33:30PM +0000, Russell King - ARM Linux wrote:
>>>>> I'm going to try a few other kernels to try and track down what's going
>>>>> on - whether something from arm-soc or my tree is responsible for this
>>>>> really weird behaviour.
>>>>
>>>> Okay, this is weird - it seems that it's caused by the FIQ oops
>>>> dumping code/FIQ changes which I've carried for many months
>>>> unchanged in my tree.
>>>
>>> More weirdness.  Progressing forwards through my development code
>>> showed that when I merged the patch I mentioned in the previous mail,
>>> things started to fail.
>>>
>>> As I also mentioned, I'd drop that branch (two patches, one adding
>>> the IPI backtrace stuff and the second one updating the GIC to allow
>>> it to raise FIQs on suitably equipped platforms.)  I would have
>>> expected that to have worked, but it just failed after four boot
>>> iterations.  So either it's not the FIQ, or it is the FIQ code _and_
>>> also something else.  Or it has something to do with the placement
>>> of functions in the kernel.
>>>
>>> I'll try more stuff tomorrow, working from where I presently am
>>> (which is basically last night's code minus the FIQ changes) by
>>> removing other changes to see what brings us back to a working
>>> system.
>>>
>>> As I've already said - this is really weird because all of these
>>> changes were also tested against -rc1... those which weren't are:
>>>
>>> mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
>>> mm: split ET_DYN ASLR from mmap ASLR
>>> mm: move randomize_et_dyn into ELF_ET_DYN_BASE
>>> mm: expose arch_mmap_rnd when available
>>> arm: factor out mmap ASLR into mmap_rnd
>>>
>>> and a number of clkdev rework patches (to make it use clk_hw
>>> internally.)  Neither of these should be affecting it, but that's
>>> something I will be testing tomorrow.
>>
>> Okay, reverting the ASLR changes and the clkdev changes annoyingly still
>> results in random failure.
>
> After ruling out ASLR and clkdev, I started progressively reverting other
> stuff in the build tree.  Eventually, I got down to reverting the L2C
> change I've been carrying since the L2C cleanups.
>
> With that lot reverted, which is slightly more than the previously known
> good test, it booted five times without issue.
>
> So, I thought I'd add that L2C change to the list of bad commits, and try
> omitting _just_ the L2C and FIQ changes... and it still fails - on the
> first test boot iteration.
>
> I think I'm going to declare that this is all down to some obscure
> hardware problem with Versatile Express, which is tickled by the layout
> of the kernel against the cache, and take it out of the nightly system
> (it's pointless having unstable hardware being tested; random failures
> are completely meaningless.)
>

I was able to see exact behaviour on my VExpress setup with CA9X4 
core-tile. Few observations from my side:

1. This issue can be reproduced even on v3.19
2. As you suspected L2C, I tried disabling L2C and it seems to solve
    the issue
3. Since it's very random and enabling LL_DEBUG made it difficult to
    reproduce the issue, I tried to dump the stack using DS5 debugger
4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
    and on multiple runs
5. Connecting to h/w debugger, stopping and re-starting the CPUs,
    solves the issue. It's helping CPUs to get out of __radix_tree_lookup
    somehow

Stacktrace
==========
(sorry it's looks different from std. Linux backtrace as this one id 
dump from DS5)

CPU 0
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq 
= 16, lookup = <Value currently has no location>, regs = <Value 
currently has no location> ) at irqdesc.c:391
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
#5 [__irq_svc+0x40]


CPU1
----
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags = 
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value 
in stack frame for register R0>, action = <Value currently has no 
location>, hcpu = <Value not available : Undefined value in stack frame 
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val = 
<Value not available : Undefined value in stack frame for register R1>, 
v = <Value not available : Undefined value in stack frame for register 
R2>, nr_to_call = <Value not available : Undefined value in stack frame 
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
#5 notifier_to_errno( ret = <Value currently has no location> ) at 
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value 
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367

CPU2 & CPU3
-----------
Not booted yet, still waiting in bootloader

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16 17:47         ` Sudeep Holla
@ 2015-03-16 18:16           ` Russell King - ARM Linux
  2015-03-16 19:16             ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 05:47:46PM +0000, Sudeep Holla wrote:
> Hi Russell,
> 
> I was able to see exact behaviour on my VExpress setup with CA9X4 core-tile.
> Few observations from my side:
> 
> 1. This issue can be reproduced even on v3.19
> 2. As you suspected L2C, I tried disabling L2C and it seems to solve
>    the issue

My L2C says it's cache ID is 0x410000c3 - which is indeed a L2C-310, but
with an undocumented revision ID of 3, which as far as we can make out,
it's a R1Px where x > 0.

> 3. Since it's very random and enabling LL_DEBUG made it difficult to
>    reproduce the issue, I tried to dump the stack using DS5 debugger
> 4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
>    and on multiple runs

Hmm, I haven't seen them before I moved to 4.0-rc3 - before then my
nightly boot tests (which run two boots on the platform each night)
always seemed to succeed.

> 5. Connecting to h/w debugger, stopping and re-starting the CPUs,
>    solves the issue. It's helping CPUs to get out of __radix_tree_lookup
>    somehow

Interesting.  Are the traces below from 4.0-rc3 or an older kernel?

> Stacktrace
> ==========
> (sorry it's looks different from std. Linux backtrace as this one id dump
> from DS5)
> 
> CPU 0
> ----
> #0 __radix_tree_lookup( root = <Value currently has no location>, index =
> 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
> radix-tree.c:517

Can you dump the disassembly around this location for both CPU0 and CPU1
and the register values please?  I think it would be interesting to see
if they're both stuck on exactly the same address access.

I've currently narrowed down my latest potential culpret to something in
my Cubox-i code... specifically something in my "cubox-i-sdio" or
"imx-drm^" branches.

The cubox-i-sdio branch contains Olof's modifications to MMC to support
resets and regulators associated with wifi cards, which would be built,
but we would not have executed any of the MMC code at the point where
we'd be bringing the secondary CPUs up.  The imx-drm^ changes don't
touch any file which is built into my Versatile Express kernel, so it's
unlikely to affect anything (though, I'm build-boot-testing with imx-drm^
but cubox-i-sdio dropped just to make absolutely sure.)

One thing I've tried is turning off are the Cortex-A9 features - early
BRESP and full line of zeros.  That seems to make no apparent difference,
though it's hard to tell when #if 0'ing out the code, because that changes
the code placement and seems to stop the problem triggering.  I did have
a case where disabling FLZ (via #if 0'ing it out) seemed to solve it with
errata 588369 enabled, but changing the code to clear the FLZ bit instead
(which should have had the same effect) resulted in the problem
re-appearing.

I'm beginning to believe at this point that there /is/ a bug in the L2C on
the test chip, and that we're probably better off changing the Versatile
Express DT files to disable the L2C cache controller... what are your
thoughts on that?

I'm currently doing up to 8 boot tests - if I can do 8 consecutive boot
tests which all succeed, I'm declaring it a pass, otherwise it's a fail.
Generally, I've found that it will fail very early (like the first) but
sometimes up to the 4th.

I guess one thing we need to confirm is whether we have exactly the same
hardware and firmware versions.  Here's my board's early boot messages:

ARM V2M Boot loader v1.1.1
HBI0190 build 2313

ARM V2M Firmware v3.1.2
Build Date: Apr 16 2013

Date: Mon 30 Mar 2009
Time:     16:59:14

Cmd> reboot

Powering up system...
Daughterboard fitted to site 1.

Switching on ATXPSU...
ATX3V3: ON
VIOset: 1.8V
MBtemp: 27 degC

Configuring motherboard (rev D, var A)...
IOFPGA  config: PASSED
MUXFPGA config: PASSED
OSC CLK config: PASSED

Testing SMC devices (FPGA build 8)...
SRAM 32MB test: PASSED
VRAM  8MB test: PASSED
LAN9118   test: PASSED
USB & OTG test: PASSED
KMI1/KMI2 test: PASSED
MMC & SD  test: PASSED
DVI image test: PASSED
AACI AC97 test: PASSED
CF card   test: PASSED
UART port test: PASSED
MAC addrs test: PASSED

Reading Site 1 Board File \SITE1\HBI0191B\board.txt
DB1 JTAG configuration complete.
Setting DB1 OSCCLKS...
DB1.0 DCC 0 SPI configuration complete.

Writing SCC 0x40610000 with 0xBB8A802A
Writing SCC 0x40610001 with 0x00001F09
Writing SCC 0x40610002 with 0x00000000
DB1.0 DCC 0 SCC configuration complete.

DB SMB clock enabled.
Waiting for SITE1 CB_READY...
Testing SMB clock...
Configuring MUXFPGA for MB.
Setting DVI mode for VGA.
Releasing Daughterboard resets.
Switching MCC log to UART1.

Warning: Card Format not recognised, please check card.

ARM Versatile Express Boot Monitor
Version:    V5.2.1
Build Date: Apr  4 2013
Daughterboard Site 1: V2P-CA9 Cortex A9
Daughterboard Site 2: Not Used
Running boot script from flash - BOOTSCRIPT


U-Boot 2013.01.-rc1-00003-g43ee87aabf17-dirty (Jan 07 2014 - 00:00:38)
...

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16 18:16           ` Russell King - ARM Linux
@ 2015-03-16 19:16             ` Sudeep Holla
  2015-03-16 19:52               ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-16 19:16 UTC (permalink / raw)
  To: linux-arm-kernel



On 16/03/15 18:16, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 05:47:46PM +0000, Sudeep Holla wrote:
>> Hi Russell,
>>
>> I was able to see exact behaviour on my VExpress setup with CA9X4 core-tile.
>> Few observations from my side:
>>
>> 1. This issue can be reproduced even on v3.19
>> 2. As you suspected L2C, I tried disabling L2C and it seems to solve
>>     the issue
>
> My L2C says it's cache ID is 0x410000c3 - which is indeed a L2C-310, but
> with an undocumented revision ID of 3, which as far as we can make out,
> it's a R1Px where x > 0.
>
>> 3. Since it's very random and enabling LL_DEBUG made it difficult to
>>     reproduce the issue, I tried to dump the stack using DS5 debugger
>> 4. The stack is exactly same always both on v4.0-rc* and v3.19 kernel
>>     and on multiple runs
>
> Hmm, I haven't seen them before I moved to 4.0-rc3 - before then my
> nightly boot tests (which run two boots on the platform each night)
> always seemed to succeed.
>
>> 5. Connecting to h/w debugger, stopping and re-starting the CPUs,
>>     solves the issue. It's helping CPUs to get out of __radix_tree_lookup
>>     somehow
>
> Interesting.  Are the traces below from 4.0-rc3 or an older kernel?
>

This one is with v3.19 but I get exact same trace with v4.0-rc* kernel.

>> Stacktrace
>> ==========
>> (sorry it's looks different from std. Linux backtrace as this one id dump
>> from DS5)
>>
>> CPU 0
>> ----
>> #0 __radix_tree_lookup( root = <Value currently has no location>, index =
>> 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at
>> radix-tree.c:517
>
> Can you dump the disassembly around this location for both CPU0 and CPU1
> and the register values please?  I think it would be interesting to see
> if they're both stuck on exactly the same address access.
>

(with v4.0-rc4 this time)

CPU#0
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 generic_handle_irq( irq = 16 ) at irqdesc.c:349
     desc = <Value optimised away by compiler>
#2 __handle_domain_irq( domain = (struct irq_domain*) 0xBF004400, hwirq 
= 16, lookup = <Value currently has no location>, regs = <Value 
currently has no location> ) at irqdesc.c:391
     old_regs = (struct pt_regs*) 0x0
     irq = <Value optimised away by compiler>
     ret = 0
#3 __raw_readl( addr = <Value optimised away by compiler> ) at io.h:121
#4 gic_handle_irq( regs = (struct pt_regs*) 0x805F1F40 ) at irq-gic.c:277
     irqstat = 2147518036
     irqnr = <Value currently has no location>
     gic = <Value optimised away by compiler>
     cpu_base = (void*) 0xC0802100
#5 [__irq_svc+0x40]

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0x00000000
R7           0x00000000
R8           0xBF004400
R9           0x805F1F90
R10          0x00000001
R11          0x805EEB08
R12          0xBEC00001
SP           0x805F1EFC
LR           0x00000000
PC           0x8021F820
CPSR         0x80000193

CPU#1
=====
#0 __radix_tree_lookup( root = <Value currently has no location>, index 
= 16, nodep = (struct radix_tree_node**) 0x0, slotp = (void***) 0x0 ) at 
radix-tree.c:517
     node = (struct radix_tree_node*) 0xBEC00001
     parent = <Value optimised away by compiler>
     height = 1
     shift = 0
     slot = <Value currently has no location>
#1 __irq_get_desc_lock( irq = <Value currently has no location>, flags = 
(long unsigned int*) 0xBF08BF94, bus = false, check = 3 ) at irqdesc.c:544
     desc = <Value optimised away by compiler>
#2 enable_percpu_irq( irq = 16, type = 0 ) at manage.c:1583
     cpu = 1
     flags = <Value currently has no location>
     desc = <Value optimised away by compiler>
#3 twd_timer_cpu_notify( self = <Value not available : Undefined value 
in stack frame for register R0>, action = <Value currently has no 
location>, hcpu = <Value not available : Undefined value in stack frame 
for register R2> ) at smp_twd.c:322
#4 notifier_call_chain( nl = <Value currently has no location>, val = 
<Value not available : Undefined value in stack frame for register R1>, 
v = <Value not available : Undefined value in stack frame for register 
R2>, nr_to_call = <Value not available : Undefined value in stack frame 
for register R3>, nr_calls = (int*) 0x0 ) at notifier.c:95
     ret = <Value currently has no location>
     nb = <Value optimised away by compiler>
     next_nb = <Value optimised away by compiler>
#5 notifier_to_errno( ret = <Value currently has no location> ) at 
notifier.h:179
#6 cpu_notify( val = <Value currently has no location>, v = <Value 
currently has no location> ) at cpu.c:234
#7 secondary_start_kernel() at smp.c:367
     mm = <Value optimised away by compiler>
     cpu = 1
#8 [S:0x60008724]

Disassembly:

S:0x8021F80C : LSL      lr,r4,#3
S:0x8021F810 : SUB      lr,lr,r4,LSL #1
S:0x8021F814 : SUB      lr,lr,#6
S:0x8021F818 : B        {pc}+8 ; 0x8021f820
S:0x8021F81C : MOV      r5,r0
S:0x8021F820 : LSR      r12,r1,lr
S:0x8021F824 : SUB      lr,lr,#6
S:0x8021F828 : AND      r12,r12,#0x3f
S:0x8021F82C : ADD      r12,r12,#6
S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]

Core registers:
R0           0x0000003F
R1           0x00000010
R2           0x00000000
R3           0x00000000
R4           0x00000001
R5           0xBEC00000
R6           0xBF08BF94
R7           0x00000000
R8           0x805F92A0
R9           0x00000000
R10          0x00000000
R11          0x00000000
R12          0xBEC00001
SP           0xBF08BF6C
LR           0x00000000
PC           0x8021F820
CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

[...]

> I'm beginning to believe at this point that there /is/ a bug in the L2C on
> the test chip, and that we're probably better off changing the Versatile
> Express DT files to disable the L2C cache controller... what are your
> thoughts on that?
>

I was thinking of taking the dump of L2C register settings and comparing
them. But currently I am facing issues booting even v3.18 on my setup,
it seem to fails somewhere else which I need to look at.

> I'm currently doing up to 8 boot tests - if I can do 8 consecutive boot
> tests which all succeed, I'm declaring it a pass, otherwise it's a fail.
> Generally, I've found that it will fail very early (like the first) but
> sometimes up to the 4th.
>
> I guess one thing we need to confirm is whether we have exactly the same
> hardware and firmware versions.  Here's my board's early boot messages:
>

ARM V2M Boot loader v1.1.2
HBI0190 build 2313

ARM V2M Firmware v3.1.2
Build Date: Apr 16 2013

Date: Mon 16 Mar 2015
Time:     18:57:21

Powering up system...
Daughterboard fitted to site 1.

Switching on ATXPSU...
ATX3V3: ON
VIOset: 1.8V
MBtemp: 26 degC

Configuring motherboard (rev D, var A)...
IOFPGA  config: PASSED
MUXFPGA config: PASSED
OSC CLK config: PASSED

Testing SMC devices (FPGA build 8)...
SRAM 32MB test: PASSED
VRAM  8MB test: PASSED
LAN9118   test: PASSED
USB & OTG test: PASSED
KMI1/KMI2 test: PASSED
MMC & SD  test: PASSED
DVI image test: PASSED
AACI AC97 test: PASSED
CF card   test: PASSED
UART port test: PASSED
MAC addrs test: PASSED

Reading Site 1 Board File \SITE1\HBI0191B\board.txt
DB1 JTAG configuration complete.
Setting DB1 OSCCLKS...
DB1.0 DCC 0 SPI configuration complete.

Writing SCC 0x40610000 with 0xBB8A802A
Writing SCC 0x40610001 with 0x00001F09
Writing SCC 0x40610002 with 0x00000000
DB1.0 DCC 0 SCC configuration complete.

DB SMB clock enabled.
Waiting for SITE1 CB_READY...
Testing SMB clock...
Configuring MUXFPGA for MB.
Setting DVI mode for VGA.
Releasing Daughterboard resets.
Switching MCC log to UART1.
%BootMonitor-Warning, Unable to open SYSTEM.DAT


ARM Versatile Express Boot Monitor
Version:    V5.2.1
Build Date: Apr  4 2013
Daughterboard Site 1: V2P-CA9 Cortex A9
Daughterboard Site 2: Not Used
Running boot script from flash - BOOTSCRIPT

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16 19:16             ` Sudeep Holla
@ 2015-03-16 19:52               ` Russell King - ARM Linux
  2015-03-17 12:05                 ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-16 19:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 07:16:05PM +0000, Sudeep Holla wrote:
> On 16/03/15 18:16, Russell King - ARM Linux wrote:
> >Can you dump the disassembly around this location for both CPU0 and CPU1
> >and the register values please?  I think it would be interesting to see
> >if they're both stuck on exactly the same address access.
> 
> (with v4.0-rc4 this time)

Thanks.

> CPU#0
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0x00000000
> R7           0x00000000
> R8           0xBF004400
> R9           0x805F1F90
> R10          0x00000001
> R11          0x805EEB08
> R12          0xBEC00001
> SP           0x805F1EFC
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x80000193
> 
> CPU#1
> =====
...
> S:0x8021F80C : LSL      lr,r4,#3
> S:0x8021F810 : SUB      lr,lr,r4,LSL #1
> S:0x8021F814 : SUB      lr,lr,#6
> S:0x8021F818 : B        {pc}+8 ; 0x8021f820
> S:0x8021F81C : MOV      r5,r0
> S:0x8021F820 : LSR      r12,r1,lr
> S:0x8021F824 : SUB      lr,lr,#6
> S:0x8021F828 : AND      r12,r12,#0x3f
> S:0x8021F82C : ADD      r12,r12,#6
> S:0x8021F830 : LDR      r0,[r5,r12,LSL #2]
> 
> Core registers:
> R0           0x0000003F
> R1           0x00000010
> R2           0x00000000
> R3           0x00000000
> R4           0x00000001
> R5           0xBEC00000
> R6           0xBF08BF94
> R7           0x00000000
> R8           0x805F92A0
> R9           0x00000000
> R10          0x00000000
> R11          0x00000000
> R12          0xBEC00001
> SP           0xBF08BF6C
> LR           0x00000000
> PC           0x8021F820
> CPSR         0x800001D3   Nzcvq_ge3ge2ge1ge0_inactive_eAIFtj_SVC

And we find that both CPUs have stopped at exactly the same place, which
is an arithmetic instruction.

If I had to guess, I'd say the reason it's stopped there (exactly on a
cache line boundary) is because both CPUs are waiting for an instruction
fetch to complete into its L1 I-cache, and for some reason, the L2
cache is not satisfying the request from either CPU.  The question of
course is... why not.

> >I guess one thing we need to confirm is whether we have exactly the same
> >hardware and firmware versions.  Here's my board's early boot messages:

Looks like we're broadly the same, apart from the boot loader version.
You have 1.1.2, whereas I have 1.1.1.

Co-incidentally, I just looked at the disassembly of my __radix_tree_lookup:

c0199750:       e0050495        mul     r5, r5, r4
c0199754:       e2455006        sub     r5, r5, #6
c0199758:       ea000000        b       c0199760 <__radix_tree_lookup+0x70>
c019975c:       e1a0c000        mov     ip, r0
c0199760:       e1a06531        lsr     r6, r1, r5
c0199764:       e206603f        and     r6, r6, #63     ; 0x3f
c0199768:       e2866006        add     r6, r6, #6
c019976c:       e79c0106        ldr     r0, [ip, r6, lsl #2]

The code is slightly different, but notice that the alignment of the
LSR instruction is the same as yours - at first I wondered whether that's
coincidence or not.  However, taking Olof's MMC changes back out of my
tree (which results in a booting kernel) makes no difference to the
placement of this code.

The start of the read-only data section doesn't change between the working
and non-working kernels, but the location of the spinlock and some scheduler
code does change (along with all the networking code.)

There's changes in the read-only data section, there's also changes to a
set of "descriptor.NNNNN" symbols towards the end of the data section,
which goes on to change the placement of the bss section.

The diff between the System.map is unpostable - it's about 1.3MB. :(

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-16 19:52               ` Russell King - ARM Linux
@ 2015-03-17 12:05                 ` Sudeep Holla
  2015-03-17 15:36                   ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-17 12:05 UTC (permalink / raw)
  To: linux-arm-kernel



On 16/03/15 19:52, Russell King - ARM Linux wrote:
> On Mon, Mar 16, 2015 at 07:16:05PM +0000, Sudeep Holla wrote:
>> On 16/03/15 18:16, Russell King - ARM Linux wrote:

[...]

> If I had to guess, I'd say the reason it's stopped there (exactly on a
> cache line boundary) is because both CPUs are waiting for an instruction
> fetch to complete into its L1 I-cache, and for some reason, the L2
> cache is not satisfying the request from either CPU.  The question of
> course is... why not.
>

As I had mentioned yesterday, I did compare the L2C settings between
v3.18 and later kernel and found them to be *exactly same*.

Since you suspected issues around instruction fetching, I tried playing
around the tag and data ram latencies. After some experiments, I found
that changing just the tag ram read latency to 2 cycles, the issue we
are seeing goes away at-least on my setup. It will be good to see the
behaviour on your setup with the patch below.

The default value which bootmon is programming happens to be worst
case scenario(8 cycles for all). Will recalls that it was changed to
minimum value after graphics guys complained about performance.

We need to check with h/w guys to get the correct optimal values for
these latencies.

Regards,
Sudeep

--->8

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts 
b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
index 23662b5a5e9d..030c90c1105d 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
@@ -172,7 +172,7 @@
                 interrupts = <0 43 4>;
                 cache-level = <2>;
                 arm,data-latency = <1 1 1>;
-               arm,tag-latency = <1 1 1>;
+               arm,tag-latency = <1 2 1>;
         };

         pmu {

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-17 12:05                 ` Sudeep Holla
@ 2015-03-17 15:36                   ` Russell King - ARM Linux
  2015-03-17 15:51                     ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-17 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 17, 2015 at 12:05:58PM +0000, Sudeep Holla wrote:
> As I had mentioned yesterday, I did compare the L2C settings between
> v3.18 and later kernel and found them to be *exactly same*.
> 
> Since you suspected issues around instruction fetching, I tried playing
> around the tag and data ram latencies. After some experiments, I found
> that changing just the tag ram read latency to 2 cycles, the issue we
> are seeing goes away at-least on my setup. It will be good to see the
> behaviour on your setup with the patch below.
> 
> The default value which bootmon is programming happens to be worst
> case scenario(8 cycles for all). Will recalls that it was changed to
> minimum value after graphics guys complained about performance.
> 
> We need to check with h/w guys to get the correct optimal values for
> these latencies.
> 
> Regards,
> Sudeep
> 
> --->8
> 
> diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> index 23662b5a5e9d..030c90c1105d 100644
> --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
> @@ -172,7 +172,7 @@
>                 interrupts = <0 43 4>;
>                 cache-level = <2>;
>                 arm,data-latency = <1 1 1>;
> -               arm,tag-latency = <1 1 1>;
> +               arm,tag-latency = <1 2 1>;

I've tried <1 2 1> and <1 8 1> here, I don't see any difference.  My test
build fails on the first boot attempt for each.

I notice you're only changing the write latency here.  Is that correct?
You mention read latency above.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-17 15:36                   ` Russell King - ARM Linux
@ 2015-03-17 15:51                     ` Sudeep Holla
  2015-03-17 16:17                       ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-17 15:51 UTC (permalink / raw)
  To: linux-arm-kernel



On 17/03/15 15:36, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 12:05:58PM +0000, Sudeep Holla wrote:
>> As I had mentioned yesterday, I did compare the L2C settings between
>> v3.18 and later kernel and found them to be *exactly same*.
>>
>> Since you suspected issues around instruction fetching, I tried playing
>> around the tag and data ram latencies. After some experiments, I found
>> that changing just the tag ram read latency to 2 cycles, the issue we
>> are seeing goes away at-least on my setup. It will be good to see the
>> behaviour on your setup with the patch below.
>>
>> The default value which bootmon is programming happens to be worst
>> case scenario(8 cycles for all). Will recalls that it was changed to
>> minimum value after graphics guys complained about performance.
>>
>> We need to check with h/w guys to get the correct optimal values for
>> these latencies.
>>
>> Regards,
>> Sudeep
>>
>> --->8
>>
>> diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> index 23662b5a5e9d..030c90c1105d 100644
>> --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts
>> @@ -172,7 +172,7 @@
>>                  interrupts = <0 43 4>;
>>                  cache-level = <2>;
>>                  arm,data-latency = <1 1 1>;
>> -               arm,tag-latency = <1 1 1>;
>> +               arm,tag-latency = <1 2 1>;
>
> I've tried <1 2 1> and <1 8 1> here, I don't see any difference.  My test
> build fails on the first boot attempt for each.
>

That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
and tag ram) and narrowed down to this setting with multiple
experiments. I did try booting 10 times each time at-least.

Since the bootmon sets 8 cycles for all the latencies, does it make
sense to try that setting to check if the issue you are seeing is
related to L2 latencies at all. Meanwhile I will continue my testing.

> I notice you're only changing the write latency here.  Is that correct?
> You mention read latency above.
>

Sorry my bad, you are right, it's write latency, I misread the L2C
binding document.

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-17 15:51                     ` Sudeep Holla
@ 2015-03-17 16:17                       ` Russell King - ARM Linux
  2015-03-30 14:03                         ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-17 16:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
> That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
> and tag ram) and narrowed down to this setting with multiple
> experiments. I did try booting 10 times each time at-least.
> 
> Since the bootmon sets 8 cycles for all the latencies, does it make
> sense to try that setting to check if the issue you are seeing is
> related to L2 latencies at all. Meanwhile I will continue my testing.

For me <2 2 1> works - so read and write latencies of 2, setup of 1.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-17 16:17                       ` Russell King - ARM Linux
@ 2015-03-30 14:03                         ` Russell King - ARM Linux
  2015-03-30 14:48                           ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-30 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 17, 2015 at 04:17:48PM +0000, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
> > That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
> > and tag ram) and narrowed down to this setting with multiple
> > experiments. I did try booting 10 times each time at-least.
> > 
> > Since the bootmon sets 8 cycles for all the latencies, does it make
> > sense to try that setting to check if the issue you are seeing is
> > related to L2 latencies at all. Meanwhile I will continue my testing.
> 
> For me <2 2 1> works - so read and write latencies of 2, setup of 1.

So what's happening?  Is someone going to update the dtb to adjust the
latencies so that we can have a reliable platform?

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-30 14:03                         ` Russell King - ARM Linux
@ 2015-03-30 14:48                           ` Sudeep Holla
  2015-03-30 15:05                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-30 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On 30/03/15 15:03, Russell King - ARM Linux wrote:
> On Tue, Mar 17, 2015 at 04:17:48PM +0000, Russell King - ARM Linux wrote:
>> On Tue, Mar 17, 2015 at 03:51:53PM +0000, Sudeep Holla wrote:
>>> That's bad. I started with 2 cycles for all(rd/wr/setup) latencies(data
>>> and tag ram) and narrowed down to this setting with multiple
>>> experiments. I did try booting 10 times each time at-least.
>>>
>>> Since the bootmon sets 8 cycles for all the latencies, does it make
>>> sense to try that setting to check if the issue you are seeing is
>>> related to L2 latencies at all. Meanwhile I will continue my testing.
>>
>> For me <2 2 1> works - so read and write latencies of 2, setup of 1.
>
> So what's happening?  Is someone going to update the dtb to adjust the
> latencies so that we can have a reliable platform?
>

Though <2 2 1> works fine most of the time, I did try testing continuous
reboot overnight and it failed. I kept increasing the latencies and
found out that even max latency of <8 8 8> could not survive continuous
overnight reboot test and it fails with exact same issue.

So I am not sure if we can consider it as a fix. However if we are OK to
have *mostly reliable*, then we can push that change.

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-30 14:48                           ` Sudeep Holla
@ 2015-03-30 15:05                             ` Russell King - ARM Linux
  2015-03-30 15:39                               ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-03-30 15:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
> Though <2 2 1> works fine most of the time, I did try testing continuous
> reboot overnight and it failed. I kept increasing the latencies and
> found out that even max latency of <8 8 8> could not survive continuous
> overnight reboot test and it fails with exact same issue.
> 
> So I am not sure if we can consider it as a fix. However if we are OK to
> have *mostly reliable*, then we can push that change.

Okay, the issue I have is this.

Versatile Express used to boot reliably in the nightly build tests prior
to DT.  In that mode, we never configured the latency values.

Then the legacy code was removed, and I had to switch over to DT booting,
and shortly after I noticed that the platform was now randomly failing
its nightly boot tests.

Maybe we should revert the commit removing the superior legacy code,
because that seems to be the only thing that was reliable?  Maybe it was
premature to remove it until DT had proven itself?

On the other hand, if the legacy code hadn't been removed, I probably
would never have tested it - but then, from what I hear, this was a
*known* issue prior to the removal of the legacy code.  Given that the
legacy code worked totally fine, it's utterly idiotic to me to have
removed the working legacy code when DT is soo unstable.

Whatever way I look at this, this problem _is_ a _regression_, and we
can't sit around and hope it magically vanishes by some means.

I think given what you've said, it suggests that there is something else
going on.  So, what we need to do is to revert the removal of the legacy
code and investigate what the differences are between the apparently
broken DT code and the working legacy code.

I have not _once_ seen this behaviour with the legacy code.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-30 15:05                             ` Russell King - ARM Linux
@ 2015-03-30 15:39                               ` Sudeep Holla
  2015-03-31 17:27                                 ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-30 15:39 UTC (permalink / raw)
  To: linux-arm-kernel



On 30/03/15 16:05, Russell King - ARM Linux wrote:
> On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
>> Though <2 2 1> works fine most of the time, I did try testing continuous
>> reboot overnight and it failed. I kept increasing the latencies and
>> found out that even max latency of <8 8 8> could not survive continuous
>> overnight reboot test and it fails with exact same issue.
>>
>> So I am not sure if we can consider it as a fix. However if we are OK to
>> have *mostly reliable*, then we can push that change.
>
> Okay, the issue I have is this.
>
> Versatile Express used to boot reliably in the nightly build tests prior
> to DT.  In that mode, we never configured the latency values.
>

I have never run in legacy mode as I am relatively new to vexpress
platform and started using with DT from first. Just to understand better
I had a look at the commit commit 81cc3f868d30("ARM: vexpress: Remove
non-DT code") and I see the below function in
arch/arm/mach-vexpress/ct-ca9x4.c So I assume we were programming one
cycle for all the latencies just like DT.

static void __init ca9x4_l2_init(void)
{
#ifdef CONFIG_CACHE_L2X0
	void __iomem *l2x0_base = ioremap(CT_CA9X4_L2CC, SZ_4K);

	if (l2x0_base) {
		/* set RAM latencies to 1 cycle for this core tile. */
		writel(0, l2x0_base + L310_TAG_LATENCY_CTRL);
		writel(0, l2x0_base + L310_DATA_LATENCY_CTRL);

		l2x0_init(l2x0_base, 0x00400000, 0xfe0fffff);
	} else {
		pr_err("L2C: unable to map L2 cache controller\n");
	}
#endif
}

> Then the legacy code was removed, and I had to switch over to DT booting,
> and shortly after I noticed that the platform was now randomly failing
> its nightly boot tests.
>
> Maybe we should revert the commit removing the superior legacy code,
> because that seems to be the only thing that was reliable?  Maybe it was
> premature to remove it until DT had proven itself?
>
> On the other hand, if the legacy code hadn't been removed, I probably
> would never have tested it - but then, from what I hear, this was a
> *known* issue prior to the removal of the legacy code.  Given that the
> legacy code worked totally fine, it's utterly idiotic to me to have
> removed the working legacy code when DT is soo unstable.
>
> Whatever way I look at this, this problem _is_ a _regression_, and we
> can't sit around and hope it magically vanishes by some means.
>

I agree, last time I tested it was fine with v3.18. However I have not
run the continuous overnight reboot test on that. I will first started
looking at that, just to see if it's issue related to DT vs legacy boot.

> I think given what you've said, it suggests that there is something else
> going on.  So, what we need to do is to revert the removal of the legacy
> code and investigate what the differences are between the apparently
> broken DT code and the working legacy code.
>

Agreed, I will see if DT boot was ever stable before before and
including v3.18

> I have not _once_ seen this behaviour with the legacy code.
>

OK

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-30 15:39                               ` Sudeep Holla
@ 2015-03-31 17:27                                 ` Sudeep Holla
  2015-04-02 14:13                                   ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-03-31 17:27 UTC (permalink / raw)
  To: linux-arm-kernel



On 30/03/15 16:39, Sudeep Holla wrote:
>
>
> On 30/03/15 16:05, Russell King - ARM Linux wrote:
>> On Mon, Mar 30, 2015 at 03:48:08PM +0100, Sudeep Holla wrote:
>>> Though <2 2 1> works fine most of the time, I did try testing continuous
>>> reboot overnight and it failed. I kept increasing the latencies and
>>> found out that even max latency of <8 8 8> could not survive continuous
>>> overnight reboot test and it fails with exact same issue.
>>>
>>> So I am not sure if we can consider it as a fix. However if we are OK to
>>> have *mostly reliable*, then we can push that change.
>>
>> Okay, the issue I have is this.
>>
>> Versatile Express used to boot reliably in the nightly build tests prior
>> to DT.  In that mode, we never configured the latency values.
>>
>
> I have never run in legacy mode as I am relatively new to vexpress
> platform and started using with DT from first. Just to understand better
> I had a look at the commit commit 81cc3f868d30("ARM: vexpress: Remove
> non-DT code") and I see the below function in
> arch/arm/mach-vexpress/ct-ca9x4.c So I assume we were programming one
> cycle for all the latencies just like DT.
>

I was able to boot v3.18 without DT and I compared the L2C settings with
and w/o DT, they are identical. Also v3.18 with and w/o DT survived
overnight reboot testing.

>> Then the legacy code was removed, and I had to switch over to DT booting,
>> and shortly after I noticed that the platform was now randomly failing
>> its nightly boot tests.
>>
>> Maybe we should revert the commit removing the superior legacy code,
>> because that seems to be the only thing that was reliable?  Maybe it was
>> premature to remove it until DT had proven itself?
>>

Not sure on that as v3.18 with DT seems to be working fine and passed
overnight reboot testing.

>> On the other hand, if the legacy code hadn't been removed, I probably
>> would never have tested it - but then, from what I hear, this was a
>> *known* issue prior to the removal of the legacy code.  Given that the
>> legacy code worked totally fine, it's utterly idiotic to me to have
>> removed the working legacy code when DT is soo unstable.
>>
>> Whatever way I look at this, this problem _is_ a _regression_, and we
>> can't sit around and hope it magically vanishes by some means.
>>
>
> I agree, last time I tested it was fine with v3.18. However I have not
> run the continuous overnight reboot test on that. I will first started
> looking at that, just to see if it's issue related to DT vs legacy boot.
>

Since v3.18 is both boot modes and the problem is reproducible on
v3.19-rc1. I am trying to bisect but not sure if that's feasible for
such a problem. I also found out by accident that even on mainline with
more configs enabled, it's hard to hit the issue.

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-03-31 17:27                                 ` Sudeep Holla
@ 2015-04-02 14:13                                   ` Russell King - ARM Linux
  2015-04-02 17:38                                     ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2015-04-02 14:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
> Not sure on that as v3.18 with DT seems to be working fine and passed
> overnight reboot testing.

Okay, that suggests there's something post v3.18 which is causing this,
rather than it being a DT vs non-DT thing.

An extra data point which I've just found (by enabling attempts to do
hibernation on various test platforms) is that the Versatile Express
appears to be incapable of taking a CPU offline.

This crashes the entire system with sometimes random results.  Sometimes
it'll appear that a spinlock has been left owned by CPU#1 which is
offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
dumping kernel messages from the start of the kernel's ring buffer (!),
eg:

PM: freeze of devices complete after 29.342 msecs
PM: late freeze of devices complete after 6.398 msecs
PM: noirq freeze of devices complete after 5.493 msecs
Disabling non-boot CPUs ...
__cpu_disable(1)
__cpu_die(1)
handle_IPI(0)
Booting Linux on physical CPU 0x0

So far, it's not managed to take a CPU successfully offline and know that
it has.  If I disable the calls to cpu_enter_lowpower() and
cpu_leave_lowpower(), then it appears to work.

This leads me to wonder whether flush_cache_louis() works... which led me
in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.

The really interesting thing is that I've never had that errata enabled
for Versatile Express - even going back to 3.14 times (I have a working
3.14 config file which clearly shows that it was disabled.)  So, I'm
wondering if we've relaxed the cache flushing in such a way that we now
expose the ineffectual flush_cache_louis() bug.

There aren't that many flush_cache_louis() calls in the kernel.  We do
have this:

commit bca7a5a04933700a8bde4ea5798119607a8b0436
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Thu Apr 18 18:15:44 2013 +0100

    ARM: cpu hotplug: remove majority of cache flushing from platforms

in conjuction with:

commit 51acdfd1fa38a2bf1003255be9f105c19fbc0176
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Thu Apr 18 18:05:29 2013 +0100

    ARM: smp: flush L1 cache in cpu_die()

which changed the flush_cache_all() to a flush_cache_louis() in the
hot unplug path.  We also have this:

commit e40678559fdf3f56ce9a349365fbf39e1f63ecc0
Author: Nicolas Pitre <nicolas.pitre@linaro.org>
Date:   Thu Nov 8 19:46:07 2012 +0100

    ARM: 7573/1: idmap: use flush_cache_louis() and flush TLBs only when necessary

which added the flush_cache_louis() for the idmap tables, but prior to
that, I don't see how we were ensuring that the page tables were visible.

I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
confirm whether you have this errata enabled for your tests?

Thanks.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-04-02 14:13                                   ` Russell King - ARM Linux
@ 2015-04-02 17:38                                     ` Sudeep Holla
  2016-06-14 15:31                                       ` Jon Medhurst (Tixy)
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2015-04-02 17:38 UTC (permalink / raw)
  To: linux-arm-kernel



On 02/04/15 15:13, Russell King - ARM Linux wrote:
> On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
>> Not sure on that as v3.18 with DT seems to be working fine and passed
>> overnight reboot testing.
>
> Okay, that suggests there's something post v3.18 which is causing this,
> rather than it being a DT vs non-DT thing.
>

Correct. Just to be 100% sure I reverted that non-DT removal commit on
both v3.19-rc1 and v4.0-rc6 and was able to reproduce issue without DT.

> An extra data point which I've just found (by enabling attempts to do
> hibernation on various test platforms) is that the Versatile Express
> appears to be incapable of taking a CPU offline.
>
> This crashes the entire system with sometimes random results.  Sometimes
> it'll appear that a spinlock has been left owned by CPU#1 which is
> offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
> dumping kernel messages from the start of the kernel's ring buffer (!),
> eg:
>
> PM: freeze of devices complete after 29.342 msecs
> PM: late freeze of devices complete after 6.398 msecs
> PM: noirq freeze of devices complete after 5.493 msecs
> Disabling non-boot CPUs ...
> __cpu_disable(1)
> __cpu_die(1)
> handle_IPI(0)
> Booting Linux on physical CPU 0x0
>
> So far, it's not managed to take a CPU successfully offline and know that
> it has.  If I disable the calls to cpu_enter_lowpower() and
> cpu_leave_lowpower(), then it appears to work.
>
> This leads me to wonder whether flush_cache_louis() works... which led me
> in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
> the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.
>

Yes I observed that and tested for this issue enabling it. It's doesn't
affect and I still hit the issue.

[...]
>
> I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
> confirm whether you have this errata enabled for your tests?
>
I have now gone back to <1 1 1> latency to debug the issue as it's
easier to reproduce with that latencies.

After I failed terribly to bisect between v3.18..v3.19-c1, as it depends
a lot on the config you choose(a lot of changes introduced as it's merge
window), I started looking at the code where we hit this issue since
it's always in __radix_tree_lookup in lib/radix-tree.c while
accessing the slots to see if it provides any more details.

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2015-04-02 17:38                                     ` Sudeep Holla
@ 2016-06-14 15:31                                       ` Jon Medhurst (Tixy)
  2016-06-14 15:52                                         ` Russell King - ARM Linux
  2016-06-14 16:31                                         ` Sudeep Holla
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Medhurst (Tixy) @ 2016-06-14 15:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Sudeep

Over the past several days I think I've been unknowingly reproducing
many of the steps in this old discussion thread [1] about A9 Versatile
Express boot failures. It's only when I found myself looking at the L2
cache timings that I got a vague recollection and dug out this old
thread again. Was there any resolution to the issue? As far as I can
work out, the A9x4 CoreTile stopped working around Linux 3.18 (the
problem isn't 100% reproducible so it's difficult to tell).

Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
fixed things for him, also works for me. So should we update mainline
device-tree with that?

Alternatively, we could assume nobody cares about A9 as presumably Linux
has been unbootable for a year without anyone raising the issue. (The
only reason I'm looking at it is I may be making U-Boot changes for
vexpress and I wanted to test them).

But if we are going to just ignore things, I think it would be good to
delete the A9 dts, or the L2 cache entry, so other people in the future
don't waste days trying to track down the problem.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/330860.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-May/342005.html

-- 
Tixy


n Thu, 2015-04-02 at 18:38 +0100, Sudeep Holla wrote:
> 
> On 02/04/15 15:13, Russell King - ARM Linux wrote:
> > On Tue, Mar 31, 2015 at 06:27:30PM +0100, Sudeep Holla wrote:
> >> Not sure on that as v3.18 with DT seems to be working fine and passed
> >> overnight reboot testing.
> >
> > Okay, that suggests there's something post v3.18 which is causing this,
> > rather than it being a DT vs non-DT thing.
> >
> 
> Correct. Just to be 100% sure I reverted that non-DT removal commit on
> both v3.19-rc1 and v4.0-rc6 and was able to reproduce issue without DT.
> 
> > An extra data point which I've just found (by enabling attempts to do
> > hibernation on various test platforms) is that the Versatile Express
> > appears to be incapable of taking a CPU offline.
> >
> > This crashes the entire system with sometimes random results.  Sometimes
> > it'll appear that a spinlock has been left owned by CPU#1 which is
> > offline.  Sometimes it'll silently hang.  Sometimes it'll start slowly
> > dumping kernel messages from the start of the kernel's ring buffer (!),
> > eg:
> >
> > PM: freeze of devices complete after 29.342 msecs
> > PM: late freeze of devices complete after 6.398 msecs
> > PM: noirq freeze of devices complete after 5.493 msecs
> > Disabling non-boot CPUs ...
> > __cpu_disable(1)
> > __cpu_die(1)
> > handle_IPI(0)
> > Booting Linux on physical CPU 0x0
> >
> > So far, it's not managed to take a CPU successfully offline and know that
> > it has.  If I disable the calls to cpu_enter_lowpower() and
> > cpu_leave_lowpower(), then it appears to work.
> >
> > This leads me to wonder whether flush_cache_louis() works... which led me
> > in turn to ARM_ERRATA_643719, which is disabled in my builds.  However,
> > the CA9 tile has a r0p1 CA9, which allegedly suffers from this errata.
> >
> 
> Yes I observed that and tested for this issue enabling it. It's doesn't
> affect and I still hit the issue.
> 
> [...]
> >
> > I haven't tested going back to a tag latency of 1 1 1 yet.  Can you
> > confirm whether you have this errata enabled for your tests?
> >
> I have now gone back to <1 1 1> latency to debug the issue as it's
> easier to reproduce with that latencies.
> 
> After I failed terribly to bisect between v3.18..v3.19-c1, as it depends
> a lot on the config you choose(a lot of changes introduced as it's merge
> window), I started looking at the code where we hit this issue since
> it's always in __radix_tree_lookup in lib/radix-tree.c while
> accessing the slots to see if it provides any more details.
> 
> Regards,
> Sudeep
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 15:31                                       ` Jon Medhurst (Tixy)
@ 2016-06-14 15:52                                         ` Russell King - ARM Linux
  2016-06-14 16:44                                           ` Sudeep Holla
  2016-06-14 16:31                                         ` Sudeep Holla
  1 sibling, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2016-06-14 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 14, 2016 at 04:31:25PM +0100, Jon Medhurst (Tixy) wrote:
> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
> fixed things for him, also works for me. So should we update mainline
> device-tree with that?

I've proposed that several times, and there seems to be no desire to
do so.  For me, VE CT9x4 no longer boots in my nightly builds, and
my plan at the start of the year was to take it out of both the
nightly builds and the boot tests - no one within ARM seems to have
any interest in the platform.

Having a randomly failing platform due to hardware issues is not
productive to an automated boot test system, so I think we should
(a) remove it from automated testing, and (b) consider deleting
support for it from the kernel tree, as it seems there is little
interest in debugging what's happening.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 15:31                                       ` Jon Medhurst (Tixy)
  2016-06-14 15:52                                         ` Russell King - ARM Linux
@ 2016-06-14 16:31                                         ` Sudeep Holla
  1 sibling, 0 replies; 30+ messages in thread
From: Sudeep Holla @ 2016-06-14 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tixy,

On 14/06/16 16:31, Jon Medhurst (Tixy) wrote:
> Hi Sudeep
>
> Over the past several days I think I've been unknowingly reproducing
> many of the steps in this old discussion thread [1] about A9 Versatile
> Express boot failures. It's only when I found myself looking at the L2
> cache timings that I got a vague recollection and dug out this old
> thread again. Was there any resolution to the issue? As far as I can
> work out, the A9x4 CoreTile stopped working around Linux 3.18 (the
> problem isn't 100% reproducible so it's difficult to tell).
>
> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
> fixed things for him, also works for me. So should we update mainline
> device-tree with that?
>

That's fine by me.

> Alternatively, we could assume nobody cares about A9 as presumably Linux
> has been unbootable for a year without anyone raising the issue. (The
> only reason I'm looking at it is I may be making U-Boot changes for
> vexpress and I wanted to test them).
>

I admit I just do a boot test every release and I seem to have flashed
the DT so failed to notice any issue. That's my fault.

> But if we are going to just ignore things, I think it would be good to
> delete the A9 dts, or the L2 cache entry, so other people in the future
> don't waste days trying to track down the problem.
>

I am fine either way, if people still use it as reference, we can retain it.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 15:52                                         ` Russell King - ARM Linux
@ 2016-06-14 16:44                                           ` Sudeep Holla
  2016-06-14 16:49                                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2016-06-14 16:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On 14/06/16 16:52, Russell King - ARM Linux wrote:
> On Tue, Jun 14, 2016 at 04:31:25PM +0100, Jon Medhurst (Tixy) wrote:
>> Using "arm,tag-latency = <2 2 1>" as Russell seemed to indicate [2]
>> fixed things for him, also works for me. So should we update mainline
>> device-tree with that?
>
> I've proposed that several times, and there seems to be no desire to
> do so.

Sorry for missing that. IIRC, we didn't conclude as <2 2 1> did fail
on continuous reboot test over night on my setup. As I mentioned early 
we can change to this new value if people are able to use it reliably.

> For me, VE CT9x4 no longer boots in my nightly builds, and
> my plan at the start of the year was to take it out of both the
> nightly builds and the boot tests - no one within ARM seems to have
> any interest in the platform.
>

It's hard to get any kind of attention from hardware guys for such an
old platform.

> Having a randomly failing platform due to hardware issues is not
> productive to an automated boot test system, so I think we should
> (a) remove it from automated testing, and (b) consider deleting
> support for it from the kernel tree, as it seems there is little
> interest in debugging what's happening.
>

Even with higher latency if the platform is unusable, I agree to remove.
If you think it's usable with the updated latency(<2 2 1>) then we can
update it.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 16:44                                           ` Sudeep Holla
@ 2016-06-14 16:49                                             ` Russell King - ARM Linux
  2016-06-15  9:27                                               ` Jon Medhurst (Tixy)
  2016-06-15  9:27                                               ` Sudeep Holla
  0 siblings, 2 replies; 30+ messages in thread
From: Russell King - ARM Linux @ 2016-06-14 16:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
> Even with higher latency if the platform is unusable, I agree to remove.
> If you think it's usable with the updated latency(<2 2 1>) then we can
> update it.

The kernels I'm booting have that updated latency.  It used to improve
things, but for most of this year, it fails most boot attempts.  Out
of the last 21 boot attempts, all 21 attempts failed with the above
latency value.

  http://www.armlin ux.org.uk/developer/build/index.php?id=2003

(remove the space...)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 16:49                                             ` Russell King - ARM Linux
@ 2016-06-15  9:27                                               ` Jon Medhurst (Tixy)
  2016-06-15  9:32                                                 ` Sudeep Holla
  2016-06-15  9:27                                               ` Sudeep Holla
  1 sibling, 1 reply; 30+ messages in thread
From: Jon Medhurst (Tixy) @ 2016-06-15  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2016-06-14 at 17:49 +0100, Russell King - ARM Linux wrote:
> On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
> > Even with higher latency if the platform is unusable, I agree to remove.
> > If you think it's usable with the updated latency(<2 2 1>) then we can
> > update it.
> 
> The kernels I'm booting have that updated latency.  It used to improve
> things, but for most of this year, it fails most boot attempts.  Out
> of the last 21 boot attempts, all 21 attempts failed with the above
> latency value.

I've done some more testing and it seems the different latency values
don't reliably fix things for me after all.

In had previously tried enabling the A9 errata workarounds that aren't
enabled for vexpress:

ARM_ERRATA_751472  Hacked to remove wrong(?) depends !ARCH_MULTIPLATFORM
ARM_ERRATA_754327
ARM_ERRATA_764369

That didn't fix things, but I noticed this morning there are PL310
errata workarounds too that we should have enabled:

PL310_ERRATA_769419
PL310_ERRATA_727915
PL310_ERRATA_588369

I'm going to do testing with different combinations to see if I can
gather evidence that any of these errata are causing the problem we're
hitting. Of course, errata workarounds as with cache timing tweaks will
affect timings in the system and so could affect the triggering of our
problem without actually being the root cause.

-- 
Tixy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-14 16:49                                             ` Russell King - ARM Linux
  2016-06-15  9:27                                               ` Jon Medhurst (Tixy)
@ 2016-06-15  9:27                                               ` Sudeep Holla
  1 sibling, 0 replies; 30+ messages in thread
From: Sudeep Holla @ 2016-06-15  9:27 UTC (permalink / raw)
  To: linux-arm-kernel



On 14/06/16 17:49, Russell King - ARM Linux wrote:
> On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
>> Even with higher latency if the platform is unusable, I agree to remove.
>> If you think it's usable with the updated latency(<2 2 1>) then we can
>> update it.
>
> The kernels I'm booting have that updated latency.  It used to improve
> things, but for most of this year, it fails most boot attempts.  Out
> of the last 21 boot attempts, all 21 attempts failed with the above
> latency value.
>

Thanks, I do see that, it's unreliable with higher latencies too. If I
increase them, it seem to work, but again as the size of the image
increases the behavior changes.

So, apart from increasing the latency or removing the DT completely, I
was thinking of 3rd option of disabling L2CC on Vexpress CA9 coretile.
Let me know if that's acceptable. I thought it's reasonable as we
still can the platform support without L2CC enabled.

Regards,
Sudeep


-->8

diff --git i/arch/arm/boot/dts/vexpress-v2p-ca9.dts 
w/arch/arm/boot/dts/vexpress-v2p-ca9.dts
index b608a03ee02f..9742448b4e85 100644
--- i/arch/arm/boot/dts/vexpress-v2p-ca9.dts
+++ w/arch/arm/boot/dts/vexpress-v2p-ca9.dts
@@ -174,6 +174,7 @@
                 cache-level = <2>;
                 arm,data-latency = <1 1 1>;
                 arm,tag-latency = <1 1 1>;
+               status = "disabled";
         };

         pmu {
diff --git i/arch/arm/mm/cache-l2x0.c w/arch/arm/mm/cache-l2x0.c
index c61996c256cc..569fb1f0994b 100644
--- i/arch/arm/mm/cache-l2x0.c
+++ w/arch/arm/mm/cache-l2x0.c
@@ -1750,6 +1750,9 @@ int __init l2x0_of_init(u32 aux_val, u32 aux_mask)
         if (!np)
                 return -ENODEV;

+       if (!of_device_is_available(np))
+               return -ENODEV;
+
         if (of_address_to_resource(np, 0, &res))
                 return -ENODEV;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-15  9:27                                               ` Jon Medhurst (Tixy)
@ 2016-06-15  9:32                                                 ` Sudeep Holla
  2016-06-15  9:50                                                   ` Jon Medhurst (Tixy)
  0 siblings, 1 reply; 30+ messages in thread
From: Sudeep Holla @ 2016-06-15  9:32 UTC (permalink / raw)
  To: linux-arm-kernel



On 15/06/16 10:27, Jon Medhurst (Tixy) wrote:
> On Tue, 2016-06-14 at 17:49 +0100, Russell King - ARM Linux wrote:
>> On Tue, Jun 14, 2016 at 05:44:26PM +0100, Sudeep Holla wrote:
>>> Even with higher latency if the platform is unusable, I agree to remove.
>>> If you think it's usable with the updated latency(<2 2 1>) then we can
>>> update it.
>>
>> The kernels I'm booting have that updated latency.  It used to improve
>> things, but for most of this year, it fails most boot attempts.  Out
>> of the last 21 boot attempts, all 21 attempts failed with the above
>> latency value.
>

[...]

>
> That didn't fix things, but I noticed this morning there are PL310
> errata workarounds too that we should have enabled:
>
> PL310_ERRATA_769419
> PL310_ERRATA_727915
> PL310_ERRATA_588369
>

I had all the above enabled in my config which doesn't help either.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-15  9:32                                                 ` Sudeep Holla
@ 2016-06-15  9:50                                                   ` Jon Medhurst (Tixy)
  2016-06-15  9:59                                                     ` Sudeep Holla
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Medhurst (Tixy) @ 2016-06-15  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2016-06-15 at 10:32 +0100, Sudeep Holla wrote:
> > That didn't fix things, but I noticed this morning there are PL310
> > errata workarounds too that we should have enabled:
> >
> > PL310_ERRATA_769419
> > PL310_ERRATA_727915
> > PL310_ERRATA_588369
> >
> 
> I had all the above enabled in my config which doesn't help either.

Thanks for letting me know, that's saved me from a tedious morning of
testing. Though did you also have the A9 errata enabled?

BTW, I noticed the PL310 Cache ID reports RTL version 3, which doesn't
appear in the list of versions in cache-l2x0.h and doesn't match the TRM
shipped on the vexpress CD, which is for version 2 (r1p0).

Anyway, I'll test various Linux versions with the L2 cache removed from
device-tree to see if that reliably fixes things.

-- 
Tixy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing
  2016-06-15  9:50                                                   ` Jon Medhurst (Tixy)
@ 2016-06-15  9:59                                                     ` Sudeep Holla
  0 siblings, 0 replies; 30+ messages in thread
From: Sudeep Holla @ 2016-06-15  9:59 UTC (permalink / raw)
  To: linux-arm-kernel



On 15/06/16 10:50, Jon Medhurst (Tixy) wrote:
> On Wed, 2016-06-15 at 10:32 +0100, Sudeep Holla wrote:
>>> That didn't fix things, but I noticed this morning there are PL310
>>> errata workarounds too that we should have enabled:
>>>
>>> PL310_ERRATA_769419
>>> PL310_ERRATA_727915
>>> PL310_ERRATA_588369
>>>
>>
>> I had all the above enabled in my config which doesn't help either.
>
> Thanks for letting me know, that's saved me from a tedious morning of
> testing. Though did you also have the A9 errata enabled?
>

Yes I tried multi_v7_defconfig which has all the errata's enabled for A9

> BTW, I noticed the PL310 Cache ID reports RTL version 3, which doesn't
> appear in the list of versions in cache-l2x0.h and doesn't match the TRM
> shipped on the vexpress CD, which is for version 2 (r1p0).
>

Yes I remember having similar observations from past. IIRC, it's one of
the earliest version used on that core-tile(which generally is the case
on few early versions of test chips)

> Anyway, I'll test various Linux versions with the L2 cache removed from
> device-tree to see if that reliably fixes things.
>

Thanks, but would like to see what Russell thinks about this approach
before you want to spend more time on that.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-06-15  9:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-15 21:33 Versatile Express randomly fails to boot Russell King - ARM Linux
2015-03-16  0:04 ` Russell King - ARM Linux
2015-03-16  0:42   ` Russell King - ARM Linux
2015-03-16  9:35     ` Russell King - ARM Linux
2015-03-16 13:04       ` Versatile Express randomly fails to boot - Versatile Express to be removed from nightly testing Russell King - ARM Linux
2015-03-16 17:47         ` Sudeep Holla
2015-03-16 18:16           ` Russell King - ARM Linux
2015-03-16 19:16             ` Sudeep Holla
2015-03-16 19:52               ` Russell King - ARM Linux
2015-03-17 12:05                 ` Sudeep Holla
2015-03-17 15:36                   ` Russell King - ARM Linux
2015-03-17 15:51                     ` Sudeep Holla
2015-03-17 16:17                       ` Russell King - ARM Linux
2015-03-30 14:03                         ` Russell King - ARM Linux
2015-03-30 14:48                           ` Sudeep Holla
2015-03-30 15:05                             ` Russell King - ARM Linux
2015-03-30 15:39                               ` Sudeep Holla
2015-03-31 17:27                                 ` Sudeep Holla
2015-04-02 14:13                                   ` Russell King - ARM Linux
2015-04-02 17:38                                     ` Sudeep Holla
2016-06-14 15:31                                       ` Jon Medhurst (Tixy)
2016-06-14 15:52                                         ` Russell King - ARM Linux
2016-06-14 16:44                                           ` Sudeep Holla
2016-06-14 16:49                                             ` Russell King - ARM Linux
2016-06-15  9:27                                               ` Jon Medhurst (Tixy)
2016-06-15  9:32                                                 ` Sudeep Holla
2016-06-15  9:50                                                   ` Jon Medhurst (Tixy)
2016-06-15  9:59                                                     ` Sudeep Holla
2016-06-15  9:27                                               ` Sudeep Holla
2016-06-14 16:31                                         ` Sudeep Holla

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.