All of lore.kernel.org
 help / color / mirror / Atom feed
* Cache issues in vexpress cpu shutdown (regression in 3.10)
@ 2013-06-05 11:09 Jon Medhurst (Tixy)
  2013-06-05 11:39 ` Russell King - ARM Linux
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Medhurst (Tixy) @ 2013-06-05 11:09 UTC (permalink / raw)
  To: linux-arm-kernel

I've been investigating why reboot fails on Versatile Express with the
CA9x4 CoreTile and the problem seems to get triggered by commit bca7a5a0
(ARM: cpu hotplug: remove majority of cache flushing from platforms).

Putting back the flush_cache_all() removed by this patch in
mach-vexpress/hotplug.c gets reboot working again. Without that I see
the following during shutdown:

CPU 2 is in _cpu_down called from disable_nonboot_cpus, and is spinning
in the loop:

	while (!idle_cpu(cpu))
		cpu_relax();

cpu == 1 here and idle_cpu() is constantly returning false because
rq->curr != rq->idle and it looks like the runqueue has one process:
that which issued the 'reboot' command.

CPU 1 is spinning in platform_do_lowpower and waiting for pen release to
equal 1 (it's -1). Looks like it got there via the smp_ops.cpu_die(cpu)
call in cpu_die.

CPU 0 and 3 are at wfi in cpu_v7_do_idle

Sometimes I see a different symptoms where it appears that some CPUs
reboot whilst the system still hasn't shut down. (Possibly because it
is returning from cpu_die and jumping to secondary_start_kernel?)

The cache flushing for cpu_die was moved to generic code by the commit
previous to the one mentioned above, i.e. 51acdfd1 (ARM: smp: flush L1
cache in cpu_die()). This added flush_cache_louis to the generic code so
I thought I would see what replacing these with flush_cache_all would
do...

Replacing the first flush_cache_louis in cpu_die with flush_cache_all
allows reboot to happen, but I see

   * Will now restart
  CPU1: cpu didn't die
  CPU2: cpu didn't die
  CPU3: cpu didn't die
  Restarting system.

Speculation: means the complete(&cpu_died) after that cache flush didn't
get seen?

Replacing the second flush_cache_louis instead makes every work fine; as
we would expect as it is equivalent to putting original flush_cache_all
back in the vexpress code.

I'm a bit stumped by all this as I don't see why flush_cache_louis is
apparently insufficient to get changes on one core seen by the other.

-- 
Tixy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 11:09 Cache issues in vexpress cpu shutdown (regression in 3.10) Jon Medhurst (Tixy)
@ 2013-06-05 11:39 ` Russell King - ARM Linux
  2013-06-05 11:50   ` Will Deacon
  2013-06-05 12:05   ` Lorenzo Pieralisi
  0 siblings, 2 replies; 10+ messages in thread
From: Russell King - ARM Linux @ 2013-06-05 11:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 12:09:11PM +0100, Jon Medhurst (Tixy) wrote:
> I've been investigating why reboot fails on Versatile Express with the
> CA9x4 CoreTile and the problem seems to get triggered by commit bca7a5a0
> (ARM: cpu hotplug: remove majority of cache flushing from platforms).
> 
> Putting back the flush_cache_all() removed by this patch in
> mach-vexpress/hotplug.c gets reboot working again. Without that I see
> the following during shutdown:
> 
> CPU 2 is in _cpu_down called from disable_nonboot_cpus, and is spinning
> in the loop:
> 
> 	while (!idle_cpu(cpu))
> 		cpu_relax();
> 
> cpu == 1 here and idle_cpu() is constantly returning false because
> rq->curr != rq->idle and it looks like the runqueue has one process:
> that which issued the 'reboot' command.
> 
> CPU 1 is spinning in platform_do_lowpower and waiting for pen release to
> equal 1 (it's -1). Looks like it got there via the smp_ops.cpu_die(cpu)
> call in cpu_die.

This sounds like CPU2 hasn't seen the updates to CPU1 inspite of pushing
the contents of CPU1's cache out to point of unification in the inner
sharable domain (the point where all CPUs should see the same view.)

Are you able to look at what's visible in the caches for both CPUs for
things like rq->curr for CPU 1?

I wonder if - even though we've pushed it out of CPU 1's local cache,
whether there's still something to do with the coherency stuff which
remains incomplete.

Either way, this has significant implications for everyone who uses
flush_cache_louis() in paths where the CPU loses state - it means that
something is wrong with the way data is pushed out of the CPU.

> I'm a bit stumped by all this as I don't see why flush_cache_louis is
> apparently insufficient to get changes on one core seen by the other.

Could it be that flush_cache_louis() doesn't actually do what it claims
to?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 11:39 ` Russell King - ARM Linux
@ 2013-06-05 11:50   ` Will Deacon
  2013-06-05 13:45     ` Jon Medhurst (Tixy)
  2013-06-05 12:05   ` Lorenzo Pieralisi
  1 sibling, 1 reply; 10+ messages in thread
From: Will Deacon @ 2013-06-05 11:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 12:39:12PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jun 05, 2013 at 12:09:11PM +0100, Jon Medhurst (Tixy) wrote:
> > I've been investigating why reboot fails on Versatile Express with the
> > CA9x4 CoreTile and the problem seems to get triggered by commit bca7a5a0
> > (ARM: cpu hotplug: remove majority of cache flushing from platforms).
> > 
> > Putting back the flush_cache_all() removed by this patch in
> > mach-vexpress/hotplug.c gets reboot working again. Without that I see
> > the following during shutdown:
> > 
> > CPU 2 is in _cpu_down called from disable_nonboot_cpus, and is spinning
> > in the loop:
> > 
> > 	while (!idle_cpu(cpu))
> > 		cpu_relax();
> > 
> > cpu == 1 here and idle_cpu() is constantly returning false because
> > rq->curr != rq->idle and it looks like the runqueue has one process:
> > that which issued the 'reboot' command.
> > 
> > CPU 1 is spinning in platform_do_lowpower and waiting for pen release to
> > equal 1 (it's -1). Looks like it got there via the smp_ops.cpu_die(cpu)
> > call in cpu_die.
> 
> This sounds like CPU2 hasn't seen the updates to CPU1 inspite of pushing
> the contents of CPU1's cache out to point of unification in the inner
> sharable domain (the point where all CPUs should see the same view.)
> 
> Are you able to look at what's visible in the caches for both CPUs for
> things like rq->curr for CPU 1?
> 
> I wonder if - even though we've pushed it out of CPU 1's local cache,
> whether there's still something to do with the coherency stuff which
> remains incomplete.
> 
> Either way, this has significant implications for everyone who uses
> flush_cache_louis() in paths where the CPU loses state - it means that
> something is wrong with the way data is pushed out of the CPU.
> 
> > I'm a bit stumped by all this as I don't see why flush_cache_louis is
> > apparently insufficient to get changes on one core seen by the other.
> 
> Could it be that flush_cache_louis() doesn't actually do what it claims
> to?

Smells like erratum #643719, which was fixed in r1p0 of A9. The LoUIS value
in the CLIDR reports 0 instead of 1.

Tixy -- if you can confirm the above (i.e. your register has the wrong
value) then we can come up with a workaround. I'm not sure whether any
platforms other than vexpress have such an early A9 revision though...

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 11:39 ` Russell King - ARM Linux
  2013-06-05 11:50   ` Will Deacon
@ 2013-06-05 12:05   ` Lorenzo Pieralisi
  2013-06-05 19:08     ` Russell King - ARM Linux
  1 sibling, 1 reply; 10+ messages in thread
From: Lorenzo Pieralisi @ 2013-06-05 12:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 12:39:12PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jun 05, 2013 at 12:09:11PM +0100, Jon Medhurst (Tixy) wrote:
> > I've been investigating why reboot fails on Versatile Express with the
> > CA9x4 CoreTile and the problem seems to get triggered by commit bca7a5a0
> > (ARM: cpu hotplug: remove majority of cache flushing from platforms).
> > 
> > Putting back the flush_cache_all() removed by this patch in
> > mach-vexpress/hotplug.c gets reboot working again. Without that I see
> > the following during shutdown:
> > 
> > CPU 2 is in _cpu_down called from disable_nonboot_cpus, and is spinning
> > in the loop:
> > 
> > 	while (!idle_cpu(cpu))
> > 		cpu_relax();
> > 
> > cpu == 1 here and idle_cpu() is constantly returning false because
> > rq->curr != rq->idle and it looks like the runqueue has one process:
> > that which issued the 'reboot' command.
> > 
> > CPU 1 is spinning in platform_do_lowpower and waiting for pen release to
> > equal 1 (it's -1). Looks like it got there via the smp_ops.cpu_die(cpu)
> > call in cpu_die.
> 
> This sounds like CPU2 hasn't seen the updates to CPU1 inspite of pushing
> the contents of CPU1's cache out to point of unification in the inner
> sharable domain (the point where all CPUs should see the same view.)
> 
> Are you able to look at what's visible in the caches for both CPUs for
> things like rq->curr for CPU 1?
> 
> I wonder if - even though we've pushed it out of CPU 1's local cache,
> whether there's still something to do with the coherency stuff which
> remains incomplete.
> 
> Either way, this has significant implications for everyone who uses
> flush_cache_louis() in paths where the CPU loses state - it means that
> something is wrong with the way data is pushed out of the CPU.
> 
> > I'm a bit stumped by all this as I don't see why flush_cache_louis is
> > apparently insufficient to get changes on one core seen by the other.
> 
> Could it be that flush_cache_louis() doesn't actually do what it claims
> to?

There is an A9 errata (fixed in r1p0) whereby CLIDR[23:21] reads as 0
where it should read as 3'b001, so basically flush_cache_louis is not
flushing anything. If that's the problem, either we add a generic fix
in v7 cache assembly or we just fix it in platform code (by calling
flush_cache_all()), since there should not be many pre-r1p0 around.

Please let me know what you think.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 11:50   ` Will Deacon
@ 2013-06-05 13:45     ` Jon Medhurst (Tixy)
  2013-06-05 13:58       ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Medhurst (Tixy) @ 2013-06-05 13:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-06-05 at 12:50 +0100, Will Deacon wrote:
> On Wed, Jun 05, 2013 at 12:39:12PM +0100, Russell King - ARM Linux wrote:
> > On Wed, Jun 05, 2013 at 12:09:11PM +0100, Jon Medhurst (Tixy) wrote:
> > > I've been investigating why reboot fails on Versatile Express with the
> > > CA9x4 CoreTile and the problem seems to get triggered by commit bca7a5a0
> > > (ARM: cpu hotplug: remove majority of cache flushing from platforms).
> > > 
> > > Putting back the flush_cache_all() removed by this patch in
> > > mach-vexpress/hotplug.c gets reboot working again. Without that I see
> > > the following during shutdown:
> > > 
> > > CPU 2 is in _cpu_down called from disable_nonboot_cpus, and is spinning
> > > in the loop:
> > > 
> > > 	while (!idle_cpu(cpu))
> > > 		cpu_relax();
> > > 
> > > cpu == 1 here and idle_cpu() is constantly returning false because
> > > rq->curr != rq->idle and it looks like the runqueue has one process:
> > > that which issued the 'reboot' command.
> > > 
> > > CPU 1 is spinning in platform_do_lowpower and waiting for pen release to
> > > equal 1 (it's -1). Looks like it got there via the smp_ops.cpu_die(cpu)
> > > call in cpu_die.
> > 
> > This sounds like CPU2 hasn't seen the updates to CPU1 inspite of pushing
> > the contents of CPU1's cache out to point of unification in the inner
> > sharable domain (the point where all CPUs should see the same view.)
> > 
> > Are you able to look at what's visible in the caches for both CPUs for
> > things like rq->curr for CPU 1?
> > 
> > I wonder if - even though we've pushed it out of CPU 1's local cache,
> > whether there's still something to do with the coherency stuff which
> > remains incomplete.
> > 
> > Either way, this has significant implications for everyone who uses
> > flush_cache_louis() in paths where the CPU loses state - it means that
> > something is wrong with the way data is pushed out of the CPU.
> > 
> > > I'm a bit stumped by all this as I don't see why flush_cache_louis is
> > > apparently insufficient to get changes on one core seen by the other.
> > 
> > Could it be that flush_cache_louis() doesn't actually do what it claims
> > to?
> 
> Smells like erratum #643719, which was fixed in r1p0 of A9. The LoUIS value
> in the CLIDR reports 0 instead of 1.
> 
> Tixy -- if you can confirm the above (i.e. your register has the wrong
> value) then we can come up with a workaround. I'm not sure whether any
> platforms other than vexpress have such an early A9 revision though...

Yes, you have a good nose. The LoUIS value reads zero and hacking
v7_flush_dcache_louis to force it to one makes the problems go away.

So, is the correct fix to add an errata config option to add a check for
an r0p? ARM A9 when LoUIS value reads as zero?

I'm not sure that we should just add flush_cache_all back to the
vexpress cpu_enter_lowpower because flush_cache_louis is used in several
places and could also cause problems there. Though presumably not that
serious problems if we've gone this time without anyone noticing...

-- 
Tixy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 13:45     ` Jon Medhurst (Tixy)
@ 2013-06-05 13:58       ` Will Deacon
  2013-06-05 14:13         ` Pawel Moll
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2013-06-05 13:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 02:45:02PM +0100, Jon Medhurst (Tixy) wrote:
> On Wed, 2013-06-05 at 12:50 +0100, Will Deacon wrote:
> > On Wed, Jun 05, 2013 at 12:39:12PM +0100, Russell King - ARM Linux wrote:
> > > Could it be that flush_cache_louis() doesn't actually do what it claims
> > > to?
> > 
> > Smells like erratum #643719, which was fixed in r1p0 of A9. The LoUIS value
> > in the CLIDR reports 0 instead of 1.
> > 
> > Tixy -- if you can confirm the above (i.e. your register has the wrong
> > value) then we can come up with a workaround. I'm not sure whether any
> > platforms other than vexpress have such an early A9 revision though...
> 
> Yes, you have a good nose. The LoUIS value reads zero and hacking
> v7_flush_dcache_louis to force it to one makes the problems go away.

Ok, that confirms it then.

> So, is the correct fix to add an errata config option to add a check for
> an r0p? ARM A9 when LoUIS value reads as zero?

That's the most general solution, yes.

> I'm not sure that we should just add flush_cache_all back to the
> vexpress cpu_enter_lowpower because flush_cache_louis is used in several
> places and could also cause problems there. Though presumably not that
> serious problems if we've gone this time without anyone noticing...

Hence my question earlier on: does any platform other than VE actually have
a pre r1p0 A9 in it? If we don't know, then your solution above sounds
reasonable.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 13:58       ` Will Deacon
@ 2013-06-05 14:13         ` Pawel Moll
  0 siblings, 0 replies; 10+ messages in thread
From: Pawel Moll @ 2013-06-05 14:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2013-06-05 at 14:58 +0100, Will Deacon wrote:
> Hence my question earlier on: does any platform other than VE actually have
> a pre r1p0 A9 in it? 

Not to my knowledge, no. Even the earliest OMAP4s were r1px (and there
was nothing between r0p1 and r1p0). I'll try to ask the hardware support
people - if anyone knows otherwise, it's them.

Pawe?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 12:05   ` Lorenzo Pieralisi
@ 2013-06-05 19:08     ` Russell King - ARM Linux
  2013-06-06  9:21       ` Catalin Marinas
  0 siblings, 1 reply; 10+ messages in thread
From: Russell King - ARM Linux @ 2013-06-05 19:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 01:05:39PM +0100, Lorenzo Pieralisi wrote:
> There is an A9 errata (fixed in r1p0) whereby CLIDR[23:21] reads as 0
> where it should read as 3'b001, so basically flush_cache_louis is not
> flushing anything. If that's the problem, either we add a generic fix
> in v7 cache assembly or we just fix it in platform code (by calling
> flush_cache_all()), since there should not be many pre-r1p0 around.
> 
> Please let me know what you think.

If we're providing a flush_cache_louis() function, then it better do
what it says or be removed.  So the right solution is to fix the
function rather than working around it, because over time we'll only
add more calls to flush_cache_louis() and it'll become a stumbling
block.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-05 19:08     ` Russell King - ARM Linux
@ 2013-06-06  9:21       ` Catalin Marinas
  2013-06-06  9:30         ` Lorenzo Pieralisi
  0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2013-06-06  9:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 05, 2013 at 08:08:53PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jun 05, 2013 at 01:05:39PM +0100, Lorenzo Pieralisi wrote:
> > There is an A9 errata (fixed in r1p0) whereby CLIDR[23:21] reads as 0
> > where it should read as 3'b001, so basically flush_cache_louis is not
> > flushing anything. If that's the problem, either we add a generic fix
> > in v7 cache assembly or we just fix it in platform code (by calling
> > flush_cache_all()), since there should not be many pre-r1p0 around.
> > 
> > Please let me know what you think.
> 
> If we're providing a flush_cache_louis() function, then it better do
> what it says or be removed.  So the right solution is to fix the
> function rather than working around it, because over time we'll only
> add more calls to flush_cache_louis() and it'll become a stumbling
> block.

I think at some point we should define this unification level in the DT,
maybe together with the CPU topology. We use LoUIS because it seems to
be the right thing on all the existing platforms but it's nothing in the
architecture that says LoUIS is the right level of flushing for the
CPU-local caches.

-- 
Catalin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Cache issues in vexpress cpu shutdown (regression in 3.10)
  2013-06-06  9:21       ` Catalin Marinas
@ 2013-06-06  9:30         ` Lorenzo Pieralisi
  0 siblings, 0 replies; 10+ messages in thread
From: Lorenzo Pieralisi @ 2013-06-06  9:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 06, 2013 at 10:21:32AM +0100, Catalin Marinas wrote:
> On Wed, Jun 05, 2013 at 08:08:53PM +0100, Russell King - ARM Linux wrote:
> > On Wed, Jun 05, 2013 at 01:05:39PM +0100, Lorenzo Pieralisi wrote:
> > > There is an A9 errata (fixed in r1p0) whereby CLIDR[23:21] reads as 0
> > > where it should read as 3'b001, so basically flush_cache_louis is not
> > > flushing anything. If that's the problem, either we add a generic fix
> > > in v7 cache assembly or we just fix it in platform code (by calling
> > > flush_cache_all()), since there should not be many pre-r1p0 around.
> > > 
> > > Please let me know what you think.
> > 
> > If we're providing a flush_cache_louis() function, then it better do
> > what it says or be removed.  So the right solution is to fix the
> > function rather than working around it, because over time we'll only
> > add more calls to flush_cache_louis() and it'll become a stumbling
> > block.
> 
> I think at some point we should define this unification level in the DT,
> maybe together with the CPU topology. We use LoUIS because it seems to
> be the right thing on all the existing platforms but it's nothing in the
> architecture that says LoUIS is the right level of flushing for the
> CPU-local caches.

I definitely agree. For the time being, I am willing to patch the code
with an errata fix, waiting for the link between cache levels and
topology (and power domains) to get in the kernel.

Lorenzo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-06-06  9:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-05 11:09 Cache issues in vexpress cpu shutdown (regression in 3.10) Jon Medhurst (Tixy)
2013-06-05 11:39 ` Russell King - ARM Linux
2013-06-05 11:50   ` Will Deacon
2013-06-05 13:45     ` Jon Medhurst (Tixy)
2013-06-05 13:58       ` Will Deacon
2013-06-05 14:13         ` Pawel Moll
2013-06-05 12:05   ` Lorenzo Pieralisi
2013-06-05 19:08     ` Russell King - ARM Linux
2013-06-06  9:21       ` Catalin Marinas
2013-06-06  9:30         ` Lorenzo Pieralisi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.