linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Problems booting exynos5420 with >1 CPU
@ 2014-06-06  1:08 Doug Anderson
  2014-06-06  4:38 ` Tushar Behera
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Anderson @ 2014-06-06  1:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

When I try to boot linuxnext on my exynos5420-peach-pit chromebook I
have problems bringing up extra CPUs:

1. They don't come up

2. As they're coming up you can see U-Boot print out (!).  This sounds
similar to something that was happening during bringup where the other
CPUs didn't realize that they were not the primary CPU and started the
boot process all over again.

Can someone from Samsung take a look, or point at any in-flight
patches that will fix this?


Thanks!  Logs are below...  I'm on linuxnext 20140605 + my patch to
fix earlyprintk.  I'm using
<https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/eclass/cros-kernel/exynos5_defconfig>
but with CONFIG_NR_CPUS=8.


-Doug

---

Uncompressing Linux... done, booting the kernel.
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.15.0-rc8-next-20140605-00001-gdd2e9c6
(dianders at tictac.mtv.corp.google.com) (gcc version 4.8.x-google
20140307 (prerelease) (4.8.2_cos_gg_38c6bf0_4.8.2-r73) ) #96 SMP Thu
Jun 5 18:00:08 PDT 2014
[    0.000000] CPU: ARMv7 Processor [412fc0f3] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] Machine model: Google Peach Pit Rev 6+
[    0.000000] bootconsole [earlycon0] enabled
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] On node 0 totalpages: 524288
[    0.000000] free_area_init_node: node 0, pgdat 807a0a40,
node_mem_map ee7f3000
[    0.000000]   Normal zone: 3568 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 456704 pages, LIFO batch:31
[    0.000000]   HighMem zone: 528 pages used for memmap
[    0.000000]   HighMem zone: 67584 pages, LIFO batch:15
[    0.000000] PERCPU: Embedded 9 pages/cpu @ee781000 s13184 r8192 d15488 u36864
[    0.000000] pcpu-alloc: s13184 r8192 d15488 u36864 alloc=9*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [0] 4 [0] 5 [0] 6 [0] 7
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.
Total pages: 520720
[    0.000000] Kernel command line: cros_legacy cros_debug
console=ttySAC3,115200 debug slub_debug=FZPUA kgdboc=ttySAC3
earlyprintk root=/dev/mmcblk0p3 rootwait ro
[    0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Dentry cache hash table entries: 262144 (order: 8, 1048576 bytes)
[    0.000000] Inode-cache hash table entries: 131072 (order: 7, 524288 bytes)
[    0.000000] Memory: 2069908K/2097152K available (5445K kernel code,
357K rwdata, 1544K rodata, 444K init, 895K bss, 27244K reserved,
270336K highmem)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xffe00000   (2048 kB)
[    0.000000]     vmalloc : 0xf0000000 - 0xff000000   ( 240 MB)
[    0.000000]     lowmem  : 0x80000000 - 0xef800000   (1784 MB)
[    0.000000]     pkmap   : 0x7fe00000 - 0x80000000   (   2 MB)
[    0.000000]     modules : 0x7f000000 - 0x7fe00000   (  14 MB)
[    0.000000]       .text : 0x80008000 - 0x806db78c   (6990 kB)
[    0.000000]       .init : 0x806dc000 - 0x8074b380   ( 445 kB)
[    0.000000]       .data : 0x8074c000 - 0x807a55a8   ( 358 kB)
[    0.000000]        .bss : 0x807a55a8 - 0x8088526c   ( 896 kB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS:16 nr_irqs:16 16
[    0.000000] L2C: failed to init: -19
[    0.000005] sched_clock: 64 bits at 24MHz, resolution 41ns, wraps
every 2863311519744ns
[    0.008328] Console: colour dummy device 80x30
[    0.012698] Calibrating delay loop... 3581.54 BogoMIPS (lpj=8953856)
[    0.047723] pid_max: default: 32768 minimum: 301
[    0.052945] Security Framework initialized
[    0.057701] Mount-cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.064325] Mountpoint-cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.074789] CPU: Testing write buffer coherency: ok
[    0.079570] ftrace: allocating 19911 entries in 59 pages
[    0.104146] CPU0: update cpu_power 1535
[    0.107910] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.113925] Setting up static identity map for 0x204e2908 - 0x204e2960
[    1.125030] CPU1: failed to boot: -38
[    2.130043] CPU2: failed to boot: -38
[    3.135059] CPU3: failed to boot: -38


U-Boot 2013.04 (Apr 12 2014 - 13:50:09) for Peach

CPU:    Exynos5420 at 1800MHz

Board: Google Peach Pit, rev 9.0
I2C:   ready
DRAM:  2 GiB
[    4.140071] CPU4: failed to boot: -38
[    5.145086] CPU5: failed to boot: -38
[    6.150106] CPU6: failed to boot: -38
[    7.155127] CPU7: failed to boot: -38
[    7.158721] Brought up 1 CPUs
[    7.161746] SMP: Total of 1 processors activated.
[    7.166516] CPU: All CPU(s) started in SVC mode.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06  1:08 Problems booting exynos5420 with >1 CPU Doug Anderson
@ 2014-06-06  4:38 ` Tushar Behera
  2014-06-06 17:17   ` Doug Anderson
  0 siblings, 1 reply; 51+ messages in thread
From: Tushar Behera @ 2014-06-06  4:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/06/2014 06:38 AM, Doug Anderson wrote:
> Hi,
> 
> When I try to boot linuxnext on my exynos5420-peach-pit chromebook I
> have problems bringing up extra CPUs:
> 
> 1. They don't come up
> 

[ ... ]

> [    1.125030] CPU1: failed to boot: -38
> [    2.130043] CPU2: failed to boot: -38
> [    3.135059] CPU3: failed to boot: -38
> 

With following config modifications over exynos_defconfig, I can see 4
big cores coming up.

CONFIG_MCPM=y
CONFIG_EXYNOS5420_MCPM=y

[    0.030919] Exynos MCPM support installed
[    0.060233] CPU1: Booted secondary processor
[    0.090006] CPU1: update cpu_power 1535
[    0.090010] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.100234] CPU2: Booted secondary processor
[    0.130007] CPU2: update cpu_power 1535
[    0.130011] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    0.140233] CPU3: Booted secondary processor
[    0.170008] CPU3: update cpu_power 1535
[    0.170012] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003


-- 
Tushar Behera

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06  4:38 ` Tushar Behera
@ 2014-06-06 17:17   ` Doug Anderson
  2014-06-06 17:36     ` Abhilash Kesavan
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 17:17 UTC (permalink / raw)
  To: linux-arm-kernel

Tushar,

On Thu, Jun 5, 2014 at 9:38 PM, Tushar Behera <trblinux@gmail.com> wrote:
> On 06/06/2014 06:38 AM, Doug Anderson wrote:
>> Hi,
>>
>> When I try to boot linuxnext on my exynos5420-peach-pit chromebook I
>> have problems bringing up extra CPUs:
>>
>> 1. They don't come up
>>
>
> [ ... ]
>
>> [    1.125030] CPU1: failed to boot: -38
>> [    2.130043] CPU2: failed to boot: -38
>> [    3.135059] CPU3: failed to boot: -38
>>
>
> With following config modifications over exynos_defconfig, I can see 4
> big cores coming up.
>
> CONFIG_MCPM=y
> CONFIG_EXYNOS5420_MCPM=y

Thanks!  OK, that certainly changes the behavior for me!  ...but it
doesn't work.

I also got a response from some folks that I pinged off-list.
Apparently our U-Boot doesn't set things up like the upstream kernel
expects (ugh!), so to get things booting you need the magic commands
in U-Boot:

mw.l 02073028 0  # Clear EXYNOS_START_FLAG
mw.l 10d25000 3  # Enable CCI from U-Boot

Once you do that then all 8 CPUs can come up!


Can someone at Samsung come up with a reasonable solution about how we
should solve this in a way that everyone can be happy with?  I haven't
really been involved with MCPM stuff, so I'm not sure why what landed
upstream is different than what we have in our kernel.

I guess worst case we can require that anyone running an upstream
kernel simply needs a "fixed" U-Boot, but that's always a bit of a
hassle.  We might have to go there eventually anyway given that our L2
cache timings in U-Boot are wrong (again!) and it's unlikely that
upstream will take some weird hack in the kernel to fix them...

-Doug


P.S. We also really need to do something about the configs.  There are
no exynos configs upstream that have any useful set of options (that I
know of).  ...and useful options don't seem to get selected
automatically...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 17:17   ` Doug Anderson
@ 2014-06-06 17:36     ` Abhilash Kesavan
  2014-06-06 18:02       ` Doug Anderson
  2014-06-06 21:48       ` Nicolas Pitre
  0 siblings, 2 replies; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 17:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

The first change in the kernel (clearing an iRAM location) is needed
because of an unnecessary change that we are carrying in the Chrome
U-boot. There is no reason for us to have the workaround in the
mainline kernel. Rather, we should remove the check from our u-boot.
However AFAIR a clean-up patch that I had posted internally was not
accepted as we had frozen the SPL at the time.

The second change is to enable snoops for boot cluster. Internally, we
were disabling the snoops for both the clusters at power off and
enabling it in power_up_setup and power_up. However, I dropped the
approach due to problems pointed out by Nicolas here
http://www.spinics.net/lists/arm-kernel/msg324091.html related to
cpuidle. Hence, we turn it on at the u-boot.

Regards,
Abhilash

On Fri, Jun 6, 2014 at 10:47 PM, Doug Anderson <dianders@google.com> wrote:
> Tushar,
>
> On Thu, Jun 5, 2014 at 9:38 PM, Tushar Behera <trblinux@gmail.com> wrote:
>> On 06/06/2014 06:38 AM, Doug Anderson wrote:
>>> Hi,
>>>
>>> When I try to boot linuxnext on my exynos5420-peach-pit chromebook I
>>> have problems bringing up extra CPUs:
>>>
>>> 1. They don't come up
>>>
>>
>> [ ... ]
>>
>>> [    1.125030] CPU1: failed to boot: -38
>>> [    2.130043] CPU2: failed to boot: -38
>>> [    3.135059] CPU3: failed to boot: -38
>>>
>>
>> With following config modifications over exynos_defconfig, I can see 4
>> big cores coming up.
>>
>> CONFIG_MCPM=y
>> CONFIG_EXYNOS5420_MCPM=y
>
> Thanks!  OK, that certainly changes the behavior for me!  ...but it
> doesn't work.
>
> I also got a response from some folks that I pinged off-list.
> Apparently our U-Boot doesn't set things up like the upstream kernel
> expects (ugh!), so to get things booting you need the magic commands
> in U-Boot:
>
> mw.l 02073028 0  # Clear EXYNOS_START_FLAG
> mw.l 10d25000 3  # Enable CCI from U-Boot
>
> Once you do that then all 8 CPUs can come up!
>
>
> Can someone at Samsung come up with a reasonable solution about how we
> should solve this in a way that everyone can be happy with?  I haven't
> really been involved with MCPM stuff, so I'm not sure why what landed
> upstream is different than what we have in our kernel.
>
> I guess worst case we can require that anyone running an upstream
> kernel simply needs a "fixed" U-Boot, but that's always a bit of a
> hassle.  We might have to go there eventually anyway given that our L2
> cache timings in U-Boot are wrong (again!) and it's unlikely that
> upstream will take some weird hack in the kernel to fix them...
>
> -Doug
>
>
> P.S. We also really need to do something about the configs.  There are
> no exynos configs upstream that have any useful set of options (that I
> know of).  ...and useful options don't seem to get selected
> automatically...
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 17:36     ` Abhilash Kesavan
@ 2014-06-06 18:02       ` Doug Anderson
  2014-06-06 18:12         ` Abhilash Kesavan
  2014-06-06 21:48       ` Nicolas Pitre
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 18:02 UTC (permalink / raw)
  To: linux-arm-kernel

Abhilash,

On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
<kesavan.abhilash@gmail.com> wrote:
> Hi Doug,
>
> The first change in the kernel (clearing an iRAM location) is needed
> because of an unnecessary change that we are carrying in the Chrome
> U-boot. There is no reason for us to have the workaround in the
> mainline kernel. Rather, we should remove the check from our u-boot.
> However AFAIR a clean-up patch that I had posted internally was not
> accepted as we had frozen the SPL at the time.

Ah, is that this one, or a different one?

https://chromium-review.googlesource.com/#/c/66049/


If we land that patch now it won't help since nobody is going to be
updating their read-only firmware.  We'll need to put code somewhere
that fixes it.


> The second change is to enable snoops for boot cluster. Internally, we
> were disabling the snoops for both the clusters at power off and
> enabling it in power_up_setup and power_up. However, I dropped the
> approach due to problems pointed out by Nicolas here
> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
> cpuidle. Hence, we turn it on at the u-boot.

I think I followed all that.  What you're saying is that our kernel
dynamically enables and disables snoops as needed, but Nicolas pointed
out that it was unsafe (though apparently we're not seeing problems in
our usage).

...so now the kernel doesn't touch the snoops and assumes that U-Boot
turned them on.


Ugh.

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:02       ` Doug Anderson
@ 2014-06-06 18:12         ` Abhilash Kesavan
  2014-06-06 18:20           ` Doug Anderson
  0 siblings, 1 reply; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 18:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
> Abhilash,
>
> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
> <kesavan.abhilash@gmail.com> wrote:
>> Hi Doug,
>>
>> The first change in the kernel (clearing an iRAM location) is needed
>> because of an unnecessary change that we are carrying in the Chrome
>> U-boot. There is no reason for us to have the workaround in the
>> mainline kernel. Rather, we should remove the check from our u-boot.
>> However AFAIR a clean-up patch that I had posted internally was not
>> accepted as we had frozen the SPL at the time.
>
> Ah, is that this one, or a different one?
>
> https://chromium-review.googlesource.com/#/c/66049/
Yes, this along with a kernel side change.
>
>
> If we land that patch now it won't help since nobody is going to be
> updating their read-only firmware.  We'll need to put code somewhere
> that fixes it.
We just carry the workaround fix locally until we migrate to mainline
u-boot for 5420 where the unnecessay check will not be present.
>
>
>> The second change is to enable snoops for boot cluster. Internally, we
>> were disabling the snoops for both the clusters at power off and
>> enabling it in power_up_setup and power_up. However, I dropped the
>> approach due to problems pointed out by Nicolas here
>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
>> cpuidle. Hence, we turn it on at the u-boot.
>
> I think I followed all that.  What you're saying is that our kernel
> dynamically enables and disables snoops as needed, but Nicolas pointed
> out that it was unsafe (though apparently we're not seeing problems in
> our usage).
We did not see any problems as CPUIdle was one of the problematic
scenarios which we have not got enabled.
>
> ...so now the kernel doesn't touch the snoops and assumes that U-Boot
> turned them on.
The kernel does turn on the snoops for the incoming cluster.
>
>
> Ugh.


Regards,
Abhilash
>
> -Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:12         ` Abhilash Kesavan
@ 2014-06-06 18:20           ` Doug Anderson
  2014-06-06 18:31             ` Abhilash Kesavan
  2014-06-06 20:37             ` Olof Johansson
  0 siblings, 2 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

Abhilash,



On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
<kesavan.abhilash@gmail.com> wrote:
> Hi Doug,
>
> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>> Abhilash,
>>
>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>> <kesavan.abhilash@gmail.com> wrote:
>>> Hi Doug,
>>>
>>> The first change in the kernel (clearing an iRAM location) is needed
>>> because of an unnecessary change that we are carrying in the Chrome
>>> U-boot. There is no reason for us to have the workaround in the
>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>> However AFAIR a clean-up patch that I had posted internally was not
>>> accepted as we had frozen the SPL at the time.
>>
>> Ah, is that this one, or a different one?
>>
>> https://chromium-review.googlesource.com/#/c/66049/
> Yes, this along with a kernel side change.

Can we safely take this one without the kernel-side one?


>> If we land that patch now it won't help since nobody is going to be
>> updating their read-only firmware.  We'll need to put code somewhere
>> that fixes it.
> We just carry the workaround fix locally until we migrate to mainline
> u-boot for 5420 where the unnecessay check will not be present.

I think there are people out there who want to run a mainline kernel
on existing Chromebook 2 hardware and don't want to rewrite their RO
firmware.  We need a solution for those people.


>>> The second change is to enable snoops for boot cluster. Internally, we
>>> were disabling the snoops for both the clusters at power off and
>>> enabling it in power_up_setup and power_up. However, I dropped the
>>> approach due to problems pointed out by Nicolas here
>>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
>>> cpuidle. Hence, we turn it on at the u-boot.
>>
>> I think I followed all that.  What you're saying is that our kernel
>> dynamically enables and disables snoops as needed, but Nicolas pointed
>> out that it was unsafe (though apparently we're not seeing problems in
>> our usage).
> We did not see any problems as CPUIdle was one of the problematic
> scenarios which we have not got enabled.

Ah, makes sense!



I'm still trying to figure out all of this code, but we'll also need
to make sure whatever solution we come up with handles suspend/resume
properly.  I know SRAM is lost across suspend/resume so someone
(either the SPL from read-only memory or the kernel) must be
recreating the SRAM structures after S2R...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:20           ` Doug Anderson
@ 2014-06-06 18:31             ` Abhilash Kesavan
  2014-06-06 18:56               ` Doug Anderson
  2014-06-06 20:37             ` Olof Johansson
  1 sibling, 1 reply; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 18:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On Fri, Jun 6, 2014 at 11:50 PM, Doug Anderson <dianders@google.com> wrote:
> Abhilash,
>
>
>
> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
> <kesavan.abhilash@gmail.com> wrote:
>> Hi Doug,
>>
>> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>>> Abhilash,
>>>
>>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>>> <kesavan.abhilash@gmail.com> wrote:
>>>> Hi Doug,
>>>>
>>>> The first change in the kernel (clearing an iRAM location) is needed
>>>> because of an unnecessary change that we are carrying in the Chrome
>>>> U-boot. There is no reason for us to have the workaround in the
>>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>>> However AFAIR a clean-up patch that I had posted internally was not
>>>> accepted as we had frozen the SPL at the time.
>>>
>>> Ah, is that this one, or a different one?
>>>
>>> https://chromium-review.googlesource.com/#/c/66049/
>> Yes, this along with a kernel side change.
>
> Can we safely take this one without the kernel-side one?
Yes, just the u-boot change should suffice.
>
>
>>> If we land that patch now it won't help since nobody is going to be
>>> updating their read-only firmware.  We'll need to put code somewhere
>>> that fixes it.
>> We just carry the workaround fix locally until we migrate to mainline
>> u-boot for 5420 where the unnecessay check will not be present.
>
> I think there are people out there who want to run a mainline kernel
> on existing Chromebook 2 hardware and don't want to rewrite their RO
> firmware.  We need a solution for those people.
Yes, I see your point. But, do you think someone who has changed the
existing fused kernel on the device to a mainline one would be averse
to applying a couple of small work-around changes as well ? Their
finding this thread and the proposed "magic" fixes may be difficult
but not the actual application I think.

How about having a page similar to
"http://www.chromium.org/chromium-os/how-tos-and-troubleshooting/using-an-upstream-kernel-on-snow"
for Peach ? We could have the work-arounds listed there.
>
>
>>>> The second change is to enable snoops for boot cluster. Internally, we
>>>> were disabling the snoops for both the clusters at power off and
>>>> enabling it in power_up_setup and power_up. However, I dropped the
>>>> approach due to problems pointed out by Nicolas here
>>>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
>>>> cpuidle. Hence, we turn it on at the u-boot.
>>>
>>> I think I followed all that.  What you're saying is that our kernel
>>> dynamically enables and disables snoops as needed, but Nicolas pointed
>>> out that it was unsafe (though apparently we're not seeing problems in
>>> our usage).
>> We did not see any problems as CPUIdle was one of the problematic
>> scenarios which we have not got enabled.
>
> Ah, makes sense!
>
>
>
> I'm still trying to figure out all of this code, but we'll also need
> to make sure whatever solution we come up with handles suspend/resume
> properly.  I know SRAM is lost across suspend/resume so someone
> (either the SPL from read-only memory or the kernel) must be
> recreating the SRAM structures after S2R...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:31             ` Abhilash Kesavan
@ 2014-06-06 18:56               ` Doug Anderson
  2014-06-06 19:09                 ` Abhilash Kesavan
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 18:56 UTC (permalink / raw)
  To: linux-arm-kernel

Abhilash,

On Fri, Jun 6, 2014 at 11:31 AM, Abhilash Kesavan
<kesavan.abhilash@gmail.com> wrote:
> Hi Doug,
>
> On Fri, Jun 6, 2014 at 11:50 PM, Doug Anderson <dianders@google.com> wrote:
>> Abhilash,
>>
>>
>>
>> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>> <kesavan.abhilash@gmail.com> wrote:
>>> Hi Doug,
>>>
>>> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>>>> Abhilash,
>>>>
>>>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>> Hi Doug,
>>>>>
>>>>> The first change in the kernel (clearing an iRAM location) is needed
>>>>> because of an unnecessary change that we are carrying in the Chrome
>>>>> U-boot. There is no reason for us to have the workaround in the
>>>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>>>> However AFAIR a clean-up patch that I had posted internally was not
>>>>> accepted as we had frozen the SPL at the time.
>>>>
>>>> Ah, is that this one, or a different one?
>>>>
>>>> https://chromium-review.googlesource.com/#/c/66049/
>>> Yes, this along with a kernel side change.
>>
>> Can we safely take this one without the kernel-side one?
> Yes, just the u-boot change should suffice.
>>
>>
>>>> If we land that patch now it won't help since nobody is going to be
>>>> updating their read-only firmware.  We'll need to put code somewhere
>>>> that fixes it.
>>> We just carry the workaround fix locally until we migrate to mainline
>>> u-boot for 5420 where the unnecessay check will not be present.
>>
>> I think there are people out there who want to run a mainline kernel
>> on existing Chromebook 2 hardware and don't want to rewrite their RO
>> firmware.  We need a solution for those people.
> Yes, I see your point. But, do you think someone who has changed the
> existing fused kernel on the device to a mainline one would be averse
> to applying a couple of small work-around changes as well ? Their
> finding this thread and the proposed "magic" fixes may be difficult
> but not the actual application I think.
>
> How about having a page similar to
> "http://www.chromium.org/chromium-os/how-tos-and-troubleshooting/using-an-upstream-kernel-on-snow"
> for Peach ? We could have the work-arounds listed there.

We can (though the fewer weird things we have the better), but we
definitely need to provide workarounds that don't require people to
change their RO firmware.

Thinking all that through, I think the answer is that we want to
abandon the U-Boot change above
<https://chromium-review.googlesource.com/#/c/66049/>.  At this point
we're never going to take it at this point and there's no possible way
to do it in an RW firmware update (right?) since any workaround would
be overwritten by the SPL at resume time.

So the proper "fix" for the "mw.l 02073028 0" is a kernel fix.  ...and
if upstream doesn't want land it because it's impure then we'll just
have to list it on our "apply these hacks to your kernel".  You sent
me this as a kernel fix before and now I think I understand why (to
handle the s2r case).  Can you please post this up to the mailing
list?  Please make sure it will handle the suspend/resume case
whenever suspend/resume starts working (I haven't tested but I
remember hearing that it doesn't work).

Note: I think there are actually two possible kernel fixes (right?):
1. Have the kernel itself (re)write the code at 0x02073000 at bootup /
resume time.  I don't know of any reason why this should need to be in
U-Boot.  Maybe upstream would take this?

2. Have the kernel clear this flag to work with existing Chromebook 2 firmware.

--

The proper "fix" for the "mw.l 10d25000 3" is a U-Boot fix.  Let me
see if I can put together something that will handle this in RW
U-Boot.  Even if we don't officially ship this RW U-Boot we can always
point people at the binaries.


Does that make sense?

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:56               ` Doug Anderson
@ 2014-06-06 19:09                 ` Abhilash Kesavan
  2014-06-06 19:12                   ` Abhilash Kesavan
  2014-06-06 20:49                   ` Doug Anderson
  0 siblings, 2 replies; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 19:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On Sat, Jun 7, 2014 at 12:26 AM, Doug Anderson <dianders@google.com> wrote:
> Abhilash,
>
> On Fri, Jun 6, 2014 at 11:31 AM, Abhilash Kesavan
> <kesavan.abhilash@gmail.com> wrote:
>> Hi Doug,
>>
>> On Fri, Jun 6, 2014 at 11:50 PM, Doug Anderson <dianders@google.com> wrote:
>>> Abhilash,
>>>
>>>
>>>
>>> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>>> <kesavan.abhilash@gmail.com> wrote:
>>>> Hi Doug,
>>>>
>>>> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>>>>> Abhilash,
>>>>>
>>>>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>>> Hi Doug,
>>>>>>
>>>>>> The first change in the kernel (clearing an iRAM location) is needed
>>>>>> because of an unnecessary change that we are carrying in the Chrome
>>>>>> U-boot. There is no reason for us to have the workaround in the
>>>>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>>>>> However AFAIR a clean-up patch that I had posted internally was not
>>>>>> accepted as we had frozen the SPL at the time.
>>>>>
>>>>> Ah, is that this one, or a different one?
>>>>>
>>>>> https://chromium-review.googlesource.com/#/c/66049/
>>>> Yes, this along with a kernel side change.
>>>
>>> Can we safely take this one without the kernel-side one?
>> Yes, just the u-boot change should suffice.
>>>
>>>
>>>>> If we land that patch now it won't help since nobody is going to be
>>>>> updating their read-only firmware.  We'll need to put code somewhere
>>>>> that fixes it.
>>>> We just carry the workaround fix locally until we migrate to mainline
>>>> u-boot for 5420 where the unnecessay check will not be present.
>>>
>>> I think there are people out there who want to run a mainline kernel
>>> on existing Chromebook 2 hardware and don't want to rewrite their RO
>>> firmware.  We need a solution for those people.
>> Yes, I see your point. But, do you think someone who has changed the
>> existing fused kernel on the device to a mainline one would be averse
>> to applying a couple of small work-around changes as well ? Their
>> finding this thread and the proposed "magic" fixes may be difficult
>> but not the actual application I think.
>>
>> How about having a page similar to
>> "http://www.chromium.org/chromium-os/how-tos-and-troubleshooting/using-an-upstream-kernel-on-snow"
>> for Peach ? We could have the work-arounds listed there.
>
> We can (though the fewer weird things we have the better), but we
> definitely need to provide workarounds that don't require people to
> change their RO firmware.
I do not quite agree with this. They do not need to change their RO
firmware, just modify their boot commands.
>
> Thinking all that through, I think the answer is that we want to
> abandon the U-Boot change above
> <https://chromium-review.googlesource.com/#/c/66049/>.  At this point
> we're never going to take it at this point and there's no possible way
> to do it in an RW firmware update (right?) since any workaround would
> be overwritten by the SPL at resume time.
Sure, will abandon.
>
> So the proper "fix" for the "mw.l 02073028 0" is a kernel fix.  ...and
I believe there is no "proper" fix for incorrect existing code in
non-mainline u-boot.

> if upstream doesn't want land it because it's impure then we'll just
> have to list it on our "apply these hacks to your kernel".  You sent
> me this as a kernel fix before and now I think I understand why (to
> handle the s2r case).  Can you please post this up to the mailing
> list?  Please make sure it will handle the suspend/resume case
> whenever suspend/resume starts working (I haven't tested but I
> remember hearing that it doesn't work).

Are you talking about the re-writing the mcpm entry point address post-resume ?
>
> Note: I think there are actually two possible kernel fixes (right?):
> 1. Have the kernel itself (re)write the code at 0x02073000 at bootup /
> resume time.  I don't know of any reason why this should need to be in
> U-Boot.  Maybe upstream would take this?

>
> 2. Have the kernel clear this flag to work with existing Chromebook 2 firmware.
>
> --
>
> The proper "fix" for the "mw.l 10d25000 3" is a U-Boot fix.  Let me
> see if I can put together something that will handle this in RW
> U-Boot.  Even if we don't officially ship this RW U-Boot we can always
> point people at the binaries.
>
>
> Does that make sense?
>
> -Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 19:09                 ` Abhilash Kesavan
@ 2014-06-06 19:12                   ` Abhilash Kesavan
  2014-06-06 20:49                   ` Doug Anderson
  1 sibling, 0 replies; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 19:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 7, 2014 at 12:39 AM, Abhilash Kesavan
<kesavan.abhilash@gmail.com> wrote:
> Hi Doug,
>
> On Sat, Jun 7, 2014 at 12:26 AM, Doug Anderson <dianders@google.com> wrote:
>> Abhilash,
>>
>> On Fri, Jun 6, 2014 at 11:31 AM, Abhilash Kesavan
>> <kesavan.abhilash@gmail.com> wrote:
>>> Hi Doug,
>>>
>>> On Fri, Jun 6, 2014 at 11:50 PM, Doug Anderson <dianders@google.com> wrote:
>>>> Abhilash,
>>>>
>>>>
>>>>
>>>> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>> Hi Doug,
>>>>>
>>>>> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>>>>>> Abhilash,
>>>>>>
>>>>>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>>>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>>>> Hi Doug,
>>>>>>>
>>>>>>> The first change in the kernel (clearing an iRAM location) is needed
>>>>>>> because of an unnecessary change that we are carrying in the Chrome
>>>>>>> U-boot. There is no reason for us to have the workaround in the
>>>>>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>>>>>> However AFAIR a clean-up patch that I had posted internally was not
>>>>>>> accepted as we had frozen the SPL at the time.
>>>>>>
>>>>>> Ah, is that this one, or a different one?
>>>>>>
>>>>>> https://chromium-review.googlesource.com/#/c/66049/
>>>>> Yes, this along with a kernel side change.
>>>>
>>>> Can we safely take this one without the kernel-side one?
>>> Yes, just the u-boot change should suffice.
>>>>
>>>>
>>>>>> If we land that patch now it won't help since nobody is going to be
>>>>>> updating their read-only firmware.  We'll need to put code somewhere
>>>>>> that fixes it.
>>>>> We just carry the workaround fix locally until we migrate to mainline
>>>>> u-boot for 5420 where the unnecessay check will not be present.
>>>>
>>>> I think there are people out there who want to run a mainline kernel
>>>> on existing Chromebook 2 hardware and don't want to rewrite their RO
>>>> firmware.  We need a solution for those people.
>>> Yes, I see your point. But, do you think someone who has changed the
>>> existing fused kernel on the device to a mainline one would be averse
>>> to applying a couple of small work-around changes as well ? Their
>>> finding this thread and the proposed "magic" fixes may be difficult
>>> but not the actual application I think.
>>>
>>> How about having a page similar to
>>> "http://www.chromium.org/chromium-os/how-tos-and-troubleshooting/using-an-upstream-kernel-on-snow"
>>> for Peach ? We could have the work-arounds listed there.
>>
>> We can (though the fewer weird things we have the better), but we
>> definitely need to provide workarounds that don't require people to
>> change their RO firmware.
> I do not quite agree with this. They do not need to change their RO
> firmware, just modify their boot commands.
>>
>> Thinking all that through, I think the answer is that we want to
>> abandon the U-Boot change above
>> <https://chromium-review.googlesource.com/#/c/66049/>.  At this point
>> we're never going to take it at this point and there's no possible way
>> to do it in an RW firmware update (right?) since any workaround would
>> be overwritten by the SPL at resume time.
> Sure, will abandon.
>>
>> So the proper "fix" for the "mw.l 02073028 0" is a kernel fix.  ...and
> I believe there is no "proper" fix for incorrect existing code in
> non-mainline u-boot.
>
>> if upstream doesn't want land it because it's impure then we'll just
>> have to list it on our "apply these hacks to your kernel".  You sent
>> me this as a kernel fix before and now I think I understand why (to
>> handle the s2r case).  Can you please post this up to the mailing
>> list?  Please make sure it will handle the suspend/resume case
>> whenever suspend/resume starts working (I haven't tested but I
>> remember hearing that it doesn't work).
>
> Are you talking about the re-writing the mcpm entry point address post-resume ?
S2R does not work right now, but patches have been posted for the
same. I did test the MCPM patches across an S2R cycle and it worked
fine. Implying the SRAM was being retained. I guess that would be
because the regulator support has not been added/enabled for Peach.
>>
>> Note: I think there are actually two possible kernel fixes (right?):
>> 1. Have the kernel itself (re)write the code at 0x02073000 at bootup /
>> resume time.  I don't know of any reason why this should need to be in
>> U-Boot.  Maybe upstream would take this?
>
>>
>> 2. Have the kernel clear this flag to work with existing Chromebook 2 firmware.
>>
>> --
>>
>> The proper "fix" for the "mw.l 10d25000 3" is a U-Boot fix.  Let me
>> see if I can put together something that will handle this in RW
>> U-Boot.  Even if we don't officially ship this RW U-Boot we can always
>> point people at the binaries.
>>
>>
>> Does that make sense?
>>
>> -Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 18:20           ` Doug Anderson
  2014-06-06 18:31             ` Abhilash Kesavan
@ 2014-06-06 20:37             ` Olof Johansson
  2014-06-06 20:46               ` Abhilash Kesavan
  1 sibling, 1 reply; 51+ messages in thread
From: Olof Johansson @ 2014-06-06 20:37 UTC (permalink / raw)
  To: linux-arm-kernel

[Adding Nico since he was involved in the original reviews]

Hi,

On Fri, Jun 06, 2014 at 11:20:56AM -0700, Doug Anderson wrote:
> Abhilash,
> 
> 
> 
> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
> <kesavan.abhilash@gmail.com> wrote:
> > Hi Doug,
> >
> > On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
> >> Abhilash,
> >>
> >> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
> >> <kesavan.abhilash@gmail.com> wrote:
> >>> Hi Doug,
> >>>
> >>> The first change in the kernel (clearing an iRAM location) is needed
> >>> because of an unnecessary change that we are carrying in the Chrome
> >>> U-boot. There is no reason for us to have the workaround in the
> >>> mainline kernel. Rather, we should remove the check from our u-boot.
> >>> However AFAIR a clean-up patch that I had posted internally was not
> >>> accepted as we had frozen the SPL at the time.
> >>
> >> Ah, is that this one, or a different one?
> >>
> >> https://chromium-review.googlesource.com/#/c/66049/
> > Yes, this along with a kernel side change.
> 
> Can we safely take this one without the kernel-side one?
> 
> 
> >> If we land that patch now it won't help since nobody is going to be
> >> updating their read-only firmware.  We'll need to put code somewhere
> >> that fixes it.
> > We just carry the workaround fix locally until we migrate to mainline
> > u-boot for 5420 where the unnecessay check will not be present.
> 
> I think there are people out there who want to run a mainline kernel
> on existing Chromebook 2 hardware and don't want to rewrite their RO
> firmware.  We need a solution for those people.

Agree. The answer to this is most definitely _not_ "install mainline u-boot".
The upstream kernel needs to be able to boot with the firmware that was shipped
on the device.

In this case it shouldn't be controversial to add this. What we need is
a one-time boot-time setup, not runtime so cpuidle shouldn't be a factor
at that time. The earlier reservations were about runtime changes and this is
quite different.

> >>> The second change is to enable snoops for boot cluster. Internally, we
> >>> were disabling the snoops for both the clusters at power off and
> >>> enabling it in power_up_setup and power_up. However, I dropped the
> >>> approach due to problems pointed out by Nicolas here
> >>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
> >>> cpuidle. Hence, we turn it on at the u-boot.
> >>
> >> I think I followed all that.  What you're saying is that our kernel
> >> dynamically enables and disables snoops as needed, but Nicolas pointed
> >> out that it was unsafe (though apparently we're not seeing problems in
> >> our usage).
> > We did not see any problems as CPUIdle was one of the problematic
> > scenarios which we have not got enabled.
> 
> Ah, makes sense!
> 
> 
> 
> I'm still trying to figure out all of this code, but we'll also need
> to make sure whatever solution we come up with handles suspend/resume
> properly.  I know SRAM is lost across suspend/resume so someone
> (either the SPL from read-only memory or the kernel) must be
> recreating the SRAM structures after S2R...


-Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 20:37             ` Olof Johansson
@ 2014-06-06 20:46               ` Abhilash Kesavan
  2014-06-06 21:01                 ` Olof Johansson
  0 siblings, 1 reply; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 20:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Olof,

On Sat, Jun 7, 2014 at 2:07 AM, Olof Johansson <olof@lixom.net> wrote:
> [Adding Nico since he was involved in the original reviews]
>
> Hi,
>
> On Fri, Jun 06, 2014 at 11:20:56AM -0700, Doug Anderson wrote:
>> Abhilash,
>>
>>
>>
>> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>> <kesavan.abhilash@gmail.com> wrote:
>> > Hi Doug,
>> >
>> > On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>> >> Abhilash,
>> >>
>> >> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>> >> <kesavan.abhilash@gmail.com> wrote:
>> >>> Hi Doug,
>> >>>
>> >>> The first change in the kernel (clearing an iRAM location) is needed
>> >>> because of an unnecessary change that we are carrying in the Chrome
>> >>> U-boot. There is no reason for us to have the workaround in the
>> >>> mainline kernel. Rather, we should remove the check from our u-boot.
>> >>> However AFAIR a clean-up patch that I had posted internally was not
>> >>> accepted as we had frozen the SPL at the time.
>> >>
>> >> Ah, is that this one, or a different one?
>> >>
>> >> https://chromium-review.googlesource.com/#/c/66049/
>> > Yes, this along with a kernel side change.
>>
>> Can we safely take this one without the kernel-side one?
>>
>>
>> >> If we land that patch now it won't help since nobody is going to be
>> >> updating their read-only firmware.  We'll need to put code somewhere
>> >> that fixes it.
>> > We just carry the workaround fix locally until we migrate to mainline
>> > u-boot for 5420 where the unnecessay check will not be present.
>>
>> I think there are people out there who want to run a mainline kernel
>> on existing Chromebook 2 hardware and don't want to rewrite their RO
>> firmware.  We need a solution for those people.
>
> Agree. The answer to this is most definitely _not_ "install mainline u-boot".
> The upstream kernel needs to be able to boot with the firmware that was shipped
> on the device.
My answer is not "use mainline u-boot" primarily because I am not sure
mainline u-boot actually works on 5420 :). My answer is keep a patch
locally (or make a trivial change to the bootcmd) for people who would
like to use an upstream kernel with the firmware on the device. Once
we do have a working mainline u-boot, that can then be used by the
interested parties.
>
> In this case it shouldn't be controversial to add this. What we need is
> a one-time boot-time setup, not runtime so cpuidle shouldn't be a factor
> at that time. The earlier reservations were about runtime changes and this is
> quite different.
I think there is some confusion here, the clearing of the iRAM
location is what I have been pushing against. It has got nothing to do
cpuidle. If it were to be done then  it would be a one time setup and
I could quite easily do it in mcpm_init.

Regards,
Abhilash
>
>> >>> The second change is to enable snoops for boot cluster. Internally, we
>> >>> were disabling the snoops for both the clusters at power off and
>> >>> enabling it in power_up_setup and power_up. However, I dropped the
>> >>> approach due to problems pointed out by Nicolas here
>> >>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
>> >>> cpuidle. Hence, we turn it on at the u-boot.
>> >>
>> >> I think I followed all that.  What you're saying is that our kernel
>> >> dynamically enables and disables snoops as needed, but Nicolas pointed
>> >> out that it was unsafe (though apparently we're not seeing problems in
>> >> our usage).
>> > We did not see any problems as CPUIdle was one of the problematic
>> > scenarios which we have not got enabled.
>>
>> Ah, makes sense!
>>
>>
>>
>> I'm still trying to figure out all of this code, but we'll also need
>> to make sure whatever solution we come up with handles suspend/resume
>> properly.  I know SRAM is lost across suspend/resume so someone
>> (either the SPL from read-only memory or the kernel) must be
>> recreating the SRAM structures after S2R...
>
>
> -Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 19:09                 ` Abhilash Kesavan
  2014-06-06 19:12                   ` Abhilash Kesavan
@ 2014-06-06 20:49                   ` Doug Anderson
  2014-06-06 21:01                     ` Russell King - ARM Linux
  2014-06-06 21:44                     ` Doug Anderson
  1 sibling, 2 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 20:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 6, 2014 at 12:09 PM, Abhilash Kesavan
<kesavan.abhilash@gmail.com> wrote:
> Hi Doug,
>
> On Sat, Jun 7, 2014 at 12:26 AM, Doug Anderson <dianders@google.com> wrote:
>> Abhilash,
>>
>> On Fri, Jun 6, 2014 at 11:31 AM, Abhilash Kesavan
>> <kesavan.abhilash@gmail.com> wrote:
>>> Hi Doug,
>>>
>>> On Fri, Jun 6, 2014 at 11:50 PM, Doug Anderson <dianders@google.com> wrote:
>>>> Abhilash,
>>>>
>>>>
>>>>
>>>> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>> Hi Doug,
>>>>>
>>>>> On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>>>>>> Abhilash,
>>>>>>
>>>>>> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>>>>>> <kesavan.abhilash@gmail.com> wrote:
>>>>>>> Hi Doug,
>>>>>>>
>>>>>>> The first change in the kernel (clearing an iRAM location) is needed
>>>>>>> because of an unnecessary change that we are carrying in the Chrome
>>>>>>> U-boot. There is no reason for us to have the workaround in the
>>>>>>> mainline kernel. Rather, we should remove the check from our u-boot.
>>>>>>> However AFAIR a clean-up patch that I had posted internally was not
>>>>>>> accepted as we had frozen the SPL at the time.
>>>>>>
>>>>>> Ah, is that this one, or a different one?
>>>>>>
>>>>>> https://chromium-review.googlesource.com/#/c/66049/
>>>>> Yes, this along with a kernel side change.
>>>>
>>>> Can we safely take this one without the kernel-side one?
>>> Yes, just the u-boot change should suffice.
>>>>
>>>>
>>>>>> If we land that patch now it won't help since nobody is going to be
>>>>>> updating their read-only firmware.  We'll need to put code somewhere
>>>>>> that fixes it.
>>>>> We just carry the workaround fix locally until we migrate to mainline
>>>>> u-boot for 5420 where the unnecessay check will not be present.
>>>>
>>>> I think there are people out there who want to run a mainline kernel
>>>> on existing Chromebook 2 hardware and don't want to rewrite their RO
>>>> firmware.  We need a solution for those people.
>>> Yes, I see your point. But, do you think someone who has changed the
>>> existing fused kernel on the device to a mainline one would be averse
>>> to applying a couple of small work-around changes as well ? Their
>>> finding this thread and the proposed "magic" fixes may be difficult
>>> but not the actual application I think.
>>>
>>> How about having a page similar to
>>> "http://www.chromium.org/chromium-os/how-tos-and-troubleshooting/using-an-upstream-kernel-on-snow"
>>> for Peach ? We could have the work-arounds listed there.
>>
>> We can (though the fewer weird things we have the better), but we
>> definitely need to provide workarounds that don't require people to
>> change their RO firmware.
> I do not quite agree with this. They do not need to change their RO
> firmware, just modify their boot commands.
>>
>> Thinking all that through, I think the answer is that we want to
>> abandon the U-Boot change above
>> <https://chromium-review.googlesource.com/#/c/66049/>.  At this point
>> we're never going to take it at this point and there's no possible way
>> to do it in an RW firmware update (right?) since any workaround would
>> be overwritten by the SPL at resume time.
> Sure, will abandon.
>>
>> So the proper "fix" for the "mw.l 02073028 0" is a kernel fix.  ...and
> I believe there is no "proper" fix for incorrect existing code in
> non-mainline u-boot.
>
>> if upstream doesn't want land it because it's impure then we'll just
>> have to list it on our "apply these hacks to your kernel".  You sent
>> me this as a kernel fix before and now I think I understand why (to
>> handle the s2r case).  Can you please post this up to the mailing
>> list?  Please make sure it will handle the suspend/resume case
>> whenever suspend/resume starts working (I haven't tested but I
>> remember hearing that it doesn't work).
>
> Are you talking about the re-writing the mcpm entry point address post-resume ?

Or even (as Andrew points out) just don't use it.

This works and IMHO is much cleaner because it totally removes the
U-Boot dependency.  I'll cleanup to not be so insane and post:

diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
b/arch/arm/mach-exynos/mcpm-exynos.c
index 0498d0b..9c5df7b 100644
--- a/arch/arm/mach-exynos/mcpm-exynos.c
+++ b/arch/arm/mach-exynos/mcpm-exynos.c
@@ -290,6 +290,14 @@ static void __naked
exynos_pm_power_up_setup(unsigned int affinity_level)
        "b      cci_enable_port_for_self");
 }

+static void __naked exynos_mcpm_secondary_cpu_start(void)
+{
+       asm volatile ("\n"
+       "ldr    r0, [pc, #0]\n"
+       "bx     r0\n"
+       ".word  0" );
+}
+
 static const struct of_device_id exynos_dt_mcpm_match[] = {
        { .compatible = "samsung,exynos5420" },
        { .compatible = "samsung,exynos5800" },
@@ -346,8 +354,9 @@ static int __init exynos_mcpm_init(void)
         * Future entries into the kernel can now go
         * through the cluster entry vectors.
         */
-       __raw_writel(virt_to_phys(mcpm_entry_point),
-                       ns_sram_base_addr + MCPM_BOOT_ADDR_OFFSET);
+       __raw_writel(((u32*)exynos_mcpm_secondary_cpu_start)[0],
ns_sram_base_addr);
+       __raw_writel(((u32*)exynos_mcpm_secondary_cpu_start)[1],
ns_sram_base_addr + 4);
+       __raw_writel(virt_to_phys(mcpm_entry_point), ns_sram_base_addr + 8);

-Doug

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 20:46               ` Abhilash Kesavan
@ 2014-06-06 21:01                 ` Olof Johansson
  2014-06-06 21:06                   ` Abhilash Kesavan
  2014-06-06 21:34                   ` Nicolas Pitre
  0 siblings, 2 replies; 51+ messages in thread
From: Olof Johansson @ 2014-06-06 21:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 07, 2014 at 02:16:27AM +0530, Abhilash Kesavan wrote:
> Hi Olof,
> 
> On Sat, Jun 7, 2014 at 2:07 AM, Olof Johansson <olof@lixom.net> wrote:
> > [Adding Nico since he was involved in the original reviews]
> >
> > Hi,
> >
> > On Fri, Jun 06, 2014 at 11:20:56AM -0700, Doug Anderson wrote:
> >> Abhilash,
> >>
> >>
> >>
> >> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
> >> <kesavan.abhilash@gmail.com> wrote:
> >> > Hi Doug,
> >> >
> >> > On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
> >> >> Abhilash,
> >> >>
> >> >> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
> >> >> <kesavan.abhilash@gmail.com> wrote:
> >> >>> Hi Doug,
> >> >>>
> >> >>> The first change in the kernel (clearing an iRAM location) is needed
> >> >>> because of an unnecessary change that we are carrying in the Chrome
> >> >>> U-boot. There is no reason for us to have the workaround in the
> >> >>> mainline kernel. Rather, we should remove the check from our u-boot.
> >> >>> However AFAIR a clean-up patch that I had posted internally was not
> >> >>> accepted as we had frozen the SPL at the time.
> >> >>
> >> >> Ah, is that this one, or a different one?
> >> >>
> >> >> https://chromium-review.googlesource.com/#/c/66049/
> >> > Yes, this along with a kernel side change.
> >>
> >> Can we safely take this one without the kernel-side one?
> >>
> >>
> >> >> If we land that patch now it won't help since nobody is going to be
> >> >> updating their read-only firmware.  We'll need to put code somewhere
> >> >> that fixes it.
> >> > We just carry the workaround fix locally until we migrate to mainline
> >> > u-boot for 5420 where the unnecessay check will not be present.
> >>
> >> I think there are people out there who want to run a mainline kernel
> >> on existing Chromebook 2 hardware and don't want to rewrite their RO
> >> firmware.  We need a solution for those people.
> >
> > Agree. The answer to this is most definitely _not_ "install mainline u-boot".
> > The upstream kernel needs to be able to boot with the firmware that was shipped
> > on the device.
>
> My answer is not "use mainline u-boot" primarily because I am not sure
> mainline u-boot actually works on 5420 :).

And I'm saying that's not the answer primarily because we should never require
people to update their firmware to get a usable linux system.

> My answer is keep a patch
> locally (or make a trivial change to the bootcmd) for people who would
> like to use an upstream kernel with the firmware on the device. Once
> we do have a working mainline u-boot, that can then be used by the
> interested parties.

And I am strongly NAK:ing both of those approaches. We should not require
a single out-of-tree patch because that means we have failed to make a useful
kernel for people. And it should never, ever, be a requirement for people to
reflash and risk bricking their device just to run mainline linux on it. It's
an artificial barrier of entry with high risk, and we'll be worse off for
adding it. Same for out-of-tree patches.

> > In this case it shouldn't be controversial to add this. What we need is
> > a one-time boot-time setup, not runtime so cpuidle shouldn't be a factor
> > at that time. The earlier reservations were about runtime changes and this is
> > quite different.
> I think there is some confusion here, the clearing of the iRAM
> location is what I have been pushing against. It has got nothing to do
> cpuidle. If it were to be done then  it would be a one time setup and
> I could quite easily do it in mcpm_init.

iRAM is covered on Doug's sub-thread, and I think his approach looks promising.
So, it seems like we have a solution both to enable the CCI port and to avoid
clearing iram -- we should be set?


-Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 20:49                   ` Doug Anderson
@ 2014-06-06 21:01                     ` Russell King - ARM Linux
  2014-06-06 21:12                       ` Doug Anderson
  2014-06-06 21:44                     ` Doug Anderson
  1 sibling, 1 reply; 51+ messages in thread
From: Russell King - ARM Linux @ 2014-06-06 21:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 06, 2014 at 01:49:11PM -0700, Doug Anderson wrote:
> This works and IMHO is much cleaner because it totally removes the
> U-Boot dependency.  I'll cleanup to not be so insane and post:
> 
> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
> b/arch/arm/mach-exynos/mcpm-exynos.c
> index 0498d0b..9c5df7b 100644
> --- a/arch/arm/mach-exynos/mcpm-exynos.c
> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
> @@ -290,6 +290,14 @@ static void __naked
> exynos_pm_power_up_setup(unsigned int affinity_level)
>         "b      cci_enable_port_for_self");
>  }
> 
> +static void __naked exynos_mcpm_secondary_cpu_start(void)
> +{
> +       asm volatile ("\n"
> +       "ldr    r0, [pc, #0]\n"
> +       "bx     r0\n"
> +       ".word  0" );
> +}
> +

So does it matter whether the above code gets assembled as thumb or
ARM?  How does your caller know which ISA mode to enter this fragment
in?

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:01                 ` Olof Johansson
@ 2014-06-06 21:06                   ` Abhilash Kesavan
  2014-06-06 21:34                   ` Nicolas Pitre
  1 sibling, 0 replies; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-06 21:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Olof,

On Sat, Jun 7, 2014 at 2:31 AM, Olof Johansson <olof@lixom.net> wrote:
> On Sat, Jun 07, 2014 at 02:16:27AM +0530, Abhilash Kesavan wrote:
>> Hi Olof,
>>
>> On Sat, Jun 7, 2014 at 2:07 AM, Olof Johansson <olof@lixom.net> wrote:
>> > [Adding Nico since he was involved in the original reviews]
>> >
>> > Hi,
>> >
>> > On Fri, Jun 06, 2014 at 11:20:56AM -0700, Doug Anderson wrote:
>> >> Abhilash,
>> >>
>> >>
>> >>
>> >> On Fri, Jun 6, 2014 at 11:12 AM, Abhilash Kesavan
>> >> <kesavan.abhilash@gmail.com> wrote:
>> >> > Hi Doug,
>> >> >
>> >> > On Fri, Jun 6, 2014 at 11:32 PM, Doug Anderson <dianders@google.com> wrote:
>> >> >> Abhilash,
>> >> >>
>> >> >> On Fri, Jun 6, 2014 at 10:36 AM, Abhilash Kesavan
>> >> >> <kesavan.abhilash@gmail.com> wrote:
>> >> >>> Hi Doug,
>> >> >>>
>> >> >>> The first change in the kernel (clearing an iRAM location) is needed
>> >> >>> because of an unnecessary change that we are carrying in the Chrome
>> >> >>> U-boot. There is no reason for us to have the workaround in the
>> >> >>> mainline kernel. Rather, we should remove the check from our u-boot.
>> >> >>> However AFAIR a clean-up patch that I had posted internally was not
>> >> >>> accepted as we had frozen the SPL at the time.
>> >> >>
>> >> >> Ah, is that this one, or a different one?
>> >> >>
>> >> >> https://chromium-review.googlesource.com/#/c/66049/
>> >> > Yes, this along with a kernel side change.
>> >>
>> >> Can we safely take this one without the kernel-side one?
>> >>
>> >>
>> >> >> If we land that patch now it won't help since nobody is going to be
>> >> >> updating their read-only firmware.  We'll need to put code somewhere
>> >> >> that fixes it.
>> >> > We just carry the workaround fix locally until we migrate to mainline
>> >> > u-boot for 5420 where the unnecessay check will not be present.
>> >>
>> >> I think there are people out there who want to run a mainline kernel
>> >> on existing Chromebook 2 hardware and don't want to rewrite their RO
>> >> firmware.  We need a solution for those people.
>> >
>> > Agree. The answer to this is most definitely _not_ "install mainline u-boot".
>> > The upstream kernel needs to be able to boot with the firmware that was shipped
>> > on the device.
>>
>> My answer is not "use mainline u-boot" primarily because I am not sure
>> mainline u-boot actually works on 5420 :).
>
> And I'm saying that's not the answer primarily because we should never require
> people to update their firmware to get a usable linux system.
>
>> My answer is keep a patch
>> locally (or make a trivial change to the bootcmd) for people who would
>> like to use an upstream kernel with the firmware on the device. Once
>> we do have a working mainline u-boot, that can then be used by the
>> interested parties.
>
> And I am strongly NAK:ing both of those approaches. We should not require
> a single out-of-tree patch because that means we have failed to make a useful
> kernel for people. And it should never, ever, be a requirement for people to
> reflash and risk bricking their device just to run mainline linux on it. It's
> an artificial barrier of entry with high risk, and we'll be worse off for
> adding it. Same for out-of-tree patches.
I have explained my reasons in the thread. I continue to believe that
we should not be adding code in the kernel that is specifically
handling an oversight that exists in the Chrome U-boot. However, since
you have NAK'ed my approaches, we are left with Doug's.
>
>> > In this case it shouldn't be controversial to add this. What we need is
>> > a one-time boot-time setup, not runtime so cpuidle shouldn't be a factor
>> > at that time. The earlier reservations were about runtime changes and this is
>> > quite different.
>> I think there is some confusion here, the clearing of the iRAM
>> location is what I have been pushing against. It has got nothing to do
>> cpuidle. If it were to be done then  it would be a one time setup and
>> I could quite easily do it in mcpm_init.
>
> iRAM is covered on Doug's sub-thread, and I think his approach looks promising.
> So, it seems like we have a solution both to enable the CCI port and to avoid
> clearing iram -- we should be set?
I'll have a look at Doug's updated patches tomorrow or on Monday.

Regards,
Abhilash
>
>
> -Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:01                     ` Russell King - ARM Linux
@ 2014-06-06 21:12                       ` Doug Anderson
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 21:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Jun 6, 2014 at 2:01 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Jun 06, 2014 at 01:49:11PM -0700, Doug Anderson wrote:
>> This works and IMHO is much cleaner because it totally removes the
>> U-Boot dependency.  I'll cleanup to not be so insane and post:
>>
>> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
>> b/arch/arm/mach-exynos/mcpm-exynos.c
>> index 0498d0b..9c5df7b 100644
>> --- a/arch/arm/mach-exynos/mcpm-exynos.c
>> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
>> @@ -290,6 +290,14 @@ static void __naked
>> exynos_pm_power_up_setup(unsigned int affinity_level)
>>         "b      cci_enable_port_for_self");
>>  }
>>
>> +static void __naked exynos_mcpm_secondary_cpu_start(void)
>> +{
>> +       asm volatile ("\n"
>> +       "ldr    r0, [pc, #0]\n"
>> +       "bx     r0\n"
>> +       ".word  0" );
>> +}
>> +
>
> So does it matter whether the above code gets assembled as thumb or
> ARM?  How does your caller know which ISA mode to enter this fragment
> in?

I'm going to take Olof's suggestion and just hardcode the instructions like:

__raw_writel(0xe59f0000, ns_sram_base_addr);     /* ldr r0, [pc, #0] */
__raw_writel(0xe12fff10, ns_sram_base_addr + 4); /* bx  r0 */
__raw_writel(virt_to_phys(mcpm_entry_point), ns_sram_base_addr + 8);

The caller always jumps to this code with "bx" but always jumps into
ARM mode.  Specifically U-Boot will have:

  branch_bx(CONFIG_EXYNOS_RELOCATE_CODE_BASE);

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:01                 ` Olof Johansson
  2014-06-06 21:06                   ` Abhilash Kesavan
@ 2014-06-06 21:34                   ` Nicolas Pitre
  2014-06-06 21:49                     ` Olof Johansson
  1 sibling, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-06 21:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 6 Jun 2014, Olof Johansson wrote:

> On Sat, Jun 07, 2014 at 02:16:27AM +0530, Abhilash Kesavan wrote:
> > My answer is not "use mainline u-boot" primarily because I am not sure
> > mainline u-boot actually works on 5420 :).
> 
> And I'm saying that's not the answer primarily because we should never require
> people to update their firmware to get a usable linux system.
> 
> > My answer is keep a patch
> > locally (or make a trivial change to the bootcmd) for people who would
> > like to use an upstream kernel with the firmware on the device. Once
> > we do have a working mainline u-boot, that can then be used by the
> > interested parties.
> 
> And I am strongly NAK:ing both of those approaches. We should not require
> a single out-of-tree patch because that means we have failed to make a useful
> kernel for people. And it should never, ever, be a requirement for people to
> reflash and risk bricking their device just to run mainline linux on it.

What we can do, though, is to publicly shame you all Google People very 
strongly for not making firmware updates in the field safer and easier.  
After all you must all know already that, by definition, software always 
contains bugs.  The first feature to be tested with a new 
bootloader/firmware must be the ability to successfully and safely (and 
securely) update itself.  Especially for a mainstream device like a 
Chromebook.

> It's an artificial barrier of entry with high risk, and we'll be worse 
> off for adding it. Same for out-of-tree patches.

So ... let's find the lesser of all evils ... as always.

> iRAM is covered on Doug's sub-thread, and I think his approach looks promising.
> So, it seems like we have a solution both to enable the CCI port and to avoid
> clearing iram -- we should be set?

Care to resume the proposed solution then?


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 20:49                   ` Doug Anderson
  2014-06-06 21:01                     ` Russell King - ARM Linux
@ 2014-06-06 21:44                     ` Doug Anderson
  1 sibling, 0 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 21:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Jun 6, 2014 at 1:49 PM, Doug Anderson <dianders@google.com> wrote:
>> Are you talking about the re-writing the mcpm entry point address post-resume ?
>
> Or even (as Andrew points out) just don't use it.
>
> This works and IMHO is much cleaner because it totally removes the
> U-Boot dependency.  I'll cleanup to not be so insane and post:

The less insane version is at
<https://patchwork.kernel.org/patch/4313611/>.  Thanks to Andrew for
the suggestion!

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 17:36     ` Abhilash Kesavan
  2014-06-06 18:02       ` Doug Anderson
@ 2014-06-06 21:48       ` Nicolas Pitre
  2014-06-07  3:25         ` Abhilash Kesavan
  1 sibling, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-06 21:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 6 Jun 2014, Abhilash Kesavan wrote:

> Hi Doug,
> 
> The first change in the kernel (clearing an iRAM location) is needed
> because of an unnecessary change that we are carrying in the Chrome
> U-boot. There is no reason for us to have the workaround in the
> mainline kernel. Rather, we should remove the check from our u-boot.
> However AFAIR a clean-up patch that I had posted internally was not
> accepted as we had frozen the SPL at the time.
> 
> The second change is to enable snoops for boot cluster. Internally, we
> were disabling the snoops for both the clusters at power off and
> enabling it in power_up_setup and power_up. However, I dropped the
> approach due to problems pointed out by Nicolas here
> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
> cpuidle. Hence, we turn it on at the u-boot.

You should never *ever* play with the CCI from U-Boot.  That's for the 
MCPM layer to decide when it is safe to do so.  That's mainly the 
_whole_ reason why MCPM exists in the first place.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:34                   ` Nicolas Pitre
@ 2014-06-06 21:49                     ` Olof Johansson
  2014-06-06 21:59                       ` Doug Anderson
  2014-06-06 22:17                       ` Nicolas Pitre
  0 siblings, 2 replies; 51+ messages in thread
From: Olof Johansson @ 2014-06-06 21:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 06, 2014 at 05:34:21PM -0400, Nicolas Pitre wrote:
> On Fri, 6 Jun 2014, Olof Johansson wrote:
> 
> > On Sat, Jun 07, 2014 at 02:16:27AM +0530, Abhilash Kesavan wrote:
> > > My answer is not "use mainline u-boot" primarily because I am not sure
> > > mainline u-boot actually works on 5420 :).
> > 
> > And I'm saying that's not the answer primarily because we should never require
> > people to update their firmware to get a usable linux system.
> > 
> > > My answer is keep a patch
> > > locally (or make a trivial change to the bootcmd) for people who would
> > > like to use an upstream kernel with the firmware on the device. Once
> > > we do have a working mainline u-boot, that can then be used by the
> > > interested parties.
> > 
> > And I am strongly NAK:ing both of those approaches. We should not require
> > a single out-of-tree patch because that means we have failed to make a useful
> > kernel for people. And it should never, ever, be a requirement for people to
> > reflash and risk bricking their device just to run mainline linux on it.
> 
> What we can do, though, is to publicly shame you all Google People very 
> strongly for not making firmware updates in the field safer and easier.  
> After all you must all know already that, by definition, software always 
> contains bugs.  The first feature to be tested with a new 
> bootloader/firmware must be the ability to successfully and safely (and 
> securely) update itself.  Especially for a mainstream device like a 
> Chromebook.

The security model of Chrome OS is that it relies on a read-only firmware
that is later verifying and launching a read-write firwmare. Obviously
changes to the read-only firwmare is impossible in the field (and only
done at risk of bricking the device for hobbyists). Some changes are
not possible to work around in the read-write part of the firmware,
or for some other reason becomes unfeasible to handle there.

We're working on making the RO portion smaller, so we'll be less exposed
to this in the future, but that's a feature that will take some time
to mature. It can also be hard to motivate updating the firmware in
the field for upstream usage, since doing these updates are a large
undertaking from a QA point of view. When we can bundle them with other
high-priority changes (without adding significant risk), we try to do so.

Anyway, with that being said: The real reason we're in this situation
at this time is that the upstream review and discussion (including
patch posting) didn't happen until after RO firmware had been cut, and
some pieces were not acceptable (things that we had not realized
in our internal reviews when we were working on the product). If SoC
vendors upstream code earlier, we can be more certain that the upstream
implementation will work well with the firmware we have. That's the
_real_ solution to this situation. We're working with vendors to make
them understand this (work closer and earlier with upstream). Samsung is
making a great effort on 5420/5800, but things won't be perfect overnight.

> > It's an artificial barrier of entry with high risk, and we'll be worse 
> > off for adding it. Same for out-of-tree patches.
> 
> So ... let's find the lesser of all evils ... as always.

We're working hard on making sure a vanilla upstream kernel will work
well on this generation of ARM Chromebooks, something we never really
did with the first generation. If the response we get is "just carry it
out of tree", why would we even care about upstream at all? We're trying
to do the right thing here, even if it might require a line of code or
two here or there that isn't perfect.

> > iRAM is covered on Doug's sub-thread, and I think his approach looks promising.
> > So, it seems like we have a solution both to enable the CCI port and to avoid
> > clearing iram -- we should be set?
> 
> Care to resume the proposed solution then?

Doug has posted patches for the IRAM piece, and Andrew was going to look
at the MCPM piece. So that's already happening.


-Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:49                     ` Olof Johansson
@ 2014-06-06 21:59                       ` Doug Anderson
  2014-06-06 22:38                         ` Nicolas Pitre
  2014-06-06 22:17                       ` Nicolas Pitre
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 21:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

To add a few things to Olof's comments:

On Fri, Jun 6, 2014 at 2:49 PM, Olof Johansson <olof@lixom.net> wrote:
>> What we can do, though, is to publicly shame you all Google People very
>> strongly for not making firmware updates in the field safer and easier.
>> After all you must all know already that, by definition, software always
>> contains bugs.  The first feature to be tested with a new
>> bootloader/firmware must be the ability to successfully and safely (and
>> securely) update itself.  Especially for a mainstream device like a
>> Chromebook.
>
> The security model of Chrome OS is that it relies on a read-only firmware
> that is later verifying and launching a read-write firwmare. Obviously
> changes to the read-only firwmare is impossible in the field (and only
> done at risk of bricking the device for hobbyists). Some changes are
> not possible to work around in the read-write part of the firmware,
> or for some other reason becomes unfeasible to handle there.
>
> We're working on making the RO portion smaller, so we'll be less exposed
> to this in the future, but that's a feature that will take some time
> to mature.

One thing to be aware is that despite the fact that these machines
just came out, their firmware has been roughly frozen (no major
changes) since last September due to various (non-technical) issues.
We almost got the massive rewrite in before the deadline (talk to
Simon Glass about this) but didn't quite make the deadline.  :(

Note that handling CPU resume in a way that can be updated by RW
firmware is non-trivial and requires some SRAM to be saved across
suspend/resume.  On the original Samsung Chromebook we didn't have
that.  On the Samsung Chromebook 2 we do (yay!) but we didn't realize
it until pretty late in the game.  Sigh.

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:49                     ` Olof Johansson
  2014-06-06 21:59                       ` Doug Anderson
@ 2014-06-06 22:17                       ` Nicolas Pitre
  1 sibling, 0 replies; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-06 22:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 6 Jun 2014, Olof Johansson wrote:

> On Fri, Jun 06, 2014 at 05:34:21PM -0400, Nicolas Pitre wrote:
> > On Fri, 6 Jun 2014, Olof Johansson wrote:
> > 
> > > On Sat, Jun 07, 2014 at 02:16:27AM +0530, Abhilash Kesavan wrote:
> > > > My answer is not "use mainline u-boot" primarily because I am not sure
> > > > mainline u-boot actually works on 5420 :).
> > > 
> > > And I'm saying that's not the answer primarily because we should never require
> > > people to update their firmware to get a usable linux system.
> > > 
> > > > My answer is keep a patch
> > > > locally (or make a trivial change to the bootcmd) for people who would
> > > > like to use an upstream kernel with the firmware on the device. Once
> > > > we do have a working mainline u-boot, that can then be used by the
> > > > interested parties.
> > > 
> > > And I am strongly NAK:ing both of those approaches. We should not require
> > > a single out-of-tree patch because that means we have failed to make a useful
> > > kernel for people. And it should never, ever, be a requirement for people to
> > > reflash and risk bricking their device just to run mainline linux on it.
> > 
> > What we can do, though, is to publicly shame you all Google People very 
> > strongly for not making firmware updates in the field safer and easier.  
> > After all you must all know already that, by definition, software always 
> > contains bugs.  The first feature to be tested with a new 
> > bootloader/firmware must be the ability to successfully and safely (and 
> > securely) update itself.  Especially for a mainstream device like a 
> > Chromebook.
> 
> The security model of Chrome OS is that it relies on a read-only firmware
> that is later verifying and launching a read-write firwmare.

It must do a whole lot more than that, otherwise we wouldn't be talking 
about ways to update it.

> Some changes are
> not possible to work around in the read-write part of the firmware,
> or for some other reason becomes unfeasible to handle there.

In other words, you are pleading guilty for bad design.

> We're working on making the RO portion smaller, so we'll be less exposed
> to this in the future, but that's a feature that will take some time
> to mature. It can also be hard to motivate updating the firmware in
> the field for upstream usage, since doing these updates are a large
> undertaking from a QA point of view.

Usually, when the mechanism is there, it gets happily used even before 
mainline usage is considered.

> We're working hard on making sure a vanilla upstream kernel will work
> well on this generation of ARM Chromebooks, something we never really
> did with the first generation. If the response we get is "just carry it
> out of tree", why would we even care about upstream at all? We're trying
> to do the right thing here, even if it might require a line of code or
> two here or there that isn't perfect.

Shit happens.  One line or two is tolerable.

What is not acceptable is the impossibility to fix fundamental 
correctness issues because people think their damn read-only firmware is 
bug free.  I've spent a great amount of time doing multiple rounds of 
reviews for MCPM backends.  Still it seems that people just don't get 
all the correctness subtleties before at least 5 rounds.  Yet they 
thought their code was OK on the initial rounds, otherwise why would 
they knowingly post buggy code?  If that code had been put into 
read-only firmware, say behind some PSCI interface for example, then 
your device would simply be stuck to board-bringup feature level.

I don't know enough about the actual issue with the Chromebook yet, but 
I do hope your firmware screwups aren't too serious.  The patch Doug 
just posted appears innocent enough.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:59                       ` Doug Anderson
@ 2014-06-06 22:38                         ` Nicolas Pitre
  2014-06-06 23:03                           ` Doug Anderson
  0 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-06 22:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 6 Jun 2014, Doug Anderson wrote:

> Note that handling CPU resume in a way that can be updated by RW
> firmware is non-trivial and requires some SRAM to be saved across
> suspend/resume.

Saved by the kernel or the firmware?


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 22:38                         ` Nicolas Pitre
@ 2014-06-06 23:03                           ` Doug Anderson
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-06 23:03 UTC (permalink / raw)
  To: linux-arm-kernel

Nicolas,

On Fri, Jun 6, 2014 at 3:38 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 6 Jun 2014, Doug Anderson wrote:
>
>> Note that handling CPU resume in a way that can be updated by RW
>> firmware is non-trivial and requires some SRAM to be saved across
>> suspend/resume.
>
> Saved by the kernel or the firmware?

By the SoC / board.

The problem on the original ARM Chromebook:

1. After resume, the boot CPU comes up just as if it were booting from
a cold boot.  All clocks are back at their default and the SDRAM
controller is no longer initted.  SDRAM is sitting in self refresh.

2. CPU loads boot code from SPI flash.

3. Boot code needs to be read only for security reasons.  You can have
RW updates for some portion of boot code, but only the part _after_
clock init (needed for SDRAM), SDRAM reinit, cache init.  That's
because you need those parts to (quickly) run crypto verification
code.


If the SoC / board can keep its internal SRAM powered across suspend
resume you've got an out.  Now you can have the firmware stash its
"wakeup" code into a location in SRAM.  At wakeup time you can jump
straight there and that code can worry about everything.  Now a
read-write firmware update can update the wakeup code and you're all
set.  In this case wakeup code doesn't survive a full power cycle /
reboot so you don't need any extra verification like you do if you
load from persistent storage.


That all still means you need to qualify a rw firmware update, but
that's better than nothing.

-Doug

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-06 21:48       ` Nicolas Pitre
@ 2014-06-07  3:25         ` Abhilash Kesavan
  2014-06-07 16:10           ` Nicolas Pitre
  0 siblings, 1 reply; 51+ messages in thread
From: Abhilash Kesavan @ 2014-06-07  3:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

The first man of the incoming cluster enables its snoops via the
power_up_setup function. During secondary boot-up, this does not occur
for the boot cluster. Hence, I enable the snoops for the boot cluster
as a one-time setup from the u-boot prompt. After secondary boot-up
there is no modification that I do. Where should this be ideally done
?

Regards,
Abhilash

On Sat, Jun 7, 2014 at 3:18 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 6 Jun 2014, Abhilash Kesavan wrote:
>
>> Hi Doug,
>>
>> The first change in the kernel (clearing an iRAM location) is needed
>> because of an unnecessary change that we are carrying in the Chrome
>> U-boot. There is no reason for us to have the workaround in the
>> mainline kernel. Rather, we should remove the check from our u-boot.
>> However AFAIR a clean-up patch that I had posted internally was not
>> accepted as we had frozen the SPL at the time.
>>
>> The second change is to enable snoops for boot cluster. Internally, we
>> were disabling the snoops for both the clusters at power off and
>> enabling it in power_up_setup and power_up. However, I dropped the
>> approach due to problems pointed out by Nicolas here
>> http://www.spinics.net/lists/arm-kernel/msg324091.html related to
>> cpuidle. Hence, we turn it on at the u-boot.
>
> You should never *ever* play with the CCI from U-Boot.  That's for the
> MCPM layer to decide when it is safe to do so.  That's mainly the
> _whole_ reason why MCPM exists in the first place.
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07  3:25         ` Abhilash Kesavan
@ 2014-06-07 16:10           ` Nicolas Pitre
  2014-06-07 17:56             ` Lorenzo Pieralisi
  2014-06-09 20:27             ` Kevin Hilman
  0 siblings, 2 replies; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-07 16:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 7 Jun 2014, Abhilash Kesavan wrote:

> Hi Nicolas,
> 
> The first man of the incoming cluster enables its snoops via the
> power_up_setup function. During secondary boot-up, this does not occur
> for the boot cluster. Hence, I enable the snoops for the boot cluster
> as a one-time setup from the u-boot prompt. After secondary boot-up
> there is no modification that I do.

OK that's good.

> Where should this be ideally done ?

If I remember correctly, the CCI can be safely activated only when the 
cache is disabled.  So that means the CCI should ideally be turned on 
for the boot cluster (and *only* for the boot CPU) by the bootloader.

Now... If you _really_ prefer to do it from the kernel to avoid 
difficulties with bootloader updates, then it should be possible to do 
it from the kernel by temporarily turning the cache off.  This is not a 
small thing but the MCPM infrastructure can be leveraged.  Here's what I 
tried on a TC2 which might just work for you as well:

diff --git a/arch/arm/common/mcpm_entry.c b/arch/arm/common/mcpm_entry.c
index 86fd60fefb..1cc49de308 100644
--- a/arch/arm/common/mcpm_entry.c
+++ b/arch/arm/common/mcpm_entry.c
@@ -12,11 +12,13 @@
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/irqflags.h>
+#include <linux/cpu_pm.h>
 
 #include <asm/mcpm.h>
 #include <asm/cacheflush.h>
 #include <asm/idmap.h>
 #include <asm/cputype.h>
+#include <asm/suspend.h>
 
 extern unsigned long mcpm_entry_vectors[MAX_NR_CLUSTERS][MAX_CPUS_PER_CLUSTER];
 
@@ -146,6 +148,44 @@ int mcpm_cpu_powered_up(void)
 	return 0;
 }
 
+static int go_down(unsigned long _arg)
+{
+	void (*cache_disable)(void) = (void *)_arg;
+	unsigned int mpidr = read_cpuid_mpidr();
+	unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
+	unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
+	phys_reset_t phys_reset;
+
+	mcpm_set_entry_vector(cpu, cluster, cpu_resume);
+	setup_mm_for_reboot();
+
+	__mcpm_cpu_going_down(cpu, cluster);
+	BUG_ON(!__mcpm_outbound_enter_critical(cpu, cluster));
+	cache_disable();
+	__mcpm_outbound_leave_critical(cluster, CLUSTER_DOWN);
+	__mcpm_cpu_down(cpu, cluster);
+
+	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
+	phys_reset(virt_to_phys(mcpm_entry_point));
+	BUG();
+}
+	
+int mcpm_loopback(void (*cache_disable)(void))
+{
+	int ret;
+
+	local_irq_disable();
+	local_fiq_disable();
+	ret = cpu_pm_enter();
+	if (!ret) {
+		ret = cpu_suspend((unsigned long)cache_disable, go_down);
+		cpu_pm_exit();
+	}
+	local_fiq_enable();
+	local_irq_enable();
+	return ret;
+}
+
 struct sync_struct mcpm_sync;
 
 /*
diff --git a/arch/arm/mach-vexpress/tc2_pm.c b/arch/arm/mach-vexpress/tc2_pm.c
index 29e7785a54..abecdd734f 100644
--- a/arch/arm/mach-vexpress/tc2_pm.c
+++ b/arch/arm/mach-vexpress/tc2_pm.c
@@ -323,6 +323,26 @@ static void __naked tc2_pm_power_up_setup(unsigned int affinity_level)
 "	b	cci_enable_port_for_self ");
 }
 
+int mcpm_loopback(void (*cache_disable)(void));
+
+static void tc2_cache_down(void)
+{
+	pr_warn("TC2: disabling cache\n");
+	if (read_cpuid_part_number() == ARM_CPU_PART_CORTEX_A15) {
+		/*
+		 * On the Cortex-A15 we need to disable
+		 * L2 prefetching before flushing the cache.
+		 */
+		asm volatile(
+		"mcr	p15, 1, %0, c15, c0, 3 \n\t"
+		"isb	\n\t"
+		"dsb	"
+		: : "r" (0x400) );
+	}
+	v7_exit_coherency_flush(all);
+	cci_disable_port_by_cpu(read_cpuid_mpidr());
+}
+
 static int __init tc2_pm_init(void)
 {
 	int ret, irq;
@@ -370,6 +390,7 @@ static int __init tc2_pm_init(void)
 	ret = mcpm_platform_register(&tc2_pm_power_ops);
 	if (!ret) {
 		mcpm_sync_init(tc2_pm_power_up_setup);
+		BUG_ON(mcpm_loopback(tc2_cache_down) != 0);
 		pr_info("TC2 power management initialized\n");
 	}
 	return ret;

Of course it is not because I'm helping you sidestepping firmware 
problems that I'll stop shaming those who came up with that firmware 
design.  ;-)


Nicolas

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 16:10           ` Nicolas Pitre
@ 2014-06-07 17:56             ` Lorenzo Pieralisi
  2014-06-07 20:06               ` Nicolas Pitre
  2014-06-09 20:27             ` Kevin Hilman
  1 sibling, 1 reply; 51+ messages in thread
From: Lorenzo Pieralisi @ 2014-06-07 17:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 07, 2014 at 05:10:27PM +0100, Nicolas Pitre wrote:
> On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
> 
> > Hi Nicolas,
> > 
> > The first man of the incoming cluster enables its snoops via the
> > power_up_setup function. During secondary boot-up, this does not occur
> > for the boot cluster. Hence, I enable the snoops for the boot cluster
> > as a one-time setup from the u-boot prompt. After secondary boot-up
> > there is no modification that I do.
> 
> OK that's good.
> 
> > Where should this be ideally done ?
> 
> If I remember correctly, the CCI can be safely activated only when the 
> cache is disabled.  So that means the CCI should ideally be turned on 
> for the boot cluster (and *only* for the boot CPU) by the bootloader.

CCI ports are enabled per-cluster, so the boot loader must turn on
CCI for all clusters before the respective CPUs have a chance to
turn on their caches. It is a secure operation, this can be overriden
and probably that's what has been done, wrongly.

True, TC2 on warm-boot reenables CCI, and that's because it runs the
kernel in secure world, and again that's _wrong_.

It must be done in firmware, and I am totally against any attempt at
papering over what looks like a messed up firmware implementation with
a bunch of hacks in the kernel, because that's what the patch below is
(soft restarting a CPU to enable CCI ? and adding generic code for that ?
what's next ?)

I understand there is an issue and lots at stake here, but I do not want the
patch below to be merged in the kernel, I am sorry, it is a tad too much.

Lorenzo

> 
> Now... If you _really_ prefer to do it from the kernel to avoid 
> difficulties with bootloader updates, then it should be possible to do 
> it from the kernel by temporarily turning the cache off.  This is not a 
> small thing but the MCPM infrastructure can be leveraged.  Here's what I 
> tried on a TC2 which might just work for you as well:
> 
> diff --git a/arch/arm/common/mcpm_entry.c b/arch/arm/common/mcpm_entry.c
> index 86fd60fefb..1cc49de308 100644
> --- a/arch/arm/common/mcpm_entry.c
> +++ b/arch/arm/common/mcpm_entry.c
> @@ -12,11 +12,13 @@
>  #include <linux/kernel.h>
>  #include <linux/init.h>
>  #include <linux/irqflags.h>
> +#include <linux/cpu_pm.h>
>  
>  #include <asm/mcpm.h>
>  #include <asm/cacheflush.h>
>  #include <asm/idmap.h>
>  #include <asm/cputype.h>
> +#include <asm/suspend.h>
>  
>  extern unsigned long mcpm_entry_vectors[MAX_NR_CLUSTERS][MAX_CPUS_PER_CLUSTER];
>  
> @@ -146,6 +148,44 @@ int mcpm_cpu_powered_up(void)
>  	return 0;
>  }
>  
> +static int go_down(unsigned long _arg)
> +{
> +	void (*cache_disable)(void) = (void *)_arg;
> +	unsigned int mpidr = read_cpuid_mpidr();
> +	unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> +	unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> +	phys_reset_t phys_reset;
> +
> +	mcpm_set_entry_vector(cpu, cluster, cpu_resume);
> +	setup_mm_for_reboot();
> +
> +	__mcpm_cpu_going_down(cpu, cluster);
> +	BUG_ON(!__mcpm_outbound_enter_critical(cpu, cluster));
> +	cache_disable();
> +	__mcpm_outbound_leave_critical(cluster, CLUSTER_DOWN);
> +	__mcpm_cpu_down(cpu, cluster);
> +
> +	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
> +	phys_reset(virt_to_phys(mcpm_entry_point));
> +	BUG();
> +}
> +	
> +int mcpm_loopback(void (*cache_disable)(void))
> +{
> +	int ret;
> +
> +	local_irq_disable();
> +	local_fiq_disable();
> +	ret = cpu_pm_enter();
> +	if (!ret) {
> +		ret = cpu_suspend((unsigned long)cache_disable, go_down);
> +		cpu_pm_exit();
> +	}
> +	local_fiq_enable();
> +	local_irq_enable();
> +	return ret;
> +}
> +
>  struct sync_struct mcpm_sync;
>  
>  /*
> diff --git a/arch/arm/mach-vexpress/tc2_pm.c b/arch/arm/mach-vexpress/tc2_pm.c
> index 29e7785a54..abecdd734f 100644
> --- a/arch/arm/mach-vexpress/tc2_pm.c
> +++ b/arch/arm/mach-vexpress/tc2_pm.c
> @@ -323,6 +323,26 @@ static void __naked tc2_pm_power_up_setup(unsigned int affinity_level)
>  "	b	cci_enable_port_for_self ");
>  }
>  
> +int mcpm_loopback(void (*cache_disable)(void));
> +
> +static void tc2_cache_down(void)
> +{
> +	pr_warn("TC2: disabling cache\n");
> +	if (read_cpuid_part_number() == ARM_CPU_PART_CORTEX_A15) {
> +		/*
> +		 * On the Cortex-A15 we need to disable
> +		 * L2 prefetching before flushing the cache.
> +		 */
> +		asm volatile(
> +		"mcr	p15, 1, %0, c15, c0, 3 \n\t"
> +		"isb	\n\t"
> +		"dsb	"
> +		: : "r" (0x400) );
> +	}
> +	v7_exit_coherency_flush(all);
> +	cci_disable_port_by_cpu(read_cpuid_mpidr());
> +}
> +
>  static int __init tc2_pm_init(void)
>  {
>  	int ret, irq;
> @@ -370,6 +390,7 @@ static int __init tc2_pm_init(void)
>  	ret = mcpm_platform_register(&tc2_pm_power_ops);
>  	if (!ret) {
>  		mcpm_sync_init(tc2_pm_power_up_setup);
> +		BUG_ON(mcpm_loopback(tc2_cache_down) != 0);
>  		pr_info("TC2 power management initialized\n");
>  	}
>  	return ret;
> 
> Of course it is not because I'm helping you sidestepping firmware 
> problems that I'll stop shaming those who came up with that firmware 
> design.  ;-)
> 
> 
> Nicolas
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 17:56             ` Lorenzo Pieralisi
@ 2014-06-07 20:06               ` Nicolas Pitre
  2014-06-07 22:36                 ` Lorenzo Pieralisi
  0 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-07 20:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 7 Jun 2014, Lorenzo Pieralisi wrote:

> On Sat, Jun 07, 2014 at 05:10:27PM +0100, Nicolas Pitre wrote:
> > On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
> > 
> > > Hi Nicolas,
> > > 
> > > The first man of the incoming cluster enables its snoops via the
> > > power_up_setup function. During secondary boot-up, this does not occur
> > > for the boot cluster. Hence, I enable the snoops for the boot cluster
> > > as a one-time setup from the u-boot prompt. After secondary boot-up
> > > there is no modification that I do.
> > 
> > OK that's good.
> > 
> > > Where should this be ideally done ?
> > 
> > If I remember correctly, the CCI can be safely activated only when the 
> > cache is disabled.  So that means the CCI should ideally be turned on 
> > for the boot cluster (and *only* for the boot CPU) by the bootloader.
> 
> CCI ports are enabled per-cluster, so the boot loader must turn on
> CCI for all clusters before the respective CPUs have a chance to
> turn on their caches. It is a secure operation, this can be overriden
> and probably that's what has been done, wrongly.

Careful.  By saying "for all clusters" you might be interpreted as 
saying that the CCI must be turned on even for CPUs that are not powered 
up.

> True, TC2 on warm-boot reenables CCI, and that's because it runs the
> kernel in secure world, and again that's _wrong_.

Let me respectfully disagree.

> It must be done in firmware, and I am totally against any attempt at
> papering over what looks like a messed up firmware implementation with
> a bunch of hacks in the kernel, because that's what the patch below is
> (soft restarting a CPU to enable CCI ? and adding generic code for that ?
> what's next ?)

Are you promoting for the removal of drivers/bus/arm-cci.c ?

You do realize that the fundamental raison d'?tre for MCPM is actually 
to manage the race free enabling of the cache and CCI ?

> I understand there is an issue and lots at stake here, but I do not want the
> patch below to be merged in the kernel, I am sorry, it is a tad too much.

Lorenzo: Don't get me wrong.  The Chromebooks, and possibly to some 
extent some people at Samsung, were simply too confident in their 
ability to create absolutely bug-free firmware code to the point of not 
making its update easy in the field.  This is completely outrageous in 
my point of view.  Yet one of the reactions was to call upstream kernel 
people as purists because the kernel is so much easier to modify in 
order to cover their mess and yet that might not be accepted.  Like I 
said I won't stop shaming them publicly for their own "incompetence" 
just yet (no pun intended), but being excessively purist does not 
benefit anyone either, and for that they have a point.

*HOWEVER* I have no choice but to say that many people at ARM, including 
a couple individuals for whom I nevertheless have a lot of admiration, 
also have an incredible faith in their ability to convince themselves, 
and then turn around to preach to the world, that *more firmware* is 
going to be so much purer and solve so many more problems than it 
creates and become such a magical success that we should immediately 
dedicate our soul to the cause with no hint of a doubt.

I'm sorry to rain on your parade, but I don't believe in this one iota.

Let me repeat the MCPM story again: it took 3 people, including 2 from 
ARM, over *six* months to get everything right and stable on TC2. I 
think you also contributed to that effort as well. Subsequent MCPM 
backend contributions (yes, just the backend and not the core code) took 
at least *five* rounds of reviews in one case, and after *seven* rounds 
in another case it is still not right, despite the publicly available 
TC2 implementation to serve as a reference.

I'm sure each time a new patch set was posted, their authors honestly 
believed their code was correct.  Otherwise why would they post buggy 
code?

Now you are telling me that they should have put that code into firmware 
instead?  Can you realize what a catastrophe this would have been? Are 
you _seriously_ believing that they would be up to their 5th firmware 
revision by now?  And that updating their firmware six months after 
product launch would be as easy as updating the kernel?

Software ALWAYS has bugs, whether it is user apps, the kernel, firmware 
or boot ROM. The bigger one of those parts is, the more bugs it will 
have. And the cost to vendors for fixing those bugs grow exponentially 
down each level. For proof, we're now considering possible workarounds 
in the kernel to sidestep the difficulty with updating the firmware on a 
Chromebook.

Yet you're saying that firmware should grow code with the same 
complexity as the MCPM core, plus a machine specific backend that 
experience has proven multiple times is evidently hard to get right, 
into firmware because running Linux in secure mode is wrong?  If so we 
don't live in the same world indeed.

The day I see a firmware architecture that allows for 1) the same level 
of peer review as what we enjoy with the Linux kernel code and 2) the 
same ability to perform updates in the field as the kernel, then maybe I 
could be sold on the many advantages having generic firmware might have.  
In the meantime I consider complex firmware as a very suboptimal 
architecture with no bearing on the reality of actual short-cycled 
products, and if they prevail we'd better be ready to pile more of those 
ugly hacks in the kernel.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 20:06               ` Nicolas Pitre
@ 2014-06-07 22:36                 ` Lorenzo Pieralisi
  2014-06-07 23:53                   ` Olof Johansson
  0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Pieralisi @ 2014-06-07 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 07, 2014 at 09:06:36PM +0100, Nicolas Pitre wrote:
> On Sat, 7 Jun 2014, Lorenzo Pieralisi wrote:
> 
> > On Sat, Jun 07, 2014 at 05:10:27PM +0100, Nicolas Pitre wrote:
> > > On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
> > > 
> > > > Hi Nicolas,
> > > > 
> > > > The first man of the incoming cluster enables its snoops via the
> > > > power_up_setup function. During secondary boot-up, this does not occur
> > > > for the boot cluster. Hence, I enable the snoops for the boot cluster
> > > > as a one-time setup from the u-boot prompt. After secondary boot-up
> > > > there is no modification that I do.
> > > 
> > > OK that's good.
> > > 
> > > > Where should this be ideally done ?
> > > 
> > > If I remember correctly, the CCI can be safely activated only when the 
> > > cache is disabled.  So that means the CCI should ideally be turned on 
> > > for the boot cluster (and *only* for the boot CPU) by the bootloader.
> > 
> > CCI ports are enabled per-cluster, so the boot loader must turn on
> > CCI for all clusters before the respective CPUs have a chance to
> > turn on their caches. It is a secure operation, this can be overriden
> > and probably that's what has been done, wrongly.
> 
> Careful.  By saying "for all clusters" you might be interpreted as 
> saying that the CCI must be turned on even for CPUs that are not powered 
> up.

Right, CCI snoops and DVMs for a cluster must be enabled by the first man
running in that cluster upon cluster power up with caches off, where the
first man must be arbitrated if multiple CPUs are racing for enabling CCI.
This is not an issue on cold boot, it is on resuming from CPUidle.

> > True, TC2 on warm-boot reenables CCI, and that's because it runs the
> > kernel in secure world, and again that's _wrong_.
> 
> Let me respectfully disagree.

You are welcome =)

> > It must be done in firmware, and I am totally against any attempt at
> > papering over what looks like a messed up firmware implementation with
> > a bunch of hacks in the kernel, because that's what the patch below is
> > (soft restarting a CPU to enable CCI ? and adding generic code for that ?
> > what's next ?)
> 
> Are you promoting for the removal of drivers/bus/arm-cci.c ?

I really do not want to go there, but I might (apart from CCI PMUs).

> You do realize that the fundamental raison d'?tre for MCPM is actually 
> to manage the race free enabling of the cache and CCI ?

Yes I do and I was willing to help implement it for TC2 to provide people
with an example on how to do it _properly_ (in secure world though, and that
was a mistake IMHO). If what we get is a workaround for every platform
going upstream, we are in a bind, seriously.

Whatever the outcome of this thread, a booting protocol update for CCI
is in order, even if we have to debate it for 6 months or more to get
an agreement.

> > I understand there is an issue and lots at stake here, but I do not want the
> > patch below to be merged in the kernel, I am sorry, it is a tad too much.
> 
> Lorenzo: Don't get me wrong.  The Chromebooks, and possibly to some 
> extent some people at Samsung, were simply too confident in their 
> ability to create absolutely bug-free firmware code to the point of not 
> making its update easy in the field.  This is completely outrageous in 
> my point of view.  Yet one of the reactions was to call upstream kernel 
> people as purists because the kernel is so much easier to modify in 
> order to cover their mess and yet that might not be accepted.  Like I 
> said I won't stop shaming them publicly for their own "incompetence" 
> just yet (no pun intended), but being excessively purist does not 
> benefit anyone either, and for that they have a point.

I do not think they do. The kernel should not become a place where firmware
bugs are fixed, if you refuse to fix the bug and this code does not get
upstream I am pretty sure next time more attention will be paid.

Booting, powering up and down cores are standard procedures, why can't we
share the same code for all v7 platforms ? Why ?

And we are not talking about a race condition hit every 10 gazillions
cycles here.

> *HOWEVER* I have no choice but to say that many people at ARM, including 
> a couple individuals for whom I nevertheless have a lot of admiration, 
> also have an incredible faith in their ability to convince themselves, 
> and then turn around to preach to the world, that *more firmware* is 
> going to be so much purer and solve so many more problems than it 
> creates and become such a magical success that we should immediately 
> dedicate our soul to the cause with no hint of a doubt.
> 
> I'm sorry to rain on your parade, but I don't believe in this one iota.
> 
> Let me repeat the MCPM story again: it took 3 people, including 2 from 
> ARM, over *six* months to get everything right and stable on TC2. I 
> think you also contributed to that effort as well. Subsequent MCPM 
> backend contributions (yes, just the backend and not the core code) took 
> at least *five* rounds of reviews in one case, and after *seven* rounds 
> in another case it is still not right, despite the publicly available 
> TC2 implementation to serve as a reference.
> 
> I'm sure each time a new patch set was posted, their authors honestly 
> believed their code was correct.  Otherwise why would they post buggy 
> code?
> 
> Now you are telling me that they should have put that code into firmware 
> instead?  Can you realize what a catastrophe this would have been? Are 
> you _seriously_ believing that they would be up to their 5th firmware 
> revision by now?  And that updating their firmware six months after 
> product launch would be as easy as updating the kernel?

If firmware messes up the DMC controller configuration, would you fix
it in the Linux kernel ? It is an unrealistic (well..) example but you should
catch my drift.

Where do we draw the line, that's my point.

> Software ALWAYS has bugs, whether it is user apps, the kernel, firmware 
> or boot ROM. The bigger one of those parts is, the more bugs it will 
> have. And the cost to vendors for fixing those bugs grow exponentially 
> down each level. For proof, we're now considering possible workarounds 
> in the kernel to sidestep the difficulty with updating the firmware on a 
> Chromebook.
> 
> Yet you're saying that firmware should grow code with the same 
> complexity as the MCPM core, plus a machine specific backend that 
> experience has proven multiple times is evidently hard to get right, 
> into firmware because running Linux in secure mode is wrong?  If so we 
> don't live in the same world indeed.
> 
> The day I see a firmware architecture that allows for 1) the same level 
> of peer review as what we enjoy with the Linux kernel code and 2) the 
> same ability to perform updates in the field as the kernel, then maybe I 
> could be sold on the many advantages having generic firmware might have.  
> In the meantime I consider complex firmware as a very suboptimal 
> architecture with no bearing on the reality of actual short-cycled 
> products, and if they prevail we'd better be ready to pile more of those 
> ugly hacks in the kernel.

Nicolas: it is not a matter of PSCI vs. MCPM, firmware vs. the kernel,
that's a debate worth having, not now.

There is a bug to solve, and you did. What I am telling you is that
I am fed up to my back teeth of having code in the kernel solving
issues that should not be solved in the kernel, if we keep doing that
hacks will become the rule not the exception. Time to draw a line,
firmware is broken, so is the platform. (BTW, if anyone can explain to
me how CPUidle works in this platform, and that's the main reason of
having MCPM in there, I am all ears).

Adding these hacks has serious maintainance consequences (eg CPUidle
code) and that's the main reason I jumped into this discussion.

Let me reiterate my point: it is not a kernel vs firmware debate, it
is about clean and maintainable code vs hackish and unmaintainable
code in the kernel.

And of top of that, people must test their code, we are not talking
about a rare race condition here, this is a blatant bug.

I have lots to add concerning the firmware implementations of power
down procedures, but let's take it offline.

I am not happy with merging your patch, I am sorry, for the reasons I've
just explained.

Lorenzo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 22:36                 ` Lorenzo Pieralisi
@ 2014-06-07 23:53                   ` Olof Johansson
  2014-06-08  0:19                     ` Russell King - ARM Linux
  2014-06-08 12:45                     ` Lorenzo Pieralisi
  0 siblings, 2 replies; 51+ messages in thread
From: Olof Johansson @ 2014-06-07 23:53 UTC (permalink / raw)
  To: linux-arm-kernel

Lorenzo,

Since you're emailing from @arm.com, some of this is to the wider
recipient and maybe not directly to you:

On Sat, Jun 7, 2014 at 3:36 PM, Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
> On Sat, Jun 07, 2014 at 09:06:36PM +0100, Nicolas Pitre wrote:
>> On Sat, 7 Jun 2014, Lorenzo Pieralisi wrote:
>>
>> > On Sat, Jun 07, 2014 at 05:10:27PM +0100, Nicolas Pitre wrote:
>> > > On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
>> > >
>> > > > Hi Nicolas,
>> > > >
>> > > > The first man of the incoming cluster enables its snoops via the
>> > > > power_up_setup function. During secondary boot-up, this does not occur
>> > > > for the boot cluster. Hence, I enable the snoops for the boot cluster
>> > > > as a one-time setup from the u-boot prompt. After secondary boot-up
>> > > > there is no modification that I do.
>> > >
>> > > OK that's good.
>> > >
>> > > > Where should this be ideally done ?
>> > >
>> > > If I remember correctly, the CCI can be safely activated only when the
>> > > cache is disabled.  So that means the CCI should ideally be turned on
>> > > for the boot cluster (and *only* for the boot CPU) by the bootloader.
>> >
>> > CCI ports are enabled per-cluster, so the boot loader must turn on
>> > CCI for all clusters before the respective CPUs have a chance to
>> > turn on their caches. It is a secure operation, this can be overriden
>> > and probably that's what has been done, wrongly.
>>
>> Careful.  By saying "for all clusters" you might be interpreted as
>> saying that the CCI must be turned on even for CPUs that are not powered
>> up.
>
> Right, CCI snoops and DVMs for a cluster must be enabled by the first man
> running in that cluster upon cluster power up with caches off, where the
> first man must be arbitrated if multiple CPUs are racing for enabling CCI.
> This is not an issue on cold boot, it is on resuming from CPUidle.

That's already handled by the MCPM backend on exynos:

/*
 * Enable cluster-level coherency, in preparation for turning on the MMU.
 */
static void __naked exynos_pm_power_up_setup(unsigned int affinity_level)
{
        asm volatile ("\n"
        "cmp    r0, #1\n"
        "bxne   lr\n"
        "b      cci_enable_port_for_self");
}

>> > True, TC2 on warm-boot reenables CCI, and that's because it runs the
>> > kernel in secure world, and again that's _wrong_.
>>
>> Let me respectfully disagree.
>
> You are welcome =)
>
>> > It must be done in firmware, and I am totally against any attempt at
>> > papering over what looks like a messed up firmware implementation with
>> > a bunch of hacks in the kernel, because that's what the patch below is
>> > (soft restarting a CPU to enable CCI ? and adding generic code for that ?
>> > what's next ?)
>>
>> Are you promoting for the removal of drivers/bus/arm-cci.c ?
>
> I really do not want to go there, but I might (apart from CCI PMUs).
>
>> You do realize that the fundamental raison d'?tre for MCPM is actually
>> to manage the race free enabling of the cache and CCI ?
>
> Yes I do and I was willing to help implement it for TC2 to provide people
> with an example on how to do it _properly_ (in secure world though, and that
> was a mistake IMHO). If what we get is a workaround for every platform
> going upstream, we are in a bind, seriously.

Real life is messy. Bugs happen. Having a stance that the kernel has
to be a puritan implementation that can be done in a vacuum where bugs
in hardware and firmware don't exist (and are fixable in the right
place every single case) is not realistic. Linux is a useless piece of
software to us if that is the case.

I've seen this from several other ARM developers in the past. It's
almost like they were a couple of degrees removed from actually
working on shipping real products and dealing with real users.

> Whatever the outcome of this thread, a booting protocol update for CCI
> is in order, even if we have to debate it for 6 months or more to get
> an agreement.

Ding ding ding. Current documentation is in some very complex C
frameworks (MCPM), and some device tree bindings that obviously don't
cover this. Real documentation is obviously needed, but more than that
(see below).

I'd actually argue that you don't have a leg to stand on in your
complaints at all because:

1) There's no documentation of the requirements
2) The only existing golden platform (TC2) manages CCI similar to how
exynos does.

>> > I understand there is an issue and lots at stake here, but I do not want the
>> > patch below to be merged in the kernel, I am sorry, it is a tad too much.
>>
>> Lorenzo: Don't get me wrong.  The Chromebooks, and possibly to some
>> extent some people at Samsung, were simply too confident in their
>> ability to create absolutely bug-free firmware code to the point of not
>> making its update easy in the field.  This is completely outrageous in
>> my point of view.

I would like to clarify that firmware is updateable today. But
Chromebooks are in many ways vertically integrated products, so if a
problem is found, there's always going to be a trade-off between the
places to fix it for our own product.

I have successfully argued for fixes in firmware (or EC) for some bugs
in the past. For others we have to go with the cheaper fix. And at the
end of the day, pushing out a new firmware is a massive undertaking
compared to a kernel workaround, and doing so only because upstream
maintainers aren't happy with the way we and Samsung solved something
in our firmware+kernel combo is an argument that is hard to win -- it
doesn't affect the product at all, only those wanting to run an
upstream kernel on it.

I'm a very strong proponent of enabling upstream support for our
platform (for several reasons -- most of these are actually business
reasons for us, but also because it's the right thing to do). Finding
the trade-off for what workarounds are still reasonable to do in the
kernel for that situation is obviously hard and we're disagreeing. But
the scope for these workarounds is not large.

Personally, I'm of the opinion that if we can add a few hooks to take
care of something, or keep something contained in a piece of platform
code, then it's not a given that we have to reject the workaround. If
we have to modify core code (or substantially impact core ARM code)
then it's time to consider fixing in firmware or wherever else.

In this case, the change we're looking at is enabling the CCI port for
the boot cpu. It's perfectly containable in exynos-only code, and we
can surround it by however many comments of never ever using it as an
example for how to do it as you'll want.

>> Yet one of the reactions was to call upstream kernel
>> people as purists because the kernel is so much easier to modify in
>> order to cover their mess and yet that might not be accepted.  Like I
>> said I won't stop shaming them publicly for their own "incompetence"
>> just yet (no pun intended), but being excessively purist does not
>> benefit anyone either, and for that they have a point.
>
> I do not think they do. The kernel should not become a place where firmware
> bugs are fixed, if you refuse to fix the bug and this code does not get
> upstream I am pretty sure next time more attention will be paid.

You do realize that you have absolutely zero leverage over us on this,
right? Our product is already shipped with kernel code that fixes
this. The only parties you hurt by doing this is community developers
that want to run an upstream kernel on the platform. Apparently
several of them want to work on sorting out energy aware scheduling on
a platform that is easily available -- mostly people at Linaro and
ARM. I think it makes sense to do what we have to do to enable that
without having people tear their machines open and voiding warranties.

> Booting, powering up and down cores are standard procedures, why can't we
> share the same code for all v7 platforms ? Why ?

That's something you should ask your colleagues. ARM has historically
refused to require standards for these things, and it causes the mess
we see now. Arbitrarily changing the requirements from your part of
kernel isn't going to change the marketplace.

> And we are not talking about a race condition hit every 10 gazillions
> cycles here.
>
>> *HOWEVER* I have no choice but to say that many people at ARM, including
>> a couple individuals for whom I nevertheless have a lot of admiration,
>> also have an incredible faith in their ability to convince themselves,
>> and then turn around to preach to the world, that *more firmware* is
>> going to be so much purer and solve so many more problems than it
>> creates and become such a magical success that we should immediately
>> dedicate our soul to the cause with no hint of a doubt.
>>
>> I'm sorry to rain on your parade, but I don't believe in this one iota.
>>
>> Let me repeat the MCPM story again: it took 3 people, including 2 from
>> ARM, over *six* months to get everything right and stable on TC2. I
>> think you also contributed to that effort as well. Subsequent MCPM
>> backend contributions (yes, just the backend and not the core code) took
>> at least *five* rounds of reviews in one case, and after *seven* rounds
>> in another case it is still not right, despite the publicly available
>> TC2 implementation to serve as a reference.
>>
>> I'm sure each time a new patch set was posted, their authors honestly
>> believed their code was correct.  Otherwise why would they post buggy
>> code?
>>
>> Now you are telling me that they should have put that code into firmware
>> instead?  Can you realize what a catastrophe this would have been? Are
>> you _seriously_ believing that they would be up to their 5th firmware
>> revision by now?  And that updating their firmware six months after
>> product launch would be as easy as updating the kernel?
>
> If firmware messes up the DMC controller configuration, would you fix
> it in the Linux kernel ? It is an unrealistic (well..) example but you should
> catch my drift.

It's actually perfectly realistic, and we did exactly this in the
first ARM Chromebook. We never upstreamed the code because it only
affected a smaller number of models (it's something we did update the
RO firmware for in production).

> Where do we draw the line, that's my point.

You draw the line by giving vendors a place to do the nasty stuff that
needs to be done in a place that doesn't impact others, and where
others don't have to look. Quirk tables, fixup functions, or function
pointers that can be replaced on a specific platform if needed. When
it affects core code, you sort it out in a different way if you have
to.

There'll always be some ugly code that needs to go somewhere. This
isn't rocket science. It's something we've done in the kernel since
pretty much forever. In some cases, we can even share the quirks if
several vendors have made the same mistakes.

Maybe it's just me, but I didn't use to see this disconnected puritan
world view from people until DT came along. I don't think it's DTs
fault, but I think the requirements of DT-as-ABI has tainted the
mindset of many developers in a way that they treat everything as
needing to be polished to a perfect shine in all aspects, all the
time.

>> Software ALWAYS has bugs, whether it is user apps, the kernel, firmware
>> or boot ROM. The bigger one of those parts is, the more bugs it will
>> have. And the cost to vendors for fixing those bugs grow exponentially
>> down each level. For proof, we're now considering possible workarounds
>> in the kernel to sidestep the difficulty with updating the firmware on a
>> Chromebook.
>>
>> Yet you're saying that firmware should grow code with the same
>> complexity as the MCPM core, plus a machine specific backend that
>> experience has proven multiple times is evidently hard to get right,
>> into firmware because running Linux in secure mode is wrong?  If so we
>> don't live in the same world indeed.
>>
>> The day I see a firmware architecture that allows for 1) the same level
>> of peer review as what we enjoy with the Linux kernel code and 2) the
>> same ability to perform updates in the field as the kernel, then maybe I
>> could be sold on the many advantages having generic firmware might have.
>> In the meantime I consider complex firmware as a very suboptimal
>> architecture with no bearing on the reality of actual short-cycled
>> products, and if they prevail we'd better be ready to pile more of those
>> ugly hacks in the kernel.

I couldn't agree more. MCPM and the b.L stuff is also a brand new
concept, that the reference code was kept secret for longer than it
should have been, and few people have been through a full product
cycle of it and learned lessons. We're obviously learning several
right now on our first one.

Expecting things to be perfect from day one is not realistic.

> Nicolas: it is not a matter of PSCI vs. MCPM, firmware vs. the kernel,
> that's a debate worth having, not now.
>
> There is a bug to solve, and you did. What I am telling you is that
> I am fed up to my back teeth of having code in the kernel solving
> issues that should not be solved in the kernel, if we keep doing that
> hacks will become the rule not the exception. Time to draw a line,
> firmware is broken, so is the platform. (BTW, if anyone can explain to
> me how CPUidle works in this platform, and that's the main reason of
> having MCPM in there, I am all ears).

Punting all this to software means that there will be bugs to fix in
software. Period. In the real world, fixing those bugs will not be
practical to do in the best possible place. Life sucks.

You do know that we're arguing over a one-time setup of the boot cpu
port on the CCI and coming in and out of system suspend/resume here,
right? Everything else should already be covered by the runtime
MCPM/CCI code from Nico (together with the exynos backend). So this is
just the first CCI setup on boot as well as after S3 suspend/resume.
One-time code paths, not critical path and definitely containable
within the platform code.

> Adding these hacks has serious maintainance consequences (eg CPUidle
> code) and that's the main reason I jumped into this discussion.

> Let me reiterate my point: it is not a kernel vs firmware debate, it
> is about clean and maintainable code vs hackish and unmaintainable
> code in the kernel.

No, it's about having code that runs in the real world, versus some
random framework that doesn't actually fill a useful purpose since
nobody can make use of it without a bunch of out-of-tree code.

> And of top of that, people must test their code, we are not talking
> about a rare race condition here, this is a blatant bug.

Where's the test suite that would have caught this?

Or hell, I'll be happy to see a document that declares the requirement
of firmware to enable the CCI port on the boot cluster. You admit
above that there is none.

Samsung chose to do that in the kernel instead of firmware, nobody
involved knew better at the time and we now have to deal with it. It's
unlikely to go uncaught on the next product iteration, but things like
this happen.

> I have lots to add concerning the firmware implementations of power
> down procedures, but let's take it offline.
>
> I am not happy with merging your patch, I am sorry, for the reasons I've
> just explained.

Wow, you're going to be really, really frustrated over how the world
will start to look with all the "standardized" closed firmware
platforms and their quirks and bug workarounds we'll have to add in
the kernel.


-Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 23:53                   ` Olof Johansson
@ 2014-06-08  0:19                     ` Russell King - ARM Linux
  2014-06-08  2:52                       ` Olof Johansson
  2014-06-08 18:26                       ` Nicolas Pitre
  2014-06-08 12:45                     ` Lorenzo Pieralisi
  1 sibling, 2 replies; 51+ messages in thread
From: Russell King - ARM Linux @ 2014-06-08  0:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 07, 2014 at 04:53:34PM -0700, Olof Johansson wrote:
> You do realize that you have absolutely zero leverage over us on this,
> right? Our product is already shipped with kernel code that fixes
> this.

That is never a justification for forcing /any/ code into the kernel.

We've been here before with the iPAQs, where there were all sorts of
horrid hacks that were in the code for that device, and we said no to
it, and we kept it out of the mainline kernel, and stopped those hacks
polluting elsewhere (because people got to know on the whole that if
they used those hacks, it would bar them from mainline participation.)

There's many other instances.  Think about it - if we allowed this as
an acceptance criteria, then all that vendors have to do to get their
code into the kernel is change their development cycle: develop
product, ship product, force code into mainline as a done deal not
accepting any review comments back.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08  0:19                     ` Russell King - ARM Linux
@ 2014-06-08  2:52                       ` Olof Johansson
  2014-06-08 18:26                       ` Nicolas Pitre
  1 sibling, 0 replies; 51+ messages in thread
From: Olof Johansson @ 2014-06-08  2:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 7, 2014 at 5:19 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Sat, Jun 07, 2014 at 04:53:34PM -0700, Olof Johansson wrote:
>> You do realize that you have absolutely zero leverage over us on this,
>> right? Our product is already shipped with kernel code that fixes
>> this.
>
> That is never a justification for forcing /any/ code into the kernel.

I'm not looking to force code in, I'm just making it clear that we
have limited chance to motivate rework of this since the primary
objective of the platform has already been met: We've shipped a
product with it.

I also don't think it's really in anyone's best interest to have to
open up their device, remove write protect, download and build a
mainline u-boot and try flashing that onto the system to get it to a
state where they can use a mainline kernel. There's too much risk of
bricking your hardware if you get it wrong, and you void your
warranty.

> We've been here before with the iPAQs, where there were all sorts of
> horrid hacks that were in the code for that device, and we said no to
> it, and we kept it out of the mainline kernel, and stopped those hacks
> polluting elsewhere (because people got to know on the whole that if
> they used those hacks, it would bar them from mainline participation.)

Unfortunately, very few companies actually care about mainline
participation today to the point where I don't think they care much
any set examples. :(

> There's many other instances.  Think about it - if we allowed this as
> an acceptance criteria, then all that vendors have to do to get their
> code into the kernel is change their development cycle: develop
> product, ship product, force code into mainline as a done deal not
> accepting any review comments back.

That is of course not a suitable way of working either. Unfortunately
what is mostly happening today is that vendors are doing this: develop
product, ship product. The last step never happens. Finding a middle
ground where we can pick up some of the platform quirks without making
the kernel a big ball of hacks is of course the tricky part.

In this specific case, we're not ignoring review feedback, and the
comments we've gotten we will absolutely make sure are fixed in the
next generation of product (if/when we do another big.LITTLE platform,
etc). There's just no room to fix it for the current generation. In
hindsight, of course what should have happened is that we should have
pushed the vendor to upstream the code sooner, and we're working on
making that better in the future too.


Since we're talking about upstreaming of platform support, this is
something I'm quite passionate about so I'll rant a bit:

Looking at the general case more than just this specific instance of
Samsung Chromebooks, I think we should in general work hard (where
vendors are willing to cooperate) to make sure that we can fully
enable support for platforms to the point where they can just run
unmodified upstream. Too many of the products today aren't shipping
kernels anywhere near what's mainline: It's usually a kernel that is a
couple of years old with a few thousand patches on top. It means that
bugfixes for the platforms (and much other useful code) doesn't find
its way into the kernel, and we have a long (or nonexistent) cycle of
feedback due to it. Drivers are in some cases completely broken
because nobody actually uses the upstream versions, people post
building-but-broken code because they can't actually run and test the
patch on any existing hardware with mainline so they just forward-port
what they have in the product tree and hope for the best. Etc.

I would really like to reach a state where we have several substantial
and popular platforms that work well enough with mainline that people
can use some of the mainline-near distros to run on the machine.
Cubox-i is a great example, even though there are still some pieces
left there as well. The new Chromebooks have hardware specs that are
quite suitable to use as native development platforms (good CPUs, 4GB
ram, and micro-SD card to expand beyond the basic 16GB eMMC), so I
think it makes sense to try to enable them and pick up a minimal set
of the less than ideal code snippets for that. We'll end up with a
better supported and more solid kernel if we can get distros such as
Fedora and Debian to ship with these upstream kernels instead of the
vendor tree (which is 3.8 based in this case, and won't ever move
forward).


-Olof

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 23:53                   ` Olof Johansson
  2014-06-08  0:19                     ` Russell King - ARM Linux
@ 2014-06-08 12:45                     ` Lorenzo Pieralisi
  2014-06-08 14:34                       ` Russell King - ARM Linux
  2014-06-08 17:53                       ` Nicolas Pitre
  1 sibling, 2 replies; 51+ messages in thread
From: Lorenzo Pieralisi @ 2014-06-08 12:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 08, 2014 at 12:53:34AM +0100, Olof Johansson wrote:
> Lorenzo,
> 
> Since you're emailing from @arm.com, some of this is to the wider
> recipient and maybe not directly to you:

I am glad to reply and take blame since this is a debate definitely worth
having.

[...]

> > Right, CCI snoops and DVMs for a cluster must be enabled by the first man
> > running in that cluster upon cluster power up with caches off, where the
> > first man must be arbitrated if multiple CPUs are racing for enabling CCI.
> > This is not an issue on cold boot, it is on resuming from CPUidle.
> 
> That's already handled by the MCPM backend on exynos:
> 
> /*
>  * Enable cluster-level coherency, in preparation for turning on the MMU.
>  */
> static void __naked exynos_pm_power_up_setup(unsigned int affinity_level)
> {
>         asm volatile ("\n"
>         "cmp    r0, #1\n"
>         "bxne   lr\n"
>         "b      cci_enable_port_for_self");
> }
> 
> >> > True, TC2 on warm-boot reenables CCI, and that's because it runs the
> >> > kernel in secure world, and again that's _wrong_.
> >>
> >> Let me respectfully disagree.
> >
> > You are welcome =)
> >
> >> > It must be done in firmware, and I am totally against any attempt at
> >> > papering over what looks like a messed up firmware implementation with
> >> > a bunch of hacks in the kernel, because that's what the patch below is
> >> > (soft restarting a CPU to enable CCI ? and adding generic code for that ?
> >> > what's next ?)
> >>
> >> Are you promoting for the removal of drivers/bus/arm-cci.c ?
> >
> > I really do not want to go there, but I might (apart from CCI PMUs).
> >
> >> You do realize that the fundamental raison d'?tre for MCPM is actually
> >> to manage the race free enabling of the cache and CCI ?
> >
> > Yes I do and I was willing to help implement it for TC2 to provide people
> > with an example on how to do it _properly_ (in secure world though, and that
> > was a mistake IMHO). If what we get is a workaround for every platform
> > going upstream, we are in a bind, seriously.
> 
> Real life is messy. Bugs happen. Having a stance that the kernel has
> to be a puritan implementation that can be done in a vacuum where bugs
> in hardware and firmware don't exist (and are fixable in the right
> place every single case) is not realistic. Linux is a useless piece of
> software to us if that is the case.
> 
> I've seen this from several other ARM developers in the past. It's
> almost like they were a couple of degrees removed from actually
> working on shipping real products and dealing with real users.

Guys, do not get me wrong here. There are fixes that can be deemed
acceptable in an OS, there are fixes that can't. I just can't help thinking
that Nicolas' patch is a nasty hack (and I am far, really really far from
blaming him for that, because that's the only patch that can fix that
issue in the kernel), and he perfectly knows that.

> > Whatever the outcome of this thread, a booting protocol update for CCI
> > is in order, even if we have to debate it for 6 months or more to get
> > an agreement.
> 
> Ding ding ding. Current documentation is in some very complex C
> frameworks (MCPM), and some device tree bindings that obviously don't
> cover this. Real documentation is obviously needed, but more than that
> (see below).
> 
> I'd actually argue that you don't have a leg to stand on in your
> complaints at all because:
> 
> 1) There's no documentation of the requirements
> 2) The only existing golden platform (TC2) manages CCI similar to how
> exynos does.

1) Blame taken, nothing to say. That's why I mentioned that CCI enablement
   must be part of a boot protocol document to prevent this from happening
   again.

2) Apart from the boot cluster, but that's related to (1).

> >> > I understand there is an issue and lots at stake here, but I do not want the
> >> > patch below to be merged in the kernel, I am sorry, it is a tad too much.
> >>
> >> Lorenzo: Don't get me wrong.  The Chromebooks, and possibly to some
> >> extent some people at Samsung, were simply too confident in their
> >> ability to create absolutely bug-free firmware code to the point of not
> >> making its update easy in the field.  This is completely outrageous in
> >> my point of view.
> 
> I would like to clarify that firmware is updateable today. But
> Chromebooks are in many ways vertically integrated products, so if a
> problem is found, there's always going to be a trade-off between the
> places to fix it for our own product.
> 
> I have successfully argued for fixes in firmware (or EC) for some bugs
> in the past. For others we have to go with the cheaper fix. And at the
> end of the day, pushing out a new firmware is a massive undertaking
> compared to a kernel workaround, and doing so only because upstream
> maintainers aren't happy with the way we and Samsung solved something
> in our firmware+kernel combo is an argument that is hard to win -- it
> doesn't affect the product at all, only those wanting to run an
> upstream kernel on it.
> 
> I'm a very strong proponent of enabling upstream support for our
> platform (for several reasons -- most of these are actually business
> reasons for us, but also because it's the right thing to do). Finding
> the trade-off for what workarounds are still reasonable to do in the
> kernel for that situation is obviously hard and we're disagreeing. But
> the scope for these workarounds is not large.
> 
> Personally, I'm of the opinion that if we can add a few hooks to take
> care of something, or keep something contained in a piece of platform
> code, then it's not a given that we have to reject the workaround. If
> we have to modify core code (or substantially impact core ARM code)
> then it's time to consider fixing in firmware or wherever else.
> 
> In this case, the change we're looking at is enabling the CCI port for
> the boot cpu. It's perfectly containable in exynos-only code, and we
> can surround it by however many comments of never ever using it as an
> example for how to do it as you'll want.

I agree with what you are saying, but if for any reason someone will
copy that code to paper over yet another firmware quirk and think that's
the right thing to do, that would be grave IMHO.

So:

1) CCI requirements added to boot document
2) If Nico's patch get in we must stick a massive disclaimer summing up
   this thread in there
3) Next time the answer will be: NAK.

Do we agree ? I still have very strong feelings against (2) and I would
be very glad to avoid merging that patch.

> >> Yet one of the reactions was to call upstream kernel
> >> people as purists because the kernel is so much easier to modify in
> >> order to cover their mess and yet that might not be accepted.  Like I
> >> said I won't stop shaming them publicly for their own "incompetence"
> >> just yet (no pun intended), but being excessively purist does not
> >> benefit anyone either, and for that they have a point.
> >
> > I do not think they do. The kernel should not become a place where firmware
> > bugs are fixed, if you refuse to fix the bug and this code does not get
> > upstream I am pretty sure next time more attention will be paid.
> 
> You do realize that you have absolutely zero leverage over us on this,
> right? Our product is already shipped with kernel code that fixes
> this. The only parties you hurt by doing this is community developers
> that want to run an upstream kernel on the platform. Apparently
> several of them want to work on sorting out energy aware scheduling on
> a platform that is easily available -- mostly people at Linaro and
> ARM. I think it makes sense to do what we have to do to enable that
> without having people tear their machines open and voiding warranties.

I have zero leverage, but still the right to say what I think.
The patch fixing CCI enablement for the boot CPU is a nasty hack
and that's certainly not Nicolas' fault.

I understand your point, and I do not want to stop people from using
this platform with upstream code, actually I am the first who is happy
to see power management code getting in the mainline, but not at all costs,
because this has consequences for US.

> > Booting, powering up and down cores are standard procedures, why can't we
> > share the same code for all v7 platforms ? Why ?
> 
> That's something you should ask your colleagues. ARM has historically
> refused to require standards for these things, and it causes the mess
> we see now. Arbitrarily changing the requirements from your part of
> kernel isn't going to change the marketplace.

ARM are pushing for open trusted firmware, ARM TRMs are available to
partners with those sequences described, and I have always been willing
to support developers.

We should do more, but that does not justify these bugs, really.

> > And we are not talking about a race condition hit every 10 gazillions
> > cycles here.
> >
> >> *HOWEVER* I have no choice but to say that many people at ARM, including
> >> a couple individuals for whom I nevertheless have a lot of admiration,
> >> also have an incredible faith in their ability to convince themselves,
> >> and then turn around to preach to the world, that *more firmware* is
> >> going to be so much purer and solve so many more problems than it
> >> creates and become such a magical success that we should immediately
> >> dedicate our soul to the cause with no hint of a doubt.
> >>
> >> I'm sorry to rain on your parade, but I don't believe in this one iota.
> >>
> >> Let me repeat the MCPM story again: it took 3 people, including 2 from
> >> ARM, over *six* months to get everything right and stable on TC2. I
> >> think you also contributed to that effort as well. Subsequent MCPM
> >> backend contributions (yes, just the backend and not the core code) took
> >> at least *five* rounds of reviews in one case, and after *seven* rounds
> >> in another case it is still not right, despite the publicly available
> >> TC2 implementation to serve as a reference.
> >>
> >> I'm sure each time a new patch set was posted, their authors honestly
> >> believed their code was correct.  Otherwise why would they post buggy
> >> code?
> >>
> >> Now you are telling me that they should have put that code into firmware
> >> instead?  Can you realize what a catastrophe this would have been? Are
> >> you _seriously_ believing that they would be up to their 5th firmware
> >> revision by now?  And that updating their firmware six months after
> >> product launch would be as easy as updating the kernel?
> >
> > If firmware messes up the DMC controller configuration, would you fix
> > it in the Linux kernel ? It is an unrealistic (well..) example but you should
> > catch my drift.
> 
> It's actually perfectly realistic, and we did exactly this in the
> first ARM Chromebook. We never upstreamed the code because it only
> affected a smaller number of models (it's something we did update the
> RO firmware for in production).
> 
> > Where do we draw the line, that's my point.
> 
> You draw the line by giving vendors a place to do the nasty stuff that
> needs to be done in a place that doesn't impact others, and where
> others don't have to look. Quirk tables, fixup functions, or function
> pointers that can be replaced on a specific platform if needed. When
> it affects core code, you sort it out in a different way if you have
> to.
> 
> There'll always be some ugly code that needs to go somewhere. This
> isn't rocket science. It's something we've done in the kernel since
> pretty much forever. In some cases, we can even share the quirks if
> several vendors have made the same mistakes.
> 
> Maybe it's just me, but I didn't use to see this disconnected puritan
> world view from people until DT came along. I don't think it's DTs
> fault, but I think the requirements of DT-as-ABI has tainted the
> mindset of many developers in a way that they treat everything as
> needing to be polished to a perfect shine in all aspects, all the
> time.

Olof, it is not puritanism, it is all about upstreaming code. If we
keep accepting these hacks and we end up with mach code full of them
we have a problem, do you agree ?

You, Russell and Nico know what to do, I wanted to get my opinion
across and I think you got my point.

> >> Software ALWAYS has bugs, whether it is user apps, the kernel, firmware
> >> or boot ROM. The bigger one of those parts is, the more bugs it will
> >> have. And the cost to vendors for fixing those bugs grow exponentially
> >> down each level. For proof, we're now considering possible workarounds
> >> in the kernel to sidestep the difficulty with updating the firmware on a
> >> Chromebook.
> >>
> >> Yet you're saying that firmware should grow code with the same
> >> complexity as the MCPM core, plus a machine specific backend that
> >> experience has proven multiple times is evidently hard to get right,
> >> into firmware because running Linux in secure mode is wrong?  If so we
> >> don't live in the same world indeed.
> >>
> >> The day I see a firmware architecture that allows for 1) the same level
> >> of peer review as what we enjoy with the Linux kernel code and 2) the
> >> same ability to perform updates in the field as the kernel, then maybe I
> >> could be sold on the many advantages having generic firmware might have.
> >> In the meantime I consider complex firmware as a very suboptimal
> >> architecture with no bearing on the reality of actual short-cycled
> >> products, and if they prevail we'd better be ready to pile more of those
> >> ugly hacks in the kernel.
> 
> I couldn't agree more. MCPM and the b.L stuff is also a brand new
> concept, that the reference code was kept secret for longer than it
> should have been, and few people have been through a full product
> cycle of it and learned lessons. We're obviously learning several
> right now on our first one.
> 
> Expecting things to be perfect from day one is not realistic.

I do not buy this I am sorry. Fair enough, CCI is a new concept, but
SMP power management has been implemented in older platforms with
the same requirements, nothing new and still people are getting this
wrong.

> > Nicolas: it is not a matter of PSCI vs. MCPM, firmware vs. the kernel,
> > that's a debate worth having, not now.
> >
> > There is a bug to solve, and you did. What I am telling you is that
> > I am fed up to my back teeth of having code in the kernel solving
> > issues that should not be solved in the kernel, if we keep doing that
> > hacks will become the rule not the exception. Time to draw a line,
> > firmware is broken, so is the platform. (BTW, if anyone can explain to
> > me how CPUidle works in this platform, and that's the main reason of
> > having MCPM in there, I am all ears).
> 
> Punting all this to software means that there will be bugs to fix in
> software. Period. In the real world, fixing those bugs will not be
> practical to do in the best possible place. Life sucks.
> 
> You do know that we're arguing over a one-time setup of the boot cpu
> port on the CCI and coming in and out of system suspend/resume here,
> right? Everything else should already be covered by the runtime
> MCPM/CCI code from Nico (together with the exynos backend). So this is
> just the first CCI setup on boot as well as after S3 suspend/resume.
> One-time code paths, not critical path and definitely containable
> within the platform code.

See above. And let's hope MCPM is useful at all on this platform, which
means CPUidle can be enabled and working.

> > Adding these hacks has serious maintainance consequences (eg CPUidle
> > code) and that's the main reason I jumped into this discussion.
> 
> > Let me reiterate my point: it is not a kernel vs firmware debate, it
> > is about clean and maintainable code vs hackish and unmaintainable
> > code in the kernel.
> 
> No, it's about having code that runs in the real world, versus some
> random framework that doesn't actually fill a useful purpose since
> nobody can make use of it without a bunch of out-of-tree code.

PSCI is not a random framework, it is a standard and it runs in real
world platforms and would hide all these HW quirks where they belong.

It won't be perfect, but hey, neither is this code.

> > And of top of that, people must test their code, we are not talking
> > about a rare race condition here, this is a blatant bug.
> 
> Where's the test suite that would have caught this?
> 
> Or hell, I'll be happy to see a document that declares the requirement
> of firmware to enable the CCI port on the boot cluster. You admit
> above that there is none.

See above, blame taken but FW/HW engineers should know that a multicluster
coherent system should have coherent caches to be properly booted.

> Samsung chose to do that in the kernel instead of firmware, nobody
> involved knew better at the time and we now have to deal with it. It's
> unlikely to go uncaught on the next product iteration, but things like
> this happen.

Fair enough, provided it is the last time this happens.

> > I have lots to add concerning the firmware implementations of power
> > down procedures, but let's take it offline.
> >
> > I am not happy with merging your patch, I am sorry, for the reasons I've
> > just explained.
> 
> Wow, you're going to be really, really frustrated over how the world
> will start to look with all the "standardized" closed firmware
> platforms and their quirks and bug workarounds we'll have to add in
> the kernel.

Yes, and I will shout even louder when that will happen =)

Thanks !
Lorenzo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 12:45                     ` Lorenzo Pieralisi
@ 2014-06-08 14:34                       ` Russell King - ARM Linux
  2014-06-08 17:53                       ` Nicolas Pitre
  1 sibling, 0 replies; 51+ messages in thread
From: Russell King - ARM Linux @ 2014-06-08 14:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 08, 2014 at 01:45:30PM +0100, Lorenzo Pieralisi wrote:
> Olof, it is not puritanism, it is all about upstreaming code. If we
> keep accepting these hacks and we end up with mach code full of them
> we have a problem, do you agree ?

To see the kind of problem that accepting hacked up code can cause, you
only have to look at Olof's build logs to see the warnings from the L2x0
cache code, which I've been totally unable to complete the cleanup of
/because/ we've historically accepted hacks into it.

I think that the legacy code is just going to have to stay (and I don't
want the warnings papered over, because I /want/ that crap to stick out
like a sore thumb), until someone can get sufficient motivation to work
out how to finally unuse the old functions.

Had I been on top of the L2 cache code earlier, and prevented these hacks
from going in (insisting that it was done properly) we would not be in
this position today where no one seems to know to fix this stuff.

This is the whole point - and nicely illustrates how easy it is for code
to become unmaintainable by accepting hacks to it.  A bit of push-back at
code acceptance time helps to save us from these problems in the future.

So, what I would want to see is not only what Lorenzo is saying (the
disclaimer in the comments) but also a technical description it, so if
the code needs to be modified in the future, we don't end up in this
kind of situation wondering what the code is doing/why the code exists.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 12:45                     ` Lorenzo Pieralisi
  2014-06-08 14:34                       ` Russell King - ARM Linux
@ 2014-06-08 17:53                       ` Nicolas Pitre
  2014-06-09 20:47                         ` Kevin Hilman
  1 sibling, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-08 17:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 8 Jun 2014, Lorenzo Pieralisi wrote:

> On Sun, Jun 08, 2014 at 12:53:34AM +0100, Olof Johansson wrote:
> > Lorenzo,
> > 
> > Since you're emailing from @arm.com, some of this is to the wider
> > recipient and maybe not directly to you:
> 
> I am glad to reply and take blame since this is a debate definitely worth
> having.

Great.  Because I would like to steer this debate a little towards the 
genuine cause rather than sticking to some particular consequences.

> Guys, do not get me wrong here. There are fixes that can be deemed
> acceptable in an OS, there are fixes that can't. I just can't help thinking
> that Nicolas' patch is a nasty hack (and I am far, really really far from
> blaming him for that, because that's the only patch that can fix that
> issue in the kernel), and he perfectly knows that.

You know what?  The more I think about my patch, the more I consider 
this should be the standard way of setting up things unconditionally on 
_all_ platforms using MCPM.  Why? Because that's the most coherent thing 
to do!

I really think the kernel should either be responsible for the CCI or it 
should not at all.  And conversely for the bootloader.  Right now we 
have an implicit requirement that the bootloader should turn on the CCI, 
but only for cold boot, and only for the boot cluster, and not for CPU 
resuming from idle, and what other case we haven't thought about yet.  
And as noticed this requirement is not documented.

> > > Whatever the outcome of this thread, a booting protocol update for CCI
> > > is in order, even if we have to debate it for 6 months or more to get
> > > an agreement.
> > 

And in the end I don't think the CCI should have to be documented as a 
boot requirement.  Instead of having firmware implementers understand 
when they should care about the CCI and when they shouldn't, I'd much 
prefer if they hadn't to care at all. I really prefer when 
responsibility for something is well encapsulated in one place and not 
shared across layers like the firmware or the kernel depending on some 
context. The complexity of a system (and therefore the probability for 
bugs) grows with the square of the number of interrelations between 
constituent parts. So we should always strive to make the boot protocol 
_simpler_ not more complex.

And if complete responsibility for the CCI in the kernel had been 
assumed from the beginning, we wouldn't be struggling in this power play 
to determine which side should give in.  Especially since the kernel has 
all the necessary infrastructure to do it all already, and I must say in 
a rather elegant manner.

> > I'm a very strong proponent of enabling upstream support for our
> > platform (for several reasons -- most of these are actually business
> > reasons for us, but also because it's the right thing to do). Finding
> > the trade-off for what workarounds are still reasonable to do in the
> > kernel for that situation is obviously hard and we're disagreeing. But
> > the scope for these workarounds is not large.

Will people ever realize that, if the kernel was more in control of the 
hardware (isn't that the role of an OS kernel to serve as the hardware 
abstraction layer?) then we wouldn't be talking about "workarounds" but 
rather "standard fixes"?

> > In this case, the change we're looking at is enabling the CCI port for
> > the boot cpu. It's perfectly containable in exynos-only code, and we
> > can surround it by however many comments of never ever using it as an
> > example for how to do it as you'll want.

In this case, to state my opinion clearly, it is the general design that 
was flawed and the kernel should be fixed to enable the CCI for the boot 
CPU itself _when_ it knows it is going to need it.  To start with, the 
bootloader has no need what so ever for using more than one CPU, unless 
it wants to become an operating system, so it shouldn't have to care at 
all.  The kernel, if booted without the CCI information in the DTB, will 
run with only one CPU and won't rely on the CCI.  Logically the CCI 
could be left turned off in that case, possibly increasing bus 
performance and saving some power.

> I agree with what you are saying, but if for any reason someone will
> copy that code to paper over yet another firmware quirk and think that's
> the right thing to do, that would be grave IMHO.

Someone shouldn't have to copy that code because I'm getting more and 
more convinced it should be made generic and unconditional, and by doing 
so removing any possibility for firmware to get that part wrong again.  
According to my quick experiment on TC2, this took only 271 microseconds 
to perform so this is not like if that would make a significant 
difference in boot time.

> > > I do not think they do. The kernel should not become a place where firmware
> > > bugs are fixed, if you refuse to fix the bug and this code does not get
> > > upstream I am pretty sure next time more attention will be paid.

Again, this is coming about because firmware is a MAGNITUDE harder to 
fix and IMPOSSIBLE to be bug free, just like any other software. So if I 
may get back to the genuine cause for this debate: this came about 
because of TOO MUCH firmware code and encouraging people to create more 
of it is *BAD*.

Sure, in the server world you are likely to want firmware and standards 
because that helps bringing maintenance costs down.  But server 
equipment has much longer life cycles than mobile devices and somewhat 
less aggressive and complex power management to perform. Firmware for 
servers may take *time* to be developed, tested, certified, etc.  To 
illustrate this, we've been working on UEFI and ACPI for a period tat 
can be measured in years at this point.  So, hopefully by the time 
server oriented firmware is available, it would be well tested and 
relied upon for a long time.

none of the above applies to consumer products with fast development and 
short life cycles.

> I understand your point, and I do not want to stop people from using
> this platform with upstream code, actually I am the first who is happy
> to see power management code getting in the mainline, but not at all costs,
> because this has consequences for US.

And those consequences are?

> ARM are pushing for open trusted firmware, ARM TRMs are available to
> partners with those sequences described, and I have always been willing
> to support developers.

Ahhhh...  Here we are. "ARM are pushing for open trusted firmware ..."

> We should do more, but that does not justify these bugs, really.

Bugs are never justifiable, but they happen _all_ the time.

Firmware is a MAGNITUDE harder to fix, and IMPOSSIBLE to be bug free
just like any other software.

> > > Where do we draw the line, that's my point.
> > 
> > You draw the line by giving vendors a place to do the nasty stuff that
> > needs to be done in a place that doesn't impact others, and where
> > others don't have to look. Quirk tables, fixup functions, or function
> > pointers that can be replaced on a specific platform if needed. When
> > it affects core code, you sort it out in a different way if you have
> > to.

Again this is missing the point.  No line would need to be drawn if the 
core code was responsible in the first place.  DMC parameters are 
conceptually so trivial that no one should normally mess that up, and 
the firmware must do it just so that memory is usable.  So there is no 
choice but to do that in firmware.  It is a completely different story 
with complex operations which should never ever be relegated to 
firmware.

> > Maybe it's just me, but I didn't use to see this disconnected puritan
> > world view from people until DT came along. I don't think it's DTs
> > fault, but I think the requirements of DT-as-ABI has tainted the
> > mindset of many developers in a way that they treat everything as
> > needing to be polished to a perfect shine in all aspects, all the
> > time.
> 
> Olof, it is not puritanism, it is all about upstreaming code. If we
> keep accepting these hacks and we end up with mach code full of them
> we have a problem, do you agree ?

Absolutely!

So once again, let's take a step back, open our eyes and look at the 
fundamental reason why hacks are there, and how they could fundamentally 
be avoided.  And no, hoping for fewer bugs in firmware is not realistic 
if people are encouraged to create more of it.

> > Expecting things to be perfect from day one is not realistic.
> 
> I do not buy this I am sorry. Fair enough, CCI is a new concept, but
> SMP power management has been implemented in older platforms with
> the same requirements, nothing new and still people are getting this
> wrong.

Lorenzo: what you say is not exact.  People screwed SMP power management 
in the past for sure.  And they still will because requirement are 
changing all the time they're not the same.  Maybe requirements are 
somewhat stable in the server space, but in the mobile space they're 
not. So this must be implemented where it is cheapest to fix.

> > > Nicolas: it is not a matter of PSCI vs. MCPM, firmware vs. the kernel,
> > > that's a debate worth having, not now.

Why not?  I'm saying that too much firmware is a fundamental design 
mistake for consumer products.  All the rest falls off from that.  Why 
not addressing the source of the problem rather than constantly 
suffering and debating its consequences?

Again I want to clearly state that I have nothing against PSCI the 
interface spec despite the appearances.  I've reviewed its draft and 
provided comments, etc.  My point is, when taking a step back, we may 
only conclude that more firmware does not create a better system overall 
because of real life costs and constraints associated to it. So PSCI is 
not the problem, the problem is at another conceptual level.

> > > Adding these hacks has serious maintainance consequences (eg CPUidle
> > > code) and that's the main reason I jumped into this discussion.

Sorry, I don't see the connection.

> > > Let me reiterate my point: it is not a kernel vs firmware debate,

But of *course* it is, unless you're too invested in your firmware 
strategy to be able to see all the downsides.

> > > it is about clean and maintainable code vs hackish and 
> > > unmaintainable code in the kernel.

No argument there.  Unfortunately, hackish code comes about because of 
broken firmware in most cases.  Kernel code can be cleaned at any moment 
but in practice firmware code cannot.

> > No, it's about having code that runs in the real world, versus some
> > random framework that doesn't actually fill a useful purpose since
> > nobody can make use of it without a bunch of out-of-tree code.
> 
> PSCI is not a random framework, it is a standard and it runs in real
> world platforms and would hide all these HW quirks where they belong.

Which real world platforms?  I'm curious.

> > Wow, you're going to be really, really frustrated over how the world
> > will start to look with all the "standardized" closed firmware
> > platforms and their quirks and bug workarounds we'll have to add in
> > the kernel.
> 
> Yes, and I will shout even louder when that will happen =)

That _will_ happen. Such is life. And you'll have only yourself to blame 
because you pushed for bigger firmware to be created in the first place.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08  0:19                     ` Russell King - ARM Linux
  2014-06-08  2:52                       ` Olof Johansson
@ 2014-06-08 18:26                       ` Nicolas Pitre
  2014-06-08 18:29                         ` Russell King - ARM Linux
  1 sibling, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-08 18:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 8 Jun 2014, Russell King - ARM Linux wrote:

> On Sat, Jun 07, 2014 at 04:53:34PM -0700, Olof Johansson wrote:
> > You do realize that you have absolutely zero leverage over us on this,
> > right? Our product is already shipped with kernel code that fixes
> > this.
> 
> That is never a justification for forcing /any/ code into the kernel.
> 
> We've been here before with the iPAQs, where there were all sorts of
> horrid hacks that were in the code for that device, and we said no to
> it, and we kept it out of the mainline kernel, and stopped those hacks
> polluting elsewhere (because people got to know on the whole that if
> they used those hacks, it would bar them from mainline participation.)

That's different.  The iPaq had very little in terms of firmware, and 
the one it had was field upgradable with almost no risk to brick it 
(unless you wanted to hack the firmware code but that's another story).  
The reason iPaq had never been well supported in mainline can be 
attributed to laziness for not wanting to make the code into a shape 
acceptable for mainline inclusion.  But nothing fundamentally prevented 
that from happening if someone had wanted to do it.

Here we're talking about firmware-induced hacks for which, had there 
been no firmware, the kernel would be in full position to fix properly 
because that would have been a kernel bug after all.

Hence my crusade against this "you should abstract things in firmware" 
mantra. Especially for mobile devices.

But, because firmware already exists out there and it is between 
prohibitive and impossible to fix, we have no choice but to compensate 
in the kernel some times.  In this very case my approach is to render 
the firmware irrelevant globally rather than trying to work around it 
for one particular device.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 18:26                       ` Nicolas Pitre
@ 2014-06-08 18:29                         ` Russell King - ARM Linux
  2014-06-08 18:55                           ` Nicolas Pitre
  0 siblings, 1 reply; 51+ messages in thread
From: Russell King - ARM Linux @ 2014-06-08 18:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 08, 2014 at 02:26:43PM -0400, Nicolas Pitre wrote:
> On Sun, 8 Jun 2014, Russell King - ARM Linux wrote:
> 
> > On Sat, Jun 07, 2014 at 04:53:34PM -0700, Olof Johansson wrote:
> > > You do realize that you have absolutely zero leverage over us on this,
> > > right? Our product is already shipped with kernel code that fixes
> > > this.
> > 
> > That is never a justification for forcing /any/ code into the kernel.
> > 
> > We've been here before with the iPAQs, where there were all sorts of
> > horrid hacks that were in the code for that device, and we said no to
> > it, and we kept it out of the mainline kernel, and stopped those hacks
> > polluting elsewhere (because people got to know on the whole that if
> > they used those hacks, it would bar them from mainline participation.)
> 
> That's different.  The iPaq had very little in terms of firmware, and 
> the one it had was field upgradable with almost no risk to brick it 
> (unless you wanted to hack the firmware code but that's another story).  
> The reason iPaq had never been well supported in mainline can be 
> attributed to laziness for not wanting to make the code into a shape 
> acceptable for mainline inclusion.  But nothing fundamentally prevented 
> that from happening if someone had wanted to do it.

No, I was specifically thinking about the various iPAQ specific things
like the additional platform specific ATAGs that they invented with
zero reference to mainline, and then expected them to be accepted as-is.

I gave them a very clear message over that (which was "no way") and that
was essentially the last of their mainline submissions.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 18:29                         ` Russell King - ARM Linux
@ 2014-06-08 18:55                           ` Nicolas Pitre
  2014-06-08 19:02                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-08 18:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 8 Jun 2014, Russell King - ARM Linux wrote:

> On Sun, Jun 08, 2014 at 02:26:43PM -0400, Nicolas Pitre wrote:
> > On Sun, 8 Jun 2014, Russell King - ARM Linux wrote:
> > 
> > > On Sat, Jun 07, 2014 at 04:53:34PM -0700, Olof Johansson wrote:
> > > > You do realize that you have absolutely zero leverage over us on this,
> > > > right? Our product is already shipped with kernel code that fixes
> > > > this.
> > > 
> > > That is never a justification for forcing /any/ code into the kernel.
> > > 
> > > We've been here before with the iPAQs, where there were all sorts of
> > > horrid hacks that were in the code for that device, and we said no to
> > > it, and we kept it out of the mainline kernel, and stopped those hacks
> > > polluting elsewhere (because people got to know on the whole that if
> > > they used those hacks, it would bar them from mainline participation.)
> > 
> > That's different.  The iPaq had very little in terms of firmware, and 
> > the one it had was field upgradable with almost no risk to brick it 
> > (unless you wanted to hack the firmware code but that's another story).  
> > The reason iPaq had never been well supported in mainline can be 
> > attributed to laziness for not wanting to make the code into a shape 
> > acceptable for mainline inclusion.  But nothing fundamentally prevented 
> > that from happening if someone had wanted to do it.
> 
> No, I was specifically thinking about the various iPAQ specific things
> like the additional platform specific ATAGs that they invented with
> zero reference to mainline, and then expected them to be accepted as-is.
> 
> I gave them a very clear message over that (which was "no way") and that
> was essentially the last of their mainline submissions.

That's basically what I'm saying.  They were too lazy to fix their code.  
But nothing prevented that otherwise. They certainly couldn't use the 
"firmware is broken and too hard to update" excuse.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 18:55                           ` Nicolas Pitre
@ 2014-06-08 19:02                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 51+ messages in thread
From: Russell King - ARM Linux @ 2014-06-08 19:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 08, 2014 at 02:55:03PM -0400, Nicolas Pitre wrote:
> On Sun, 8 Jun 2014, Russell King - ARM Linux wrote:
> > No, I was specifically thinking about the various iPAQ specific things
> > like the additional platform specific ATAGs that they invented with
> > zero reference to mainline, and then expected them to be accepted as-is.
> > 
> > I gave them a very clear message over that (which was "no way") and that
> > was essentially the last of their mainline submissions.
> 
> That's basically what I'm saying.  They were too lazy to fix their code.  
> But nothing prevented that otherwise. They certainly couldn't use the 
> "firmware is broken and too hard to update" excuse.

No, they just continued to use those ATAGs that they invented.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-07 16:10           ` Nicolas Pitre
  2014-06-07 17:56             ` Lorenzo Pieralisi
@ 2014-06-09 20:27             ` Kevin Hilman
  2014-06-09 20:35               ` Doug Anderson
  1 sibling, 1 reply; 51+ messages in thread
From: Kevin Hilman @ 2014-06-09 20:27 UTC (permalink / raw)
  To: linux-arm-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
>
>> Hi Nicolas,
>> 
>> The first man of the incoming cluster enables its snoops via the
>> power_up_setup function. During secondary boot-up, this does not occur
>> for the boot cluster. Hence, I enable the snoops for the boot cluster
>> as a one-time setup from the u-boot prompt. After secondary boot-up
>> there is no modification that I do.
>
> OK that's good.
>
>> Where should this be ideally done ?
>
> If I remember correctly, the CCI can be safely activated only when the 
> cache is disabled.  So that means the CCI should ideally be turned on 
> for the boot cluster (and *only* for the boot CPU) by the bootloader.
>
> Now... If you _really_ prefer to do it from the kernel to avoid 
> difficulties with bootloader updates, then it should be possible to do 
> it from the kernel by temporarily turning the cache off.  This is not a 
> small thing but the MCPM infrastructure can be leveraged.  Here's what I 
> tried on a TC2 which might just work for you as well:

FWIW, I dropped the u-boot hack I was using to enable CCI and tested
this patch (with a cut/paste of the TC2 specific stuff into
mach-exynos/mcpm-exynos.c) along with Doug's patch[1] and 
and confirm that all 8 cores boot up on the Chromebook2 using linux-next.

Kevin

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-June/262440.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-09 20:27             ` Kevin Hilman
@ 2014-06-09 20:35               ` Doug Anderson
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Anderson @ 2014-06-09 20:35 UTC (permalink / raw)
  To: linux-arm-kernel

Kevin and Nicolas,

On Mon, Jun 9, 2014 at 1:27 PM, Kevin Hilman <khilman@linaro.org> wrote:
> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
>
>> On Sat, 7 Jun 2014, Abhilash Kesavan wrote:
>>
>>> Hi Nicolas,
>>>
>>> The first man of the incoming cluster enables its snoops via the
>>> power_up_setup function. During secondary boot-up, this does not occur
>>> for the boot cluster. Hence, I enable the snoops for the boot cluster
>>> as a one-time setup from the u-boot prompt. After secondary boot-up
>>> there is no modification that I do.
>>
>> OK that's good.
>>
>>> Where should this be ideally done ?
>>
>> If I remember correctly, the CCI can be safely activated only when the
>> cache is disabled.  So that means the CCI should ideally be turned on
>> for the boot cluster (and *only* for the boot CPU) by the bootloader.
>>
>> Now... If you _really_ prefer to do it from the kernel to avoid
>> difficulties with bootloader updates, then it should be possible to do
>> it from the kernel by temporarily turning the cache off.  This is not a
>> small thing but the MCPM infrastructure can be leveraged.  Here's what I
>> tried on a TC2 which might just work for you as well:
>
> FWIW, I dropped the u-boot hack I was using to enable CCI and tested
> this patch (with a cut/paste of the TC2 specific stuff into
> mach-exynos/mcpm-exynos.c) along with Doug's patch[1] and
> and confirm that all 8 cores boot up on the Chromebook2 using linux-next.
>
> Kevin
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-June/262440.html

Agreed.  Nicolas's patch <https://patchwork.kernel.org/patch/4315711/>
plus the copy/paste to exynos made things boot for me, too.

-Doug

---

Reference of the copy/paste to exynos (though gmail is munging my tabs):

diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
b/arch/arm/mach-exynos/mcpm-exynos.c
index ace0ed6..218b9ff 100644
--- a/arch/arm/mach-exynos/mcpm-exynos.c
+++ b/arch/arm/mach-exynos/mcpm-exynos.c
@@ -295,6 +295,25 @@ static const struct of_device_id exynos_dt_mcpm_match[] = {
        {},
 };

+int mcpm_loopback(void (*cache_disable)(void));
+static void exynos_cache_down(void)
+{
+       pr_warn("exynos: disabling cache\n");
+       if (read_cpuid_part_number() == ARM_CPU_PART_CORTEX_A15) {
+               /*
+                * On the Cortex-A15 we need to disable
+                * L2 prefetching before flushing the cache.
+                */
+               asm volatile(
+               "mcr    p15, 1, %0, c15, c0, 3 \n\t"
+               "isb    \n\t"
+               "dsb    "
+               : : "r" (0x400) );
+       }
+       v7_exit_coherency_flush(all);
+       cci_disable_port_by_cpu(read_cpuid_mpidr());
+}
+
 static int __init exynos_mcpm_init(void)
 {
        struct device_node *node;
@@ -336,6 +355,7 @@ static int __init exynos_mcpm_init(void)
                iounmap(ns_sram_base_addr);
                return ret;
        }
+       BUG_ON(mcpm_loopback(exynos_cache_down) != 0);

        mcpm_smp_set_ops();

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-08 17:53                       ` Nicolas Pitre
@ 2014-06-09 20:47                         ` Kevin Hilman
  2014-06-09 22:26                           ` Lorenzo Pieralisi
  0 siblings, 1 reply; 51+ messages in thread
From: Kevin Hilman @ 2014-06-09 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

Nicolas Pitre <nicolas.pitre@linaro.org> writes:

> On Sun, 8 Jun 2014, Lorenzo Pieralisi wrote:
>
>> On Sun, Jun 08, 2014 at 12:53:34AM +0100, Olof Johansson wrote:
>> > Lorenzo,
>> > 
>> > Since you're emailing from @arm.com, some of this is to the wider
>> > recipient and maybe not directly to you:
>> 
>> I am glad to reply and take blame since this is a debate definitely worth
>> having.
>
> Great.  Because I would like to steer this debate a little towards the 
> genuine cause rather than sticking to some particular consequences.
>
>> Guys, do not get me wrong here. There are fixes that can be deemed
>> acceptable in an OS, there are fixes that can't. I just can't help thinking
>> that Nicolas' patch is a nasty hack (and I am far, really really far from
>> blaming him for that, because that's the only patch that can fix that
>> issue in the kernel), and he perfectly knows that.
>
> You know what?  The more I think about my patch, the more I consider 
> this should be the standard way of setting up things unconditionally on 
> _all_ platforms using MCPM.  Why? Because that's the most coherent thing 
> to do!

I agree.

> I really think the kernel should either be responsible for the CCI or it 
> should not at all.  And conversely for the bootloader.  Right now we 
> have an implicit requirement that the bootloader should turn on the CCI, 
> but only for cold boot, and only for the boot cluster, and not for CPU 
> resuming from idle, and what other case we haven't thought about yet.  
> And as noticed this requirement is not documented.

In addition to being a firmware minimalist like Nico, what I find most
objectional to the bootloader approach is that even with CCI enabled by
the firmware, since it's a runtime requirement (for low-power idle or
suspend), the kernel has to handle it anyways.  So you end up with a
partial solution in the firwmare (for boot cluster only) *and* a full
solution in the kernel.  This doesn't make any sense, expecially because
the kernel might then have to do things differently on cold boot
vs. low-power idle/suspend or differently on the boot cluster vs. other
clusters.  From a maintenance PoV, this is a mess and could easily lead
to just as many SoC specific hacks that are different across platforms.

Stated more simply: If the kernel has to manage the resource at runtime
due to low-power idle/suspend.  I don't see any reason why it shouldn't
manage it at cold boot time also.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-09 20:47                         ` Kevin Hilman
@ 2014-06-09 22:26                           ` Lorenzo Pieralisi
  2014-06-10  4:25                             ` Nicolas Pitre
  0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Pieralisi @ 2014-06-09 22:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 09, 2014 at 09:47:42PM +0100, Kevin Hilman wrote:
> Nicolas Pitre <nicolas.pitre@linaro.org> writes:
> 
> > On Sun, 8 Jun 2014, Lorenzo Pieralisi wrote:
> >
> >> On Sun, Jun 08, 2014 at 12:53:34AM +0100, Olof Johansson wrote:
> >> > Lorenzo,
> >> > 
> >> > Since you're emailing from @arm.com, some of this is to the wider
> >> > recipient and maybe not directly to you:
> >> 
> >> I am glad to reply and take blame since this is a debate definitely worth
> >> having.
> >
> > Great.  Because I would like to steer this debate a little towards the 
> > genuine cause rather than sticking to some particular consequences.

I commented on Nico's patch because I did not like how it was
implemented (at least remove the CPU PM notifier calls please, because
they are not needed). I also said that's the only thing he could do,
and I still think that's not a nice way to use the cpu_suspend API
for something it was not designed for, that's what I wanted to say,
period.

> >> Guys, do not get me wrong here. There are fixes that can be deemed
> >> acceptable in an OS, there are fixes that can't. I just can't help thinking
> >> that Nicolas' patch is a nasty hack (and I am far, really really far from
> >> blaming him for that, because that's the only patch that can fix that
> >> issue in the kernel), and he perfectly knows that.
> >
> > You know what?  The more I think about my patch, the more I consider 
> > this should be the standard way of setting up things unconditionally on 
> > _all_ platforms using MCPM.  Why? Because that's the most coherent thing 
> > to do!
> 
> I agree.
> 
> > I really think the kernel should either be responsible for the CCI or it 
> > should not at all.  And conversely for the bootloader.  Right now we 
> > have an implicit requirement that the bootloader should turn on the CCI, 
> > but only for cold boot, and only for the boot cluster, and not for CPU 
> > resuming from idle, and what other case we haven't thought about yet.  
> > And as noticed this requirement is not documented.
> 
> In addition to being a firmware minimalist like Nico, what I find most
> objectional to the bootloader approach is that even with CCI enabled by
> the firmware, since it's a runtime requirement (for low-power idle or
> suspend), the kernel has to handle it anyways.  So you end up with a
> partial solution in the firwmare (for boot cluster only) *and* a full
> solution in the kernel.  This doesn't make any sense, expecially because
> the kernel might then have to do things differently on cold boot
> vs. low-power idle/suspend or differently on the boot cluster vs. other
> clusters.  From a maintenance PoV, this is a mess and could easily lead
> to just as many SoC specific hacks that are different across platforms.
> 
> Stated more simply: If the kernel has to manage the resource at runtime
> due to low-power idle/suspend.  I don't see any reason why it shouldn't
> manage it at cold boot time also.

If I am allowed to say something, here is a couple of thoughts.

1) CCI snoops and DVM enablement are secure operations, to do them in
   non-secure world this must be overriden in firmware. You can argue,
   you can think whatever you want, that's a fact. So, to use this
   code SMP bit in SCTLR and CCI enablement must become non-secure
   operations. This is a boot requirement for MCPM, right or wrong
   it is up to platform designers to judge. If CCI and SMP enablement
   are secure operations, we should not start adding random SMC calls
   in the kernel, since managing coherency in the kernel would become
   problematic, with lots of platform quirks. We do not want that to
   happen, and I think we all agree on this.
2) (1) must be documented.
3) When I talked about consequences for CPUidle (implicit), I was referring
   to all sort of hacks we had to come up to bring devices like SPC
   (remember ? I remember very very well unfortunately for me), or whatever
   power controller up in the kernel early, too early to fit in any
   existing kernel device framework. There is still no solution to that, and
   the only way that code can exist is in mach- code. Right or wrong,
   that's a second fact and in my opinion that's not nice for the ARM
   kernel.
4) When I am talking about firmware I am talking about sequences that
   are very close to HW (disabling C bit, cleaning caches, exiting
   coherency). Erratas notwithstanding, they are being standardized at
   ARM the best we can. They might even end up being implemented in HW
   in the not so far future. I understand they are tricky, I understand
   they take lots of time to implement them and to debug them, what I
   want to say is that they are becoming standard and we _must_ reuse the
   same code for all ARM platforms. You can implement them in MCPM (see
   (1)) or in firmware (and please do not start painting me as firmware
   hugger here, I am referring to standard power down sequences that
   again, are very close to HW state machines and more importantly if they
   HAVE to run in secure world that's the only solution we have unless you
   want to split race conditions between kernel and secure world).
5) I agree that the CCI enablement in TC2 (bootmon for cold boot and
   kernel for warm-boot is wrong, nothing to say and it was not the
   reason I commented on Nico's patch - I think I explained to you
   thoroughly why now).

That's what I had to say, I hope it helps.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-09 22:26                           ` Lorenzo Pieralisi
@ 2014-06-10  4:25                             ` Nicolas Pitre
  2014-06-10  9:19                               ` Lorenzo Pieralisi
  2014-06-10 14:14                               ` Catalin Marinas
  0 siblings, 2 replies; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-10  4:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 9 Jun 2014, Lorenzo Pieralisi wrote:

> I commented on Nico's patch because I did not like how it was
> implemented (at least remove the CPU PM notifier calls please, because
> they are not needed).

OK no problem.  That's easy enough.  I added them to play it safe as a 
test patch in case some VFP content could be lost somehow by looping 
back through the CPU init code for example, and needed to be saved.

> I also said that's the only thing he could do, and I still think 
> that's not a nice way to use the cpu_suspend API for something it was 
> not designed for, that's what I wanted to say, period.

Well... Maybe it wasn't designed for that, but it certainly can be used 
for that. And with no modifications to the core code, making this 
solution fairly elegant.  This is not so different from, say, the BPF 
code being reused for seccomp_filters. BPF wasn't designed for system 
call filtering, but it happens to work well.

> If I am allowed to say something, here is a couple of thoughts.
> 
> 1) CCI snoops and DVM enablement are secure operations, to do them in
>    non-secure world this must be overriden in firmware. You can argue,
>    you can think whatever you want, that's a fact. So, to use this
>    code SMP bit in SCTLR and CCI enablement must become non-secure
>    operations. This is a boot requirement for MCPM, right or wrong
>    it is up to platform designers to judge. If CCI and SMP enablement
>    are secure operations, we should not start adding random SMC calls
>    in the kernel, since managing coherency in the kernel would become
>    problematic, with lots of platform quirks. We do not want that to
>    happen, and I think we all agree on this.

One could certainly question the need for so many controls handled in 
secure world.  But that is not the point.

Here we're talking about MCPM.  That implies the kernel has control over 
SCTLR.SMP and the CCI.  If those things aren't under the kernel's 
control, then MCPM is of no use to you.

Therefore, if you do want to use MCPM, then this implies the kernel has 
access to the CCI. And if it has access to it, then it should turn it on 
by itself in all cases to be consistent, not only in half of the cases.

> 2) (1) must be documented.

Absolutely.  But let's be coherent in the implementation so the 
documentation is as simple as it can be.

> 3) When I talked about consequences for CPUidle (implicit), I was referring
>    to all sort of hacks we had to come up to bring devices like SPC
>    (remember ? I remember very very well unfortunately for me), or whatever
>    power controller up in the kernel early, too early to fit in any
>    existing kernel device framework. There is still no solution to that, and
>    the only way that code can exist is in mach- code. Right or wrong,
>    that's a second fact and in my opinion that's not nice for the ARM
>    kernel.

I disagree.  This can perfectly be turned into driver code. If we need 
it too early for existing kernel device framework to handle this 
properly, then the solution is to extend the existing framework or 
create another one specially for that purpose.  This may not be obvious 
when TC2 is the first/only platform in that situation, but if more 
platforms have the same need then it'll be easier to abstract 
commonalities into a framework.

Saying that no framework exists today or/and upstream maintainers are 
being difficult is _not_ a reason for throwing your hands up and e.g. 
shoving all this code into firmware instead.

> 4) When I am talking about firmware I am talking about sequences that
>    are very close to HW (disabling C bit, cleaning caches, exiting
>    coherency). Erratas notwithstanding, they are being standardized at
>    ARM the best we can. They might even end up being implemented in HW
>    in the not so far future. I understand they are tricky, I understand
>    they take lots of time to implement them and to debug them, what I
>    want to say is that they are becoming standard and we _must_ reuse the
>    same code for all ARM platforms. You can implement them in MCPM (see
>    (1)) or in firmware (and please do not start painting me as firmware
>    hugger here, I am referring to standard power down sequences that
>    again, are very close to HW state machines 

That's where the disconnect lies.  On the one hand you say "I understand 
they are tricky, I understand they take lots of time to implement them 
and to debug them" and on the other hand you say "They might end up being 
implemented in HW in the not so far future."  That simply makes no 
economical sense at all!

When some operation is 1) tricky and takes time to debug, and 2) not 
performance critical (no one is trying to get in and out of idle or 
hibernation a billion times per second), then you should never ever put 
such a thing in firmware, and hardware should be completely out of the 
question!

>    and more importantly if they
>    HAVE to run in secure world that's the only solution we have unless you
>    want to split race conditions between kernel and secure world).

If they HAVE to run in secure world then your secure world architecture 
is simply misdesigned, period.  Someone must have ignored the economics 
of modern software development to have come up with this.

> 5) I agree that the CCI enablement in TC2 (bootmon for cold boot and
>    kernel for warm-boot is wrong, nothing to say and it was not the
>    reason I commented on Nico's patch - I think I explained to you
>    thoroughly why now).

OK. Let's start by agreeing on the spirit behind my patch then.  The 
actual patch implementation details are a secondary concern and open for 
discussion.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-10  4:25                             ` Nicolas Pitre
@ 2014-06-10  9:19                               ` Lorenzo Pieralisi
  2014-06-10 14:14                               ` Catalin Marinas
  1 sibling, 0 replies; 51+ messages in thread
From: Lorenzo Pieralisi @ 2014-06-10  9:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 10, 2014 at 05:25:47AM +0100, Nicolas Pitre wrote:
> On Mon, 9 Jun 2014, Lorenzo Pieralisi wrote:
> 
> > I commented on Nico's patch because I did not like how it was
> > implemented (at least remove the CPU PM notifier calls please, because
> > they are not needed).
> 
> OK no problem.  That's easy enough.  I added them to play it safe as a 
> test patch in case some VFP content could be lost somehow by looping 
> back through the CPU init code for example, and needed to be saved.

Ok, thanks.

> > I also said that's the only thing he could do, and I still think 
> > that's not a nice way to use the cpu_suspend API for something it was 
> > not designed for, that's what I wanted to say, period.
> 
> Well... Maybe it wasn't designed for that, but it certainly can be used 
> for that. And with no modifications to the core code, making this 
> solution fairly elegant.  This is not so different from, say, the BPF 
> code being reused for seccomp_filters. BPF wasn't designed for system 
> call filtering, but it happens to work well.

You defined yourself "not a small thing", you know what you are doing
and that's enough for me. Please respect that when I reviewed it I thought
that was a hack. cpu_suspend is being used for many things in the kernel
and consolidating that took a while, please comment the code, that's all I
am asking you.

> > If I am allowed to say something, here is a couple of thoughts.
> > 
> > 1) CCI snoops and DVM enablement are secure operations, to do them in
> >    non-secure world this must be overriden in firmware. You can argue,
> >    you can think whatever you want, that's a fact. So, to use this
> >    code SMP bit in SCTLR and CCI enablement must become non-secure
> >    operations. This is a boot requirement for MCPM, right or wrong
> >    it is up to platform designers to judge. If CCI and SMP enablement
> >    are secure operations, we should not start adding random SMC calls
> >    in the kernel, since managing coherency in the kernel would become
> >    problematic, with lots of platform quirks. We do not want that to
> >    happen, and I think we all agree on this.
> 
> One could certainly question the need for so many controls handled in 
> secure world.  But that is not the point.
> 
> Here we're talking about MCPM.  That implies the kernel has control over 
> SCTLR.SMP and the CCI.  If those things aren't under the kernel's 
> control, then MCPM is of no use to you.

ACTLR.SMP for the sake of precision and it was my typo, sorry. That's
all I wanted to read, so nothing to add.

> Therefore, if you do want to use MCPM, then this implies the kernel has 
> access to the CCI. And if it has access to it, then it should turn it on 
> by itself in all cases to be consistent, not only in half of the cases.

I agree.

> > 2) (1) must be documented.
> 
> Absolutely.  But let's be coherent in the implementation so the 
> documentation is as simple as it can be.

Ditto.

> > 3) When I talked about consequences for CPUidle (implicit), I was referring
> >    to all sort of hacks we had to come up to bring devices like SPC
> >    (remember ? I remember very very well unfortunately for me), or whatever
> >    power controller up in the kernel early, too early to fit in any
> >    existing kernel device framework. There is still no solution to that, and
> >    the only way that code can exist is in mach- code. Right or wrong,
> >    that's a second fact and in my opinion that's not nice for the ARM
> >    kernel.
> 
> I disagree.  This can perfectly be turned into driver code. If we need 
> it too early for existing kernel device framework to handle this 
> properly, then the solution is to extend the existing framework or 
> create another one specially for that purpose.  This may not be obvious 
> when TC2 is the first/only platform in that situation, but if more 
> platforms have the same need then it'll be easier to abstract 
> commonalities into a framework.
> 
> Saying that no framework exists today or/and upstream maintainers are 
> being difficult is _not_ a reason for throwing your hands up and e.g. 
> shoving all this code into firmware instead.

You have a point, as long as we are all aware and we do not forget this
is a major problem, not a minor one. I do want to see a consolidate
story for CPUidle for ARM and this bullet is definitely part of the
picture. On a side note, you made me smile, it sounded like I wanted
to bury SPC code in firmware or anywhere else as long as it is not in the
kernel, which in a way is a true statement since I abhor that code =)

> > 4) When I am talking about firmware I am talking about sequences that
> >    are very close to HW (disabling C bit, cleaning caches, exiting
> >    coherency). Erratas notwithstanding, they are being standardized at
> >    ARM the best we can. They might even end up being implemented in HW
> >    in the not so far future. I understand they are tricky, I understand
> >    they take lots of time to implement them and to debug them, what I
> >    want to say is that they are becoming standard and we _must_ reuse the
> >    same code for all ARM platforms. You can implement them in MCPM (see
> >    (1)) or in firmware (and please do not start painting me as firmware
> >    hugger here, I am referring to standard power down sequences that
> >    again, are very close to HW state machines 
> 
> That's where the disconnect lies.  On the one hand you say "I understand 
> they are tricky, I understand they take lots of time to implement them 
> and to debug them" and on the other hand you say "They might end up being 
> implemented in HW in the not so far future."  That simply makes no 
> economical sense at all!

I wanted to say Nico that those operations are so intrinsic for all ARM
cores you can almost think of them as part of the ISA and certainly
HW knows it better than SW when a processor has nothing to do, it does
idle core units all the time without software interaction, going power off
(on cue) could just be one step further and solve those pesky races in HW.

> When some operation is 1) tricky and takes time to debug, and 2) not 
> performance critical (no one is trying to get in and out of idle or 
> hibernation a billion times per second), then you should never ever put 
> such a thing in firmware, and hardware should be completely out of the 
> question!

We can debate this offline, it is an interesting topic in particular on
the performance critical side of things; as for firmware see my point
on security, as I already mentioned.

> >    and more importantly if they
> >    HAVE to run in secure world that's the only solution we have unless you
> >    want to split race conditions between kernel and secure world).
> 
> If they HAVE to run in secure world then your secure world architecture 
> is simply misdesigned, period.  Someone must have ignored the economics 
> of modern software development to have come up with this.

That's your opinion and I respect that. What I said, and that's a fact
not an opinion, is that if those operations are required to be secure in a
platform, there is not much you can do apart from preventing races where the
race conditions are, ie in secure world.

> > 5) I agree that the CCI enablement in TC2 (bootmon for cold boot and
> >    kernel for warm-boot is wrong, nothing to say and it was not the
> >    reason I commented on Nico's patch - I think I explained to you
> >    thoroughly why now).
> 
> OK. Let's start by agreeing on the spirit behind my patch then.  The 
> actual patch implementation details are a secondary concern and open for 
> discussion.

I agree on the spirit of the patch, my concern was about its implementation.

Thank you, this was a constructive discussion.

Lorenzo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-10  4:25                             ` Nicolas Pitre
  2014-06-10  9:19                               ` Lorenzo Pieralisi
@ 2014-06-10 14:14                               ` Catalin Marinas
  2014-06-10 16:49                                 ` Nicolas Pitre
  1 sibling, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2014-06-10 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nico,

Sorry, I can't stay away from this thread ;)

On Tue, Jun 10, 2014 at 12:25:47AM -0400, Nicolas Pitre wrote:
> On Mon, 9 Jun 2014, Lorenzo Pieralisi wrote:
> > 4) When I am talking about firmware I am talking about sequences that
> >    are very close to HW (disabling C bit, cleaning caches, exiting
> >    coherency). Erratas notwithstanding, they are being standardized at
> >    ARM the best we can. They might even end up being implemented in HW
> >    in the not so far future. I understand they are tricky, I understand
> >    they take lots of time to implement them and to debug them, what I
> >    want to say is that they are becoming standard and we _must_ reuse the
> >    same code for all ARM platforms. You can implement them in MCPM (see
> >    (1)) or in firmware (and please do not start painting me as firmware
> >    hugger here, I am referring to standard power down sequences that
> >    again, are very close to HW state machines 
> 
> That's where the disconnect lies.  On the one hand you say "I understand 
> they are tricky, I understand they take lots of time to implement them 
> and to debug them" and on the other hand you say "They might end up being 
> implemented in HW in the not so far future."  That simply makes no 
> economical sense at all!

It makes lots of sense, though not from a software maintainability
perspective. It would be nice if everything still looked like ARM7TDMI
but in the race for performance (vs power), hardware becomes more
complex and it's not just the CPU but adjacent parts like interconnects,
caches, asynchronous bridges, voltage shifters, memory controllers,
clocks/PLLs etc. Many of these are simply hidden from the high level OS
like Linux because the OS assumes certain configuration (e.g. access to
memory) and it's only the hardware itself that knows in what order they
can be turned on or off (when triggered explicitly by the OS or an
external event). Having an dedicated power controller (e.g. M-class
processor) to handle some of these is a rather flexible approach, other
bits require RTL (and usually impossible to update).

> When some operation is 1) tricky and takes time to debug, and 2) not 
> performance critical (no one is trying to get in and out of idle or 
> hibernation a billion times per second), then you should never ever put 
> such a thing in firmware, and hardware should be completely out of the 
> question!

I agree that things can go wrong (both in hardware and software, no
matter where it runs) but please don't think that such power
architecture has been specifically engineered to hide the hardware from
Linux. It's a necessity for complex systems and the optimal solution is
not always simplification (it's not just ARM+vendors doing this, just
look at the power model of modern x86 processors, hidden nicely from the
software behind a few registers while making things harder for scheduler
which cannot rely on a constant performance level; but it's a trade-off
they are happy to make).

> >    and more importantly if they
> >    HAVE to run in secure world that's the only solution we have unless you
> >    want to split race conditions between kernel and secure world).
> 
> If they HAVE to run in secure world then your secure world architecture 
> is simply misdesigned, period.  Someone must have ignored the economics 
> of modern software development to have come up with this.

That's the trade-off between software complexity and hardware cost,
gates, power consumption. You can do proper physical separation of the
secure services but this would require a separate CPU that is rarely
used and adds to the overall SoC cost. On large scale hardware
deployment, it's exactly economics that matter and these translate into
hardware cost. The software cost is irrelevant here, whether we like it
or not.

-- 
Catalin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-10 14:14                               ` Catalin Marinas
@ 2014-06-10 16:49                                 ` Nicolas Pitre
  2014-06-10 17:42                                   ` Catalin Marinas
  0 siblings, 1 reply; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-10 16:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 10 Jun 2014, Catalin Marinas wrote:

> Hi Nico,
> 
> Sorry, I can't stay away from this thread ;)

;-)

> On Tue, Jun 10, 2014 at 12:25:47AM -0400, Nicolas Pitre wrote:
> > On Mon, 9 Jun 2014, Lorenzo Pieralisi wrote:
> > > 4) When I am talking about firmware I am talking about sequences that
> > >    are very close to HW (disabling C bit, cleaning caches, exiting
> > >    coherency). Erratas notwithstanding, they are being standardized at
> > >    ARM the best we can. They might even end up being implemented in HW
> > >    in the not so far future. I understand they are tricky, I understand
> > >    they take lots of time to implement them and to debug them, what I
> > >    want to say is that they are becoming standard and we _must_ reuse the
> > >    same code for all ARM platforms. You can implement them in MCPM (see
> > >    (1)) or in firmware (and please do not start painting me as firmware
> > >    hugger here, I am referring to standard power down sequences that
> > >    again, are very close to HW state machines 
> > 
> > That's where the disconnect lies.  On the one hand you say "I understand 
> > they are tricky, I understand they take lots of time to implement them 
> > and to debug them" and on the other hand you say "They might end up being 
> > implemented in HW in the not so far future."  That simply makes no 
> > economical sense at all!
> 
> It makes lots of sense, though not from a software maintainability
> perspective. It would be nice if everything still looked like ARM7TDMI
> but in the race for performance (vs power), hardware becomes more
> complex and it's not just the CPU but adjacent parts like interconnects,
> caches, asynchronous bridges, voltage shifters, memory controllers,
> clocks/PLLs etc. Many of these are simply hidden from the high level OS
> like Linux because the OS assumes certain configuration (e.g. access to
> memory) and it's only the hardware itself that knows in what order they
> can be turned on or off (when triggered explicitly by the OS or an
> external event).

I agree when the hardware has to handle parallel dependencies ordered in 
waterfall style. In such cases there is usually no point relying on 
software to implement what is nevertheless simple determinism with 
no possible alternative usage.

But the *most* important thing is what you put in parents, so let me 
emphasize on what you just said:

	When triggered _explicitly_ by the OS or external events

> Having an dedicated power controller (e.g. M-class
> processor) to handle some of these is a rather flexible approach, other
> bits require RTL (and usually impossible to update).

The M-class processor should be treated the same way as firmware.  It 
ought to be flexible (certainly more than hardwired hardware), but it 
shares all the same downsides as firmware and the same concerns apply.

> > When some operation is 1) tricky and takes time to debug, and 2) not 
> > performance critical (no one is trying to get in and out of idle or 
> > hibernation a billion times per second), then you should never ever put 
> > such a thing in firmware, and hardware should be completely out of the 
> > question!
> 
> I agree that things can go wrong (both in hardware and software, no
> matter where it runs) but please don't think that such power
> architecture has been specifically engineered to hide the hardware from
> Linux. It's a necessity for complex systems and the optimal solution is
> not always simplification (it's not just ARM+vendors doing this, just
> look at the power model of modern x86 processors, hidden nicely from the
> software behind a few registers while making things harder for scheduler
> which cannot rely on a constant performance level; but it's a trade-off
> they are happy to make).

I'll claim that this is a bad tradeoff.  And the reason why some 
hardware architects might think it is a good one is because so far we 
really sucked at software based power management in Linux (and possibly 
other OSes as well).  Hence the (fairly recent) realization that power 
management has to be integrated and under control of the scheduler 
rather than existing as some ad hoc subsystem.

The reaction from the hardware people often is "the software is crap and 
makes our hardware look bad, we know better, so let's make it easier on 
those poor software dudes by handling power management in hardware 
instead".  But ultimately the hardware just can't predict things like 
software can.  It might do a better job than the current software state 
of affairs, but most likely not be as efficient as a proper software 
architecture.  The hardware may only be reactive, whereas the software 
can be proactive (when properly done that is).

I sense from your paragraph above that ARM might be going the same 
direction as X86 and that would be very sad.  Maybe the best compromise 
would be for all knobs to be made available to software if software 
wants to turn off the hardware auto-pilot and take control.  This way 
the hardware guys would cover their arses while still allowing for the 
possibility that software might be able to out perform the hardware 
solution.

> > >    and more importantly if they
> > >    HAVE to run in secure world that's the only solution we have unless you
> > >    want to split race conditions between kernel and secure world).
> > 
> > If they HAVE to run in secure world then your secure world architecture 
> > is simply misdesigned, period.  Someone must have ignored the economics 
> > of modern software development to have come up with this.
> 
> That's the trade-off between software complexity and hardware cost,
> gates, power consumption. You can do proper physical separation of the
> secure services but this would require a separate CPU that is rarely
> used and adds to the overall SoC cost. On large scale hardware
> deployment, it's exactly economics that matter and these translate into
> hardware cost. The software cost is irrelevant here, whether we like it
> or not.

I agree with you on the hardware cost (and the same argument applies to 
power management by the way).  But once the hardware is there, the 
software cost has to be optimized the same way.

>From a cost perspective, firmware is always a magnitude more costly to 
develop and to fix and maintain afterwards than kernel code.  So, 
without requiring full physical separation increasing the hardware cost, 
I think the software architecture would benefit from a rethought, 
possibly with the help of small and cheap hardware enhancements.  I 
really think not enough attention has been dedicated to that aspect.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-10 16:49                                 ` Nicolas Pitre
@ 2014-06-10 17:42                                   ` Catalin Marinas
  2014-06-10 19:15                                     ` Nicolas Pitre
  0 siblings, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2014-06-10 17:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 10, 2014 at 05:49:01PM +0100, Nicolas Pitre wrote:
> On Tue, 10 Jun 2014, Catalin Marinas wrote:
> > On Tue, Jun 10, 2014 at 12:25:47AM -0400, Nicolas Pitre wrote:
> > > On Mon, 9 Jun 2014, Lorenzo Pieralisi wrote:
> > > > 4) When I am talking about firmware I am talking about sequences that
> > > >    are very close to HW (disabling C bit, cleaning caches, exiting
> > > >    coherency). Erratas notwithstanding, they are being standardized at
> > > >    ARM the best we can. They might even end up being implemented in HW
> > > >    in the not so far future. I understand they are tricky, I understand
> > > >    they take lots of time to implement them and to debug them, what I
> > > >    want to say is that they are becoming standard and we _must_ reuse the
> > > >    same code for all ARM platforms. You can implement them in MCPM (see
> > > >    (1)) or in firmware (and please do not start painting me as firmware
> > > >    hugger here, I am referring to standard power down sequences that
> > > >    again, are very close to HW state machines 
> > > 
> > > That's where the disconnect lies.  On the one hand you say "I understand 
> > > they are tricky, I understand they take lots of time to implement them 
> > > and to debug them" and on the other hand you say "They might end up being 
> > > implemented in HW in the not so far future."  That simply makes no 
> > > economical sense at all!
> > 
> > It makes lots of sense, though not from a software maintainability
> > perspective. It would be nice if everything still looked like ARM7TDMI
> > but in the race for performance (vs power), hardware becomes more
> > complex and it's not just the CPU but adjacent parts like interconnects,
> > caches, asynchronous bridges, voltage shifters, memory controllers,
> > clocks/PLLs etc. Many of these are simply hidden from the high level OS
> > like Linux because the OS assumes certain configuration (e.g. access to
> > memory) and it's only the hardware itself that knows in what order they
> > can be turned on or off (when triggered explicitly by the OS or an
> > external event).
> 
> I agree when the hardware has to handle parallel dependencies ordered in 
> waterfall style. In such cases there is usually no point relying on 
> software to implement what is nevertheless simple determinism with 
> no possible alternative usage.
> 
> But the *most* important thing is what you put in parens, so let me 
> emphasize on what you just said:
> 
> 	When triggered _explicitly_ by the OS or external events

I don't think anyone is arguing that the policy should not be in the OS.
But part of the mechanism can be in the OS and part in firmware, SCP or
hardware. The kernel part can be a simple PSCI call or more complex
setup (possibly MCPM-based) which usually ends up with a WFI. This WFI,
however, triggers further hardware changes which may be handled by
dedicated power controller.

> > Having an dedicated power controller (e.g. M-class
> > processor) to handle some of these is a rather flexible approach, other
> > bits require RTL (and usually impossible to update).
> 
> The M-class processor should be treated the same way as firmware.  It 
> ought to be flexible (certainly more than hardwired hardware), but it 
> shares all the same downsides as firmware and the same concerns apply.

Yes, we can treat it as firmware, but we don't have a better alternative
to move the functionality into the kernel (well, we could at least allow
the kernel to load a binary blob and restart the controller).

> > > When some operation is 1) tricky and takes time to debug, and 2) not 
> > > performance critical (no one is trying to get in and out of idle or 
> > > hibernation a billion times per second), then you should never ever put 
> > > such a thing in firmware, and hardware should be completely out of the 
> > > question!
> > 
> > I agree that things can go wrong (both in hardware and software, no
> > matter where it runs) but please don't think that such power
> > architecture has been specifically engineered to hide the hardware from
> > Linux. It's a necessity for complex systems and the optimal solution is
> > not always simplification (it's not just ARM+vendors doing this, just
> > look at the power model of modern x86 processors, hidden nicely from the
> > software behind a few registers while making things harder for scheduler
> > which cannot rely on a constant performance level; but it's a trade-off
> > they are happy to make).
> 
> I'll claim that this is a bad tradeoff.  And the reason why some 
> hardware architects might think it is a good one is because so far we 
> really sucked at software based power management in Linux (and possibly 
> other OSes as well).  Hence the (fairly recent) realization that power 
> management has to be integrated and under control of the scheduler 
> rather than existing as some ad hoc subsystem.

But even this is a complex problem. Anyway, I don't think the (ARM at
least) hardware guys aim to take over the cpufreq or Linux scheduler
functionality. Their concern is rather mechanism rather than policy.

> The reaction from the hardware people often is "the software is crap and 
> makes our hardware look bad, we know better, so let's make it easier on 
> those poor software dudes by handling power management in hardware 
> instead".  But ultimately the hardware just can't predict things like 
> software can.  It might do a better job than the current software state 
> of affairs, but most likely not be as efficient as a proper software 
> architecture.  The hardware may only be reactive, whereas the software 
> can be proactive (when properly done that is).

Indeed. But that's not the aim of the power controller on our boards.
It's just a mechanism for safely changing sleep states, CPU frequencies
but entirely under the OS decision.

> I sense from your paragraph above that ARM might be going the same 
> direction as X86 and that would be very sad.  Maybe the best compromise 
> would be for all knobs to be made available to software if software 
> wants to turn off the hardware auto-pilot and take control.  This way 
> the hardware guys would cover their arses while still allowing for the 
> possibility that software might be able to out perform the hardware 
> solution.

I'm definitely not suggesting ARM is going the same route. Just trying
to show that ARM is slightly better here.

As a personal opinion, I like the simplicity of writing a register to
change the P-state but I don't like the non-determinism of the x86
hardware w.r.t. CPU performance. There are however some "policy" aspects
which I find interesting (like detecting whether the workload is memory
bound and automatically lowering the CPU frequency; the OS cannot react
this fast).

> > > >    and more importantly if they
> > > >    HAVE to run in secure world that's the only solution we have unless you
> > > >    want to split race conditions between kernel and secure world).
> > > 
> > > If they HAVE to run in secure world then your secure world architecture 
> > > is simply misdesigned, period.  Someone must have ignored the economics 
> > > of modern software development to have come up with this.
> > 
> > That's the trade-off between software complexity and hardware cost,
> > gates, power consumption. You can do proper physical separation of the
> > secure services but this would require a separate CPU that is rarely
> > used and adds to the overall SoC cost. On large scale hardware
> > deployment, it's exactly economics that matter and these translate into
> > hardware cost. The software cost is irrelevant here, whether we like it
> > or not.
> 
> I agree with you on the hardware cost (and the same argument applies to 
> power management by the way).  But once the hardware is there, the 
> software cost has to be optimized the same way.
> 
> From a cost perspective, firmware is always a magnitude more costly to 
> develop and to fix and maintain afterwards than kernel code.  So, 
> without requiring full physical separation increasing the hardware cost, 
> I think the software architecture would benefit from a rethought, 
> possibly with the help of small and cheap hardware enhancements.  I 
> really think not enough attention has been dedicated to that aspect.

I fully agree (and I think Linaro is well positioned for this ;),
possibly as an extension of the boot+firmware architecture).

-- 
Catalin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Problems booting exynos5420 with >1 CPU
  2014-06-10 17:42                                   ` Catalin Marinas
@ 2014-06-10 19:15                                     ` Nicolas Pitre
  0 siblings, 0 replies; 51+ messages in thread
From: Nicolas Pitre @ 2014-06-10 19:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 10 Jun 2014, Catalin Marinas wrote:
> On Tue, Jun 10, 2014 at 05:49:01PM +0100, Nicolas Pitre wrote:
> > The M-class processor should be treated the same way as firmware.  It 
> > ought to be flexible (certainly more than hardwired hardware), but it 
> > shares all the same downsides as firmware and the same concerns apply.
> 
> Yes, we can treat it as firmware, but we don't have a better alternative
> to move the functionality into the kernel (well, we could at least allow
> the kernel to load a binary blob and restart the controller).

That would address the "easy to update in the field" side of the story.  
So far I've not seen this aspect been addressed with a serious plan 
anywhere.

> > The reaction from the hardware people often is "the software is crap and 
> > makes our hardware look bad, we know better, so let's make it easier on 
> > those poor software dudes by handling power management in hardware 
> > instead".  But ultimately the hardware just can't predict things like 
> > software can.  It might do a better job than the current software state 
> > of affairs, but most likely not be as efficient as a proper software 
> > architecture.  The hardware may only be reactive, whereas the software 
> > can be proactive (when properly done that is).
> 
> Indeed. But that's not the aim of the power controller on our boards.
> It's just a mechanism for safely changing sleep states, CPU frequencies
> but entirely under the OS decision.

Sure.  But then you might want to consider that some usage scenarios 
might benefit from the ability to abort a request, or monitor the 
progress of a request for software timing purposes, or accept parallel 
requests rather than serialize them, etc.  Given the flexibility to 
extend beyond a rigid interface, the system may become even more 
efficient overall, albeit with added complexity in the implementation. 
But for that to work it has to be cheaply achievable.

> > I sense from your paragraph above that ARM might be going the same 
> > direction as X86 and that would be very sad.  Maybe the best compromise 
> > would be for all knobs to be made available to software if software 
> > wants to turn off the hardware auto-pilot and take control.  This way 
> > the hardware guys would cover their arses while still allowing for the 
> > possibility that software might be able to out perform the hardware 
> > solution.
> 
> I'm definitely not suggesting ARM is going the same route. Just trying
> to show that ARM is slightly better here.
> 
> As a personal opinion, I like the simplicity of writing a register to
> change the P-state but I don't like the non-determinism of the x86
> hardware w.r.t. CPU performance. There are however some "policy" aspects
> which I find interesting (like detecting whether the workload is memory
> bound and automatically lowering the CPU frequency; the OS cannot react
> this fast).

This is not really a policy.  This is a straight-forward waterfall 
dependency.  There is simply nothing you can do with those CPU clock 
cycles when stalled most of the time waiting for memory queries to come 
back so the choice is obvious.  If however this has a significant impact 
on code execution speed then this becomes a tradeoff and the choice to 
use this feature or not (the policy) must be implemented in software.


Nicolas

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2014-06-10 19:15 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-06  1:08 Problems booting exynos5420 with >1 CPU Doug Anderson
2014-06-06  4:38 ` Tushar Behera
2014-06-06 17:17   ` Doug Anderson
2014-06-06 17:36     ` Abhilash Kesavan
2014-06-06 18:02       ` Doug Anderson
2014-06-06 18:12         ` Abhilash Kesavan
2014-06-06 18:20           ` Doug Anderson
2014-06-06 18:31             ` Abhilash Kesavan
2014-06-06 18:56               ` Doug Anderson
2014-06-06 19:09                 ` Abhilash Kesavan
2014-06-06 19:12                   ` Abhilash Kesavan
2014-06-06 20:49                   ` Doug Anderson
2014-06-06 21:01                     ` Russell King - ARM Linux
2014-06-06 21:12                       ` Doug Anderson
2014-06-06 21:44                     ` Doug Anderson
2014-06-06 20:37             ` Olof Johansson
2014-06-06 20:46               ` Abhilash Kesavan
2014-06-06 21:01                 ` Olof Johansson
2014-06-06 21:06                   ` Abhilash Kesavan
2014-06-06 21:34                   ` Nicolas Pitre
2014-06-06 21:49                     ` Olof Johansson
2014-06-06 21:59                       ` Doug Anderson
2014-06-06 22:38                         ` Nicolas Pitre
2014-06-06 23:03                           ` Doug Anderson
2014-06-06 22:17                       ` Nicolas Pitre
2014-06-06 21:48       ` Nicolas Pitre
2014-06-07  3:25         ` Abhilash Kesavan
2014-06-07 16:10           ` Nicolas Pitre
2014-06-07 17:56             ` Lorenzo Pieralisi
2014-06-07 20:06               ` Nicolas Pitre
2014-06-07 22:36                 ` Lorenzo Pieralisi
2014-06-07 23:53                   ` Olof Johansson
2014-06-08  0:19                     ` Russell King - ARM Linux
2014-06-08  2:52                       ` Olof Johansson
2014-06-08 18:26                       ` Nicolas Pitre
2014-06-08 18:29                         ` Russell King - ARM Linux
2014-06-08 18:55                           ` Nicolas Pitre
2014-06-08 19:02                             ` Russell King - ARM Linux
2014-06-08 12:45                     ` Lorenzo Pieralisi
2014-06-08 14:34                       ` Russell King - ARM Linux
2014-06-08 17:53                       ` Nicolas Pitre
2014-06-09 20:47                         ` Kevin Hilman
2014-06-09 22:26                           ` Lorenzo Pieralisi
2014-06-10  4:25                             ` Nicolas Pitre
2014-06-10  9:19                               ` Lorenzo Pieralisi
2014-06-10 14:14                               ` Catalin Marinas
2014-06-10 16:49                                 ` Nicolas Pitre
2014-06-10 17:42                                   ` Catalin Marinas
2014-06-10 19:15                                     ` Nicolas Pitre
2014-06-09 20:27             ` Kevin Hilman
2014-06-09 20:35               ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).