All of lore.kernel.org
 help / color / mirror / Atom feed
* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
@ 2016-12-29 16:38 Bas van Tiel
  2016-12-29 17:02 ` Neil Armstrong
  2016-12-30  7:13 ` Jisheng Zhang
  0 siblings, 2 replies; 10+ messages in thread
From: Bas van Tiel @ 2016-12-29 16:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

when using a signal handler as a way to context switch between
different usercontexts a reproducible exception occurs on my rpi3 in
64-bit mode. (https://gist.github.com/DanGe42/7148946)

Running the context_demo program as a 32-bit ARM executable on a
64-bit kernel is OK, running as a 32 || 64 bit executable on an x86
kernel is OK.

In the first exception the PC doesn?t look correct, and the *pmd is 0.
The 2nd exception happens after running the program again, the PC is 0x0.

A successful function trace was not possible -> complete kernel hangup
when enabling.

Is there another way to gather more information about what is happening?

Linux (none) 4.10.0-rc1-v8+ #3 SMP PREEMPT Thu Dec 29 12:10:12 CET
2016 aarch64 GNU/Linux

[   46.350738] a.out[196]: unhandled level 2 translation fault (11) at
0x000000b8, esr 0x92000046
[   46.360516] pgd = ffffffc0392cb000
[   46.365377] [000000b8] *pgd=00000000392ec003
[   46.365381] , *pud=00000000392ec003
[   46.370878] , *pmd=0000000000000000
[   46.375907]
[   46.383974]
[   46.389107] CPU: 0 PID: 196 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
[   46.397949] Hardware name: Raspberry Pi 3 Model B (DT)
[   46.406218] task: ffffffc039ad6580 task.stack: ffffffc039bfc000
[   46.413892] PC is at 0x7fb4e34810
[   46.418230] LR is at 0x400b84
[   46.422956] pc : [<0000007fb4e34810>] lr : [<0000000000400b84>]
pstate: 60000000
[   46.431522] sp : 0000000000413350
[   46.436480] x29: 0000000000413350 x28: 0000000000000016
[   46.443142] x27: 0000000000000000 x26: 0000000000000020
[   46.451908] x25: 0000007fb4f35488 x24: 0000000000415f00
[   46.459641] x23: 0000000000000016 x22: 0000000000400b84
[   46.469198] x21: 0000000000413670 x20: 0000000000417030
[   46.476970] x19: 0000000000001000 x18: 0000000000000000
[   46.484744] x17: 0000007fb4e34810 x16: 0000000000411270
[   46.492175] x15: 00000000000005f1 x14: 0000000000000000
[   46.498884] x13: 0000000000000000 x12: 0000000000000000
[   46.506013] x11: 0000000000000020 x10: 0101010101010101
[   46.517164] x9 : 0000000000413670 x8 : 00000000ffffffe0
[   46.525541] x7 : 0000000000413350 x6 : 0000000000413350
[   46.533495] x5 : 00000000ffffffe0 x4 : 0000000000413730
[   46.544052] x3 : 0000000000000008 x2 : 0000000000000000
[   46.552211] x1 : 0000000000413670 x0 : 0000000000000000
[   46.558668]

2nd time startup of the executable

[  262.565147] a.out[201]: unhandled level 2 translation fault (11) at
0x00000000, esr 0x82000006
[  262.575243] pgd = ffffffc03939a000
[  262.579948] [00000000] *pgd=000000003938f003
[  262.579951] , *pud=000000003938f003
[  262.586040] , *pmd=0000000000000000
[  262.590479]
[  262.598234]
[  262.601108] CPU: 0 PID: 201 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
[  262.609086] Hardware name: Raspberry Pi 3 Model B (DT)
[  262.615731] task: ffffffc03904a600 task.stack: ffffffc039bfc000
[  262.621768] PC is at 0x0
[  262.624300] LR is at 0x0
[  262.626835] pc : [<0000000000000000>] lr : [<0000000000000000>]
pstate: 60000000
[  262.634437] sp : 00000000004159c0
[  262.637753] x29: 0000000000000000 x28: 0000000000000000
[  262.643242] x27: 0000000000000000 x26: 0000000000000000
[  262.648554] x25: 0000000000000000 x24: 0000000000000000
[  262.654033] x23: 0000000000000000 x22: 0000000000000000
[  262.659349] x21: 00000000004008f0 x20: 0000000000000000
[  262.664825] x19: 0000000000000000 x18: 0000000000000000
[  262.670145] x17: 0000007fb065b620 x16: 0000000000400b84
[  262.675622] x15: 00000000000003d1 x14: 0000000000000000
[  262.680938] x13: 0000000000000000 x12: 0000000000000000
[  262.686413] x11: 0000000000000020 x10: 0101010101010101
[  262.691835] x9 : 00000000004112c0 x8 : 0000000000000087
[  262.697159] x7 : 0000000000000000 x6 : 0000000000000000
[  262.702634] x5 : 0000000000000000 x4 : 0000000000000000
[  262.707949] x3 : 0000000000000000 x2 : 0000000000000000
[  262.713424] x1 : 0000000000000000 x0 : 0000000000000000
[  262.718739]

rpi3:
minimal kernel (64-bit, cortex-a53, little endian, 4Kb page,
initramfs), different kernels tried 4.8/4.9/4.10.0-rc1-v8+ the same
result occurs, also with different compilers.

kernel, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016
application, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016

The only item I found by reading through the different source-files was the
structure definition of struct kernel_rt_sigframe
(http://osxr.org:8080/glibc/source/ports/sysdeps/unix/sysv/linux/aarch64/kernel_rt_sigframe.h?v=glibc-2.18)
compared to the struct rt_sigframe (linux/arch/arm64/signal.c).

Any help or pointers to solve this issue are welcome,

regards
Bas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2016-12-29 16:38 Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64) Bas van Tiel
@ 2016-12-29 17:02 ` Neil Armstrong
  2016-12-30  7:13 ` Jisheng Zhang
  1 sibling, 0 replies; 10+ messages in thread
From: Neil Armstrong @ 2016-12-29 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/29/2016 05:38 PM, Bas van Tiel wrote:
> Hi,
> 
> when using a signal handler as a way to context switch between
> different usercontexts a reproducible exception occurs on my rpi3 in
> 64-bit mode. (https://gist.github.com/DanGe42/7148946)
> 
> Running the context_demo program as a 32-bit ARM executable on a
> 64-bit kernel is OK, running as a 32 || 64 bit executable on an x86
> kernel is OK.
> 
> In the first exception the PC doesn?t look correct, and the *pmd is 0.
> The 2nd exception happens after running the program again, the PC is 0x0.
> 
> A successful function trace was not possible -> complete kernel hangup
> when enabling.
> 
> Is there another way to gather more information about what is happening?
> 
> Linux (none) 4.10.0-rc1-v8+ #3 SMP PREEMPT Thu Dec 29 12:10:12 CET
> 2016 aarch64 GNU/Linux
> 
> [   46.350738] a.out[196]: unhandled level 2 translation fault (11) at
> 0x000000b8, esr 0x92000046
> [   46.360516] pgd = ffffffc0392cb000
> [   46.365377] [000000b8] *pgd=00000000392ec003
> [   46.365381] , *pud=00000000392ec003
> [   46.370878] , *pmd=0000000000000000
> [   46.375907]
> [   46.383974]
> [   46.389107] CPU: 0 PID: 196 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
> [   46.397949] Hardware name: Raspberry Pi 3 Model B (DT)
> [   46.406218] task: ffffffc039ad6580 task.stack: ffffffc039bfc000
> [   46.413892] PC is at 0x7fb4e34810
> [   46.418230] LR is at 0x400b84
> [   46.422956] pc : [<0000007fb4e34810>] lr : [<0000000000400b84>]
> pstate: 60000000
> [   46.431522] sp : 0000000000413350
> [   46.436480] x29: 0000000000413350 x28: 0000000000000016
> [   46.443142] x27: 0000000000000000 x26: 0000000000000020
> [   46.451908] x25: 0000007fb4f35488 x24: 0000000000415f00
> [   46.459641] x23: 0000000000000016 x22: 0000000000400b84
> [   46.469198] x21: 0000000000413670 x20: 0000000000417030
> [   46.476970] x19: 0000000000001000 x18: 0000000000000000
> [   46.484744] x17: 0000007fb4e34810 x16: 0000000000411270
> [   46.492175] x15: 00000000000005f1 x14: 0000000000000000
> [   46.498884] x13: 0000000000000000 x12: 0000000000000000
> [   46.506013] x11: 0000000000000020 x10: 0101010101010101
> [   46.517164] x9 : 0000000000413670 x8 : 00000000ffffffe0
> [   46.525541] x7 : 0000000000413350 x6 : 0000000000413350
> [   46.533495] x5 : 00000000ffffffe0 x4 : 0000000000413730
> [   46.544052] x3 : 0000000000000008 x2 : 0000000000000000
> [   46.552211] x1 : 0000000000413670 x0 : 0000000000000000
> [   46.558668]
> 
> 2nd time startup of the executable
> 
> [  262.565147] a.out[201]: unhandled level 2 translation fault (11) at
> 0x00000000, esr 0x82000006
> [  262.575243] pgd = ffffffc03939a000
> [  262.579948] [00000000] *pgd=000000003938f003
> [  262.579951] , *pud=000000003938f003
> [  262.586040] , *pmd=0000000000000000
> [  262.590479]
> [  262.598234]
> [  262.601108] CPU: 0 PID: 201 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
> [  262.609086] Hardware name: Raspberry Pi 3 Model B (DT)
> [  262.615731] task: ffffffc03904a600 task.stack: ffffffc039bfc000
> [  262.621768] PC is at 0x0
> [  262.624300] LR is at 0x0
> [  262.626835] pc : [<0000000000000000>] lr : [<0000000000000000>]
> pstate: 60000000
> [  262.634437] sp : 00000000004159c0
> [  262.637753] x29: 0000000000000000 x28: 0000000000000000
> [  262.643242] x27: 0000000000000000 x26: 0000000000000000
> [  262.648554] x25: 0000000000000000 x24: 0000000000000000
> [  262.654033] x23: 0000000000000000 x22: 0000000000000000
> [  262.659349] x21: 00000000004008f0 x20: 0000000000000000
> [  262.664825] x19: 0000000000000000 x18: 0000000000000000
> [  262.670145] x17: 0000007fb065b620 x16: 0000000000400b84
> [  262.675622] x15: 00000000000003d1 x14: 0000000000000000
> [  262.680938] x13: 0000000000000000 x12: 0000000000000000
> [  262.686413] x11: 0000000000000020 x10: 0101010101010101
> [  262.691835] x9 : 00000000004112c0 x8 : 0000000000000087
> [  262.697159] x7 : 0000000000000000 x6 : 0000000000000000
> [  262.702634] x5 : 0000000000000000 x4 : 0000000000000000
> [  262.707949] x3 : 0000000000000000 x2 : 0000000000000000
> [  262.713424] x1 : 0000000000000000 x0 : 0000000000000000
> [  262.718739]
> 
> rpi3:
> minimal kernel (64-bit, cortex-a53, little endian, 4Kb page,
> initramfs), different kernels tried 4.8/4.9/4.10.0-rc1-v8+ the same
> result occurs, also with different compilers.
> 
> kernel, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016
> application, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016
> 
> The only item I found by reading through the different source-files was the
> structure definition of struct kernel_rt_sigframe
> (http://osxr.org:8080/glibc/source/ports/sysdeps/unix/sysv/linux/aarch64/kernel_rt_sigframe.h?v=glibc-2.18)
> compared to the struct rt_sigframe (linux/arch/arm64/signal.c).
> 
> Any help or pointers to solve this issue are welcome,
> 
> regards
> Bas
> 

Hi,

The same issue was reported on Amlogic 64bit aswell : https://www.spinics.net/lists/arm-kernel/msg550204.html

Neil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2016-12-29 16:38 Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64) Bas van Tiel
  2016-12-29 17:02 ` Neil Armstrong
@ 2016-12-30  7:13 ` Jisheng Zhang
  2016-12-30 12:21   ` Bas van Tiel
  1 sibling, 1 reply; 10+ messages in thread
From: Jisheng Zhang @ 2016-12-30  7:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, 29 Dec 2016 17:38:14 +0100 Bas van Tiel wrote:

> Hi,
> 
> when using a signal handler as a way to context switch between
> different usercontexts a reproducible exception occurs on my rpi3 in
> 64-bit mode. (https://gist.github.com/DanGe42/7148946)
> 
> Running the context_demo program as a 32-bit ARM executable on a
> 64-bit kernel is OK, running as a 32 || 64 bit executable on an x86
> kernel is OK.
> 
> In the first exception the PC doesn?t look correct, and the *pmd is 0.
> The 2nd exception happens after running the program again, the PC is 0x0.
> 
> A successful function trace was not possible -> complete kernel hangup
> when enabling.
> 
> Is there another way to gather more information about what is happening?

I can reproduce Segmentation fault with your program on Marvell berlin SoCs
my kernel version is 4.1, I didn't tested 4.9, 4.10-rc1 etc..

Then I increased the STACKSIZE from 4096 to 8192 in context_demo.c,
everything works fine now. Maybe arm64 need a bit larger signalstack?

Thanks,
Jisheng

> 
> Linux (none) 4.10.0-rc1-v8+ #3 SMP PREEMPT Thu Dec 29 12:10:12 CET
> 2016 aarch64 GNU/Linux
> 
> [   46.350738] a.out[196]: unhandled level 2 translation fault (11) at
> 0x000000b8, esr 0x92000046
> [   46.360516] pgd = ffffffc0392cb000
> [   46.365377] [000000b8] *pgd=00000000392ec003
> [   46.365381] , *pud=00000000392ec003
> [   46.370878] , *pmd=0000000000000000
> [   46.375907]
> [   46.383974]
> [   46.389107] CPU: 0 PID: 196 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
> [   46.397949] Hardware name: Raspberry Pi 3 Model B (DT)
> [   46.406218] task: ffffffc039ad6580 task.stack: ffffffc039bfc000
> [   46.413892] PC is at 0x7fb4e34810
> [   46.418230] LR is at 0x400b84
> [   46.422956] pc : [<0000007fb4e34810>] lr : [<0000000000400b84>]
> pstate: 60000000
> [   46.431522] sp : 0000000000413350
> [   46.436480] x29: 0000000000413350 x28: 0000000000000016
> [   46.443142] x27: 0000000000000000 x26: 0000000000000020
> [   46.451908] x25: 0000007fb4f35488 x24: 0000000000415f00
> [   46.459641] x23: 0000000000000016 x22: 0000000000400b84
> [   46.469198] x21: 0000000000413670 x20: 0000000000417030
> [   46.476970] x19: 0000000000001000 x18: 0000000000000000
> [   46.484744] x17: 0000007fb4e34810 x16: 0000000000411270
> [   46.492175] x15: 00000000000005f1 x14: 0000000000000000
> [   46.498884] x13: 0000000000000000 x12: 0000000000000000
> [   46.506013] x11: 0000000000000020 x10: 0101010101010101
> [   46.517164] x9 : 0000000000413670 x8 : 00000000ffffffe0
> [   46.525541] x7 : 0000000000413350 x6 : 0000000000413350
> [   46.533495] x5 : 00000000ffffffe0 x4 : 0000000000413730
> [   46.544052] x3 : 0000000000000008 x2 : 0000000000000000
> [   46.552211] x1 : 0000000000413670 x0 : 0000000000000000
> [   46.558668]
> 
> 2nd time startup of the executable
> 
> [  262.565147] a.out[201]: unhandled level 2 translation fault (11) at
> 0x00000000, esr 0x82000006
> [  262.575243] pgd = ffffffc03939a000
> [  262.579948] [00000000] *pgd=000000003938f003
> [  262.579951] , *pud=000000003938f003
> [  262.586040] , *pmd=0000000000000000
> [  262.590479]
> [  262.598234]
> [  262.601108] CPU: 0 PID: 201 Comm: a.out Not tainted 4.10.0-rc1-v8+ #3
> [  262.609086] Hardware name: Raspberry Pi 3 Model B (DT)
> [  262.615731] task: ffffffc03904a600 task.stack: ffffffc039bfc000
> [  262.621768] PC is at 0x0
> [  262.624300] LR is at 0x0
> [  262.626835] pc : [<0000000000000000>] lr : [<0000000000000000>]
> pstate: 60000000
> [  262.634437] sp : 00000000004159c0
> [  262.637753] x29: 0000000000000000 x28: 0000000000000000
> [  262.643242] x27: 0000000000000000 x26: 0000000000000000
> [  262.648554] x25: 0000000000000000 x24: 0000000000000000
> [  262.654033] x23: 0000000000000000 x22: 0000000000000000
> [  262.659349] x21: 00000000004008f0 x20: 0000000000000000
> [  262.664825] x19: 0000000000000000 x18: 0000000000000000
> [  262.670145] x17: 0000007fb065b620 x16: 0000000000400b84
> [  262.675622] x15: 00000000000003d1 x14: 0000000000000000
> [  262.680938] x13: 0000000000000000 x12: 0000000000000000
> [  262.686413] x11: 0000000000000020 x10: 0101010101010101
> [  262.691835] x9 : 00000000004112c0 x8 : 0000000000000087
> [  262.697159] x7 : 0000000000000000 x6 : 0000000000000000
> [  262.702634] x5 : 0000000000000000 x4 : 0000000000000000
> [  262.707949] x3 : 0000000000000000 x2 : 0000000000000000
> [  262.713424] x1 : 0000000000000000 x0 : 0000000000000000
> [  262.718739]
> 
> rpi3:
> minimal kernel (64-bit, cortex-a53, little endian, 4Kb page,
> initramfs), different kernels tried 4.8/4.9/4.10.0-rc1-v8+ the same
> result occurs, also with different compilers.
> 
> kernel, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016
> application, aarch64-linux-gnu-gcc (Linaro GCC 6.2-2016.11) 6.2.1 20161016
> 
> The only item I found by reading through the different source-files was the
> structure definition of struct kernel_rt_sigframe
> (http://osxr.org:8080/glibc/source/ports/sysdeps/unix/sysv/linux/aarch64/kernel_rt_sigframe.h?v=glibc-2.18)
> compared to the struct rt_sigframe (linux/arch/arm64/signal.c).
> 
> Any help or pointers to solve this issue are welcome,
> 
> regards
> Bas
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2016-12-30  7:13 ` Jisheng Zhang
@ 2016-12-30 12:21   ` Bas van Tiel
  2017-01-09 15:13     ` Catalin Marinas
  0 siblings, 1 reply; 10+ messages in thread
From: Bas van Tiel @ 2016-12-30 12:21 UTC (permalink / raw)
  To: linux-arm-kernel

>> Hi,
>>
>> when using a signal handler as a way to context switch between
>> different usercontexts a reproducible exception occurs on my rpi3 in
>> 64-bit mode. (https://gist.github.com/DanGe42/7148946)
>>
>> Running the context_demo program as a 32-bit ARM executable on a
>> 64-bit kernel is OK, running as a 32 || 64 bit executable on an x86
>> kernel is OK.
>>
>> In the first exception the PC doesn?t look correct, and the *pmd is 0.
>> The 2nd exception happens after running the program again, the PC is 0x0.
>>
>> A successful function trace was not possible -> complete kernel hangup
>> when enabling.
>>
>> Is there another way to gather more information about what is happening?
>
> I can reproduce Segmentation fault with your program on Marvell berlin SoCs
> my kernel version is 4.1, I didn't tested 4.9, 4.10-rc1 etc..
>
> Then I increased the STACKSIZE from 4096 to 8192 in context_demo.c,
> everything works fine now. Maybe arm64 need a bit larger signalstack?
>

yes, increased STACKSIZE to 8192 helps on 4.9/4,10-rc1 but after a
while the exception still occurs, although the message is different.
The *pmd is not 0 in this case.

to trigger this scenario:
- INTERVAL set to 500 [ns]
- kernel with maxcpus=0
- start a 'find /' command in the shell in parallel of the program
- stdout, stderr > redirected to file.

[  850.581983] a.out[173]: unhandled level 3 permission fault (11) at
0x004391f0, esr 0x8200000f
[  850.591833] pgd = ffffffc039311000
[  850.596725] [004391f0] *pgd=0000000039340003
[  850.602145] , *pud=0000000039340003
[  850.608352] , *pmd=000000003922c003
[  850.611963] , *pte=00e80000359c0f53
[  850.618111]
[  850.621102]
[  850.624032] CPU: 0 PID: 173 Comm: a.out Not tainted 4.9.0-v8+ #5
[  850.631314] Hardware name: Raspberry Pi 3 Model B (DT)
[  850.637925] task: ffffffc039a13100 task.stack: ffffffc039a14000
[  850.645314] PC is at 0x4391f0
[  850.649783] LR is at 0x4391f0
[  850.654035] pc : [<00000000004391f0>] lr : [<00000000004391f0>]
pstate: 60000000
[  850.662920] sp : 0000000000420da0
[  850.667516] x29: 00000000004391f0 x28: 0000000000000000
[  850.677145] x27: 0000000000000000 x26: 0000000000000000

When I taskset the context_demo program to other cores that are
completely isolated (CONFIG_NO_HZ_FULL, isolcpus=1,2,3) it will run
continuously with the modified STACKSIZE.

regards
Bas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2016-12-30 12:21   ` Bas van Tiel
@ 2017-01-09 15:13     ` Catalin Marinas
  2017-01-09 18:06       ` Bas van Tiel
  0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2017-01-09 15:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 30, 2016 at 01:21:00PM +0100, Bas van Tiel wrote:
> >> when using a signal handler as a way to context switch between
> >> different usercontexts a reproducible exception occurs on my rpi3 in
> >> 64-bit mode. (https://gist.github.com/DanGe42/7148946)
> >>
> >> Running the context_demo program as a 32-bit ARM executable on a
> >> 64-bit kernel is OK, running as a 32 || 64 bit executable on an x86
> >> kernel is OK.
> >>
> >> In the first exception the PC doesn?t look correct, and the *pmd is 0.
> >> The 2nd exception happens after running the program again, the PC is 0x0.
> >>
> >> A successful function trace was not possible -> complete kernel hangup
> >> when enabling.
> >>
> >> Is there another way to gather more information about what is happening?
> >
> > I can reproduce Segmentation fault with your program on Marvell berlin SoCs
> > my kernel version is 4.1, I didn't tested 4.9, 4.10-rc1 etc..
> >
> > Then I increased the STACKSIZE from 4096 to 8192 in context_demo.c,
> > everything works fine now. Maybe arm64 need a bit larger signalstack?
> 
> yes, increased STACKSIZE to 8192 helps on 4.9/4,10-rc1 but after a
> while the exception still occurs, although the message is different.
> The *pmd is not 0 in this case.

I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
fine, though I'll leave it longer/overnight (on a Juno board). With the
4K signal stack it was crashing shortly after start.

-- 
Catalin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2017-01-09 15:13     ` Catalin Marinas
@ 2017-01-09 18:06       ` Bas van Tiel
  2017-01-10 12:14         ` Catalin Marinas
  0 siblings, 1 reply; 10+ messages in thread
From: Bas van Tiel @ 2017-01-09 18:06 UTC (permalink / raw)
  To: linux-arm-kernel

> I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
> fine, though I'll leave it longer/overnight (on a Juno board). With the
> 4K signal stack it was crashing shortly after start.

I tried the STACKSIZE of 16384 for both the RPI3 and the PINEA64 board
and still see the same behaviour of crashing. Sometimes the process
is also blocked for a long time before it crashes.

Setting the interval to 200 usec [5 Khz] will help to crash it faster.

To further isolate the issue I will create a kernel module (based on a
hrtimer) that will sent a periodic signal to the registered process
and execute the same sighandler logic to check if the problem is still
there.

regards
Bas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2017-01-09 18:06       ` Bas van Tiel
@ 2017-01-10 12:14         ` Catalin Marinas
  2017-01-11 14:49           ` Catalin Marinas
  0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2017-01-10 12:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jan 09, 2017 at 07:06:19PM +0100, Bas van Tiel wrote:
> > I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
> > fine, though I'll leave it longer/overnight (on a Juno board). With the
> > 4K signal stack it was crashing shortly after start.
> 
> I tried the STACKSIZE of 16384 for both the RPI3 and the PINEA64 board
> and still see the same behaviour of crashing. Sometimes the process
> is also blocked for a long time before it crashes.
> 
> Setting the interval to 200 usec [5 Khz] will help to crash it faster.
> 
> To further isolate the issue I will create a kernel module (based on a
> hrtimer) that will sent a periodic signal to the registered process
> and execute the same sighandler logic to check if the problem is still
> there.

I lowered the interval to 100us (it was 100ms in the original file) and
I can indeed trigger segfault easily on Juno. But it doesn't fail in the
same way every time, I sometimes get permission fault, other times bad
frame.

-- 
Catalin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2017-01-10 12:14         ` Catalin Marinas
@ 2017-01-11 14:49           ` Catalin Marinas
  2017-01-11 15:33             ` Dave Martin
  0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2017-01-11 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 10, 2017 at 12:14:23PM +0000, Catalin Marinas wrote:
> On Mon, Jan 09, 2017 at 07:06:19PM +0100, Bas van Tiel wrote:
> > > I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
> > > fine, though I'll leave it longer/overnight (on a Juno board). With the
> > > 4K signal stack it was crashing shortly after start.
> > 
> > I tried the STACKSIZE of 16384 for both the RPI3 and the PINEA64 board
> > and still see the same behaviour of crashing. Sometimes the process
> > is also blocked for a long time before it crashes.
> > 
> > Setting the interval to 200 usec [5 Khz] will help to crash it faster.
> > 
> > To further isolate the issue I will create a kernel module (based on a
> > hrtimer) that will sent a periodic signal to the registered process
> > and execute the same sighandler logic to check if the problem is still
> > there.
> 
> I lowered the interval to 100us (it was 100ms in the original file) and
> I can indeed trigger segfault easily on Juno. But it doesn't fail in the
> same way every time, I sometimes get permission fault, other times bad
> frame.

With 100us interval, it segfaults on x86 fairly quickly as well, so I
don't think it's a kernel issue.

-- 
Catalin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2017-01-11 14:49           ` Catalin Marinas
@ 2017-01-11 15:33             ` Dave Martin
  2017-01-13 18:47               ` Bas van Tiel
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Martin @ 2017-01-11 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 11, 2017 at 02:49:03PM +0000, Catalin Marinas wrote:
> On Tue, Jan 10, 2017 at 12:14:23PM +0000, Catalin Marinas wrote:
> > On Mon, Jan 09, 2017 at 07:06:19PM +0100, Bas van Tiel wrote:
> > > > I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
> > > > fine, though I'll leave it longer/overnight (on a Juno board). With the
> > > > 4K signal stack it was crashing shortly after start.
> > > 
> > > I tried the STACKSIZE of 16384 for both the RPI3 and the PINEA64 board
> > > and still see the same behaviour of crashing. Sometimes the process
> > > is also blocked for a long time before it crashes.
> > > 
> > > Setting the interval to 200 usec [5 Khz] will help to crash it faster.
> > > 
> > > To further isolate the issue I will create a kernel module (based on a
> > > hrtimer) that will sent a periodic signal to the registered process
> > > and execute the same sighandler logic to check if the problem is still
> > > there.
> > 
> > I lowered the interval to 100us (it was 100ms in the original file) and
> > I can indeed trigger segfault easily on Juno. But it doesn't fail in the
> > same way every time, I sometimes get permission fault, other times bad
> > frame.
> 
> With 100us interval, it segfaults on x86 fairly quickly as well, so I
> don't think it's a kernel issue.

To be able to take a signal at all, stacks need to be at least SIGSTKSZ
bytes in practice:


diff --git a/context_demo.c b/context_demo.c
index 2cc63f7..b1f3bbc 100644
--- a/context_demo.c
+++ b/context_demo.c
@@ -22,7 +22,7 @@
 
 
 #define NUMCONTEXTS 10              /* how many contexts to make */
-#define STACKSIZE 4096              /* stack size */
+#define STACKSIZE SIGSTKSZ          /* stack size */
 #define INTERVAL 100                /* timer interval in nanoseconds */
 
 sigset_t set;                       /* process wide signal mask */


The other issue looks a bit subtler, to do with signal masking.

SIGALRM will be masked on entry to timer_interrupt() and restored on
return, due to and absence of SA_NODEFER from sa_flags when calling
sigaction.  (Setting SIGALRM in sa_mask also has this effect, but this
is redundant without SA_NODEFER.)

However, by explicitly clearing this signal from
signal_context.uc_sigmask, we'll enter scheduler() with SIGALRM
unmasked.  If a new SIGALRM is taken before scheduler() has called
setcontext(), we'll pile up another signal on signal_stack and call
schedule() again, still on signal_stack ... and this can repeat
indefinitely.

There's no need to clear SIGALRM from the signal mask: it will be
cleared when timer_interrupt() returns after resuming an interrupted
task (as part of the signal frame restore work done by rt_sigreturn).
So:

@@ -61,7 +61,6 @@ timer_interrupt(int j, siginfo_t *si, void *old_context)
     signal_context.uc_stack.ss_sp = signal_stack;
     signal_context.uc_stack.ss_size = STACKSIZE;
     signal_context.uc_stack.ss_flags = 0;
-    sigemptyset(&signal_context.uc_sigmask);
     makecontext(&signal_context, scheduler, 1);
 
     /* save running thread, jump to scheduler */

For me, this seems to fix the problem.

It also makes sense of what we've seen: we need either short timer
intervals, slow machines, or high system load (or some combination) in
order to take enough extra signals in scheduler() to cause a stack
overflow.

I can't see the purpose of running scheduler() in its own context here,
except so that it doesn't contribute stack overhead to the thread
stacks (which hardly seems worthwhile, since its overhead is probably a
lot smaller than the signal overhead anyway -- maybe I'm missing
something).

makeconext() and swapcontext() are obsoleted by POSIX.1-2008 and
considered non-portable (see makecontext(3), swapcontext(3)).  Really,
the ucontext API should not be used for anything except cooperative
switching now (certainly this covered the vast majority or real-world
usage the last time I looked into it).  For anything else, pthreads
almost certainly do it better.

Cheers
---Dave

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64)
  2017-01-11 15:33             ` Dave Martin
@ 2017-01-13 18:47               ` Bas van Tiel
  0 siblings, 0 replies; 10+ messages in thread
From: Bas van Tiel @ 2017-01-13 18:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 11, 2017 at 4:33 PM, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Jan 11, 2017 at 02:49:03PM +0000, Catalin Marinas wrote:
>> On Tue, Jan 10, 2017 at 12:14:23PM +0000, Catalin Marinas wrote:
>> > On Mon, Jan 09, 2017 at 07:06:19PM +0100, Bas van Tiel wrote:
>> > > > I defined STACKSIZE to the kernel's SIGSTKSZ (16384) and it seems to run
>> > > > fine, though I'll leave it longer/overnight (on a Juno board). With the
>> > > > 4K signal stack it was crashing shortly after start.
>> > >
>> > > I tried the STACKSIZE of 16384 for both the RPI3 and the PINEA64 board
>> > > and still see the same behaviour of crashing. Sometimes the process
>> > > is also blocked for a long time before it crashes.
>> > >
>> > > Setting the interval to 200 usec [5 Khz] will help to crash it faster.
>> > >
>> > > To further isolate the issue I will create a kernel module (based on a
>> > > hrtimer) that will sent a periodic signal to the registered process
>> > > and execute the same sighandler logic to check if the problem is still
>> > > there.
>> >
>> > I lowered the interval to 100us (it was 100ms in the original file) and
>> > I can indeed trigger segfault easily on Juno. But it doesn't fail in the
>> > same way every time, I sometimes get permission fault, other times bad
>> > frame.
>>
>> With 100us interval, it segfaults on x86 fairly quickly as well, so I
>> don't think it's a kernel issue.
>
> To be able to take a signal at all, stacks need to be at least SIGSTKSZ
> bytes in practice:
>
>
> diff --git a/context_demo.c b/context_demo.c
> index 2cc63f7..b1f3bbc 100644
> --- a/context_demo.c
> +++ b/context_demo.c
> @@ -22,7 +22,7 @@
>
>
>  #define NUMCONTEXTS 10              /* how many contexts to make */
> -#define STACKSIZE 4096              /* stack size */
> +#define STACKSIZE SIGSTKSZ          /* stack size */
>  #define INTERVAL 100                /* timer interval in nanoseconds */
>
>  sigset_t set;                       /* process wide signal mask */
>
>
> The other issue looks a bit subtler, to do with signal masking.
>
> SIGALRM will be masked on entry to timer_interrupt() and restored on
> return, due to and absence of SA_NODEFER from sa_flags when calling
> sigaction.  (Setting SIGALRM in sa_mask also has this effect, but this
> is redundant without SA_NODEFER.)
>
> However, by explicitly clearing this signal from
> signal_context.uc_sigmask, we'll enter scheduler() with SIGALRM
> unmasked.  If a new SIGALRM is taken before scheduler() has called
> setcontext(), we'll pile up another signal on signal_stack and call
> schedule() again, still on signal_stack ... and this can repeat
> indefinitely.
>
> There's no need to clear SIGALRM from the signal mask: it will be
> cleared when timer_interrupt() returns after resuming an interrupted
> task (as part of the signal frame restore work done by rt_sigreturn).
> So:
>
> @@ -61,7 +61,6 @@ timer_interrupt(int j, siginfo_t *si, void *old_context)
>      signal_context.uc_stack.ss_sp = signal_stack;
>      signal_context.uc_stack.ss_size = STACKSIZE;
>      signal_context.uc_stack.ss_flags = 0;
> -    sigemptyset(&signal_context.uc_sigmask);
>      makecontext(&signal_context, scheduler, 1);
>
>      /* save running thread, jump to scheduler */
>
> For me, this seems to fix the problem.
>
> It also makes sense of what we've seen: we need either short timer
> intervals, slow machines, or high system load (or some combination) in
> order to take enough extra signals in scheduler() to cause a stack
> overflow.

so the piling up of the signals is because the handling time of the
signal takes more than the influx of signals related to the interval
time => userspace issue (programming error), thank you for the
explanation and pointing this out to userspace.

In my case the unmasking of SIGALRM cannot be done, because that would
hide the fact that the processing is too slow compared to the influx.

I did a rerun of the program on the 64-bit rpi3 and was able to have
an interval of 13 [us] in which the contexts are executing
a while(1) with a few nops for 6 hours (isolcpus, sched_fifo,
prio:99). Going below below 13 [us] the segfault occurs.

>
> I can't see the purpose of running scheduler() in its own context here,
> except so that it doesn't contribute stack overhead to the thread
> stacks (which hardly seems worthwhile, since its overhead is probably a
> lot smaller than the signal overhead anyway -- maybe I'm missing
> something).

The reason is to have an HPC usecase with context scheduling in 1
process on a dedicated isolated core (CONFIG_NO_HZ_FULL). The signal
can be seen as an HW/SW-IRQ that can be elevated above the
main-process at all times when it occurs.

> makeconext() and swapcontext() are obsoleted by POSIX.1-2008 and
> considered non-portable (see makecontext(3), swapcontext(3)).  Really,
> the ucontext API should not be used for anything except cooperative
> switching now (certainly this covered the vast majority or real-world
> usage the last time I looked into it).

agree

> For anything else, pthreads
> almost certainly do it better.
>

agree

--
Bas

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-01-13 18:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-29 16:38 Unhandled level 2 translation fault (11) at 0x000000b8, esr 0x92000046, rpi3 (aarch64) Bas van Tiel
2016-12-29 17:02 ` Neil Armstrong
2016-12-30  7:13 ` Jisheng Zhang
2016-12-30 12:21   ` Bas van Tiel
2017-01-09 15:13     ` Catalin Marinas
2017-01-09 18:06       ` Bas van Tiel
2017-01-10 12:14         ` Catalin Marinas
2017-01-11 14:49           ` Catalin Marinas
2017-01-11 15:33             ` Dave Martin
2017-01-13 18:47               ` Bas van Tiel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.