All of lore.kernel.org
 help / color / mirror / Atom feed
* am335x: 5.18.x: system stalling
@ 2022-05-04 10:35 Yegor Yefremov
  2022-05-05  5:08 ` Tony Lindgren
  0 siblings, 1 reply; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-04 10:35 UTC (permalink / raw)
  To: Linux-OMAP; +Cc: Tony Lindgren

Hi Tony, all,

since kernel 5.18.x (5.17.x doesn't show this behavior), the system
stalls as soon as I invoke the following commands (initializing
USB-to-CAN converter):

slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
ip link set slcan0 up

Have you already seen such an issue? Should I try to bisect this?

[   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
[   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   88.454859] rcu: RCU grace-period kthread stack dump:
[   88.460446] task:rcu_sched       state:R  running task     stack:
 0 pid:   11 ppid:     2 flags:0x00000000
[   88.471840]  __schedule from schedule+0x58/0xcc
[   88.477680]  schedule from schedule_timeout+0x78/0xf8
[   88.483754]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3cc
[   88.490629]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[   88.497187]  rcu_gp_kthread from kthread+0xe4/0x104
[   88.503061]  kthread from ret_from_fork+0x14/0x28
[   88.508627] Exception stack(0xd0041fb0 to 0xd0041ff8)
[   88.514443] 1fa0:                                     00000000
00000000 00000000 00000000
[   88.523433] 1fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   88.532374] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   88.539639] rcu: Stack dump where RCU GP kthread last ran:
[   88.545694] NMI backtrace for cpu 0
[   88.549779] CPU: 0 PID: 58 Comm: kworker/0:8 Not tainted 5.18.0-rc5 #1
[   88.557103] Hardware name: Generic AM33XX (Flattened Device Tree)
[   88.563822] Workqueue: events dbs_work_handler
[   88.569398]  unwind_backtrace from show_stack+0x10/0x14
[   88.575662]  show_stack from dump_stack_lvl+0x58/0x70
[   88.581627]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[   88.588345]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[   88.596339]  nmi_trigger_cpumask_backtrace from
trigger_single_cpu_backtrace+0x20/0x2c
[   88.605221]  trigger_single_cpu_backtrace from
rcu_check_gp_kthread_starvation+0xf4/0x148
[   88.614328]  rcu_check_gp_kthread_starvation from
rcu_sched_clock_irq+0xdf0/0xf7c
[   88.622778]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[   88.630182]  update_process_times from tick_sched_handle+0x48/0x54
[   88.637293]  tick_sched_handle from tick_sched_timer+0x48/0xac
[   88.643993]  tick_sched_timer from __hrtimer_run_queues+0x244/0x4d8
[   88.651212]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[   88.658582]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[   88.666506]  dmtimer_clockevent_interrupt from
__handle_irq_event_percpu+0x98/0x334
[   88.675241]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[   88.682749]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[   88.689639]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[   88.696253]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[   88.703524]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[   88.710195] Exception stack(0xd0001f58 to 0xd0001fa0)
[   88.715947] 1f40:
    c01015c8 00000000
[   88.724939] 1f60: 0eae9000 00000000 fffffffe 60000013 ffffffff
d0385d74 00000000 c2702a80
[   88.733926] 1f80: 00000002 c2702a80 00000000 d0001fa8 c01015c8
c01015d0 60000113 ffffffff
[   88.742765]  __irq_svc from __do_softirq+0xa0/0x604
[   88.748533]  __do_softirq from __irq_exit_rcu+0x138/0x178
[   88.754961]  __irq_exit_rcu from irq_exit+0x8/0x28
[   88.760758]  irq_exit from call_with_stack+0x18/0x20
[   88.766687]  call_with_stack from __irq_svc+0x9c/0xbc
[   88.772576] Exception stack(0xd0385d40 to 0xd0385d88)
[   88.778458] 5d40: 00000005 00000488 00000000 00000000 c208c0c0
00006402 c208b800 c1874ff0
[   88.787451] 5d60: 00000000 c208c0c0 c1109210 c208c0d8 00000000
d0385d90 c06e068c c06e08a4
[   88.796305] 5d80: 60000013 ffffffff
[   88.800369]  __irq_svc from omap3_noncore_dpll_program+0x3f8/0x5ec
[   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
[   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
[   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
[   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
[   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
[   88.870411]  process_one_work from worker_thread+0x28/0x4b0
[   88.876973]  worker_thread from kthread+0xe4/0x104
[   88.882692]  kthread from ret_from_fork+0x14/0x28
[   88.888225] Exception stack(0xd0385fb0 to 0xd0385ff8)
[   88.893998] 5fa0:                                     00000000
00000000 00000000 00000000
[   88.902971] 5fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   88.911888] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000

Regards,
Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-04 10:35 am335x: 5.18.x: system stalling Yegor Yefremov
@ 2022-05-05  5:08 ` Tony Lindgren
  2022-05-11 14:16   ` Yegor Yefremov
  0 siblings, 1 reply; 115+ messages in thread
From: Tony Lindgren @ 2022-05-05  5:08 UTC (permalink / raw)
  To: Yegor Yefremov; +Cc: Linux-OMAP, linux-clk, Stephen Boyd

Hi,

* Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> Hi Tony, all,
> 
> since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> stalls as soon as I invoke the following commands (initializing
> USB-to-CAN converter):
> 
> slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> ip link set slcan0 up
> 
> Have you already seen such an issue? Should I try to bisect this?

No have not seen this one either, yes please bisect if you can.

Note that v5.18-rc1 has revert commit 859c2c7b1d06 ("Revert "clk: Drop
the rate range on clk_put()"") that you may need to carry along in the
bisect.

Regards,

Tony


> [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> time, OOM is now expected behavior.
> [   88.454859] rcu: RCU grace-period kthread stack dump:
> [   88.460446] task:rcu_sched       state:R  running task     stack:
>  0 pid:   11 ppid:     2 flags:0x00000000
> [   88.471840]  __schedule from schedule+0x58/0xcc
> [   88.477680]  schedule from schedule_timeout+0x78/0xf8
> [   88.483754]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3cc
> [   88.490629]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
> [   88.497187]  rcu_gp_kthread from kthread+0xe4/0x104
> [   88.503061]  kthread from ret_from_fork+0x14/0x28
> [   88.508627] Exception stack(0xd0041fb0 to 0xd0041ff8)
> [   88.514443] 1fa0:                                     00000000
> 00000000 00000000 00000000
> [   88.523433] 1fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [   88.532374] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [   88.539639] rcu: Stack dump where RCU GP kthread last ran:
> [   88.545694] NMI backtrace for cpu 0
> [   88.549779] CPU: 0 PID: 58 Comm: kworker/0:8 Not tainted 5.18.0-rc5 #1
> [   88.557103] Hardware name: Generic AM33XX (Flattened Device Tree)
> [   88.563822] Workqueue: events dbs_work_handler
> [   88.569398]  unwind_backtrace from show_stack+0x10/0x14
> [   88.575662]  show_stack from dump_stack_lvl+0x58/0x70
> [   88.581627]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
> [   88.588345]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
> [   88.596339]  nmi_trigger_cpumask_backtrace from
> trigger_single_cpu_backtrace+0x20/0x2c
> [   88.605221]  trigger_single_cpu_backtrace from
> rcu_check_gp_kthread_starvation+0xf4/0x148
> [   88.614328]  rcu_check_gp_kthread_starvation from
> rcu_sched_clock_irq+0xdf0/0xf7c
> [   88.622778]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
> [   88.630182]  update_process_times from tick_sched_handle+0x48/0x54
> [   88.637293]  tick_sched_handle from tick_sched_timer+0x48/0xac
> [   88.643993]  tick_sched_timer from __hrtimer_run_queues+0x244/0x4d8
> [   88.651212]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
> [   88.658582]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
> [   88.666506]  dmtimer_clockevent_interrupt from
> __handle_irq_event_percpu+0x98/0x334
> [   88.675241]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
> [   88.682749]  handle_irq_event from handle_level_irq+0xb4/0x1a8
> [   88.689639]  handle_level_irq from handle_irq_desc+0x1c/0x2c
> [   88.696253]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
> [   88.703524]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
> [   88.710195] Exception stack(0xd0001f58 to 0xd0001fa0)
> [   88.715947] 1f40:
>     c01015c8 00000000
> [   88.724939] 1f60: 0eae9000 00000000 fffffffe 60000013 ffffffff
> d0385d74 00000000 c2702a80
> [   88.733926] 1f80: 00000002 c2702a80 00000000 d0001fa8 c01015c8
> c01015d0 60000113 ffffffff
> [   88.742765]  __irq_svc from __do_softirq+0xa0/0x604
> [   88.748533]  __do_softirq from __irq_exit_rcu+0x138/0x178
> [   88.754961]  __irq_exit_rcu from irq_exit+0x8/0x28
> [   88.760758]  irq_exit from call_with_stack+0x18/0x20
> [   88.766687]  call_with_stack from __irq_svc+0x9c/0xbc
> [   88.772576] Exception stack(0xd0385d40 to 0xd0385d88)
> [   88.778458] 5d40: 00000005 00000488 00000000 00000000 c208c0c0
> 00006402 c208b800 c1874ff0
> [   88.787451] 5d60: 00000000 c208c0c0 c1109210 c208c0d8 00000000
> d0385d90 c06e068c c06e08a4
> [   88.796305] 5d80: 60000013 ffffffff
> [   88.800369]  __irq_svc from omap3_noncore_dpll_program+0x3f8/0x5ec
> [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> [   88.876973]  worker_thread from kthread+0xe4/0x104
> [   88.882692]  kthread from ret_from_fork+0x14/0x28
> [   88.888225] Exception stack(0xd0385fb0 to 0xd0385ff8)
> [   88.893998] 5fa0:                                     00000000
> 00000000 00000000 00000000
> [   88.902971] 5fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [   88.911888] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> 
> Regards,
> Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-05  5:08 ` Tony Lindgren
@ 2022-05-11 14:16   ` Yegor Yefremov
  2022-05-12  5:41       ` Tony Lindgren
  0 siblings, 1 reply; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-11 14:16 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: Linux-OMAP, linux-clk, Stephen Boyd

Hi Tony,

On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
>
> Hi,
>
> * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > Hi Tony, all,
> >
> > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > stalls as soon as I invoke the following commands (initializing
> > USB-to-CAN converter):
> >
> > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > ip link set slcan0 up
> >
> > Have you already seen such an issue? Should I try to bisect this?
>
> No have not seen this one either, yes please bisect if you can.
>
> Note that v5.18-rc1 has revert commit 859c2c7b1d06 ("Revert "clk: Drop
> the rate range on clk_put()"") that you may need to carry along in the
> bisect.

I had to skip a lot of commits due to assembler related build issues:

/tmp/cc5p087h.s: Assembler messages:
/tmp/cc5p087h.s:500: Error: invalid literal constant: pool needs to be closer

Hence, I don't have the exact commit:

#There are only 'skip'ped commits left to test.
The first bad commit could be any of:
9cf72c358a20b95e040e6a54a03baf6d264e0719
cafc0eab168917ec9c0cd47d530a40cd40eb2928
23d9a9280efea105852de358f21d69231992ae73
9c46929e7989efacc1dd0a1dd662a839897ea2b6
5fe41793bc78d9bb47fea37d1a16984ad6cf294b
We cannot bisect more!

git bisect log
git bisect start
# good: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17
git bisect good f443e374ae131c168a065ea1748feac6b2e76613
# bad: [672c0c5173427e6b3e2a9bbb7be51ceeec78093a] Linux 5.18-rc5
git bisect bad 672c0c5173427e6b3e2a9bbb7be51ceeec78093a
# bad: [25fd2d41b505d0640bdfe67aa77c549de2d3c18a] selftests: kselftest
framework: provide "finished" helper
git bisect bad 25fd2d41b505d0640bdfe67aa77c549de2d3c18a
# bad: [b4bc93bd76d4da32600795cd323c971f00a2e788] Merge tag
'arm-drivers-5.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad b4bc93bd76d4da32600795cd323c971f00a2e788
# good: [3fe2f7446f1e029b220f7f650df6d138f91651f2] Merge tag
'sched-core-2022-03-22' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 3fe2f7446f1e029b220f7f650df6d138f91651f2
# good: [182966e1cd74ec0e326cd376de241803ee79741b] Merge tag
'media/v5.18-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 182966e1cd74ec0e326cd376de241803ee79741b
# good: [49a24e9d9c740d3bd8b1200f225f67d45e3d68a5] Make the SOF
control, PCM and PM code IPC agnostic
git bisect good 49a24e9d9c740d3bd8b1200f225f67d45e3d68a5
# bad: [8ffa5709e577385a1c8d20fb434cb02732f1d991] Merge tag
'arm-defconfig-5.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad 8ffa5709e577385a1c8d20fb434cb02732f1d991
# good: [e6aef3496a00a12e78a571f61d98300cf0a86e6a] Merge tag
'm68knommu-for-v5.18' of
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
git bisect good e6aef3496a00a12e78a571f61d98300cf0a86e6a
# bad: [9c0e6a89b592f4c4e4d769dbc22d399ab0685159] Merge tag
'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
git bisect bad 9c0e6a89b592f4c4e4d769dbc22d399ab0685159
# skip: [cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable
support for IRQ stacks
git bisect skip cafc0eab168917ec9c0cd47d530a40cd40eb2928
# skip: [54f481a2308efab49d2b14c3f8263b34fdb1c65e] ARM: remove
old-style irq entry
git bisect skip 54f481a2308efab49d2b14c3f8263b34fdb1c65e
# good: [8cdfdf7fe4fec5a952edfb8927ee7cc639c58184] ARM: export
dump_mem() to other objects
git bisect good 8cdfdf7fe4fec5a952edfb8927ee7cc639c58184
# bad: [5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid
literal references in inline assembly
git bisect bad 5fe41793bc78d9bb47fea37d1a16984ad6cf294b
# good: [90890f17ccd2aa96350abd1f4d37d4667e09027f] ARM: footbridge:
use GENERIC_IRQ_MULTI_HANDLER
git bisect good 90890f17ccd2aa96350abd1f4d37d4667e09027f
# good: [4e918ab13eaf40f19938659cb5a22c93172778a8] ARM: assembler: add
optimized ldr/str macros to load variables from memory
git bisect good 4e918ab13eaf40f19938659cb5a22c93172778a8
# skip: [9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
THREAD_INFO_IN_TASK for uniprocessor systems
git bisect skip 9c46929e7989efacc1dd0a1dd662a839897ea2b6
# good: [c2755910373bb5dfb9aa68ba2924036686815c9e] ARM: smp: defer
TPIDRURO update for SMP v6 configurations too
git bisect good c2755910373bb5dfb9aa68ba2924036686815c9e
# skip: [9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
'arm-irq-and-vmap-stacks-for-rmk' of
git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
devel-stable
git bisect skip 9cf72c358a20b95e040e6a54a03baf6d264e0719
# skip: [23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1:
disable vmap'ed stacks on suspend-capable SMP configs
git bisect skip 23d9a9280efea105852de358f21d69231992ae73
# only skipped commits left to test
# possible first bad commit:
[5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid literal
references in inline assembly
# possible first bad commit:
[23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1: disable
vmap'ed stacks on suspend-capable SMP configs
# possible first bad commit:
[9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
'arm-irq-and-vmap-stacks-for-rmk' of
git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
devel-stable
# possible first bad commit:
[cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable support
for IRQ stacks
# possible first bad commit:
[9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
THREAD_INFO_IN_TASK for uniprocessor systems

Best regards,
Yegor

> > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > time, OOM is now expected behavior.
> > [   88.454859] rcu: RCU grace-period kthread stack dump:
> > [   88.460446] task:rcu_sched       state:R  running task     stack:
> >  0 pid:   11 ppid:     2 flags:0x00000000
> > [   88.471840]  __schedule from schedule+0x58/0xcc
> > [   88.477680]  schedule from schedule_timeout+0x78/0xf8
> > [   88.483754]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3cc
> > [   88.490629]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
> > [   88.497187]  rcu_gp_kthread from kthread+0xe4/0x104
> > [   88.503061]  kthread from ret_from_fork+0x14/0x28
> > [   88.508627] Exception stack(0xd0041fb0 to 0xd0041ff8)
> > [   88.514443] 1fa0:                                     00000000
> > 00000000 00000000 00000000
> > [   88.523433] 1fc0: 00000000 00000000 00000000 00000000 00000000
> > 00000000 00000000 00000000
> > [   88.532374] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > [   88.539639] rcu: Stack dump where RCU GP kthread last ran:
> > [   88.545694] NMI backtrace for cpu 0
> > [   88.549779] CPU: 0 PID: 58 Comm: kworker/0:8 Not tainted 5.18.0-rc5 #1
> > [   88.557103] Hardware name: Generic AM33XX (Flattened Device Tree)
> > [   88.563822] Workqueue: events dbs_work_handler
> > [   88.569398]  unwind_backtrace from show_stack+0x10/0x14
> > [   88.575662]  show_stack from dump_stack_lvl+0x58/0x70
> > [   88.581627]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
> > [   88.588345]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
> > [   88.596339]  nmi_trigger_cpumask_backtrace from
> > trigger_single_cpu_backtrace+0x20/0x2c
> > [   88.605221]  trigger_single_cpu_backtrace from
> > rcu_check_gp_kthread_starvation+0xf4/0x148
> > [   88.614328]  rcu_check_gp_kthread_starvation from
> > rcu_sched_clock_irq+0xdf0/0xf7c
> > [   88.622778]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
> > [   88.630182]  update_process_times from tick_sched_handle+0x48/0x54
> > [   88.637293]  tick_sched_handle from tick_sched_timer+0x48/0xac
> > [   88.643993]  tick_sched_timer from __hrtimer_run_queues+0x244/0x4d8
> > [   88.651212]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
> > [   88.658582]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
> > [   88.666506]  dmtimer_clockevent_interrupt from
> > __handle_irq_event_percpu+0x98/0x334
> > [   88.675241]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
> > [   88.682749]  handle_irq_event from handle_level_irq+0xb4/0x1a8
> > [   88.689639]  handle_level_irq from handle_irq_desc+0x1c/0x2c
> > [   88.696253]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
> > [   88.703524]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
> > [   88.710195] Exception stack(0xd0001f58 to 0xd0001fa0)
> > [   88.715947] 1f40:
> >     c01015c8 00000000
> > [   88.724939] 1f60: 0eae9000 00000000 fffffffe 60000013 ffffffff
> > d0385d74 00000000 c2702a80
> > [   88.733926] 1f80: 00000002 c2702a80 00000000 d0001fa8 c01015c8
> > c01015d0 60000113 ffffffff
> > [   88.742765]  __irq_svc from __do_softirq+0xa0/0x604
> > [   88.748533]  __do_softirq from __irq_exit_rcu+0x138/0x178
> > [   88.754961]  __irq_exit_rcu from irq_exit+0x8/0x28
> > [   88.760758]  irq_exit from call_with_stack+0x18/0x20
> > [   88.766687]  call_with_stack from __irq_svc+0x9c/0xbc
> > [   88.772576] Exception stack(0xd0385d40 to 0xd0385d88)
> > [   88.778458] 5d40: 00000005 00000488 00000000 00000000 c208c0c0
> > 00006402 c208b800 c1874ff0
> > [   88.787451] 5d60: 00000000 c208c0c0 c1109210 c208c0d8 00000000
> > d0385d90 c06e068c c06e08a4
> > [   88.796305] 5d80: 60000013 ffffffff
> > [   88.800369]  __irq_svc from omap3_noncore_dpll_program+0x3f8/0x5ec
> > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > [   88.882692]  kthread from ret_from_fork+0x14/0x28
> > [   88.888225] Exception stack(0xd0385fb0 to 0xd0385ff8)
> > [   88.893998] 5fa0:                                     00000000
> > 00000000 00000000 00000000
> > [   88.902971] 5fc0: 00000000 00000000 00000000 00000000 00000000
> > 00000000 00000000 00000000
> > [   88.911888] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >
> > Regards,
> > Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-11 14:16   ` Yegor Yefremov
@ 2022-05-12  5:41       ` Tony Lindgren
  0 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-12  5:41 UTC (permalink / raw)
  To: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann
  Cc: Linux-OMAP, linux-clk, Stephen Boyd, linux-arm-kernel

Hi,

Adding Ard and Arnd for vmap stack.

* Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> Hi Tony,
> 
> On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> >
> > Hi,
> >
> > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > Hi Tony, all,
> > >
> > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > stalls as soon as I invoke the following commands (initializing
> > > USB-to-CAN converter):
> > >
> > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > ip link set slcan0 up
> > >
> > > Have you already seen such an issue? Should I try to bisect this?
> >
> > No have not seen this one either, yes please bisect if you can.
> >
> > Note that v5.18-rc1 has revert commit 859c2c7b1d06 ("Revert "clk: Drop
> > the rate range on clk_put()"") that you may need to carry along in the
> > bisect.
> 
> I had to skip a lot of commits due to assembler related build issues:
> 
> /tmp/cc5p087h.s: Assembler messages:
> /tmp/cc5p087h.s:500: Error: invalid literal constant: pool needs to be closer
> 
> Hence, I don't have the exact commit:
> 
> #There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 9cf72c358a20b95e040e6a54a03baf6d264e0719
> cafc0eab168917ec9c0cd47d530a40cd40eb2928
> 23d9a9280efea105852de358f21d69231992ae73
> 9c46929e7989efacc1dd0a1dd662a839897ea2b6
> 5fe41793bc78d9bb47fea37d1a16984ad6cf294b
> We cannot bisect more!

Sounds like you would need to carry some fixes along with the
bisect to avoid multiple bugs.. Note that for smc calls we needed
8cf8df89678a ("ARM: OMAP2+: Fix regression for smc calls for
vmap stack"), but that should only affect am3/4 for system suspend.

> git bisect log
> git bisect start
> # good: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17
> git bisect good f443e374ae131c168a065ea1748feac6b2e76613
> # bad: [672c0c5173427e6b3e2a9bbb7be51ceeec78093a] Linux 5.18-rc5
> git bisect bad 672c0c5173427e6b3e2a9bbb7be51ceeec78093a
> # bad: [25fd2d41b505d0640bdfe67aa77c549de2d3c18a] selftests: kselftest
> framework: provide "finished" helper
> git bisect bad 25fd2d41b505d0640bdfe67aa77c549de2d3c18a
> # bad: [b4bc93bd76d4da32600795cd323c971f00a2e788] Merge tag
> 'arm-drivers-5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect bad b4bc93bd76d4da32600795cd323c971f00a2e788
> # good: [3fe2f7446f1e029b220f7f650df6d138f91651f2] Merge tag
> 'sched-core-2022-03-22' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 3fe2f7446f1e029b220f7f650df6d138f91651f2
> # good: [182966e1cd74ec0e326cd376de241803ee79741b] Merge tag
> 'media/v5.18-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> git bisect good 182966e1cd74ec0e326cd376de241803ee79741b
> # good: [49a24e9d9c740d3bd8b1200f225f67d45e3d68a5] Make the SOF
> control, PCM and PM code IPC agnostic
> git bisect good 49a24e9d9c740d3bd8b1200f225f67d45e3d68a5
> # bad: [8ffa5709e577385a1c8d20fb434cb02732f1d991] Merge tag
> 'arm-defconfig-5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect bad 8ffa5709e577385a1c8d20fb434cb02732f1d991
> # good: [e6aef3496a00a12e78a571f61d98300cf0a86e6a] Merge tag
> 'm68knommu-for-v5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
> git bisect good e6aef3496a00a12e78a571f61d98300cf0a86e6a
> # bad: [9c0e6a89b592f4c4e4d769dbc22d399ab0685159] Merge tag
> 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
> git bisect bad 9c0e6a89b592f4c4e4d769dbc22d399ab0685159
> # skip: [cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable
> support for IRQ stacks
> git bisect skip cafc0eab168917ec9c0cd47d530a40cd40eb2928
> # skip: [54f481a2308efab49d2b14c3f8263b34fdb1c65e] ARM: remove
> old-style irq entry
> git bisect skip 54f481a2308efab49d2b14c3f8263b34fdb1c65e
> # good: [8cdfdf7fe4fec5a952edfb8927ee7cc639c58184] ARM: export
> dump_mem() to other objects
> git bisect good 8cdfdf7fe4fec5a952edfb8927ee7cc639c58184
> # bad: [5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid
> literal references in inline assembly
> git bisect bad 5fe41793bc78d9bb47fea37d1a16984ad6cf294b
> # good: [90890f17ccd2aa96350abd1f4d37d4667e09027f] ARM: footbridge:
> use GENERIC_IRQ_MULTI_HANDLER
> git bisect good 90890f17ccd2aa96350abd1f4d37d4667e09027f
> # good: [4e918ab13eaf40f19938659cb5a22c93172778a8] ARM: assembler: add
> optimized ldr/str macros to load variables from memory
> git bisect good 4e918ab13eaf40f19938659cb5a22c93172778a8
> # skip: [9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems
> git bisect skip 9c46929e7989efacc1dd0a1dd662a839897ea2b6
> # good: [c2755910373bb5dfb9aa68ba2924036686815c9e] ARM: smp: defer
> TPIDRURO update for SMP v6 configurations too
> git bisect good c2755910373bb5dfb9aa68ba2924036686815c9e
> # skip: [9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
> 'arm-irq-and-vmap-stacks-for-rmk' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
> devel-stable
> git bisect skip 9cf72c358a20b95e040e6a54a03baf6d264e0719
> # skip: [23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1:
> disable vmap'ed stacks on suspend-capable SMP configs
> git bisect skip 23d9a9280efea105852de358f21d69231992ae73
> # only skipped commits left to test
> # possible first bad commit:
> [5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid literal
> references in inline assembly
> # possible first bad commit:
> [23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1: disable
> vmap'ed stacks on suspend-capable SMP configs
> # possible first bad commit:
> [9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
> 'arm-irq-and-vmap-stacks-for-rmk' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
> devel-stable
> # possible first bad commit:
> [cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable support
> for IRQ stacks
> # possible first bad commit:
> [9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems

Maybe Ard and Arnd have some ideas what might be going wrong here.
Basically anything trying to use a physical address on stack will
fail in weird ways like we've seen for smc and wl1251.

Regards,

Tony

> Best regards,
> Yegor
> 
> > > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > > time, OOM is now expected behavior.
> > > [   88.454859] rcu: RCU grace-period kthread stack dump:
> > > [   88.460446] task:rcu_sched       state:R  running task     stack:
> > >  0 pid:   11 ppid:     2 flags:0x00000000
> > > [   88.471840]  __schedule from schedule+0x58/0xcc
> > > [   88.477680]  schedule from schedule_timeout+0x78/0xf8
> > > [   88.483754]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3cc
> > > [   88.490629]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
> > > [   88.497187]  rcu_gp_kthread from kthread+0xe4/0x104
> > > [   88.503061]  kthread from ret_from_fork+0x14/0x28
> > > [   88.508627] Exception stack(0xd0041fb0 to 0xd0041ff8)
> > > [   88.514443] 1fa0:                                     00000000
> > > 00000000 00000000 00000000
> > > [   88.523433] 1fc0: 00000000 00000000 00000000 00000000 00000000
> > > 00000000 00000000 00000000
> > > [   88.532374] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > > [   88.539639] rcu: Stack dump where RCU GP kthread last ran:
> > > [   88.545694] NMI backtrace for cpu 0
> > > [   88.549779] CPU: 0 PID: 58 Comm: kworker/0:8 Not tainted 5.18.0-rc5 #1
> > > [   88.557103] Hardware name: Generic AM33XX (Flattened Device Tree)
> > > [   88.563822] Workqueue: events dbs_work_handler
> > > [   88.569398]  unwind_backtrace from show_stack+0x10/0x14
> > > [   88.575662]  show_stack from dump_stack_lvl+0x58/0x70
> > > [   88.581627]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
> > > [   88.588345]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
> > > [   88.596339]  nmi_trigger_cpumask_backtrace from
> > > trigger_single_cpu_backtrace+0x20/0x2c
> > > [   88.605221]  trigger_single_cpu_backtrace from
> > > rcu_check_gp_kthread_starvation+0xf4/0x148
> > > [   88.614328]  rcu_check_gp_kthread_starvation from
> > > rcu_sched_clock_irq+0xdf0/0xf7c
> > > [   88.622778]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
> > > [   88.630182]  update_process_times from tick_sched_handle+0x48/0x54
> > > [   88.637293]  tick_sched_handle from tick_sched_timer+0x48/0xac
> > > [   88.643993]  tick_sched_timer from __hrtimer_run_queues+0x244/0x4d8
> > > [   88.651212]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
> > > [   88.658582]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
> > > [   88.666506]  dmtimer_clockevent_interrupt from
> > > __handle_irq_event_percpu+0x98/0x334
> > > [   88.675241]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
> > > [   88.682749]  handle_irq_event from handle_level_irq+0xb4/0x1a8
> > > [   88.689639]  handle_level_irq from handle_irq_desc+0x1c/0x2c
> > > [   88.696253]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
> > > [   88.703524]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
> > > [   88.710195] Exception stack(0xd0001f58 to 0xd0001fa0)
> > > [   88.715947] 1f40:
> > >     c01015c8 00000000
> > > [   88.724939] 1f60: 0eae9000 00000000 fffffffe 60000013 ffffffff
> > > d0385d74 00000000 c2702a80
> > > [   88.733926] 1f80: 00000002 c2702a80 00000000 d0001fa8 c01015c8
> > > c01015d0 60000113 ffffffff
> > > [   88.742765]  __irq_svc from __do_softirq+0xa0/0x604
> > > [   88.748533]  __do_softirq from __irq_exit_rcu+0x138/0x178
> > > [   88.754961]  __irq_exit_rcu from irq_exit+0x8/0x28
> > > [   88.760758]  irq_exit from call_with_stack+0x18/0x20
> > > [   88.766687]  call_with_stack from __irq_svc+0x9c/0xbc
> > > [   88.772576] Exception stack(0xd0385d40 to 0xd0385d88)
> > > [   88.778458] 5d40: 00000005 00000488 00000000 00000000 c208c0c0
> > > 00006402 c208b800 c1874ff0
> > > [   88.787451] 5d60: 00000000 c208c0c0 c1109210 c208c0d8 00000000
> > > d0385d90 c06e068c c06e08a4
> > > [   88.796305] 5d80: 60000013 ffffffff
> > > [   88.800369]  __irq_svc from omap3_noncore_dpll_program+0x3f8/0x5ec
> > > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > > [   88.882692]  kthread from ret_from_fork+0x14/0x28
> > > [   88.888225] Exception stack(0xd0385fb0 to 0xd0385ff8)
> > > [   88.893998] 5fa0:                                     00000000
> > > 00000000 00000000 00000000
> > > [   88.902971] 5fc0: 00000000 00000000 00000000 00000000 00000000
> > > 00000000 00000000 00000000
> > > [   88.911888] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > >
> > > Regards,
> > > Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-12  5:41       ` Tony Lindgren
  0 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-12  5:41 UTC (permalink / raw)
  To: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann
  Cc: Linux-OMAP, linux-clk, Stephen Boyd, linux-arm-kernel

Hi,

Adding Ard and Arnd for vmap stack.

* Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> Hi Tony,
> 
> On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> >
> > Hi,
> >
> > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > Hi Tony, all,
> > >
> > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > stalls as soon as I invoke the following commands (initializing
> > > USB-to-CAN converter):
> > >
> > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > ip link set slcan0 up
> > >
> > > Have you already seen such an issue? Should I try to bisect this?
> >
> > No have not seen this one either, yes please bisect if you can.
> >
> > Note that v5.18-rc1 has revert commit 859c2c7b1d06 ("Revert "clk: Drop
> > the rate range on clk_put()"") that you may need to carry along in the
> > bisect.
> 
> I had to skip a lot of commits due to assembler related build issues:
> 
> /tmp/cc5p087h.s: Assembler messages:
> /tmp/cc5p087h.s:500: Error: invalid literal constant: pool needs to be closer
> 
> Hence, I don't have the exact commit:
> 
> #There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 9cf72c358a20b95e040e6a54a03baf6d264e0719
> cafc0eab168917ec9c0cd47d530a40cd40eb2928
> 23d9a9280efea105852de358f21d69231992ae73
> 9c46929e7989efacc1dd0a1dd662a839897ea2b6
> 5fe41793bc78d9bb47fea37d1a16984ad6cf294b
> We cannot bisect more!

Sounds like you would need to carry some fixes along with the
bisect to avoid multiple bugs.. Note that for smc calls we needed
8cf8df89678a ("ARM: OMAP2+: Fix regression for smc calls for
vmap stack"), but that should only affect am3/4 for system suspend.

> git bisect log
> git bisect start
> # good: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17
> git bisect good f443e374ae131c168a065ea1748feac6b2e76613
> # bad: [672c0c5173427e6b3e2a9bbb7be51ceeec78093a] Linux 5.18-rc5
> git bisect bad 672c0c5173427e6b3e2a9bbb7be51ceeec78093a
> # bad: [25fd2d41b505d0640bdfe67aa77c549de2d3c18a] selftests: kselftest
> framework: provide "finished" helper
> git bisect bad 25fd2d41b505d0640bdfe67aa77c549de2d3c18a
> # bad: [b4bc93bd76d4da32600795cd323c971f00a2e788] Merge tag
> 'arm-drivers-5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect bad b4bc93bd76d4da32600795cd323c971f00a2e788
> # good: [3fe2f7446f1e029b220f7f650df6d138f91651f2] Merge tag
> 'sched-core-2022-03-22' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good 3fe2f7446f1e029b220f7f650df6d138f91651f2
> # good: [182966e1cd74ec0e326cd376de241803ee79741b] Merge tag
> 'media/v5.18-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> git bisect good 182966e1cd74ec0e326cd376de241803ee79741b
> # good: [49a24e9d9c740d3bd8b1200f225f67d45e3d68a5] Make the SOF
> control, PCM and PM code IPC agnostic
> git bisect good 49a24e9d9c740d3bd8b1200f225f67d45e3d68a5
> # bad: [8ffa5709e577385a1c8d20fb434cb02732f1d991] Merge tag
> 'arm-defconfig-5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect bad 8ffa5709e577385a1c8d20fb434cb02732f1d991
> # good: [e6aef3496a00a12e78a571f61d98300cf0a86e6a] Merge tag
> 'm68knommu-for-v5.18' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
> git bisect good e6aef3496a00a12e78a571f61d98300cf0a86e6a
> # bad: [9c0e6a89b592f4c4e4d769dbc22d399ab0685159] Merge tag
> 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
> git bisect bad 9c0e6a89b592f4c4e4d769dbc22d399ab0685159
> # skip: [cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable
> support for IRQ stacks
> git bisect skip cafc0eab168917ec9c0cd47d530a40cd40eb2928
> # skip: [54f481a2308efab49d2b14c3f8263b34fdb1c65e] ARM: remove
> old-style irq entry
> git bisect skip 54f481a2308efab49d2b14c3f8263b34fdb1c65e
> # good: [8cdfdf7fe4fec5a952edfb8927ee7cc639c58184] ARM: export
> dump_mem() to other objects
> git bisect good 8cdfdf7fe4fec5a952edfb8927ee7cc639c58184
> # bad: [5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid
> literal references in inline assembly
> git bisect bad 5fe41793bc78d9bb47fea37d1a16984ad6cf294b
> # good: [90890f17ccd2aa96350abd1f4d37d4667e09027f] ARM: footbridge:
> use GENERIC_IRQ_MULTI_HANDLER
> git bisect good 90890f17ccd2aa96350abd1f4d37d4667e09027f
> # good: [4e918ab13eaf40f19938659cb5a22c93172778a8] ARM: assembler: add
> optimized ldr/str macros to load variables from memory
> git bisect good 4e918ab13eaf40f19938659cb5a22c93172778a8
> # skip: [9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems
> git bisect skip 9c46929e7989efacc1dd0a1dd662a839897ea2b6
> # good: [c2755910373bb5dfb9aa68ba2924036686815c9e] ARM: smp: defer
> TPIDRURO update for SMP v6 configurations too
> git bisect good c2755910373bb5dfb9aa68ba2924036686815c9e
> # skip: [9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
> 'arm-irq-and-vmap-stacks-for-rmk' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
> devel-stable
> git bisect skip 9cf72c358a20b95e040e6a54a03baf6d264e0719
> # skip: [23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1:
> disable vmap'ed stacks on suspend-capable SMP configs
> git bisect skip 23d9a9280efea105852de358f21d69231992ae73
> # only skipped commits left to test
> # possible first bad commit:
> [5fe41793bc78d9bb47fea37d1a16984ad6cf294b] ARM: 9176/1: avoid literal
> references in inline assembly
> # possible first bad commit:
> [23d9a9280efea105852de358f21d69231992ae73] ARM: 9177/1: disable
> vmap'ed stacks on suspend-capable SMP configs
> # possible first bad commit:
> [9cf72c358a20b95e040e6a54a03baf6d264e0719] Merge tag
> 'arm-irq-and-vmap-stacks-for-rmk' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux into
> devel-stable
> # possible first bad commit:
> [cafc0eab168917ec9c0cd47d530a40cd40eb2928] ARM: v7m: enable support
> for IRQ stacks
> # possible first bad commit:
> [9c46929e7989efacc1dd0a1dd662a839897ea2b6] ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems

Maybe Ard and Arnd have some ideas what might be going wrong here.
Basically anything trying to use a physical address on stack will
fail in weird ways like we've seen for smc and wl1251.

Regards,

Tony

> Best regards,
> Yegor
> 
> > > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > > time, OOM is now expected behavior.
> > > [   88.454859] rcu: RCU grace-period kthread stack dump:
> > > [   88.460446] task:rcu_sched       state:R  running task     stack:
> > >  0 pid:   11 ppid:     2 flags:0x00000000
> > > [   88.471840]  __schedule from schedule+0x58/0xcc
> > > [   88.477680]  schedule from schedule_timeout+0x78/0xf8
> > > [   88.483754]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3cc
> > > [   88.490629]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
> > > [   88.497187]  rcu_gp_kthread from kthread+0xe4/0x104
> > > [   88.503061]  kthread from ret_from_fork+0x14/0x28
> > > [   88.508627] Exception stack(0xd0041fb0 to 0xd0041ff8)
> > > [   88.514443] 1fa0:                                     00000000
> > > 00000000 00000000 00000000
> > > [   88.523433] 1fc0: 00000000 00000000 00000000 00000000 00000000
> > > 00000000 00000000 00000000
> > > [   88.532374] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > > [   88.539639] rcu: Stack dump where RCU GP kthread last ran:
> > > [   88.545694] NMI backtrace for cpu 0
> > > [   88.549779] CPU: 0 PID: 58 Comm: kworker/0:8 Not tainted 5.18.0-rc5 #1
> > > [   88.557103] Hardware name: Generic AM33XX (Flattened Device Tree)
> > > [   88.563822] Workqueue: events dbs_work_handler
> > > [   88.569398]  unwind_backtrace from show_stack+0x10/0x14
> > > [   88.575662]  show_stack from dump_stack_lvl+0x58/0x70
> > > [   88.581627]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
> > > [   88.588345]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
> > > [   88.596339]  nmi_trigger_cpumask_backtrace from
> > > trigger_single_cpu_backtrace+0x20/0x2c
> > > [   88.605221]  trigger_single_cpu_backtrace from
> > > rcu_check_gp_kthread_starvation+0xf4/0x148
> > > [   88.614328]  rcu_check_gp_kthread_starvation from
> > > rcu_sched_clock_irq+0xdf0/0xf7c
> > > [   88.622778]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
> > > [   88.630182]  update_process_times from tick_sched_handle+0x48/0x54
> > > [   88.637293]  tick_sched_handle from tick_sched_timer+0x48/0xac
> > > [   88.643993]  tick_sched_timer from __hrtimer_run_queues+0x244/0x4d8
> > > [   88.651212]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
> > > [   88.658582]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
> > > [   88.666506]  dmtimer_clockevent_interrupt from
> > > __handle_irq_event_percpu+0x98/0x334
> > > [   88.675241]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
> > > [   88.682749]  handle_irq_event from handle_level_irq+0xb4/0x1a8
> > > [   88.689639]  handle_level_irq from handle_irq_desc+0x1c/0x2c
> > > [   88.696253]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
> > > [   88.703524]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
> > > [   88.710195] Exception stack(0xd0001f58 to 0xd0001fa0)
> > > [   88.715947] 1f40:
> > >     c01015c8 00000000
> > > [   88.724939] 1f60: 0eae9000 00000000 fffffffe 60000013 ffffffff
> > > d0385d74 00000000 c2702a80
> > > [   88.733926] 1f80: 00000002 c2702a80 00000000 d0001fa8 c01015c8
> > > c01015d0 60000113 ffffffff
> > > [   88.742765]  __irq_svc from __do_softirq+0xa0/0x604
> > > [   88.748533]  __do_softirq from __irq_exit_rcu+0x138/0x178
> > > [   88.754961]  __irq_exit_rcu from irq_exit+0x8/0x28
> > > [   88.760758]  irq_exit from call_with_stack+0x18/0x20
> > > [   88.766687]  call_with_stack from __irq_svc+0x9c/0xbc
> > > [   88.772576] Exception stack(0xd0385d40 to 0xd0385d88)
> > > [   88.778458] 5d40: 00000005 00000488 00000000 00000000 c208c0c0
> > > 00006402 c208b800 c1874ff0
> > > [   88.787451] 5d60: 00000000 c208c0c0 c1109210 c208c0d8 00000000
> > > d0385d90 c06e068c c06e08a4
> > > [   88.796305] 5d80: 60000013 ffffffff
> > > [   88.800369]  __irq_svc from omap3_noncore_dpll_program+0x3f8/0x5ec
> > > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > > [   88.882692]  kthread from ret_from_fork+0x14/0x28
> > > [   88.888225] Exception stack(0xd0385fb0 to 0xd0385ff8)
> > > [   88.893998] 5fa0:                                     00000000
> > > 00000000 00000000 00000000
> > > [   88.902971] 5fc0: 00000000 00000000 00000000 00000000 00000000
> > > 00000000 00000000 00000000
> > > [   88.911888] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > >
> > > Regards,
> > > Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-12  5:41       ` Tony Lindgren
@ 2022-05-12  8:14         ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-12  8:14 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> Adding Ard and Arnd for vmap stack.

Thanks!

> * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:

>
> Maybe Ard and Arnd have some ideas what might be going wrong here.
> Basically anything trying to use a physical address on stack will
> fail in weird ways like we've seen for smc and wl1251.

For this, the first step should be to enable CONFIG_DMA_API_DEBUG.
If any device is getting the wrong DMA address for a stack variable,
this should print a helpful debug message to the console.

> > > > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > > > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > > > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > > > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > > > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > > > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > > > time, OOM is now expected behavior.
> > > > [   88.454859] rcu: RCU grace-period kthread stack dump:

I looked for a smoking gun in the backtrace, didn't really find anything,
so I'm guessing the problem is something that happened between the
last timer timer and the time it actually ran the rcu_gp_kthread, maybe
some DMA timeout in a device driver running with interrupts disabled.

> > > > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > > > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > > > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > > > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > > > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > > > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > > > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > > > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > > > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > > > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > > > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > > > [   88.882692]  kthread from ret_from_fork+0x14/0x28

The only thing I see that is slightly unusual here is that the timer
tick happened
exactly during the cpufreq transition. Is this always the same backtrace when
you run into the bug? What happens when you disable the omap3 cpufreq
driver or set it to run at a fixed frequency?

          Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-12  8:14         ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-12  8:14 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> Adding Ard and Arnd for vmap stack.

Thanks!

> * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:

>
> Maybe Ard and Arnd have some ideas what might be going wrong here.
> Basically anything trying to use a physical address on stack will
> fail in weird ways like we've seen for smc and wl1251.

For this, the first step should be to enable CONFIG_DMA_API_DEBUG.
If any device is getting the wrong DMA address for a stack variable,
this should print a helpful debug message to the console.

> > > > [   88.408578] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [   88.415777]  (detected by 0, t=2602 jiffies, g=2529, q=17)
> > > > [   88.422026] rcu: All QSes seen, last rcu_sched kthread activity
> > > > 2602 (-21160--23762), jiffies_till_next_fqs=1, root ->qsmask 0x0
> > > > [   88.434445] rcu: rcu_sched kthread starved for 2602 jiffies! g2529
> > > > f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> > > > [   88.445274] rcu:     Unless rcu_sched kthread gets sufficient CPU
> > > > time, OOM is now expected behavior.
> > > > [   88.454859] rcu: RCU grace-period kthread stack dump:

I looked for a smoking gun in the backtrace, didn't really find anything,
so I'm guessing the problem is something that happened between the
last timer timer and the time it actually ran the rcu_gp_kthread, maybe
some DMA timeout in a device driver running with interrupts disabled.

> > > > [   88.807588]  omap3_noncore_dpll_program from clk_change_rate+0x23c/0x4f8
> > > > [   88.815375]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
> > > > [   88.822936]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
> > > > [   88.830056]  clk_set_rate from _set_opp+0x254/0x51c
> > > > [   88.835835]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
> > > > [   88.842073]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x700
> > > > [   88.849792]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
> > > > [   88.856953]  od_dbs_update from dbs_work_handler+0x2c/0x60
> > > > [   88.863441]  dbs_work_handler from process_one_work+0x284/0x72c
> > > > [   88.870411]  process_one_work from worker_thread+0x28/0x4b0
> > > > [   88.876973]  worker_thread from kthread+0xe4/0x104
> > > > [   88.882692]  kthread from ret_from_fork+0x14/0x28

The only thing I see that is slightly unusual here is that the timer
tick happened
exactly during the cpufreq transition. Is this always the same backtrace when
you run into the bug? What happens when you disable the omap3 cpufreq
driver or set it to run at a fixed frequency?

          Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-12  5:41       ` Tony Lindgren
@ 2022-05-12  8:42         ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-12  8:42 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > Hi Tony, all,
> > > >
> > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > stalls as soon as I invoke the following commands (initializing
> > > > USB-to-CAN converter):
> > > >
> > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > ip link set slcan0 up

Oh, I missed this part at first and only looked at the backtrace.
Which CAN driver
are you using? It's likely a problem in the kernel driver.

CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
just see it by looking at the right source file.

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-12  8:42         ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-12  8:42 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Ard Biesheuvel, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > Hi Tony, all,
> > > >
> > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > stalls as soon as I invoke the following commands (initializing
> > > > USB-to-CAN converter):
> > > >
> > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > ip link set slcan0 up

Oh, I missed this part at first and only looked at the backtrace.
Which CAN driver
are you using? It's likely a problem in the kernel driver.

CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
just see it by looking at the right source file.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-12  8:42         ` Arnd Bergmann
@ 2022-05-12 10:20           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-12 10:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > Hi Tony, all,
> > > > >
> > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > stalls as soon as I invoke the following commands (initializing
> > > > > USB-to-CAN converter):
> > > > >
> > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > ip link set slcan0 up
>
> Oh, I missed this part at first and only looked at the backtrace.
> Which CAN driver
> are you using? It's likely a problem in the kernel driver.

I am using the slcan driver [1].

> CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> just see it by looking at the right source file.

I'll try to get more debug info with CONFIG_DMA_API_DEBUG.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/can/slcan.c?h=v5.18-rc6

Yegor
>        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-12 10:20           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-12 10:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > Hi Tony, all,
> > > > >
> > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > stalls as soon as I invoke the following commands (initializing
> > > > > USB-to-CAN converter):
> > > > >
> > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > ip link set slcan0 up
>
> Oh, I missed this part at first and only looked at the backtrace.
> Which CAN driver
> are you using? It's likely a problem in the kernel driver.

I am using the slcan driver [1].

> CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> just see it by looking at the right source file.

I'll try to get more debug info with CONFIG_DMA_API_DEBUG.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/can/slcan.c?h=v5.18-rc6

Yegor
>        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-12 10:20           ` Yegor Yefremov
@ 2022-05-19 16:52             ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-19 16:52 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> Hi Arnd,
>
> On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > Hi Tony, all,
> > > > > >
> > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > USB-to-CAN converter):
> > > > > >
> > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > ip link set slcan0 up
> >
> > Oh, I missed this part at first and only looked at the backtrace.
> > Which CAN driver
> > are you using? It's likely a problem in the kernel driver.
>
> I am using the slcan driver [1].
>
> > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > just see it by looking at the right source file.
>
> I'll try to get more debug info with CONFIG_DMA_API_DEBUG.

DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
"solved" the problem. I have tried different governors and got these
two groups:

ondemand, schedutil - cause the problem
conservative, powersave, performance and userspace - don't cause the problem

So far, I have only seen the same debug output that I've initially
sent and in most cases, the system stalls without the output.

Yegor
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/can/slcan.c?h=v5.18-rc6
>
> Yegor
> >        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-19 16:52             ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-19 16:52 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> Hi Arnd,
>
> On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > Hi Tony, all,
> > > > > >
> > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > USB-to-CAN converter):
> > > > > >
> > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > ip link set slcan0 up
> >
> > Oh, I missed this part at first and only looked at the backtrace.
> > Which CAN driver
> > are you using? It's likely a problem in the kernel driver.
>
> I am using the slcan driver [1].
>
> > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > just see it by looking at the right source file.
>
> I'll try to get more debug info with CONFIG_DMA_API_DEBUG.

DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
"solved" the problem. I have tried different governors and got these
two groups:

ondemand, schedutil - cause the problem
conservative, powersave, performance and userspace - don't cause the problem

So far, I have only seen the same debug output that I've initially
sent and in most cases, the system stalls without the output.

Yegor
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/can/slcan.c?h=v5.18-rc6
>
> Yegor
> >        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-19 16:52             ` Yegor Yefremov
@ 2022-05-21 19:41               ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-21 19:41 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > Hi Tony, all,
> > > > > > >
> > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > USB-to-CAN converter):
> > > > > > >
> > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > ip link set slcan0 up
> > >
> > > Oh, I missed this part at first and only looked at the backtrace.
> > > Which CAN driver
> > > are you using? It's likely a problem in the kernel driver.
> >
> > I am using the slcan driver [1].

Ok, so this is just a serial port based driver, which means the
follow-up question
is what you use for your uart. Is this one of the USB-serial ones or an on-chip
uart? Which driver?

> > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > just see it by looking at the right source file.
> >
> > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
>
> DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> "solved" the problem. I have tried different governors and got these
> two groups:
>
> ondemand, schedutil - cause the problem
> conservative, powersave, performance and userspace - don't cause the problem
>
> So far, I have only seen the same debug output that I've initially
> sent and in most cases, the system stalls without the output.

Ok, so that sounds like it happens when you change the frequency.
I assume this means you are using drivers/cpufreq/omap-cpufreq.c?

When using the usersapce governor, do you see problems when you
manually change the frequency from sysfs?

          Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-21 19:41               ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-21 19:41 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > Hi Tony, all,
> > > > > > >
> > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > USB-to-CAN converter):
> > > > > > >
> > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > ip link set slcan0 up
> > >
> > > Oh, I missed this part at first and only looked at the backtrace.
> > > Which CAN driver
> > > are you using? It's likely a problem in the kernel driver.
> >
> > I am using the slcan driver [1].

Ok, so this is just a serial port based driver, which means the
follow-up question
is what you use for your uart. Is this one of the USB-serial ones or an on-chip
uart? Which driver?

> > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > just see it by looking at the right source file.
> >
> > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
>
> DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> "solved" the problem. I have tried different governors and got these
> two groups:
>
> ondemand, schedutil - cause the problem
> conservative, powersave, performance and userspace - don't cause the problem
>
> So far, I have only seen the same debug output that I've initially
> sent and in most cases, the system stalls without the output.

Ok, so that sounds like it happens when you change the frequency.
I assume this means you are using drivers/cpufreq/omap-cpufreq.c?

When using the usersapce governor, do you see problems when you
manually change the frequency from sysfs?

          Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-21 19:41               ` Arnd Bergmann
@ 2022-05-24 13:38                 ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-24 13:38 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > Hi Tony, all,
> > > > > > > >
> > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > USB-to-CAN converter):
> > > > > > > >
> > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > ip link set slcan0 up
> > > >
> > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > Which CAN driver
> > > > are you using? It's likely a problem in the kernel driver.
> > >
> > > I am using the slcan driver [1].
>
> Ok, so this is just a serial port based driver, which means the
> follow-up question
> is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> uart? Which driver?

This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).

I have also tried another system with two FT4232 chips (RS232 devices)
and performed transmission tests. This had no effect, the system
didn't stall.

> > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > just see it by looking at the right source file.
> > >
> > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> >
> > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > "solved" the problem. I have tried different governors and got these
> > two groups:
> >
> > ondemand, schedutil - cause the problem
> > conservative, powersave, performance and userspace - don't cause the problem
> >
> > So far, I have only seen the same debug output that I've initially
> > sent and in most cases, the system stalls without the output.
>
> Ok, so that sounds like it happens when you change the frequency.
> I assume this means you are using drivers/cpufreq/omap-cpufreq.c?

Yes.

> When using the usersapce governor, do you see problems when you
> manually change the frequency from sysfs?

No, I can switch between 300MHz and 600MHz and perform CAN tests.
Everything goes well.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-24 13:38                 ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-24 13:38 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > Hi Tony, all,
> > > > > > > >
> > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > USB-to-CAN converter):
> > > > > > > >
> > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > ip link set slcan0 up
> > > >
> > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > Which CAN driver
> > > > are you using? It's likely a problem in the kernel driver.
> > >
> > > I am using the slcan driver [1].
>
> Ok, so this is just a serial port based driver, which means the
> follow-up question
> is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> uart? Which driver?

This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).

I have also tried another system with two FT4232 chips (RS232 devices)
and performed transmission tests. This had no effect, the system
didn't stall.

> > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > just see it by looking at the right source file.
> > >
> > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> >
> > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > "solved" the problem. I have tried different governors and got these
> > two groups:
> >
> > ondemand, schedutil - cause the problem
> > conservative, powersave, performance and userspace - don't cause the problem
> >
> > So far, I have only seen the same debug output that I've initially
> > sent and in most cases, the system stalls without the output.
>
> Ok, so that sounds like it happens when you change the frequency.
> I assume this means you are using drivers/cpufreq/omap-cpufreq.c?

Yes.

> When using the usersapce governor, do you see problems when you
> manually change the frequency from sysfs?

No, I can switch between 300MHz and 600MHz and perform CAN tests.
Everything goes well.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-24 13:38                 ` Yegor Yefremov
@ 2022-05-24 14:19                   ` Tony Lindgren
  -1 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-24 14:19 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Yegor Yefremov <yegorslists@googlemail.com> [220524 13:34]:
> Hi Arnd,
> 
> On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > > Hi Tony, all,
> > > > > > > > >
> > > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > > USB-to-CAN converter):
> > > > > > > > >
> > > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > > ip link set slcan0 up
> > > > >
> > > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > > Which CAN driver
> > > > > are you using? It's likely a problem in the kernel driver.
> > > >
> > > > I am using the slcan driver [1].
> >
> > Ok, so this is just a serial port based driver, which means the
> > follow-up question
> > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > uart? Which driver?
> 
> This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
> 
> I have also tried another system with two FT4232 chips (RS232 devices)
> and performed transmission tests. This had no effect, the system
> didn't stall.

Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
better or worse :)

> > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > just see it by looking at the right source file.
> > > >
> > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > >
> > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > "solved" the problem. I have tried different governors and got these
> > > two groups:
> > >
> > > ondemand, schedutil - cause the problem
> > > conservative, powersave, performance and userspace - don't cause the problem
> > >
> > > So far, I have only seen the same debug output that I've initially
> > > sent and in most cases, the system stalls without the output.
> >
> > Ok, so that sounds like it happens when you change the frequency.
> > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
> 
> Yes.
> 
> > When using the usersapce governor, do you see problems when you
> > manually change the frequency from sysfs?
> 
> No, I can switch between 300MHz and 600MHz and perform CAN tests.
> Everything goes well.

OK so not cpufreq related.

Regards,

Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-24 14:19                   ` Tony Lindgren
  0 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-24 14:19 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Yegor Yefremov <yegorslists@googlemail.com> [220524 13:34]:
> Hi Arnd,
> 
> On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > > Hi Tony, all,
> > > > > > > > >
> > > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > > USB-to-CAN converter):
> > > > > > > > >
> > > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > > ip link set slcan0 up
> > > > >
> > > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > > Which CAN driver
> > > > > are you using? It's likely a problem in the kernel driver.
> > > >
> > > > I am using the slcan driver [1].
> >
> > Ok, so this is just a serial port based driver, which means the
> > follow-up question
> > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > uart? Which driver?
> 
> This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
> 
> I have also tried another system with two FT4232 chips (RS232 devices)
> and performed transmission tests. This had no effect, the system
> didn't stall.

Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
better or worse :)

> > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > just see it by looking at the right source file.
> > > >
> > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > >
> > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > "solved" the problem. I have tried different governors and got these
> > > two groups:
> > >
> > > ondemand, schedutil - cause the problem
> > > conservative, powersave, performance and userspace - don't cause the problem
> > >
> > > So far, I have only seen the same debug output that I've initially
> > > sent and in most cases, the system stalls without the output.
> >
> > Ok, so that sounds like it happens when you change the frequency.
> > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
> 
> Yes.
> 
> > When using the usersapce governor, do you see problems when you
> > manually change the frequency from sysfs?
> 
> No, I can switch between 300MHz and 600MHz and perform CAN tests.
> Everything goes well.

OK so not cpufreq related.

Regards,

Tony

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-24 13:38                 ` Yegor Yefremov
@ 2022-05-24 14:36                   ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-24 14:36 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, May 24, 2022 at 3:38 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > Ok, so this is just a serial port based driver, which means the
> > follow-up question
> > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > uart? Which driver?
>
> This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
>
> I have also tried another system with two FT4232 chips (RS232 devices)
> and performed transmission tests. This had no effect, the system
> didn't stall.

Ok, I see. I looked at ftdi_sio, and found a couple of slightly suspicious
code paths in the FT-X specific bits, but after looking more closely I
found nothing actually wrong with them.

It might still be worth trying more combinations of those, e.g. if the FT-X
uart fails without the CAN adapter, or whether it fails on the other machine.

> > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > just see it by looking at the right source file.
> > > >
> > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > >
> > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > "solved" the problem. I have tried different governors and got these
> > > two groups:
> > >
> > > ondemand, schedutil - cause the problem
> > > conservative, powersave, performance and userspace - don't cause the problem
> > >
> > > So far, I have only seen the same debug output that I've initially
> > > sent and in most cases, the system stalls without the output.
> >
> > Ok, so that sounds like it happens when you change the frequency.
> > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
>
> Yes.
>
> > When using the usersapce governor, do you see problems when you
> > manually change the frequency from sysfs?
>
> No, I can switch between 300MHz and 600MHz and perform CAN tests.
> Everything goes well.

One more idea: maybe this is a case where we actually run out of stack
space? Without VMAP stacks, that may easily go unnoticed, but with
VMAP stack it is supposed to produce an obvious error message with a
backtrace. If we have a callchain that involves

can_xmit -> tty -> tty_usb -> usb -> musb -> schedule -> cpufreq_update_util
 -> omap_cpufreq

we might run out of the 8KB stack area. It's probably not this, but if you
want to rule it out, try using

#define THREAD_SIZE_ORDER       2

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-24 14:36                   ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-24 14:36 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, May 24, 2022 at 3:38 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > Ok, so this is just a serial port based driver, which means the
> > follow-up question
> > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > uart? Which driver?
>
> This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
>
> I have also tried another system with two FT4232 chips (RS232 devices)
> and performed transmission tests. This had no effect, the system
> didn't stall.

Ok, I see. I looked at ftdi_sio, and found a couple of slightly suspicious
code paths in the FT-X specific bits, but after looking more closely I
found nothing actually wrong with them.

It might still be worth trying more combinations of those, e.g. if the FT-X
uart fails without the CAN adapter, or whether it fails on the other machine.

> > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > just see it by looking at the right source file.
> > > >
> > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > >
> > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > "solved" the problem. I have tried different governors and got these
> > > two groups:
> > >
> > > ondemand, schedutil - cause the problem
> > > conservative, powersave, performance and userspace - don't cause the problem
> > >
> > > So far, I have only seen the same debug output that I've initially
> > > sent and in most cases, the system stalls without the output.
> >
> > Ok, so that sounds like it happens when you change the frequency.
> > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
>
> Yes.
>
> > When using the usersapce governor, do you see problems when you
> > manually change the frequency from sysfs?
>
> No, I can switch between 300MHz and 600MHz and perform CAN tests.
> Everything goes well.

One more idea: maybe this is a case where we actually run out of stack
space? Without VMAP stacks, that may easily go unnoticed, but with
VMAP stack it is supposed to produce an obvious error message with a
backtrace. If we have a callchain that involves

can_xmit -> tty -> tty_usb -> usb -> musb -> schedule -> cpufreq_update_util
 -> omap_cpufreq

we might run out of the 8KB stack area. It's probably not this, but if you
want to rule it out, try using

#define THREAD_SIZE_ORDER       2

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-24 14:19                   ` Tony Lindgren
@ 2022-05-26  5:49                     ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-26  5:49 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Tony,

On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
>
> * Yegor Yefremov <yegorslists@googlemail.com> [220524 13:34]:
> > Hi Arnd,
> >
> > On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > >
> > > > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > > > Hi Tony, all,
> > > > > > > > > >
> > > > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > > > USB-to-CAN converter):
> > > > > > > > > >
> > > > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > > > ip link set slcan0 up
> > > > > >
> > > > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > > > Which CAN driver
> > > > > > are you using? It's likely a problem in the kernel driver.
> > > > >
> > > > > I am using the slcan driver [1].
> > >
> > > Ok, so this is just a serial port based driver, which means the
> > > follow-up question
> > > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > > uart? Which driver?
> >
> > This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
> >
> > I have also tried another system with two FT4232 chips (RS232 devices)
> > and performed transmission tests. This had no effect, the system
> > didn't stall.
>
> Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> better or worse :)

PIO is always the last resort :-) And now it proves it again. With
PIO_ONLY the system doesn't stall.

Regards,
Yegor

> > > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > > just see it by looking at the right source file.
> > > > >
> > > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > > >
> > > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > > "solved" the problem. I have tried different governors and got these
> > > > two groups:
> > > >
> > > > ondemand, schedutil - cause the problem
> > > > conservative, powersave, performance and userspace - don't cause the problem
> > > >
> > > > So far, I have only seen the same debug output that I've initially
> > > > sent and in most cases, the system stalls without the output.
> > >
> > > Ok, so that sounds like it happens when you change the frequency.
> > > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
> >
> > Yes.
> >
> > > When using the usersapce governor, do you see problems when you
> > > manually change the frequency from sysfs?
> >
> > No, I can switch between 300MHz and 600MHz and perform CAN tests.
> > Everything goes well.
>
> OK so not cpufreq related.
>
> Regards,
>
> Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-26  5:49                     ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-26  5:49 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Tony,

On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
>
> * Yegor Yefremov <yegorslists@googlemail.com> [220524 13:34]:
> > Hi Arnd,
> >
> > On Sat, May 21, 2022 at 9:41 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 19, 2022 at 5:52 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 12, 2022 at 12:20 PM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Thu, May 12, 2022 at 10:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > >
> > > > > > On Thu, May 12, 2022 at 7:41 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220511 14:16]:
> > > > > > > > On Thu, May 5, 2022 at 7:08 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220504 10:35]:
> > > > > > > > > > Hi Tony, all,
> > > > > > > > > >
> > > > > > > > > > since kernel 5.18.x (5.17.x doesn't show this behavior), the system
> > > > > > > > > > stalls as soon as I invoke the following commands (initializing
> > > > > > > > > > USB-to-CAN converter):
> > > > > > > > > >
> > > > > > > > > > slcand -o -s8 -t hw -S 3000000 /dev/ttyUSB0
> > > > > > > > > > ip link set slcan0 up
> > > > > >
> > > > > > Oh, I missed this part at first and only looked at the backtrace.
> > > > > > Which CAN driver
> > > > > > are you using? It's likely a problem in the kernel driver.
> > > > >
> > > > > I am using the slcan driver [1].
> > >
> > > Ok, so this is just a serial port based driver, which means the
> > > follow-up question
> > > is what you use for your uart. Is this one of the USB-serial ones or an on-chip
> > > uart? Which driver?
> >
> > This is the following chain: am335x -> musb-> ftdi_sio (FT-X flavor).
> >
> > I have also tried another system with two FT4232 chips (RS232 devices)
> > and performed transmission tests. This had no effect, the system
> > didn't stall.
>
> Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> better or worse :)

PIO is always the last resort :-) And now it proves it again. With
PIO_ONLY the system doesn't stall.

Regards,
Yegor

> > > > > > CONFIG_DMA_API_DEBUG is still likely to pinpoint the bug, but I might also
> > > > > > just see it by looking at the right source file.
> > > > >
> > > > > I'll try to get more debug info with CONFIG_DMA_API_DEBUG.
> > > >
> > > > DMA_API_DEBUG showed nothing new. But disabling the CPUfreq driver
> > > > "solved" the problem. I have tried different governors and got these
> > > > two groups:
> > > >
> > > > ondemand, schedutil - cause the problem
> > > > conservative, powersave, performance and userspace - don't cause the problem
> > > >
> > > > So far, I have only seen the same debug output that I've initially
> > > > sent and in most cases, the system stalls without the output.
> > >
> > > Ok, so that sounds like it happens when you change the frequency.
> > > I assume this means you are using drivers/cpufreq/omap-cpufreq.c?
> >
> > Yes.
> >
> > > When using the usersapce governor, do you see problems when you
> > > manually change the frequency from sysfs?
> >
> > No, I can switch between 300MHz and 600MHz and perform CAN tests.
> > Everything goes well.
>
> OK so not cpufreq related.
>
> Regards,
>
> Tony

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-26  5:49                     ` Yegor Yefremov
@ 2022-05-26  6:20                       ` Tony Lindgren
  -1 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-26  6:20 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > better or worse :)
> 
> PIO is always the last resort :-) And now it proves it again. With
> PIO_ONLY the system doesn't stall.

OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
is. Or maybe there's something using stack for buffers being passed to
dma again that breaks with vmap stack.

Regards,

Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-26  6:20                       ` Tony Lindgren
  0 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-26  6:20 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > better or worse :)
> 
> PIO is always the last resort :-) And now it proves it again. With
> PIO_ONLY the system doesn't stall.

OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
is. Or maybe there's something using stack for buffers being passed to
dma again that breaks with vmap stack.

Regards,

Tony

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-26  6:20                       ` Tony Lindgren
@ 2022-05-26  8:19                         ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-26  8:19 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Arnd Bergmann, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
>
> * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > better or worse :)
> >
> > PIO is always the last resort :-) And now it proves it again. With
> > PIO_ONLY the system doesn't stall.
>
> OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> is. Or maybe there's something using stack for buffers being passed to
> dma again that breaks with vmap stack.
>

In order to confirm this theory, could you please try rebuilding your
kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
before?

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-26  8:19                         ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-26  8:19 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Yegor Yefremov, Arnd Bergmann, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
>
> * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > better or worse :)
> >
> > PIO is always the last resort :-) And now it proves it again. With
> > PIO_ONLY the system doesn't stall.
>
> OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> is. Or maybe there's something using stack for buffers being passed to
> dma again that breaks with vmap stack.
>

In order to confirm this theory, could you please try rebuilding your
kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
before?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-26  8:19                         ` Ard Biesheuvel
@ 2022-05-26 12:37                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-26 12:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Tony Lindgren, Arnd Bergmann, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Ard,

On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> >
> > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > better or worse :)
> > >
> > > PIO is always the last resort :-) And now it proves it again. With
> > > PIO_ONLY the system doesn't stall.
> >
> > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > is. Or maybe there's something using stack for buffers being passed to
> > dma again that breaks with vmap stack.
> >
>
> In order to confirm this theory, could you please try rebuilding your
> kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> before?

I have disabled the CONFIG_VMAP_STACK option:

# zcat /proc/config.gz | grep VMAP_STACK
CONFIG_HAVE_ARCH_VMAP_STACK=y
# CONFIG_VMAP_STACK is not set

The system stalls.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-26 12:37                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-26 12:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Tony Lindgren, Arnd Bergmann, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Ard,

On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> >
> > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > better or worse :)
> > >
> > > PIO is always the last resort :-) And now it proves it again. With
> > > PIO_ONLY the system doesn't stall.
> >
> > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > is. Or maybe there's something using stack for buffers being passed to
> > dma again that breaks with vmap stack.
> >
>
> In order to confirm this theory, could you please try rebuilding your
> kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> before?

I have disabled the CONFIG_VMAP_STACK option:

# zcat /proc/config.gz | grep VMAP_STACK
CONFIG_HAVE_ARCH_VMAP_STACK=y
# CONFIG_VMAP_STACK is not set

The system stalls.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-26 12:37                           ` Yegor Yefremov
@ 2022-05-26 14:15                             ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-26 14:15 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Tony Lindgren, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > >
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > better or worse :)
> > > >
> > > > PIO is always the last resort :-) And now it proves it again. With
> > > > PIO_ONLY the system doesn't stall.
> > >
> > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > is. Or maybe there's something using stack for buffers being passed to
> > > dma again that breaks with vmap stack.
> > >
> >
> > In order to confirm this theory, could you please try rebuilding your
> > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > before?
>
> I have disabled the CONFIG_VMAP_STACK option:
>
> # zcat /proc/config.gz | grep VMAP_STACK
> CONFIG_HAVE_ARCH_VMAP_STACK=y
> # CONFIG_VMAP_STACK is not set
>
> The system stalls.

Ok, I guess that means we can stop looking for invalid DMA buffers
on stacks. Out of the original commits you listed as possible causes,
we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
stacks on suspend-capable SMP configs") and cafc0eab1689
("ARM: v7m: enable support for IRQ stacks"). It could still be
9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
literal references in inline assembly") or possibly the merge.

Can you post the whole .config file somewhere for reference?
In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
or CURRENT_POINTER_IN_TPIDRURO set?

      Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-26 14:15                             ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-26 14:15 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Tony Lindgren, Arnd Bergmann, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > >
> > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > better or worse :)
> > > >
> > > > PIO is always the last resort :-) And now it proves it again. With
> > > > PIO_ONLY the system doesn't stall.
> > >
> > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > is. Or maybe there's something using stack for buffers being passed to
> > > dma again that breaks with vmap stack.
> > >
> >
> > In order to confirm this theory, could you please try rebuilding your
> > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > before?
>
> I have disabled the CONFIG_VMAP_STACK option:
>
> # zcat /proc/config.gz | grep VMAP_STACK
> CONFIG_HAVE_ARCH_VMAP_STACK=y
> # CONFIG_VMAP_STACK is not set
>
> The system stalls.

Ok, I guess that means we can stop looking for invalid DMA buffers
on stacks. Out of the original commits you listed as possible causes,
we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
stacks on suspend-capable SMP configs") and cafc0eab1689
("ARM: v7m: enable support for IRQ stacks"). It could still be
9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
literal references in inline assembly") or possibly the merge.

Can you post the whole .config file somewhere for reference?
In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
or CURRENT_POINTER_IN_TPIDRURO set?

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-26 14:15                             ` Arnd Bergmann
@ 2022-05-27  4:44                               ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  4:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > >
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > better or worse :)
> > > > >
> > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > PIO_ONLY the system doesn't stall.
> > > >
> > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > is. Or maybe there's something using stack for buffers being passed to
> > > > dma again that breaks with vmap stack.
> > > >
> > >
> > > In order to confirm this theory, could you please try rebuilding your
> > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > before?
> >
> > I have disabled the CONFIG_VMAP_STACK option:
> >
> > # zcat /proc/config.gz | grep VMAP_STACK
> > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > # CONFIG_VMAP_STACK is not set
> >
> > The system stalls.
>
> Ok, I guess that means we can stop looking for invalid DMA buffers
> on stacks. Out of the original commits you listed as possible causes,
> we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> stacks on suspend-capable SMP configs") and cafc0eab1689
> ("ARM: v7m: enable support for IRQ stacks"). It could still be
> 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> literal references in inline assembly") or possibly the merge.
>
> Can you post the whole .config file somewhere for reference?
> In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> or CURRENT_POINTER_IN_TPIDRURO set?

This is my config [1] and this is the system in question [2].

[1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am335x-baltos-ir5221.dts?h=v5.18

Regards,
Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  4:44                               ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  4:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > >
> > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > better or worse :)
> > > > >
> > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > PIO_ONLY the system doesn't stall.
> > > >
> > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > is. Or maybe there's something using stack for buffers being passed to
> > > > dma again that breaks with vmap stack.
> > > >
> > >
> > > In order to confirm this theory, could you please try rebuilding your
> > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > before?
> >
> > I have disabled the CONFIG_VMAP_STACK option:
> >
> > # zcat /proc/config.gz | grep VMAP_STACK
> > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > # CONFIG_VMAP_STACK is not set
> >
> > The system stalls.
>
> Ok, I guess that means we can stop looking for invalid DMA buffers
> on stacks. Out of the original commits you listed as possible causes,
> we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> stacks on suspend-capable SMP configs") and cafc0eab1689
> ("ARM: v7m: enable support for IRQ stacks"). It could still be
> 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> literal references in inline assembly") or possibly the merge.
>
> Can you post the whole .config file somewhere for reference?
> In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> or CURRENT_POINTER_IN_TPIDRURO set?

This is my config [1] and this is the system in question [2].

[1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am335x-baltos-ir5221.dts?h=v5.18

Regards,
Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  4:44                               ` Yegor Yefremov
@ 2022-05-27  6:38                                 ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  6:38 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > > >
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > > better or worse :)
> > > > > >
> > > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > > PIO_ONLY the system doesn't stall.
> > > > >
> > > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > > is. Or maybe there's something using stack for buffers being passed to
> > > > > dma again that breaks with vmap stack.
> > > > >
> > > >
> > > > In order to confirm this theory, could you please try rebuilding your
> > > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > > before?
> > >
> > > I have disabled the CONFIG_VMAP_STACK option:
> > >
> > > # zcat /proc/config.gz | grep VMAP_STACK
> > > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > > # CONFIG_VMAP_STACK is not set
> > >
> > > The system stalls.
> >
> > Ok, I guess that means we can stop looking for invalid DMA buffers
> > on stacks. Out of the original commits you listed as possible causes,
> > we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> > stacks on suspend-capable SMP configs") and cafc0eab1689
> > ("ARM: v7m: enable support for IRQ stacks"). It could still be
> > 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> > uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> > literal references in inline assembly") or possibly the merge.
> >
> > Can you post the whole .config file somewhere for reference?
> > In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> > or CURRENT_POINTER_IN_TPIDRURO set?
>
> This is my config [1] and this is the system in question [2].
>
> [1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config

Thanks! The first thing I noticed in here is that this config enables both
CONFIG_ARCH_MULTI_V6 (for OMAP2) and CONFIG_SMP, which
gets you into a couple of corner cases that nobody else hits in practice.

Can you still reproduce the problem if you turn off both of these?

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  6:38                                 ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  6:38 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > > >
> > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > > better or worse :)
> > > > > >
> > > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > > PIO_ONLY the system doesn't stall.
> > > > >
> > > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > > is. Or maybe there's something using stack for buffers being passed to
> > > > > dma again that breaks with vmap stack.
> > > > >
> > > >
> > > > In order to confirm this theory, could you please try rebuilding your
> > > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > > before?
> > >
> > > I have disabled the CONFIG_VMAP_STACK option:
> > >
> > > # zcat /proc/config.gz | grep VMAP_STACK
> > > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > > # CONFIG_VMAP_STACK is not set
> > >
> > > The system stalls.
> >
> > Ok, I guess that means we can stop looking for invalid DMA buffers
> > on stacks. Out of the original commits you listed as possible causes,
> > we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> > stacks on suspend-capable SMP configs") and cafc0eab1689
> > ("ARM: v7m: enable support for IRQ stacks"). It could still be
> > 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> > uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> > literal references in inline assembly") or possibly the merge.
> >
> > Can you post the whole .config file somewhere for reference?
> > In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> > or CURRENT_POINTER_IN_TPIDRURO set?
>
> This is my config [1] and this is the system in question [2].
>
> [1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config

Thanks! The first thing I noticed in here is that this config enables both
CONFIG_ARCH_MULTI_V6 (for OMAP2) and CONFIG_SMP, which
gets you into a couple of corner cases that nobody else hits in practice.

Can you still reproduce the problem if you turn off both of these?

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  6:38                                 ` Arnd Bergmann
@ 2022-05-27  6:50                                   ` Tony Lindgren
  -1 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-27  6:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > > > >
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > > > better or worse :)
> > > > > > >
> > > > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > > > PIO_ONLY the system doesn't stall.
> > > > > >
> > > > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > > > is. Or maybe there's something using stack for buffers being passed to
> > > > > > dma again that breaks with vmap stack.
> > > > > >
> > > > >
> > > > > In order to confirm this theory, could you please try rebuilding your
> > > > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > > > before?
> > > >
> > > > I have disabled the CONFIG_VMAP_STACK option:
> > > >
> > > > # zcat /proc/config.gz | grep VMAP_STACK
> > > > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > > > # CONFIG_VMAP_STACK is not set
> > > >
> > > > The system stalls.
> > >
> > > Ok, I guess that means we can stop looking for invalid DMA buffers
> > > on stacks. Out of the original commits you listed as possible causes,
> > > we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> > > stacks on suspend-capable SMP configs") and cafc0eab1689
> > > ("ARM: v7m: enable support for IRQ stacks"). It could still be
> > > 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> > > uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> > > literal references in inline assembly") or possibly the merge.
> > >
> > > Can you post the whole .config file somewhere for reference?
> > > In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> > > or CURRENT_POINTER_IN_TPIDRURO set?
> >
> > This is my config [1] and this is the system in question [2].
> >
> > [1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config
> 
> Thanks! The first thing I noticed in here is that this config enables both
> CONFIG_ARCH_MULTI_V6 (for OMAP2) and CONFIG_SMP, which
> gets you into a couple of corner cases that nobody else hits in practice.
> 
> Can you still reproduce the problem if you turn off both of these?

Based on what we just discussed on #armlinux, testing before and after
commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
systems") might be a good idea as it enables some config options that
did not get enabled earlier.

Another thing that might help is to bisect again and ensure vmap stack
config option stays disabled so issues related to vmap stack are kept
out of the way.

Regards,

Tony

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  6:50                                   ` Tony Lindgren
  0 siblings, 0 replies; 115+ messages in thread
From: Tony Lindgren @ 2022-05-27  6:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

* Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, May 26, 2022 at 2:37 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 26, 2022 at 10:19 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Thu, 26 May 2022 at 08:20, Tony Lindgren <tony@atomide.com> wrote:
> > > > > >
> > > > > > * Yegor Yefremov <yegorslists@googlemail.com> [220526 05:45]:
> > > > > > > On Tue, May 24, 2022 at 4:19 PM Tony Lindgren <tony@atomide.com> wrote:
> > > > > > > > Maybe also try with CONFIG_MUSB_PIO_ONLY=y to see if it makes things
> > > > > > > > better or worse :)
> > > > > > >
> > > > > > > PIO is always the last resort :-) And now it proves it again. With
> > > > > > > PIO_ONLY the system doesn't stall.
> > > > > >
> > > > > > OK great :) So it has something to do with drivers/dma/ti/cppi41.c, or
> > > > > > with drivers/usb/musb/cppi_dma.c or whatever the dma for am335x here
> > > > > > is. Or maybe there's something using stack for buffers being passed to
> > > > > > dma again that breaks with vmap stack.
> > > > > >
> > > > >
> > > > > In order to confirm this theory, could you please try rebuilding your
> > > > > kernel with CONFIG_VMAP_STACK disabled, and leave everything else as
> > > > > before?
> > > >
> > > > I have disabled the CONFIG_VMAP_STACK option:
> > > >
> > > > # zcat /proc/config.gz | grep VMAP_STACK
> > > > CONFIG_HAVE_ARCH_VMAP_STACK=y
> > > > # CONFIG_VMAP_STACK is not set
> > > >
> > > > The system stalls.
> > >
> > > Ok, I guess that means we can stop looking for invalid DMA buffers
> > > on stacks. Out of the original commits you listed as possible causes,
> > > we can also rule out 23d9a9280efe ("ARM: 9177/1: disable vmap'ed
> > > stacks on suspend-capable SMP configs") and cafc0eab1689
> > > ("ARM: v7m: enable support for IRQ stacks"). It could still be
> > > 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for
> > > uniprocessor systems") and 5fe41793bc78 ("ARM: 9176/1: avoid
> > > literal references in inline assembly") or possibly the merge.
> > >
> > > Can you post the whole .config file somewhere for reference?
> > > In particular, do you have CONFIG_SMP, CONFIG_LD_IS_LLD
> > > or CURRENT_POINTER_IN_TPIDRURO set?
> >
> > This is my config [1] and this is the system in question [2].
> >
> > [1] https://github.com/visionsystemsgmbh/onrisc_br_bsp/blob/master/board/vscom/baltos/linux-experimental-config
> 
> Thanks! The first thing I noticed in here is that this config enables both
> CONFIG_ARCH_MULTI_V6 (for OMAP2) and CONFIG_SMP, which
> gets you into a couple of corner cases that nobody else hits in practice.
> 
> Can you still reproduce the problem if you turn off both of these?

Based on what we just discussed on #armlinux, testing before and after
commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
systems") might be a good idea as it enables some config options that
did not get enabled earlier.

Another thing that might help is to bisect again and ensure vmap stack
config option stays disabled so issues related to vmap stack are kept
out of the way.

Regards,

Tony

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  6:50                                   ` Tony Lindgren
@ 2022-05-27  6:57                                     ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  6:57 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Arnd Bergmann, Yegor Yefremov, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:

> Based on what we just discussed on #armlinux, testing before and after
> commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> systems") might be a good idea as it enables some config options that
> did not get enabled earlier.
>
> Another thing that might help is to bisect again and ensure vmap stack
> config option stays disabled so issues related to vmap stack are kept
> out of the way.

Unfortunately the commits around 9c46929e7989 are the ones that failed
to build according to the original report. But it's possible that the
problem has something to do with
CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
in the V6+SMP config, and which in turn is required for
THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.

If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
CONFIG_CPU_V6 should show the same bug in older commits as well.

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  6:57                                     ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  6:57 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Arnd Bergmann, Yegor Yefremov, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:

> Based on what we just discussed on #armlinux, testing before and after
> commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> systems") might be a good idea as it enables some config options that
> did not get enabled earlier.
>
> Another thing that might help is to bisect again and ensure vmap stack
> config option stays disabled so issues related to vmap stack are kept
> out of the way.

Unfortunately the commits around 9c46929e7989 are the ones that failed
to build according to the original report. But it's possible that the
problem has something to do with
CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
in the V6+SMP config, and which in turn is required for
THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.

If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
CONFIG_CPU_V6 should show the same bug in older commits as well.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  6:57                                     ` Arnd Bergmann
@ 2022-05-27  8:17                                       ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  8:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> > Based on what we just discussed on #armlinux, testing before and after
> > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > systems") might be a good idea as it enables some config options that
> > did not get enabled earlier.
> >
> > Another thing that might help is to bisect again and ensure vmap stack
> > config option stays disabled so issues related to vmap stack are kept
> > out of the way.
>
> Unfortunately the commits around 9c46929e7989 are the ones that failed
> to build according to the original report. But it's possible that the
> problem has something to do with
> CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> in the V6+SMP config, and which in turn is required for
> THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
>
> If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> CONFIG_CPU_V6 should show the same bug in older commits as well.

Both config options disabled:

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
# CONFIG_ARCH_MULTI_V6 is not set
CONFIG_ARCH_MULTI_V6_V7=y
# CONFIG_SMP is not set

This helped - no stalls.

Regards,
Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  8:17                                       ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  8:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

Hi Arnd,

On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> > Based on what we just discussed on #armlinux, testing before and after
> > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > systems") might be a good idea as it enables some config options that
> > did not get enabled earlier.
> >
> > Another thing that might help is to bisect again and ensure vmap stack
> > config option stays disabled so issues related to vmap stack are kept
> > out of the way.
>
> Unfortunately the commits around 9c46929e7989 are the ones that failed
> to build according to the original report. But it's possible that the
> problem has something to do with
> CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> in the V6+SMP config, and which in turn is required for
> THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
>
> If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> CONFIG_CPU_V6 should show the same bug in older commits as well.

Both config options disabled:

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
# CONFIG_ARCH_MULTI_V6 is not set
CONFIG_ARCH_MULTI_V6_V7=y
# CONFIG_SMP is not set

This helped - no stalls.

Regards,
Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  8:17                                       ` Yegor Yefremov
@ 2022-05-27  8:38                                         ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  8:38 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 10:17 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > > Based on what we just discussed on #armlinux, testing before and after
> > > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > > systems") might be a good idea as it enables some config options that
> > > did not get enabled earlier.
> > >
> > > Another thing that might help is to bisect again and ensure vmap stack
> > > config option stays disabled so issues related to vmap stack are kept
> > > out of the way.
> >
> > Unfortunately the commits around 9c46929e7989 are the ones that failed
> > to build according to the original report. But it's possible that the
> > problem has something to do with
> > CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> > in the V6+SMP config, and which in turn is required for
> > THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
> >
> > If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> > CONFIG_CPU_V6 should show the same bug in older commits as well.
>
> Both config options disabled:
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> # CONFIG_ARCH_MULTI_V6 is not set
> CONFIG_ARCH_MULTI_V6_V7=y
> # CONFIG_SMP is not set
>
> This helped - no stalls.

Ok, that does point back to a recent regression then, rather than something
that was already broken and just uncovered by the changed behavior.

Can you try the other combinations as well? OMAP2=y with SMP=n
and OMAP2=n with SMP=y? Hopefully that narrows it down enough that
we can look at which code paths actually changed.

        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  8:38                                         ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27  8:38 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 10:17 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > > Based on what we just discussed on #armlinux, testing before and after
> > > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > > systems") might be a good idea as it enables some config options that
> > > did not get enabled earlier.
> > >
> > > Another thing that might help is to bisect again and ensure vmap stack
> > > config option stays disabled so issues related to vmap stack are kept
> > > out of the way.
> >
> > Unfortunately the commits around 9c46929e7989 are the ones that failed
> > to build according to the original report. But it's possible that the
> > problem has something to do with
> > CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> > in the V6+SMP config, and which in turn is required for
> > THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
> >
> > If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> > CONFIG_CPU_V6 should show the same bug in older commits as well.
>
> Both config options disabled:
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> # CONFIG_ARCH_MULTI_V6 is not set
> CONFIG_ARCH_MULTI_V6_V7=y
> # CONFIG_SMP is not set
>
> This helped - no stalls.

Ok, that does point back to a recent regression then, rather than something
that was already broken and just uncovered by the changed behavior.

Can you try the other combinations as well? OMAP2=y with SMP=n
and OMAP2=n with SMP=y? Hopefully that narrows it down enough that
we can look at which code paths actually changed.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  8:38                                         ` Arnd Bergmann
@ 2022-05-27  9:50                                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  9:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 10:39 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 10:17 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > > Based on what we just discussed on #armlinux, testing before and after
> > > > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > > > systems") might be a good idea as it enables some config options that
> > > > did not get enabled earlier.
> > > >
> > > > Another thing that might help is to bisect again and ensure vmap stack
> > > > config option stays disabled so issues related to vmap stack are kept
> > > > out of the way.
> > >
> > > Unfortunately the commits around 9c46929e7989 are the ones that failed
> > > to build according to the original report. But it's possible that the
> > > problem has something to do with
> > > CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> > > in the V6+SMP config, and which in turn is required for
> > > THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
> > >
> > > If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> > > CONFIG_CPU_V6 should show the same bug in older commits as well.
> >
> > Both config options disabled:
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> > # CONFIG_ARCH_MULTI_V6 is not set
> > CONFIG_ARCH_MULTI_V6_V7=y
> > # CONFIG_SMP is not set
> >
> > This helped - no stalls.
>
> Ok, that does point back to a recent regression then, rather than something
> that was already broken and just uncovered by the changed behavior.
>
> Can you try the other combinations as well? OMAP2=y with SMP=n
> and OMAP2=n with SMP=y? Hopefully that narrows it down enough that
> we can look at which code paths actually changed.

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
# CONFIG_ARCH_MULTI_V6 is not set
CONFIG_ARCH_MULTI_V6_V7=y
CONFIG_SMP=y
CONFIG_SMP_ON_UP=y

No stalls.

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
CONFIG_ARCH_MULTI_V6=y
CONFIG_ARCH_MULTI_V6_V7=y
CONFIG_ARCH_OMAP2=y
CONFIG_ARCH_OMAP2PLUS=y
CONFIG_ARCH_OMAP2PLUS_TYPICAL=y

No stalls.

As soon as I enable both SMP and OMAP2 options the system stalls.

Yegor


# CONFIG_SMP is not set

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27  9:50                                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-27  9:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tony Lindgren, Ard Biesheuvel, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 10:39 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 10:17 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Fri, May 27, 2022 at 8:57 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Fri, May 27, 2022 at 8:50 AM Tony Lindgren <tony@atomide.com> wrote:
> > > > * Arnd Bergmann <arnd@arndb.de> [220527 06:35]:
> > > > > On Fri, May 27, 2022 at 6:44 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > > On Thu, May 26, 2022 at 4:16 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > > Based on what we just discussed on #armlinux, testing before and after
> > > > commit 9c46929e7989 ("ARM: implement THREAD_INFO_IN_TASK for uniprocessor
> > > > systems") might be a good idea as it enables some config options that
> > > > did not get enabled earlier.
> > > >
> > > > Another thing that might help is to bisect again and ensure vmap stack
> > > > config option stays disabled so issues related to vmap stack are kept
> > > > out of the way.
> > >
> > > Unfortunately the commits around 9c46929e7989 are the ones that failed
> > > to build according to the original report. But it's possible that the
> > > problem has something to do with
> > > CONFIG_CURRENT_POINTER_IN_TPIDRURO, which is disabled
> > > in the V6+SMP config, and which in turn is required for
> > > THREAD_INFO_IN_TASK, IRQSTACKS and STACKPROTECTOR_PER_TASK.
> > >
> > > If any of the four are the cause of the stall, then turning off ARCH_OMAP2 and
> > > CONFIG_CPU_V6 should show the same bug in older commits as well.
> >
> > Both config options disabled:
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> > # CONFIG_ARCH_MULTI_V6 is not set
> > CONFIG_ARCH_MULTI_V6_V7=y
> > # CONFIG_SMP is not set
> >
> > This helped - no stalls.
>
> Ok, that does point back to a recent regression then, rather than something
> that was already broken and just uncovered by the changed behavior.
>
> Can you try the other combinations as well? OMAP2=y with SMP=n
> and OMAP2=n with SMP=y? Hopefully that narrows it down enough that
> we can look at which code paths actually changed.

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
# CONFIG_ARCH_MULTI_V6 is not set
CONFIG_ARCH_MULTI_V6_V7=y
CONFIG_SMP=y
CONFIG_SMP_ON_UP=y

No stalls.

# zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
CONFIG_ARCH_MULTI_V6=y
CONFIG_ARCH_MULTI_V6_V7=y
CONFIG_ARCH_OMAP2=y
CONFIG_ARCH_OMAP2PLUS=y
CONFIG_ARCH_OMAP2PLUS_TYPICAL=y

No stalls.

As soon as I enable both SMP and OMAP2 options the system stalls.

Yegor


# CONFIG_SMP is not set

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27  9:50                                           ` Yegor Yefremov
@ 2022-05-27 12:53                                             ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27 12:53 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> # CONFIG_ARCH_MULTI_V6 is not set
> CONFIG_ARCH_MULTI_V6_V7=y
> CONFIG_SMP=y
> CONFIG_SMP_ON_UP=y
>
> No stalls.
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
> CONFIG_ARCH_MULTI_V6=y
> CONFIG_ARCH_MULTI_V6_V7=y
> CONFIG_ARCH_OMAP2=y
> CONFIG_ARCH_OMAP2PLUS=y
> CONFIG_ARCH_OMAP2PLUS_TYPICAL=y
>
> No stalls.
>
> As soon as I enable both SMP and OMAP2 options the system stalls.

Ok, that points to the SMP patching for percpu data, which doesn't happen
before the patch, and which is different between loadable modules and
the normal code.

Can you try out this patch to short-circuit the logic and always return
the offset for CPU 0? This is obviously broken on SMP machines but
would get around the bit of code that is V6+SMP specific.

        Arnd

diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 7545c87c251f..3057c5fef970 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -25,10 +25,13 @@ static inline void set_my_cpu_offset(unsigned long off)
        asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
 }

+extern unsigned long __per_cpu_offset[];
 static __always_inline unsigned long __my_cpu_offset(void)
 {
        unsigned long off;

+       return __per_cpu_offset[0];
+
        /*
         * Read TPIDRPRW.
         * We want to allow caching the value, so avoid using volatile and

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27 12:53                                             ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27 12:53 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Ard Biesheuvel, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> # CONFIG_ARCH_MULTI_V6 is not set
> CONFIG_ARCH_MULTI_V6_V7=y
> CONFIG_SMP=y
> CONFIG_SMP_ON_UP=y
>
> No stalls.
>
> # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
> CONFIG_ARCH_MULTI_V6=y
> CONFIG_ARCH_MULTI_V6_V7=y
> CONFIG_ARCH_OMAP2=y
> CONFIG_ARCH_OMAP2PLUS=y
> CONFIG_ARCH_OMAP2PLUS_TYPICAL=y
>
> No stalls.
>
> As soon as I enable both SMP and OMAP2 options the system stalls.

Ok, that points to the SMP patching for percpu data, which doesn't happen
before the patch, and which is different between loadable modules and
the normal code.

Can you try out this patch to short-circuit the logic and always return
the offset for CPU 0? This is obviously broken on SMP machines but
would get around the bit of code that is V6+SMP specific.

        Arnd

diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 7545c87c251f..3057c5fef970 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -25,10 +25,13 @@ static inline void set_my_cpu_offset(unsigned long off)
        asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
 }

+extern unsigned long __per_cpu_offset[];
 static __always_inline unsigned long __my_cpu_offset(void)
 {
        unsigned long off;

+       return __per_cpu_offset[0];
+
        /*
         * Read TPIDRPRW.
         * We want to allow caching the value, so avoid using volatile and

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27 12:53                                             ` Arnd Bergmann
@ 2022-05-27 13:12                                               ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-27 13:12 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> > # CONFIG_ARCH_MULTI_V6 is not set
> > CONFIG_ARCH_MULTI_V6_V7=y
> > CONFIG_SMP=y
> > CONFIG_SMP_ON_UP=y
> >
> > No stalls.
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
> > CONFIG_ARCH_MULTI_V6=y
> > CONFIG_ARCH_MULTI_V6_V7=y
> > CONFIG_ARCH_OMAP2=y
> > CONFIG_ARCH_OMAP2PLUS=y
> > CONFIG_ARCH_OMAP2PLUS_TYPICAL=y
> >
> > No stalls.
> >
> > As soon as I enable both SMP and OMAP2 options the system stalls.
>
> Ok, that points to the SMP patching for percpu data, which doesn't happen
> before the patch, and which is different between loadable modules and
> the normal code.
>

Not just per-cpu data: there is also the 'current' global variable
which gets used now instead of the user thread ID register, and this
is also different between modules and the core kernel (unless
CONFIG_ARM_MODULE_PLTS is disabled)

I looked at the fdti-sio and slcan modules, and didn't find any
references to per-CPU offsets when building them using the provided
.config. I did find some references to __current, but these seem to be
ignored (they are only emitted to satisfy the "m" inline asm
constraint in get_domain(), but the parameter is never actually used
in the assembler code)


> Can you try out this patch to short-circuit the logic and always return
> the offset for CPU 0? This is obviously broken on SMP machines but
> would get around the bit of code that is V6+SMP specific.
>
>         Arnd
>
> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
> index 7545c87c251f..3057c5fef970 100644
> --- a/arch/arm/include/asm/percpu.h
> +++ b/arch/arm/include/asm/percpu.h
> @@ -25,10 +25,13 @@ static inline void set_my_cpu_offset(unsigned long off)
>         asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
>  }
>
> +extern unsigned long __per_cpu_offset[];
>  static __always_inline unsigned long __my_cpu_offset(void)
>  {
>         unsigned long off;
>
> +       return __per_cpu_offset[0];
> +
>         /*
>          * Read TPIDRPRW.
>          * We want to allow caching the value, so avoid using volatile and

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27 13:12                                               ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-27 13:12 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP'
> > # CONFIG_ARCH_MULTI_V6 is not set
> > CONFIG_ARCH_MULTI_V6_V7=y
> > CONFIG_SMP=y
> > CONFIG_SMP_ON_UP=y
> >
> > No stalls.
> >
> > # zcat /proc/config.gz | grep 'CONFIG_ARCH_MULTI_V6\|CONFIG_SMP\|ARCH_OMAP2'
> > CONFIG_ARCH_MULTI_V6=y
> > CONFIG_ARCH_MULTI_V6_V7=y
> > CONFIG_ARCH_OMAP2=y
> > CONFIG_ARCH_OMAP2PLUS=y
> > CONFIG_ARCH_OMAP2PLUS_TYPICAL=y
> >
> > No stalls.
> >
> > As soon as I enable both SMP and OMAP2 options the system stalls.
>
> Ok, that points to the SMP patching for percpu data, which doesn't happen
> before the patch, and which is different between loadable modules and
> the normal code.
>

Not just per-cpu data: there is also the 'current' global variable
which gets used now instead of the user thread ID register, and this
is also different between modules and the core kernel (unless
CONFIG_ARM_MODULE_PLTS is disabled)

I looked at the fdti-sio and slcan modules, and didn't find any
references to per-CPU offsets when building them using the provided
.config. I did find some references to __current, but these seem to be
ignored (they are only emitted to satisfy the "m" inline asm
constraint in get_domain(), but the parameter is never actually used
in the assembler code)


> Can you try out this patch to short-circuit the logic and always return
> the offset for CPU 0? This is obviously broken on SMP machines but
> would get around the bit of code that is V6+SMP specific.
>
>         Arnd
>
> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
> index 7545c87c251f..3057c5fef970 100644
> --- a/arch/arm/include/asm/percpu.h
> +++ b/arch/arm/include/asm/percpu.h
> @@ -25,10 +25,13 @@ static inline void set_my_cpu_offset(unsigned long off)
>         asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
>  }
>
> +extern unsigned long __per_cpu_offset[];
>  static __always_inline unsigned long __my_cpu_offset(void)
>  {
>         unsigned long off;
>
> +       return __per_cpu_offset[0];
> +
>         /*
>          * Read TPIDRPRW.
>          * We want to allow caching the value, so avoid using volatile and

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27 13:12                                               ` Ard Biesheuvel
@ 2022-05-27 14:12                                                 ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27 14:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Yegor Yefremov, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> Not just per-cpu data: there is also the 'current' global variable
> which gets used now instead of the user thread ID register, and this
> is also different between modules and the core kernel (unless
> CONFIG_ARM_MODULE_PLTS is disabled)

Right, so if the percpu hack doesn't address it, this one might:

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 1e1178bf176d..306d1a4cae40 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
task_struct *get_current(void)
 {
        struct task_struct *cur;

+       return __current;
+
 #if __has_builtin(__builtin_thread_pointer) && \
     defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
     !(defined(CONFIG_THUMB2_KERNEL) && \

> I looked at the fdti-sio and slcan modules, and didn't find any
> references to per-CPU offsets when building them using the provided
> .config. I did find some references to __current, but these seem to be
> ignored (they are only emitted to satisfy the "m" inline asm
> constraint in get_domain(), but the parameter is never actually used
> in the assembler code)

I see some __current references in the musb driver that come from
tracepoints as well (in omap2plus_defconfig), but these also shouldn't be
active.

        Arnd

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-27 14:12                                                 ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-27 14:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Yegor Yefremov, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> Not just per-cpu data: there is also the 'current' global variable
> which gets used now instead of the user thread ID register, and this
> is also different between modules and the core kernel (unless
> CONFIG_ARM_MODULE_PLTS is disabled)

Right, so if the percpu hack doesn't address it, this one might:

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 1e1178bf176d..306d1a4cae40 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
task_struct *get_current(void)
 {
        struct task_struct *cur;

+       return __current;
+
 #if __has_builtin(__builtin_thread_pointer) && \
     defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
     !(defined(CONFIG_THUMB2_KERNEL) && \

> I looked at the fdti-sio and slcan modules, and didn't find any
> references to per-CPU offsets when building them using the provided
> .config. I did find some references to __current, but these seem to be
> ignored (they are only emitted to satisfy the "m" inline asm
> constraint in get_domain(), but the parameter is never actually used
> in the assembler code)

I see some __current references in the musb driver that come from
tracepoints as well (in omap2plus_defconfig), but these also shouldn't be
active.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-27 14:12                                                 ` Arnd Bergmann
@ 2022-05-28  5:48                                                   ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28  5:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > Not just per-cpu data: there is also the 'current' global variable
> > which gets used now instead of the user thread ID register, and this
> > is also different between modules and the core kernel (unless
> > CONFIG_ARM_MODULE_PLTS is disabled)
>
> Right, so if the percpu hack doesn't address it, this one might:
>
> diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> index 1e1178bf176d..306d1a4cae40 100644
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> task_struct *get_current(void)
>  {
>         struct task_struct *cur;
>
> +       return __current;
> +
>  #if __has_builtin(__builtin_thread_pointer) && \
>      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
>      !(defined(CONFIG_THUMB2_KERNEL) && \

I have tried this patch and the system still stalls.

Yegor

> > I looked at the fdti-sio and slcan modules, and didn't find any
> > references to per-CPU offsets when building them using the provided
> > .config. I did find some references to __current, but these seem to be
> > ignored (they are only emitted to satisfy the "m" inline asm
> > constraint in get_domain(), but the parameter is never actually used
> > in the assembler code)
>
> I see some __current references in the musb driver that come from
> tracepoints as well (in omap2plus_defconfig), but these also shouldn't be
> active.
>
>         Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28  5:48                                                   ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28  5:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > Not just per-cpu data: there is also the 'current' global variable
> > which gets used now instead of the user thread ID register, and this
> > is also different between modules and the core kernel (unless
> > CONFIG_ARM_MODULE_PLTS is disabled)
>
> Right, so if the percpu hack doesn't address it, this one might:
>
> diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> index 1e1178bf176d..306d1a4cae40 100644
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> task_struct *get_current(void)
>  {
>         struct task_struct *cur;
>
> +       return __current;
> +
>  #if __has_builtin(__builtin_thread_pointer) && \
>      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
>      !(defined(CONFIG_THUMB2_KERNEL) && \

I have tried this patch and the system still stalls.

Yegor

> > I looked at the fdti-sio and slcan modules, and didn't find any
> > references to per-CPU offsets when building them using the provided
> > .config. I did find some references to __current, but these seem to be
> > ignored (they are only emitted to satisfy the "m" inline asm
> > constraint in get_domain(), but the parameter is never actually used
> > in the assembler code)
>
> I see some __current references in the musb driver that come from
> tracepoints as well (in omap2plus_defconfig), but these also shouldn't be
> active.
>
>         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28  5:48                                                   ` Yegor Yefremov
@ 2022-05-28  7:53                                                     ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-28  7:53 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > Not just per-cpu data: there is also the 'current' global variable
> > > which gets used now instead of the user thread ID register, and this
> > > is also different between modules and the core kernel (unless
> > > CONFIG_ARM_MODULE_PLTS is disabled)
> >
> > Right, so if the percpu hack doesn't address it, this one might:
> >
> > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > index 1e1178bf176d..306d1a4cae40 100644
> > --- a/arch/arm/include/asm/current.h
> > +++ b/arch/arm/include/asm/current.h
> > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > task_struct *get_current(void)
> >  {
> >         struct task_struct *cur;
> >
> > +       return __current;
> > +
> >  #if __has_builtin(__builtin_thread_pointer) && \
> >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> >      !(defined(CONFIG_THUMB2_KERNEL) && \
>
> I have tried this patch and the system still stalls.

Ok, thanks for testing. To clarify: did you test with both the get_current() and
__my_cpu_offset() changes applied, or just the get_current() one?

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28  7:53                                                     ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-28  7:53 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > Not just per-cpu data: there is also the 'current' global variable
> > > which gets used now instead of the user thread ID register, and this
> > > is also different between modules and the core kernel (unless
> > > CONFIG_ARM_MODULE_PLTS is disabled)
> >
> > Right, so if the percpu hack doesn't address it, this one might:
> >
> > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > index 1e1178bf176d..306d1a4cae40 100644
> > --- a/arch/arm/include/asm/current.h
> > +++ b/arch/arm/include/asm/current.h
> > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > task_struct *get_current(void)
> >  {
> >         struct task_struct *cur;
> >
> > +       return __current;
> > +
> >  #if __has_builtin(__builtin_thread_pointer) && \
> >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> >      !(defined(CONFIG_THUMB2_KERNEL) && \
>
> I have tried this patch and the system still stalls.

Ok, thanks for testing. To clarify: did you test with both the get_current() and
__my_cpu_offset() changes applied, or just the get_current() one?

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28  7:53                                                     ` Arnd Bergmann
@ 2022-05-28  8:29                                                       ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28  8:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > Not just per-cpu data: there is also the 'current' global variable
> > > > which gets used now instead of the user thread ID register, and this
> > > > is also different between modules and the core kernel (unless
> > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > >
> > > Right, so if the percpu hack doesn't address it, this one might:
> > >
> > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > index 1e1178bf176d..306d1a4cae40 100644
> > > --- a/arch/arm/include/asm/current.h
> > > +++ b/arch/arm/include/asm/current.h
> > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > task_struct *get_current(void)
> > >  {
> > >         struct task_struct *cur;
> > >
> > > +       return __current;
> > > +
> > >  #if __has_builtin(__builtin_thread_pointer) && \
> > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> >
> > I have tried this patch and the system still stalls.
>
> Ok, thanks for testing. To clarify: did you test with both the get_current() and
> __my_cpu_offset() changes applied, or just the get_current() one?

I have tested only the get_current() one. Should I also test
__my_cpu_offset() separately and combined?

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28  8:29                                                       ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28  8:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > Not just per-cpu data: there is also the 'current' global variable
> > > > which gets used now instead of the user thread ID register, and this
> > > > is also different between modules and the core kernel (unless
> > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > >
> > > Right, so if the percpu hack doesn't address it, this one might:
> > >
> > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > index 1e1178bf176d..306d1a4cae40 100644
> > > --- a/arch/arm/include/asm/current.h
> > > +++ b/arch/arm/include/asm/current.h
> > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > task_struct *get_current(void)
> > >  {
> > >         struct task_struct *cur;
> > >
> > > +       return __current;
> > > +
> > >  #if __has_builtin(__builtin_thread_pointer) && \
> > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> >
> > I have tried this patch and the system still stalls.
>
> Ok, thanks for testing. To clarify: did you test with both the get_current() and
> __my_cpu_offset() changes applied, or just the get_current() one?

I have tested only the get_current() one. Should I also test
__my_cpu_offset() separately and combined?

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28  8:29                                                       ` Yegor Yefremov
@ 2022-05-28  9:07                                                         ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-28  9:07 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, 28 May 2022 at 10:29, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > Not just per-cpu data: there is also the 'current' global variable
> > > > > which gets used now instead of the user thread ID register, and this
> > > > > is also different between modules and the core kernel (unless
> > > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > > >
> > > > Right, so if the percpu hack doesn't address it, this one might:
> > > >
> > > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > > index 1e1178bf176d..306d1a4cae40 100644
> > > > --- a/arch/arm/include/asm/current.h
> > > > +++ b/arch/arm/include/asm/current.h
> > > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > > task_struct *get_current(void)
> > > >  {
> > > >         struct task_struct *cur;
> > > >
> > > > +       return __current;
> > > > +
> > > >  #if __has_builtin(__builtin_thread_pointer) && \
> > > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> > >
> > > I have tried this patch and the system still stalls.
> >
> > Ok, thanks for testing. To clarify: did you test with both the get_current() and
> > __my_cpu_offset() changes applied, or just the get_current() one?
>
> I have tested only the get_current() one. Should I also test
> __my_cpu_offset() separately and combined?
>

That would be helpful, yes.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28  9:07                                                         ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-28  9:07 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, 28 May 2022 at 10:29, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > Not just per-cpu data: there is also the 'current' global variable
> > > > > which gets used now instead of the user thread ID register, and this
> > > > > is also different between modules and the core kernel (unless
> > > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > > >
> > > > Right, so if the percpu hack doesn't address it, this one might:
> > > >
> > > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > > index 1e1178bf176d..306d1a4cae40 100644
> > > > --- a/arch/arm/include/asm/current.h
> > > > +++ b/arch/arm/include/asm/current.h
> > > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > > task_struct *get_current(void)
> > > >  {
> > > >         struct task_struct *cur;
> > > >
> > > > +       return __current;
> > > > +
> > > >  #if __has_builtin(__builtin_thread_pointer) && \
> > > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> > >
> > > I have tried this patch and the system still stalls.
> >
> > Ok, thanks for testing. To clarify: did you test with both the get_current() and
> > __my_cpu_offset() changes applied, or just the get_current() one?
>
> I have tested only the get_current() one. Should I also test
> __my_cpu_offset() separately and combined?
>

That would be helpful, yes.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28  9:07                                                         ` Ard Biesheuvel
@ 2022-05-28 13:01                                                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28 13:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sat, 28 May 2022 at 10:29, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > >
> > > > > > Not just per-cpu data: there is also the 'current' global variable
> > > > > > which gets used now instead of the user thread ID register, and this
> > > > > > is also different between modules and the core kernel (unless
> > > > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > > > >
> > > > > Right, so if the percpu hack doesn't address it, this one might:
> > > > >
> > > > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > > > index 1e1178bf176d..306d1a4cae40 100644
> > > > > --- a/arch/arm/include/asm/current.h
> > > > > +++ b/arch/arm/include/asm/current.h
> > > > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > > > task_struct *get_current(void)
> > > > >  {
> > > > >         struct task_struct *cur;
> > > > >
> > > > > +       return __current;
> > > > > +
> > > > >  #if __has_builtin(__builtin_thread_pointer) && \
> > > > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > > > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> > > >
> > > > I have tried this patch and the system still stalls.
> > >
> > > Ok, thanks for testing. To clarify: did you test with both the get_current() and
> > > __my_cpu_offset() changes applied, or just the get_current() one?
> >
> > I have tested only the get_current() one. Should I also test
> > __my_cpu_offset() separately and combined?
> >
>
> That would be helpful, yes.

  SYNC    include/config/auto.conf.cmd
  CC      kernel/bounds.s
  CALL    scripts/atomic/check-atomics.sh
In file included from ./include/linux/irqflags.h:17,
                 from ./arch/arm/include/asm/bitops.h:28,
                 from ./include/linux/bitops.h:33,
                 from ./include/linux/log2.h:12,
                 from kernel/bounds.c:13:
./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
undeclared (first use in this function); did you mean
‘__my_cpu_offset’?
   32 |  return __per_cpu_offset[0];
      |         ^~~~~~~~~~~~~~~~
      |         __my_cpu_offset
./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
is reported only once for each function it appears in

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28 13:01                                                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28 13:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sat, 28 May 2022 at 10:29, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 9:53 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 7:48 AM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Fri, May 27, 2022 at 4:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Fri, May 27, 2022 at 3:12 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > On Fri, 27 May 2022 at 14:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > On Fri, May 27, 2022 at 11:50 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > >
> > > > > > Not just per-cpu data: there is also the 'current' global variable
> > > > > > which gets used now instead of the user thread ID register, and this
> > > > > > is also different between modules and the core kernel (unless
> > > > > > CONFIG_ARM_MODULE_PLTS is disabled)
> > > > >
> > > > > Right, so if the percpu hack doesn't address it, this one might:
> > > > >
> > > > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> > > > > index 1e1178bf176d..306d1a4cae40 100644
> > > > > --- a/arch/arm/include/asm/current.h
> > > > > +++ b/arch/arm/include/asm/current.h
> > > > > @@ -18,6 +18,8 @@ static __always_inline __attribute_const__ struct
> > > > > task_struct *get_current(void)
> > > > >  {
> > > > >         struct task_struct *cur;
> > > > >
> > > > > +       return __current;
> > > > > +
> > > > >  #if __has_builtin(__builtin_thread_pointer) && \
> > > > >      defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
> > > > >      !(defined(CONFIG_THUMB2_KERNEL) && \
> > > >
> > > > I have tried this patch and the system still stalls.
> > >
> > > Ok, thanks for testing. To clarify: did you test with both the get_current() and
> > > __my_cpu_offset() changes applied, or just the get_current() one?
> >
> > I have tested only the get_current() one. Should I also test
> > __my_cpu_offset() separately and combined?
> >
>
> That would be helpful, yes.

  SYNC    include/config/auto.conf.cmd
  CC      kernel/bounds.s
  CALL    scripts/atomic/check-atomics.sh
In file included from ./include/linux/irqflags.h:17,
                 from ./arch/arm/include/asm/bitops.h:28,
                 from ./include/linux/bitops.h:33,
                 from ./include/linux/log2.h:12,
                 from kernel/bounds.c:13:
./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
undeclared (first use in this function); did you mean
‘__my_cpu_offset’?
   32 |  return __per_cpu_offset[0];
      |         ^~~~~~~~~~~~~~~~
      |         __my_cpu_offset
./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
is reported only once for each function it appears in

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28 13:01                                                           ` Yegor Yefremov
@ 2022-05-28 13:13                                                             ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-28 13:13 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> In file included from ./include/linux/irqflags.h:17,
>                  from ./arch/arm/include/asm/bitops.h:28,
>                  from ./include/linux/bitops.h:33,
>                  from ./include/linux/log2.h:12,
>                  from kernel/bounds.c:13:
> ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> undeclared (first use in this function); did you mean
> ‘__my_cpu_offset’?
>    32 |  return __per_cpu_offset[0];
>       |         ^~~~~~~~~~~~~~~~
>       |         __my_cpu_offset
> ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> is reported only once for each function it appears in

I think you just missed the line in my patch that adds the
"extern unsigned long __per_cpu_offset[];" variable declaration.

       Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28 13:13                                                             ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-28 13:13 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> In file included from ./include/linux/irqflags.h:17,
>                  from ./arch/arm/include/asm/bitops.h:28,
>                  from ./include/linux/bitops.h:33,
>                  from ./include/linux/log2.h:12,
>                  from kernel/bounds.c:13:
> ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> undeclared (first use in this function); did you mean
> ‘__my_cpu_offset’?
>    32 |  return __per_cpu_offset[0];
>       |         ^~~~~~~~~~~~~~~~
>       |         __my_cpu_offset
> ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> is reported only once for each function it appears in

I think you just missed the line in my patch that adds the
"extern unsigned long __per_cpu_offset[];" variable declaration.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28 13:13                                                             ` Arnd Bergmann
@ 2022-05-28 19:28                                                               ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28 19:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > In file included from ./include/linux/irqflags.h:17,
> >                  from ./arch/arm/include/asm/bitops.h:28,
> >                  from ./include/linux/bitops.h:33,
> >                  from ./include/linux/log2.h:12,
> >                  from kernel/bounds.c:13:
> > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > undeclared (first use in this function); did you mean
> > ‘__my_cpu_offset’?
> >    32 |  return __per_cpu_offset[0];
> >       |         ^~~~~~~~~~~~~~~~
> >       |         __my_cpu_offset
> > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > is reported only once for each function it appears in
>
> I think you just missed the line in my patch that adds the
> "extern unsigned long __per_cpu_offset[];" variable declaration.

My bad.

So, I tried both variants and both led to stalls.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-28 19:28                                                               ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-28 19:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > In file included from ./include/linux/irqflags.h:17,
> >                  from ./arch/arm/include/asm/bitops.h:28,
> >                  from ./include/linux/bitops.h:33,
> >                  from ./include/linux/log2.h:12,
> >                  from kernel/bounds.c:13:
> > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > undeclared (first use in this function); did you mean
> > ‘__my_cpu_offset’?
> >    32 |  return __per_cpu_offset[0];
> >       |         ^~~~~~~~~~~~~~~~
> >       |         __my_cpu_offset
> > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > is reported only once for each function it appears in
>
> I think you just missed the line in my patch that adds the
> "extern unsigned long __per_cpu_offset[];" variable declaration.

My bad.

So, I tried both variants and both led to stalls.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28 19:28                                                               ` Yegor Yefremov
@ 2022-05-30 10:16                                                                 ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-30 10:16 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, 28 May 2022 at 21:28, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > In file included from ./include/linux/irqflags.h:17,
> > >                  from ./arch/arm/include/asm/bitops.h:28,
> > >                  from ./include/linux/bitops.h:33,
> > >                  from ./include/linux/log2.h:12,
> > >                  from kernel/bounds.c:13:
> > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > undeclared (first use in this function); did you mean
> > > ‘__my_cpu_offset’?
> > >    32 |  return __per_cpu_offset[0];
> > >       |         ^~~~~~~~~~~~~~~~
> > >       |         __my_cpu_offset
> > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > is reported only once for each function it appears in
> >
> > I think you just missed the line in my patch that adds the
> > "extern unsigned long __per_cpu_offset[];" variable declaration.
>
> My bad.
>
> So, I tried both variants and both led to stalls.
>

Could you please try running slcand under strace (and use the -F
option on slcand), and bring up the link from another terminal
session? That way, we may be able to narrow down the cause of the
stall from the strace output.

On my BB white, it never gets past

openat(AT_FDCWD, "/dev/ttyUSB0", O_RDWR|O_NOCTTY|O_NONBLOCK|O_LARGEFILE) = 3
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TIOCGSERIAL, 0xbec564fc)       = 0
ioctl(3, TIOCSSERIAL)                   = 0
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
ioctl(3, SNDCTL_TMR_STOP or TCSETSW, {B3000000 -opost -isig -icanon
-echo ...}) = 0
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
write(3, "C\rS8\r", 5)                  = 5
write(3, "O\r", 2)                      = 2
ioctl(3, TIOCSETD, [17]

but I don't have any actual CAN-to-USB-serial hardware so I'm not sure
if I'm even able to reproduce this.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-30 10:16                                                                 ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-30 10:16 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sat, 28 May 2022 at 21:28, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > In file included from ./include/linux/irqflags.h:17,
> > >                  from ./arch/arm/include/asm/bitops.h:28,
> > >                  from ./include/linux/bitops.h:33,
> > >                  from ./include/linux/log2.h:12,
> > >                  from kernel/bounds.c:13:
> > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > undeclared (first use in this function); did you mean
> > > ‘__my_cpu_offset’?
> > >    32 |  return __per_cpu_offset[0];
> > >       |         ^~~~~~~~~~~~~~~~
> > >       |         __my_cpu_offset
> > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > is reported only once for each function it appears in
> >
> > I think you just missed the line in my patch that adds the
> > "extern unsigned long __per_cpu_offset[];" variable declaration.
>
> My bad.
>
> So, I tried both variants and both led to stalls.
>

Could you please try running slcand under strace (and use the -F
option on slcand), and bring up the link from another terminal
session? That way, we may be able to narrow down the cause of the
stall from the strace output.

On my BB white, it never gets past

openat(AT_FDCWD, "/dev/ttyUSB0", O_RDWR|O_NOCTTY|O_NONBLOCK|O_LARGEFILE) = 3
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TIOCGSERIAL, 0xbec564fc)       = 0
ioctl(3, TIOCSSERIAL)                   = 0
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
ioctl(3, SNDCTL_TMR_STOP or TCSETSW, {B3000000 -opost -isig -icanon
-echo ...}) = 0
ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
write(3, "C\rS8\r", 5)                  = 5
write(3, "O\r", 2)                      = 2
ioctl(3, TIOCSETD, [17]

but I don't have any actual CAN-to-USB-serial hardware so I'm not sure
if I'm even able to reproduce this.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-30 10:16                                                                 ` Ard Biesheuvel
@ 2022-05-30 12:09                                                                   ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-30 12:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Mon, May 30, 2022 at 12:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sat, 28 May 2022 at 21:28, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > In file included from ./include/linux/irqflags.h:17,
> > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > >                  from ./include/linux/bitops.h:33,
> > > >                  from ./include/linux/log2.h:12,
> > > >                  from kernel/bounds.c:13:
> > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > undeclared (first use in this function); did you mean
> > > > ‘__my_cpu_offset’?
> > > >    32 |  return __per_cpu_offset[0];
> > > >       |         ^~~~~~~~~~~~~~~~
> > > >       |         __my_cpu_offset
> > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > is reported only once for each function it appears in
> > >
> > > I think you just missed the line in my patch that adds the
> > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> >
> > My bad.
> >
> > So, I tried both variants and both led to stalls.
> >
>
> Could you please try running slcand under strace (and use the -F
> option on slcand), and bring up the link from another terminal
> session? That way, we may be able to narrow down the cause of the
> stall from the strace output.
>
> On my BB white, it never gets past
>
> openat(AT_FDCWD, "/dev/ttyUSB0", O_RDWR|O_NOCTTY|O_NONBLOCK|O_LARGEFILE) = 3
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> ioctl(3, TIOCGSERIAL, 0xbec564fc)       = 0
> ioctl(3, TIOCSSERIAL)                   = 0
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> ioctl(3, SNDCTL_TMR_STOP or TCSETSW, {B3000000 -opost -isig -icanon
> -echo ...}) = 0
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> write(3, "C\rS8\r", 5)                  = 5
> write(3, "O\r", 2)                      = 2
> ioctl(3, TIOCSETD, [17]
>
> but I don't have any actual CAN-to-USB-serial hardware so I'm not sure
> if I'm even able to reproduce this.

Triggering the stall is not that straight forward. slcand just loads
the slcan driver and creates an slcan0 network device. This alone
doesn't lead to a stall. First when I send some CAN frames, the system
stalls after some seconds.

My CAN test script can also work directly with /dev/ttyUSB0 omitting
the slcan driver. In this case, the system stays stable.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-30 12:09                                                                   ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-30 12:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Mon, May 30, 2022 at 12:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sat, 28 May 2022 at 21:28, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > In file included from ./include/linux/irqflags.h:17,
> > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > >                  from ./include/linux/bitops.h:33,
> > > >                  from ./include/linux/log2.h:12,
> > > >                  from kernel/bounds.c:13:
> > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > undeclared (first use in this function); did you mean
> > > > ‘__my_cpu_offset’?
> > > >    32 |  return __per_cpu_offset[0];
> > > >       |         ^~~~~~~~~~~~~~~~
> > > >       |         __my_cpu_offset
> > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > is reported only once for each function it appears in
> > >
> > > I think you just missed the line in my patch that adds the
> > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> >
> > My bad.
> >
> > So, I tried both variants and both led to stalls.
> >
>
> Could you please try running slcand under strace (and use the -F
> option on slcand), and bring up the link from another terminal
> session? That way, we may be able to narrow down the cause of the
> stall from the strace output.
>
> On my BB white, it never gets past
>
> openat(AT_FDCWD, "/dev/ttyUSB0", O_RDWR|O_NOCTTY|O_NONBLOCK|O_LARGEFILE) = 3
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> ioctl(3, TIOCGSERIAL, 0xbec564fc)       = 0
> ioctl(3, TIOCSSERIAL)                   = 0
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> ioctl(3, SNDCTL_TMR_STOP or TCSETSW, {B3000000 -opost -isig -icanon
> -echo ...}) = 0
> ioctl(3, TCGETS, {B3000000 -opost -isig -icanon -echo ...}) = 0
> write(3, "C\rS8\r", 5)                  = 5
> write(3, "O\r", 2)                      = 2
> ioctl(3, TIOCSETD, [17]
>
> but I don't have any actual CAN-to-USB-serial hardware so I'm not sure
> if I'm even able to reproduce this.

Triggering the stall is not that straight forward. slcand just loads
the slcan driver and creates an slcan0 network device. This alone
doesn't lead to a stall. First when I send some CAN frames, the system
stalls after some seconds.

My CAN test script can also work directly with /dev/ttyUSB0 omitting
the slcan driver. In this case, the system stays stable.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-28 19:28                                                               ` Yegor Yefremov
@ 2022-05-30 13:54                                                                 ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-30 13:54 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > In file included from ./include/linux/irqflags.h:17,
> > >                  from ./arch/arm/include/asm/bitops.h:28,
> > >                  from ./include/linux/bitops.h:33,
> > >                  from ./include/linux/log2.h:12,
> > >                  from kernel/bounds.c:13:
> > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > undeclared (first use in this function); did you mean
> > > ‘__my_cpu_offset’?
> > >    32 |  return __per_cpu_offset[0];
> > >       |         ^~~~~~~~~~~~~~~~
> > >       |         __my_cpu_offset
> > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > is reported only once for each function it appears in
> >
> > I think you just missed the line in my patch that adds the
> > "extern unsigned long __per_cpu_offset[];" variable declaration.
>
> So, I tried both variants and both led to stalls.

I'm running out of ideas here.  Going to back to the original bisection,
I rebased Ard's patches in a way that you should be able to build the
config for each patch, and I split up the "ARM: implement
THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
another way, hoping to get something left over that points to the
bug. Can you try bisecting through the top commits of

https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test

starting maybe with "52d240871760 irqchip: nvic: Use
GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
going to be ok?

At some point I fear we may have to give up and just mark the v6+SMP
configuration as broken, which is something we have considered in the
past but ended up always keeping around for the purpose of testing
omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
systems you probably don't want to use that config anway, and should
either stick to a uniprocessor build, or disable the ARMv6 support.

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-30 13:54                                                                 ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-30 13:54 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > In file included from ./include/linux/irqflags.h:17,
> > >                  from ./arch/arm/include/asm/bitops.h:28,
> > >                  from ./include/linux/bitops.h:33,
> > >                  from ./include/linux/log2.h:12,
> > >                  from kernel/bounds.c:13:
> > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > undeclared (first use in this function); did you mean
> > > ‘__my_cpu_offset’?
> > >    32 |  return __per_cpu_offset[0];
> > >       |         ^~~~~~~~~~~~~~~~
> > >       |         __my_cpu_offset
> > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > is reported only once for each function it appears in
> >
> > I think you just missed the line in my patch that adds the
> > "extern unsigned long __per_cpu_offset[];" variable declaration.
>
> So, I tried both variants and both led to stalls.

I'm running out of ideas here.  Going to back to the original bisection,
I rebased Ard's patches in a way that you should be able to build the
config for each patch, and I split up the "ARM: implement
THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
another way, hoping to get something left over that points to the
bug. Can you try bisecting through the top commits of

https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test

starting maybe with "52d240871760 irqchip: nvic: Use
GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
going to be ok?

At some point I fear we may have to give up and just mark the v6+SMP
configuration as broken, which is something we have considered in the
past but ended up always keeping around for the purpose of testing
omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
systems you probably don't want to use that config anway, and should
either stick to a uniprocessor build, or disable the ARMv6 support.

         Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-30 13:54                                                                 ` Arnd Bergmann
@ 2022-05-30 15:14                                                                   ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-30 15:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > In file included from ./include/linux/irqflags.h:17,
> > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > >                  from ./include/linux/bitops.h:33,
> > > >                  from ./include/linux/log2.h:12,
> > > >                  from kernel/bounds.c:13:
> > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > undeclared (first use in this function); did you mean
> > > > ‘__my_cpu_offset’?
> > > >    32 |  return __per_cpu_offset[0];
> > > >       |         ^~~~~~~~~~~~~~~~
> > > >       |         __my_cpu_offset
> > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > is reported only once for each function it appears in
> > >
> > > I think you just missed the line in my patch that adds the
> > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> >
> > So, I tried both variants and both led to stalls.
>
> I'm running out of ideas here.  Going to back to the original bisection,
> I rebased Ard's patches in a way that you should be able to build the
> config for each patch, and I split up the "ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> another way, hoping to get something left over that points to the
> bug. Can you try bisecting through the top commits of
>
> https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
>
> starting maybe with "52d240871760 irqchip: nvic: Use
> GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> going to be ok?
>
> At some point I fear we may have to give up and just mark the v6+SMP
> configuration as broken, which is something we have considered in the
> past but ended up always keeping around for the purpose of testing
> omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> systems you probably don't want to use that config anway, and should
> either stick to a uniprocessor build, or disable the ARMv6 support.
>

Yeah, I am also running out of ideas. One question, though: does the
RCU detected stall always occur in the same place? I.e., how similar
are the backtraces of the stalls between different occurrences?
Perhaps we could narrow down where in the code we are stalling, and
gain some more understanding of the root cause.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-30 15:14                                                                   ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-05-30 15:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > In file included from ./include/linux/irqflags.h:17,
> > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > >                  from ./include/linux/bitops.h:33,
> > > >                  from ./include/linux/log2.h:12,
> > > >                  from kernel/bounds.c:13:
> > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > undeclared (first use in this function); did you mean
> > > > ‘__my_cpu_offset’?
> > > >    32 |  return __per_cpu_offset[0];
> > > >       |         ^~~~~~~~~~~~~~~~
> > > >       |         __my_cpu_offset
> > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > is reported only once for each function it appears in
> > >
> > > I think you just missed the line in my patch that adds the
> > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> >
> > So, I tried both variants and both led to stalls.
>
> I'm running out of ideas here.  Going to back to the original bisection,
> I rebased Ard's patches in a way that you should be able to build the
> config for each patch, and I split up the "ARM: implement
> THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> another way, hoping to get something left over that points to the
> bug. Can you try bisecting through the top commits of
>
> https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
>
> starting maybe with "52d240871760 irqchip: nvic: Use
> GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> going to be ok?
>
> At some point I fear we may have to give up and just mark the v6+SMP
> configuration as broken, which is something we have considered in the
> past but ended up always keeping around for the purpose of testing
> omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> systems you probably don't want to use that config anway, and should
> either stick to a uniprocessor build, or disable the ARMv6 support.
>

Yeah, I am also running out of ideas. One question, though: does the
RCU detected stall always occur in the same place? I.e., how similar
are the backtraces of the stalls between different occurrences?
Perhaps we could narrow down where in the code we are stalling, and
gain some more understanding of the root cause.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-30 15:14                                                                   ` Ard Biesheuvel
@ 2022-05-31  8:36                                                                     ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-31  8:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

On Mon, May 30, 2022 at 5:15 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > In file included from ./include/linux/irqflags.h:17,
> > > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > > >                  from ./include/linux/bitops.h:33,
> > > > >                  from ./include/linux/log2.h:12,
> > > > >                  from kernel/bounds.c:13:
> > > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > > undeclared (first use in this function); did you mean
> > > > > ‘__my_cpu_offset’?
> > > > >    32 |  return __per_cpu_offset[0];
> > > > >       |         ^~~~~~~~~~~~~~~~
> > > > >       |         __my_cpu_offset
> > > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > > is reported only once for each function it appears in
> > > >
> > > > I think you just missed the line in my patch that adds the
> > > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> > >
> > > So, I tried both variants and both led to stalls.
> >
> > I'm running out of ideas here.  Going to back to the original bisection,
> > I rebased Ard's patches in a way that you should be able to build the
> > config for each patch, and I split up the "ARM: implement
> > THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> > another way, hoping to get something left over that points to the
> > bug. Can you try bisecting through the top commits of
> >
> > https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
> >
> > starting maybe with "52d240871760 irqchip: nvic: Use
> > GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> > going to be ok?
> >
> > At some point I fear we may have to give up and just mark the v6+SMP
> > configuration as broken, which is something we have considered in the
> > past but ended up always keeping around for the purpose of testing
> > omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> > systems you probably don't want to use that config anway, and should
> > either stick to a uniprocessor build, or disable the ARMv6 support.
> >
>
> Yeah, I am also running out of ideas. One question, though: does the
> RCU detected stall always occur in the same place? I.e., how similar
> are the backtraces of the stalls between different occurrences?
> Perhaps we could narrow down where in the code we are stalling, and
> gain some more understanding of the root cause.

I have attached 4 crash logs and will start with Arnd's branch bisecting.

Yegor

[-- Attachment #2: crash3.txt --]
[-- Type: text/plain, Size: 9097 bytes --]

[  219.721096] rcu: INFO: rcu_sched self-detected stall on CPU
[  219.727845] rcu:     0-...!: (2600 ticks this GP) idle=e7d/1/0x40000004 softirq=3592/3592 fqs=0
[  219.737376]  (t=2600 jiffies g=5525 q=21)
[  219.742051] rcu: rcu_sched kthread timer wakeup didn't happen for 2599 jiffies! g5525 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  219.753979] rcu:     Possible timer handling issue on cpu=0 timer-softirq=2867
[  219.761534] rcu: rcu_sched kthread starved for 2600 jiffies! g5525 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  219.772512] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  219.782043] rcu: RCU grace-period kthread stack dump:
[  219.787605] task:rcu_sched       state:I stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  219.797138]  __schedule from schedule+0x58/0xcc
[  219.802763]  schedule from schedule_timeout+0x78/0xf8
[  219.808847]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  219.815741]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  219.822273]  rcu_gp_kthread from kthread+0xe4/0x104
[  219.828121]  kthread from ret_from_fork+0x14/0x28
[  219.833664] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  219.839459] 1fa0:                                     00000000 00000000 00000000 00000000
[  219.848426] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  219.857325] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  219.864572] rcu: Stack dump where RCU GP kthread last ran:
[  219.870636] NMI backtrace for cpu 0
[  219.874702] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[  219.883491] Hardware name: Generic AM33XX (Flattened Device Tree)
[  219.890214] Workqueue: events dbs_work_handler
[  219.895659]  unwind_backtrace from show_stack+0x10/0x14
[  219.901897]  show_stack from dump_stack_lvl+0x58/0x70
[  219.908005]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  219.914814]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  219.922875]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  219.931828]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  219.941030]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xa98/0xf8c
[  219.949573]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  219.957001]  update_process_times from tick_sched_handle+0x48/0x54
[  219.964167]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  219.970891]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  219.978144]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  219.985517]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  219.993529]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  220.002289]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  220.009770]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  220.016630]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  220.023279]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  220.030501]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  220.037105] Exception stack(0xd0001f58 to 0xd0001fa0)
[  220.042841] 1f40:                                                       c01015c8 00000000
[  220.051805] 1f60: 0eaec000 00000000 fffffe00 600f0013 ffffffff d0385d5c 00000000 c3744a80
[  220.060765] 1f80: 00000200 c3744a80 c208dcd8 d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[  220.069580]  __irq_svc from __do_softirq+0xa0/0x5fc
[  220.075370]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  220.081788]  __irq_exit_rcu from irq_exit+0x8/0x28
[  220.087557]  irq_exit from call_with_stack+0x18/0x20
[  220.093503]  call_with_stack from __irq_svc+0x9c/0xbc
[  220.099402] Exception stack(0xd0385d28 to 0xd0385d70)
[  220.105218] 5d20:                   c208dd04 f9e00488 c2006940 c191a2fc c208dcc0 c208a680
[  220.114175] 5d40: c208dcc0 c191a2fc 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0385d78
[  220.123033] 5d60: c06d5e5c c06d5c60 600f0013 ffffffff
[  220.128675]  __irq_svc from _omap3_noncore_dpll_lock+0x14/0xc4
[  220.135601]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  220.144176]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  220.151871]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  220.159248]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  220.166215]  clk_set_rate from _set_opp+0x214/0x528
[  220.171991]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  220.178264]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  220.186075]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  220.193319]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  220.199733]  dbs_work_handler from process_one_work+0x284/0x72c
[  220.206617]  process_one_work from worker_thread+0x28/0x4b0
[  220.213147]  worker_thread from kthread+0xe4/0x104
[  220.218844]  kthread from ret_from_fork+0x14/0x28
[  220.224350] Exception stack(0xd0385fb0 to 0xd0385ff8)
[  220.230085] 5fa0:                                     00000000 00000000 00000000 00000000
[  220.239020] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  220.247910] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  220.255832] NMI backtrace for cpu 0
[  220.260006] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[  220.268798] Hardware name: Generic AM33XX (Flattened Device Tree)
[  220.275513] Workqueue: events dbs_work_handler
[  220.280953]  unwind_backtrace from show_stack+0x10/0x14
[  220.287156]  show_stack from dump_stack_lvl+0x58/0x70
[  220.293215]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  220.299984]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  220.308037]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  220.316976]  trigger_single_cpu_backtrace from rcu_dump_cpu_stacks+0xf8/0x1ec
[  220.325082]  rcu_dump_cpu_stacks from rcu_sched_clock_irq+0xab8/0xf8c
[  220.332539]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  220.339919]  update_process_times from tick_sched_handle+0x48/0x54
[  220.347059]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  220.353781]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  220.361027]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  220.368390]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  220.376332]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  220.385071]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  220.392548]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  220.399389]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  220.406030]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  220.413261]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  220.419847] Exception stack(0xd0001f58 to 0xd0001fa0)
[  220.425568] 1f40:                                                       c01015c8 00000000
[  220.434531] 1f60: 0eaec000 00000000 fffffe00 600f0013 ffffffff d0385d5c 00000000 c3744a80
[  220.443512] 1f80: 00000200 c3744a80 c208dcd8 d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[  220.452310]  __irq_svc from __do_softirq+0xa0/0x5fc
[  220.458090]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  220.464449]  __irq_exit_rcu from irq_exit+0x8/0x28
[  220.470214]  irq_exit from call_with_stack+0x18/0x20
[  220.476116]  call_with_stack from __irq_svc+0x9c/0xbc
[  220.482007] Exception stack(0xd0385d28 to 0xd0385d70)
[  220.487816] 5d20:                   c208dd04 f9e00488 c2006940 c191a2fc c208dcc0 c208a680
[  220.496775] 5d40: c208dcc0 c191a2fc 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0385d78
[  220.505644] 5d60: c06d5e5c c06d5c60 600f0013 ffffffff
[  220.511288]  __irq_svc from _omap3_noncore_dpll_lock+0x14/0xc4
[  220.518136]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  220.526717]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  220.534368]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  220.541751]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  220.548696]  clk_set_rate from _set_opp+0x214/0x528
[  220.554436]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  220.560702]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  220.568481]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  220.575706]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  220.582161]  dbs_work_handler from process_one_work+0x284/0x72c
[  220.589012]  process_one_work from worker_thread+0x28/0x4b0
[  220.595530]  worker_thread from kthread+0xe4/0x104
[  220.601208]  kthread from ret_from_fork+0x14/0x28
[  220.606707] Exception stack(0xd0385fb0 to 0xd0385ff8)
[  220.612461] 5fa0:                                     00000000 00000000 00000000 00000000
[  220.621380] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  220.630266] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #3: crash4.txt --]
[-- Type: text/plain, Size: 5183 bytes --]

[   79.751404] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   79.758633]  (detected by 0, t=2602 jiffies, g=4697, q=16429)
[   79.765139] rcu: All QSes seen, last rcu_sched kthread activity 2602 (-22026--24628), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   79.777563] rcu: rcu_sched kthread starved for 2602 jiffies! g4697 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   79.788374] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   79.797901] rcu: RCU grace-period kthread stack dump:
[   79.803469] task:rcu_sched       state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00000000
[   79.814789]  __schedule from schedule+0x58/0xcc
[   79.820464]  schedule from schedule_timeout+0x78/0xf8
[   79.826524]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[   79.833419]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[   79.839968]  rcu_gp_kthread from kthread+0xe4/0x104
[   79.845802]  kthread from ret_from_fork+0x14/0x28
[   79.851344] Exception stack(0xd0041fb0 to 0xd0041ff8)
[   79.857137] 1fa0:                                     00000000 00000000 00000000 00000000
[   79.866093] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   79.875005] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   79.882257] rcu: Stack dump where RCU GP kthread last ran:
[   79.888306] NMI backtrace for cpu 0
[   79.892364] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[   79.901162] Hardware name: Generic AM33XX (Flattened Device Tree)
[   79.907898] Workqueue: events dbs_work_handler
[   79.913341]  unwind_backtrace from show_stack+0x10/0x14
[   79.919588]  show_stack from dump_stack_lvl+0x58/0x70
[   79.925713]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[   79.932520]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[   79.940585]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[   79.949523]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[   79.958732]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xe1c/0xf8c
[   79.967278]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[   79.974688]  update_process_times from tick_sched_handle+0x48/0x54
[   79.981854]  tick_sched_handle from tick_sched_timer+0x48/0xac
[   79.988576]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[   79.995829]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[   80.003205]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[   80.011215]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[   80.019978]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[   80.027458]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[   80.034335]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[   80.040971]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[   80.048196]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[   80.054809] Exception stack(0xd0001f58 to 0xd0001fa0)
[   80.060516] 1f40:                                                       c01015c8 00000000
[   80.069486] 1f60: 0eaec000 00000000 fffffffe 600f0013 ffffffff d0385d64 016e3600 c3744a80
[   80.078437] 1f80: 00000002 c3744a80 ffffffff d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[   80.087237]  __irq_svc from __do_softirq+0xa0/0x5fc
[   80.093025]  __do_softirq from __irq_exit_rcu+0x138/0x178
[   80.099460]  __irq_exit_rcu from irq_exit+0x8/0x28
[   80.105230]  irq_exit from call_with_stack+0x18/0x20
[   80.111159]  call_with_stack from __irq_svc+0x9c/0xbc
[   80.117055] Exception stack(0xd0385d30 to 0xd0385d78)
[   80.122806] 5d20:                                     00001901 f9e0042c 00000002 f9e00000
[   80.131771] 5d40: c208dcc0 00000000 c208a680 c191a2fc 016e3600 11e1a300 c1109210 00000000
[   80.140683] 5d60: fffffff9 d0385d80 c06d5d8c c06d30d8 600f0013 ffffffff
[   80.147898]  __irq_svc from clk_memmap_readl+0x28/0x90
[   80.154011]  clk_memmap_readl from omap3_noncore_dpll_program+0x7c/0x5e4
[   80.161766]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[   80.169435]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[   80.176806]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[   80.183777]  clk_set_rate from _set_opp+0x214/0x528
[   80.189541]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[   80.195818]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[   80.203634]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[   80.210877]  od_dbs_update from dbs_work_handler+0x2c/0x60
[   80.217323]  dbs_work_handler from process_one_work+0x284/0x72c
[   80.224217]  process_one_work from worker_thread+0x28/0x4b0
[   80.230730]  worker_thread from kthread+0xe4/0x104
[   80.236422]  kthread from ret_from_fork+0x14/0x28
[   80.241925] Exception stack(0xd0385fb0 to 0xd0385ff8)
[   80.247670] 5fa0:                                     00000000 00000000 00000000 00000000
[   80.256597] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   80.265480] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #4: crash1.txt --]
[-- Type: text/plain, Size: 9355 bytes --]

[  259.990401] rcu: INFO: rcu_sched self-detected stall on CPU
[  259.997260] rcu:     0-...!: (2600 ticks this GP) idle=5af/1/0x40000004 softirq=7041/7041 fqs=0
[  260.006798]  (t=2600 jiffies g=16825 q=11323)
[  260.011833] rcu: rcu_sched kthread timer wakeup didn't happen for 2599 jiffies! g16825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  260.023878] rcu:     Possible timer handling issue on cpu=0 timer-softirq=5692
[  260.031436] rcu: rcu_sched kthread starved for 2600 jiffies! g16825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  260.042517] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  260.052059] rcu: RCU grace-period kthread stack dump:
[  260.057621] task:rcu_sched       state:I stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  260.067142]  __schedule from schedule+0x58/0xcc
[  260.072792]  schedule from schedule_timeout+0x78/0xf8
[  260.078867]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  260.085765]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  260.092307]  rcu_gp_kthread from kthread+0xe4/0x104
[  260.098151]  kthread from ret_from_fork+0x14/0x28
[  260.103695] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  260.109490] 1fa0:                                     00000000 00000000 00000000 00000000
[  260.118466] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.127383] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  260.134615] rcu: Stack dump where RCU GP kthread last ran:
[  260.140672] NMI backtrace for cpu 0
[  260.144724] CPU: 0 PID: 59 Comm: kworker/0:9 Tainted: G        W         5.18.0-rc7 #14
[  260.153508] Hardware name: Generic AM33XX (Flattened Device Tree)
[  260.160237] Workqueue: events dbs_work_handler
[  260.165684]  unwind_backtrace from show_stack+0x10/0x14
[  260.171933]  show_stack from dump_stack_lvl+0x58/0x70
[  260.178024]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  260.184833]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  260.192906]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  260.201852]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  260.211059]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xa98/0xf8c
[  260.219589]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  260.227022]  update_process_times from tick_sched_handle+0x48/0x54
[  260.234180]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  260.240922]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  260.248166]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  260.255530]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  260.263551]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  260.272314]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  260.279800]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  260.286673]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  260.293323]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  260.300520]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  260.307114] Exception stack(0xd0001f58 to 0xd0001fa0)
[  260.312847] 1f40:                                                       c01015c8 00000000
[  260.321798] 1f60: 0eaec000 00000000 fffffff8 60020013 ffffffff d0389d34 00000000 c3742a40
[  260.330763] 1f80: 00000008 c3742a40 ffffffff d0001fa8 c01015c8 c01015d0 60020113 ffffffff
[  260.339564]  __irq_svc from __do_softirq+0xa0/0x5fc
[  260.345366]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  260.351794]  __irq_exit_rcu from irq_exit+0x8/0x28
[  260.357572]  irq_exit from call_with_stack+0x18/0x20
[  260.363513]  call_with_stack from __irq_svc+0x9c/0xbc
[  260.369403] Exception stack(0xd0389d00 to 0xd0389d48)
[  260.375239] 9d00: 00000005 f9e00488 00000002 f9e00000 00000007 c208dcd8 c191a2fc c208dcc0
[  260.384200] 9d20: 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0389d50 c06d59ec c06d30d8
[  260.393026] 9d40: 60020013 ffffffff
[  260.397059]  __irq_svc from clk_memmap_readl+0x28/0x90
[  260.403180]  clk_memmap_readl from _omap3_dpll_write_clken+0x24/0x58
[  260.410566]  _omap3_dpll_write_clken from _omap3_noncore_dpll_lock+0x94/0xc4
[  260.418690]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  260.427274]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  260.434956]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  260.442340]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  260.449272]  clk_set_rate from _set_opp+0x214/0x528
[  260.455062]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  260.461336]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  260.469146]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  260.476378]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  260.482821]  dbs_work_handler from process_one_work+0x284/0x72c
[  260.489705]  process_one_work from worker_thread+0x28/0x4b0
[  260.496232]  worker_thread from kthread+0xe4/0x104
[  260.501922]  kthread from ret_from_fork+0x14/0x28
[  260.507432] Exception stack(0xd0389fb0 to 0xd0389ff8)
[  260.513179] 9fa0:                                     00000000 00000000 00000000 00000000
[  260.522110] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.531008] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  260.539014] NMI backtrace for cpu 0
[  260.543179] CPU: 0 PID: 59 Comm: kworker/0:9 Tainted: G        W         5.18.0-rc7 #14
[  260.551963] Hardware name: Generic AM33XX (Flattened Device Tree)
[  260.558674] Workqueue: events dbs_work_handler
[  260.564131]  unwind_backtrace from show_stack+0x10/0x14
[  260.570357]  show_stack from dump_stack_lvl+0x58/0x70
[  260.576398]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  260.583157]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  260.591223]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  260.600156]  trigger_single_cpu_backtrace from rcu_dump_cpu_stacks+0xf8/0x1ec
[  260.608284]  rcu_dump_cpu_stacks from rcu_sched_clock_irq+0xab8/0xf8c
[  260.615746]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  260.623134]  update_process_times from tick_sched_handle+0x48/0x54
[  260.630280]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  260.636991]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  260.644235]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  260.651601]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  260.659558]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  260.668288]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  260.675782]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  260.682627]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  260.689264]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  260.696486]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  260.703067] Exception stack(0xd0001f58 to 0xd0001fa0)
[  260.708780] 1f40:                                                       c01015c8 00000000
[  260.717753] 1f60: 0eaec000 00000000 fffffff8 60020013 ffffffff d0389d34 00000000 c3742a40
[  260.726702] 1f80: 00000008 c3742a40 ffffffff d0001fa8 c01015c8 c01015d0 60020113 ffffffff
[  260.735511]  __irq_svc from __do_softirq+0xa0/0x5fc
[  260.741288]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  260.747659]  __irq_exit_rcu from irq_exit+0x8/0x28
[  260.753441]  irq_exit from call_with_stack+0x18/0x20
[  260.759337]  call_with_stack from __irq_svc+0x9c/0xbc
[  260.765228] Exception stack(0xd0389d00 to 0xd0389d48)
[  260.771072] 9d00: 00000005 f9e00488 00000002 f9e00000 00000007 c208dcd8 c191a2fc c208dcc0
[  260.780031] 9d20: 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0389d50 c06d59ec c06d30d8
[  260.788844] 9d40: 60020013 ffffffff
[  260.792883]  __irq_svc from clk_memmap_readl+0x28/0x90
[  260.798946]  clk_memmap_readl from _omap3_dpll_write_clken+0x24/0x58
[  260.806301]  _omap3_dpll_write_clken from _omap3_noncore_dpll_lock+0x94/0xc4
[  260.814421]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  260.823015]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  260.830702]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  260.838091]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  260.845045]  clk_set_rate from _set_opp+0x214/0x528
[  260.850797]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  260.857071]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  260.864863]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  260.872082]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  260.878522]  dbs_work_handler from process_one_work+0x284/0x72c
[  260.885357]  process_one_work from worker_thread+0x28/0x4b0
[  260.891875]  worker_thread from kthread+0xe4/0x104
[  260.897568]  kthread from ret_from_fork+0x14/0x28
[  260.903073] Exception stack(0xd0389fb0 to 0xd0389ff8)
[  260.908814] 9fa0:                                     00000000 00000000 00000000 00000000
[  260.917745] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.926641] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #5: crash2.txt --]
[-- Type: text/plain, Size: 4512 bytes --]

[  112.951462] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  112.958658]  (detected by 0, t=2602 jiffies, g=4733, q=11060)
[  112.965167] rcu: All QSes seen, last rcu_sched kthread activity 2602 (-18706--21308), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  112.977570] rcu: rcu_sched kthread starved for 2602 jiffies! g4733 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[  112.988383] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  112.997921] rcu: RCU grace-period kthread stack dump:
[  113.003480] task:rcu_sched       state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  113.014832]  __schedule from schedule+0x58/0xcc
[  113.020467]  schedule from schedule_timeout+0x78/0xf8
[  113.026535]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  113.033431]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  113.039978]  rcu_gp_kthread from kthread+0xe4/0x104
[  113.045824]  kthread from ret_from_fork+0x14/0x28
[  113.051356] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  113.057147] 1fa0:                                     00000000 00000000 00000000 00000000
[  113.066104] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  113.075023] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  113.082279] rcu: Stack dump where RCU GP kthread last ran:
[  113.088341] NMI backtrace for cpu 0
[  113.092378] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.18.0-rc7 #14
[  113.099831] Hardware name: Generic AM33XX (Flattened Device Tree)
[  113.106561] Workqueue: events dbs_work_handler
[  113.112021]  unwind_backtrace from show_stack+0x10/0x14
[  113.118272]  show_stack from dump_stack_lvl+0x58/0x70
[  113.124379]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  113.131208]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  113.139264]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  113.148200]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  113.157383]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xe1c/0xf8c
[  113.165916]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  113.173339]  update_process_times from tick_sched_handle+0x48/0x54
[  113.180505]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  113.187240]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  113.194469]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  113.201827]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  113.209839]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  113.218603]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  113.226084]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  113.232972]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  113.239613]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  113.246842]  generic_handle_arch_irq from call_with_stack+0x18/0x20
[  113.254073]  call_with_stack from __irq_svc+0x9c/0xbc
[  113.259973] Exception stack(0xd0395d40 to 0xd0395d88)
[  113.265824] 5d40: 00000005 f9e00488 00000000 00000000 c208dcc0 00001901 c208a680 c191a2fc
[  113.274781] 5d60: 00000000 c208dcc0 c1109210 c208dcd8 fffffff9 d0395d90 c06d5ef0 c06d6104
[  113.283594] 5d80: 60070013 ffffffff
[  113.287629]  __irq_svc from omap3_noncore_dpll_program+0x3f4/0x5e4
[  113.294907]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  113.302572]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  113.309950]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  113.316908]  clk_set_rate from _set_opp+0x260/0x528
[  113.322680]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  113.328969]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  113.336784]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  113.344024]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  113.350466]  dbs_work_handler from process_one_work+0x284/0x72c
[  113.357326]  process_one_work from worker_thread+0x28/0x4b0
[  113.363841]  worker_thread from kthread+0xe4/0x104
[  113.369532]  kthread from ret_from_fork+0x14/0x28
[  113.375038] Exception stack(0xd0395fb0 to 0xd0395ff8)
[  113.380793] 5fa0:                                     00000000 00000000 00000000 00000000
[  113.389727] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  113.398614] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-31  8:36                                                                     ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-31  8:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

On Mon, May 30, 2022 at 5:15 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > In file included from ./include/linux/irqflags.h:17,
> > > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > > >                  from ./include/linux/bitops.h:33,
> > > > >                  from ./include/linux/log2.h:12,
> > > > >                  from kernel/bounds.c:13:
> > > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > > undeclared (first use in this function); did you mean
> > > > > ‘__my_cpu_offset’?
> > > > >    32 |  return __per_cpu_offset[0];
> > > > >       |         ^~~~~~~~~~~~~~~~
> > > > >       |         __my_cpu_offset
> > > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > > is reported only once for each function it appears in
> > > >
> > > > I think you just missed the line in my patch that adds the
> > > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> > >
> > > So, I tried both variants and both led to stalls.
> >
> > I'm running out of ideas here.  Going to back to the original bisection,
> > I rebased Ard's patches in a way that you should be able to build the
> > config for each patch, and I split up the "ARM: implement
> > THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> > another way, hoping to get something left over that points to the
> > bug. Can you try bisecting through the top commits of
> >
> > https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
> >
> > starting maybe with "52d240871760 irqchip: nvic: Use
> > GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> > going to be ok?
> >
> > At some point I fear we may have to give up and just mark the v6+SMP
> > configuration as broken, which is something we have considered in the
> > past but ended up always keeping around for the purpose of testing
> > omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> > systems you probably don't want to use that config anway, and should
> > either stick to a uniprocessor build, or disable the ARMv6 support.
> >
>
> Yeah, I am also running out of ideas. One question, though: does the
> RCU detected stall always occur in the same place? I.e., how similar
> are the backtraces of the stalls between different occurrences?
> Perhaps we could narrow down where in the code we are stalling, and
> gain some more understanding of the root cause.

I have attached 4 crash logs and will start with Arnd's branch bisecting.

Yegor

[-- Attachment #2: crash3.txt --]
[-- Type: text/plain, Size: 9097 bytes --]

[  219.721096] rcu: INFO: rcu_sched self-detected stall on CPU
[  219.727845] rcu:     0-...!: (2600 ticks this GP) idle=e7d/1/0x40000004 softirq=3592/3592 fqs=0
[  219.737376]  (t=2600 jiffies g=5525 q=21)
[  219.742051] rcu: rcu_sched kthread timer wakeup didn't happen for 2599 jiffies! g5525 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  219.753979] rcu:     Possible timer handling issue on cpu=0 timer-softirq=2867
[  219.761534] rcu: rcu_sched kthread starved for 2600 jiffies! g5525 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  219.772512] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  219.782043] rcu: RCU grace-period kthread stack dump:
[  219.787605] task:rcu_sched       state:I stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  219.797138]  __schedule from schedule+0x58/0xcc
[  219.802763]  schedule from schedule_timeout+0x78/0xf8
[  219.808847]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  219.815741]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  219.822273]  rcu_gp_kthread from kthread+0xe4/0x104
[  219.828121]  kthread from ret_from_fork+0x14/0x28
[  219.833664] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  219.839459] 1fa0:                                     00000000 00000000 00000000 00000000
[  219.848426] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  219.857325] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  219.864572] rcu: Stack dump where RCU GP kthread last ran:
[  219.870636] NMI backtrace for cpu 0
[  219.874702] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[  219.883491] Hardware name: Generic AM33XX (Flattened Device Tree)
[  219.890214] Workqueue: events dbs_work_handler
[  219.895659]  unwind_backtrace from show_stack+0x10/0x14
[  219.901897]  show_stack from dump_stack_lvl+0x58/0x70
[  219.908005]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  219.914814]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  219.922875]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  219.931828]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  219.941030]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xa98/0xf8c
[  219.949573]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  219.957001]  update_process_times from tick_sched_handle+0x48/0x54
[  219.964167]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  219.970891]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  219.978144]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  219.985517]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  219.993529]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  220.002289]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  220.009770]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  220.016630]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  220.023279]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  220.030501]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  220.037105] Exception stack(0xd0001f58 to 0xd0001fa0)
[  220.042841] 1f40:                                                       c01015c8 00000000
[  220.051805] 1f60: 0eaec000 00000000 fffffe00 600f0013 ffffffff d0385d5c 00000000 c3744a80
[  220.060765] 1f80: 00000200 c3744a80 c208dcd8 d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[  220.069580]  __irq_svc from __do_softirq+0xa0/0x5fc
[  220.075370]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  220.081788]  __irq_exit_rcu from irq_exit+0x8/0x28
[  220.087557]  irq_exit from call_with_stack+0x18/0x20
[  220.093503]  call_with_stack from __irq_svc+0x9c/0xbc
[  220.099402] Exception stack(0xd0385d28 to 0xd0385d70)
[  220.105218] 5d20:                   c208dd04 f9e00488 c2006940 c191a2fc c208dcc0 c208a680
[  220.114175] 5d40: c208dcc0 c191a2fc 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0385d78
[  220.123033] 5d60: c06d5e5c c06d5c60 600f0013 ffffffff
[  220.128675]  __irq_svc from _omap3_noncore_dpll_lock+0x14/0xc4
[  220.135601]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  220.144176]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  220.151871]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  220.159248]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  220.166215]  clk_set_rate from _set_opp+0x214/0x528
[  220.171991]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  220.178264]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  220.186075]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  220.193319]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  220.199733]  dbs_work_handler from process_one_work+0x284/0x72c
[  220.206617]  process_one_work from worker_thread+0x28/0x4b0
[  220.213147]  worker_thread from kthread+0xe4/0x104
[  220.218844]  kthread from ret_from_fork+0x14/0x28
[  220.224350] Exception stack(0xd0385fb0 to 0xd0385ff8)
[  220.230085] 5fa0:                                     00000000 00000000 00000000 00000000
[  220.239020] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  220.247910] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  220.255832] NMI backtrace for cpu 0
[  220.260006] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[  220.268798] Hardware name: Generic AM33XX (Flattened Device Tree)
[  220.275513] Workqueue: events dbs_work_handler
[  220.280953]  unwind_backtrace from show_stack+0x10/0x14
[  220.287156]  show_stack from dump_stack_lvl+0x58/0x70
[  220.293215]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  220.299984]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  220.308037]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  220.316976]  trigger_single_cpu_backtrace from rcu_dump_cpu_stacks+0xf8/0x1ec
[  220.325082]  rcu_dump_cpu_stacks from rcu_sched_clock_irq+0xab8/0xf8c
[  220.332539]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  220.339919]  update_process_times from tick_sched_handle+0x48/0x54
[  220.347059]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  220.353781]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  220.361027]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  220.368390]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  220.376332]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  220.385071]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  220.392548]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  220.399389]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  220.406030]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  220.413261]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  220.419847] Exception stack(0xd0001f58 to 0xd0001fa0)
[  220.425568] 1f40:                                                       c01015c8 00000000
[  220.434531] 1f60: 0eaec000 00000000 fffffe00 600f0013 ffffffff d0385d5c 00000000 c3744a80
[  220.443512] 1f80: 00000200 c3744a80 c208dcd8 d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[  220.452310]  __irq_svc from __do_softirq+0xa0/0x5fc
[  220.458090]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  220.464449]  __irq_exit_rcu from irq_exit+0x8/0x28
[  220.470214]  irq_exit from call_with_stack+0x18/0x20
[  220.476116]  call_with_stack from __irq_svc+0x9c/0xbc
[  220.482007] Exception stack(0xd0385d28 to 0xd0385d70)
[  220.487816] 5d20:                   c208dd04 f9e00488 c2006940 c191a2fc c208dcc0 c208a680
[  220.496775] 5d40: c208dcc0 c191a2fc 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0385d78
[  220.505644] 5d60: c06d5e5c c06d5c60 600f0013 ffffffff
[  220.511288]  __irq_svc from _omap3_noncore_dpll_lock+0x14/0xc4
[  220.518136]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  220.526717]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  220.534368]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  220.541751]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  220.548696]  clk_set_rate from _set_opp+0x214/0x528
[  220.554436]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  220.560702]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  220.568481]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  220.575706]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  220.582161]  dbs_work_handler from process_one_work+0x284/0x72c
[  220.589012]  process_one_work from worker_thread+0x28/0x4b0
[  220.595530]  worker_thread from kthread+0xe4/0x104
[  220.601208]  kthread from ret_from_fork+0x14/0x28
[  220.606707] Exception stack(0xd0385fb0 to 0xd0385ff8)
[  220.612461] 5fa0:                                     00000000 00000000 00000000 00000000
[  220.621380] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  220.630266] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #3: crash4.txt --]
[-- Type: text/plain, Size: 5183 bytes --]

[   79.751404] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   79.758633]  (detected by 0, t=2602 jiffies, g=4697, q=16429)
[   79.765139] rcu: All QSes seen, last rcu_sched kthread activity 2602 (-22026--24628), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   79.777563] rcu: rcu_sched kthread starved for 2602 jiffies! g4697 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   79.788374] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   79.797901] rcu: RCU grace-period kthread stack dump:
[   79.803469] task:rcu_sched       state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00000000
[   79.814789]  __schedule from schedule+0x58/0xcc
[   79.820464]  schedule from schedule_timeout+0x78/0xf8
[   79.826524]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[   79.833419]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[   79.839968]  rcu_gp_kthread from kthread+0xe4/0x104
[   79.845802]  kthread from ret_from_fork+0x14/0x28
[   79.851344] Exception stack(0xd0041fb0 to 0xd0041ff8)
[   79.857137] 1fa0:                                     00000000 00000000 00000000 00000000
[   79.866093] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   79.875005] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   79.882257] rcu: Stack dump where RCU GP kthread last ran:
[   79.888306] NMI backtrace for cpu 0
[   79.892364] CPU: 0 PID: 58 Comm: kworker/0:8 Tainted: G        W         5.18.0-rc7 #14
[   79.901162] Hardware name: Generic AM33XX (Flattened Device Tree)
[   79.907898] Workqueue: events dbs_work_handler
[   79.913341]  unwind_backtrace from show_stack+0x10/0x14
[   79.919588]  show_stack from dump_stack_lvl+0x58/0x70
[   79.925713]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[   79.932520]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[   79.940585]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[   79.949523]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[   79.958732]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xe1c/0xf8c
[   79.967278]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[   79.974688]  update_process_times from tick_sched_handle+0x48/0x54
[   79.981854]  tick_sched_handle from tick_sched_timer+0x48/0xac
[   79.988576]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[   79.995829]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[   80.003205]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[   80.011215]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[   80.019978]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[   80.027458]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[   80.034335]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[   80.040971]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[   80.048196]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[   80.054809] Exception stack(0xd0001f58 to 0xd0001fa0)
[   80.060516] 1f40:                                                       c01015c8 00000000
[   80.069486] 1f60: 0eaec000 00000000 fffffffe 600f0013 ffffffff d0385d64 016e3600 c3744a80
[   80.078437] 1f80: 00000002 c3744a80 ffffffff d0001fa8 c01015c8 c01015d0 600f0113 ffffffff
[   80.087237]  __irq_svc from __do_softirq+0xa0/0x5fc
[   80.093025]  __do_softirq from __irq_exit_rcu+0x138/0x178
[   80.099460]  __irq_exit_rcu from irq_exit+0x8/0x28
[   80.105230]  irq_exit from call_with_stack+0x18/0x20
[   80.111159]  call_with_stack from __irq_svc+0x9c/0xbc
[   80.117055] Exception stack(0xd0385d30 to 0xd0385d78)
[   80.122806] 5d20:                                     00001901 f9e0042c 00000002 f9e00000
[   80.131771] 5d40: c208dcc0 00000000 c208a680 c191a2fc 016e3600 11e1a300 c1109210 00000000
[   80.140683] 5d60: fffffff9 d0385d80 c06d5d8c c06d30d8 600f0013 ffffffff
[   80.147898]  __irq_svc from clk_memmap_readl+0x28/0x90
[   80.154011]  clk_memmap_readl from omap3_noncore_dpll_program+0x7c/0x5e4
[   80.161766]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[   80.169435]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[   80.176806]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[   80.183777]  clk_set_rate from _set_opp+0x214/0x528
[   80.189541]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[   80.195818]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[   80.203634]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[   80.210877]  od_dbs_update from dbs_work_handler+0x2c/0x60
[   80.217323]  dbs_work_handler from process_one_work+0x284/0x72c
[   80.224217]  process_one_work from worker_thread+0x28/0x4b0
[   80.230730]  worker_thread from kthread+0xe4/0x104
[   80.236422]  kthread from ret_from_fork+0x14/0x28
[   80.241925] Exception stack(0xd0385fb0 to 0xd0385ff8)
[   80.247670] 5fa0:                                     00000000 00000000 00000000 00000000
[   80.256597] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   80.265480] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #4: crash1.txt --]
[-- Type: text/plain, Size: 9355 bytes --]

[  259.990401] rcu: INFO: rcu_sched self-detected stall on CPU
[  259.997260] rcu:     0-...!: (2600 ticks this GP) idle=5af/1/0x40000004 softirq=7041/7041 fqs=0
[  260.006798]  (t=2600 jiffies g=16825 q=11323)
[  260.011833] rcu: rcu_sched kthread timer wakeup didn't happen for 2599 jiffies! g16825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  260.023878] rcu:     Possible timer handling issue on cpu=0 timer-softirq=5692
[  260.031436] rcu: rcu_sched kthread starved for 2600 jiffies! g16825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  260.042517] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  260.052059] rcu: RCU grace-period kthread stack dump:
[  260.057621] task:rcu_sched       state:I stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  260.067142]  __schedule from schedule+0x58/0xcc
[  260.072792]  schedule from schedule_timeout+0x78/0xf8
[  260.078867]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  260.085765]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  260.092307]  rcu_gp_kthread from kthread+0xe4/0x104
[  260.098151]  kthread from ret_from_fork+0x14/0x28
[  260.103695] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  260.109490] 1fa0:                                     00000000 00000000 00000000 00000000
[  260.118466] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.127383] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  260.134615] rcu: Stack dump where RCU GP kthread last ran:
[  260.140672] NMI backtrace for cpu 0
[  260.144724] CPU: 0 PID: 59 Comm: kworker/0:9 Tainted: G        W         5.18.0-rc7 #14
[  260.153508] Hardware name: Generic AM33XX (Flattened Device Tree)
[  260.160237] Workqueue: events dbs_work_handler
[  260.165684]  unwind_backtrace from show_stack+0x10/0x14
[  260.171933]  show_stack from dump_stack_lvl+0x58/0x70
[  260.178024]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  260.184833]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  260.192906]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  260.201852]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  260.211059]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xa98/0xf8c
[  260.219589]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  260.227022]  update_process_times from tick_sched_handle+0x48/0x54
[  260.234180]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  260.240922]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  260.248166]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  260.255530]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  260.263551]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  260.272314]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  260.279800]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  260.286673]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  260.293323]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  260.300520]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  260.307114] Exception stack(0xd0001f58 to 0xd0001fa0)
[  260.312847] 1f40:                                                       c01015c8 00000000
[  260.321798] 1f60: 0eaec000 00000000 fffffff8 60020013 ffffffff d0389d34 00000000 c3742a40
[  260.330763] 1f80: 00000008 c3742a40 ffffffff d0001fa8 c01015c8 c01015d0 60020113 ffffffff
[  260.339564]  __irq_svc from __do_softirq+0xa0/0x5fc
[  260.345366]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  260.351794]  __irq_exit_rcu from irq_exit+0x8/0x28
[  260.357572]  irq_exit from call_with_stack+0x18/0x20
[  260.363513]  call_with_stack from __irq_svc+0x9c/0xbc
[  260.369403] Exception stack(0xd0389d00 to 0xd0389d48)
[  260.375239] 9d00: 00000005 f9e00488 00000002 f9e00000 00000007 c208dcd8 c191a2fc c208dcc0
[  260.384200] 9d20: 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0389d50 c06d59ec c06d30d8
[  260.393026] 9d40: 60020013 ffffffff
[  260.397059]  __irq_svc from clk_memmap_readl+0x28/0x90
[  260.403180]  clk_memmap_readl from _omap3_dpll_write_clken+0x24/0x58
[  260.410566]  _omap3_dpll_write_clken from _omap3_noncore_dpll_lock+0x94/0xc4
[  260.418690]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  260.427274]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  260.434956]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  260.442340]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  260.449272]  clk_set_rate from _set_opp+0x214/0x528
[  260.455062]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  260.461336]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  260.469146]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  260.476378]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  260.482821]  dbs_work_handler from process_one_work+0x284/0x72c
[  260.489705]  process_one_work from worker_thread+0x28/0x4b0
[  260.496232]  worker_thread from kthread+0xe4/0x104
[  260.501922]  kthread from ret_from_fork+0x14/0x28
[  260.507432] Exception stack(0xd0389fb0 to 0xd0389ff8)
[  260.513179] 9fa0:                                     00000000 00000000 00000000 00000000
[  260.522110] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.531008] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  260.539014] NMI backtrace for cpu 0
[  260.543179] CPU: 0 PID: 59 Comm: kworker/0:9 Tainted: G        W         5.18.0-rc7 #14
[  260.551963] Hardware name: Generic AM33XX (Flattened Device Tree)
[  260.558674] Workqueue: events dbs_work_handler
[  260.564131]  unwind_backtrace from show_stack+0x10/0x14
[  260.570357]  show_stack from dump_stack_lvl+0x58/0x70
[  260.576398]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  260.583157]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  260.591223]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  260.600156]  trigger_single_cpu_backtrace from rcu_dump_cpu_stacks+0xf8/0x1ec
[  260.608284]  rcu_dump_cpu_stacks from rcu_sched_clock_irq+0xab8/0xf8c
[  260.615746]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  260.623134]  update_process_times from tick_sched_handle+0x48/0x54
[  260.630280]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  260.636991]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  260.644235]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  260.651601]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  260.659558]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  260.668288]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  260.675782]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  260.682627]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  260.689264]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  260.696486]  generic_handle_arch_irq from __irq_svc+0x90/0xbc
[  260.703067] Exception stack(0xd0001f58 to 0xd0001fa0)
[  260.708780] 1f40:                                                       c01015c8 00000000
[  260.717753] 1f60: 0eaec000 00000000 fffffff8 60020013 ffffffff d0389d34 00000000 c3742a40
[  260.726702] 1f80: 00000008 c3742a40 ffffffff d0001fa8 c01015c8 c01015d0 60020113 ffffffff
[  260.735511]  __irq_svc from __do_softirq+0xa0/0x5fc
[  260.741288]  __do_softirq from __irq_exit_rcu+0x138/0x178
[  260.747659]  __irq_exit_rcu from irq_exit+0x8/0x28
[  260.753441]  irq_exit from call_with_stack+0x18/0x20
[  260.759337]  call_with_stack from __irq_svc+0x9c/0xbc
[  260.765228] Exception stack(0xd0389d00 to 0xd0389d48)
[  260.771072] 9d00: 00000005 f9e00488 00000002 f9e00000 00000007 c208dcd8 c191a2fc c208dcc0
[  260.780031] 9d20: 00000000 c208dcc0 00000005 c208dcd8 fffffff9 d0389d50 c06d59ec c06d30d8
[  260.788844] 9d40: 60020013 ffffffff
[  260.792883]  __irq_svc from clk_memmap_readl+0x28/0x90
[  260.798946]  clk_memmap_readl from _omap3_dpll_write_clken+0x24/0x58
[  260.806301]  _omap3_dpll_write_clken from _omap3_noncore_dpll_lock+0x94/0xc4
[  260.814421]  _omap3_noncore_dpll_lock from omap3_noncore_dpll_program+0x14c/0x5e4
[  260.823015]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  260.830702]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  260.838091]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  260.845045]  clk_set_rate from _set_opp+0x214/0x528
[  260.850797]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  260.857071]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  260.864863]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  260.872082]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  260.878522]  dbs_work_handler from process_one_work+0x284/0x72c
[  260.885357]  process_one_work from worker_thread+0x28/0x4b0
[  260.891875]  worker_thread from kthread+0xe4/0x104
[  260.897568]  kthread from ret_from_fork+0x14/0x28
[  260.903073] Exception stack(0xd0389fb0 to 0xd0389ff8)
[  260.908814] 9fa0:                                     00000000 00000000 00000000 00000000
[  260.917745] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  260.926641] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #5: crash2.txt --]
[-- Type: text/plain, Size: 4512 bytes --]

[  112.951462] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  112.958658]  (detected by 0, t=2602 jiffies, g=4733, q=11060)
[  112.965167] rcu: All QSes seen, last rcu_sched kthread activity 2602 (-18706--21308), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  112.977570] rcu: rcu_sched kthread starved for 2602 jiffies! g4733 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[  112.988383] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  112.997921] rcu: RCU grace-period kthread stack dump:
[  113.003480] task:rcu_sched       state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00000000
[  113.014832]  __schedule from schedule+0x58/0xcc
[  113.020467]  schedule from schedule_timeout+0x78/0xf8
[  113.026535]  schedule_timeout from rcu_gp_fqs_loop+0x108/0x3d0
[  113.033431]  rcu_gp_fqs_loop from rcu_gp_kthread+0xa8/0x134
[  113.039978]  rcu_gp_kthread from kthread+0xe4/0x104
[  113.045824]  kthread from ret_from_fork+0x14/0x28
[  113.051356] Exception stack(0xd0041fb0 to 0xd0041ff8)
[  113.057147] 1fa0:                                     00000000 00000000 00000000 00000000
[  113.066104] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  113.075023] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  113.082279] rcu: Stack dump where RCU GP kthread last ran:
[  113.088341] NMI backtrace for cpu 0
[  113.092378] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.18.0-rc7 #14
[  113.099831] Hardware name: Generic AM33XX (Flattened Device Tree)
[  113.106561] Workqueue: events dbs_work_handler
[  113.112021]  unwind_backtrace from show_stack+0x10/0x14
[  113.118272]  show_stack from dump_stack_lvl+0x58/0x70
[  113.124379]  dump_stack_lvl from nmi_cpu_backtrace+0xe0/0x128
[  113.131208]  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xec/0x184
[  113.139264]  nmi_trigger_cpumask_backtrace from trigger_single_cpu_backtrace+0x20/0x2c
[  113.148200]  trigger_single_cpu_backtrace from rcu_check_gp_kthread_starvation+0xf4/0x148
[  113.157383]  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0xe1c/0xf8c
[  113.165916]  rcu_sched_clock_irq from update_process_times+0x88/0xc0
[  113.173339]  update_process_times from tick_sched_handle+0x48/0x54
[  113.180505]  tick_sched_handle from tick_sched_timer+0x48/0xac
[  113.187240]  tick_sched_timer from __hrtimer_run_queues+0x250/0x4e4
[  113.194469]  __hrtimer_run_queues from hrtimer_interrupt+0x128/0x2c8
[  113.201827]  hrtimer_interrupt from dmtimer_clockevent_interrupt+0x24/0x2c
[  113.209839]  dmtimer_clockevent_interrupt from __handle_irq_event_percpu+0x98/0x334
[  113.218603]  __handle_irq_event_percpu from handle_irq_event+0x38/0xc0
[  113.226084]  handle_irq_event from handle_level_irq+0xb4/0x1a8
[  113.232972]  handle_level_irq from handle_irq_desc+0x1c/0x2c
[  113.239613]  handle_irq_desc from generic_handle_arch_irq+0x2c/0x64
[  113.246842]  generic_handle_arch_irq from call_with_stack+0x18/0x20
[  113.254073]  call_with_stack from __irq_svc+0x9c/0xbc
[  113.259973] Exception stack(0xd0395d40 to 0xd0395d88)
[  113.265824] 5d40: 00000005 f9e00488 00000000 00000000 c208dcc0 00001901 c208a680 c191a2fc
[  113.274781] 5d60: 00000000 c208dcc0 c1109210 c208dcd8 fffffff9 d0395d90 c06d5ef0 c06d6104
[  113.283594] 5d80: 60070013 ffffffff
[  113.287629]  __irq_svc from omap3_noncore_dpll_program+0x3f4/0x5e4
[  113.294907]  omap3_noncore_dpll_program from clk_change_rate+0x238/0x4f8
[  113.302572]  clk_change_rate from clk_core_set_rate_nolock+0x1b0/0x29c
[  113.309950]  clk_core_set_rate_nolock from clk_set_rate+0x30/0x64
[  113.316908]  clk_set_rate from _set_opp+0x260/0x528
[  113.322680]  _set_opp from dev_pm_opp_set_rate+0xec/0x228
[  113.328969]  dev_pm_opp_set_rate from __cpufreq_driver_target+0x580/0x6fc
[  113.336784]  __cpufreq_driver_target from od_dbs_update+0xb4/0x168
[  113.344024]  od_dbs_update from dbs_work_handler+0x2c/0x60
[  113.350466]  dbs_work_handler from process_one_work+0x284/0x72c
[  113.357326]  process_one_work from worker_thread+0x28/0x4b0
[  113.363841]  worker_thread from kthread+0xe4/0x104
[  113.369532]  kthread from ret_from_fork+0x14/0x28
[  113.375038] Exception stack(0xd0395fb0 to 0xd0395ff8)
[  113.380793] 5fa0:                                     00000000 00000000 00000000 00000000
[  113.389727] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  113.398614] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000


[-- Attachment #6: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-31  8:36                                                                     ` Yegor Yefremov
@ 2022-05-31 14:16                                                                       ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-31 14:16 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Mon, May 30, 2022 at 5:15 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > > > <yegorslists@googlemail.com> wrote:
> > > > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > In file included from ./include/linux/irqflags.h:17,
> > > > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > > > >                  from ./include/linux/bitops.h:33,
> > > > > >                  from ./include/linux/log2.h:12,
> > > > > >                  from kernel/bounds.c:13:
> > > > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > > > undeclared (first use in this function); did you mean
> > > > > > ‘__my_cpu_offset’?
> > > > > >    32 |  return __per_cpu_offset[0];
> > > > > >       |         ^~~~~~~~~~~~~~~~
> > > > > >       |         __my_cpu_offset
> > > > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > > > is reported only once for each function it appears in
> > > > >
> > > > > I think you just missed the line in my patch that adds the
> > > > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> > > >
> > > > So, I tried both variants and both led to stalls.
> > >
> > > I'm running out of ideas here.  Going to back to the original bisection,
> > > I rebased Ard's patches in a way that you should be able to build the
> > > config for each patch, and I split up the "ARM: implement
> > > THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> > > another way, hoping to get something left over that points to the
> > > bug. Can you try bisecting through the top commits of
> > >
> > > https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
> > >
> > > starting maybe with "52d240871760 irqchip: nvic: Use
> > > GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> > > going to be ok?
> > >
> > > At some point I fear we may have to give up and just mark the v6+SMP
> > > configuration as broken, which is something we have considered in the
> > > past but ended up always keeping around for the purpose of testing
> > > omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> > > systems you probably don't want to use that config anway, and should
> > > either stick to a uniprocessor build, or disable the ARMv6 support.
> > >
> >
> > Yeah, I am also running out of ideas. One question, though: does the
> > RCU detected stall always occur in the same place? I.e., how similar
> > are the backtraces of the stalls between different occurrences?
> > Perhaps we could narrow down where in the code we are stalling, and
> > gain some more understanding of the root cause.
>
> I have attached 4 crash logs and will start with Arnd's branch bisecting.

My bisect results:

git bisect log
git bisect start
# good: [52d24087176055d5994ac98378426421b2d6d653] irqchip: nvic: Use
GENERIC_IRQ_MULTI_HANDLER
git bisect good 52d24087176055d5994ac98378426421b2d6d653
# bad: [2d3456213319c0277ee6082946c43c3afacca9b4] [PART 2] ARM:
implement THREAD_INFO_IN_TASK for uniprocessor system
git bisect bad 2d3456213319c0277ee6082946c43c3afacca9b4
# good: [20e50fc1187d82d6d9ef80c01cf8e11d476f6227] ARM: 9176/1: avoid
literal references in inline assembly
git bisect good 20e50fc1187d82d6d9ef80c01cf8e11d476f6227
# good: [59f3cd822afe6445b2864d0cf1a73ca6edd24f42] ARM: smp: defer
TPIDRURO update for SMP v6 configurations too
git bisect good 59f3cd822afe6445b2864d0cf1a73ca6edd24f42
# bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
implement THREAD_INFO_IN_TASK for uniprocessor systems
git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
# good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
support for IRQ stacks
git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
# first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems

Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
kernel that didn't even show any output after the bootloader had
started it.

Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-31 14:16                                                                       ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-05-31 14:16 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Mon, May 30, 2022 at 5:15 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Mon, 30 May 2022 at 15:54, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, May 28, 2022 at 9:28 PM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Sat, May 28, 2022 at 3:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Sat, May 28, 2022 at 3:01 PM Yegor Yefremov
> > > > > <yegorslists@googlemail.com> wrote:
> > > > > > On Sat, May 28, 2022 at 11:07 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > In file included from ./include/linux/irqflags.h:17,
> > > > > >                  from ./arch/arm/include/asm/bitops.h:28,
> > > > > >                  from ./include/linux/bitops.h:33,
> > > > > >                  from ./include/linux/log2.h:12,
> > > > > >                  from kernel/bounds.c:13:
> > > > > > ./arch/arm/include/asm/percpu.h: In function ‘__my_cpu_offset’:
> > > > > > ./arch/arm/include/asm/percpu.h:32:9: error: ‘__per_cpu_offset’
> > > > > > undeclared (first use in this function); did you mean
> > > > > > ‘__my_cpu_offset’?
> > > > > >    32 |  return __per_cpu_offset[0];
> > > > > >       |         ^~~~~~~~~~~~~~~~
> > > > > >       |         __my_cpu_offset
> > > > > > ./arch/arm/include/asm/percpu.h:32:9: note: each undeclared identifier
> > > > > > is reported only once for each function it appears in
> > > > >
> > > > > I think you just missed the line in my patch that adds the
> > > > > "extern unsigned long __per_cpu_offset[];" variable declaration.
> > > >
> > > > So, I tried both variants and both led to stalls.
> > >
> > > I'm running out of ideas here.  Going to back to the original bisection,
> > > I rebased Ard's patches in a way that you should be able to build the
> > > config for each patch, and I split up the "ARM: implement
> > > THREAD_INFO_IN_TASK for uniprocessor systems" commit in yet
> > > another way, hoping to get something left over that points to the
> > > bug. Can you try bisecting through the top commits of
> > >
> > > https://kernel.org/pub/scm/linux/kernel/git/soc/soc.git am335x-stall-test
> > >
> > > starting maybe with "52d240871760 irqchip: nvic: Use
> > > GENERIC_IRQ_MULTI_HANDLER" as the patch that is almost certainly
> > > going to be ok?
> > >
> > > At some point I fear we may have to give up and just mark the v6+SMP
> > > configuration as broken, which is something we have considered in the
> > > past but ended up always keeping around for the purpose of testing
> > > omap2plus_defconfig and imx_v6_v7_defconfig. Note that on production
> > > systems you probably don't want to use that config anway, and should
> > > either stick to a uniprocessor build, or disable the ARMv6 support.
> > >
> >
> > Yeah, I am also running out of ideas. One question, though: does the
> > RCU detected stall always occur in the same place? I.e., how similar
> > are the backtraces of the stalls between different occurrences?
> > Perhaps we could narrow down where in the code we are stalling, and
> > gain some more understanding of the root cause.
>
> I have attached 4 crash logs and will start with Arnd's branch bisecting.

My bisect results:

git bisect log
git bisect start
# good: [52d24087176055d5994ac98378426421b2d6d653] irqchip: nvic: Use
GENERIC_IRQ_MULTI_HANDLER
git bisect good 52d24087176055d5994ac98378426421b2d6d653
# bad: [2d3456213319c0277ee6082946c43c3afacca9b4] [PART 2] ARM:
implement THREAD_INFO_IN_TASK for uniprocessor system
git bisect bad 2d3456213319c0277ee6082946c43c3afacca9b4
# good: [20e50fc1187d82d6d9ef80c01cf8e11d476f6227] ARM: 9176/1: avoid
literal references in inline assembly
git bisect good 20e50fc1187d82d6d9ef80c01cf8e11d476f6227
# good: [59f3cd822afe6445b2864d0cf1a73ca6edd24f42] ARM: smp: defer
TPIDRURO update for SMP v6 configurations too
git bisect good 59f3cd822afe6445b2864d0cf1a73ca6edd24f42
# bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
implement THREAD_INFO_IN_TASK for uniprocessor systems
git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
# good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
support for IRQ stacks
git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
# first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems

Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
kernel that didn't even show any output after the bootloader had
started it.

Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-31 14:16                                                                       ` Yegor Yefremov
@ 2022-05-31 15:22                                                                         ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-31 15:22 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 4:16 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> # bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
> implement THREAD_INFO_IN_TASK for uniprocessor systems
> git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
> # good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
> support for IRQ stacks
> git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
> # first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
> 1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems
>
> Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
> kernel that didn't even show any output after the bootloader had
> started it.
>
> Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.

Ok, good, so we know that the "ARM: implement THREAD_INFO_IN_TASK for
uniprocessor system" commit caused the problem then. This is what we had
already assumed, but now it's confirmed.

Too bad I screwed up that "this_cpu_offset" macro, I think it should
have been

@@ -286,7 +286,7 @@ THUMB(      fpreg   .req    r7      )
         *                   register 'rd'
         */
        .macro          this_cpu_offset, rd:req
-       mov             \rd, #0
+       ldr_va          \rd, __per_cpu_offset
        .endm

        /*

I've pushed a modified branch now, with that fix on the broken commit,
and another change to make CONFIG_IRQSTACKS user-selectable rather
than always enabled. That should tell us if the problem is in the SMP
patching or in the irqstacks.

Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?

      Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-05-31 15:22                                                                         ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-05-31 15:22 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 4:16 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> # bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
> implement THREAD_INFO_IN_TASK for uniprocessor systems
> git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
> # good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
> support for IRQ stacks
> git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
> # first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
> 1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems
>
> Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
> kernel that didn't even show any output after the bootloader had
> started it.
>
> Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.

Ok, good, so we know that the "ARM: implement THREAD_INFO_IN_TASK for
uniprocessor system" commit caused the problem then. This is what we had
already assumed, but now it's confirmed.

Too bad I screwed up that "this_cpu_offset" macro, I think it should
have been

@@ -286,7 +286,7 @@ THUMB(      fpreg   .req    r7      )
         *                   register 'rd'
         */
        .macro          this_cpu_offset, rd:req
-       mov             \rd, #0
+       ldr_va          \rd, __per_cpu_offset
        .endm

        /*

I've pushed a modified branch now, with that fix on the broken commit,
and another change to make CONFIG_IRQSTACKS user-selectable rather
than always enabled. That should tell us if the problem is in the SMP
patching or in the irqstacks.

Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?

      Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-05-31 15:22                                                                         ` Arnd Bergmann
@ 2022-06-01  7:36                                                                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01  7:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, May 31, 2022 at 4:16 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > # bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
> > implement THREAD_INFO_IN_TASK for uniprocessor systems
> > git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
> > # good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
> > support for IRQ stacks
> > git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
> > # first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
> > 1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems
> >
> > Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
> > kernel that didn't even show any output after the bootloader had
> > started it.
> >
> > Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.
>
> Ok, good, so we know that the "ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor system" commit caused the problem then. This is what we had
> already assumed, but now it's confirmed.
>
> Too bad I screwed up that "this_cpu_offset" macro, I think it should
> have been
>
> @@ -286,7 +286,7 @@ THUMB(      fpreg   .req    r7      )
>          *                   register 'rd'
>          */
>         .macro          this_cpu_offset, rd:req
> -       mov             \rd, #0
> +       ldr_va          \rd, __per_cpu_offset
>         .endm
>
>         /*
>
> I've pushed a modified branch now, with that fix on the broken commit,
> and another change to make CONFIG_IRQSTACKS user-selectable rather
> than always enabled. That should tell us if the problem is in the SMP
> patching or in the irqstacks.
>
> Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?

1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
2. f0191ea5c2e5 with the same config - not

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01  7:36                                                                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01  7:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, May 31, 2022 at 4:16 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Tue, May 31, 2022 at 10:36 AM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > # bad: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART 1] ARM:
> > implement THREAD_INFO_IN_TASK for uniprocessor systems
> > git bisect bad b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103
> > # good: [dccfc18999cf4b4e518f01d5c7c578426166e5f2] ARM: v7m: enable
> > support for IRQ stacks
> > git bisect good dccfc18999cf4b4e518f01d5c7c578426166e5f2
> > # first bad commit: [b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103] [PART
> > 1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems
> >
> > Though commit b6b3b4814e77d2f5a7517297e9ac1d1aa1cda103 led to a broken
> > kernel that didn't even show any output after the bootloader had
> > started it.
> >
> > Commit 2d3456213319c0277ee6082946c43c3afacca9b4 showed the expected stalling.
>
> Ok, good, so we know that the "ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor system" commit caused the problem then. This is what we had
> already assumed, but now it's confirmed.
>
> Too bad I screwed up that "this_cpu_offset" macro, I think it should
> have been
>
> @@ -286,7 +286,7 @@ THUMB(      fpreg   .req    r7      )
>          *                   register 'rd'
>          */
>         .macro          this_cpu_offset, rd:req
> -       mov             \rd, #0
> +       ldr_va          \rd, __per_cpu_offset
>         .endm
>
>         /*
>
> I've pushed a modified branch now, with that fix on the broken commit,
> and another change to make CONFIG_IRQSTACKS user-selectable rather
> than always enabled. That should tell us if the problem is in the SMP
> patching or in the irqstacks.
>
> Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?

1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
2. f0191ea5c2e5 with the same config - not

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01  7:36                                                                           ` Yegor Yefremov
@ 2022-06-01  7:59                                                                             ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-01  7:59 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > I've pushed a modified branch now, with that fix on the broken commit,
> > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > than always enabled. That should tell us if the problem is in the SMP
> > patching or in the irqstacks.
> >
> > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
>
> 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> 2. f0191ea5c2e5 with the same config - not

Ok, perfect, that does narrow down the problem quite a bit: The final
patch has seven changes, all of which can be done individually because
in each case the simplified version in f0191ea5c2e5 is meant to run
the exact same instructions as the version after the change, when running
on a uniprocessor machine such as your am335x.

You have already shown earlier that the get_current() and
__my_cpu_offset() functions are not to blame here, as reverting
only those does not change the behavior.

This leaves the is_smp() check in set_current(), and the
four macros in <asm/assembler.h>. I don't see anything obviously
wrong with any of those five, but I would bet on the macros
here. Can you try bisecting into this commit, maybe reverting
the changes to set_current and get_current first, and then
narrowing it down to (hopefully) a single macro that causes the
problem?

        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01  7:59                                                                             ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-01  7:59 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > I've pushed a modified branch now, with that fix on the broken commit,
> > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > than always enabled. That should tell us if the problem is in the SMP
> > patching or in the irqstacks.
> >
> > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
>
> 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> 2. f0191ea5c2e5 with the same config - not

Ok, perfect, that does narrow down the problem quite a bit: The final
patch has seven changes, all of which can be done individually because
in each case the simplified version in f0191ea5c2e5 is meant to run
the exact same instructions as the version after the change, when running
on a uniprocessor machine such as your am335x.

You have already shown earlier that the get_current() and
__my_cpu_offset() functions are not to blame here, as reverting
only those does not change the behavior.

This leaves the is_smp() check in set_current(), and the
four macros in <asm/assembler.h>. I don't see anything obviously
wrong with any of those five, but I would bet on the macros
here. Can you try bisecting into this commit, maybe reverting
the changes to set_current and get_current first, and then
narrowing it down to (hopefully) a single macro that causes the
problem?

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01  7:59                                                                             ` Arnd Bergmann
@ 2022-06-01  8:08                                                                               ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01  8:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > I've pushed a modified branch now, with that fix on the broken commit,
> > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > than always enabled. That should tell us if the problem is in the SMP
> > > patching or in the irqstacks.
> > >
> > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> >
> > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > 2. f0191ea5c2e5 with the same config - not
>
> Ok, perfect, that does narrow down the problem quite a bit: The final
> patch has seven changes, all of which can be done individually because
> in each case the simplified version in f0191ea5c2e5 is meant to run
> the exact same instructions as the version after the change, when running
> on a uniprocessor machine such as your am335x.
>
> You have already shown earlier that the get_current() and
> __my_cpu_offset() functions are not to blame here, as reverting
> only those does not change the behavior.
>
> This leaves the is_smp() check in set_current(), and the
> four macros in <asm/assembler.h>. I don't see anything obviously
> wrong with any of those five, but I would bet on the macros
> here. Can you try bisecting into this commit, maybe reverting
> the changes to set_current and get_current first, and then
> narrowing it down to (hopefully) a single macro that causes the
> problem?
>

set_current() is never called by the primary CPU, which is why the
is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
pointless SMP check on secondary startup path").

So that leaves only the four macros in asm/assembler.h, but I don't
see anything obviously wrong with those either.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01  8:08                                                                               ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01  8:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > I've pushed a modified branch now, with that fix on the broken commit,
> > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > than always enabled. That should tell us if the problem is in the SMP
> > > patching or in the irqstacks.
> > >
> > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> >
> > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > 2. f0191ea5c2e5 with the same config - not
>
> Ok, perfect, that does narrow down the problem quite a bit: The final
> patch has seven changes, all of which can be done individually because
> in each case the simplified version in f0191ea5c2e5 is meant to run
> the exact same instructions as the version after the change, when running
> on a uniprocessor machine such as your am335x.
>
> You have already shown earlier that the get_current() and
> __my_cpu_offset() functions are not to blame here, as reverting
> only those does not change the behavior.
>
> This leaves the is_smp() check in set_current(), and the
> four macros in <asm/assembler.h>. I don't see anything obviously
> wrong with any of those five, but I would bet on the macros
> here. Can you try bisecting into this commit, maybe reverting
> the changes to set_current and get_current first, and then
> narrowing it down to (hopefully) a single macro that causes the
> problem?
>

set_current() is never called by the primary CPU, which is why the
is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
pointless SMP check on secondary startup path").

So that leaves only the four macros in asm/assembler.h, but I don't
see anything obviously wrong with those either.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01  8:08                                                                               ` Ard Biesheuvel
@ 2022-06-01  9:27                                                                                 ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > than always enabled. That should tell us if the problem is in the SMP
> > > > patching or in the irqstacks.
> > > >
> > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > >
> > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > 2. f0191ea5c2e5 with the same config - not
> >
> > Ok, perfect, that does narrow down the problem quite a bit: The final
> > patch has seven changes, all of which can be done individually because
> > in each case the simplified version in f0191ea5c2e5 is meant to run
> > the exact same instructions as the version after the change, when running
> > on a uniprocessor machine such as your am335x.
> >
> > You have already shown earlier that the get_current() and
> > __my_cpu_offset() functions are not to blame here, as reverting
> > only those does not change the behavior.
> >
> > This leaves the is_smp() check in set_current(), and the
> > four macros in <asm/assembler.h>. I don't see anything obviously
> > wrong with any of those five, but I would bet on the macros
> > here. Can you try bisecting into this commit, maybe reverting
> > the changes to set_current and get_current first, and then
> > narrowing it down to (hopefully) a single macro that causes the
> > problem?
> >
>
> set_current() is never called by the primary CPU, which is why the
> is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> pointless SMP check on secondary startup path").
>
> So that leaves only the four macros in asm/assembler.h, but I don't
> see anything obviously wrong with those either.

I pushed a patch on top of Arnd's branch at the link below that gets
rid of the subsections, and uses normal branches (and code patching)
to switch between the thread ID register and the LDR to retrieve the
CPU offset and the current pointer. I have no explanation whether or
why it could make a difference, but I think it's worth a try.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01  9:27                                                                                 ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01  9:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > than always enabled. That should tell us if the problem is in the SMP
> > > > patching or in the irqstacks.
> > > >
> > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > >
> > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > 2. f0191ea5c2e5 with the same config - not
> >
> > Ok, perfect, that does narrow down the problem quite a bit: The final
> > patch has seven changes, all of which can be done individually because
> > in each case the simplified version in f0191ea5c2e5 is meant to run
> > the exact same instructions as the version after the change, when running
> > on a uniprocessor machine such as your am335x.
> >
> > You have already shown earlier that the get_current() and
> > __my_cpu_offset() functions are not to blame here, as reverting
> > only those does not change the behavior.
> >
> > This leaves the is_smp() check in set_current(), and the
> > four macros in <asm/assembler.h>. I don't see anything obviously
> > wrong with any of those five, but I would bet on the macros
> > here. Can you try bisecting into this commit, maybe reverting
> > the changes to set_current and get_current first, and then
> > narrowing it down to (hopefully) a single macro that causes the
> > problem?
> >
>
> set_current() is never called by the primary CPU, which is why the
> is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> pointless SMP check on secondary startup path").
>
> So that leaves only the four macros in asm/assembler.h, but I don't
> see anything obviously wrong with those either.

I pushed a patch on top of Arnd's branch at the link below that gets
rid of the subsections, and uses normal branches (and code patching)
to switch between the thread ID register and the LDR to retrieve the
CPU offset and the current pointer. I have no explanation whether or
why it could make a difference, but I think it's worth a try.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01  9:27                                                                                 ` Ard Biesheuvel
@ 2022-06-01 10:03                                                                                   ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01 10:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > patching or in the irqstacks.
> > > > >
> > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > >
> > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > 2. f0191ea5c2e5 with the same config - not
> > >
> > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > patch has seven changes, all of which can be done individually because
> > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > the exact same instructions as the version after the change, when running
> > > on a uniprocessor machine such as your am335x.
> > >
> > > You have already shown earlier that the get_current() and
> > > __my_cpu_offset() functions are not to blame here, as reverting
> > > only those does not change the behavior.
> > >
> > > This leaves the is_smp() check in set_current(), and the
> > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > wrong with any of those five, but I would bet on the macros
> > > here. Can you try bisecting into this commit, maybe reverting
> > > the changes to set_current and get_current first, and then
> > > narrowing it down to (hopefully) a single macro that causes the
> > > problem?
> > >
> >
> > set_current() is never called by the primary CPU, which is why the
> > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > pointless SMP check on secondary startup path").
> >
> > So that leaves only the four macros in asm/assembler.h, but I don't
> > see anything obviously wrong with those either.
>
> I pushed a patch on top of Arnd's branch at the link below that gets
> rid of the subsections, and uses normal branches (and code patching)
> to switch between the thread ID register and the LDR to retrieve the
> CPU offset and the current pointer. I have no explanation whether or
> why it could make a difference, but I think it's worth a try.

The link to your repo is missing.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01 10:03                                                                                   ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01 10:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > <yegorslists@googlemail.com> wrote:
> > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > patching or in the irqstacks.
> > > > >
> > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > >
> > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > 2. f0191ea5c2e5 with the same config - not
> > >
> > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > patch has seven changes, all of which can be done individually because
> > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > the exact same instructions as the version after the change, when running
> > > on a uniprocessor machine such as your am335x.
> > >
> > > You have already shown earlier that the get_current() and
> > > __my_cpu_offset() functions are not to blame here, as reverting
> > > only those does not change the behavior.
> > >
> > > This leaves the is_smp() check in set_current(), and the
> > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > wrong with any of those five, but I would bet on the macros
> > > here. Can you try bisecting into this commit, maybe reverting
> > > the changes to set_current and get_current first, and then
> > > narrowing it down to (hopefully) a single macro that causes the
> > > problem?
> > >
> >
> > set_current() is never called by the primary CPU, which is why the
> > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > pointless SMP check on secondary startup path").
> >
> > So that leaves only the four macros in asm/assembler.h, but I don't
> > see anything obviously wrong with those either.
>
> I pushed a patch on top of Arnd's branch at the link below that gets
> rid of the subsections, and uses normal branches (and code patching)
> to switch between the thread ID register and the LDR to retrieve the
> CPU offset and the current pointer. I have no explanation whether or
> why it could make a difference, but I think it's worth a try.

The link to your repo is missing.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01 10:03                                                                                   ` Yegor Yefremov
@ 2022-06-01 10:06                                                                                     ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01 10:06 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > patching or in the irqstacks.
> > > > > >
> > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > >
> > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > 2. f0191ea5c2e5 with the same config - not
> > > >
> > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > patch has seven changes, all of which can be done individually because
> > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > the exact same instructions as the version after the change, when running
> > > > on a uniprocessor machine such as your am335x.
> > > >
> > > > You have already shown earlier that the get_current() and
> > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > only those does not change the behavior.
> > > >
> > > > This leaves the is_smp() check in set_current(), and the
> > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > wrong with any of those five, but I would bet on the macros
> > > > here. Can you try bisecting into this commit, maybe reverting
> > > > the changes to set_current and get_current first, and then
> > > > narrowing it down to (hopefully) a single macro that causes the
> > > > problem?
> > > >
> > >
> > > set_current() is never called by the primary CPU, which is why the
> > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > pointless SMP check on secondary startup path").
> > >
> > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > see anything obviously wrong with those either.
> >
> > I pushed a patch on top of Arnd's branch at the link below that gets
> > rid of the subsections, and uses normal branches (and code patching)
> > to switch between the thread ID register and the LDR to retrieve the
> > CPU offset and the current pointer. I have no explanation whether or
> > why it could make a difference, but I think it's worth a try.
>
> The link to your repo is missing.
>

Oops, sorry :-)

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01 10:06                                                                                     ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01 10:06 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > <yegorslists@googlemail.com> wrote:
> > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > patching or in the irqstacks.
> > > > > >
> > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > >
> > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > 2. f0191ea5c2e5 with the same config - not
> > > >
> > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > patch has seven changes, all of which can be done individually because
> > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > the exact same instructions as the version after the change, when running
> > > > on a uniprocessor machine such as your am335x.
> > > >
> > > > You have already shown earlier that the get_current() and
> > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > only those does not change the behavior.
> > > >
> > > > This leaves the is_smp() check in set_current(), and the
> > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > wrong with any of those five, but I would bet on the macros
> > > > here. Can you try bisecting into this commit, maybe reverting
> > > > the changes to set_current and get_current first, and then
> > > > narrowing it down to (hopefully) a single macro that causes the
> > > > problem?
> > > >
> > >
> > > set_current() is never called by the primary CPU, which is why the
> > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > pointless SMP check on secondary startup path").
> > >
> > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > see anything obviously wrong with those either.
> >
> > I pushed a patch on top of Arnd's branch at the link below that gets
> > rid of the subsections, and uses normal branches (and code patching)
> > to switch between the thread ID register and the LDR to retrieve the
> > CPU offset and the current pointer. I have no explanation whether or
> > why it could make a difference, but I think it's worth a try.
>
> The link to your repo is missing.
>

Oops, sorry :-)

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01 10:06                                                                                     ` Ard Biesheuvel
@ 2022-06-01 10:46                                                                                       ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01 10:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > <yegorslists@googlemail.com> wrote:
> > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > patching or in the irqstacks.
> > > > > > >
> > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > >
> > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > >
> > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > patch has seven changes, all of which can be done individually because
> > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > the exact same instructions as the version after the change, when running
> > > > > on a uniprocessor machine such as your am335x.
> > > > >
> > > > > You have already shown earlier that the get_current() and
> > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > only those does not change the behavior.
> > > > >
> > > > > This leaves the is_smp() check in set_current(), and the
> > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > wrong with any of those five, but I would bet on the macros
> > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > the changes to set_current and get_current first, and then
> > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > problem?
> > > > >
> > > >
> > > > set_current() is never called by the primary CPU, which is why the
> > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > pointless SMP check on secondary startup path").
> > > >
> > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > see anything obviously wrong with those either.
> > >
> > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > rid of the subsections, and uses normal branches (and code patching)
> > > to switch between the thread ID register and the LDR to retrieve the
> > > CPU offset and the current pointer. I have no explanation whether or
> > > why it could make a difference, but I think it's worth a try.
> >
> > The link to your repo is missing.
> >
>
> Oops, sorry :-)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

I have tested your branch and it stalls:

[   69.924298] rcu: INFO: rcu_sched self-detected stall on CPU
[   69.930986] rcu:     0-...!: (2600 ticks this GP)
idle=6f5/1/0x40000004 softirq=2257/2257 fqs=0
[   69.940551]  (t=2600 jiffies g=3413 q=11)
[   69.945187] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g3413 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   69.957111] rcu:     Possible timer handling issue on cpu=0
timer-softirq=1261
[   69.964668] rcu: rcu_sched kthread starved for 2600 jiffies! g3413
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   69.975638] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   69.985170] rcu: RCU grace-period kthread stack dump:
[   69.990708] task:rcu_sched       state:I stack:    0 pid:   10
ppid:     2 flags:0x00000000
[   70.000250] [<c0b683b4>] (__schedule) from [<c0b68cf8>] (schedule+0x54/0xe8)
[   70.008705] [<c0b68cf8>] (schedule) from [<c0b6f4fc>]
(schedule_timeout+0xa8/0x210)
[   70.017449] [<c0b6f4fc>] (schedule_timeout) from [<c01d8594>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[   70.026875] [<c01d8594>] (rcu_gp_fqs_loop) from [<c01dc4c4>]
(rcu_gp_kthread+0x138/0x30c)
[   70.036074] [<c01dc4c4>] (rcu_gp_kthread) from [<c0164dd8>]
(kthread+0x13c/0x164)
[   70.044559] [<c0164dd8>] (kthread) from [<c0100150>]
(ret_from_fork+0x14/0x44)
[   70.052732] rcu: Stack dump where RCU GP kthread last ran:
[   70.058773] NMI backtrace for cpu 0
[   70.062840] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.16.0-rc1 #1
[   70.070003] Hardware name: Generic AM33XX (Flattened Device Tree)
[   70.076698] Workqueue: events dbs_work_handler
[   70.082258] [<c01115f0>] (unwind_backtrace) from [<c010bfd4>]
(show_stack+0x10/0x14)
[   70.091113] [<c010bfd4>] (show_stack) from [<d00299f0>] (0xd00299f0)
[   70.099045] NMI backtrace for cpu 0
[   70.103188] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.16.0-rc1 #1
[   70.110357] Hardware name: Generic AM33XX (Flattened Device Tree)
[   70.117027] Workqueue: events dbs_work_handler
[   70.122491] [<c01115f0>] (unwind_backtrace) from [<c010bfd4>]
(show_stack+0x10/0x14)
[   70.131254] [<c010bfd4>] (show_stack) from [<d00299f0>] (0xd00299f0)

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01 10:46                                                                                       ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-01 10:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > <yegorslists@googlemail.com> wrote:
> > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > patching or in the irqstacks.
> > > > > > >
> > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > >
> > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > >
> > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > patch has seven changes, all of which can be done individually because
> > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > the exact same instructions as the version after the change, when running
> > > > > on a uniprocessor machine such as your am335x.
> > > > >
> > > > > You have already shown earlier that the get_current() and
> > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > only those does not change the behavior.
> > > > >
> > > > > This leaves the is_smp() check in set_current(), and the
> > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > wrong with any of those five, but I would bet on the macros
> > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > the changes to set_current and get_current first, and then
> > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > problem?
> > > > >
> > > >
> > > > set_current() is never called by the primary CPU, which is why the
> > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > pointless SMP check on secondary startup path").
> > > >
> > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > see anything obviously wrong with those either.
> > >
> > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > rid of the subsections, and uses normal branches (and code patching)
> > > to switch between the thread ID register and the LDR to retrieve the
> > > CPU offset and the current pointer. I have no explanation whether or
> > > why it could make a difference, but I think it's worth a try.
> >
> > The link to your repo is missing.
> >
>
> Oops, sorry :-)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

I have tested your branch and it stalls:

[   69.924298] rcu: INFO: rcu_sched self-detected stall on CPU
[   69.930986] rcu:     0-...!: (2600 ticks this GP)
idle=6f5/1/0x40000004 softirq=2257/2257 fqs=0
[   69.940551]  (t=2600 jiffies g=3413 q=11)
[   69.945187] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g3413 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   69.957111] rcu:     Possible timer handling issue on cpu=0
timer-softirq=1261
[   69.964668] rcu: rcu_sched kthread starved for 2600 jiffies! g3413
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   69.975638] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   69.985170] rcu: RCU grace-period kthread stack dump:
[   69.990708] task:rcu_sched       state:I stack:    0 pid:   10
ppid:     2 flags:0x00000000
[   70.000250] [<c0b683b4>] (__schedule) from [<c0b68cf8>] (schedule+0x54/0xe8)
[   70.008705] [<c0b68cf8>] (schedule) from [<c0b6f4fc>]
(schedule_timeout+0xa8/0x210)
[   70.017449] [<c0b6f4fc>] (schedule_timeout) from [<c01d8594>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[   70.026875] [<c01d8594>] (rcu_gp_fqs_loop) from [<c01dc4c4>]
(rcu_gp_kthread+0x138/0x30c)
[   70.036074] [<c01dc4c4>] (rcu_gp_kthread) from [<c0164dd8>]
(kthread+0x13c/0x164)
[   70.044559] [<c0164dd8>] (kthread) from [<c0100150>]
(ret_from_fork+0x14/0x44)
[   70.052732] rcu: Stack dump where RCU GP kthread last ran:
[   70.058773] NMI backtrace for cpu 0
[   70.062840] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.16.0-rc1 #1
[   70.070003] Hardware name: Generic AM33XX (Flattened Device Tree)
[   70.076698] Workqueue: events dbs_work_handler
[   70.082258] [<c01115f0>] (unwind_backtrace) from [<c010bfd4>]
(show_stack+0x10/0x14)
[   70.091113] [<c010bfd4>] (show_stack) from [<d00299f0>] (0xd00299f0)
[   70.099045] NMI backtrace for cpu 0
[   70.103188] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.16.0-rc1 #1
[   70.110357] Hardware name: Generic AM33XX (Flattened Device Tree)
[   70.117027] Workqueue: events dbs_work_handler
[   70.122491] [<c01115f0>] (unwind_backtrace) from [<c010bfd4>]
(show_stack+0x10/0x14)
[   70.131254] [<c010bfd4>] (show_stack) from [<d00299f0>] (0xd00299f0)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01 10:46                                                                                       ` Yegor Yefremov
@ 2022-06-01 10:49                                                                                         ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01 10:49 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > >
> > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > patching or in the irqstacks.
> > > > > > > >
> > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > >
> > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > >
> > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > patch has seven changes, all of which can be done individually because
> > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > the exact same instructions as the version after the change, when running
> > > > > > on a uniprocessor machine such as your am335x.
> > > > > >
> > > > > > You have already shown earlier that the get_current() and
> > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > only those does not change the behavior.
> > > > > >
> > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > the changes to set_current and get_current first, and then
> > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > problem?
> > > > > >
> > > > >
> > > > > set_current() is never called by the primary CPU, which is why the
> > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > pointless SMP check on secondary startup path").
> > > > >
> > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > see anything obviously wrong with those either.
> > > >
> > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > rid of the subsections, and uses normal branches (and code patching)
> > > > to switch between the thread ID register and the LDR to retrieve the
> > > > CPU offset and the current pointer. I have no explanation whether or
> > > > why it could make a difference, but I think it's worth a try.
> > >
> > > The link to your repo is missing.
> > >
> >
> > Oops, sorry :-)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
>
> I have tested your branch and it stalls:
>

OK, thanks for verifying.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-01 10:49                                                                                         ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-01 10:49 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > >
> > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > patching or in the irqstacks.
> > > > > > > >
> > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > >
> > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > >
> > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > patch has seven changes, all of which can be done individually because
> > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > the exact same instructions as the version after the change, when running
> > > > > > on a uniprocessor machine such as your am335x.
> > > > > >
> > > > > > You have already shown earlier that the get_current() and
> > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > only those does not change the behavior.
> > > > > >
> > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > the changes to set_current and get_current first, and then
> > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > problem?
> > > > > >
> > > > >
> > > > > set_current() is never called by the primary CPU, which is why the
> > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > pointless SMP check on secondary startup path").
> > > > >
> > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > see anything obviously wrong with those either.
> > > >
> > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > rid of the subsections, and uses normal branches (and code patching)
> > > > to switch between the thread ID register and the LDR to retrieve the
> > > > CPU offset and the current pointer. I have no explanation whether or
> > > > why it could make a difference, but I think it's worth a try.
> > >
> > > The link to your repo is missing.
> > >
> >
> > Oops, sorry :-)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
>
> I have tested your branch and it stalls:
>

OK, thanks for verifying.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-01 10:49                                                                                         ` Ard Biesheuvel
@ 2022-06-02 10:17                                                                                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-02 10:17 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > patching or in the irqstacks.
> > > > > > > > >
> > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > >
> > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > >
> > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > >
> > > > > > > You have already shown earlier that the get_current() and
> > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > only those does not change the behavior.
> > > > > > >
> > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > the changes to set_current and get_current first, and then
> > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > problem?
> > > > > > >
> > > > > >
> > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > pointless SMP check on secondary startup path").
> > > > > >
> > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > see anything obviously wrong with those either.
> > > > >
> > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > why it could make a difference, but I think it's worth a try.
> > > >
> > > > The link to your repo is missing.
> > > >
> > >
> > > Oops, sorry :-)
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> >
> > I have tested your branch and it stalls:
> >
>
> OK, thanks for verifying.

My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:

percpu.h: sporadic stalls
current.h: always stalls
assembler.h: no stalls
smp.c: no stalls

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-02 10:17                                                                                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-02 10:17 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > patching or in the irqstacks.
> > > > > > > > >
> > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > >
> > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > >
> > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > >
> > > > > > > You have already shown earlier that the get_current() and
> > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > only those does not change the behavior.
> > > > > > >
> > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > the changes to set_current and get_current first, and then
> > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > problem?
> > > > > > >
> > > > > >
> > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > pointless SMP check on secondary startup path").
> > > > > >
> > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > see anything obviously wrong with those either.
> > > > >
> > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > why it could make a difference, but I think it's worth a try.
> > > >
> > > > The link to your repo is missing.
> > > >
> > >
> > > Oops, sorry :-)
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> >
> > I have tested your branch and it stalls:
> >
>
> OK, thanks for verifying.

My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:

percpu.h: sporadic stalls
current.h: always stalls
assembler.h: no stalls
smp.c: no stalls

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-02 10:17                                                                                           ` Yegor Yefremov
@ 2022-06-02 10:37                                                                                             ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-02 10:37 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > >
> > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > >
> > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > >
> > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > >
> > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > only those does not change the behavior.
> > > > > > > >
> > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > problem?
> > > > > > > >
> > > > > > >
> > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > pointless SMP check on secondary startup path").
> > > > > > >
> > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > see anything obviously wrong with those either.
> > > > > >
> > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > why it could make a difference, but I think it's worth a try.
> > > > >
> > > > > The link to your repo is missing.
> > > > >
> > > >
> > > > Oops, sorry :-)
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > >
> > > I have tested your branch and it stalls:
> > >
> >
> > OK, thanks for verifying.
>
> My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
>
> percpu.h: sporadic stalls
> current.h: always stalls
> assembler.h: no stalls
> smp.c: no stalls
>

So you mean that applying the changes to each of those files in
isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
produces those results, right?

That confirms my statement that smp.c cannot be the culprit, and
appears to exonerate the pure asm pieces. I wonder if this is related
to insufficient asm constraints on the C helpers, or just the cost
model taking different decisions because the inline asm string is much
longer. In any case, this opens up a couple of avenues we could
explore to narrow this down further.

As a quick check, can you try the below snippet applied onto the
broken current.h build?

--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
task_struct *get_current(void)
            "   b       . + (2b - 0b)                           \n\t"
            "   .popsection                                     \n\t"
 #endif
-           : "=r"(cur));
+           : "=r"(cur)
+           : "Q" (*(const unsigned long *)current_stack_pointer));
 #elif __LINUX_ARM_ARCH__>= 7 || \
       !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
       (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))

Given that the problematic sequence appears to be in C code, could you
please confirm whether or not the stall is reproducible when all the
pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
are built into the kernel rather than built as modules? Also, which
GCC version are you using?

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-02 10:37                                                                                             ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-02 10:37 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
>
> On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > >
> > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > >
> > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > >
> > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > >
> > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > only those does not change the behavior.
> > > > > > > >
> > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > problem?
> > > > > > > >
> > > > > > >
> > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > pointless SMP check on secondary startup path").
> > > > > > >
> > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > see anything obviously wrong with those either.
> > > > > >
> > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > why it could make a difference, but I think it's worth a try.
> > > > >
> > > > > The link to your repo is missing.
> > > > >
> > > >
> > > > Oops, sorry :-)
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > >
> > > I have tested your branch and it stalls:
> > >
> >
> > OK, thanks for verifying.
>
> My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
>
> percpu.h: sporadic stalls
> current.h: always stalls
> assembler.h: no stalls
> smp.c: no stalls
>

So you mean that applying the changes to each of those files in
isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
produces those results, right?

That confirms my statement that smp.c cannot be the culprit, and
appears to exonerate the pure asm pieces. I wonder if this is related
to insufficient asm constraints on the C helpers, or just the cost
model taking different decisions because the inline asm string is much
longer. In any case, this opens up a couple of avenues we could
explore to narrow this down further.

As a quick check, can you try the below snippet applied onto the
broken current.h build?

--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
task_struct *get_current(void)
            "   b       . + (2b - 0b)                           \n\t"
            "   .popsection                                     \n\t"
 #endif
-           : "=r"(cur));
+           : "=r"(cur)
+           : "Q" (*(const unsigned long *)current_stack_pointer));
 #elif __LINUX_ARM_ARCH__>= 7 || \
       !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
       (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))

Given that the problematic sequence appears to be in C code, could you
please confirm whether or not the stall is reproducible when all the
pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
are built into the kernel rather than built as modules? Also, which
GCC version are you using?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-02 10:37                                                                                             ` Ard Biesheuvel
@ 2022-06-02 12:27                                                                                               ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-02 12:27 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > > >
> > > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > > >
> > > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > > >
> > > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > > >
> > > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > > only those does not change the behavior.
> > > > > > > > >
> > > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > > problem?
> > > > > > > > >
> > > > > > > >
> > > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > > pointless SMP check on secondary startup path").
> > > > > > > >
> > > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > > see anything obviously wrong with those either.
> > > > > > >
> > > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > > why it could make a difference, but I think it's worth a try.
> > > > > >
> > > > > > The link to your repo is missing.
> > > > > >
> > > > >
> > > > > Oops, sorry :-)
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > > >
> > > > I have tested your branch and it stalls:
> > > >
> > >
> > > OK, thanks for verifying.
> >
> > My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
> >
> > percpu.h: sporadic stalls
> > current.h: always stalls
> > assembler.h: no stalls
> > smp.c: no stalls
> >
>
> So you mean that applying the changes to each of those files in
> isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
> produces those results, right?

Right.

> That confirms my statement that smp.c cannot be the culprit, and
> appears to exonerate the pure asm pieces. I wonder if this is related
> to insufficient asm constraints on the C helpers, or just the cost
> model taking different decisions because the inline asm string is much
> longer. In any case, this opens up a couple of avenues we could
> explore to narrow this down further.
>
> As a quick check, can you try the below snippet applied onto the
> broken current.h build?
>
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> task_struct *get_current(void)
>             "   b       . + (2b - 0b)                           \n\t"
>             "   .popsection                                     \n\t"
>  #endif
> -           : "=r"(cur));
> +           : "=r"(cur)
> +           : "Q" (*(const unsigned long *)current_stack_pointer));

Where is the current_stack_pointer defined?

>  #elif __LINUX_ARM_ARCH__>= 7 || \
>        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
>        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
>
> Given that the problematic sequence appears to be in C code, could you
> please confirm whether or not the stall is reproducible when all the
> pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> are built into the kernel rather than built as modules? Also, which
> GCC version are you using?

For now, the CAN stack parts are built as modules. I'll try to compile them in.

I'm using GCC 10.x

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-02 12:27                                                                                               ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-02 12:27 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > > >
> > > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > > >
> > > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > > >
> > > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > > >
> > > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > > only those does not change the behavior.
> > > > > > > > >
> > > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > > problem?
> > > > > > > > >
> > > > > > > >
> > > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > > pointless SMP check on secondary startup path").
> > > > > > > >
> > > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > > see anything obviously wrong with those either.
> > > > > > >
> > > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > > why it could make a difference, but I think it's worth a try.
> > > > > >
> > > > > > The link to your repo is missing.
> > > > > >
> > > > >
> > > > > Oops, sorry :-)
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > > >
> > > > I have tested your branch and it stalls:
> > > >
> > >
> > > OK, thanks for verifying.
> >
> > My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
> >
> > percpu.h: sporadic stalls
> > current.h: always stalls
> > assembler.h: no stalls
> > smp.c: no stalls
> >
>
> So you mean that applying the changes to each of those files in
> isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
> produces those results, right?

Right.

> That confirms my statement that smp.c cannot be the culprit, and
> appears to exonerate the pure asm pieces. I wonder if this is related
> to insufficient asm constraints on the C helpers, or just the cost
> model taking different decisions because the inline asm string is much
> longer. In any case, this opens up a couple of avenues we could
> explore to narrow this down further.
>
> As a quick check, can you try the below snippet applied onto the
> broken current.h build?
>
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> task_struct *get_current(void)
>             "   b       . + (2b - 0b)                           \n\t"
>             "   .popsection                                     \n\t"
>  #endif
> -           : "=r"(cur));
> +           : "=r"(cur)
> +           : "Q" (*(const unsigned long *)current_stack_pointer));

Where is the current_stack_pointer defined?

>  #elif __LINUX_ARM_ARCH__>= 7 || \
>        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
>        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
>
> Given that the problematic sequence appears to be in C code, could you
> please confirm whether or not the stall is reproducible when all the
> pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> are built into the kernel rather than built as modules? Also, which
> GCC version are you using?

For now, the CAN stack parts are built as modules. I'll try to compile them in.

I'm using GCC 10.x

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-02 12:27                                                                                               ` Yegor Yefremov
@ 2022-06-03  8:54                                                                                                 ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-03  8:54 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 6571 bytes --]

On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > > > >
> > > > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > > > >
> > > > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > > > >
> > > > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > > > >
> > > > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > > > only those does not change the behavior.
> > > > > > > > > >
> > > > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > > > problem?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > > > pointless SMP check on secondary startup path").
> > > > > > > > >
> > > > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > > > see anything obviously wrong with those either.
> > > > > > > >
> > > > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > > > why it could make a difference, but I think it's worth a try.
> > > > > > >
> > > > > > > The link to your repo is missing.
> > > > > > >
> > > > > >
> > > > > > Oops, sorry :-)
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > > > >
> > > > > I have tested your branch and it stalls:
> > > > >
> > > >
> > > > OK, thanks for verifying.
> > >
> > > My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
> > >
> > > percpu.h: sporadic stalls
> > > current.h: always stalls
> > > assembler.h: no stalls
> > > smp.c: no stalls
> > >
> >
> > So you mean that applying the changes to each of those files in
> > isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
> > produces those results, right?
>
> Right.
>
> > That confirms my statement that smp.c cannot be the culprit, and
> > appears to exonerate the pure asm pieces. I wonder if this is related
> > to insufficient asm constraints on the C helpers, or just the cost
> > model taking different decisions because the inline asm string is much
> > longer. In any case, this opens up a couple of avenues we could
> > explore to narrow this down further.
> >
> > As a quick check, can you try the below snippet applied onto the
> > broken current.h build?
> >
> > --- a/arch/arm/include/asm/current.h
> > +++ b/arch/arm/include/asm/current.h
> > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > task_struct *get_current(void)
> >             "   b       . + (2b - 0b)                           \n\t"
> >             "   .popsection                                     \n\t"
> >  #endif
> > -           : "=r"(cur));
> > +           : "=r"(cur)
> > +           : "Q" (*(const unsigned long *)current_stack_pointer));
>
> Where is the current_stack_pointer defined?
>
> >  #elif __LINUX_ARM_ARCH__>= 7 || \
> >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> >
> > Given that the problematic sequence appears to be in C code, could you
> > please confirm whether or not the stall is reproducible when all the
> > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > are built into the kernel rather than built as modules? Also, which
> > GCC version are you using?
>
> For now, the CAN stack parts are built as modules. I'll try to compile them in.
>
> I'm using GCC 10.x

I have tried your patch (see the attachment) and the system stalls.

Will try GCC 11.x and also compiled-in drivers.

[-- Attachment #2: current_v2.patch --]
[-- Type: application/octet-stream, Size: 2211 bytes --]

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 677e217a9e65..ad38dcc1c8a0 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -14,9 +14,57 @@ struct task_struct;
 
 extern struct task_struct *__current;
 
+register unsigned long current_stack_pointer asm ("sp");
+
 static inline __attribute_const__ struct task_struct *get_current(void)
 {
-	return __current;
+	struct task_struct *cur;
+
+#if __has_builtin(__builtin_thread_pointer) && \
+    defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
+    !(defined(CONFIG_THUMB2_KERNEL) && \
+      defined(CONFIG_CC_IS_CLANG) && CONFIG_CLANG_VERSION < 130001)
+	/*
+	 * Use the __builtin helper when available - this results in better
+	 * code, especially when using GCC in combination with the per-task
+	 * stack protector, as the compiler will recognize that it needs to
+	 * load the TLS register only once in every function.
+	 *
+	 * Clang < 13.0.1 gets this wrong for Thumb2 builds:
+	 * https://github.com/ClangBuiltLinux/linux/issues/1485
+	 */
+	cur = __builtin_thread_pointer();
+#elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
+	asm("0:	mrc p15, 0, %0, c13, c0, 3			\n\t"
+#ifdef CONFIG_CPU_V6
+	    "1:							\n\t"
+	    "	.subsection 1					\n\t"
+#if !(defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) && \
+    !(defined(CONFIG_LD_IS_LLD) && CONFIG_LLD_VERSION < 140000)
+	    "2: " LOAD_SYM_ARMV6(%0, __current) "		\n\t"
+	    "	b	1b					\n\t"
+#else
+	    "2:	ldr	%0, 3f					\n\t"
+	    "	ldr	%0, [%0]				\n\t"
+	    "	b	1b					\n\t"
+	    "3:	.long	__current				\n\t"
+#endif
+	    "	.previous					\n\t"
+	    "	.pushsection \".alt.smp.init\", \"a\"		\n\t"
+	    "	.long	0b - .					\n\t"
+	    "	b	. + (2b - 0b)				\n\t"
+	    "	.popsection					\n\t"
+#endif
+           : "=r"(cur)
+           : "Q" (*(const unsigned long *)current_stack_pointer));
+#elif __LINUX_ARM_ARCH__>= 7 || \
+      (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) || \
+      (defined(CONFIG_LD_IS_LLD) && CONFIG_LLD_VERSION < 140000)
+	cur = __current;
+#else
+	asm(LOAD_SYM_ARMV6(%0, __current) : "=r"(cur));
+#endif
+	return cur;
 }
 
 #define current get_current()

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-03  8:54                                                                                                 ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-03  8:54 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 6571 bytes --]

On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 2 Jun 2022 at 12:17, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > >
> > > On Wed, Jun 1, 2022 at 12:50 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > > > On Wed, 1 Jun 2022 at 12:46, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > >
> > > > > On Wed, Jun 1, 2022 at 12:06 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, 1 Jun 2022 at 12:04, Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 1, 2022 at 11:28 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, 1 Jun 2022 at 10:08, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 1 Jun 2022 at 09:59, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 1, 2022 at 9:36 AM Yegor Yefremov
> > > > > > > > > > <yegorslists@googlemail.com> wrote:
> > > > > > > > > > > On Tue, May 31, 2022 at 5:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > > > > > > > > I've pushed a modified branch now, with that fix on the broken commit,
> > > > > > > > > > > > and another change to make CONFIG_IRQSTACKS user-selectable rather
> > > > > > > > > > > > than always enabled. That should tell us if the problem is in the SMP
> > > > > > > > > > > > patching or in the irqstacks.
> > > > > > > > > > > >
> > > > > > > > > > > > Can you test the top of this branch with CONFIG_IRQSTACKS disabled,
> > > > > > > > > > > > and (if that still stalls) retest the fixed commit f0191ea5c2e5 ("[PART 1]
> > > > > > > > > > > > ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems")?
> > > > > > > > > > >
> > > > > > > > > > > 1. the top of this branch with CONFIG_IRQSTACKS disabled stalls
> > > > > > > > > > > 2. f0191ea5c2e5 with the same config - not
> > > > > > > > > >
> > > > > > > > > > Ok, perfect, that does narrow down the problem quite a bit: The final
> > > > > > > > > > patch has seven changes, all of which can be done individually because
> > > > > > > > > > in each case the simplified version in f0191ea5c2e5 is meant to run
> > > > > > > > > > the exact same instructions as the version after the change, when running
> > > > > > > > > > on a uniprocessor machine such as your am335x.
> > > > > > > > > >
> > > > > > > > > > You have already shown earlier that the get_current() and
> > > > > > > > > > __my_cpu_offset() functions are not to blame here, as reverting
> > > > > > > > > > only those does not change the behavior.
> > > > > > > > > >
> > > > > > > > > > This leaves the is_smp() check in set_current(), and the
> > > > > > > > > > four macros in <asm/assembler.h>. I don't see anything obviously
> > > > > > > > > > wrong with any of those five, but I would bet on the macros
> > > > > > > > > > here. Can you try bisecting into this commit, maybe reverting
> > > > > > > > > > the changes to set_current and get_current first, and then
> > > > > > > > > > narrowing it down to (hopefully) a single macro that causes the
> > > > > > > > > > problem?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > set_current() is never called by the primary CPU, which is why the
> > > > > > > > > is_smp() check was removed from there in 57a420435edcb0b94 ("ARM: drop
> > > > > > > > > pointless SMP check on secondary startup path").
> > > > > > > > >
> > > > > > > > > So that leaves only the four macros in asm/assembler.h, but I don't
> > > > > > > > > see anything obviously wrong with those either.
> > > > > > > >
> > > > > > > > I pushed a patch on top of Arnd's branch at the link below that gets
> > > > > > > > rid of the subsections, and uses normal branches (and code patching)
> > > > > > > > to switch between the thread ID register and the LDR to retrieve the
> > > > > > > > CPU offset and the current pointer. I have no explanation whether or
> > > > > > > > why it could make a difference, but I think it's worth a try.
> > > > > > >
> > > > > > > The link to your repo is missing.
> > > > > > >
> > > > > >
> > > > > > Oops, sorry :-)
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
> > > > >
> > > > > I have tested your branch and it stalls:
> > > > >
> > > >
> > > > OK, thanks for verifying.
> > >
> > > My bisection results for f0191ea5c2e5aab29484ede0493ca385eec5472f as a base:
> > >
> > > percpu.h: sporadic stalls
> > > current.h: always stalls
> > > assembler.h: no stalls
> > > smp.c: no stalls
> > >
> >
> > So you mean that applying the changes to each of those files in
> > isolation to the baseline in f0191ea5c2e5aab29484ede0493ca385eec5472f
> > produces those results, right?
>
> Right.
>
> > That confirms my statement that smp.c cannot be the culprit, and
> > appears to exonerate the pure asm pieces. I wonder if this is related
> > to insufficient asm constraints on the C helpers, or just the cost
> > model taking different decisions because the inline asm string is much
> > longer. In any case, this opens up a couple of avenues we could
> > explore to narrow this down further.
> >
> > As a quick check, can you try the below snippet applied onto the
> > broken current.h build?
> >
> > --- a/arch/arm/include/asm/current.h
> > +++ b/arch/arm/include/asm/current.h
> > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > task_struct *get_current(void)
> >             "   b       . + (2b - 0b)                           \n\t"
> >             "   .popsection                                     \n\t"
> >  #endif
> > -           : "=r"(cur));
> > +           : "=r"(cur)
> > +           : "Q" (*(const unsigned long *)current_stack_pointer));
>
> Where is the current_stack_pointer defined?
>
> >  #elif __LINUX_ARM_ARCH__>= 7 || \
> >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> >
> > Given that the problematic sequence appears to be in C code, could you
> > please confirm whether or not the stall is reproducible when all the
> > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > are built into the kernel rather than built as modules? Also, which
> > GCC version are you using?
>
> For now, the CAN stack parts are built as modules. I'll try to compile them in.
>
> I'm using GCC 10.x

I have tried your patch (see the attachment) and the system stalls.

Will try GCC 11.x and also compiled-in drivers.

[-- Attachment #2: current_v2.patch --]
[-- Type: application/octet-stream, Size: 2211 bytes --]

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 677e217a9e65..ad38dcc1c8a0 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -14,9 +14,57 @@ struct task_struct;
 
 extern struct task_struct *__current;
 
+register unsigned long current_stack_pointer asm ("sp");
+
 static inline __attribute_const__ struct task_struct *get_current(void)
 {
-	return __current;
+	struct task_struct *cur;
+
+#if __has_builtin(__builtin_thread_pointer) && \
+    defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) && \
+    !(defined(CONFIG_THUMB2_KERNEL) && \
+      defined(CONFIG_CC_IS_CLANG) && CONFIG_CLANG_VERSION < 130001)
+	/*
+	 * Use the __builtin helper when available - this results in better
+	 * code, especially when using GCC in combination with the per-task
+	 * stack protector, as the compiler will recognize that it needs to
+	 * load the TLS register only once in every function.
+	 *
+	 * Clang < 13.0.1 gets this wrong for Thumb2 builds:
+	 * https://github.com/ClangBuiltLinux/linux/issues/1485
+	 */
+	cur = __builtin_thread_pointer();
+#elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
+	asm("0:	mrc p15, 0, %0, c13, c0, 3			\n\t"
+#ifdef CONFIG_CPU_V6
+	    "1:							\n\t"
+	    "	.subsection 1					\n\t"
+#if !(defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) && \
+    !(defined(CONFIG_LD_IS_LLD) && CONFIG_LLD_VERSION < 140000)
+	    "2: " LOAD_SYM_ARMV6(%0, __current) "		\n\t"
+	    "	b	1b					\n\t"
+#else
+	    "2:	ldr	%0, 3f					\n\t"
+	    "	ldr	%0, [%0]				\n\t"
+	    "	b	1b					\n\t"
+	    "3:	.long	__current				\n\t"
+#endif
+	    "	.previous					\n\t"
+	    "	.pushsection \".alt.smp.init\", \"a\"		\n\t"
+	    "	.long	0b - .					\n\t"
+	    "	b	. + (2b - 0b)				\n\t"
+	    "	.popsection					\n\t"
+#endif
+           : "=r"(cur)
+           : "Q" (*(const unsigned long *)current_stack_pointer));
+#elif __LINUX_ARM_ARCH__>= 7 || \
+      (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) || \
+      (defined(CONFIG_LD_IS_LLD) && CONFIG_LLD_VERSION < 140000)
+	cur = __current;
+#else
+	asm(LOAD_SYM_ARMV6(%0, __current) : "=r"(cur));
+#endif
+	return cur;
 }
 
 #define current get_current()

[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-03  8:54                                                                                                 ` Yegor Yefremov
@ 2022-06-03  9:32                                                                                                   ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-03  9:32 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:

> > > That confirms my statement that smp.c cannot be the culprit, and
> > > appears to exonerate the pure asm pieces. I wonder if this is related
> > > to insufficient asm constraints on the C helpers, or just the cost
> > > model taking different decisions because the inline asm string is much
> > > longer. In any case, this opens up a couple of avenues we could
> > > explore to narrow this down further.
> > >
> > > As a quick check, can you try the below snippet applied onto the
> > > broken current.h build?
> > >
> > > --- a/arch/arm/include/asm/current.h
> > > +++ b/arch/arm/include/asm/current.h
> > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > > task_struct *get_current(void)
> > >             "   b       . + (2b - 0b)                           \n\t"
> > >             "   .popsection                                     \n\t"
> > >  #endif
> > > -           : "=r"(cur));
> > > +           : "=r"(cur)
> > > +           : "Q" (*(const unsigned long *)current_stack_pointer));
> >
> > Where is the current_stack_pointer defined?
> >
> > >  #elif __LINUX_ARM_ARCH__>= 7 || \
> > >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> > >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> > >
> > > Given that the problematic sequence appears to be in C code, could you
> > > please confirm whether or not the stall is reproducible when all the
> > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > > are built into the kernel rather than built as modules? Also, which
> > > GCC version are you using?
> >
> > For now, the CAN stack parts are built as modules. I'll try to compile them in.
> >
> > I'm using GCC 10.x
>
> I have tried your patch (see the attachment) and the system stalls.

This is with only get_current() patched on top of the working
f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for
uniprocessor systems"), right?

My best theory right now is that something in get_currnent() is wrong that
causes it to return the wrong task pointer, which in turn leads to
current->preempt_count to get out of sync. This may be related to the cppi41
dmaengine tasklet and effectively disables further softirqs including the timer
that triggers the RCU grace period.

When we finally switch tasks to the cpufreq worker thread, softirqs
can happen again because of the task switch, and at the next IRQ
the timer detects the stall.

> Will try GCC 11.x and also compiled-in drivers.

Ok. Maybe make sure all drivers are built-in here. I see both the CAN
layer and the cppi41 driver use softirqs, so to be on the safe side,
try to get to a running kernel that has no modules loaded at all at
the time you expect the stall.


One thing that could possibly go wrong with get_current() would be that
it fails to get patched for some reason, or it gets patched only after
it was already called. Since you run on an ARMv7 CPU as opposed to
an actual OMAP2410/ARM1136r0, it would then try to load the
variable from the uninitialized TPIDRURO register. If that happens,
the one-liner below should tell you exactly where, by triggering an
Oops. You can apply the patch on top for testing, it should have no
other effects if the patching part works correctly.

        Anrd

8<---

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 2f9d79214b25..2a15832793c4 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct
*get_current(void)
         */
        cur = __builtin_thread_pointer();
 #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
-       asm("0: mrc p15, 0, %0, c13, c0, 3                      \n\t"
+       asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
 #ifdef CONFIG_CPU_V6
            "1:                                                 \n\t"
            "   .subsection 1                                   \n\t"

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-03  9:32                                                                                                   ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-03  9:32 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:

> > > That confirms my statement that smp.c cannot be the culprit, and
> > > appears to exonerate the pure asm pieces. I wonder if this is related
> > > to insufficient asm constraints on the C helpers, or just the cost
> > > model taking different decisions because the inline asm string is much
> > > longer. In any case, this opens up a couple of avenues we could
> > > explore to narrow this down further.
> > >
> > > As a quick check, can you try the below snippet applied onto the
> > > broken current.h build?
> > >
> > > --- a/arch/arm/include/asm/current.h
> > > +++ b/arch/arm/include/asm/current.h
> > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > > task_struct *get_current(void)
> > >             "   b       . + (2b - 0b)                           \n\t"
> > >             "   .popsection                                     \n\t"
> > >  #endif
> > > -           : "=r"(cur));
> > > +           : "=r"(cur)
> > > +           : "Q" (*(const unsigned long *)current_stack_pointer));
> >
> > Where is the current_stack_pointer defined?
> >
> > >  #elif __LINUX_ARM_ARCH__>= 7 || \
> > >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> > >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> > >
> > > Given that the problematic sequence appears to be in C code, could you
> > > please confirm whether or not the stall is reproducible when all the
> > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > > are built into the kernel rather than built as modules? Also, which
> > > GCC version are you using?
> >
> > For now, the CAN stack parts are built as modules. I'll try to compile them in.
> >
> > I'm using GCC 10.x
>
> I have tried your patch (see the attachment) and the system stalls.

This is with only get_current() patched on top of the working
f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for
uniprocessor systems"), right?

My best theory right now is that something in get_currnent() is wrong that
causes it to return the wrong task pointer, which in turn leads to
current->preempt_count to get out of sync. This may be related to the cppi41
dmaengine tasklet and effectively disables further softirqs including the timer
that triggers the RCU grace period.

When we finally switch tasks to the cpufreq worker thread, softirqs
can happen again because of the task switch, and at the next IRQ
the timer detects the stall.

> Will try GCC 11.x and also compiled-in drivers.

Ok. Maybe make sure all drivers are built-in here. I see both the CAN
layer and the cppi41 driver use softirqs, so to be on the safe side,
try to get to a running kernel that has no modules loaded at all at
the time you expect the stall.


One thing that could possibly go wrong with get_current() would be that
it fails to get patched for some reason, or it gets patched only after
it was already called. Since you run on an ARMv7 CPU as opposed to
an actual OMAP2410/ARM1136r0, it would then try to load the
variable from the uninitialized TPIDRURO register. If that happens,
the one-liner below should tell you exactly where, by triggering an
Oops. You can apply the patch on top for testing, it should have no
other effects if the patching part works correctly.

        Anrd

8<---

diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
index 2f9d79214b25..2a15832793c4 100644
--- a/arch/arm/include/asm/current.h
+++ b/arch/arm/include/asm/current.h
@@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct
*get_current(void)
         */
        cur = __builtin_thread_pointer();
 #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
-       asm("0: mrc p15, 0, %0, c13, c0, 3                      \n\t"
+       asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
 #ifdef CONFIG_CPU_V6
            "1:                                                 \n\t"
            "   .subsection 1                                   \n\t"

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-03  9:32                                                                                                   ` Arnd Bergmann
@ 2022-06-03 19:11                                                                                                     ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-03 19:11 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 11:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > > That confirms my statement that smp.c cannot be the culprit, and
> > > > appears to exonerate the pure asm pieces. I wonder if this is related
> > > > to insufficient asm constraints on the C helpers, or just the cost
> > > > model taking different decisions because the inline asm string is much
> > > > longer. In any case, this opens up a couple of avenues we could
> > > > explore to narrow this down further.
> > > >
> > > > As a quick check, can you try the below snippet applied onto the
> > > > broken current.h build?
> > > >
> > > > --- a/arch/arm/include/asm/current.h
> > > > +++ b/arch/arm/include/asm/current.h
> > > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > > > task_struct *get_current(void)
> > > >             "   b       . + (2b - 0b)                           \n\t"
> > > >             "   .popsection                                     \n\t"
> > > >  #endif
> > > > -           : "=r"(cur));
> > > > +           : "=r"(cur)
> > > > +           : "Q" (*(const unsigned long *)current_stack_pointer));
> > >
> > > Where is the current_stack_pointer defined?
> > >
> > > >  #elif __LINUX_ARM_ARCH__>= 7 || \
> > > >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> > > >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> > > >
> > > > Given that the problematic sequence appears to be in C code, could you
> > > > please confirm whether or not the stall is reproducible when all the
> > > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > > > are built into the kernel rather than built as modules? Also, which
> > > > GCC version are you using?
> > >
> > > For now, the CAN stack parts are built as modules. I'll try to compile them in.
> > >
> > > I'm using GCC 10.x
> >
> > I have tried your patch (see the attachment) and the system stalls.
>
> This is with only get_current() patched on top of the working
> f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor systems"), right?
>
> My best theory right now is that something in get_currnent() is wrong that
> causes it to return the wrong task pointer, which in turn leads to
> current->preempt_count to get out of sync. This may be related to the cppi41
> dmaengine tasklet and effectively disables further softirqs including the timer
> that triggers the RCU grace period.
>
> When we finally switch tasks to the cpufreq worker thread, softirqs
> can happen again because of the task switch, and at the next IRQ
> the timer detects the stall.
>
> > Will try GCC 11.x and also compiled-in drivers.
>
> Ok. Maybe make sure all drivers are built-in here. I see both the CAN
> layer and the cppi41 driver use softirqs, so to be on the safe side,
> try to get to a running kernel that has no modules loaded at all at
> the time you expect the stall.
>
>
> One thing that could possibly go wrong with get_current() would be that
> it fails to get patched for some reason, or it gets patched only after
> it was already called. Since you run on an ARMv7 CPU as opposed to
> an actual OMAP2410/ARM1136r0, it would then try to load the
> variable from the uninitialized TPIDRURO register. If that happens,
> the one-liner below should tell you exactly where, by triggering an
> Oops. You can apply the patch on top for testing, it should have no
> other effects if the patching part works correctly.
>
>         Anrd
>
> 8<---
>
> diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> index 2f9d79214b25..2a15832793c4 100644
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct
> *get_current(void)
>          */
>         cur = __builtin_thread_pointer();
>  #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
> -       asm("0: mrc p15, 0, %0, c13, c0, 3                      \n\t"
> +       asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>  #ifdef CONFIG_CPU_V6
>             "1:                                                 \n\t"
>             "   .subsection 1                                   \n\t"

With compiled-in drivers the system doesn't stall. All other tests and
related outputs will come next week.

Have a nice weekend.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-03 19:11                                                                                                     ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-03 19:11 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ard Biesheuvel, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 11:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> > On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > > That confirms my statement that smp.c cannot be the culprit, and
> > > > appears to exonerate the pure asm pieces. I wonder if this is related
> > > > to insufficient asm constraints on the C helpers, or just the cost
> > > > model taking different decisions because the inline asm string is much
> > > > longer. In any case, this opens up a couple of avenues we could
> > > > explore to narrow this down further.
> > > >
> > > > As a quick check, can you try the below snippet applied onto the
> > > > broken current.h build?
> > > >
> > > > --- a/arch/arm/include/asm/current.h
> > > > +++ b/arch/arm/include/asm/current.h
> > > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct
> > > > task_struct *get_current(void)
> > > >             "   b       . + (2b - 0b)                           \n\t"
> > > >             "   .popsection                                     \n\t"
> > > >  #endif
> > > > -           : "=r"(cur));
> > > > +           : "=r"(cur)
> > > > +           : "Q" (*(const unsigned long *)current_stack_pointer));
> > >
> > > Where is the current_stack_pointer defined?
> > >
> > > >  #elif __LINUX_ARM_ARCH__>= 7 || \
> > > >        !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \
> > > >        (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS))
> > > >
> > > > Given that the problematic sequence appears to be in C code, could you
> > > > please confirm whether or not the stall is reproducible when all the
> > > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc)
> > > > are built into the kernel rather than built as modules? Also, which
> > > > GCC version are you using?
> > >
> > > For now, the CAN stack parts are built as modules. I'll try to compile them in.
> > >
> > > I'm using GCC 10.x
> >
> > I have tried your patch (see the attachment) and the system stalls.
>
> This is with only get_current() patched on top of the working
> f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for
> uniprocessor systems"), right?
>
> My best theory right now is that something in get_currnent() is wrong that
> causes it to return the wrong task pointer, which in turn leads to
> current->preempt_count to get out of sync. This may be related to the cppi41
> dmaengine tasklet and effectively disables further softirqs including the timer
> that triggers the RCU grace period.
>
> When we finally switch tasks to the cpufreq worker thread, softirqs
> can happen again because of the task switch, and at the next IRQ
> the timer detects the stall.
>
> > Will try GCC 11.x and also compiled-in drivers.
>
> Ok. Maybe make sure all drivers are built-in here. I see both the CAN
> layer and the cppi41 driver use softirqs, so to be on the safe side,
> try to get to a running kernel that has no modules loaded at all at
> the time you expect the stall.
>
>
> One thing that could possibly go wrong with get_current() would be that
> it fails to get patched for some reason, or it gets patched only after
> it was already called. Since you run on an ARMv7 CPU as opposed to
> an actual OMAP2410/ARM1136r0, it would then try to load the
> variable from the uninitialized TPIDRURO register. If that happens,
> the one-liner below should tell you exactly where, by triggering an
> Oops. You can apply the patch on top for testing, it should have no
> other effects if the patching part works correctly.
>
>         Anrd
>
> 8<---
>
> diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h
> index 2f9d79214b25..2a15832793c4 100644
> --- a/arch/arm/include/asm/current.h
> +++ b/arch/arm/include/asm/current.h
> @@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct
> *get_current(void)
>          */
>         cur = __builtin_thread_pointer();
>  #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP)
> -       asm("0: mrc p15, 0, %0, c13, c0, 3                      \n\t"
> +       asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>  #ifdef CONFIG_CPU_V6
>             "1:                                                 \n\t"
>             "   .subsection 1                                   \n\t"

With compiled-in drivers the system doesn't stall. All other tests and
related outputs will come next week.

Have a nice weekend.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-03 19:11                                                                                                     ` Yegor Yefremov
@ 2022-06-03 20:46                                                                                                       ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-03 20:46 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> With compiled-in drivers the system doesn't stall. All other tests and
> related outputs will come next week.

Ah, nice!

It's probably a reasonable assumption that the smp-patched get_current()
is (at least sometimes) broken in modules but working in the kernel itself.
I suppose that means in the worst case we can hot-fix the issue by
having an 'extern' version of get_current() for the case of
armv6+smp+module ;-)

Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
mail. If that gives you an oops for the module case, then we know
that the patching doesn't work at all and you don't have to try anything
else, otherwise it's more likely that an incorrect instruction sequence
is patched in.

        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-03 20:46                                                                                                       ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-06-03 20:46 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Arnd Bergmann, Ard Biesheuvel, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
>
> With compiled-in drivers the system doesn't stall. All other tests and
> related outputs will come next week.

Ah, nice!

It's probably a reasonable assumption that the smp-patched get_current()
is (at least sometimes) broken in modules but working in the kernel itself.
I suppose that means in the worst case we can hot-fix the issue by
having an 'extern' version of get_current() for the case of
armv6+smp+module ;-)

Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
mail. If that gives you an oops for the module case, then we know
that the patching doesn't work at all and you don't have to try anything
else, otherwise it's more likely that an incorrect instruction sequence
is patched in.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-03 20:46                                                                                                       ` Arnd Bergmann
@ 2022-06-05 14:59                                                                                                         ` Ard Biesheuvel
  -1 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-05 14:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > With compiled-in drivers the system doesn't stall. All other tests and
> > related outputs will come next week.
>
> Ah, nice!
>
> It's probably a reasonable assumption that the smp-patched get_current()
> is (at least sometimes) broken in modules but working in the kernel itself.
> I suppose that means in the worst case we can hot-fix the issue by
> having an 'extern' version of get_current() for the case of
> armv6+smp+module ;-)
>

I've coded something up along those lines, and pushed it to my
am335x-stall-test branch.

> Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> mail. If that gives you an oops for the module case, then we know
> that the patching doesn't work at all and you don't have to try anything
> else, otherwise it's more likely that an incorrect instruction sequence
> is patched in.
>

Yeah, I'd be really surprised if the patching misses some occurrences,
so I have no clue what is going on here.

Yegor, can you please try my branch with the original config (i.e.,
slcan and ftdio as modules)

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-05 14:59                                                                                                         ` Ard Biesheuvel
  0 siblings, 0 replies; 115+ messages in thread
From: Ard Biesheuvel @ 2022-06-05 14:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yegor Yefremov, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> <yegorslists@googlemail.com> wrote:
> >
> > With compiled-in drivers the system doesn't stall. All other tests and
> > related outputs will come next week.
>
> Ah, nice!
>
> It's probably a reasonable assumption that the smp-patched get_current()
> is (at least sometimes) broken in modules but working in the kernel itself.
> I suppose that means in the worst case we can hot-fix the issue by
> having an 'extern' version of get_current() for the case of
> armv6+smp+module ;-)
>

I've coded something up along those lines, and pushed it to my
am335x-stall-test branch.

> Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> mail. If that gives you an oops for the module case, then we know
> that the patching doesn't work at all and you don't have to try anything
> else, otherwise it's more likely that an incorrect instruction sequence
> is patched in.
>

Yeah, I'd be really surprised if the patching misses some occurrences,
so I have no clue what is going on here.

Yegor, can you please try my branch with the original config (i.e.,
slcan and ftdio as modules)

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-05 14:59                                                                                                         ` Ard Biesheuvel
@ 2022-06-07  8:55                                                                                                           ` Yegor Yefremov
  -1 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-07  8:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > With compiled-in drivers the system doesn't stall. All other tests and
> > > related outputs will come next week.
> >
> > Ah, nice!
> >
> > It's probably a reasonable assumption that the smp-patched get_current()
> > is (at least sometimes) broken in modules but working in the kernel itself.
> > I suppose that means in the worst case we can hot-fix the issue by
> > having an 'extern' version of get_current() for the case of
> > armv6+smp+module ;-)
> >
>
> I've coded something up along those lines, and pushed it to my
> am335x-stall-test branch.
>
> > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > mail. If that gives you an oops for the module case, then we know
> > that the patching doesn't work at all and you don't have to try anything
> > else, otherwise it's more likely that an incorrect instruction sequence
> > is patched in.
> >
>
> Yeah, I'd be really surprised if the patching misses some occurrences,
> so I have no clue what is going on here.
>
> Yegor, can you please try my branch with the original config (i.e.,
> slcan and ftdio as modules)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

@Arnd: I have applied your patch with this change:

asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap

But it revealed nothing new:

[   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
[   50.760834] rcu:     0-...!: (2600 ticks this GP)
idle=ec9/1/0x40000004 softirq=1852/1852 fqs=0
[   50.770407]  (t=2600 jiffies g=2577 q=17)
[   50.775046] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   50.786961] rcu:     Possible timer handling issue on cpu=0 timer-softirq=872
[   50.794429] rcu: rcu_sched kthread starved for 2600 jiffies! g2577
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   50.805403] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   50.814927] rcu: RCU grace-period kthread stack dump:
[   50.820464] task:rcu_sched       state:I stack:    0 pid:   10
ppid:     2 flags:0x00000000
[   50.830019] [<c0b683d4>] (__schedule) from [<c0b68d18>] (schedule+0x54/0xe8)
[   50.838470] [<c0b68d18>] (schedule) from [<c0b6f51c>]
(schedule_timeout+0xa8/0x210)
[   50.847208] [<c0b6f51c>] (schedule_timeout) from [<c01d85b4>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[   50.856631] [<c01d85b4>] (rcu_gp_fqs_loop) from [<c01dc4e4>]
(rcu_gp_kthread+0x138/0x30c)
[   50.865832] [<c01dc4e4>] (rcu_gp_kthread) from [<c0164df8>]
(kthread+0x13c/0x164)
[   50.874315] [<c0164df8>] (kthread) from [<c0100140>]
(ret_from_fork+0x14/0x34)
[   50.882477] rcu: Stack dump where RCU GP kthread last ran:
[   50.888512] NMI backtrace for cpu 0
[   50.892575] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.899912] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.906610] Workqueue: events dbs_work_handler
[   50.912202] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.921035] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)
[   50.928943] NMI backtrace for cpu 0
[   50.933084] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.940419] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.947083] Workqueue: events dbs_work_handler
[   50.952574] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.961334] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)

@Ard: I have tried your branch
(21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Yegor

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-06-07  8:55                                                                                                           ` Yegor Yefremov
  0 siblings, 0 replies; 115+ messages in thread
From: Yegor Yefremov @ 2022-06-07  8:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Arnd Bergmann, Tony Lindgren, Linux-OMAP, linux-clk,
	Stephen Boyd, Linux ARM

On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov
> > <yegorslists@googlemail.com> wrote:
> > >
> > > With compiled-in drivers the system doesn't stall. All other tests and
> > > related outputs will come next week.
> >
> > Ah, nice!
> >
> > It's probably a reasonable assumption that the smp-patched get_current()
> > is (at least sometimes) broken in modules but working in the kernel itself.
> > I suppose that means in the worst case we can hot-fix the issue by
> > having an 'extern' version of get_current() for the case of
> > armv6+smp+module ;-)
> >
>
> I've coded something up along those lines, and pushed it to my
> am335x-stall-test branch.
>
> > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > mail. If that gives you an oops for the module case, then we know
> > that the patching doesn't work at all and you don't have to try anything
> > else, otherwise it's more likely that an incorrect instruction sequence
> > is patched in.
> >
>
> Yeah, I'd be really surprised if the patching misses some occurrences,
> so I have no clue what is going on here.
>
> Yegor, can you please try my branch with the original config (i.e.,
> slcan and ftdio as modules)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test

@Arnd: I have applied your patch with this change:

asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap

But it revealed nothing new:

[   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
[   50.760834] rcu:     0-...!: (2600 ticks this GP)
idle=ec9/1/0x40000004 softirq=1852/1852 fqs=0
[   50.770407]  (t=2600 jiffies g=2577 q=17)
[   50.775046] rcu: rcu_sched kthread timer wakeup didn't happen for
2599 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   50.786961] rcu:     Possible timer handling issue on cpu=0 timer-softirq=872
[   50.794429] rcu: rcu_sched kthread starved for 2600 jiffies! g2577
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   50.805403] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[   50.814927] rcu: RCU grace-period kthread stack dump:
[   50.820464] task:rcu_sched       state:I stack:    0 pid:   10
ppid:     2 flags:0x00000000
[   50.830019] [<c0b683d4>] (__schedule) from [<c0b68d18>] (schedule+0x54/0xe8)
[   50.838470] [<c0b68d18>] (schedule) from [<c0b6f51c>]
(schedule_timeout+0xa8/0x210)
[   50.847208] [<c0b6f51c>] (schedule_timeout) from [<c01d85b4>]
(rcu_gp_fqs_loop+0x118/0x6b4)
[   50.856631] [<c01d85b4>] (rcu_gp_fqs_loop) from [<c01dc4e4>]
(rcu_gp_kthread+0x138/0x30c)
[   50.865832] [<c01dc4e4>] (rcu_gp_kthread) from [<c0164df8>]
(kthread+0x13c/0x164)
[   50.874315] [<c0164df8>] (kthread) from [<c0100140>]
(ret_from_fork+0x14/0x34)
[   50.882477] rcu: Stack dump where RCU GP kthread last ran:
[   50.888512] NMI backtrace for cpu 0
[   50.892575] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.899912] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.906610] Workqueue: events dbs_work_handler
[   50.912202] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.921035] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)
[   50.928943] NMI backtrace for cpu 0
[   50.933084] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1
[   50.940419] Hardware name: Generic AM33XX (Flattened Device Tree)
[   50.947083] Workqueue: events dbs_work_handler
[   50.952574] [<c0111600>] (unwind_backtrace) from [<c010bff4>]
(show_stack+0x10/0x14)
[   50.961334] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0)

@Ard: I have tried your branch
(21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Yegor

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
  2022-06-07  8:55                                                                                                           ` Yegor Yefremov
@ 2022-08-12  7:35                                                                                                             ` Arnd Bergmann
  -1 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-08-12  7:35 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, Jun 7, 2022 at 10:55 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > With compiled-in drivers the system doesn't stall. All other tests and
> > > > related outputs will come next week.
> > >
> > > Ah, nice!
> > >
> > > It's probably a reasonable assumption that the smp-patched get_current()
> > > is (at least sometimes) broken in modules but working in the kernel itself.
> > > I suppose that means in the worst case we can hot-fix the issue by
> > > having an 'extern' version of get_current() for the case of
> > > armv6+smp+module ;-)
> > >
> >
> > I've coded something up along those lines, and pushed it to my
> > am335x-stall-test branch.
> >
> > > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > > mail. If that gives you an oops for the module case, then we know
> > > that the patching doesn't work at all and you don't have to try anything
> > > else, otherwise it's more likely that an incorrect instruction sequence
> > > is patched in.
> > >
> >
> > Yeah, I'd be really surprised if the patching misses some occurrences,
> > so I have no clue what is going on here.
> >
> > Yegor, can you please try my branch with the original config (i.e.,
> > slcan and ftdio as modules)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
>
> @Arnd: I have applied your patch with this change:
>
> asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>
> But it revealed nothing new:
>
> [   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
>
> @Ard: I have tried your branch
> (21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Getting back to this old thread, as we never found out what is
actually going on.

It seems we are still stuck trying to figure out why a kernel with ARMv6
support and SMP patching is broken, or if the same bug might also affect
other configurations without ARMv6 support. This is of course very
unfortunate, but unless someone has an idea for how to debug the problem
further, I suppose we should at least prevent that broken configuration and
disallow enabling CONFIG_SMP in combination with ARMv6 (pre-ARMv6K)
CPUs, to keep others from running into the same problem.

Any other suggestions?

        Arnd

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: am335x: 5.18.x: system stalling
@ 2022-08-12  7:35                                                                                                             ` Arnd Bergmann
  0 siblings, 0 replies; 115+ messages in thread
From: Arnd Bergmann @ 2022-08-12  7:35 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: Ard Biesheuvel, Arnd Bergmann, Tony Lindgren, Linux-OMAP,
	linux-clk, Stephen Boyd, Linux ARM

On Tue, Jun 7, 2022 at 10:55 AM Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov <yegorslists@googlemail.com> wrote:
> > > >
> > > > With compiled-in drivers the system doesn't stall. All other tests and
> > > > related outputs will come next week.
> > >
> > > Ah, nice!
> > >
> > > It's probably a reasonable assumption that the smp-patched get_current()
> > > is (at least sometimes) broken in modules but working in the kernel itself.
> > > I suppose that means in the worst case we can hot-fix the issue by
> > > having an 'extern' version of get_current() for the case of
> > > armv6+smp+module ;-)
> > >
> >
> > I've coded something up along those lines, and pushed it to my
> > am335x-stall-test branch.
> >
> > > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last
> > > mail. If that gives you an oops for the module case, then we know
> > > that the patching doesn't work at all and you don't have to try anything
> > > else, otherwise it's more likely that an incorrect instruction sequence
> > > is patched in.
> > >
> >
> > Yeah, I'd be really surprised if the patching misses some occurrences,
> > so I have no clue what is going on here.
> >
> > Yegor, can you please try my branch with the original config (i.e.,
> > slcan and ftdio as modules)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test
>
> @Arnd: I have applied your patch with this change:
>
> asm("0: .long 0xe7f001f2                        \n\t" // BUG() trap
>
> But it revealed nothing new:
>
> [   50.754130] rcu: INFO: rcu_sched self-detected stall on CPU
>
> @Ard: I have tried your branch
> (21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls.

Getting back to this old thread, as we never found out what is
actually going on.

It seems we are still stuck trying to figure out why a kernel with ARMv6
support and SMP patching is broken, or if the same bug might also affect
other configurations without ARMv6 support. This is of course very
unfortunate, but unless someone has an idea for how to debug the problem
further, I suppose we should at least prevent that broken configuration and
disallow enabling CONFIG_SMP in combination with ARMv6 (pre-ARMv6K)
CPUs, to keep others from running into the same problem.

Any other suggestions?

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2022-08-12  7:36 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 10:35 am335x: 5.18.x: system stalling Yegor Yefremov
2022-05-05  5:08 ` Tony Lindgren
2022-05-11 14:16   ` Yegor Yefremov
2022-05-12  5:41     ` Tony Lindgren
2022-05-12  5:41       ` Tony Lindgren
2022-05-12  8:14       ` Arnd Bergmann
2022-05-12  8:14         ` Arnd Bergmann
2022-05-12  8:42       ` Arnd Bergmann
2022-05-12  8:42         ` Arnd Bergmann
2022-05-12 10:20         ` Yegor Yefremov
2022-05-12 10:20           ` Yegor Yefremov
2022-05-19 16:52           ` Yegor Yefremov
2022-05-19 16:52             ` Yegor Yefremov
2022-05-21 19:41             ` Arnd Bergmann
2022-05-21 19:41               ` Arnd Bergmann
2022-05-24 13:38               ` Yegor Yefremov
2022-05-24 13:38                 ` Yegor Yefremov
2022-05-24 14:19                 ` Tony Lindgren
2022-05-24 14:19                   ` Tony Lindgren
2022-05-26  5:49                   ` Yegor Yefremov
2022-05-26  5:49                     ` Yegor Yefremov
2022-05-26  6:20                     ` Tony Lindgren
2022-05-26  6:20                       ` Tony Lindgren
2022-05-26  8:19                       ` Ard Biesheuvel
2022-05-26  8:19                         ` Ard Biesheuvel
2022-05-26 12:37                         ` Yegor Yefremov
2022-05-26 12:37                           ` Yegor Yefremov
2022-05-26 14:15                           ` Arnd Bergmann
2022-05-26 14:15                             ` Arnd Bergmann
2022-05-27  4:44                             ` Yegor Yefremov
2022-05-27  4:44                               ` Yegor Yefremov
2022-05-27  6:38                               ` Arnd Bergmann
2022-05-27  6:38                                 ` Arnd Bergmann
2022-05-27  6:50                                 ` Tony Lindgren
2022-05-27  6:50                                   ` Tony Lindgren
2022-05-27  6:57                                   ` Arnd Bergmann
2022-05-27  6:57                                     ` Arnd Bergmann
2022-05-27  8:17                                     ` Yegor Yefremov
2022-05-27  8:17                                       ` Yegor Yefremov
2022-05-27  8:38                                       ` Arnd Bergmann
2022-05-27  8:38                                         ` Arnd Bergmann
2022-05-27  9:50                                         ` Yegor Yefremov
2022-05-27  9:50                                           ` Yegor Yefremov
2022-05-27 12:53                                           ` Arnd Bergmann
2022-05-27 12:53                                             ` Arnd Bergmann
2022-05-27 13:12                                             ` Ard Biesheuvel
2022-05-27 13:12                                               ` Ard Biesheuvel
2022-05-27 14:12                                               ` Arnd Bergmann
2022-05-27 14:12                                                 ` Arnd Bergmann
2022-05-28  5:48                                                 ` Yegor Yefremov
2022-05-28  5:48                                                   ` Yegor Yefremov
2022-05-28  7:53                                                   ` Arnd Bergmann
2022-05-28  7:53                                                     ` Arnd Bergmann
2022-05-28  8:29                                                     ` Yegor Yefremov
2022-05-28  8:29                                                       ` Yegor Yefremov
2022-05-28  9:07                                                       ` Ard Biesheuvel
2022-05-28  9:07                                                         ` Ard Biesheuvel
2022-05-28 13:01                                                         ` Yegor Yefremov
2022-05-28 13:01                                                           ` Yegor Yefremov
2022-05-28 13:13                                                           ` Arnd Bergmann
2022-05-28 13:13                                                             ` Arnd Bergmann
2022-05-28 19:28                                                             ` Yegor Yefremov
2022-05-28 19:28                                                               ` Yegor Yefremov
2022-05-30 10:16                                                               ` Ard Biesheuvel
2022-05-30 10:16                                                                 ` Ard Biesheuvel
2022-05-30 12:09                                                                 ` Yegor Yefremov
2022-05-30 12:09                                                                   ` Yegor Yefremov
2022-05-30 13:54                                                               ` Arnd Bergmann
2022-05-30 13:54                                                                 ` Arnd Bergmann
2022-05-30 15:14                                                                 ` Ard Biesheuvel
2022-05-30 15:14                                                                   ` Ard Biesheuvel
2022-05-31  8:36                                                                   ` Yegor Yefremov
2022-05-31  8:36                                                                     ` Yegor Yefremov
2022-05-31 14:16                                                                     ` Yegor Yefremov
2022-05-31 14:16                                                                       ` Yegor Yefremov
2022-05-31 15:22                                                                       ` Arnd Bergmann
2022-05-31 15:22                                                                         ` Arnd Bergmann
2022-06-01  7:36                                                                         ` Yegor Yefremov
2022-06-01  7:36                                                                           ` Yegor Yefremov
2022-06-01  7:59                                                                           ` Arnd Bergmann
2022-06-01  7:59                                                                             ` Arnd Bergmann
2022-06-01  8:08                                                                             ` Ard Biesheuvel
2022-06-01  8:08                                                                               ` Ard Biesheuvel
2022-06-01  9:27                                                                               ` Ard Biesheuvel
2022-06-01  9:27                                                                                 ` Ard Biesheuvel
2022-06-01 10:03                                                                                 ` Yegor Yefremov
2022-06-01 10:03                                                                                   ` Yegor Yefremov
2022-06-01 10:06                                                                                   ` Ard Biesheuvel
2022-06-01 10:06                                                                                     ` Ard Biesheuvel
2022-06-01 10:46                                                                                     ` Yegor Yefremov
2022-06-01 10:46                                                                                       ` Yegor Yefremov
2022-06-01 10:49                                                                                       ` Ard Biesheuvel
2022-06-01 10:49                                                                                         ` Ard Biesheuvel
2022-06-02 10:17                                                                                         ` Yegor Yefremov
2022-06-02 10:17                                                                                           ` Yegor Yefremov
2022-06-02 10:37                                                                                           ` Ard Biesheuvel
2022-06-02 10:37                                                                                             ` Ard Biesheuvel
2022-06-02 12:27                                                                                             ` Yegor Yefremov
2022-06-02 12:27                                                                                               ` Yegor Yefremov
2022-06-03  8:54                                                                                               ` Yegor Yefremov
2022-06-03  8:54                                                                                                 ` Yegor Yefremov
2022-06-03  9:32                                                                                                 ` Arnd Bergmann
2022-06-03  9:32                                                                                                   ` Arnd Bergmann
2022-06-03 19:11                                                                                                   ` Yegor Yefremov
2022-06-03 19:11                                                                                                     ` Yegor Yefremov
2022-06-03 20:46                                                                                                     ` Arnd Bergmann
2022-06-03 20:46                                                                                                       ` Arnd Bergmann
2022-06-05 14:59                                                                                                       ` Ard Biesheuvel
2022-06-05 14:59                                                                                                         ` Ard Biesheuvel
2022-06-07  8:55                                                                                                         ` Yegor Yefremov
2022-06-07  8:55                                                                                                           ` Yegor Yefremov
2022-08-12  7:35                                                                                                           ` Arnd Bergmann
2022-08-12  7:35                                                                                                             ` Arnd Bergmann
2022-05-24 14:36                 ` Arnd Bergmann
2022-05-24 14:36                   ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.