xenomai.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* EVL Kernel Debugging
@ 2023-04-26 19:52 Russell Johnson
  2023-04-27  7:58 ` Philippe Gerum
  0 siblings, 1 reply; 4+ messages in thread
From: Russell Johnson @ 2023-04-26 19:52 UTC (permalink / raw)
  To: xenomai; +Cc: Dave Rolenc

[-- Attachment #1: Type: text/plain, Size: 2474 bytes --]

Has there been any successful use of kdb or kgdb with the evl kernel?

We are currently using 5.15.98evl-g1541335eef8b, and have not had much luck
in getting kdg or kgdb to work. We see the start of a kdb session, but the
serial port eventually hangs.

We are connecting the unit under test (running evl kernel) over a serial
port to a secondary machine. 

I think we have all the necessary settings n the kernel config for GDB/KDB:

[root@localhost boot]# cat config-5.15.98evl-g1541335eef8b-dirty|grep GDB
CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_GDB_SCRIPTS is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_HONOUR_BLOCKLIST=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y

Our command line is as follows:
BOOT_IMAGE=/vmlinuz-5.15.98evl-g1541335eef8b-dirty
root=UUID=8748ad87-3ef2-48fe-8d3d-fb2ef72a8f13 ro crashkernel=auto fips=1
kgdboc=ttyS0,115200

On the secondary machine, we connect with minicom or screen over the serial
port. 

The first issue is that  magic sysrq over serial (ctrl-a f g with minicom,
for example) doesn't work even with the proper mask written to
/proc/sys/kernel/sysrq (we tried "1", which should enable all magic-sysrq
features). Doing   echo g > /proc/sysrq-trigger from the evl system does
seem to work, but that isn't ideal. We'd rather break in from the secondary
system when the system is hung. We think we have the correct kernel config
for Magic-Sysrq over serial:

[root@localhost boot]# cat config-5.15.98evl-g1541335eef8b-dirty|grep
MAGIC_SYS
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""

After the magic sysrq g is issued, the connection via serial port seems to
have kdb content, but the connection is not stable, usually hanging but
sometimes giving a kdb prompt once or twice. One time we were able to issue
the "kgdb" command within kdb and attempt to connect via gdb, but after the
target remote /dev/ttyS0 within gdb, the gdb process just hung.

Do you have any suggestions on debugging a hard hang in the evl environment?
We get a CPU STUCK when restarting an evl-enabled app multiple times, and
one way to get more insight into this problem is with a kernel debugger.
With the kernel debugger not working, it seems difficult to get any
kernel-level insight. 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6739 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EVL Kernel Debugging
  2023-04-26 19:52 EVL Kernel Debugging Russell Johnson
@ 2023-04-27  7:58 ` Philippe Gerum
  2023-04-28 20:47   ` Dave Rolenc
  0 siblings, 1 reply; 4+ messages in thread
From: Philippe Gerum @ 2023-04-27  7:58 UTC (permalink / raw)
  To: Russell Johnson; +Cc: xenomai, Dave Rolenc


Russell Johnson <russell.johnson@kratosdefense.com> writes:

> [[S/MIME Signed Part:Undecided]]
> Has there been any successful use of kdb or kgdb with the evl kernel?
>
> We are currently using 5.15.98evl-g1541335eef8b, and have not had much luck
> in getting kdg or kgdb to work. We see the start of a kdb session, but the
> serial port eventually hangs.
>
> We are connecting the unit under test (running evl kernel) over a serial
> port to a secondary machine. 
>
> I think we have all the necessary settings n the kernel config for GDB/KDB:
>
> [root@localhost boot]# cat config-5.15.98evl-g1541335eef8b-dirty|grep GDB
> CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
> CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
> # CONFIG_SERIAL_KGDB_NMI is not set
> # CONFIG_GDB_SCRIPTS is not set
> CONFIG_HAVE_ARCH_KGDB=y
> CONFIG_KGDB=y
> CONFIG_KGDB_HONOUR_BLOCKLIST=y
> CONFIG_KGDB_SERIAL_CONSOLE=y
> CONFIG_KGDB_TESTS=y
> # CONFIG_KGDB_TESTS_ON_BOOT is not set
> CONFIG_KGDB_LOW_LEVEL_TRAP=y
> CONFIG_KGDB_KDB=y
>
> Our command line is as follows:
> BOOT_IMAGE=/vmlinuz-5.15.98evl-g1541335eef8b-dirty
> root=UUID=8748ad87-3ef2-48fe-8d3d-fb2ef72a8f13 ro crashkernel=auto fips=1
> kgdboc=ttyS0,115200
>
> On the secondary machine, we connect with minicom or screen over the serial
> port. 
>
> The first issue is that  magic sysrq over serial (ctrl-a f g with minicom,
> for example) doesn't work even with the proper mask written to
> /proc/sys/kernel/sysrq (we tried "1", which should enable all magic-sysrq
> features). Doing   echo g > /proc/sysrq-trigger from the evl system does
> seem to work, but that isn't ideal. We'd rather break in from the secondary
> system when the system is hung. We think we have the correct kernel config
> for Magic-Sysrq over serial:
>
> [root@localhost boot]# cat config-5.15.98evl-g1541335eef8b-dirty|grep
> MAGIC_SYS
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
> CONFIG_MAGIC_SYSRQ_SERIAL=y
> CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
>
> After the magic sysrq g is issued, the connection via serial port seems to
> have kdb content, but the connection is not stable, usually hanging but
> sometimes giving a kdb prompt once or twice. One time we were able to issue
> the "kgdb" command within kdb and attempt to connect via gdb, but after the
> target remote /dev/ttyS0 within gdb, the gdb process just hung.
>

This is more of a Dovetail issue than an EVL one. I don't use kernel
debuggers, so I must admit that Dovetail + KGDB support did not get much
attention and certainly no testing from my side.

> Do you have any suggestions on debugging a hard hang in the evl environment?
> We get a CPU STUCK when restarting an evl-enabled app multiple times, and
> one way to get more insight into this problem is with a kernel debugger.
> With the kernel debugger not working, it seems difficult to get any
> kernel-level insight. 
>

With x86, you could try passing nmi_watchdog=1 via the kernel cmdline to
enable the APIC watchdog on the CPUs, _only for the purpose of
debugging_ because this is likely going to make the latency figures
skyrocket (setting nmi_watchdog=0 is a common recommendation on x86 for
a real-time configuration). But if the application logic can bear with
degraded response time, with luck you might get a kernel backtrace
exposing the culprit.

-- 
Philippe.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EVL Kernel Debugging
  2023-04-27  7:58 ` Philippe Gerum
@ 2023-04-28 20:47   ` Dave Rolenc
  2023-05-06 10:56     ` Philippe Gerum
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Rolenc @ 2023-04-28 20:47 UTC (permalink / raw)
  To: xenomai; +Cc: Russell Johnson

>> We get a CPU STUCK when restarting an evl-enabled app multiple times, 
>> and one way to get more insight into this problem is with a kernel debugger.
>> With the kernel debugger not working, it seems difficult to get any 
>> kernel-level insight.

> With x86, you could try passing nmi_watchdog=1 via the kernel cmdline
> to enable the APIC watchdog on the CPUs, _only for the purpose of
> debugging_ because this is likely going to make the latency figures
> skyrocket (setting nmi_watchdog=0 is a common recommendation on x86
> for a real-time configuration). But if the application logic can bear
> with degraded response time, with luck you might get a kernel
> backtrace exposing the culprit.

With this approach, we did end up with some stack traces. They mostly look like this:

sync_current_irq_stage (kernel/irq/pipeline.c:922 kernel/irq/pipeline.c:1288)
__inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
_raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
down_write (kernel/locking/rwsem.c:1535)
kernfs_activate (fs/kernfs/dir.c:1302)
kernfs_add_one (fs/kernfs/dir.c:774)
kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
sysfs_create_dir_ns (fs/sysfs/dir.c:62)
kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
kobject_add (lib/kobject.c:390 lib/kobject.c:442)
? _raw_spin_unlock (kernel/locking/spinlock.c:187)
device_add (drivers/base/core.c:3329)
? __init_waitqueue_head (kernel/sched/wait.c:13)
device_register (drivers/base/core.c:3476)
create_sys_device (kernel/evl/factory.c:312)
create_element_device (kernel/evl/factory.c:439)
ioctl_clone_device (kernel/evl/factory.c:559)
 __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)

I have to dig a little deeper about the origin of the ioctl from
userspace. The top of the trace seems to vary a little bit above
the inband_irq_enable. For example, here is another trace from
the stuck CPU where the sync_current_irq_stage call is missing:

__inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
_raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
down_write (kernel/locking/rwsem.c:1535)
kernfs_activate (fs/kernfs/dir.c:1302)
kernfs_add_one (fs/kernfs/dir.c:774)
kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
sysfs_create_dir_ns (fs/sysfs/dir.c:62)
kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
kobject_add (lib/kobject.c:390 lib/kobject.c:442)
? _raw_spin_unlock (kernel/locking/spinlock.c:187)
device_add (drivers/base/core.c:3329)
? __init_waitqueue_head (kernel/sched/wait.c:13)
device_register (drivers/base/core.c:3476)
create_sys_device (kernel/evl/factory.c:312)
create_element_device (kernel/evl/factory.c:439)
ioctl_clone_device (kernel/evl/factory.c:559)
__x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)

Any thoughts on what may be causing this?

--
David Rolenc
Principal Engineer 
Kratos Defense & Security Solutions, Inc.






^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EVL Kernel Debugging
  2023-04-28 20:47   ` Dave Rolenc
@ 2023-05-06 10:56     ` Philippe Gerum
  0 siblings, 0 replies; 4+ messages in thread
From: Philippe Gerum @ 2023-05-06 10:56 UTC (permalink / raw)
  To: Dave Rolenc; +Cc: xenomai, Russell Johnson


Dave Rolenc <Dave.Rolenc@kratosdefense.com> writes:

>>> We get a CPU STUCK when restarting an evl-enabled app multiple times, 
>>> and one way to get more insight into this problem is with a kernel debugger.
>>> With the kernel debugger not working, it seems difficult to get any 
>>> kernel-level insight.
>
>> With x86, you could try passing nmi_watchdog=1 via the kernel cmdline
>> to enable the APIC watchdog on the CPUs, _only for the purpose of
>> debugging_ because this is likely going to make the latency figures
>> skyrocket (setting nmi_watchdog=0 is a common recommendation on x86
>> for a real-time configuration). But if the application logic can bear
>> with degraded response time, with luck you might get a kernel
>> backtrace exposing the culprit.
>
> With this approach, we did end up with some stack traces. They mostly look like this:
>
> sync_current_irq_stage (kernel/irq/pipeline.c:922 kernel/irq/pipeline.c:1288)
> __inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
> inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
> _raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
> rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
> down_write (kernel/locking/rwsem.c:1535)

For some reason, this rwsem above does not seem to be released in time,
causing the hang. This happens into the kernelfs internals, but a
reasonable assumption is that EVL might be causing this, maybe due to
some EVL callback invoked by kernelfs holding this lock, which would
stall unexpectedly. This bug might be more likely if EVL elements are
created at high rate like when an app starts, quickly instantiating a
truckload of EVL resources in a row, as if some race occurred. Anyway,
the plan is to reproduce first, then find out if this scenario happens.

> kernfs_activate (fs/kernfs/dir.c:1302)
> kernfs_add_one (fs/kernfs/dir.c:774)
> kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
> sysfs_create_dir_ns (fs/sysfs/dir.c:62)
> kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
> kobject_add (lib/kobject.c:390 lib/kobject.c:442)
> ? _raw_spin_unlock (kernel/locking/spinlock.c:187)
> device_add (drivers/base/core.c:3329)
> ? __init_waitqueue_head (kernel/sched/wait.c:13)
> device_register (drivers/base/core.c:3476)
> create_sys_device (kernel/evl/factory.c:312)
> create_element_device (kernel/evl/factory.c:439)
> ioctl_clone_device (kernel/evl/factory.c:559)
>  __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)
>
> I have to dig a little deeper about the origin of the ioctl from
> userspace.

The origin is evl_create_element(), in libevl. This happens every time a
new EVL element is created such as monitors (for
sema4/events/flags/mutexes), threads, proxies, and so
on. ioctl(EVL_IOC_CLONE) is the source.

> The top of the trace seems to vary a little bit above
> the inband_irq_enable. For example, here is another trace from
> the stuck CPU where the sync_current_irq_stage call is missing:
>
> __inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
> inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
> _raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
> rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
> down_write (kernel/locking/rwsem.c:1535)
> kernfs_activate (fs/kernfs/dir.c:1302)
> kernfs_add_one (fs/kernfs/dir.c:774)
> kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
> sysfs_create_dir_ns (fs/sysfs/dir.c:62)
> kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
> kobject_add (lib/kobject.c:390 lib/kobject.c:442)
> ? _raw_spin_unlock (kernel/locking/spinlock.c:187)
> device_add (drivers/base/core.c:3329)
> ? __init_waitqueue_head (kernel/sched/wait.c:13)
> device_register (drivers/base/core.c:3476)
> create_sys_device (kernel/evl/factory.c:312)
> create_element_device (kernel/evl/factory.c:439)
> ioctl_clone_device (kernel/evl/factory.c:559)
> __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)
>
> Any thoughts on what may be causing this?

A kernel device is mated to each EVL element, this is what gives us a
/sysfs representation for each of them (e.g. evl-ps reads those [thread]
device files to figure out what is running in the system).

-- 
Philippe.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-06 11:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-26 19:52 EVL Kernel Debugging Russell Johnson
2023-04-27  7:58 ` Philippe Gerum
2023-04-28 20:47   ` Dave Rolenc
2023-05-06 10:56     ` Philippe Gerum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).