xenomai.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Philippe Gerum <rpm@xenomai.org>
To: Dave Rolenc <Dave.Rolenc@kratosdefense.com>
Cc: "xenomai@lists.linux.dev" <xenomai@lists.linux.dev>,
	Russell Johnson <russell.johnson@kratosdefense.com>
Subject: Re: EVL Kernel Debugging
Date: Sat, 06 May 2023 12:56:53 +0200	[thread overview]
Message-ID: <87o7mxpv5q.fsf@xenomai.org> (raw)
In-Reply-To: <PH1P110MB1666BA1533E0614496E7B678FE6B9@PH1P110MB1666.NAMP110.PROD.OUTLOOK.COM>


Dave Rolenc <Dave.Rolenc@kratosdefense.com> writes:

>>> We get a CPU STUCK when restarting an evl-enabled app multiple times, 
>>> and one way to get more insight into this problem is with a kernel debugger.
>>> With the kernel debugger not working, it seems difficult to get any 
>>> kernel-level insight.
>
>> With x86, you could try passing nmi_watchdog=1 via the kernel cmdline
>> to enable the APIC watchdog on the CPUs, _only for the purpose of
>> debugging_ because this is likely going to make the latency figures
>> skyrocket (setting nmi_watchdog=0 is a common recommendation on x86
>> for a real-time configuration). But if the application logic can bear
>> with degraded response time, with luck you might get a kernel
>> backtrace exposing the culprit.
>
> With this approach, we did end up with some stack traces. They mostly look like this:
>
> sync_current_irq_stage (kernel/irq/pipeline.c:922 kernel/irq/pipeline.c:1288)
> __inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
> inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
> _raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
> rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
> down_write (kernel/locking/rwsem.c:1535)

For some reason, this rwsem above does not seem to be released in time,
causing the hang. This happens into the kernelfs internals, but a
reasonable assumption is that EVL might be causing this, maybe due to
some EVL callback invoked by kernelfs holding this lock, which would
stall unexpectedly. This bug might be more likely if EVL elements are
created at high rate like when an app starts, quickly instantiating a
truckload of EVL resources in a row, as if some race occurred. Anyway,
the plan is to reproduce first, then find out if this scenario happens.

> kernfs_activate (fs/kernfs/dir.c:1302)
> kernfs_add_one (fs/kernfs/dir.c:774)
> kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
> sysfs_create_dir_ns (fs/sysfs/dir.c:62)
> kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
> kobject_add (lib/kobject.c:390 lib/kobject.c:442)
> ? _raw_spin_unlock (kernel/locking/spinlock.c:187)
> device_add (drivers/base/core.c:3329)
> ? __init_waitqueue_head (kernel/sched/wait.c:13)
> device_register (drivers/base/core.c:3476)
> create_sys_device (kernel/evl/factory.c:312)
> create_element_device (kernel/evl/factory.c:439)
> ioctl_clone_device (kernel/evl/factory.c:559)
>  __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)
>
> I have to dig a little deeper about the origin of the ioctl from
> userspace.

The origin is evl_create_element(), in libevl. This happens every time a
new EVL element is created such as monitors (for
sema4/events/flags/mutexes), threads, proxies, and so
on. ioctl(EVL_IOC_CLONE) is the source.

> The top of the trace seems to vary a little bit above
> the inband_irq_enable. For example, here is another trace from
> the stuck CPU where the sync_current_irq_stage call is missing:
>
> __inband_irq_enable (arch/x86/include/asm/irqflags.h:41 arch/x86/include/asm/irqflags.h:91 kernel/irq/pipeline.c:287)
> inband_irq_enable (kernel/irq/pipeline.c:317 (discriminator 9))
> _raw_spin_unlock_irq (kernel/locking/spinlock.c:203)
> rwsem_down_write_slowpath (arch/x86/include/asm/current.h:15 (discriminator 1) kernel/locking/rwsem.c:1136 (discriminator 1))
> down_write (kernel/locking/rwsem.c:1535)
> kernfs_activate (fs/kernfs/dir.c:1302)
> kernfs_add_one (fs/kernfs/dir.c:774)
> kernfs_create_dir_ns (fs/kernfs/dir.c:1001)
> sysfs_create_dir_ns (fs/sysfs/dir.c:62)
> kobject_add_internal (lib/kobject.c:89 (discriminator 11) lib/kobject.c:255 (discriminator 11))
> kobject_add (lib/kobject.c:390 lib/kobject.c:442)
> ? _raw_spin_unlock (kernel/locking/spinlock.c:187)
> device_add (drivers/base/core.c:3329)
> ? __init_waitqueue_head (kernel/sched/wait.c:13)
> device_register (drivers/base/core.c:3476)
> create_sys_device (kernel/evl/factory.c:312)
> create_element_device (kernel/evl/factory.c:439)
> ioctl_clone_device (kernel/evl/factory.c:559)
> __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:89)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:118)
>
> Any thoughts on what may be causing this?

A kernel device is mated to each EVL element, this is what gives us a
/sysfs representation for each of them (e.g. evl-ps reads those [thread]
device files to figure out what is running in the system).

-- 
Philippe.

      reply	other threads:[~2023-05-06 11:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-26 19:52 EVL Kernel Debugging Russell Johnson
2023-04-27  7:58 ` Philippe Gerum
2023-04-28 20:47   ` Dave Rolenc
2023-05-06 10:56     ` Philippe Gerum [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7mxpv5q.fsf@xenomai.org \
    --to=rpm@xenomai.org \
    --cc=Dave.Rolenc@kratosdefense.com \
    --cc=russell.johnson@kratosdefense.com \
    --cc=xenomai@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).