All of lore.kernel.org
 help / color / mirror / Atom feed
* x86 kernel Oops in Xeno-3.1/3.2
@ 2022-01-03  7:29 C Smith
  2022-01-03  7:38 ` Jan Kiszka
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2022-01-03  7:29 UTC (permalink / raw)
  To: Xenomai List, Jan Kiszka, Philippe Gerum

I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
In numerous tests, I can't keep a computer running for more than a day
before the computer hard-locks (no kbd/mouse/ping). Frequently the
kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
changed RAM, and tried another manufacturer's motherboard on a 3rd
computer.

* Can someone supply me with a known successful x68 kernel 4.19.89
config so I can compare and try those settings? I will attach my
kernel config to this email, in hopes someone can see something wrong
with them.

Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
(also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
kernel from kernel.org source.

Sometimes onscreen (in a text terminal) I get this Oops:

kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
BUG: unable to handle kernel paging request at ...
PGD ... P4D ... PUD .. PHD ...
Oops: 0011 [#1] SMP PTI
CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
BIOS 4.6.5 08/29/2017
I-pipe domain: Linux
RIP: ... : ...
Code: Bad RIP value.

Which means the Instruction Pointer is in a Data area. That is bad,
and I think it is caused by Cobalt code not restoring the
stack/registers correctly during a context switch.
Other times I get :

Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
in: __xnsched_run.part.63 h -
CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
I-pipe domain: Linux
Call Trace:
<IRQ>
dump_stack+8x95/8xna
panic+8xe§l8x246
? ___xnsched_run.part.63+8x5c4/8x4d0
__stack_chhk_fail+8x19x8x28
___xnsched_run.part.63+8x§c4/Bx§d8
? release_ioapic_irq+8x3f/8x58
? __ipipe_end_fasteoi_irq+BNZZ/8x38
xnintr;edge_vec_handler+BXBIA/8x558
__ipipe_do_sync_pipeline+8xS/ana
dispatch_irq_head+8xe6/Bx118
__ipipe_dispatch_irq+ax1bc/Bx1e8
__ipipe_handle_irq+8x198/x208
? common_interrupt+8xf/Bx2c
</IRQ>

The accompanying stack trace seems to implicate an ipipe interrupt
handler as causing the problem. I'm using xeno_16550A.ko interrupts on
an isolated interrupt level (IRQ 18).

Interestingly, the Cobalt scheduler and my RT userspace app are still
running after this, even though the Linux kernel is halted. I proved
this on an oscilloscope: I can see serial packets going into and out
of the serial ports at the expected periodic time base.

(Note that the text of these kernel faults above is reconstructed with
OCR so some addresses are not complete. The computer is hard-locked in
a text terminal when these happen. I can supply the full JPG pictures
or re-type addresses if you like.)

The application scenario which causes the above problems:  The primary
app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
applied for x86 kernel 4.19.89. It has shared memory via mmap() with
an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
present, no interrupts etc. There are also two non-RT userspace linux
apps which have attached to the same shared memory via mmap() but
those are doing nothing much during these tests. I have attached
several (1-6) RS232 serial devices and one CAN device all
communicating with “apprt2”.

The system does not fault (for 48+ hours) when no peripheral
connections are present (Serial/CAN). The faults happen with Serial
traffic, whether the CAN device is attached or not. The CAN device
alone with no Serial does not cause the fault (tested for 48+ hours),
and the fault has also happened when the motherboard serial ports were
used, so the PCI Moxa code is not implicated.

Note that in order to get 32-bit userspace support to fully work I had
to manually patch the 16550A.c serial driver with the 32 bit
“compatibility” patch from the xenomai mailing list. That works OK and
my apps can communicate fine for hours. The serial packets in my
applications have CRC checks so we know if data ever gets corrupted.

Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
did not get any faults in a test lasting 21+ hours (serial driver
only, no CAN).

Since I imagine Xenomai developers prefer to debug on recent builds, I
also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
still get kernel Oopses with Xeno 3.2.1 :

kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
BUG: unable to handle kernel paging request at ...
PGD ... P4D ... PUD ... PMD ...
Oops: 0011 [#1] SMP PTI
CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
BIOS 4.6.5 08/29/2017
I-pipe domain: Linux
RIP: … : ...
Code: Bad RIP value.
…

* Is there some way to instrument the Cobalt kernel to debug this ? It
seems impossible to get any debug data from /proc/xenomai because the
Linux kernel is Oopsed.

A debugging problem:  occasionally with my apps compiled 64 bit on
Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
eventually, or in another test). So I get 'false positives' and it
takes weeks to make progress.  It is easiest to generate a kernel Oops
rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
testing process may I propose to keep compiling 32 bit and we
instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
xeno-3.2 (k4.19.89)?

Thanks.  -C Smith
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config_4.19.89-20211206
Type: application/octet-stream
Size: 190113 bytes
Desc: not available
URL: <http://xenomai.org/pipermail/xenomai/attachments/20220102/c6ee52df/attachment.obj>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-03  7:29 x86 kernel Oops in Xeno-3.1/3.2 C Smith
@ 2022-01-03  7:38 ` Jan Kiszka
  2022-01-03 21:12   ` C Smith
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kiszka @ 2022-01-03  7:38 UTC (permalink / raw)
  To: C Smith, Xenomai List, Philippe Gerum

On 03.01.22 08:29, C Smith wrote:
> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
> In numerous tests, I can't keep a computer running for more than a day
> before the computer hard-locks (no kbd/mouse/ping). Frequently the
> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
> changed RAM, and tried another manufacturer's motherboard on a 3rd
> computer.
> 
> * Can someone supply me with a known successful x68 kernel 4.19.89
> config so I can compare and try those settings? I will attach my
> kernel config to this email, in hopes someone can see something wrong
> with them.
> 
> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
> kernel from kernel.org source.
> 
> Sometimes onscreen (in a text terminal) I get this Oops:
> 
> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> BUG: unable to handle kernel paging request at ...
> PGD ... P4D ... PUD .. PHD ...
> Oops: 0011 [#1] SMP PTI
> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> BIOS 4.6.5 08/29/2017
> I-pipe domain: Linux
> RIP: ... : ...
> Code: Bad RIP value.
> 
> Which means the Instruction Pointer is in a Data area. That is bad,
> and I think it is caused by Cobalt code not restoring the
> stack/registers correctly during a context switch.
> Other times I get :
> 
> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
> in: __xnsched_run.part.63 h -
> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
> I-pipe domain: Linux
> Call Trace:
> <IRQ>
> dump_stack+8x95/8xna
> panic+8xe§l8x246
> ? ___xnsched_run.part.63+8x5c4/8x4d0
> __stack_chhk_fail+8x19x8x28
> ___xnsched_run.part.63+8x§c4/Bx§d8
> ? release_ioapic_irq+8x3f/8x58
> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
> xnintr;edge_vec_handler+BXBIA/8x558
> __ipipe_do_sync_pipeline+8xS/ana
> dispatch_irq_head+8xe6/Bx118
> __ipipe_dispatch_irq+ax1bc/Bx1e8
> __ipipe_handle_irq+8x198/x208
> ? common_interrupt+8xf/Bx2c
> </IRQ>
> 
> The accompanying stack trace seems to implicate an ipipe interrupt
> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
> an isolated interrupt level (IRQ 18).
> 
> Interestingly, the Cobalt scheduler and my RT userspace app are still
> running after this, even though the Linux kernel is halted. I proved
> this on an oscilloscope: I can see serial packets going into and out
> of the serial ports at the expected periodic time base.
> 
> (Note that the text of these kernel faults above is reconstructed with
> OCR so some addresses are not complete. The computer is hard-locked in
> a text terminal when these happen. I can supply the full JPG pictures
> or re-type addresses if you like.)
> 
> The application scenario which causes the above problems:  The primary
> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
> present, no interrupts etc. There are also two non-RT userspace linux
> apps which have attached to the same shared memory via mmap() but
> those are doing nothing much during these tests. I have attached
> several (1-6) RS232 serial devices and one CAN device all
> communicating with “apprt2”.
> 
> The system does not fault (for 48+ hours) when no peripheral
> connections are present (Serial/CAN). The faults happen with Serial
> traffic, whether the CAN device is attached or not. The CAN device
> alone with no Serial does not cause the fault (tested for 48+ hours),
> and the fault has also happened when the motherboard serial ports were
> used, so the PCI Moxa code is not implicated.
> 
> Note that in order to get 32-bit userspace support to fully work I had
> to manually patch the 16550A.c serial driver with the 32 bit
> “compatibility” patch from the xenomai mailing list. That works OK and
> my apps can communicate fine for hours. The serial packets in my
> applications have CRC checks so we know if data ever gets corrupted.
> 
> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
> did not get any faults in a test lasting 21+ hours (serial driver
> only, no CAN).
> 
> Since I imagine Xenomai developers prefer to debug on recent builds, I
> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
> still get kernel Oopses with Xeno 3.2.1 :
> 
> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> BUG: unable to handle kernel paging request at ...
> PGD ... P4D ... PUD ... PMD ...
> Oops: 0011 [#1] SMP PTI
> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> BIOS 4.6.5 08/29/2017
> I-pipe domain: Linux
> RIP: … : ...
> Code: Bad RIP value.
> …
> 
> * Is there some way to instrument the Cobalt kernel to debug this ? It
> seems impossible to get any debug data from /proc/xenomai because the
> Linux kernel is Oopsed.
> 
> A debugging problem:  occasionally with my apps compiled 64 bit on
> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
> eventually, or in another test). So I get 'false positives' and it
> takes weeks to make progress.  It is easiest to generate a kernel Oops
> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
> testing process may I propose to keep compiling 32 bit and we
> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
> xeno-3.2 (k4.19.89)?
> 
> Thanks.  -C Smith

The issue is only with 4.19-ipipe kernels? Are you able to test also
with 5.4-ipipe or 5.10/15-dovetail?

Can you also spend an extra UART for a kernel console so that crash
dumps may have a better chance to be reported?

Regarding reference configurations: See also
https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
Not optimal ones, but tested.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-03  7:38 ` Jan Kiszka
@ 2022-01-03 21:12   ` C Smith
  2022-01-04  7:05     ` Jan Kiszka
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2022-01-03 21:12 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai List, Philippe Gerum

On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
>
> On 03.01.22 08:29, C Smith wrote:
> > I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
> > In numerous tests, I can't keep a computer running for more than a day
> > before the computer hard-locks (no kbd/mouse/ping). Frequently the
> > kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
> > changed RAM, and tried another manufacturer's motherboard on a 3rd
> > computer.
> >
> > * Can someone supply me with a known successful x68 kernel 4.19.89
> > config so I can compare and try those settings? I will attach my
> > kernel config to this email, in hopes someone can see something wrong
> > with them.
> >
> > Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
> > chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
> > 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
> > (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
> > kernel from kernel.org source.
> >
> > Sometimes onscreen (in a text terminal) I get this Oops:
> >
> > kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> > BUG: unable to handle kernel paging request at ...
> > PGD ... P4D ... PUD .. PHD ...
> > Oops: 0011 [#1] SMP PTI
> > CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> > Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> > BIOS 4.6.5 08/29/2017
> > I-pipe domain: Linux
> > RIP: ... : ...
> > Code: Bad RIP value.
> >
> > Which means the Instruction Pointer is in a Data area. That is bad,
> > and I think it is caused by Cobalt code not restoring the
> > stack/registers correctly during a context switch.
> > Other times I get :
> >
> > Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
> > in: __xnsched_run.part.63 h -
> > CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
> > Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
> > I-pipe domain: Linux
> > Call Trace:
> > <IRQ>
> > dump_stack+8x95/8xna
> > panic+8xe§l8x246
> > ? ___xnsched_run.part.63+8x5c4/8x4d0
> > __stack_chhk_fail+8x19x8x28
> > ___xnsched_run.part.63+8x§c4/Bx§d8
> > ? release_ioapic_irq+8x3f/8x58
> > ? __ipipe_end_fasteoi_irq+BNZZ/8x38
> > xnintr;edge_vec_handler+BXBIA/8x558
> > __ipipe_do_sync_pipeline+8xS/ana
> > dispatch_irq_head+8xe6/Bx118
> > __ipipe_dispatch_irq+ax1bc/Bx1e8
> > __ipipe_handle_irq+8x198/x208
> > ? common_interrupt+8xf/Bx2c
> > </IRQ>
> >
> > The accompanying stack trace seems to implicate an ipipe interrupt
> > handler as causing the problem. I'm using xeno_16550A.ko interrupts on
> > an isolated interrupt level (IRQ 18).
> >
> > Interestingly, the Cobalt scheduler and my RT userspace app are still
> > running after this, even though the Linux kernel is halted. I proved
> > this on an oscilloscope: I can see serial packets going into and out
> > of the serial ports at the expected periodic time base.
> >
> > (Note that the text of these kernel faults above is reconstructed with
> > OCR so some addresses are not complete. The computer is hard-locked in
> > a text terminal when these happen. I can supply the full JPG pictures
> > or re-type addresses if you like.)
> >
> > The application scenario which causes the above problems:  The primary
> > app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
> > CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
> > applied for x86 kernel 4.19.89. It has shared memory via mmap() with
> > an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
> > present, no interrupts etc. There are also two non-RT userspace linux
> > apps which have attached to the same shared memory via mmap() but
> > those are doing nothing much during these tests. I have attached
> > several (1-6) RS232 serial devices and one CAN device all
> > communicating with “apprt2”.
> >
> > The system does not fault (for 48+ hours) when no peripheral
> > connections are present (Serial/CAN). The faults happen with Serial
> > traffic, whether the CAN device is attached or not. The CAN device
> > alone with no Serial does not cause the fault (tested for 48+ hours),
> > and the fault has also happened when the motherboard serial ports were
> > used, so the PCI Moxa code is not implicated.
> >
> > Note that in order to get 32-bit userspace support to fully work I had
> > to manually patch the 16550A.c serial driver with the 32 bit
> > “compatibility” patch from the xenomai mailing list. That works OK and
> > my apps can communicate fine for hours. The serial packets in my
> > applications have CRC checks so we know if data ever gets corrupted.
> >
> > Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
> > years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
> > did not get any faults in a test lasting 21+ hours (serial driver
> > only, no CAN).
> >
> > Since I imagine Xenomai developers prefer to debug on recent builds, I
> > also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
> > still get kernel Oopses with Xeno 3.2.1 :
> >
> > kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> > BUG: unable to handle kernel paging request at ...
> > PGD ... P4D ... PUD ... PMD ...
> > Oops: 0011 [#1] SMP PTI
> > CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> > Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> > BIOS 4.6.5 08/29/2017
> > I-pipe domain: Linux
> > RIP: … : ...
> > Code: Bad RIP value.
> > …
> >
> > * Is there some way to instrument the Cobalt kernel to debug this ? It
> > seems impossible to get any debug data from /proc/xenomai because the
> > Linux kernel is Oopsed.
> >
> > A debugging problem:  occasionally with my apps compiled 64 bit on
> > Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
> > eventually, or in another test). So I get 'false positives' and it
> > takes weeks to make progress.  It is easiest to generate a kernel Oops
> > rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
> > testing process may I propose to keep compiling 32 bit and we
> > instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
> > xeno-3.2 (k4.19.89)?
> >
> > Thanks.  -C Smith
>
> The issue is only with 4.19-ipipe kernels?

Yes all of the oopses were on 4.19.89 ipipe kernels (x86).

>Are you able to test also
> with 5.4-ipipe or 5.10/15-dovetail?

Yes I can test with both of those. I'll do that shortly.

> Can you also spend an extra UART for a kernel console so that crash
> dumps may have a better chance to be reported?

I can spare a serial port for a terminal, but I believe I have
complete crash dumps I can show
you already in photos, so as to show you what has been happening
historically in my tests this month.
See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
getting an NX protection fault from Dec 10th:
https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing

Here is another crash dump from Dec 30, in which my RT apps are
compiled 64 bit running on Xeno 3.1,
getting a Kernel panic this time:
https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing

> Regarding reference configurations: See also
> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
> Not optimal ones, but tested.

I can't seem to find kernel configs in that file tree. Can you guide
me to where an x86 kernel config is, so I can diff it against mine ?
Maybe I can build one of these qemu images, but it is a lower priority
as I need to do some other tests for you first like running
with kernel 5.4 ipipe patch and then Dovetail.
I fear that the qemu image would not be a useful test because there
wouldn't be serial ports or serial interrupts, right?

thanks  -C Smith

> Jan
> --
> Siemens AG, T RDA IOT
> Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-03 21:12   ` C Smith
@ 2022-01-04  7:05     ` Jan Kiszka
  2022-01-04  7:44       ` C Smith
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kiszka @ 2022-01-04  7:05 UTC (permalink / raw)
  To: C Smith; +Cc: Xenomai List, Philippe Gerum

On 03.01.22 22:12, C Smith wrote:
> On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>
>> On 03.01.22 08:29, C Smith wrote:
>>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
>>> In numerous tests, I can't keep a computer running for more than a day
>>> before the computer hard-locks (no kbd/mouse/ping). Frequently the
>>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
>>> changed RAM, and tried another manufacturer's motherboard on a 3rd
>>> computer.
>>>
>>> * Can someone supply me with a known successful x68 kernel 4.19.89
>>> config so I can compare and try those settings? I will attach my
>>> kernel config to this email, in hopes someone can see something wrong
>>> with them.
>>>
>>> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
>>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
>>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
>>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
>>> kernel from kernel.org source.
>>>
>>> Sometimes onscreen (in a text terminal) I get this Oops:
>>>
>>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
>>> BUG: unable to handle kernel paging request at ...
>>> PGD ... P4D ... PUD .. PHD ...
>>> Oops: 0011 [#1] SMP PTI
>>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
>>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
>>> BIOS 4.6.5 08/29/2017
>>> I-pipe domain: Linux
>>> RIP: ... : ...
>>> Code: Bad RIP value.
>>>
>>> Which means the Instruction Pointer is in a Data area. That is bad,
>>> and I think it is caused by Cobalt code not restoring the
>>> stack/registers correctly during a context switch.
>>> Other times I get :
>>>
>>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
>>> in: __xnsched_run.part.63 h -
>>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
>>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
>>> I-pipe domain: Linux
>>> Call Trace:
>>> <IRQ>
>>> dump_stack+8x95/8xna
>>> panic+8xe§l8x246
>>> ? ___xnsched_run.part.63+8x5c4/8x4d0
>>> __stack_chhk_fail+8x19x8x28
>>> ___xnsched_run.part.63+8x§c4/Bx§d8
>>> ? release_ioapic_irq+8x3f/8x58
>>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
>>> xnintr;edge_vec_handler+BXBIA/8x558
>>> __ipipe_do_sync_pipeline+8xS/ana
>>> dispatch_irq_head+8xe6/Bx118
>>> __ipipe_dispatch_irq+ax1bc/Bx1e8
>>> __ipipe_handle_irq+8x198/x208
>>> ? common_interrupt+8xf/Bx2c
>>> </IRQ>
>>>
>>> The accompanying stack trace seems to implicate an ipipe interrupt
>>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
>>> an isolated interrupt level (IRQ 18).
>>>
>>> Interestingly, the Cobalt scheduler and my RT userspace app are still
>>> running after this, even though the Linux kernel is halted. I proved
>>> this on an oscilloscope: I can see serial packets going into and out
>>> of the serial ports at the expected periodic time base.
>>>
>>> (Note that the text of these kernel faults above is reconstructed with
>>> OCR so some addresses are not complete. The computer is hard-locked in
>>> a text terminal when these happen. I can supply the full JPG pictures
>>> or re-type addresses if you like.)
>>>
>>> The application scenario which causes the above problems:  The primary
>>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
>>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
>>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
>>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
>>> present, no interrupts etc. There are also two non-RT userspace linux
>>> apps which have attached to the same shared memory via mmap() but
>>> those are doing nothing much during these tests. I have attached
>>> several (1-6) RS232 serial devices and one CAN device all
>>> communicating with “apprt2”.
>>>
>>> The system does not fault (for 48+ hours) when no peripheral
>>> connections are present (Serial/CAN). The faults happen with Serial
>>> traffic, whether the CAN device is attached or not. The CAN device
>>> alone with no Serial does not cause the fault (tested for 48+ hours),
>>> and the fault has also happened when the motherboard serial ports were
>>> used, so the PCI Moxa code is not implicated.
>>>
>>> Note that in order to get 32-bit userspace support to fully work I had
>>> to manually patch the 16550A.c serial driver with the 32 bit
>>> “compatibility” patch from the xenomai mailing list. That works OK and
>>> my apps can communicate fine for hours. The serial packets in my
>>> applications have CRC checks so we know if data ever gets corrupted.
>>>
>>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
>>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
>>> did not get any faults in a test lasting 21+ hours (serial driver
>>> only, no CAN).
>>>
>>> Since I imagine Xenomai developers prefer to debug on recent builds, I
>>> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
>>> still get kernel Oopses with Xeno 3.2.1 :
>>>
>>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
>>> BUG: unable to handle kernel paging request at ...
>>> PGD ... P4D ... PUD ... PMD ...
>>> Oops: 0011 [#1] SMP PTI
>>> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
>>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
>>> BIOS 4.6.5 08/29/2017
>>> I-pipe domain: Linux
>>> RIP: … : ...
>>> Code: Bad RIP value.
>>> …
>>>
>>> * Is there some way to instrument the Cobalt kernel to debug this ? It
>>> seems impossible to get any debug data from /proc/xenomai because the
>>> Linux kernel is Oopsed.
>>>
>>> A debugging problem:  occasionally with my apps compiled 64 bit on
>>> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
>>> eventually, or in another test). So I get 'false positives' and it
>>> takes weeks to make progress.  It is easiest to generate a kernel Oops
>>> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
>>> testing process may I propose to keep compiling 32 bit and we
>>> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
>>> xeno-3.2 (k4.19.89)?
>>>
>>> Thanks.  -C Smith
>>
>> The issue is only with 4.19-ipipe kernels?
> 
> Yes all of the oopses were on 4.19.89 ipipe kernels (x86).
> 
>> Are you able to test also
>> with 5.4-ipipe or 5.10/15-dovetail?
> 
> Yes I can test with both of those. I'll do that shortly.
> 
>> Can you also spend an extra UART for a kernel console so that crash
>> dumps may have a better chance to be reported?
> 
> I can spare a serial port for a terminal, but I believe I have
> complete crash dumps I can show
> you already in photos, so as to show you what has been happening
> historically in my tests this month.

The major drawback of screen-reported crashes is that you only have what
is on the frozen screen, nothing from the past before that. Plus, you
can't search in that.

> See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
> getting an NX protection fault from Dec 10th:
> https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing
> 
> Here is another crash dump from Dec 30, in which my RT apps are
> compiled 64 bit running on Xeno 3.1,
> getting a Kernel panic this time:
> https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing
> 
>> Regarding reference configurations: See also
>> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
>> Not optimal ones, but tested.
> 
> I can't seem to find kernel configs in that file tree. Can you guide
> me to where an x86 kernel config is, so I can diff it against mine ?

https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig

That's a defconfig, so run "make olddefconfig" against it first.

> Maybe I can build one of these qemu images, but it is a lower priority
> as I need to do some other tests for you first like running
> with kernel 5.4 ipipe patch and then Dovetail.
> I fear that the qemu image would not be a useful test because there
> wouldn't be serial ports or serial interrupts, right?

There are as well, in fact. The first UART's output is redirected to the
console when you run start-qemu.sh. You can append a second UART via the
command line using QEMU options, and then you could even direct that
virtual UART to a real one of the host system.

The major issue with reproducing in QEMU[/KVM] is, though, that the
timings will suffer, and applications may even fail to run when
deadlines are missed. But if you could reproduce in QEMU, we may
simplify the reproduction to just sharing your VM image.

Jan

-- 
Siemens AG, Technology
Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-04  7:05     ` Jan Kiszka
@ 2022-01-04  7:44       ` C Smith
  2022-01-06  8:19         ` C Smith
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2022-01-04  7:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai List, Philippe Gerum

On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
>
> On 03.01.22 22:12, C Smith wrote:
> > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >>
> >> On 03.01.22 08:29, C Smith wrote:
> >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
> >>> In numerous tests, I can't keep a computer running for more than a day
> >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the
> >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
> >>> changed RAM, and tried another manufacturer's motherboard on a 3rd
> >>> computer.
> >>>
> >>> * Can someone supply me with a known successful x68 kernel 4.19.89
> >>> config so I can compare and try those settings? I will attach my
> >>> kernel config to this email, in hopes someone can see something wrong
> >>> with them.
> >>>
> >>> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
> >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
> >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
> >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
> >>> kernel from kernel.org source.
> >>>
> >>> Sometimes onscreen (in a text terminal) I get this Oops:
> >>>
> >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> >>> BUG: unable to handle kernel paging request at ...
> >>> PGD ... P4D ... PUD .. PHD ...
> >>> Oops: 0011 [#1] SMP PTI
> >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> >>> BIOS 4.6.5 08/29/2017
> >>> I-pipe domain: Linux
> >>> RIP: ... : ...
> >>> Code: Bad RIP value.
> >>>
> >>> Which means the Instruction Pointer is in a Data area. That is bad,
> >>> and I think it is caused by Cobalt code not restoring the
> >>> stack/registers correctly during a context switch.
> >>> Other times I get :
> >>>
> >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
> >>> in: __xnsched_run.part.63 h -
> >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
> >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
> >>> I-pipe domain: Linux
> >>> Call Trace:
> >>> <IRQ>
> >>> dump_stack+8x95/8xna
> >>> panic+8xe§l8x246
> >>> ? ___xnsched_run.part.63+8x5c4/8x4d0
> >>> __stack_chhk_fail+8x19x8x28
> >>> ___xnsched_run.part.63+8x§c4/Bx§d8
> >>> ? release_ioapic_irq+8x3f/8x58
> >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
> >>> xnintr;edge_vec_handler+BXBIA/8x558
> >>> __ipipe_do_sync_pipeline+8xS/ana
> >>> dispatch_irq_head+8xe6/Bx118
> >>> __ipipe_dispatch_irq+ax1bc/Bx1e8
> >>> __ipipe_handle_irq+8x198/x208
> >>> ? common_interrupt+8xf/Bx2c
> >>> </IRQ>
> >>>
> >>> The accompanying stack trace seems to implicate an ipipe interrupt
> >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
> >>> an isolated interrupt level (IRQ 18).
> >>>
> >>> Interestingly, the Cobalt scheduler and my RT userspace app are still
> >>> running after this, even though the Linux kernel is halted. I proved
> >>> this on an oscilloscope: I can see serial packets going into and out
> >>> of the serial ports at the expected periodic time base.
> >>>
> >>> (Note that the text of these kernel faults above is reconstructed with
> >>> OCR so some addresses are not complete. The computer is hard-locked in
> >>> a text terminal when these happen. I can supply the full JPG pictures
> >>> or re-type addresses if you like.)
> >>>
> >>> The application scenario which causes the above problems:  The primary
> >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
> >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
> >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
> >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
> >>> present, no interrupts etc. There are also two non-RT userspace linux
> >>> apps which have attached to the same shared memory via mmap() but
> >>> those are doing nothing much during these tests. I have attached
> >>> several (1-6) RS232 serial devices and one CAN device all
> >>> communicating with “apprt2”.
> >>>
> >>> The system does not fault (for 48+ hours) when no peripheral
> >>> connections are present (Serial/CAN). The faults happen with Serial
> >>> traffic, whether the CAN device is attached or not. The CAN device
> >>> alone with no Serial does not cause the fault (tested for 48+ hours),
> >>> and the fault has also happened when the motherboard serial ports were
> >>> used, so the PCI Moxa code is not implicated.
> >>>
> >>> Note that in order to get 32-bit userspace support to fully work I had
> >>> to manually patch the 16550A.c serial driver with the 32 bit
> >>> “compatibility” patch from the xenomai mailing list. That works OK and
> >>> my apps can communicate fine for hours. The serial packets in my
> >>> applications have CRC checks so we know if data ever gets corrupted.
> >>>
> >>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
> >>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
> >>> did not get any faults in a test lasting 21+ hours (serial driver
> >>> only, no CAN).
> >>>
> >>> Since I imagine Xenomai developers prefer to debug on recent builds, I
> >>> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
> >>> still get kernel Oopses with Xeno 3.2.1 :
> >>>
> >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> >>> BUG: unable to handle kernel paging request at ...
> >>> PGD ... P4D ... PUD ... PMD ...
> >>> Oops: 0011 [#1] SMP PTI
> >>> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> >>> BIOS 4.6.5 08/29/2017
> >>> I-pipe domain: Linux
> >>> RIP: … : ...
> >>> Code: Bad RIP value.
> >>> …
> >>>
> >>> * Is there some way to instrument the Cobalt kernel to debug this ? It
> >>> seems impossible to get any debug data from /proc/xenomai because the
> >>> Linux kernel is Oopsed.
> >>>
> >>> A debugging problem:  occasionally with my apps compiled 64 bit on
> >>> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
> >>> eventually, or in another test). So I get 'false positives' and it
> >>> takes weeks to make progress.  It is easiest to generate a kernel Oops
> >>> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
> >>> testing process may I propose to keep compiling 32 bit and we
> >>> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
> >>> xeno-3.2 (k4.19.89)?
> >>>
> >>> Thanks.  -C Smith
> >>
> >> The issue is only with 4.19-ipipe kernels?
> >
> > Yes all of the oopses were on 4.19.89 ipipe kernels (x86).
> >
> >> Are you able to test also
> >> with 5.4-ipipe or 5.10/15-dovetail?
> >
> > Yes I can test with both of those. I'll do that shortly.
> >
> >> Can you also spend an extra UART for a kernel console so that crash
> >> dumps may have a better chance to be reported?
> >
> > I can spare a serial port for a terminal, but I believe I have
> > complete crash dumps I can show
> > you already in photos, so as to show you what has been happening
> > historically in my tests this month.
>
> The major drawback of screen-reported crashes is that you only have what
> is on the frozen screen, nothing from the past before that. Plus, you
> can't search in that.

Agreed, just showing you history. I have my kernel outputting to a
serial terminal now - I like it!
Here is kernel dump output from today. There was no serial activity
during this so maybe the serial driver is absolved? I was running
'switchtest -2s 200' during this:

[  427.925103] apprt2: External pulse period: 0 ns, frame divisor: 0
[ 1926.021851] kernel tried to execute NX-protected page - exploit
attempt? (uid: 1000)
[ 1926.029897] BUG: unable to handle kernel paging request at ffff95b8d4439d40
[ 1926.037164] PGD 35801067 P4D 35801067 PUD 2099a4063 PMD 20eede063
PTE 8000000154439063
[ 1926.045405] Oops: 0011 [#1] SMP PTI
[ 1926.049218] CPU: 2 PID: 2323 Comm: appnrt1 Tainted: G           OE
   4.19.89xeno3.1-i64x8632 #2
[ 1926.058430] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017
[ 1926.068268] I-pipe domain: Linux
[ 1926.071861] RIP: 0010:0xffff95b8d4439d40
[ 1926.076156] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00
[ 1926.095348] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
[ 1926.101356] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
[ 1926.108937] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
[ 1926.116529] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
[ 1926.124127] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
[ 1926.131733] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
[ 1926.139341] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
knlGS:00000000f6c2e740
[ 1926.147918] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 1926.154160] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
[ 1926.161801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1926.169450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1926.177095] Call Trace:
[ 1926.180054] Modules linked in: modrt1(OE) devlink xt_CHECKSUM
ipt_MASQUERADE xt_conntrack nft_compat nf_nat_tftp nft_objref
nf_conntrack_tftp nft_counter tun bridge stp llc rpcsec_gss_krb5
auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv6
nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6
nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
nft_chain_route_ipv4 ip_set nf_tables nfnetlink xeno_rtipc snd_pcm_oss
snd_mixer_oss sunrpc snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm intel_rapl
snd_timer intel_powerclamp snd coretemp crc32_pclmul xeno_can_peak_pci
joydev mei_wdt xeno_can_sja1000
[ 1926.253264]  mei_me intel_cstate rt_igb soundcore xeno_can(E)
rt_e1000e intel_uncore iTCO_wdt intel_rapl_perf iTCO_vendor_support
rtnet mei video lpc_ich radeon drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm crc32c_intel drm serio_raw igb e1000e
ata_generic pata_acpi i2c_algo_bit fuse [last unloaded: i2c_i801]
[ 1926.283792] CR2: ffff95b8d4439d40
[ 1926.287879] ---[ end trace 00b88b101da84af3 ]---
[ 1926.293275] RIP: 0010:0xffff95b8d4439d40
[ 1926.297985] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00
[ 1926.317585] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
[ 1926.317586] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
[ 1926.317588] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
[ 1926.317589] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
[ 1926.317590] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
[ 1926.317591] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
[ 1926.317593] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
knlGS:00000000f6c2e740
[ 1926.317594] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 1926.317595] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
[ 1926.317597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1926.317598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> > See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
> > getting an NX protection fault from Dec 10th:
> > https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing
> >
> > Here is another crash dump from Dec 30, in which my RT apps are
> > compiled 64 bit running on Xeno 3.1,
> > getting a Kernel panic this time:
> > https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing
> >
> >> Regarding reference configurations: See also
> >> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
> >> Not optimal ones, but tested.
> >
> > I can't seem to find kernel configs in that file tree. Can you guide
> > me to where an x86 kernel config is, so I can diff it against mine ?
>
> https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig
>
> That's a defconfig, so run "make olddefconfig" against it first.

OK I will diff it against my config tomorrow...

> > Maybe I can build one of these qemu images, but it is a lower priority
> > as I need to do some other tests for you first like running
> > with kernel 5.4 ipipe patch and then Dovetail.
> > I fear that the qemu image would not be a useful test because there
> > wouldn't be serial ports or serial interrupts, right?
>
> There are as well, in fact. The first UART's output is redirected to the
> console when you run start-qemu.sh. You can append a second UART via the
> command line using QEMU options, and then you could even direct that
> virtual UART to a real one of the host system.
>
> The major issue with reproducing in QEMU[/KVM] is, though, that the
> timings will suffer, and applications may even fail to run when
> deadlines are missed. But if you could reproduce in QEMU, we may
> simplify the reproduction to just sharing your VM image.
>
> Jan
> --
> Siemens AG, Technology
> Competence Center Embedded Linux

Thanks  -C Smith


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-04  7:44       ` C Smith
@ 2022-01-06  8:19         ` C Smith
  2022-01-09 16:35           ` Philippe Gerum
  0 siblings, 1 reply; 8+ messages in thread
From: C Smith @ 2022-01-06  8:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai List, Philippe Gerum

On Mon, Jan 3, 2022 at 11:44 PM C Smith <csmithquestions@gmail.com> wrote:
>
> On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >
> > On 03.01.22 22:12, C Smith wrote:
> > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > >>
> > >> On 03.01.22 08:29, C Smith wrote:
> > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
> > >>> In numerous tests, I can't keep a computer running for more than a day
> > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the
> > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
> > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd
> > >>> computer.
> > >>>
> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89
> > >>> config so I can compare and try those settings? I will attach my
> > >>> kernel config to this email, in hopes someone can see something wrong
> > >>> with them.
> > >>>
> > >>> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
> > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
> > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
> > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
> > >>> kernel from kernel.org source.
> > >>>
> > >>> Sometimes onscreen (in a text terminal) I get this Oops:
> > >>>
> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> > >>> BUG: unable to handle kernel paging request at ...
> > >>> PGD ... P4D ... PUD .. PHD ...
> > >>> Oops: 0011 [#1] SMP PTI
> > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> > >>> BIOS 4.6.5 08/29/2017
> > >>> I-pipe domain: Linux
> > >>> RIP: ... : ...
> > >>> Code: Bad RIP value.
> > >>>
> > >>> Which means the Instruction Pointer is in a Data area. That is bad,
> > >>> and I think it is caused by Cobalt code not restoring the
> > >>> stack/registers correctly during a context switch.
> > >>> Other times I get :
> > >>>
> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
> > >>> in: __xnsched_run.part.63 h -
> > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
> > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
> > >>> I-pipe domain: Linux
> > >>> Call Trace:
> > >>> <IRQ>
> > >>> dump_stack+8x95/8xna
> > >>> panic+8xe§l8x246
> > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0
> > >>> __stack_chhk_fail+8x19x8x28
> > >>> ___xnsched_run.part.63+8x§c4/Bx§d8
> > >>> ? release_ioapic_irq+8x3f/8x58
> > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
> > >>> xnintr;edge_vec_handler+BXBIA/8x558
> > >>> __ipipe_do_sync_pipeline+8xS/ana
> > >>> dispatch_irq_head+8xe6/Bx118
> > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8
> > >>> __ipipe_handle_irq+8x198/x208
> > >>> ? common_interrupt+8xf/Bx2c
> > >>> </IRQ>
> > >>>
> > >>> The accompanying stack trace seems to implicate an ipipe interrupt
> > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
> > >>> an isolated interrupt level (IRQ 18).
> > >>>
> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still
> > >>> running after this, even though the Linux kernel is halted. I proved
> > >>> this on an oscilloscope: I can see serial packets going into and out
> > >>> of the serial ports at the expected periodic time base.
> > >>>
> > >>> (Note that the text of these kernel faults above is reconstructed with
> > >>> OCR so some addresses are not complete. The computer is hard-locked in
> > >>> a text terminal when these happen. I can supply the full JPG pictures
> > >>> or re-type addresses if you like.)
> > >>>
> > >>> The application scenario which causes the above problems:  The primary
> > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
> > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
> > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
> > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
> > >>> present, no interrupts etc. There are also two non-RT userspace linux
> > >>> apps which have attached to the same shared memory via mmap() but
> > >>> those are doing nothing much during these tests. I have attached
> > >>> several (1-6) RS232 serial devices and one CAN device all
> > >>> communicating with “apprt2”.
> > >>>
> > >>> The system does not fault (for 48+ hours) when no peripheral
> > >>> connections are present (Serial/CAN). The faults happen with Serial
> > >>> traffic, whether the CAN device is attached or not. The CAN device
> > >>> alone with no Serial does not cause the fault (tested for 48+ hours),
> > >>> and the fault has also happened when the motherboard serial ports were
> > >>> used, so the PCI Moxa code is not implicated.
> > >>>
> > >>> Note that in order to get 32-bit userspace support to fully work I had
> > >>> to manually patch the 16550A.c serial driver with the 32 bit
> > >>> “compatibility” patch from the xenomai mailing list. That works OK and
> > >>> my apps can communicate fine for hours. The serial packets in my
> > >>> applications have CRC checks so we know if data ever gets corrupted.
> > >>>
> > >>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
> > >>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
> > >>> did not get any faults in a test lasting 21+ hours (serial driver
> > >>> only, no CAN).
> > >>>
> > >>> Since I imagine Xenomai developers prefer to debug on recent builds, I
> > >>> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
> > >>> still get kernel Oopses with Xeno 3.2.1 :
> > >>>
> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> > >>> BUG: unable to handle kernel paging request at ...
> > >>> PGD ... P4D ... PUD ... PMD ...
> > >>> Oops: 0011 [#1] SMP PTI
> > >>> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> > >>> BIOS 4.6.5 08/29/2017
> > >>> I-pipe domain: Linux
> > >>> RIP: … : ...
> > >>> Code: Bad RIP value.
> > >>> …
> > >>>
> > >>> * Is there some way to instrument the Cobalt kernel to debug this ? It
> > >>> seems impossible to get any debug data from /proc/xenomai because the
> > >>> Linux kernel is Oopsed.
> > >>>
> > >>> A debugging problem:  occasionally with my apps compiled 64 bit on
> > >>> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
> > >>> eventually, or in another test). So I get 'false positives' and it
> > >>> takes weeks to make progress.  It is easiest to generate a kernel Oops
> > >>> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
> > >>> testing process may I propose to keep compiling 32 bit and we
> > >>> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
> > >>> xeno-3.2 (k4.19.89)?
> > >>>
> > >>> Thanks.  -C Smith
> > >>
> > >> The issue is only with 4.19-ipipe kernels?
> > >
> > > Yes all of the oopses were on 4.19.89 ipipe kernels (x86).
> > >
> > >> Are you able to test also
> > >> with 5.4-ipipe or 5.10/15-dovetail?
> > >
> > > Yes I can test with both of those. I'll do that shortly.
> > >
> > >> Can you also spend an extra UART for a kernel console so that crash
> > >> dumps may have a better chance to be reported?
> > >
> > > I can spare a serial port for a terminal, but I believe I have
> > > complete crash dumps I can show
> > > you already in photos, so as to show you what has been happening
> > > historically in my tests this month.
> >
> > The major drawback of screen-reported crashes is that you only have what
> > is on the frozen screen, nothing from the past before that. Plus, you
> > can't search in that.
>
> Agreed, just showing you history. I have my kernel outputting to a
> serial terminal now - I like it!
> Here is kernel dump output from today. There was no serial activity
> during this so maybe the serial driver is absolved? I was running
> 'switchtest -2s 200' during this:
>
> [  427.925103] apprt2: External pulse period: 0 ns, frame divisor: 0
> [ 1926.021851] kernel tried to execute NX-protected page - exploit
> attempt? (uid: 1000)
> [ 1926.029897] BUG: unable to handle kernel paging request at ffff95b8d4439d40
> [ 1926.037164] PGD 35801067 P4D 35801067 PUD 2099a4063 PMD 20eede063
> PTE 8000000154439063
> [ 1926.045405] Oops: 0011 [#1] SMP PTI
> [ 1926.049218] CPU: 2 PID: 2323 Comm: appnrt1 Tainted: G           OE
>    4.19.89xeno3.1-i64x8632 #2
> [ 1926.058430] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017
> [ 1926.068268] I-pipe domain: Linux
> [ 1926.071861] RIP: 0010:0xffff95b8d4439d40
> [ 1926.076156] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00
> [ 1926.095348] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
> [ 1926.101356] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
> [ 1926.108937] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
> [ 1926.116529] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
> [ 1926.124127] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
> [ 1926.131733] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
> [ 1926.139341] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
> knlGS:00000000f6c2e740
> [ 1926.147918] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 1926.154160] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
> [ 1926.161801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1926.169450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1926.177095] Call Trace:
> [ 1926.180054] Modules linked in: modrt1(OE) devlink xt_CHECKSUM
> ipt_MASQUERADE xt_conntrack nft_compat nf_nat_tftp nft_objref
> nf_conntrack_tftp nft_counter tun bridge stp llc rpcsec_gss_krb5
> auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv6
> nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6
> nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> nft_chain_route_ipv4 ip_set nf_tables nfnetlink xeno_rtipc snd_pcm_oss
> snd_mixer_oss sunrpc snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm intel_rapl
> snd_timer intel_powerclamp snd coretemp crc32_pclmul xeno_can_peak_pci
> joydev mei_wdt xeno_can_sja1000
> [ 1926.253264]  mei_me intel_cstate rt_igb soundcore xeno_can(E)
> rt_e1000e intel_uncore iTCO_wdt intel_rapl_perf iTCO_vendor_support
> rtnet mei video lpc_ich radeon drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm crc32c_intel drm serio_raw igb e1000e
> ata_generic pata_acpi i2c_algo_bit fuse [last unloaded: i2c_i801]
> [ 1926.283792] CR2: ffff95b8d4439d40
> [ 1926.287879] ---[ end trace 00b88b101da84af3 ]---
> [ 1926.293275] RIP: 0010:0xffff95b8d4439d40
> [ 1926.297985] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00
> [ 1926.317585] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
> [ 1926.317586] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
> [ 1926.317588] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
> [ 1926.317589] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
> [ 1926.317590] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
> [ 1926.317591] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
> [ 1926.317593] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
> knlGS:00000000f6c2e740
> [ 1926.317594] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 1926.317595] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
> [ 1926.317597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1926.317598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> > > See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
> > > getting an NX protection fault from Dec 10th:
> > > https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing
> > >
> > > Here is another crash dump from Dec 30, in which my RT apps are
> > > compiled 64 bit running on Xeno 3.1,
> > > getting a Kernel panic this time:
> > > https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing
> > >
> > >> Regarding reference configurations: See also
> > >> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
> > >> Not optimal ones, but tested.
> > >
> > > I can't seem to find kernel configs in that file tree. Can you guide
> > > me to where an x86 kernel config is, so I can diff it against mine ?
> >
> > https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig
> >
> > That's a defconfig, so run "make olddefconfig" against it first.
>
> OK I will diff it against my config tomorrow...
>
> > > Maybe I can build one of these qemu images, but it is a lower priority
> > > as I need to do some other tests for you first like running
> > > with kernel 5.4 ipipe patch and then Dovetail.
> > > I fear that the qemu image would not be a useful test because there
> > > wouldn't be serial ports or serial interrupts, right?
> >
> > There are as well, in fact. The first UART's output is redirected to the
> > console when you run start-qemu.sh. You can append a second UART via the
> > command line using QEMU options, and then you could even direct that
> > virtual UART to a real one of the host system.
> >
> > The major issue with reproducing in QEMU[/KVM] is, though, that the
> > timings will suffer, and applications may even fail to run when
> > deadlines are missed. But if you could reproduce in QEMU, we may
> > simplify the reproduction to just sharing your VM image.
> >
> > Jan
> > --
> > Siemens AG, Technology
> > Competence Center Embedded Linux
>
> Thanks  -C Smith

An update: switchtest does not cause a kernel Oops for more than 18+
hours when run alone, but when my periodic RT task is run concurrent
with switchtest the kernel Oopses within 2-3 hours . So the theory is:
concurrent context switching causes the problem. I'm still doing more
tests to comment out device driver etc. code in my RT task and
hopefully narrow it down further.
-C Smith


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-06  8:19         ` C Smith
@ 2022-01-09 16:35           ` Philippe Gerum
  2022-01-26  8:37             ` C Smith
  0 siblings, 1 reply; 8+ messages in thread
From: Philippe Gerum @ 2022-01-09 16:35 UTC (permalink / raw)
  To: C Smith; +Cc: Jan Kiszka, Xenomai List


C Smith <csmithquestions@gmail.com> writes:

> On Mon, Jan 3, 2022 at 11:44 PM C Smith <csmithquestions@gmail.com> wrote:
>>
>> On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> >
>> > On 03.01.22 22:12, C Smith wrote:
>> > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> > >>
>> > >> On 03.01.22 08:29, C Smith wrote:
>> > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
>> > >>> In numerous tests, I can't keep a computer running for more than a day
>> > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the
>> > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
>> > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd
>> > >>> computer.
>> > >>>
>> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89
>> > >>> config so I can compare and try those settings? I will attach my
>> > >>> kernel config to this email, in hopes someone can see something wrong
>> > >>> with them.
>> > >>>
>> > >>> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
>> > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
>> > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
>> > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
>> > >>> kernel from kernel.org source.
>> > >>>
>> > >>> Sometimes onscreen (in a text terminal) I get this Oops:
>> > >>>
>> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
>> > >>> BUG: unable to handle kernel paging request at ...
>> > >>> PGD ... P4D ... PUD .. PHD ...
>> > >>> Oops: 0011 [#1] SMP PTI
>> > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
>> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
>> > >>> BIOS 4.6.5 08/29/2017
>> > >>> I-pipe domain: Linux
>> > >>> RIP: ... : ...
>> > >>> Code: Bad RIP value.
>> > >>>
>> > >>> Which means the Instruction Pointer is in a Data area. That is bad,
>> > >>> and I think it is caused by Cobalt code not restoring the
>> > >>> stack/registers correctly during a context switch.
>> > >>> Other times I get :
>> > >>>
>> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
>> > >>> in: __xnsched_run.part.63 h -
>> > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
>> > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
>> > >>> I-pipe domain: Linux
>> > >>> Call Trace:
>> > >>> <IRQ>
>> > >>> dump_stack+8x95/8xna
>> > >>> panic+8xe§l8x246
>> > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0
>> > >>> __stack_chhk_fail+8x19x8x28
>> > >>> ___xnsched_run.part.63+8x§c4/Bx§d8
>> > >>> ? release_ioapic_irq+8x3f/8x58
>> > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
>> > >>> xnintr;edge_vec_handler+BXBIA/8x558
>> > >>> __ipipe_do_sync_pipeline+8xS/ana
>> > >>> dispatch_irq_head+8xe6/Bx118
>> > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8
>> > >>> __ipipe_handle_irq+8x198/x208
>> > >>> ? common_interrupt+8xf/Bx2c
>> > >>> </IRQ>
>> > >>>
>> > >>> The accompanying stack trace seems to implicate an ipipe interrupt
>> > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
>> > >>> an isolated interrupt level (IRQ 18).
>> > >>>
>> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still
>> > >>> running after this, even though the Linux kernel is halted. I proved
>> > >>> this on an oscilloscope: I can see serial packets going into and out
>> > >>> of the serial ports at the expected periodic time base.
>> > >>>
>> > >>> (Note that the text of these kernel faults above is reconstructed with
>> > >>> OCR so some addresses are not complete. The computer is hard-locked in
>> > >>> a text terminal when these happen. I can supply the full JPG pictures
>> > >>> or re-type addresses if you like.)
>> > >>>
>> > >>> The application scenario which causes the above problems:  The primary
>> > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
>> > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
>> > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
>> > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
>> > >>> present, no interrupts etc. There are also two non-RT userspace linux
>> > >>> apps which have attached to the same shared memory via mmap() but
>> > >>> those are doing nothing much during these tests. I have attached
>> > >>> several (1-6) RS232 serial devices and one CAN device all
>> > >>> communicating with “apprt2”.
>> > >>>
>> > >>> The system does not fault (for 48+ hours) when no peripheral
>> > >>> connections are present (Serial/CAN). The faults happen with Serial
>> > >>> traffic, whether the CAN device is attached or not. The CAN device
>> > >>> alone with no Serial does not cause the fault (tested for 48+ hours),
>> > >>> and the fault has also happened when the motherboard serial ports were
>> > >>> used, so the PCI Moxa code is not implicated.
>> > >>>
>> > >>> Note that in order to get 32-bit userspace support to fully work I had
>> > >>> to manually patch the 16550A.c serial driver with the 32 bit
>> > >>> “compatibility” patch from the xenomai mailing list. That works OK and
>> > >>> my apps can communicate fine for hours. The serial packets in my
>> > >>> applications have CRC checks so we know if data ever gets corrupted.
>> > >>>
>> > >>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
>> > >>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
>> > >>> did not get any faults in a test lasting 21+ hours (serial driver
>> > >>> only, no CAN).
>> > >>>
>> > >>> Since I imagine Xenomai developers prefer to debug on recent builds, I
>> > >>> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
>> > >>> still get kernel Oopses with Xeno 3.2.1 :
>> > >>>
>> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
>> > >>> BUG: unable to handle kernel paging request at ...
>> > >>> PGD ... P4D ... PUD ... PMD ...
>> > >>> Oops: 0011 [#1] SMP PTI
>> > >>> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
>> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
>> > >>> BIOS 4.6.5 08/29/2017
>> > >>> I-pipe domain: Linux
>> > >>> RIP: … : ...
>> > >>> Code: Bad RIP value.
>> > >>> …
>> > >>>
>> > >>> * Is there some way to instrument the Cobalt kernel to debug this ? It
>> > >>> seems impossible to get any debug data from /proc/xenomai because the
>> > >>> Linux kernel is Oopsed.
>> > >>>
>> > >>> A debugging problem:  occasionally with my apps compiled 64 bit on
>> > >>> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
>> > >>> eventually, or in another test). So I get 'false positives' and it
>> > >>> takes weeks to make progress.  It is easiest to generate a kernel Oops
>> > >>> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
>> > >>> testing process may I propose to keep compiling 32 bit and we
>> > >>> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
>> > >>> xeno-3.2 (k4.19.89)?
>> > >>>
>> > >>> Thanks.  -C Smith
>> > >>
>> > >> The issue is only with 4.19-ipipe kernels?
>> > >
>> > > Yes all of the oopses were on 4.19.89 ipipe kernels (x86).
>> > >
>> > >> Are you able to test also
>> > >> with 5.4-ipipe or 5.10/15-dovetail?
>> > >
>> > > Yes I can test with both of those. I'll do that shortly.
>> > >
>> > >> Can you also spend an extra UART for a kernel console so that crash
>> > >> dumps may have a better chance to be reported?
>> > >
>> > > I can spare a serial port for a terminal, but I believe I have
>> > > complete crash dumps I can show
>> > > you already in photos, so as to show you what has been happening
>> > > historically in my tests this month.
>> >
>> > The major drawback of screen-reported crashes is that you only have what
>> > is on the frozen screen, nothing from the past before that. Plus, you
>> > can't search in that.
>>
>> Agreed, just showing you history. I have my kernel outputting to a
>> serial terminal now - I like it!
>> Here is kernel dump output from today. There was no serial activity
>> during this so maybe the serial driver is absolved? I was running
>> 'switchtest -2s 200' during this:
>>
>> [  427.925103] apprt2: External pulse period: 0 ns, frame divisor: 0
>> [ 1926.021851] kernel tried to execute NX-protected page - exploit
>> attempt? (uid: 1000)
>> [ 1926.029897] BUG: unable to handle kernel paging request at ffff95b8d4439d40
>> [ 1926.037164] PGD 35801067 P4D 35801067 PUD 2099a4063 PMD 20eede063
>> PTE 8000000154439063
>> [ 1926.045405] Oops: 0011 [#1] SMP PTI
>> [ 1926.049218] CPU: 2 PID: 2323 Comm: appnrt1 Tainted: G           OE
>>    4.19.89xeno3.1-i64x8632 #2
>> [ 1926.058430] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017
>> [ 1926.068268] I-pipe domain: Linux
>> [ 1926.071861] RIP: 0010:0xffff95b8d4439d40
>> [ 1926.076156] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00
>> [ 1926.095348] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
>> [ 1926.101356] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
>> [ 1926.108937] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
>> [ 1926.116529] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
>> [ 1926.124127] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
>> [ 1926.131733] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
>> [ 1926.139341] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
>> knlGS:00000000f6c2e740
>> [ 1926.147918] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
>> [ 1926.154160] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
>> [ 1926.161801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1926.169450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1926.177095] Call Trace:
>> [ 1926.180054] Modules linked in: modrt1(OE) devlink xt_CHECKSUM
>> ipt_MASQUERADE xt_conntrack nft_compat nf_nat_tftp nft_objref
>> nf_conntrack_tftp nft_counter tun bridge stp llc rpcsec_gss_krb5
>> auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nft_fib_inet
>> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv6
>> nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6
>> nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
>> nft_chain_route_ipv4 ip_set nf_tables nfnetlink xeno_rtipc snd_pcm_oss
>> snd_mixer_oss sunrpc snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
>> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm intel_rapl
>> snd_timer intel_powerclamp snd coretemp crc32_pclmul xeno_can_peak_pci
>> joydev mei_wdt xeno_can_sja1000
>> [ 1926.253264]  mei_me intel_cstate rt_igb soundcore xeno_can(E)
>> rt_e1000e intel_uncore iTCO_wdt intel_rapl_perf iTCO_vendor_support
>> rtnet mei video lpc_ich radeon drm_kms_helper syscopyarea sysfillrect
>> sysimgblt fb_sys_fops ttm crc32c_intel drm serio_raw igb e1000e
>> ata_generic pata_acpi i2c_algo_bit fuse [last unloaded: i2c_i801]
>> [ 1926.283792] CR2: ffff95b8d4439d40
>> [ 1926.287879] ---[ end trace 00b88b101da84af3 ]---
>> [ 1926.293275] RIP: 0010:0xffff95b8d4439d40
>> [ 1926.297985] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00
>> [ 1926.317585] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
>> [ 1926.317586] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
>> [ 1926.317588] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
>> [ 1926.317589] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
>> [ 1926.317590] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
>> [ 1926.317591] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
>> [ 1926.317593] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
>> knlGS:00000000f6c2e740
>> [ 1926.317594] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
>> [ 1926.317595] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
>> [ 1926.317597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1926.317598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>
>> > > See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
>> > > getting an NX protection fault from Dec 10th:
>> > > https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing
>> > >
>> > > Here is another crash dump from Dec 30, in which my RT apps are
>> > > compiled 64 bit running on Xeno 3.1,
>> > > getting a Kernel panic this time:
>> > > https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing
>> > >
>> > >> Regarding reference configurations: See also
>> > >> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
>> > >> Not optimal ones, but tested.
>> > >
>> > > I can't seem to find kernel configs in that file tree. Can you guide
>> > > me to where an x86 kernel config is, so I can diff it against mine ?
>> >
>> > https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig
>> >
>> > That's a defconfig, so run "make olddefconfig" against it first.
>>
>> OK I will diff it against my config tomorrow...
>>
>> > > Maybe I can build one of these qemu images, but it is a lower priority
>> > > as I need to do some other tests for you first like running
>> > > with kernel 5.4 ipipe patch and then Dovetail.
>> > > I fear that the qemu image would not be a useful test because there
>> > > wouldn't be serial ports or serial interrupts, right?
>> >
>> > There are as well, in fact. The first UART's output is redirected to the
>> > console when you run start-qemu.sh. You can append a second UART via the
>> > command line using QEMU options, and then you could even direct that
>> > virtual UART to a real one of the host system.
>> >
>> > The major issue with reproducing in QEMU[/KVM] is, though, that the
>> > timings will suffer, and applications may even fail to run when
>> > deadlines are missed. But if you could reproduce in QEMU, we may
>> > simplify the reproduction to just sharing your VM image.
>> >
>> > Jan
>> > --
>> > Siemens AG, Technology
>> > Competence Center Embedded Linux
>>
>> Thanks  -C Smith
>
> An update: switchtest does not cause a kernel Oops for more than 18+
> hours when run alone, but when my periodic RT task is run concurrent
> with switchtest the kernel Oopses within 2-3 hours . So the theory is:
> concurrent context switching causes the problem. I'm still doing more
> tests to comment out device driver etc. code in my RT task and
> hopefully narrow it down further.
> -C Smith

I don't think context switching per se is a problem (we would have
noticed it earlier, I believe), because switchtest does exactly that:
switching contexts in a fairly hectic way over many threads (e.g. 90+ on
a quad-core CPU). IOW, running switchtest alone should be enough to
trigger such kind of bugs.

Question: has any of the threads in the application been forcibly set a
CPU affinity mask allowing it to run on more than a single CPU (i.e. via
sched_setaffinity(2))? If so, is that thread subject to
primary/secondary stage transitions during normal operations?

-- 
Philippe.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: x86 kernel Oops in Xeno-3.1/3.2
  2022-01-09 16:35           ` Philippe Gerum
@ 2022-01-26  8:37             ` C Smith
  0 siblings, 0 replies; 8+ messages in thread
From: C Smith @ 2022-01-26  8:37 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, Xenomai List

On Sun, Jan 9, 2022 at 8:49 AM Philippe Gerum <rpm@xenomai.org> wrote:
>
>
> C Smith <csmithquestions@gmail.com> writes:
>
> > On Mon, Jan 3, 2022 at 11:44 PM C Smith <csmithquestions@gmail.com> wrote:
> >>
> >> On Mon, Jan 3, 2022 at 11:05 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >> >
> >> > On 03.01.22 22:12, C Smith wrote:
> >> > > On Sun, Jan 2, 2022 at 11:38 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >> > >>
> >> > >> On 03.01.22 08:29, C Smith wrote:
> >> > >>> I have been getting kernel Oopses with x86 Xenomai 3.1 (and 3.2.1).
> >> > >>> In numerous tests, I can't keep a computer running for more than a day
> >> > >>> before the computer hard-locks (no kbd/mouse/ping). Frequently the
> >> > >>> kernel Oopses within 4-6 hours. I have tried 2 identical motherboards,
> >> > >>> changed RAM, and tried another manufacturer's motherboard on a 3rd
> >> > >>> computer.
> >> > >>>
> >> > >>> * Can someone supply me with a known successful x68 kernel 4.19.89
> >> > >>> config so I can compare and try those settings? I will attach my
> >> > >>> kernel config to this email, in hopes someone can see something wrong
> >> > >>> with them.
> >> > >>>
> >> > >>> Specs:  Intel i5-4590 CPU, Advantech motherboard with Q87 intel
> >> > >>> chipset, 8G RAM, Moxa 4-port PCI card w/ 16750 UARTs, 2 motherboard
> >> > >>> 16550 UARTS (in ISA memory range), Peak PCI CAN card, Xenomai 3.1
> >> > >>> (also xeno 3.2.1), Distro: RHEL8, with xenomai ipipe-patched 4.19.89
> >> > >>> kernel from kernel.org source.
> >> > >>>
> >> > >>> Sometimes onscreen (in a text terminal) I get this Oops:
> >> > >>>
> >> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> >> > >>> BUG: unable to handle kernel paging request at ...
> >> > >>> PGD ... P4D ... PUD .. PHD ...
> >> > >>> Oops: 0011 [#1] SMP PTI
> >> > >>> CPU: 1 P1D: 3539 Comm: gui Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> >> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> >> > >>> BIOS 4.6.5 08/29/2017
> >> > >>> I-pipe domain: Linux
> >> > >>> RIP: ... : ...
> >> > >>> Code: Bad RIP value.
> >> > >>>
> >> > >>> Which means the Instruction Pointer is in a Data area. That is bad,
> >> > >>> and I think it is caused by Cobalt code not restoring the
> >> > >>> stack/registers correctly during a context switch.
> >> > >>> Other times I get :
> >> > >>>
> >> > >>> Kernel Panic - not syncing: stack-protector: Kernel stack is corrupted
> >> > >>> in: __xnsched_run.part.63 h -
> >> > >>> CPU: 1 PID: 2409 Comm: appnrtB Tainted: G OE 4.19.89Nen03.1-i64x8632 #2
> >> > >>> Hardware name: To be filled by 0.E.M. To be filled by OEM, BIOS 4.6.5 04/23/2021
> >> > >>> I-pipe domain: Linux
> >> > >>> Call Trace:
> >> > >>> <IRQ>
> >> > >>> dump_stack+8x95/8xna
> >> > >>> panic+8xe§l8x246
> >> > >>> ? ___xnsched_run.part.63+8x5c4/8x4d0
> >> > >>> __stack_chhk_fail+8x19x8x28
> >> > >>> ___xnsched_run.part.63+8x§c4/Bx§d8
> >> > >>> ? release_ioapic_irq+8x3f/8x58
> >> > >>> ? __ipipe_end_fasteoi_irq+BNZZ/8x38
> >> > >>> xnintr;edge_vec_handler+BXBIA/8x558
> >> > >>> __ipipe_do_sync_pipeline+8xS/ana
> >> > >>> dispatch_irq_head+8xe6/Bx118
> >> > >>> __ipipe_dispatch_irq+ax1bc/Bx1e8
> >> > >>> __ipipe_handle_irq+8x198/x208
> >> > >>> ? common_interrupt+8xf/Bx2c
> >> > >>> </IRQ>
> >> > >>>
> >> > >>> The accompanying stack trace seems to implicate an ipipe interrupt
> >> > >>> handler as causing the problem. I'm using xeno_16550A.ko interrupts on
> >> > >>> an isolated interrupt level (IRQ 18).
> >> > >>>
> >> > >>> Interestingly, the Cobalt scheduler and my RT userspace app are still
> >> > >>> running after this, even though the Linux kernel is halted. I proved
> >> > >>> this on an oscilloscope: I can see serial packets going into and out
> >> > >>> of the serial ports at the expected periodic time base.
> >> > >>>
> >> > >>> (Note that the text of these kernel faults above is reconstructed with
> >> > >>> OCR so some addresses are not complete. The computer is hard-locked in
> >> > >>> a text terminal when these happen. I can supply the full JPG pictures
> >> > >>> or re-type addresses if you like.)
> >> > >>>
> >> > >>> The application scenario which causes the above problems:  The primary
> >> > >>> app, “apprt2”, is a 32-bit userspace app (compiled -m32) running on
> >> > >>> CPU core 1 (by fixed affinity), on 64 bit Xenomai 3.1 with ipipe patch
> >> > >>> applied for x86 kernel 4.19.89. It has shared memory via mmap() with
> >> > >>> an RTDM module (“modrt1”) but nothing is happening in “modrt1” at
> >> > >>> present, no interrupts etc. There are also two non-RT userspace linux
> >> > >>> apps which have attached to the same shared memory via mmap() but
> >> > >>> those are doing nothing much during these tests. I have attached
> >> > >>> several (1-6) RS232 serial devices and one CAN device all
> >> > >>> communicating with “apprt2”.
> >> > >>>
> >> > >>> The system does not fault (for 48+ hours) when no peripheral
> >> > >>> connections are present (Serial/CAN). The faults happen with Serial
> >> > >>> traffic, whether the CAN device is attached or not. The CAN device
> >> > >>> alone with no Serial does not cause the fault (tested for 48+ hours),
> >> > >>> and the fault has also happened when the motherboard serial ports were
> >> > >>> used, so the PCI Moxa code is not implicated.
> >> > >>>
> >> > >>> Note that in order to get 32-bit userspace support to fully work I had
> >> > >>> to manually patch the 16550A.c serial driver with the 32 bit
> >> > >>> “compatibility” patch from the xenomai mailing list. That works OK and
> >> > >>> my apps can communicate fine for hours. The serial packets in my
> >> > >>> applications have CRC checks so we know if data ever gets corrupted.
> >> > >>>
> >> > >>> Note that my apps have been running OK 32-bit on Xenomai v2.6 for two
> >> > >>> years. Also I ran my apps compiled as 64 bit on Xenomai v3.0.12 and
> >> > >>> did not get any faults in a test lasting 21+ hours (serial driver
> >> > >>> only, no CAN).
> >> > >>>
> >> > >>> Since I imagine Xenomai developers prefer to debug on recent builds, I
> >> > >>> also tested this on Xenomai 3.2.1 and I recompiled my apps 64 bit.  I
> >> > >>> still get kernel Oopses with Xeno 3.2.1 :
> >> > >>>
> >> > >>> kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
> >> > >>> BUG: unable to handle kernel paging request at ...
> >> > >>> PGD ... P4D ... PUD ... PMD ...
> >> > >>> Oops: 0011 [#1] SMP PTI
> >> > >>> CPU: 1 P1D: 3539 Comm: appnrtA Tainted: G OE 4.19.89xeno3.1-i64x3832 #2
> >> > >>> Hardware name: To be filled by O.E.M. To be filled by O.E.M./SHARKBAY,
> >> > >>> BIOS 4.6.5 08/29/2017
> >> > >>> I-pipe domain: Linux
> >> > >>> RIP: … : ...
> >> > >>> Code: Bad RIP value.
> >> > >>> …
> >> > >>>
> >> > >>> * Is there some way to instrument the Cobalt kernel to debug this ? It
> >> > >>> seems impossible to get any debug data from /proc/xenomai because the
> >> > >>> Linux kernel is Oopsed.
> >> > >>>
> >> > >>> A debugging problem:  occasionally with my apps compiled 64 bit on
> >> > >>> Xeno 3.1 or 3.2 the tests run 24+ hours OK (but would fault
> >> > >>> eventually, or in another test). So I get 'false positives' and it
> >> > >>> takes weeks to make progress.  It is easiest to generate a kernel Oops
> >> > >>> rapidly on Xeno 3.1 with my apps compiled 32 bit. So to expedite the
> >> > >>> testing process may I propose to keep compiling 32 bit and we
> >> > >>> instrument Xeno-3.1 (k4.19.89), and ultimately port the fix to
> >> > >>> xeno-3.2 (k4.19.89)?
> >> > >>>
> >> > >>> Thanks.  -C Smith
> >> > >>
> >> > >> The issue is only with 4.19-ipipe kernels?
> >> > >
> >> > > Yes all of the oopses were on 4.19.89 ipipe kernels (x86).
> >> > >
> >> > >> Are you able to test also
> >> > >> with 5.4-ipipe or 5.10/15-dovetail?
> >> > >
> >> > > Yes I can test with both of those. I'll do that shortly.
> >> > >
> >> > >> Can you also spend an extra UART for a kernel console so that crash
> >> > >> dumps may have a better chance to be reported?
> >> > >
> >> > > I can spare a serial port for a terminal, but I believe I have
> >> > > complete crash dumps I can show
> >> > > you already in photos, so as to show you what has been happening
> >> > > historically in my tests this month.
> >> >
> >> > The major drawback of screen-reported crashes is that you only have what
> >> > is on the frozen screen, nothing from the past before that. Plus, you
> >> > can't search in that.
> >>
> >> Agreed, just showing you history. I have my kernel outputting to a
> >> serial terminal now - I like it!
> >> Here is kernel dump output from today. There was no serial activity
> >> during this so maybe the serial driver is absolved? I was running
> >> 'switchtest -2s 200' during this:
> >>
> >> [  427.925103] apprt2: External pulse period: 0 ns, frame divisor: 0
> >> [ 1926.021851] kernel tried to execute NX-protected page - exploit
> >> attempt? (uid: 1000)
> >> [ 1926.029897] BUG: unable to handle kernel paging request at ffff95b8d4439d40
> >> [ 1926.037164] PGD 35801067 P4D 35801067 PUD 2099a4063 PMD 20eede063
> >> PTE 8000000154439063
> >> [ 1926.045405] Oops: 0011 [#1] SMP PTI
> >> [ 1926.049218] CPU: 2 PID: 2323 Comm: appnrt1 Tainted: G           OE
> >>    4.19.89xeno3.1-i64x8632 #2
> >> [ 1926.058430] Hardware name: To be filled by O.E.M. To be filled by
> >> O.E.M./SHARKBAY, BIOS 4.6.5 08/29/2017
> >> [ 1926.068268] I-pipe domain: Linux
> >> [ 1926.071861] RIP: 0010:0xffff95b8d4439d40
> >> [ 1926.076156] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00
> >> [ 1926.095348] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
> >> [ 1926.101356] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
> >> [ 1926.108937] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
> >> [ 1926.116529] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
> >> [ 1926.124127] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
> >> [ 1926.131733] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
> >> [ 1926.139341] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
> >> knlGS:00000000f6c2e740
> >> [ 1926.147918] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> >> [ 1926.154160] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
> >> [ 1926.161801] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 1926.169450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [ 1926.177095] Call Trace:
> >> [ 1926.180054] Modules linked in: modrt1(OE) devlink xt_CHECKSUM
> >> ipt_MASQUERADE xt_conntrack nft_compat nf_nat_tftp nft_objref
> >> nf_conntrack_tftp nft_counter tun bridge stp llc rpcsec_gss_krb5
> >> auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nft_fib_inet
> >> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv6
> >> nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6
> >> nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat
> >> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> >> nft_chain_route_ipv4 ip_set nf_tables nfnetlink xeno_rtipc snd_pcm_oss
> >> snd_mixer_oss sunrpc snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
> >> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm intel_rapl
> >> snd_timer intel_powerclamp snd coretemp crc32_pclmul xeno_can_peak_pci
> >> joydev mei_wdt xeno_can_sja1000
> >> [ 1926.253264]  mei_me intel_cstate rt_igb soundcore xeno_can(E)
> >> rt_e1000e intel_uncore iTCO_wdt intel_rapl_perf iTCO_vendor_support
> >> rtnet mei video lpc_ich radeon drm_kms_helper syscopyarea sysfillrect
> >> sysimgblt fb_sys_fops ttm crc32c_intel drm serio_raw igb e1000e
> >> ata_generic pata_acpi i2c_algo_bit fuse [last unloaded: i2c_i801]
> >> [ 1926.283792] CR2: ffff95b8d4439d40
> >> [ 1926.287879] ---[ end trace 00b88b101da84af3 ]---
> >> [ 1926.293275] RIP: 0010:0xffff95b8d4439d40
> >> [ 1926.297985] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00 00 <80> 00 02 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00
> >> [ 1926.317585] RSP: 95b03ea0:000000000001a220 EFLAGS: 00003046
> >> [ 1926.317586] RAX: ffff95b8e2bd8000 RBX: ffffffffa10b6040 RCX: 0000000000000000
> >> [ 1926.317588] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95b8e2a2f1c0
> >> [ 1926.317589] RBP: ffff95b995b15210 R08: 0000000000031980 R09: 00000000000009dc
> >> [ 1926.317590] R10: 00000000000009dc R11: 0000000000000000 R12: ffff95b995b03e80
> >> [ 1926.317591] R13: ffff95b995b03e80 R14: 000000000002c720 R15: 0000000000000046
> >> [ 1926.317593] FS:  0000000000000000(0000) GS:ffff95b995b00000(0063)
> >> knlGS:00000000f6c2e740
> >> [ 1926.317594] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> >> [ 1926.317595] CR2: ffff95b8d4439d40 CR3: 0000000162a22006 CR4: 00000000001606a0
> >> [ 1926.317597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 1926.317598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>
> >> > > See this picture of a test w/ my  RT apps compiled 32 bit on Xeno-3.1,
> >> > > getting an NX protection fault from Dec 10th:
> >> > > https://drive.google.com/file/d/15QYgfa73mVr3vhGdPyrQsghG1WeMFZlL/view?usp=sharing
> >> > >
> >> > > Here is another crash dump from Dec 30, in which my RT apps are
> >> > > compiled 64 bit running on Xeno 3.1,
> >> > > getting a Kernel panic this time:
> >> > > https://drive.google.com/file/d/1h7fePxUnrlm5H4PKpKALrQ_TK_dpqXj6/view?usp=sharing
> >> > >
> >> > >> Regarding reference configurations: See also
> >> > >> https://source.denx.de/Xenomai/xenomai-images/-/tree/master/recipes-kernel/linux/files.
> >> > >> Not optimal ones, but tested.
> >> > >
> >> > > I can't seem to find kernel configs in that file tree. Can you guide
> >> > > me to where an x86 kernel config is, so I can diff it against mine ?
> >> >
> >> > https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig
> >> >
> >> > That's a defconfig, so run "make olddefconfig" against it first.
> >>
> >> OK I will diff it against my config tomorrow...
> >>
> >> > > Maybe I can build one of these qemu images, but it is a lower priority
> >> > > as I need to do some other tests for you first like running
> >> > > with kernel 5.4 ipipe patch and then Dovetail.
> >> > > I fear that the qemu image would not be a useful test because there
> >> > > wouldn't be serial ports or serial interrupts, right?
> >> >
> >> > There are as well, in fact. The first UART's output is redirected to the
> >> > console when you run start-qemu.sh. You can append a second UART via the
> >> > command line using QEMU options, and then you could even direct that
> >> > virtual UART to a real one of the host system.
> >> >
> >> > The major issue with reproducing in QEMU[/KVM] is, though, that the
> >> > timings will suffer, and applications may even fail to run when
> >> > deadlines are missed. But if you could reproduce in QEMU, we may
> >> > simplify the reproduction to just sharing your VM image.
> >> >
> >> > Jan
> >> > --
> >> > Siemens AG, Technology
> >> > Competence Center Embedded Linux
> >>
> >> Thanks  -C Smith
> >
> > An update: switchtest does not cause a kernel Oops for more than 18+
> > hours when run alone, but when my periodic RT task is run concurrent
> > with switchtest the kernel Oopses within 2-3 hours . So the theory is:
> > concurrent context switching causes the problem. I'm still doing more
> > tests to comment out device driver etc. code in my RT task and
> > hopefully narrow it down further.
> > -C Smith
>
> I don't think context switching per se is a problem (we would have
> noticed it earlier, I believe), because switchtest does exactly that:
> switching contexts in a fairly hectic way over many threads (e.g. 90+ on
> a quad-core CPU). IOW, running switchtest alone should be enough to
> trigger such kind of bugs.
>
> Question: has any of the threads in the application been forcibly set a
> CPU affinity mask allowing it to run on more than a single CPU (i.e. via
> sched_setaffinity(2))? If so, is that thread subject to
> primary/secondary stage transitions during normal operations?
> --
> Philippe.

In answer to your question: "no" about sched_setaffinity(2) and "no"
about primary/secondary stage transitions.

I have run many more tests and this kernel oops happens in ipipe
patched kernels 4.19.89, 4.19.207, 5.4.151, in Xenomai 3.2 and 3.1 and
with Xenomai compiled both 32 bit and 64 bit.  I need to try dovetail
next. I am cutting down my apps to execute as little code as possible
in an effort to generate a reproducible test.

Incidentally switchtest running alongside my periodic RT userspace app
actually runs OK for 24 hours at a time, but only if I do not run my
non-RT Linux userspace app. The three apps in combination seem to
cause the kernel oops within a few hours.  That non-RT userspace app
is linked to RTDM so it can share memory with my RT app via mmap().
The non-RT userspace app accesses shared memory and enters Cobalt
wrappers when it does a usleep() - but I can't see why that would
contribute to a corrupted kernel stack.

-C Smith


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-26  8:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-03  7:29 x86 kernel Oops in Xeno-3.1/3.2 C Smith
2022-01-03  7:38 ` Jan Kiszka
2022-01-03 21:12   ` C Smith
2022-01-04  7:05     ` Jan Kiszka
2022-01-04  7:44       ` C Smith
2022-01-06  8:19         ` C Smith
2022-01-09 16:35           ` Philippe Gerum
2022-01-26  8:37             ` C Smith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.