All of lore.kernel.org
 help / color / mirror / Atom feed
From: Philippe Gerum <rpm@xenomai.org>
To: Russell Johnson <russell.johnson@kratosdefense.com>
Cc: "xenomai@lists.linux.dev" <xenomai@lists.linux.dev>,
	Bryan Butler <Bryan.Butler@kratosdefense.com>,
	Shawn McManus <shawn.mcmanus@kratosdefense.com>
Subject: Re: EVL Memory
Date: Wed, 09 Nov 2022 18:20:40 +0100	[thread overview]
Message-ID: <877d04dm52.fsf@xenomai.org> (raw)
In-Reply-To: <PH1P110MB1050307D7FC58A2B99A40094E23E9@PH1P110MB1050.NAMP110.PROD.OUTLOOK.COM>


Russell Johnson <russell.johnson@kratosdefense.com> writes:

> [[S/MIME Signed Part:Undecided]]
> Hello,
>
>  
>
> We have been running into some memory issues with our realtime EVL application. Recently, we are seeing a core dump write immediately on
> startup, and when running in the debugger,  the call stack just doesn’t make any sense – it almost seems random. I have the latest EVL
> kernel built with all the EVL debug on including the EVL_DEBUG_MEM option, and I also turned on KASAN. I see no output at all from the
> kernel when this core dump happens except for a couple basic lines:
>
>  
>
> EVL: RxMain switching in-band [pid=4327, excpt=14, user_pc=0x0]
>
> RxMain[4327]: segfault at 0 ip 0000000000000000 sp 00007f7ab5305468 error 14 in realtime_sfhm[400000+161000]
>
> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
>
>  
>
> I wouldn’t say those lines are very helpful. I know that exception 14 on x86 is a page fault and that a pc value of 0 does not seem good
> (memory corruption?). 
>

Yes, this very much looks like a stack overwrite, since you are running
x86, return addresses may be popped from the stack, so this could lead
to that funky user_pc=0x0 if the IP register is reloaded with a random
value.

>  
>
> I wanted to get your opinion on our overall memory setup and see if there are obvious issues or if you had any recommendations of things to
> try.

Checking and controlling the thread stack consumption comes to mind,
looking for unconstrained writes to stack-based variables too.

> So we set up 1 EVL heap on startup of the application, and this heap is used for all dynamic allocations throughout the entire app as we
> also overrode global new/delete to use the EVL heap rather than malloc/free. The reason we overrode global new/delete is that we have a lot
> of STL objects as well as third party libraries that we are using in this app and it would become a very long and challenging process to actually
> go through and modify all those to use custom allocators.

Hopefully the STL won't use hidden synchronization via *libc mutexes,
otherwise this would be an issue. Re-routing new()/delete() away from
malloc()/free() to a dedicated allocator is common practice.

> So, in main(), we spawn a “Main” EVL thread. All other threads are spawned from
> this parent EVL thread. When the EVL Main thread starts, it sets a static flag that enables using the evl heap in the global news/deletes, and
> every library and process for the realtime app is created and started. We also prefault the EVL heap after it is initialized/created in order to try
> to avoid page faults while running the realtime loop. However, we still see occasional page faults in our realtime EVL threads while running our
> main realtime loop. I don’t understand how there could be a page fault if the EVL heap is large enough (verified) and prefaulted.
>
>  
>
> Now back to the issue mentioned at the beginning of this email – we are unsure at this time if the page fault error seen there is cause, result,
> or primary error causing these core dumps.

This page fault is a symptom, not the issue, the kernel detects that the
app has done something weird, like jumping to 0x0. If the stack is
wrecked, then other registers than IP might be trashed on reload,
leading to other forms of bad accesses. But they would all originate
from the same stack corruption issue.

> Typically when I see the occasional page faults while running other EVL threads in the app, I do
> not see core dumps – just the log from the kernel. The only reason why I figured I would at least run this past you was that I did not see any
> core dumps when disabling the flag in the global new/delete and just using malloc/free (of course I saw some in band switches, but that was
> it).

You may want to make sure to set ulimit to allow unlimited core dumps
for the process.

>
>  
>
> The issue of occasional page faults, handling dynamic memory allocation (using global new/delete), and now these possible memory
> corruption seg faults is becoming a larger concern for us. We would like to make sure that we are understanding how to use memory in an
> EVL application properly, and we would be interested to know if there are any recommended ways of tracking these down with an EVL
> application. I have tried to build/run with the gcc address sanitizer, but I was seeing issues attaching EVL threads when this was enabled. I
> have also tried running valgrind, but that has produced nothing useful. And, of course, I have run in gdb and the stack traces are not helpful.
> At this point, any guidance, thoughts, and/or recommendations would be greatly appreciated. I added some more clear/specific questions at
> the bottom.
>

An EVL application can run over valgrind, you may want to check if that
helps.

>  
>
> A few specific questions:
>
> 1 Is this a reasonable model to use for an EVL application, or do you expect the model to revolve more around static allocation?
>

No requirement for static allocation if that fits the bill.

> 2 Are you aware of people using EVL overriding the global new/delete to use the EVL heap?
>

Yes.

> 3 Do you have any tools for debugging the EVL heap or have you adapted any existing tools (such as valgrind) to debug the EVL heap?
>

As mentioned earlier, valgrind works out of the box for EVL apps.

> 4 Is there any known way of protecting against a stack overflow?
>

EVL threads are merely plain threads on real-time steroïds, so the
read-only canary area / red zone the *libc sets in order to detect
overflows is there too. So maybe the stack is not overflowing, but
trashed by some buffer overflow on some automatic variable.

> 5 PC value of 0 is never valid and we have no evidence that we have an uninitialized pointer in our C++ code. Is there anyway to use info from
>  EVL to help track down this issue?
>

No, it's the worst case, we loose track due to the bad jump, and there
is no link register to locate the caller on this architecture.

> 6 We are allocating 2GB of EVL Heap memory and pre-faulting all of it on startup. We also use pthread_attr_setstacksize to pre-allocate the
>  stack for each EVL thread we have. EVL still says that we get rare page faults. How is this possible? Are we missing something? (I have
>  attached our heap pre-faulting logic)
>

To answer this, we'd need to find out which kind of PF is that, whether
this is a PTE miss and which code actually triggers it. This would be a
Dovetail/x86 issue if any, EVL leaves all the mm stuff to the in-band
kernel.

-- 
Philippe.

  reply	other threads:[~2022-11-09 18:21 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <PH1P110MB1050AD875FCD924D827A0A54E23E9@PH1P110MB1050.NAMP110.PROD.OUTLOOK.COM>
     [not found] ` <PH1P110MB10508DD9688B7D0E82AB30CDE23E9@PH1P110MB1050.NAMP110.PROD.OUTLOOK.COM>
2022-11-09 17:04   ` EVL Memory Russell Johnson
2022-11-09 17:20     ` Philippe Gerum [this message]
2022-11-11 21:34       ` [External] - " Russell Johnson
2022-11-14  9:53         ` Philippe Gerum
2022-11-14 22:42           ` Russell Johnson
2022-11-15  8:33             ` Philippe Gerum
2022-11-15 17:05               ` Russell Johnson
2022-11-15 18:36                 ` Philippe Gerum
2022-11-16 15:48                   ` Philippe Gerum
2022-11-16 21:37                     ` Russell Johnson
2022-11-17 16:48                       ` Philippe Gerum
2022-11-17 16:57                         ` Russell Johnson
2022-11-17 17:03                           ` Philippe Gerum
2022-11-17 17:37                             ` Russell Johnson
2022-11-18  8:06                               ` Philippe Gerum
2022-11-18 21:08                                 ` Russell Johnson
2022-11-17 22:19                             ` Russell Johnson
2022-11-18  8:02                               ` Philippe Gerum
2022-11-18  8:08                         ` Philippe Gerum
2022-11-19 16:37                           ` Russell Johnson
2022-11-19 16:42                             ` Philippe Gerum
2022-11-19 16:50                               ` Russell Johnson
2022-11-19 18:11                               ` Russell Johnson
2022-11-20  8:25                                 ` Philippe Gerum
2022-11-21 15:56                             ` Philippe Gerum
2022-11-21 18:33                               ` Bryan Butler
2022-11-28 15:21                                 ` Russell Johnson
2022-11-28 16:49                                   ` Philippe Gerum
2022-11-28 20:59                                     ` Russell Johnson
     [not found]                                       ` <0082bff2d91b0125ac60050159d3003e64b45bffa35e0c4f0ed9799e38b97b8c@mu>
2022-11-30 15:57                                         ` Philippe Gerum
2022-12-01 14:36                                           ` Philippe Gerum
2022-12-01 20:01                                             ` Russell Johnson
2022-12-02  9:18                                               ` Philippe Gerum
2022-12-02 15:12                                                 ` Russell Johnson
2022-12-02 15:27                                                   ` Philippe Gerum
2022-12-02 15:38                                                     ` Philippe Gerum
2022-12-02 20:50                                                       ` Russell Johnson
2022-12-03 11:37                                                         ` Philippe Gerum
2022-12-02 15:48                                                     ` Russell Johnson
2022-12-02 16:50                                                       ` Philippe Gerum
2022-12-02 17:22                                                       ` Philippe Gerum
2022-12-02 22:26                                                         ` Russell Johnson
2022-12-03 11:37                                                           ` Philippe Gerum
2022-12-03 15:44                                                             ` Philippe Gerum
2022-12-04 11:05                                                               ` Philippe Gerum
2022-12-04 18:05                                                                 ` Philippe Gerum
2022-12-04 18:43                                                                   ` Russell Johnson
2022-12-05  6:53                                                                   ` Russell Johnson
2022-12-05  6:59                                                                     ` Russell Johnson
2022-12-05  8:24                                                                       ` Philippe Gerum
2022-12-05 16:31                                                                         ` Russell Johnson
2022-12-05 16:38                                                                           ` Russell Johnson
2022-12-05 17:01                                                                             ` Philippe Gerum
2022-12-05  8:45                                                                     ` Philippe Gerum
2022-11-14 23:33           ` Russell Johnson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877d04dm52.fsf@xenomai.org \
    --to=rpm@xenomai.org \
    --cc=Bryan.Butler@kratosdefense.com \
    --cc=russell.johnson@kratosdefense.com \
    --cc=shawn.mcmanus@kratosdefense.com \
    --cc=xenomai@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.