Hi Sebastian, > > > > > > > The kernel version 4.1.37-rt43 is working fine, the kernel > > > > > > > versions > > > > > > > 4.4.39-rt50 and > > > > > > 4.8.11-rt7 show the same strange behavior as 4.9.0-rt1. > > > > > > > Something on the way between 4.1.37-rt43 and 4.4.39-rt50 > > > > > > > seems to cause the > > > > > > trouble. > > > > > > > > > > > > can you check if one of the earlier v4.4-RT releases (maybe > > > > > > start with > > > > > > -rt2 or -rt1) also shows this behaviour? > > > > > > If not, would you have a testcase? > > > > > I tested with 4.4.0-rt2. > > > > > Here this issue occurs very rarely. I ran the gdb for about 30 > > > > > times, and here I got one hit. > > > > > With the other kernel versions I got the issue in 20-40% of the cases. > > > > > > > > > > I will try out the other 4.4.0 releases... > > > > > > > > I ran more tests. The kernel 4.4.1-rt4 is working fairly fine > > > > (only about one hit every 30 runs) but the kernel 4.4.1-rt5 is > > > > causing the issue very > > > often. > > > > > > and 4.1.37-rt43 shows not SIGSTOPS in 100 runs? > > That's right. > > I used 4.1.37-rt43 and ran the test for 200 times. There was no single issue. I have extended my test executable to use the kernel tracing feature to get more details. Here is the output of my gdb session: ------------------ BEGIN ------------------------- pid: 17969 [New Thread 0x7ffff7616700 (LWP 17973)] [New Thread 0x7ffff6e15700 (LWP 17974)] [New Thread 0x7ffff6614700 (LWP 17975)] [New Thread 0x7ffff5e13700 (LWP 17976)] End of thread 0 End of thread 2 End of thread 1 Thread 0 joined Thread 1 joined Thread 2 joined [Thread 0x7ffff6614700 (LWP 17975) exited] [Thread 0x7ffff6e15700 (LWP 17974) exited] [Thread 0x7ffff7616700 (LWP 17973) exited] ^C Thread 1 "gdb-issue" received signal SIGINT, Interrupt. 0x00007ffff79bd6bd in pthread_join (threadid=140737318565632, thread_return=0x0) at pthread_join.c:90 90 pthread_join.c: No such file or directory. (gdb) info thr Id Target Id Frame * 1 Thread 0x7ffff7fe0700 (LWP 17969) "gdb-issue" 0x00007ffff79bd6bd in pthread_join (threadid=140737318565632, thread_return=0x0) at pthread_join.c:90 5 Thread 0x7ffff5e13700 (LWP 17976) "gdb-issue" 0x00007ffff79c5b7b in __waitpid (pid=18160, stat_loc=0x7ffff5e12ea0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:29 (gdb) thr 5 [Switching to thread 5 (Thread 0x7ffff5e13700 (LWP 17976))] #0 0x00007ffff79c5b7b in __waitpid (pid=18160, stat_loc=0x7ffff5e12ea0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:29 29 ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory. (gdb) bt #0 0x00007ffff79c5b7b in __waitpid (pid=18160, stat_loc=0x7ffff5e12ea0, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:29 #1 0x00005555555551ce in thread_func (arg=0x555555756130 ) at gdb-issue.c:107 #2 0x00007ffff79bc464 in start_thread (arg=0x7ffff5e13700) at pthread_create.c:333 #3 0x00007ffff76ff9df in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105 (gdb) ------------------ END ------------------------- The program hangs as it cannot join thread 3 (LWP 17976). This thread has forked (pid=18160) to call "lspci -xxx" but it waits forever for the termination of pid 18160. I have attached an extract of the kernel trace where this situation is reflected. ERROR: One interesting thing can be observed. The forked process 18160 is listed with the following lines in the trace: <...>-18160 [003] d...112 19076.865212: sched_waking: comm=gdb-issue pid=17969 prio=120 target_cpu=000 <...>-18160 [003] d...212 19076.865213: sched_wakeup: comm=gdb-issue pid=17969 prio=120 target_cpu=000 <...>-18160 [003] d...2.. 19076.874033: sched_switch: prev_comm=gdb-issue prev_pid=18160 prev_prio=39 prev_state=T|K ==> next_comm=gdb-issue next_pid=17976 next_prio=39 <...>-18160 [003] d...2.. 19076.874033: tlb_flush: pages:-1 reason:flush on task switch (0) <...>-18160 [003] d...2.. 19076.874034: x86_fpu_regs_deactivated: x86/fpu: ffff8801fcfaf600 fpregs_active: 0 fpstate_active: 1 counter: 4 xfeatures: 6 xcomp_bv: 0 <...>-18160 [003] d...2.. 19076.874034: x86_fpu_regs_activated: x86/fpu: ffff88040b027600 fpregs_active: 1 fpstate_active: 1 counter: 113 xfeatures: 6 xcomp_bv: 0 <...>-18160 [003] d...2.. 19076.874034: x86_fpu_regs_activated: x86/fpu: ffff88040b027600 fpregs_active: 1 fpstate_active: 1 counter: 113 xfeatures: 6 xcomp_bv: 0 <...>-18160 [003] d...2.. 19076.874034: write_msr: c0000100, value 7ffff5e13700 SUCCESS: Within the previous loop the thread 17976 also forked to pid=18150 to do the very same. In this case it has been successfully. The lines for pid=18150 in the trace looks like this: <...>-18150 [003] d...2.. 19076.846386: sched_switch: prev_comm=sh prev_pid=18150 prev_prio=39 prev_state=S ==> next_comm=gdb-issue next_pid=17976 next_prio=39 <...>-18150 [003] d...2.. 19076.846387: tlb_flush: pages:-1 reason:flush on task switch (0) <...>-18150 [003] d...2.. 19076.846387: x86_fpu_regs_deactivated: x86/fpu: ffff8801fcfa8a00 fpregs_active: 0 fpstate_active: 1 counter: 0 xfeatures: 2 xcomp_bv: 0 <...>-18150 [003] d...2.. 19076.846388: x86_fpu_regs_activated: x86/fpu: ffff88040b027600 fpregs_active: 1 fpstate_active: 1 counter: 105 xfeatures: 6 xcomp_bv: 0 <...>-18150 [003] d...2.. 19076.846388: x86_fpu_regs_activated: x86/fpu: ffff88040b027600 fpregs_active: 1 fpstate_active: 1 counter: 105 xfeatures: 6 xcomp_bv: 0 <...>-18150 [003] d...2.. 19076.846388: write_msr: c0000100, value 7ffff5e13700 In all the successful loops the lines in the trace look like the lines in the successful (2nd) example below. In the error case there is an additional sched_waking and sched_wakeup in the trace. Also there is always "prev_state=S" within the forked process. In the error case, there is "prev_state=T|K". Please find the trace extract and the test executable in the attachment. Any idea or feedback on the issue is highly welcome! Thanks Best regards Mathias