linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* System Hang With 5.15.79-rt54 Patch Set
@ 2022-12-15 18:39 Joseph Salisbury
  2022-12-15 23:39 ` Joseph Salisbury
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Salisbury @ 2022-12-15 18:39 UTC (permalink / raw)
  To: linux-rt-users; +Cc: williams, bigeasy

Hello,

A bug has been reported[0] against a linux-5.15.y based Ubuntu kernel 
that has the 5.15.79-rt54 real-time patch set applied.

This bug is causing a system hang when running disk tests with the 
stress-ng[1] test suite.  A 'hung task' call trace is being generated[2] 
just prior to the system hanging.  It appears the hang is happening 
during a call to schedule.

To see if this was a regression, I also tested prior real-time patch 
sets, going as far back to rt38 and all versions exhibit this bug. I 
will test back further to see if this is in fact a regression and a 
kernel bisect can be done.  I also confirmed the same kernel without the 
rt54 patch set applied does not exhibit this bug.

I will continue to debug this issue, but I wanted to send this message 
upstream in case anyone else runs into it.  Any specific tests, tracing 
or debugging steps would be greatly appreciated. Otherwise, I'll report 
back to this thread with any findings.

Thanks,

Joe


[0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
[1] https://github.com/ColinIanKing/stress-ng
[2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2022-12-15 18:39 System Hang With 5.15.79-rt54 Patch Set Joseph Salisbury
@ 2022-12-15 23:39 ` Joseph Salisbury
  2023-01-04 20:47   ` Joseph Salisbury
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Salisbury @ 2022-12-15 23:39 UTC (permalink / raw)
  To: linux-rt-users; +Cc: williams, bigeasy



On 12/15/22 13:39, Joseph Salisbury wrote:
> Hello,
>
> A bug has been reported[0] against a linux-5.15.y based Ubuntu kernel 
> that has the 5.15.79-rt54 real-time patch set applied.
>
> This bug is causing a system hang when running disk tests with the 
> stress-ng[1] test suite.  A 'hung task' call trace is being 
> generated[2] just prior to the system hanging.  It appears the hang is 
> happening during a call to schedule.
>
> To see if this was a regression, I also tested prior real-time patch 
> sets, going as far back to rt38 and all versions exhibit this bug. I 
> will test back further to see if this is in fact a regression and a 
> kernel bisect can be done.  I also confirmed the same kernel without 
> the rt54 patch set applied does not exhibit this bug.
>
> I will continue to debug this issue, but I wanted to send this message 
> upstream in case anyone else runs into it.  Any specific tests, 
> tracing or debugging steps would be greatly appreciated. Otherwise, 
> I'll report back to this thread with any findings.
>
> Thanks,
>
> Joe
>
>
> [0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
> [1] https://github.com/ColinIanKing/stress-ng
> [2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt

I should have also mentioned, I will also test the latest available 
patch set (6.1-rc7-rt5) to see if it also exhibits the bug.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2022-12-15 23:39 ` Joseph Salisbury
@ 2023-01-04 20:47   ` Joseph Salisbury
  2023-01-18 18:52     ` Joseph Salisbury
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Salisbury @ 2023-01-04 20:47 UTC (permalink / raw)
  To: linux-rt-users; +Cc: williams, bigeasy



On 12/15/22 18:39, Joseph Salisbury wrote:
>
>
> On 12/15/22 13:39, Joseph Salisbury wrote:
>> Hello,
>>
>> A bug has been reported[0] against a linux-5.15.y based Ubuntu kernel 
>> that has the 5.15.79-rt54 real-time patch set applied.
>>
>> This bug is causing a system hang when running disk tests with the 
>> stress-ng[1] test suite.  A 'hung task' call trace is being 
>> generated[2] just prior to the system hanging.  It appears the hang 
>> is happening during a call to schedule.
>>
>> To see if this was a regression, I also tested prior real-time patch 
>> sets, going as far back to rt38 and all versions exhibit this bug. I 
>> will test back further to see if this is in fact a regression and a 
>> kernel bisect can be done.  I also confirmed the same kernel without 
>> the rt54 patch set applied does not exhibit this bug.
>>
>> I will continue to debug this issue, but I wanted to send this 
>> message upstream in case anyone else runs into it.  Any specific 
>> tests, tracing or debugging steps would be greatly appreciated. 
>> Otherwise, I'll report back to this thread with any findings.
>>
>> Thanks,
>>
>> Joe
>>
>>
>> [0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
>> [1] https://github.com/ColinIanKing/stress-ng
>> [2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt
>
> I should have also mentioned, I will also test the latest available 
> patch set (6.1-rc7-rt5) to see if it also exhibits the bug.
>
>
This bug/system hang does still happen with the v6.1-rc7 kernel and the 
rt5 real-time patch.  I also confirmed the bug/system hang does not 
happen with just v6.1-rc7 without the real-time patch applied.

I will bisect through the rt patches to see if I can find the specific 
patch(s) that causes the bug.  I will use the v5.15 kernel and it's rt 
patches, since v5.15 has the patches broken out individually, and they 
can be applied one at a time with 'git am'.

I'll provide further updates during the bisect.

Thanks,

Joe



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2023-01-04 20:47   ` Joseph Salisbury
@ 2023-01-18 18:52     ` Joseph Salisbury
  2023-02-16 17:15       ` Sebastian Andrzej Siewior
  2023-02-28 16:50       ` Joseph Salisbury
  0 siblings, 2 replies; 8+ messages in thread
From: Joseph Salisbury @ 2023-01-18 18:52 UTC (permalink / raw)
  To: linux-rt-users; +Cc: williams, bigeasy



On 1/4/23 15:47, Joseph Salisbury wrote:
>
>
> On 12/15/22 18:39, Joseph Salisbury wrote:
>>
>>
>> On 12/15/22 13:39, Joseph Salisbury wrote:
>>> Hello,
>>>
>>> A bug has been reported[0] against a linux-5.15.y based Ubuntu 
>>> kernel that has the 5.15.79-rt54 real-time patch set applied.
>>>
>>> This bug is causing a system hang when running disk tests with the 
>>> stress-ng[1] test suite.  A 'hung task' call trace is being 
>>> generated[2] just prior to the system hanging.  It appears the hang 
>>> is happening during a call to schedule.
>>>
>>> To see if this was a regression, I also tested prior real-time patch 
>>> sets, going as far back to rt38 and all versions exhibit this bug. I 
>>> will test back further to see if this is in fact a regression and a 
>>> kernel bisect can be done.  I also confirmed the same kernel without 
>>> the rt54 patch set applied does not exhibit this bug.
>>>
>>> I will continue to debug this issue, but I wanted to send this 
>>> message upstream in case anyone else runs into it.  Any specific 
>>> tests, tracing or debugging steps would be greatly appreciated. 
>>> Otherwise, I'll report back to this thread with any findings.
>>>
>>> Thanks,
>>>
>>> Joe
>>>
>>>
>>> [0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
>>> [1] https://github.com/ColinIanKing/stress-ng
>>> [2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt
>>
>> I should have also mentioned, I will also test the latest available 
>> patch set (6.1-rc7-rt5) to see if it also exhibits the bug.
>>
>>
> This bug/system hang does still happen with the v6.1-rc7 kernel and 
> the rt5 real-time patch.  I also confirmed the bug/system hang does 
> not happen with just v6.1-rc7 without the real-time patch applied.
>
> I will bisect through the rt patches to see if I can find the specific 
> patch(s) that causes the bug.  I will use the v5.15 kernel and it's rt 
> patches, since v5.15 has the patches broken out individually, and they 
> can be applied one at a time with 'git am'.
>
> I'll provide further updates during the bisect.
>
> Thanks,
>
> Joe
>
>

I've been bisecting this bug down. It appears the bug has existed back 
to v5.15-rc3-rt5. I'll test older versions, but it may be that this bug 
has always existed with PREEMPT_RT enabled.

I will test some of the real-time patch releases prior to v5.15 to 
confirm this issue has always existed or not.

This is the current message I see on the console when the system locks up:

[ 4363.605933] INFO: task journal-offline:23780 blocked for more than 
622 seconds.
[ 4363.605936]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.605938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606051] INFO: task stress-ng:23298 blocked for more than 622 
seconds.
[ 4363.606052]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606121] INFO: task stress-ng:23299 blocked for more than 622 
seconds.
[ 4363.606122]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606183] INFO: task stress-ng:23300 blocked for more than 622 
seconds.
[ 4363.606184]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606235] INFO: task stress-ng:23302 blocked for more than 622 
seconds.
[ 4363.606236]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606290] INFO: task stress-ng:23305 blocked for more than 622 
seconds.
[ 4363.606291]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606346] INFO: task stress-ng:23309 blocked for more than 622 
seconds.
[ 4363.606347]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606393] INFO: task stress-ng:23314 blocked for more than 622 
seconds.
[ 4363.606394]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606440] INFO: task stress-ng:23319 blocked for more than 622 
seconds.
[ 4363.606440]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4363.606484] INFO: task stress-ng:23323 blocked for more than 622 
seconds.
[ 4363.606485]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 4708.553134] systemd[1]: Failed to start Journal Service.
[ 4770.052389] systemd[1]: systemd-timesyncd.service: Watchdog timeout 
(limit 3min)!


I'll add more details to this thread as I continue.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2023-01-18 18:52     ` Joseph Salisbury
@ 2023-02-16 17:15       ` Sebastian Andrzej Siewior
  2023-03-10 21:09         ` Joseph Salisbury
  2023-02-28 16:50       ` Joseph Salisbury
  1 sibling, 1 reply; 8+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-02-16 17:15 UTC (permalink / raw)
  To: Joseph Salisbury; +Cc: linux-rt-users, williams

On 2023-01-18 13:52:21 [-0500], Joseph Salisbury wrote:
> I'll add more details to this thread as I continue.

Any update on this?
Does the system really hang? The dmesg says:
|[ 8235.110075] INFO: task stress-ng:9466 blocked for more than 122 seconds.

which means stress-ng is blocked for quite some time due to I/O
according to the backtrace. This appears once for each stress-ng
process, 10 times in total. It does not repeat and the system runs at
least until

| [50733.471625] hid-generic 0003:03F0:7029.0004: input,hidraw1: USB HID v1.10 Mouse [iLO Virtual Keyboard] on usb-0000:01:00.4-1/input1

~11h after that report. 
Based on that it looks like the system complained about slow I/O but did
not hang as it completed its task.

Sebastian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2023-01-18 18:52     ` Joseph Salisbury
  2023-02-16 17:15       ` Sebastian Andrzej Siewior
@ 2023-02-28 16:50       ` Joseph Salisbury
  1 sibling, 0 replies; 8+ messages in thread
From: Joseph Salisbury @ 2023-02-28 16:50 UTC (permalink / raw)
  To: linux-rt-users; +Cc: williams, bigeasy, rostedt, tglx



On 1/18/23 13:52, Joseph Salisbury wrote:
>
>
> On 1/4/23 15:47, Joseph Salisbury wrote:
>>
>>
>> On 12/15/22 18:39, Joseph Salisbury wrote:
>>>
>>>
>>> On 12/15/22 13:39, Joseph Salisbury wrote:
>>>> Hello,
>>>>
>>>> A bug has been reported[0] against a linux-5.15.y based Ubuntu 
>>>> kernel that has the 5.15.79-rt54 real-time patch set applied.
>>>>
>>>> This bug is causing a system hang when running disk tests with the 
>>>> stress-ng[1] test suite.  A 'hung task' call trace is being 
>>>> generated[2] just prior to the system hanging.  It appears the hang 
>>>> is happening during a call to schedule.
>>>>
>>>> To see if this was a regression, I also tested prior real-time 
>>>> patch sets, going as far back to rt38 and all versions exhibit this 
>>>> bug. I will test back further to see if this is in fact a 
>>>> regression and a kernel bisect can be done.  I also confirmed the 
>>>> same kernel without the rt54 patch set applied does not exhibit 
>>>> this bug.
>>>>
>>>> I will continue to debug this issue, but I wanted to send this 
>>>> message upstream in case anyone else runs into it.  Any specific 
>>>> tests, tracing or debugging steps would be greatly appreciated. 
>>>> Otherwise, I'll report back to this thread with any findings.
>>>>
>>>> Thanks,
>>>>
>>>> Joe
>>>>
>>>>
>>>> [0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
>>>> [1] https://github.com/ColinIanKing/stress-ng
>>>> [2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt
>>>
>>> I should have also mentioned, I will also test the latest available 
>>> patch set (6.1-rc7-rt5) to see if it also exhibits the bug.
>>>
>>>
>> This bug/system hang does still happen with the v6.1-rc7 kernel and 
>> the rt5 real-time patch.  I also confirmed the bug/system hang does 
>> not happen with just v6.1-rc7 without the real-time patch applied.
>>
>> I will bisect through the rt patches to see if I can find the 
>> specific patch(s) that causes the bug.  I will use the v5.15 kernel 
>> and it's rt patches, since v5.15 has the patches broken out 
>> individually, and they can be applied one at a time with 'git am'.
>>
>> I'll provide further updates during the bisect.
>>
>> Thanks,
>>
>> Joe
>>
>>
>
> I've been bisecting this bug down. It appears the bug has existed back 
> to v5.15-rc3-rt5. I'll test older versions, but it may be that this 
> bug has always existed with PREEMPT_RT enabled.
>
> I will test some of the real-time patch releases prior to v5.15 to 
> confirm this issue has always existed or not.
>
> This is the current message I see on the console when the system locks 
> up:
>
> [ 4363.605933] INFO: task journal-offline:23780 blocked for more than 
> 622 seconds.
> [ 4363.605936]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.605938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606051] INFO: task stress-ng:23298 blocked for more than 622 
> seconds.
> [ 4363.606052]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606121] INFO: task stress-ng:23299 blocked for more than 622 
> seconds.
> [ 4363.606122]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606183] INFO: task stress-ng:23300 blocked for more than 622 
> seconds.
> [ 4363.606184]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606235] INFO: task stress-ng:23302 blocked for more than 622 
> seconds.
> [ 4363.606236]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606290] INFO: task stress-ng:23305 blocked for more than 622 
> seconds.
> [ 4363.606291]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606346] INFO: task stress-ng:23309 blocked for more than 622 
> seconds.
> [ 4363.606347]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606393] INFO: task stress-ng:23314 blocked for more than 622 
> seconds.
> [ 4363.606394]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606440] INFO: task stress-ng:23319 blocked for more than 622 
> seconds.
> [ 4363.606440]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4363.606484] INFO: task stress-ng:23323 blocked for more than 622 
> seconds.
> [ 4363.606485]       Not tainted 5.15.85-051585-realtime #202301172024
> [ 4363.606486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [ 4708.553134] systemd[1]: Failed to start Journal Service.
> [ 4770.052389] systemd[1]: systemd-timesyncd.service: Watchdog timeout 
> (limit 3min)!
>
>
> I'll add more details to this thread as I continue.
>
>
>
>

A bisect has not provided additional detail. This issue does not appear 
to be a regression and appears to have always existed.

I was able to get additional debug info, by enabling 
CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG. 
Enabling these configs shows a circular locking issue[0] and a call 
trace[1].

I will analyze this data for the next steps.  Any other suggestions 
would be greatly appreciated.

Thanks,

Joe

[0] https://launchpadlibrarian.net/653810449/locking_issue.txt
[1] https://launchpadlibrarian.net/653810490/call_trace.txt



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2023-02-16 17:15       ` Sebastian Andrzej Siewior
@ 2023-03-10 21:09         ` Joseph Salisbury
  2023-03-13 15:11           ` Alison Chaiken
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Salisbury @ 2023-03-10 21:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-users, williams, bigeasy, rostedt, tglx



On 2/16/23 12:15, Sebastian Andrzej Siewior wrote:
> On 2023-01-18 13:52:21 [-0500], Joseph Salisbury wrote:
>> I'll add more details to this thread as I continue.
> Any update on this?
> Does the system really hang? The dmesg says:
> |[ 8235.110075] INFO: task stress-ng:9466 blocked for more than 122 seconds.
>
> which means stress-ng is blocked for quite some time due to I/O
> according to the backtrace. This appears once for each stress-ng
> process, 10 times in total. It does not repeat and the system runs at
> least until
>
> | [50733.471625] hid-generic 0003:03F0:7029.0004: input,hidraw1: USB HID v1.10 Mouse [iLO Virtual Keyboard] on usb-0000:01:00.4-1/input1
>
> ~11h after that report.
> Based on that it looks like the system complained about slow I/O but did
> not hang as it completed its task.
>
> Sebastian
Hi Sebastian,

Sorry I missed this mail due to a mail filtering issue.

A bisect has not provided additional detail. This issue does not appear 
to be a regression and appears to have always existed.

I was able to get additional debug info, by enabling 
CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG. 
Enabling these configs shows a circular locking issue[0] and a call 
trace[1].

I don't think the circular locking report is related.  I think your 
correct that the system is not actually hanging.  The interactive 
response make it seem like it's hung.  For example, once the issue 
starts to happen, no other interactive commands can be issues without 
taking at least days (I never waiting more that 3 days :-) )  I'm also 
not able to log in or log out while the system "Appears" hung.  I was 
able to get a sysrq-W while the system was in this state[2].

I think I may have starting investigating too deep at first (By 
bisecting and enabling trace, etc).  I stepped back and looked at the 
high level stats.  The stress-ng test is started with one process for 
each core, and there are 96 of them.  I looked at top[3] during a hang, 
and many of the stress-ng processes are running 'R'. However, a sysrq-q 
also shows many stress-ng processes are 'D' in uninterruptible sleep.  
What also sticks out to me is all the stress-ng processes are running as 
root with a priority of 20. Looking back at one of the call traces[1], I 
see jbd2 stuck in an uninterruptible state:
...
[ 4461.908213] task:journal-offline state:D stack:    0 pid:17541 
ppid:     1 flags:0x00000226
...


The jdb2 kernel thread also runs with a priority of 20[4].  When the 
hang happens, jbd2 is also stuck in an uninterruptible state(As well as 
systemd-journal):
1521 root      20   0       0      0      0 D   0.0   0.0   4:10.48 
jbd2/sda2-8
1593 root      19  -1   64692  15832  14512 D   0.0   0.1   0:01.54 
systemd-journal



I don't yet know why running the test the same way for a generic kernel 
does not cause this behavior when it does for a preempt-rt kernel.  
Maybe it's a case of priority 'Sameness' and not priority inversion :-) ?

I tried to pin all of the stress-ng threads to cores 1-95 and the kernel 
threads to a housekeeping cpu, 0. I recall though that there are certain 
kernel threads that need to run on every core and kworker is one of 
them.  Output from cmdline:
"BOOT_IMAGE=/boot/vmlinuz-5.15.0-1033-realtime 
root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0 
console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,1-95 
intel_pstate=disable nosoftlockup tsc=nowatchdog 
crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M"

However, even with this pinning, stress-ng ends up running on cpu 0, per 
the ps output[4].  This may be why it is interfering with jbd2.

I'll see if I can modify the test to run as a non-root user or with a 
lower priority.  I could also try bumping the priority of jdb2. Maybe 
one of these would allow the journal to complete it's work and the test 
to finish?

Could it be that that the system is not hung, it is just waiting to 
complete I/O, which will never happen since the jdb2 threads are stuck. 
In this case, this is not a bug, but a test that is not configured 
correctly for a real-time system.  Does that sound plausible?  If you 
think that is the case, I'll talk with the bug reporter and assist them 
with running the test properly for a real-time system.



Thanks for your assistance!

Joe


[0] https://launchpadlibrarian.net/653810449/locking_issue.txt
[1] https://launchpadlibrarian.net/653810490/call_trace.txt
[2] https://launchpadlibrarian.net/655372944/sysrq-w.txt
[3] https://launchpadlibrarian.net/655374168/top-during-hang.txt
[4] https://launchpadlibrarian.net/655380123/ps-test-running.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: System Hang With 5.15.79-rt54 Patch Set
  2023-03-10 21:09         ` Joseph Salisbury
@ 2023-03-13 15:11           ` Alison Chaiken
  0 siblings, 0 replies; 8+ messages in thread
From: Alison Chaiken @ 2023-03-13 15:11 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Sebastian Andrzej Siewior, linux-rt-users, williams, rostedt, tglx

> On 2/16/23 12:15, Sebastian Andrzej Siewior wrote:
> > On 2023-01-18 13:52:21 [-0500], Joseph Salisbury wrote:
> >> I'll add more details to this thread as I continue.
> > Any update on this?
> > Does the system really hang? The dmesg says:
> > |[ 8235.110075] INFO: task stress-ng:9466 blocked for more than 122 seconds.
> >
> > which means stress-ng is blocked for quite some time due to I/O
> > according to the backtrace. This appears once for each stress-ng
> > process, 10 times in total. It does not repeat and the system runs at
> > least until
> >
> > | [50733.471625] hid-generic 0003:03F0:7029.0004: input,hidraw1: USB HID v1.10 Mouse [iLO Virtual Keyboard] on usb-0000:01:00.4-1/input1
> >
> > ~11h after that report.
> > Based on that it looks like the system complained about slow I/O but did
> > not hang as it completed its task.
> >
> > Sebastian

On Fri, Mar 10, 2023 at 1:09 PM Joseph Salisbury  wrote:
> A bisect has not provided additional detail. This issue does not appear
> to be a regression and appears to have always existed.
>
> I was able to get additional debug info, by enabling
> CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG.
> Enabling these configs shows a circular locking issue[0] and a call
> trace[1].
>
> I don't think the circular locking report is related.  I think your
> correct that the system is not actually hanging.  The interactive
> response make it seem like it's hung.  For example, once the issue
> starts to happen, no other interactive commands can be issues without
> taking at least days (I never waiting more that 3 days :-) )  I'm also
> not able to log in or log out while the system "Appears" hung.  I was
> able to get a sysrq-W while the system was in this state[2].
>
> I think I may have starting investigating too deep at first (By
> bisecting and enabling trace, etc).  I stepped back and looked at the
> high level stats.  The stress-ng test is started with one process for
> each core, and there are 96 of them.  I looked at top[3] during a hang,
> and many of the stress-ng processes are running 'R'. However, a sysrq-q
> also shows many stress-ng processes are 'D' in uninterruptible sleep.
> What also sticks out to me is all the stress-ng processes are running as
> root with a priority of 20. Looking back at one of the call traces[1], I
> see jbd2 stuck in an uninterruptible state:
> ...
> [ 4461.908213] task:journal-offline state:D stack:    0 pid:17541
> ppid:     1 flags:0x00000226
> ...
>
>
> The jdb2 kernel thread also runs with a priority of 20[4].  When the
> hang happens, jbd2 is also stuck in an uninterruptible state(As well as
> systemd-journal):
> 1521 root      20   0       0      0      0 D   0.0   0.0   4:10.48
> jbd2/sda2-8
> 1593 root      19  -1   64692  15832  14512 D   0.0   0.1   0:01.54
> systemd-journal
>
>
>
> I don't yet know why running the test the same way for a generic kernel
> does not cause this behavior when it does for a preempt-rt kernel.
> Maybe it's a case of priority 'Sameness' and not priority inversion :-) ?
>
> I tried to pin all of the stress-ng threads to cores 1-95 and the kernel
> threads to a housekeeping cpu, 0. I recall though that there are certain
> kernel threads that need to run on every core and kworker is one of
> them.  Output from cmdline:
> "BOOT_IMAGE=/boot/vmlinuz-5.15.0-1033-realtime
> root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0
> console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,1-95
> intel_pstate=disable nosoftlockup tsc=nowatchdog
> crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M"
>
> However, even with this pinning, stress-ng ends up running on cpu 0, per
> the ps output[4].  This may be why it is interfering with jbd2.
>
> I'll see if I can modify the test to run as a non-root user or with a
> lower priority.  I could also try bumping the priority of jdb2. Maybe
> one of these would allow the journal to complete it's work and the test
> to finish?
>
> Could it be that that the system is not hung, it is just waiting to
> complete I/O, which will never happen since the jdb2 threads are stuck.
> In this case, this is not a bug, but a test that is not configured
> correctly for a real-time system.  Does that sound plausible?  If you
> think that is the case, I'll talk with the bug reporter and assist them
> with running the test properly for a real-time system.

Have you tried checking for low-memory during the test?   Maybe the
system is unable to write because of slow page-cache allocations (see
https://www.socallinuxexpo.org/sites/default/files/presentations/Exploring%20%20Linux%20Memory%20Usage%20and%20%20Disk%20IO%20performance%20version%203.pdf)
or perhaps there is massive inter-NUMA-node rebalancing going on in
such a large system?      Turning on CONFIG_PSI is a relatively easy
way to monitor memory problems.   Also, have you tried connecting to
systemd-journald with GDB during the test, to see what it is doing?
Or tried calculating if the bandwidth to your storage devices is
simply maxed out?

-- Alison Chaiken
Aurora Innovation
achaiken@aurora.tech

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-13 15:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-15 18:39 System Hang With 5.15.79-rt54 Patch Set Joseph Salisbury
2022-12-15 23:39 ` Joseph Salisbury
2023-01-04 20:47   ` Joseph Salisbury
2023-01-18 18:52     ` Joseph Salisbury
2023-02-16 17:15       ` Sebastian Andrzej Siewior
2023-03-10 21:09         ` Joseph Salisbury
2023-03-13 15:11           ` Alison Chaiken
2023-02-28 16:50       ` Joseph Salisbury

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).