Dear Linux folks, Building a Linux kernel (like 5.1.2) on a 128 thread AMD EPYC server with 126, 127, or 128 threads *sometimes* the server becomes unusable and logging in over network is not possible anymore. Only logging in over tty1 works, and the server needs to be rebooted. ``` [ 0.000000] Linux version 4.19.19.mx64.244 (root@theinternet.molgen.mpg.de) (gcc version 7.3.0 (GCC)) #1 SMP Tue Feb 5 13:01:13 CET 2019 […] [2418051.367223] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [2418051.367231] rcu: 30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=14323 [2418051.367235] rcu: 94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=14323 [2418051.367236] rcu: (detected by 2, t=60002 jiffies, g=298982765, q=7633949) [2418051.367254] Sending NMI from CPU 2 to CPUs 30: [2418061.370201] Sending NMI from CPU 2 to CPUs 94: [2418071.372935] rcu: rcu_sched kthread starved for 20004 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=106 [2418071.372936] rcu: RCU grace-period kthread stack dump: [2418071.372938] rcu_sched R running task 0 11 2 0x80000000 [2418071.372940] Call Trace: [2418071.372947] ? _raw_spin_unlock_irqrestore+0xa/0x10 [2418071.372950] ? force_qs_rnp+0x11e/0x140 [2418071.372952] ? rcu_gp_kthread+0x62b/0xdf0 [2418071.372953] ? __schedule+0x1f8/0x7b0 [2418071.372955] ? rcu_gp_slow.isra.40.part.41+0x30/0x30 [2418071.372957] ? kthread+0x113/0x130 [2418071.372958] ? kthread_park+0x90/0x90 [2418071.372960] ? ret_from_fork+0x22/0x40 [2418231.372935] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [2418231.372943] rcu: 30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=52808 [2418231.372946] rcu: 94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=52808 [2418231.372947] rcu: (detected by 5, t=240007 jiffies, g=298982765, q=8914782) [2418231.372959] Sending NMI from CPU 5 to CPUs 30: [2418241.375808] Sending NMI from CPU 5 to CPUs 94: [2418251.378374] rcu: rcu_sched kthread starved for 20002 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=60 [2418251.378376] rcu: RCU grace-period kthread stack dump: [2418251.378378] rcu_sched R running task 0 11 2 0x80000000 [2418251.378381] Call Trace: [2418251.378388] ? _raw_spin_unlock_irqrestore+0xa/0x10 [2418251.378392] ? force_qs_rnp+0x11e/0x140 [2418251.378393] ? rcu_gp_kthread+0x62b/0xdf0 [2418251.378395] ? __schedule+0x1f8/0x7b0 [2418251.378397] ? rcu_gp_slow.isra.40.part.41+0x30/0x30 [2418251.378399] ? kthread+0x113/0x130 [2418251.378400] ? kthread_park+0x90/0x90 [2418251.378402] ? ret_from_fork+0x22/0x40 [2418411.378841] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [2418411.378849] rcu: 30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=91376 [2418411.378852] rcu: 94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=91376 [2418411.378853] rcu: (detected by 3, t=420012 jiffies, g=298982765, q=10176682) [2418411.378866] Sending NMI from CPU 3 to CPUs 30: [2418421.381889] Sending NMI from CPU 3 to CPUs 94: [2418431.384518] rcu: rcu_sched kthread starved for 20004 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=107 [2418431.384520] rcu: RCU grace-period kthread stack dump: [2418431.384521] rcu_sched R running task 0 11 2 0x80000000 [2418431.384523] Call Trace: [2418431.384530] ? _raw_spin_unlock_irqrestore+0xa/0x10 [2418431.384533] ? force_qs_rnp+0x11e/0x140 [2418431.384535] ? rcu_gp_kthread+0x62b/0xdf0 [2418431.384537] ? __schedule+0x1f8/0x7b0 [2418431.384538] ? rcu_gp_slow.isra.40.part.41+0x30/0x30 [2418431.384540] ? kthread+0x113/0x130 [2418431.384541] ? kthread_park+0x90/0x90 [2418431.384543] ? ret_from_fork+0x22/0x40 […] ``` Do you see anything in the attached logs, which could cause this? Kind regards, Paul