linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 5.4.13-rt7 stall on CPU?
@ 2020-06-27 13:30 Udo van den Heuvel
  2020-07-03 19:49 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Udo van den Heuvel @ 2020-06-27 13:30 UTC (permalink / raw)
  To: RT

Hello,

Found this in /var/log/messages:

Jun 25 16:31:39 vuurmuur pppd[1522583]: local  LL address
fe80::ed36:3ac4:4115:e23e
Jun 25 16:31:39 vuurmuur pppd[1522583]: remote LL address
fe80::2a8a:1cff:fee0:9484
Jun 26 04:50:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt
self-detected stall on CPU
Jun 26 04:50:24 vuurmuur kernel: 002: rcu:      2-....: (5336 ticks this
GP) idle=f6a/1/0x4000000000000002 softirq=347363113/347363115 fqs=2430
Jun 26 04:50:24 vuurmuur kernel: 002:   (t=5250 jiffies g=608224341 q=1297)
Jun 26 04:50:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
Jun 26 04:50:24 vuurmuur kernel: 002: CPU: 2 PID: 3468730 Comm: ntpd
Tainted: G           O      5.4.13-rt7 #9
Jun 26 04:50:24 vuurmuur kernel: 002: Hardware name: To Be Filled By
O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jun 26 04:50:24 vuurmuur kernel: 002: Call Trace:
Jun 26 04:50:24 vuurmuur kernel: 002:  <IRQ>
Jun 26 04:50:24 vuurmuur kernel: 002:  dump_stack+0x50/0x70
Jun 26 04:50:24 vuurmuur kernel: 002:  nmi_cpu_backtrace.cold+0x14/0x53
Jun 26 04:50:24 vuurmuur kernel: 002:  ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jun 26 04:50:24 vuurmuur kernel: 002:
nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jun 26 04:50:24 vuurmuur kernel: 002:  rcu_dump_cpu_stacks+0x8b/0xb9
Jun 26 04:50:24 vuurmuur kernel: 002:  rcu_sched_clock_irq.cold+0x17e/0x4fd
Jun 26 04:50:24 vuurmuur kernel: 002:  ? account_system_index_time+0xa6/0xd0
Jun 26 04:50:24 vuurmuur kernel: 002:  update_process_times+0x1f/0x50
Jun 26 04:50:24 vuurmuur kernel: 002:  tick_sched_timer+0x5a/0x1c0
Jun 26 04:50:24 vuurmuur kernel: 002:  ?
tick_switch_to_oneshot.cold+0x74/0x74
Jun 26 04:50:24 vuurmuur kernel: 002:  __hrtimer_run_queues+0xba/0x1b0
Jun 26 04:50:24 vuurmuur kernel: 002:  hrtimer_interrupt+0x108/0x230
Jun 26 04:50:24 vuurmuur kernel: 002:  smp_apic_timer_interrupt+0x61/0xa0
Jun 26 04:50:24 vuurmuur kernel: 002:  apic_timer_interrupt+0xf/0x20
Jun 26 04:50:24 vuurmuur kernel: 002:  </IRQ>
Jun 26 04:50:24 vuurmuur kernel: 002: RIP: 0010:__fget_light+0x3d/0x60
Jun 26 04:50:24 vuurmuur kernel: 002: Code: ca 75 2e 48 8b 50 50 8b 02
39 c7 73 21 89 f9 48 39 c1 48 19 c0 21 c7 48 8b 42 08 48 8d 04 f8 48 8b
00 48 85 c0 74 07 85 70 7c <75> 02 f3 c3 31 c0 c3 ba 01 00 00 00 e8 22
fe ff ff 48 85 c0 74 ee
Jun 26 04:50:24 vuurmuur kernel: 002: RSP: 0018:ffffc90001ebb930 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jun 26 04:50:24 vuurmuur kernel: 002: RAX: ffff888059007c00 RBX:
000000007fff0010 RCX: 000000000000001a
Jun 26 04:50:24 vuurmuur kernel: 002: RDX: ffff888136edb458 RSI:
0000000000004000 RDI: 000000000000001a
Jun 26 04:50:24 vuurmuur kernel: 002: RBP: ffffc90001ebbcd0 R08:
00000000000000db R09: ffffffff81639c01
Jun 26 04:50:24 vuurmuur kernel: 002: R10: 0000000000000001 R11:
0000000000000000 R12: 0000000004000000
Jun 26 04:50:24 vuurmuur kernel: 002: R13: 000000000000001a R14:
000000000000001f R15: 0000000000000000
Jun 26 04:50:24 vuurmuur kernel: 002:  ? sock_ioctl+0x381/0x440
Jun 26 04:50:24 vuurmuur kernel: 002:  do_select+0x350/0x7a0
Jun 26 04:50:24 vuurmuur kernel: 002:  ?
select_estimate_accuracy+0x100/0x100
Jun 26 04:50:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jun 26 04:50:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? put_prev_task_rt+0x22/0x140
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? put_prev_task_rt+0x22/0x140
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jun 26 04:50:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
Jun 26 04:50:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jun 26 04:50:24 vuurmuur kernel: 002:  ? tty_ldisc_ref_wait+0x27/0x70
Jun 26 04:50:24 vuurmuur kernel: 002:  ? __ldsem_down_read_nested+0x5e/0x240
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jun 26 04:50:24 vuurmuur kernel: 002:  ? set_user_sigmask+0x62/0x90
Jun 26 04:50:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
Jun 26 04:50:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jun 26 04:50:24 vuurmuur kernel: 002:  ? sigprocmask+0x6d/0x90
Jun 26 04:50:24 vuurmuur kernel: 002:  do_syscall_64+0x77/0x440
Jun 26 04:50:24 vuurmuur kernel: 002:  ? schedule+0x3b/0xb0
Jun 26 04:50:24 vuurmuur kernel: 002:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 26 04:50:24 vuurmuur kernel: 002: RIP: 0033:0x7fba5ae5a096
Jun 26 04:50:24 vuurmuur kernel: 002: Code: e8 9f dd f8 ff 4c 8b 0c 24
4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b
74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8
c8 dd f8 ff 8b 04 24 eb
Jun 26 04:50:24 vuurmuur kernel: 002: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jun 26 04:50:24 vuurmuur kernel: 002: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jun 26 04:50:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jun 26 04:50:24 vuurmuur kernel: 002: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jun 26 04:50:24 vuurmuur kernel: 002: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jun 26 04:50:24 vuurmuur kernel: 002: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jun 26 05:03:01 vuurmuur named[1433212]: received control channel
command 'flush'


What went wrong?
How bad is this?
How to avoid?

Kind regards,
Udo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 5.4.13-rt7 stall on CPU?
  2020-06-27 13:30 5.4.13-rt7 stall on CPU? Udo van den Heuvel
@ 2020-07-03 19:49 ` Sebastian Andrzej Siewior
  2020-07-04  3:43   ` Udo van den Heuvel
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-03 19:49 UTC (permalink / raw)
  To: Udo van den Heuvel; +Cc: RT

On 2020-06-27 15:30:20 [+0200], Udo van den Heuvel wrote:
> Hello,
Hi,

> Found this in /var/log/messages:
> 
> Jun 25 16:31:39 vuurmuur pppd[1522583]: local  LL address fe80::ed36:3ac4:4115:e23e
> Jun 25 16:31:39 vuurmuur pppd[1522583]: remote LL address fe80::2a8a:1cff:fee0:9484
> Jun 26 04:50:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt self-detected stall on CPU
> Jun 26 04:50:24 vuurmuur kernel: 002: rcu:      2-....: (5336 ticks this GP) idle=f6a/1/0x4000000000000002 softirq=347363113/347363115 fqs=2430
> Jun 26 04:50:24 vuurmuur kernel: 002:   (t=5250 jiffies g=608224341 q=1297)
> Jun 26 04:50:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
> Jun 26 04:50:24 vuurmuur kernel: 002: RIP: 0010:__fget_light+0x3d/0x60
> Jun 26 04:50:24 vuurmuur kernel: 002: Code: ca 75 2e 48 8b 50 50 8b 02 39 c7 73 21 89 f9 48 39 c1 48 19 c0 21 c7 48 8b 42 08 48 8d 04 f8 48 8b 00 48 85 c0 74 07 85 70 7c <75> 02 f3 c3 31 c0 c3 ba 01 00 00 00 e8 22 fe ff ff 48 85 c0 74 ee
> Jun 26 04:50:24 vuurmuur kernel: 002:  do_select+0x350/0x7a0
> Jun 26 04:50:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
> Jun 26 04:50:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
> Jun 26 05:03:01 vuurmuur named[1433212]: received control channel command 'flush'
> 
> 
> What went wrong?

ntpq entered into kernel via pselect(). In that syscall it looped at
somepoint and RCU couldn't make any progress. Assuming you have
CONFIG_HZ=250 then it didn't make any progress for 5250/250 = 21
seconds. This stall piled 1297 callbacks up. The situation resolved by
itself later because this "rcu_preempt self-detected stall" did not
appear again.

> How bad is this?
Each callback would free a data structure i.e. give back memory to the
system. Since ntpq lead to a RCU stall, the system could no release
memory. You will run eventually out of memory if this situation does not
get resolved.

> How to avoid?
Can you reproduce this or was this one a time thing?
I *think* this happened within the loop in __fget_files(). This function
is inlined by __fget_light() and the loop has a RCU-section so it would
make sense.
Do you run something at an elevated priority in the system? I don't know
what the other part was doing but somehow one of the file descriptors
(network sockets probably) was about to be closed while the other side
tried to poll() on it.

> Kind regards,
> Udo

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 5.4.13-rt7 stall on CPU?
  2020-07-03 19:49 ` Sebastian Andrzej Siewior
@ 2020-07-04  3:43   ` Udo van den Heuvel
  2020-07-07 17:47     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Udo van den Heuvel @ 2020-07-04  3:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: RT

On 03-07-2020 21:49, Sebastian Andrzej Siewior wrote:
>> How to avoid?
> Can you reproduce this or was this one a time thing?

So far one time.

> I *think* this happened within the loop in __fget_files(). This function
> is inlined by __fget_light() and the loop has a RCU-section so it would
> make sense.
> Do you run something at an elevated priority in the system? 

ntpd (ntpsec) elevates itself.

> I don't know
> what the other part was doing but somehow one of the file descriptors
> (network sockets probably) was about to be closed while the other side
> tried to poll() on it.

Thanks for explaining.

Kind regards,
Udo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 5.4.13-rt7 stall on CPU?
  2020-07-04  3:43   ` Udo van den Heuvel
@ 2020-07-07 17:47     ` Sebastian Andrzej Siewior
  2020-07-20  9:21       ` Udo van den Heuvel
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-07 17:47 UTC (permalink / raw)
  To: Udo van den Heuvel; +Cc: RT

On 2020-07-04 05:43:32 [+0200], Udo van den Heuvel wrote:
> > I *think* this happened within the loop in __fget_files(). This function
> > is inlined by __fget_light() and the loop has a RCU-section so it would
> > make sense.
> > Do you run something at an elevated priority in the system? 
> 
> ntpd (ntpsec) elevates itself.

To the highest possible. Wonderful.

> > I don't know
> > what the other part was doing but somehow one of the file descriptors
> > (network sockets probably) was about to be closed while the other side
> > tried to poll() on it.
> 
> Thanks for explaining.

I'm puzzled. On dup() and otherwise on replacement / closing the file
pointer is replaced (with an incremented reference count or NULL ) and
then the reference count of the old file pointer is decremented. So the
zero count could be observed once and then the new pointer should be
seen. So it should not loop for seconds. And then, dup2() is only used
at startup.

> Kind regards,
> Udo

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 5.4.13-rt7 stall on CPU?
  2020-07-07 17:47     ` Sebastian Andrzej Siewior
@ 2020-07-20  9:21       ` Udo van den Heuvel
  0 siblings, 0 replies; 5+ messages in thread
From: Udo van den Heuvel @ 2020-07-20  9:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: RT

Hello Sebastian, all,

I found another one of these in my logs:

Jul 19 05:33:05 vuurmuur named[2363288]: dispatch 0x7f1e3d7fa6c0:
shutting down due to TCP receive error: 2001:500:200::b#53: connection reset
Jul 20 02:20:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt
self-detected stall on CPU
Jul 20 02:20:24 vuurmuur kernel: 002: rcu:      2-....: (5250 ticks this
GP) idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: 002:   (t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: 002: CPU: 2 PID: 3468730 Comm: ntpd
Tainted: G           O      5.4.13-rt7 #9
Jul 20 02:20:24 vuurmuur kernel: 002: Hardware name: To Be Filled By
O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: 002: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: 002:  <IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002:  dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: 002:  ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: 002:
nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: 002:  ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: 002:  update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: 002:  __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: 002:  hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: 002:  smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: 002:  apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: 002:  </IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: 002: Code: 0a 50 6e 01 75 0e 44 2b 7c
24 08 44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07
00 00 00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00
75 73 48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: 002: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: 002:  ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: 002:  ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: 002:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: Code: e8 9f dd f8 ff 4c 8b 0c 24
4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b
74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8
c8 dd f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: 002: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 02:20:24 vuurmuur kernel: rcu: INFO: rcu_preempt self-detected
stall on CPU
Jul 20 02:20:24 vuurmuur kernel: rcu: #0112-....: (5250 ticks this GP)
idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: #011(t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: CPU: 2 PID: 3468730 Comm: ntpd Tainted:
G           O      5.4.13-rt7 #9

Jul 20 02:20:24 vuurmuur kernel: Hardware name: To Be Filled By O.E.M.
To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: <IRQ>
Jul 20 02:20:24 vuurmuur kernel: dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: ? tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: </IRQ>
Jul 20 02:20:24 vuurmuur kernel: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: Code: 0a 50 6e 01 75 0e 44 2b 7c 24 08
44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07 00 00
00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 73
48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: ? select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: Code: e8 9f dd f8 ff 4c 8b 0c 24 4c 8b
44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b 74 24
18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8 c8 dd
f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 05:03:01 vuurmuur named[2363288]: received control channel
command 'flush'

Again it appears to concern ntpd.
Still on 5.4.13-rt7.

Should I boot into a newer kernel to verify the issue is still there?

Kind regards,
Udo


On 07-07-2020 19:47, Sebastian Andrzej Siewior wrote:
> On 2020-07-04 05:43:32 [+0200], Udo van den Heuvel wrote:
>>> I *think* this happened within the loop in __fget_files(). This function
>>> is inlined by __fget_light() and the loop has a RCU-section so it would
>>> make sense.
>>> Do you run something at an elevated priority in the system? 
>>
>> ntpd (ntpsec) elevates itself.
> 
> To the highest possible. Wonderful.
> 
>>> I don't know
>>> what the other part was doing but somehow one of the file descriptors
>>> (network sockets probably) was about to be closed while the other side
>>> tried to poll() on it.
>>
>> Thanks for explaining.
> 
> I'm puzzled. On dup() and otherwise on replacement / closing the file
> pointer is replaced (with an incremented reference count or NULL ) and
> then the reference count of the old file pointer is decremented. So the
> zero count could be observed once and then the new pointer should be
> seen. So it should not loop for seconds. And then, dup2() is only used
> at startup.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-20  9:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-27 13:30 5.4.13-rt7 stall on CPU? Udo van den Heuvel
2020-07-03 19:49 ` Sebastian Andrzej Siewior
2020-07-04  3:43   ` Udo van den Heuvel
2020-07-07 17:47     ` Sebastian Andrzej Siewior
2020-07-20  9:21       ` Udo van den Heuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).