linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [rcu_sched stall] regression/miss-config ?
@ 2016-05-15 21:18 Santosh Shilimkar
  2016-05-16  4:35 ` santosh.shilimkar
  0 siblings, 1 reply; 11+ messages in thread
From: Santosh Shilimkar @ 2016-05-15 21:18 UTC (permalink / raw)
  To: paulmck, linux-kernel; +Cc: Sasha Levin

Hi Paul,

I was asking Sasha about [1] since other folks in Oracle
also stumbled upon similar RCU stalls with v4.1 kernel in
different workloads. I was reported similar issue with
RDS as well and looking at [1], [2], [3] and [4], thought
of reaching out to see if you can help us to understand
this issue better.

Have also included RCU specific config used in these
test(s). Its very hard to reproduce the issue but one of
the data point is, it reproduces on systems with larger
CPUs(64+). Same workload with less than 64 CPUs, don't
show the issue. Someone also told me, making use of
SLAB instead SLUB allocator makes difference but I
haven't verified that part for RDS.

Let me know your thoughts. Thanks in advance !!

Regards,
Santosh

[1] https://lkml.org/lkml/2014/12/14/304
[2] log 1: http://pastebin.uk.oracle.com/iUr9qE
[3] log 2: http://pastebin.uk.oracle.com/Oe3cr5
[4] log 3: http://pastebin.uk.oracle.com/bMYLkD
[5] rcu config: http://pastebin.uk.oracle.com/e7NXTW

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-15 21:18 [rcu_sched stall] regression/miss-config ? Santosh Shilimkar
@ 2016-05-16  4:35 ` santosh.shilimkar
  2016-05-16 12:03   ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: santosh.shilimkar @ 2016-05-16  4:35 UTC (permalink / raw)
  To: paulmck, linux-kernel; +Cc: Sasha Levin

On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> Hi Paul,
>
> I was asking Sasha about [1] since other folks in Oracle
> also stumbled upon similar RCU stalls with v4.1 kernel in
> different workloads. I was reported similar issue with
> RDS as well and looking at [1], [2], [3] and [4], thought
> of reaching out to see if you can help us to understand
> this issue better.
>
> Have also included RCU specific config used in these
> test(s). Its very hard to reproduce the issue but one of
> the data point is, it reproduces on systems with larger
> CPUs(64+). Same workload with less than 64 CPUs, don't
> show the issue. Someone also told me, making use of
> SLAB instead SLUB allocator makes difference but I
> haven't verified that part for RDS.
>
> Let me know your thoughts. Thanks in advance !!
>
One of my colleague told me the pastebin server I used
is Oracle internal only so adding the relevant logs along
with email.

Regards,
Santosh
>
> [1] https://lkml.org/lkml/2014/12/14/304


[2]  Log 1 snippet:
-----------------------------------------------------------------
  INFO: rcu_sched self-detected stall on CPU
  INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies 
g=66023 c=66022 q=0)
  Task dump for CPU 54:
  ksoftirqd/54    R  running task        0   389      2 0x00000008
   0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
   ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
   ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
  Call Trace:
   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
   [<ffffffff810e9772>] update_process_times+0x42/0x70
   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
   <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
   [<ffffffff8118e775>] __free_pages+0x25/0x40
   [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
   [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
   [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
   [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
   [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
   [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
   [<ffffffff810856cd>] run_ksoftirqd+0x2d/0x50
   [<ffffffff810a4a46>] smpboot_thread_fn+0x116/0x170
   [<ffffffff810a11ae>] kthread+0xce/0xf0
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff816c1b62>] ret_from_fork+0x42/0x70
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
  Task dump for CPU 66:
  ksoftirqd/66    R  running task        0   474      2 0x00000008
   04208040664afe58 ffff88ff664ac008 000000010005d97a 0000000a39b2e200
   0000000000000018 0000000000000006 ffff88ff664a1c00 ffff88ff669a5260
   ffff88ff664a1c00 ffffffff81aa4ec0 ffffffff810a4930 0000000000000000
  Call Trace:
   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
   [<ffffffff810856cd>] ? run_ksoftirqd+0x2d/0x50
   [<ffffffff810a4a46>] ? smpboot_thread_fn+0x116/0x170
   [<ffffffff810a11ae>] ? kthread+0xce/0xf0
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff816c1b62>] ? ret_from_fork+0x42/0x70
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
   { 66}  (t=60011 jiffies g=66023 c=66022 q=0)
  Task dump for CPU 54:
  ksoftirqd/54    R  running task        0   389      2 0x00000008
   04208040669e3e58 ffff88ff669e0008 000000010005d981 0000000a39b28e00
   0000000000000000 0000000000000006 ffff88ff669d1c00 ffff88ff669a5020
   ffff88ff669d1c00 ffffffff81aa4ec0 ffffffff810a4930 0000000000000000
  Call Trace:
   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
   [<ffffffff810856cd>] ? run_ksoftirqd+0x2d/0x50
   [<ffffffff810a4a46>] ? smpboot_thread_fn+0x116/0x170
   [<ffffffff810a11ae>] ? kthread+0xce/0xf0
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff816c1b62>] ? ret_from_fork+0x42/0x70
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
  Task dump for CPU 66:
  ksoftirqd/66    R  running task        0   474      2 0x00000008
   0000000000000003 ffff88ff7f703d38 ffffffff810a8621 0000000000000042
   ffffffff81ab6540 ffff88ff7f703d58 ffffffff810a86cf 0000000000000086
   ffffffff81ab6a40 ffff88ff7f703d88 ffffffff810e3ad3 ffffffff81ab6540
  Call Trace:
   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
   [<ffffffff810e9772>] update_process_times+0x42/0x70
   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
   <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
   [<ffffffff8118e775>] __free_pages+0x25/0x40
   [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
   [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
   [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
   [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
   [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
   [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
   [<ffffffff810856cd>] run_ksoftirqd+0x2d/0x50
   [<ffffffff810a4a46>] smpboot_thread_fn+0x116/0x170
   [<ffffffff810a11ae>] kthread+0xce/0xf0
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
   [<ffffffff816c1b62>] ret_from_fork+0x42/0x70
   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
-----------------------------------------------------------------

[3] log 2 snippet
-------------------------------------------------------------------
  INFO: rcu_sched self-detected stall on CPU
  INFO: rcu_sched detected stalls on CPUs/tasks: { 77} (detected by 207, 
t=60002 jiffies, g=40822, c=40821, q=0)
  Task dump for CPU 77:
  rds-stress      R  running task        0  2930   2867 0x0000008a
   0000000000000018 0000000000000286 ffff893e00000000 ffffffff810ec828
   ffff883f5628d160 0000000000000282 ffff89407ffda440 00000000f4d139b8
   0000000100000000 ffff89407ffdae08 0000000000000000 000000004dcc9f40
  Call Trace:
   [<ffffffff810ec828>] ? hrtimer_try_to_cancel+0x48/0xe0
   [<ffffffff8118d7b3>] ? __alloc_pages_nodemask+0x193/0x2b0
   [<ffffffff810c52ac>] ? remove_wait_queue+0x3c/0x50
   [<ffffffff811d747f>] ? alloc_pages_current+0xaf/0x170
   [<ffffffffa2135298>] ? rds_page_remainder_alloc+0x48/0x2a0 [rds]
   [<ffffffffa212f06d>] ? rds_message_copy_from_user+0x6d/0x130 [rds]
   [<ffffffffa2133040>] ? rds_sendmsg+0x470/0x990 [rds]
   [<ffffffff811ffa00>] ? rw_copy_check_uvector+0xa0/0x130
   [<ffffffff815c9ffd>] ? sock_sendmsg+0x4d/0x60
   [<ffffffff815cc93a>] ? ___sys_sendmsg+0x30a/0x330
   [<ffffffff811b77fc>] ? handle_pte_fault+0x20c/0x230
   [<ffffffff815ccb69>] ? __sys_sendmsg+0x49/0x90
   [<ffffffff81023611>] ? syscall_trace_leave+0xf1/0x140
   [<ffffffff815ccbc9>] ? SyS_sendmsg+0x19/0x20
   [<ffffffff816c176e>] ? system_call_fastpath+0x12/0x71
   { 77}  (t=60011 jiffies g=40822 c=40821 q=0)
  Task dump for CPU 77:
  rds-stress      R  running task        0  2930   2867 0x0000008a
   000000000000000e ffff893f7f5439c0 ffffffff810a8621 000000000000004d
   ffffffff81ab6540 ffff893f7f5439e0 ffffffff810a86cf 0000000000000082
   ffffffff81ab6a40 ffff893f7f543a10 ffffffff810e3ad3 ffffffff81ab6540
  Call Trace:
   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
   [<ffffffff810e9772>] update_process_times+0x42/0x70
   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
   [<ffffffff811df679>] ? __free_slab+0xd9/0x1b0
   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
   [<ffffffff8118db64>] ? free_one_page+0x164/0x380
   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
   [<ffffffff8118e775>] __free_pages+0x25/0x40
   [<ffffffffa212f4f0>] rds_message_purge+0x60/0x150 [rds]
   [<ffffffffa212f624>] rds_message_put+0x44/0x80 [rds]
   [<ffffffffa217d5b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
   [<ffffffffa18fe273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
   [<ffffffffa21766f1>] poll_cq+0xa1/0xe0 [rds_rdma]
   [<ffffffffa2177489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
   [<ffffffff810594f4>] ? native_apic_msr_eoi_write+0x14/0x20
   [<ffffffff81085815>] irq_exit+0x125/0x130
   [<ffffffff816c45f5>] do_IRQ+0x65/0x110
   [<ffffffff816c22ae>] common_interrupt+0x6e/0x6e
   <EOI>  [<ffffffff8118bf28>] ? get_page_from_freelist+0x2b8/0x750
   [<ffffffff810ec828>] ? hrtimer_try_to_cancel+0x48/0xe0
   [<ffffffff8118d7b3>] __alloc_pages_nodemask+0x193/0x2b0
   [<ffffffff810c52ac>] ? remove_wait_queue+0x3c/0x50
   [<ffffffff811d747f>] alloc_pages_current+0xaf/0x170
   [<ffffffffa2135298>] rds_page_remainder_alloc+0x48/0x2a0 [rds]
   [<ffffffffa212f06d>] rds_message_copy_from_user+0x6d/0x130 [rds]
   [<ffffffffa2133040>] rds_sendmsg+0x470/0x990 [rds]
   [<ffffffff811ffa00>] ? rw_copy_check_uvector+0xa0/0x130
   [<ffffffff815c9ffd>] sock_sendmsg+0x4d/0x60
   [<ffffffff815cc93a>] ___sys_sendmsg+0x30a/0x330
   [<ffffffff811b77fc>] ? handle_pte_fault+0x20c/0x230
   [<ffffffff815ccb69>] __sys_sendmsg+0x49/0x90
   [<ffffffff81023611>] ? syscall_trace_leave+0xf1/0x140
   [<ffffffff815ccbc9>] SyS_sendmsg+0x19/0x20
   [<ffffffff816c176e>] system_call_fastpath+0x12/0x71
------------------------------------------------------

[5] LOG 3:
------------------------------------------------------
  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
0, t=240007 jiffies, g=449043, c=449042, q=0)
  All QSes seen, last rcu_sched kthread activity 240007
(4299336825-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
  ora_lms0_orcltw R  running task        0 22303      1
0x00000080
   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
ffffffff810b038f
   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
ffffffff810e8f96
   0000000000000000 ffffffff810b4deb ffff88010e817800
ffff88010e8185c0
  Call Trace:
   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
   [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
   [<ffffffff810ee582>] update_process_times+0x42/0x70
   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
   [<ffffffff81085bba>] ? __do_softirq+0x18a/0x2d0
   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
   <EOI>
  rcu_sched kthread starved for 240007 jiffies!
  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
0, t=420012 jiffies, g=449043, c=449042, q=0)
  All QSes seen, last rcu_sched kthread activity 420012
(4299516830-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
  ora_lms0_orcltw R  running task        0 22303      1
0x00000080
   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
ffffffff810b038f
   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
ffffffff810e8f96
   0000000000000000 ffff88010e817800 0000000000017800
ffff88010e8185c0
  Call Trace:
   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
   [<ffffffff810ee582>] update_process_times+0x42/0x70
   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
   <EOI>  [<ffffffff81024847>] ?
do_audit_syscall_entry+0x67/0x70
   [<ffffffff81025eb3>] syscall_trace_enter_phase1+0x143/0x1a0
   [<ffffffff8112ac86>] ? __audit_syscall_exit+0x1e6/0x280
   [<ffffffff81026216>] ? syscall_trace_leave+0xc6/0x120
   [<ffffffff8170d69a>] tracesys+0xd/0x44
  rcu_sched kthread starved for 420012 jiffies!
  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
0, t=600017 jiffies, g=449043, c=449042, q=0)
  All QSes seen, last rcu_sched kthread activity 600017
(4299696835-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
  ora_lms0_orcltw R  running task        0 22303      1
0x00000080
   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
ffffffff810b038f
   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
ffffffff810e8f96
   0000000000000000 ffffffff810b4deb ffff88010e817800
ffff88010e8185c0
Call Trace:
  <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
  [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
  [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
  [<ffffffff810ee582>] update_process_times+0x42/0x70
  [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
  [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
  [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
  [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
  [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
  [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
  [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
  [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
  [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
  [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
  [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
  [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
  [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
  [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
  <EOI>
rcu_sched kthread starved for 600017 jiffies!
------------------------------------------------------

[5] RCU config
-----------------------------------------------
# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_USER_QS=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
# CONFIG_RCU_EXPEDITE_BOOT is not set
# RCU Debugging
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
--------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-16  4:35 ` santosh.shilimkar
@ 2016-05-16 12:03   ` Paul E. McKenney
  2016-05-16 16:33     ` Santosh Shilimkar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2016-05-16 12:03 UTC (permalink / raw)
  To: santosh.shilimkar; +Cc: linux-kernel, Sasha Levin

On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
> On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >Hi Paul,
> >
> >I was asking Sasha about [1] since other folks in Oracle
> >also stumbled upon similar RCU stalls with v4.1 kernel in
> >different workloads. I was reported similar issue with
> >RDS as well and looking at [1], [2], [3] and [4], thought
> >of reaching out to see if you can help us to understand
> >this issue better.
> >
> >Have also included RCU specific config used in these
> >test(s). Its very hard to reproduce the issue but one of
> >the data point is, it reproduces on systems with larger
> >CPUs(64+). Same workload with less than 64 CPUs, don't
> >show the issue. Someone also told me, making use of
> >SLAB instead SLUB allocator makes difference but I
> >haven't verified that part for RDS.
> >
> >Let me know your thoughts. Thanks in advance !!
> >
> One of my colleague told me the pastebin server I used
> is Oracle internal only so adding the relevant logs along
> with email.
> 
> Regards,
> Santosh
> >
> >[1] https://lkml.org/lkml/2014/12/14/304
> 
> 
> [2]  Log 1 snippet:
> -----------------------------------------------------------------
>  INFO: rcu_sched self-detected stall on CPU
>  INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> g=66023 c=66022 q=0)
>  Task dump for CPU 54:
>  ksoftirqd/54    R  running task        0   389      2 0x00000008
>   0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
>   ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
>   ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
>  Call Trace:
>   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>   [<ffffffff810e9772>] update_process_times+0x42/0x70
>   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>   <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>   [<ffffffff8118e775>] __free_pages+0x25/0x40
>   [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
>   [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
>   [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>   [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>   [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>   [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]

The most likely possibility is that there is a 60-second-long loop in
one of the above functions.  This is within bottom-half execution, so
unfortunately the usual trick of placing cond_resched_rcu_qs() within this
loop, but outside of any RCU read-side critical section does not work.

Therefore, if there really is a loop here, one fix would be to
periodically unwind back out to run_ksoftirqd(), but setting up so that
the work would be continued later.  Another fix might be to move this
from tasklet context to workqueue context, where cond_resched_rcu_qs()
can be used -- however, this looks a bit like networking code, which
does not always take kindly to being run in process context (though
careful use of local_bh_disable() and local_bh_enable() can sometimes
overcome this issue).  A third fix, which works only if this code does
not use RCU and does not invoke any code that does use RCU, is to tell
RCU that it should ignore this code (which will require a little work
on RCU, as it currently does not tolerate this sort of thing aside from
the idle threads).  In this last approach, event-tracing calls must use
the _nonidle suffix.

I am not familiar with the RDS code, so I cannot be more specific.
That said, given the copyrights on those files, I bet you have a
colleague or two who understands the code quite deeply.  ;-)

							Thanx, Paul

>   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
>   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
>   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
>   [<ffffffff810856cd>] run_ksoftirqd+0x2d/0x50
>   [<ffffffff810a4a46>] smpboot_thread_fn+0x116/0x170
>   [<ffffffff810a11ae>] kthread+0xce/0xf0
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>   [<ffffffff816c1b62>] ret_from_fork+0x42/0x70
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>  Task dump for CPU 66:
>  ksoftirqd/66    R  running task        0   474      2 0x00000008
>   04208040664afe58 ffff88ff664ac008 000000010005d97a 0000000a39b2e200
>   0000000000000018 0000000000000006 ffff88ff664a1c00 ffff88ff669a5260
>   ffff88ff664a1c00 ffffffff81aa4ec0 ffffffff810a4930 0000000000000000
>  Call Trace:
>   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
>   [<ffffffff810856cd>] ? run_ksoftirqd+0x2d/0x50
>   [<ffffffff810a4a46>] ? smpboot_thread_fn+0x116/0x170
>   [<ffffffff810a11ae>] ? kthread+0xce/0xf0
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>   [<ffffffff816c1b62>] ? ret_from_fork+0x42/0x70
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>   { 66}  (t=60011 jiffies g=66023 c=66022 q=0)
>  Task dump for CPU 54:
>  ksoftirqd/54    R  running task        0   389      2 0x00000008
>   04208040669e3e58 ffff88ff669e0008 000000010005d981 0000000a39b28e00
>   0000000000000000 0000000000000006 ffff88ff669d1c00 ffff88ff669a5020
>   ffff88ff669d1c00 ffffffff81aa4ec0 ffffffff810a4930 0000000000000000
>  Call Trace:
>   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
>   [<ffffffff810856cd>] ? run_ksoftirqd+0x2d/0x50
>   [<ffffffff810a4a46>] ? smpboot_thread_fn+0x116/0x170
>   [<ffffffff810a11ae>] ? kthread+0xce/0xf0
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>   [<ffffffff816c1b62>] ? ret_from_fork+0x42/0x70
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>  Task dump for CPU 66:
>  ksoftirqd/66    R  running task        0   474      2 0x00000008
>   0000000000000003 ffff88ff7f703d38 ffffffff810a8621 0000000000000042
>   ffffffff81ab6540 ffff88ff7f703d58 ffffffff810a86cf 0000000000000086
>   ffffffff81ab6a40 ffff88ff7f703d88 ffffffff810e3ad3 ffffffff81ab6540
>  Call Trace:
>   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>   [<ffffffff810e9772>] update_process_times+0x42/0x70
>   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>   <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>   [<ffffffff8118e775>] __free_pages+0x25/0x40
>   [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
>   [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
>   [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>   [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>   [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>   [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
>   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
>   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
>   [<ffffffff810a4930>] ? smpboot_create_threads+0x80/0x80
>   [<ffffffff810856cd>] run_ksoftirqd+0x2d/0x50
>   [<ffffffff810a4a46>] smpboot_thread_fn+0x116/0x170
>   [<ffffffff810a11ae>] kthread+0xce/0xf0
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
>   [<ffffffff816c1b62>] ret_from_fork+0x42/0x70
>   [<ffffffff810a10e0>] ? kthread_freezable_should_stop+0x70/0x70
> -----------------------------------------------------------------
> 
> [3] log 2 snippet
> -------------------------------------------------------------------
>  INFO: rcu_sched self-detected stall on CPU
>  INFO: rcu_sched detected stalls on CPUs/tasks: { 77} (detected by
> 207, t=60002 jiffies, g=40822, c=40821, q=0)
>  Task dump for CPU 77:
>  rds-stress      R  running task        0  2930   2867 0x0000008a
>   0000000000000018 0000000000000286 ffff893e00000000 ffffffff810ec828
>   ffff883f5628d160 0000000000000282 ffff89407ffda440 00000000f4d139b8
>   0000000100000000 ffff89407ffdae08 0000000000000000 000000004dcc9f40
>  Call Trace:
>   [<ffffffff810ec828>] ? hrtimer_try_to_cancel+0x48/0xe0
>   [<ffffffff8118d7b3>] ? __alloc_pages_nodemask+0x193/0x2b0
>   [<ffffffff810c52ac>] ? remove_wait_queue+0x3c/0x50
>   [<ffffffff811d747f>] ? alloc_pages_current+0xaf/0x170
>   [<ffffffffa2135298>] ? rds_page_remainder_alloc+0x48/0x2a0 [rds]
>   [<ffffffffa212f06d>] ? rds_message_copy_from_user+0x6d/0x130 [rds]
>   [<ffffffffa2133040>] ? rds_sendmsg+0x470/0x990 [rds]
>   [<ffffffff811ffa00>] ? rw_copy_check_uvector+0xa0/0x130
>   [<ffffffff815c9ffd>] ? sock_sendmsg+0x4d/0x60
>   [<ffffffff815cc93a>] ? ___sys_sendmsg+0x30a/0x330
>   [<ffffffff811b77fc>] ? handle_pte_fault+0x20c/0x230
>   [<ffffffff815ccb69>] ? __sys_sendmsg+0x49/0x90
>   [<ffffffff81023611>] ? syscall_trace_leave+0xf1/0x140
>   [<ffffffff815ccbc9>] ? SyS_sendmsg+0x19/0x20
>   [<ffffffff816c176e>] ? system_call_fastpath+0x12/0x71
>   { 77}  (t=60011 jiffies g=40822 c=40821 q=0)
>  Task dump for CPU 77:
>  rds-stress      R  running task        0  2930   2867 0x0000008a
>   000000000000000e ffff893f7f5439c0 ffffffff810a8621 000000000000004d
>   ffffffff81ab6540 ffff893f7f5439e0 ffffffff810a86cf 0000000000000082
>   ffffffff81ab6a40 ffff893f7f543a10 ffffffff810e3ad3 ffffffff81ab6540
>  Call Trace:
>   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>   [<ffffffff810e9772>] update_process_times+0x42/0x70
>   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>   [<ffffffff811df679>] ? __free_slab+0xd9/0x1b0
>   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>   [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>   [<ffffffff8118e775>] __free_pages+0x25/0x40
>   [<ffffffffa212f4f0>] rds_message_purge+0x60/0x150 [rds]
>   [<ffffffffa212f624>] rds_message_put+0x44/0x80 [rds]
>   [<ffffffffa217d5b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>   [<ffffffffa18fe273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>   [<ffffffffa21766f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>   [<ffffffffa2177489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
>   [<ffffffff81084b31>] tasklet_action+0xb1/0xc0
>   [<ffffffff810854a2>] __do_softirq+0xf2/0x2f0
>   [<ffffffff810594f4>] ? native_apic_msr_eoi_write+0x14/0x20
>   [<ffffffff81085815>] irq_exit+0x125/0x130
>   [<ffffffff816c45f5>] do_IRQ+0x65/0x110
>   [<ffffffff816c22ae>] common_interrupt+0x6e/0x6e
>   <EOI>  [<ffffffff8118bf28>] ? get_page_from_freelist+0x2b8/0x750
>   [<ffffffff810ec828>] ? hrtimer_try_to_cancel+0x48/0xe0
>   [<ffffffff8118d7b3>] __alloc_pages_nodemask+0x193/0x2b0
>   [<ffffffff810c52ac>] ? remove_wait_queue+0x3c/0x50
>   [<ffffffff811d747f>] alloc_pages_current+0xaf/0x170
>   [<ffffffffa2135298>] rds_page_remainder_alloc+0x48/0x2a0 [rds]
>   [<ffffffffa212f06d>] rds_message_copy_from_user+0x6d/0x130 [rds]
>   [<ffffffffa2133040>] rds_sendmsg+0x470/0x990 [rds]
>   [<ffffffff811ffa00>] ? rw_copy_check_uvector+0xa0/0x130
>   [<ffffffff815c9ffd>] sock_sendmsg+0x4d/0x60
>   [<ffffffff815cc93a>] ___sys_sendmsg+0x30a/0x330
>   [<ffffffff811b77fc>] ? handle_pte_fault+0x20c/0x230
>   [<ffffffff815ccb69>] __sys_sendmsg+0x49/0x90
>   [<ffffffff81023611>] ? syscall_trace_leave+0xf1/0x140
>   [<ffffffff815ccbc9>] SyS_sendmsg+0x19/0x20
>   [<ffffffff816c176e>] system_call_fastpath+0x12/0x71
> ------------------------------------------------------
> 
> [5] LOG 3:
> ------------------------------------------------------
>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> 0, t=240007 jiffies, g=449043, c=449042, q=0)
>  All QSes seen, last rcu_sched kthread activity 240007
> (4299336825-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>  ora_lms0_orcltw R  running task        0 22303      1
> 0x00000080
>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> ffffffff810b038f
>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> ffffffff810e8f96
>   0000000000000000 ffffffff810b4deb ffff88010e817800
> ffff88010e8185c0
>  Call Trace:
>   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>   [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
>   [<ffffffff810ee582>] update_process_times+0x42/0x70
>   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>   [<ffffffff81085bba>] ? __do_softirq+0x18a/0x2d0
>   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>   <EOI>
>  rcu_sched kthread starved for 240007 jiffies!
>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> 0, t=420012 jiffies, g=449043, c=449042, q=0)
>  All QSes seen, last rcu_sched kthread activity 420012
> (4299516830-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>  ora_lms0_orcltw R  running task        0 22303      1
> 0x00000080
>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> ffffffff810b038f
>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> ffffffff810e8f96
>   0000000000000000 ffff88010e817800 0000000000017800
> ffff88010e8185c0
>  Call Trace:
>   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>   [<ffffffff810ee582>] update_process_times+0x42/0x70
>   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>   <EOI>  [<ffffffff81024847>] ?
> do_audit_syscall_entry+0x67/0x70
>   [<ffffffff81025eb3>] syscall_trace_enter_phase1+0x143/0x1a0
>   [<ffffffff8112ac86>] ? __audit_syscall_exit+0x1e6/0x280
>   [<ffffffff81026216>] ? syscall_trace_leave+0xc6/0x120
>   [<ffffffff8170d69a>] tracesys+0xd/0x44
>  rcu_sched kthread starved for 420012 jiffies!
>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> 0, t=600017 jiffies, g=449043, c=449042, q=0)
>  All QSes seen, last rcu_sched kthread activity 600017
> (4299696835-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>  ora_lms0_orcltw R  running task        0 22303      1
> 0x00000080
>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> ffffffff810b038f
>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> ffffffff810e8f96
>   0000000000000000 ffffffff810b4deb ffff88010e817800
> ffff88010e8185c0
> Call Trace:
>  <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>  [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>  [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
>  [<ffffffff810ee582>] update_process_times+0x42/0x70
>  [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>  [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>  [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>  [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>  [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>  [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>  [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>  [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>  [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>  [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>  [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>  [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>  [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>  [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>  <EOI>
> rcu_sched kthread starved for 600017 jiffies!
> ------------------------------------------------------
> 
> [5] RCU config
> -----------------------------------------------
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> CONFIG_SRCU=y
> # CONFIG_TASKS_RCU is not set
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_USER_QS=y
> CONFIG_RCU_FANOUT=64
> CONFIG_RCU_FANOUT_LEAF=16
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_RCU_FAST_NO_HZ is not set
> # CONFIG_TREE_RCU_TRACE is not set
> CONFIG_RCU_KTHREAD_PRIO=0
> CONFIG_RCU_NOCB_CPU=y
> # CONFIG_RCU_NOCB_CPU_NONE is not set
> # CONFIG_RCU_NOCB_CPU_ZERO is not set
> CONFIG_RCU_NOCB_CPU_ALL=y
> # CONFIG_RCU_EXPEDITE_BOOT is not set
> # RCU Debugging
> # CONFIG_PROVE_RCU is not set
> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=60
> # CONFIG_RCU_CPU_STALL_INFO is not set
> # CONFIG_RCU_TRACE is not set
> --------------------------------------------------
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-16 12:03   ` Paul E. McKenney
@ 2016-05-16 16:33     ` Santosh Shilimkar
  2016-05-16 17:34       ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Santosh Shilimkar @ 2016-05-16 16:33 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Sasha Levin

On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
>> On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
>>> Hi Paul,
>>>
>>> I was asking Sasha about [1] since other folks in Oracle
>>> also stumbled upon similar RCU stalls with v4.1 kernel in
>>> different workloads. I was reported similar issue with
>>> RDS as well and looking at [1], [2], [3] and [4], thought
>>> of reaching out to see if you can help us to understand
>>> this issue better.
>>>
>>> Have also included RCU specific config used in these
>>> test(s). Its very hard to reproduce the issue but one of
>>> the data point is, it reproduces on systems with larger
>>> CPUs(64+). Same workload with less than 64 CPUs, don't
>>> show the issue. Someone also told me, making use of
>>> SLAB instead SLUB allocator makes difference but I
>>> haven't verified that part for RDS.
>>>
>>> Let me know your thoughts. Thanks in advance !!
>>>
>> One of my colleague told me the pastebin server I used
>> is Oracle internal only so adding the relevant logs along
>> with email.
>>

[...]

>>> [1] https://lkml.org/lkml/2014/12/14/304
>>
>>
>> [2]  Log 1 snippet:
>> -----------------------------------------------------------------
>>  INFO: rcu_sched self-detected stall on CPU
>>  INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
>> g=66023 c=66022 q=0)
>>  Task dump for CPU 54:
>>  ksoftirqd/54    R  running task        0   389      2 0x00000008
>>   0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
>>   ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
>>   ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
>>  Call Trace:
>>   <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>>   [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>>   [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>>   [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>>   [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>>   [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>>   [<ffffffff810e9772>] update_process_times+0x42/0x70
>>   [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>>   [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>>   [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>>   [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>>   [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>>   [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>>   [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>>   [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>>   <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>>   [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>>   [<ffffffff8118e775>] __free_pages+0x25/0x40
>>   [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
>>   [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
>>   [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>>   [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>>   [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>>   [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>>   [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
>
> The most likely possibility is that there is a 60-second-long loop in
> one of the above functions.  This is within bottom-half execution, so
> unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> loop, but outside of any RCU read-side critical section does not work.
>
First of all thanks for explanation.

There is no loop which can last for 60 seconds in above code since its 
just completion queue handler used to free up buffers much like NIC
drivers bottom half(NAPI). Its done in tasklet context for latency
reasons which RDS care most. Just to get your attention, the RCU
stall is also seen with XEN code too. Log for it end of the email.

Another important observation is, for RDS if we avoid higher
order page(s) allocation, issue is not reproducible so far.
In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
the system continues to run without any issue, so the loop scenario
is ruled out more or less.

To be specific, with PAGE_SIZE allocations, alloc_pages()
is just allocating a page and __free_page() is used
instead of __free_pages() from below snippet.

------------------
if (bytes >= PAGE_SIZE)
	page = alloc_pages(gfp, get_order(bytes));

.....

(rm->data.op_sg[i].length <= PAGE_SIZE) ?
__free_page(sg_page(&rm->data.op_sg[i])) :
__free_pages(sg_page(&rm->data.op_sg[i]), 
get_order(rm->data.op_sg[i].length));
----------------------------


> Therefore, if there really is a loop here, one fix would be to
> periodically unwind back out to run_ksoftirqd(), but setting up so that
> the work would be continued later.  Another fix might be to move this
> from tasklet context to workqueue context, where cond_resched_rcu_qs()
> can be used -- however, this looks a bit like networking code, which
> does not always take kindly to being run in process context (though
> careful use of local_bh_disable() and local_bh_enable() can sometimes
> overcome this issue).  A third fix, which works only if this code does
> not use RCU and does not invoke any code that does use RCU, is to tell
> RCU that it should ignore this code (which will require a little work
> on RCU, as it currently does not tolerate this sort of thing aside from
> the idle threads).  In this last approach, event-tracing calls must use
> the _nonidle suffix.
>
> I am not familiar with the RDS code, so I cannot be more specific.

No worries. Since we saw the issue with XEN too, I was suspecting
that somehow we didn't have RCU_TREE config setup correctly or
some important RCU patch(s) missing in v4.1 which made it in
later kernels.

The only common denominator I saw between these two different
usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
which I believe triggers the rcu_sched() chain.

I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
get more information out of the system. Do you suggest any other
RCU option(s)/patch(s) which we can help us to capture more
information to understand the issue better. I wasn't sure about
options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
is also mentioned end of the email.

Regards,
Santosh

>> [5] LOG 3:
>> ------------------------------------------------------
>>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
>> 0, t=240007 jiffies, g=449043, c=449042, q=0)
>>  All QSes seen, last rcu_sched kthread activity 240007
>> (4299336825-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>>  ora_lms0_orcltw R  running task        0 22303      1
>> 0x00000080
>>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
>> ffffffff810b038f
>>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
>> ffffffff810e8f96
>>   0000000000000000 ffffffff810b4deb ffff88010e817800
>> ffff88010e8185c0
>>  Call Trace:
>>   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>>   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>>   [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
>>   [<ffffffff810ee582>] update_process_times+0x42/0x70
>>   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>>   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>>   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>>   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>>   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>>   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>>   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>>   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>>   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>>   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>>   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>>   [<ffffffff81085bba>] ? __do_softirq+0x18a/0x2d0
>>   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>>   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>>   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>>   <EOI>
>>  rcu_sched kthread starved for 240007 jiffies!
>>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
>> 0, t=420012 jiffies, g=449043, c=449042, q=0)
>>  All QSes seen, last rcu_sched kthread activity 420012
>> (4299516830-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>>  ora_lms0_orcltw R  running task        0 22303      1
>> 0x00000080
>>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
>> ffffffff810b038f
>>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
>> ffffffff810e8f96
>>   0000000000000000 ffff88010e817800 0000000000017800
>> ffff88010e8185c0
>>  Call Trace:
>>   <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>>   [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>>   [<ffffffff810ee582>] update_process_times+0x42/0x70
>>   [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>>   [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>>   [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>>   [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>>   [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>>   [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>>   [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>>   [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>>   [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>>   [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>>   [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>>   [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>>   [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>>   [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>>   <EOI>  [<ffffffff81024847>] ?
>> do_audit_syscall_entry+0x67/0x70
>>   [<ffffffff81025eb3>] syscall_trace_enter_phase1+0x143/0x1a0
>>   [<ffffffff8112ac86>] ? __audit_syscall_exit+0x1e6/0x280
>>   [<ffffffff81026216>] ? syscall_trace_leave+0xc6/0x120
>>   [<ffffffff8170d69a>] tracesys+0xd/0x44
>>  rcu_sched kthread starved for 420012 jiffies!
>>  INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
>> 0, t=600017 jiffies, g=449043, c=449042, q=0)
>>  All QSes seen, last rcu_sched kthread activity 600017
>> (4299696835-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
>>  ora_lms0_orcltw R  running task        0 22303      1
>> 0x00000080
>>   ffff8800c4651c00 00000000e7110062 ffff88010e803c18
>> ffffffff810b038f
>>   ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
>> ffffffff810e8f96
>>   0000000000000000 ffffffff810b4deb ffff88010e817800
>> ffff88010e8185c0
>> Call Trace:
>>  <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
>>  [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
>>  [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
>>  [<ffffffff810ee582>] update_process_times+0x42/0x70
>>  [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
>>  [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
>>  [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
>>  [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
>>  [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
>>  [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
>>  [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
>>  [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
>>  [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
>>  [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
>>  [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
>>  [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
>>  [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
>>  [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
>>  <EOI>
>> rcu_sched kthread starved for 600017 jiffies!
 >> ------------------------------------------------------
 >>
 >> [5] RCU config
 >> -----------------------------------------------
 >> # RCU Subsystem
 >> CONFIG_TREE_RCU=y
 >> CONFIG_SRCU=y
 >> # CONFIG_TASKS_RCU is not set
 >> CONFIG_RCU_STALL_COMMON=y
 >> CONFIG_RCU_USER_QS=y
 >> CONFIG_RCU_FANOUT=64
 >> CONFIG_RCU_FANOUT_LEAF=16
 >> # CONFIG_RCU_FANOUT_EXACT is not set
 >> # CONFIG_RCU_FAST_NO_HZ is not set
 >> # CONFIG_TREE_RCU_TRACE is not set
 >> CONFIG_RCU_KTHREAD_PRIO=0
 >> CONFIG_RCU_NOCB_CPU=y
 >> # CONFIG_RCU_NOCB_CPU_NONE is not set
 >> # CONFIG_RCU_NOCB_CPU_ZERO is not set
 >> CONFIG_RCU_NOCB_CPU_ALL=y
 >> # CONFIG_RCU_EXPEDITE_BOOT is not set
 >> # RCU Debugging
 >> # CONFIG_PROVE_RCU is not set
 >> # CONFIG_SPARSE_RCU_POINTER is not set
 >> # CONFIG_RCU_TORTURE_TEST is not set
 >> CONFIG_RCU_CPU_STALL_TIMEOUT=60
 >> # CONFIG_RCU_CPU_STALL_INFO is not set
 >> # CONFIG_RCU_TRACE is not set
 >> --------------------------------------------------
 >>
 >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-16 16:33     ` Santosh Shilimkar
@ 2016-05-16 17:34       ` Paul E. McKenney
  2016-05-16 19:49         ` Santosh Shilimkar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2016-05-16 17:34 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: linux-kernel, Sasha Levin

On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> >On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
> >>On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >>>Hi Paul,
> >>>
> >>>I was asking Sasha about [1] since other folks in Oracle
> >>>also stumbled upon similar RCU stalls with v4.1 kernel in
> >>>different workloads. I was reported similar issue with
> >>>RDS as well and looking at [1], [2], [3] and [4], thought
> >>>of reaching out to see if you can help us to understand
> >>>this issue better.
> >>>
> >>>Have also included RCU specific config used in these
> >>>test(s). Its very hard to reproduce the issue but one of
> >>>the data point is, it reproduces on systems with larger
> >>>CPUs(64+). Same workload with less than 64 CPUs, don't
> >>>show the issue. Someone also told me, making use of
> >>>SLAB instead SLUB allocator makes difference but I
> >>>haven't verified that part for RDS.
> >>>
> >>>Let me know your thoughts. Thanks in advance !!
> >>>
> >>One of my colleague told me the pastebin server I used
> >>is Oracle internal only so adding the relevant logs along
> >>with email.
> >>
> 
> [...]
> 
> >>>[1] https://lkml.org/lkml/2014/12/14/304
> >>
> >>
> >>[2]  Log 1 snippet:
> >>-----------------------------------------------------------------
> >> INFO: rcu_sched self-detected stall on CPU
> >> INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> >>g=66023 c=66022 q=0)
> >> Task dump for CPU 54:
> >> ksoftirqd/54    R  running task        0   389      2 0x00000008
> >>  0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
> >>  ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
> >>  ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
> >> Call Trace:
> >>  <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
> >>  [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
> >>  [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
> >>  [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
> >>  [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
> >>  [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
> >>  [<ffffffff810e9772>] update_process_times+0x42/0x70
> >>  [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
> >>  [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
> >>  [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
> >>  [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
> >>  [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
> >>  [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
> >>  [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
> >>  [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
> >>  <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
> >>  [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
> >>  [<ffffffff8118e775>] __free_pages+0x25/0x40
> >>  [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
> >>  [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
> >>  [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
> >>  [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
> >>  [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
> >>  [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
> >>  [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
> >
> >The most likely possibility is that there is a 60-second-long loop in
> >one of the above functions.  This is within bottom-half execution, so
> >unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> >loop, but outside of any RCU read-side critical section does not work.
> >
> First of all thanks for explanation.
> 
> There is no loop which can last for 60 seconds in above code since
> its just completion queue handler used to free up buffers much like
> NIC
> drivers bottom half(NAPI). Its done in tasklet context for latency
> reasons which RDS care most. Just to get your attention, the RCU
> stall is also seen with XEN code too. Log for it end of the email.
> 
> Another important observation is, for RDS if we avoid higher
> order page(s) allocation, issue is not reproducible so far.
> In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
> the system continues to run without any issue, so the loop scenario
> is ruled out more or less.
> 
> To be specific, with PAGE_SIZE allocations, alloc_pages()
> is just allocating a page and __free_page() is used
> instead of __free_pages() from below snippet.
> 
> ------------------
> if (bytes >= PAGE_SIZE)
> 	page = alloc_pages(gfp, get_order(bytes));
> 
> .....
> 
> (rm->data.op_sg[i].length <= PAGE_SIZE) ?
> __free_page(sg_page(&rm->data.op_sg[i])) :
> __free_pages(sg_page(&rm->data.op_sg[i]),
> get_order(rm->data.op_sg[i].length));
> ----------------------------

This sounds like something to take up with the mm folks.

> >Therefore, if there really is a loop here, one fix would be to
> >periodically unwind back out to run_ksoftirqd(), but setting up so that
> >the work would be continued later.  Another fix might be to move this
> >from tasklet context to workqueue context, where cond_resched_rcu_qs()
> >can be used -- however, this looks a bit like networking code, which
> >does not always take kindly to being run in process context (though
> >careful use of local_bh_disable() and local_bh_enable() can sometimes
> >overcome this issue).  A third fix, which works only if this code does
> >not use RCU and does not invoke any code that does use RCU, is to tell
> >RCU that it should ignore this code (which will require a little work
> >on RCU, as it currently does not tolerate this sort of thing aside from
> >the idle threads).  In this last approach, event-tracing calls must use
> >the _nonidle suffix.
> >
> >I am not familiar with the RDS code, so I cannot be more specific.
> 
> No worries. Since we saw the issue with XEN too, I was suspecting
> that somehow we didn't have RCU_TREE config setup correctly or
> some important RCU patch(s) missing in v4.1 which made it in
> later kernels.
> 
> The only common denominator I saw between these two different
> usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
> which I believe triggers the rcu_sched() chain.

Exactly!

> I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
> get more information out of the system. Do you suggest any other
> RCU option(s)/patch(s) which we can help us to capture more
> information to understand the issue better. I wasn't sure about
> options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
> is also mentioned end of the email.

Hmmm...  I just now noticed the "All QSes seen" message below.
That can happen if the load is quite high, and could as you say
one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
This is not free, as it will mean more context switches involving
the RCU grace-period kthreads.

Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
smaller than 60, say, 21.  This will cause the stall warnings to leave
less time before splatting.

Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
you need more housekeeping CPUs than you currently have configured.

							Thanx, Paul

> Regards,
> Santosh
> 
> >>[5] LOG 3:
> >>------------------------------------------------------
> >> INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> >>0, t=240007 jiffies, g=449043, c=449042, q=0)
> >> All QSes seen, last rcu_sched kthread activity 240007
> >>(4299336825-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
> >> ora_lms0_orcltw R  running task        0 22303      1
> >>0x00000080
> >>  ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> >>ffffffff810b038f
> >>  ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> >>ffffffff810e8f96
> >>  0000000000000000 ffffffff810b4deb ffff88010e817800
> >>ffff88010e8185c0
> >> Call Trace:
> >>  <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
> >>  [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
> >>  [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
> >>  [<ffffffff810ee582>] update_process_times+0x42/0x70
> >>  [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
> >>  [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
> >>  [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
> >>  [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
> >>  [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
> >>  [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
> >>  [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
> >>  [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
> >>  [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
> >>  [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
> >>  [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
> >>  [<ffffffff81085bba>] ? __do_softirq+0x18a/0x2d0
> >>  [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
> >>  [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
> >>  [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
> >>  <EOI>
> >> rcu_sched kthread starved for 240007 jiffies!
> >> INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> >>0, t=420012 jiffies, g=449043, c=449042, q=0)
> >> All QSes seen, last rcu_sched kthread activity 420012
> >>(4299516830-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
> >> ora_lms0_orcltw R  running task        0 22303      1
> >>0x00000080
> >>  ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> >>ffffffff810b038f
> >>  ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> >>ffffffff810e8f96
> >>  0000000000000000 ffff88010e817800 0000000000017800
> >>ffff88010e8185c0
> >> Call Trace:
> >>  <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
> >>  [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
> >>  [<ffffffff810ee582>] update_process_times+0x42/0x70
> >>  [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
> >>  [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
> >>  [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
> >>  [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
> >>  [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
> >>  [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
> >>  [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
> >>  [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
> >>  [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
> >>  [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
> >>  [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
> >>  [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
> >>  [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
> >>  [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
> >>  <EOI>  [<ffffffff81024847>] ?
> >>do_audit_syscall_entry+0x67/0x70
> >>  [<ffffffff81025eb3>] syscall_trace_enter_phase1+0x143/0x1a0
> >>  [<ffffffff8112ac86>] ? __audit_syscall_exit+0x1e6/0x280
> >>  [<ffffffff81026216>] ? syscall_trace_leave+0xc6/0x120
> >>  [<ffffffff8170d69a>] tracesys+0xd/0x44
> >> rcu_sched kthread starved for 420012 jiffies!
> >> INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by
> >>0, t=600017 jiffies, g=449043, c=449042, q=0)
> >> All QSes seen, last rcu_sched kthread activity 600017
> >>(4299696835-4299096818), jiffies_till_next_fqs=3, root ->qsmask 0x0
> >> ora_lms0_orcltw R  running task        0 22303      1
> >>0x00000080
> >>  ffff8800c4651c00 00000000e7110062 ffff88010e803c18
> >>ffffffff810b038f
> >>  ffff88010e8185c0 ffffffff81b22c40 ffff88010e803c98
> >>ffffffff810e8f96
> >>  0000000000000000 ffffffff810b4deb ffff88010e817800
> >>ffff88010e8185c0
> >>Call Trace:
> >> <IRQ>  [<ffffffff810b038f>] sched_show_task+0xaf/0x120
> >> [<ffffffff810e8f96>] rcu_check_callbacks+0x7e6/0x7f0
> >> [<ffffffff810b4deb>] ? account_user_time+0x8b/0xa0
> >> [<ffffffff810ee582>] update_process_times+0x42/0x70
> >> [<ffffffff810fea65>] tick_sched_handle.isra.18+0x25/0x60
> >> [<ffffffff810feae4>] tick_sched_timer+0x44/0x80
> >> [<ffffffff810ef307>] __run_hrtimer+0x77/0x1d0
> >> [<ffffffff810feaa0>] ? tick_sched_handle.isra.18+0x60/0x60
> >> [<ffffffff810ef6e3>] hrtimer_interrupt+0x103/0x230
> >> [<ffffffff8100a64d>] xen_timer_interrupt+0x3d/0x170
> >> [<ffffffff814469a9>] ? add_interrupt_randomness+0x49/0x200
> >> [<ffffffff810dd1ae>] handle_irq_event_percpu+0x3e/0x1a0
> >> [<ffffffff810e0cfd>] handle_percpu_irq+0x3d/0x60
> >> [<ffffffff810dc7ab>] generic_handle_irq+0x2b/0x40
> >> [<ffffffff813fe91f>] evtchn_2l_handle_events+0x26f/0x280
> >> [<ffffffff813fbe8f>] __xen_evtchn_do_upcall+0x4f/0x90
> >> [<ffffffff813fddf4>] xen_evtchn_do_upcall+0x34/0x50
> >> [<ffffffff8170f35e>] xen_hvm_callback_vector+0x6e/0x80
> >> <EOI>
> >>rcu_sched kthread starved for 600017 jiffies!
> >> ------------------------------------------------------
> >>
> >> [5] RCU config
> >> -----------------------------------------------
> >> # RCU Subsystem
> >> CONFIG_TREE_RCU=y
> >> CONFIG_SRCU=y
> >> # CONFIG_TASKS_RCU is not set
> >> CONFIG_RCU_STALL_COMMON=y
> >> CONFIG_RCU_USER_QS=y
> >> CONFIG_RCU_FANOUT=64
> >> CONFIG_RCU_FANOUT_LEAF=16
> >> # CONFIG_RCU_FANOUT_EXACT is not set
> >> # CONFIG_RCU_FAST_NO_HZ is not set
> >> # CONFIG_TREE_RCU_TRACE is not set
> >> CONFIG_RCU_KTHREAD_PRIO=0
> >> CONFIG_RCU_NOCB_CPU=y
> >> # CONFIG_RCU_NOCB_CPU_NONE is not set
> >> # CONFIG_RCU_NOCB_CPU_ZERO is not set
> >> CONFIG_RCU_NOCB_CPU_ALL=y
> >> # CONFIG_RCU_EXPEDITE_BOOT is not set
> >> # RCU Debugging
> >> # CONFIG_PROVE_RCU is not set
> >> # CONFIG_SPARSE_RCU_POINTER is not set
> >> # CONFIG_RCU_TORTURE_TEST is not set
> >> CONFIG_RCU_CPU_STALL_TIMEOUT=60
> >> # CONFIG_RCU_CPU_STALL_INFO is not set
> >> # CONFIG_RCU_TRACE is not set
> >> --------------------------------------------------
> >>
> >
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-16 17:34       ` Paul E. McKenney
@ 2016-05-16 19:49         ` Santosh Shilimkar
  2016-05-17  0:58           ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Santosh Shilimkar @ 2016-05-16 19:49 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Sasha Levin

On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
>> On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
>>> On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
>>>> On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
>>>>> Hi Paul,
>>>>>
>>>>> I was asking Sasha about [1] since other folks in Oracle
>>>>> also stumbled upon similar RCU stalls with v4.1 kernel in
>>>>> different workloads. I was reported similar issue with
>>>>> RDS as well and looking at [1], [2], [3] and [4], thought
>>>>> of reaching out to see if you can help us to understand
>>>>> this issue better.
>>>>>
>>>>> Have also included RCU specific config used in these
>>>>> test(s). Its very hard to reproduce the issue but one of
>>>>> the data point is, it reproduces on systems with larger
>>>>> CPUs(64+). Same workload with less than 64 CPUs, don't
>>>>> show the issue. Someone also told me, making use of
>>>>> SLAB instead SLUB allocator makes difference but I
>>>>> haven't verified that part for RDS.
>>>>>
>>>>> Let me know your thoughts. Thanks in advance !!
>>>>>
>>>> One of my colleague told me the pastebin server I used
>>>> is Oracle internal only so adding the relevant logs along
>>>> with email.
>>>>
>>
>> [...]
>>
>>>>> [1] https://lkml.org/lkml/2014/12/14/304
>>>>
>>>>
>>>> [2]  Log 1 snippet:
>>>> -----------------------------------------------------------------
>>>> INFO: rcu_sched self-detected stall on CPU
>>>> INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
>>>> g=66023 c=66022 q=0)
>>>> Task dump for CPU 54:
>>>> ksoftirqd/54    R  running task        0   389      2 0x00000008
>>>>  0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
>>>>  ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
>>>>  ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
>>>> Call Trace:
>>>>  <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>>>>  [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>>>>  [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>>>>  [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>>>>  [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>>>>  [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>>>>  [<ffffffff810e9772>] update_process_times+0x42/0x70
>>>>  [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>>>>  [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>>>>  [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>>>>  [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>>>>  [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>>>>  [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>>>>  [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>>>>  [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>>>>  <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>>>>  [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>>>>  [<ffffffff8118e775>] __free_pages+0x25/0x40
>>>>  [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
>>>>  [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
>>>>  [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>>>>  [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>>>>  [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>>>>  [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>>>>  [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
>>>
>>> The most likely possibility is that there is a 60-second-long loop in
>>> one of the above functions.  This is within bottom-half execution, so
>>> unfortunately the usual trick of placing cond_resched_rcu_qs() within this
>>> loop, but outside of any RCU read-side critical section does not work.
>>>
>> First of all thanks for explanation.
>>
>> There is no loop which can last for 60 seconds in above code since
>> its just completion queue handler used to free up buffers much like
>> NIC
>> drivers bottom half(NAPI). Its done in tasklet context for latency
>> reasons which RDS care most. Just to get your attention, the RCU
>> stall is also seen with XEN code too. Log for it end of the email.
>>
>> Another important observation is, for RDS if we avoid higher
>> order page(s) allocation, issue is not reproducible so far.
>> In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
>> the system continues to run without any issue, so the loop scenario
>> is ruled out more or less.
>>
>> To be specific, with PAGE_SIZE allocations, alloc_pages()
>> is just allocating a page and __free_page() is used
>> instead of __free_pages() from below snippet.
>>
>> ------------------
>> if (bytes >= PAGE_SIZE)
>> 	page = alloc_pages(gfp, get_order(bytes));
>>
>> .....
>>
>> (rm->data.op_sg[i].length <= PAGE_SIZE) ?
>> __free_page(sg_page(&rm->data.op_sg[i])) :
>> __free_pages(sg_page(&rm->data.op_sg[i]),
>> get_order(rm->data.op_sg[i].length));
>> ----------------------------
>
> This sounds like something to take up with the mm folks.
>
Sure. Will do once the link between two issues is established.

>>> Therefore, if there really is a loop here, one fix would be to
>>> periodically unwind back out to run_ksoftirqd(), but setting up so that
>>> the work would be continued later.  Another fix might be to move this
>> >from tasklet context to workqueue context, where cond_resched_rcu_qs()
>>> can be used -- however, this looks a bit like networking code, which
>>> does not always take kindly to being run in process context (though
>>> careful use of local_bh_disable() and local_bh_enable() can sometimes
>>> overcome this issue).  A third fix, which works only if this code does
>>> not use RCU and does not invoke any code that does use RCU, is to tell
>>> RCU that it should ignore this code (which will require a little work
>>> on RCU, as it currently does not tolerate this sort of thing aside from
>>> the idle threads).  In this last approach, event-tracing calls must use
>>> the _nonidle suffix.
>>>
>>> I am not familiar with the RDS code, so I cannot be more specific.
>>
>> No worries. Since we saw the issue with XEN too, I was suspecting
>> that somehow we didn't have RCU_TREE config setup correctly or
>> some important RCU patch(s) missing in v4.1 which made it in
>> later kernels.
>>
>> The only common denominator I saw between these two different
>> usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
>> which I believe triggers the rcu_sched() chain.
>
> Exactly!
>
>> I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
>> get more information out of the system. Do you suggest any other
>> RCU option(s)/patch(s) which we can help us to capture more
>> information to understand the issue better. I wasn't sure about
>> options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
>> is also mentioned end of the email.
>
> Hmmm...  I just now noticed the "All QSes seen" message below.
> That can happen if the load is quite high, and could as you say
> one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> This is not free, as it will mean more context switches involving
> the RCU grace-period kthreads.
>
> Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> smaller than 60, say, 21.  This will cause the stall warnings to leave
> less time before splatting.
>
OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very 
high load on the systems.

> Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> you need more housekeeping CPUs than you currently have configured.
>
Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
book keeping. Seems like without that clock-event code will just use
CPU0 for things like broadcasting which might become bottleneck.
This could explain connect the hrtimer_interrupt() path getting slowed
down because of book keeping bottleneck.

$cat .config | grep NO_HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set


Regards,
Santosh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-16 19:49         ` Santosh Shilimkar
@ 2016-05-17  0:58           ` Paul E. McKenney
  2016-05-17 13:46             ` santosh.shilimkar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2016-05-17  0:58 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: linux-kernel, Sasha Levin

On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> >>On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> >>>On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
> >>>>On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >>>>>Hi Paul,
> >>>>>
> >>>>>I was asking Sasha about [1] since other folks in Oracle
> >>>>>also stumbled upon similar RCU stalls with v4.1 kernel in
> >>>>>different workloads. I was reported similar issue with
> >>>>>RDS as well and looking at [1], [2], [3] and [4], thought
> >>>>>of reaching out to see if you can help us to understand
> >>>>>this issue better.
> >>>>>
> >>>>>Have also included RCU specific config used in these
> >>>>>test(s). Its very hard to reproduce the issue but one of
> >>>>>the data point is, it reproduces on systems with larger
> >>>>>CPUs(64+). Same workload with less than 64 CPUs, don't
> >>>>>show the issue. Someone also told me, making use of
> >>>>>SLAB instead SLUB allocator makes difference but I
> >>>>>haven't verified that part for RDS.
> >>>>>
> >>>>>Let me know your thoughts. Thanks in advance !!
> >>>>>
> >>>>One of my colleague told me the pastebin server I used
> >>>>is Oracle internal only so adding the relevant logs along
> >>>>with email.
> >>>>
> >>
> >>[...]
> >>
> >>>>>[1] https://lkml.org/lkml/2014/12/14/304
> >>>>
> >>>>
> >>>>[2]  Log 1 snippet:
> >>>>-----------------------------------------------------------------
> >>>>INFO: rcu_sched self-detected stall on CPU
> >>>>INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> >>>>g=66023 c=66022 q=0)
> >>>>Task dump for CPU 54:
> >>>>ksoftirqd/54    R  running task        0   389      2 0x00000008
> >>>> 0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
> >>>> ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
> >>>> ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
> >>>>Call Trace:
> >>>> <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
> >>>> [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
> >>>> [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
> >>>> [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
> >>>> [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
> >>>> [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
> >>>> [<ffffffff810e9772>] update_process_times+0x42/0x70
> >>>> [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
> >>>> [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
> >>>> [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
> >>>> [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
> >>>> [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
> >>>> [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
> >>>> [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
> >>>> [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
> >>>> <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
> >>>> [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
> >>>> [<ffffffff8118e775>] __free_pages+0x25/0x40
> >>>> [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
> >>>> [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
> >>>> [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
> >>>> [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
> >>>> [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
> >>>> [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
> >>>> [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
> >>>
> >>>The most likely possibility is that there is a 60-second-long loop in
> >>>one of the above functions.  This is within bottom-half execution, so
> >>>unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> >>>loop, but outside of any RCU read-side critical section does not work.
> >>>
> >>First of all thanks for explanation.
> >>
> >>There is no loop which can last for 60 seconds in above code since
> >>its just completion queue handler used to free up buffers much like
> >>NIC
> >>drivers bottom half(NAPI). Its done in tasklet context for latency
> >>reasons which RDS care most. Just to get your attention, the RCU
> >>stall is also seen with XEN code too. Log for it end of the email.
> >>
> >>Another important observation is, for RDS if we avoid higher
> >>order page(s) allocation, issue is not reproducible so far.
> >>In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
> >>the system continues to run without any issue, so the loop scenario
> >>is ruled out more or less.
> >>
> >>To be specific, with PAGE_SIZE allocations, alloc_pages()
> >>is just allocating a page and __free_page() is used
> >>instead of __free_pages() from below snippet.
> >>
> >>------------------
> >>if (bytes >= PAGE_SIZE)
> >>	page = alloc_pages(gfp, get_order(bytes));
> >>
> >>.....
> >>
> >>(rm->data.op_sg[i].length <= PAGE_SIZE) ?
> >>__free_page(sg_page(&rm->data.op_sg[i])) :
> >>__free_pages(sg_page(&rm->data.op_sg[i]),
> >>get_order(rm->data.op_sg[i].length));
> >>----------------------------
> >
> >This sounds like something to take up with the mm folks.
> >
> Sure. Will do once the link between two issues is established.

Fair enough!

> >>>Therefore, if there really is a loop here, one fix would be to
> >>>periodically unwind back out to run_ksoftirqd(), but setting up so that
> >>>the work would be continued later.  Another fix might be to move this
> >>>from tasklet context to workqueue context, where cond_resched_rcu_qs()
> >>>can be used -- however, this looks a bit like networking code, which
> >>>does not always take kindly to being run in process context (though
> >>>careful use of local_bh_disable() and local_bh_enable() can sometimes
> >>>overcome this issue).  A third fix, which works only if this code does
> >>>not use RCU and does not invoke any code that does use RCU, is to tell
> >>>RCU that it should ignore this code (which will require a little work
> >>>on RCU, as it currently does not tolerate this sort of thing aside from
> >>>the idle threads).  In this last approach, event-tracing calls must use
> >>>the _nonidle suffix.
> >>>
> >>>I am not familiar with the RDS code, so I cannot be more specific.
> >>
> >>No worries. Since we saw the issue with XEN too, I was suspecting
> >>that somehow we didn't have RCU_TREE config setup correctly or
> >>some important RCU patch(s) missing in v4.1 which made it in
> >>later kernels.
> >>
> >>The only common denominator I saw between these two different
> >>usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
> >>which I believe triggers the rcu_sched() chain.
> >
> >Exactly!
> >
> >>I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
> >>get more information out of the system. Do you suggest any other
> >>RCU option(s)/patch(s) which we can help us to capture more
> >>information to understand the issue better. I wasn't sure about
> >>options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
> >>is also mentioned end of the email.
> >
> >Hmmm...  I just now noticed the "All QSes seen" message below.
> >That can happen if the load is quite high, and could as you say
> >one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> >This is not free, as it will mean more context switches involving
> >the RCU grace-period kthreads.
> >
> >Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> >smaller than 60, say, 21.  This will cause the stall warnings to leave
> >less time before splatting.
> >
> OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
> CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very
> high load on the systems.
> 
> >Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >you need more housekeeping CPUs than you currently have configured.
> >
> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> book keeping. Seems like without that clock-event code will just use
> CPU0 for things like broadcasting which might become bottleneck.
> This could explain connect the hrtimer_interrupt() path getting slowed
> down because of book keeping bottleneck.
> 
> $cat .config | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> # CONFIG_NO_HZ_FULL_ALL is not set
> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
> CONFIG_NO_HZ=y
> # CONFIG_RCU_FAST_NO_HZ is not set

Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
housekeeping tasks, including the RCU grace-period kthreads.  So you are
booting without any nohz_full boot parameter?  You can end up with the
same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
that you can with CONFIG_NO_HZ_FULL_ALL=y.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-17  0:58           ` Paul E. McKenney
@ 2016-05-17 13:46             ` santosh.shilimkar
  2016-05-17 19:15               ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: santosh.shilimkar @ 2016-05-17 13:46 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Sasha Levin

On 5/16/16 5:58 PM, Paul E. McKenney wrote:
> On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
>> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
>>> On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:

[...]

>>> Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
>>> you need more housekeeping CPUs than you currently have configured.
>>>
>> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
>> book keeping. Seems like without that clock-event code will just use
>> CPU0 for things like broadcasting which might become bottleneck.
>> This could explain connect the hrtimer_interrupt() path getting slowed
>> down because of book keeping bottleneck.
>>
>> $cat .config | grep NO_HZ
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>> # CONFIG_NO_HZ_FULL_ALL is not set
>> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
>> CONFIG_NO_HZ=y
>> # CONFIG_RCU_FAST_NO_HZ is not set
>
> Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
> housekeeping tasks, including the RCU grace-period kthreads.  So you are
> booting without any nohz_full boot parameter?  You can end up with the
> same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
> that you can with CONFIG_NO_HZ_FULL_ALL=y.
>
I see. Yes, the systems are booting without nohz_full boot parameter.
Will try to add more CPUs to it & update the thread
after the verification since it takes time to reproduce the issue.

Thanks for discussion so far Paul. Its very insightful for me.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-17 13:46             ` santosh.shilimkar
@ 2016-05-17 19:15               ` Paul E. McKenney
  2016-05-19 23:45                 ` Santosh Shilimkar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2016-05-17 19:15 UTC (permalink / raw)
  To: santosh.shilimkar; +Cc: linux-kernel, Sasha Levin

On Tue, May 17, 2016 at 06:46:22AM -0700, santosh.shilimkar@oracle.com wrote:
> On 5/16/16 5:58 PM, Paul E. McKenney wrote:
> >On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> >>On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >>>On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> 
> [...]
> 
> >>>Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >>>you need more housekeeping CPUs than you currently have configured.
> >>>
> >>Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> >>book keeping. Seems like without that clock-event code will just use
> >>CPU0 for things like broadcasting which might become bottleneck.
> >>This could explain connect the hrtimer_interrupt() path getting slowed
> >>down because of book keeping bottleneck.
> >>
> >>$cat .config | grep NO_HZ
> >>CONFIG_NO_HZ_COMMON=y
> >># CONFIG_NO_HZ_IDLE is not set
> >>CONFIG_NO_HZ_FULL=y
> >># CONFIG_NO_HZ_FULL_ALL is not set
> >># CONFIG_NO_HZ_FULL_SYSIDLE is not set
> >>CONFIG_NO_HZ=y
> >># CONFIG_RCU_FAST_NO_HZ is not set
> >
> >Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
> >housekeeping tasks, including the RCU grace-period kthreads.  So you are
> >booting without any nohz_full boot parameter?  You can end up with the
> >same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
> >that you can with CONFIG_NO_HZ_FULL_ALL=y.
> >
> I see. Yes, the systems are booting without nohz_full boot parameter.
> Will try to add more CPUs to it & update the thread
> after the verification since it takes time to reproduce the issue.
> 
> Thanks for discussion so far Paul. Its very insightful for me.

Please let me know how things go with further testing, especially with
the priority setting.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-17 19:15               ` Paul E. McKenney
@ 2016-05-19 23:45                 ` Santosh Shilimkar
  2016-05-20  4:05                   ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Santosh Shilimkar @ 2016-05-19 23:45 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, Sasha Levin

Hi Paul,

On 5/17/2016 12:15 PM, Paul E. McKenney wrote:
> On Tue, May 17, 2016 at 06:46:22AM -0700, santosh.shilimkar@oracle.com wrote:
>> On 5/16/16 5:58 PM, Paul E. McKenney wrote:
>>> On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
>>>> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
>>>>> On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
>>
>> [...]
>>
>>>>> Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
>>>>> you need more housekeeping CPUs than you currently have configured.
>>>>>
>>>> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
>>>> book keeping. Seems like without that clock-event code will just use
>>>> CPU0 for things like broadcasting which might become bottleneck.
>>>> This could explain connect the hrtimer_interrupt() path getting slowed
>>>> down because of book keeping bottleneck.
>>>>
>>>> $cat .config | grep NO_HZ
>>>> CONFIG_NO_HZ_COMMON=y
>>>> # CONFIG_NO_HZ_IDLE is not set
>>>> CONFIG_NO_HZ_FULL=y
>>>> # CONFIG_NO_HZ_FULL_ALL is not set
>>>> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
>>>> CONFIG_NO_HZ=y
>>>> # CONFIG_RCU_FAST_NO_HZ is not set
>>>
>>> Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
>>> housekeeping tasks, including the RCU grace-period kthreads.  So you are
>>> booting without any nohz_full boot parameter?  You can end up with the
>>> same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
>>> that you can with CONFIG_NO_HZ_FULL_ALL=y.
>>>
>> I see. Yes, the systems are booting without nohz_full boot parameter.
>> Will try to add more CPUs to it & update the thread
>> after the verification since it takes time to reproduce the issue.
>>
>> Thanks for discussion so far Paul. Its very insightful for me.
>
> Please let me know how things go with further testing, especially with
> the priority setting.
>
Sorry for delay. I manage to get information about XEN usecase
custom config as discussed above. To reduce variables, I disabled 
"CONFIG_NO_HZ_FULL" altogether. So the effective setting was:

CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_TREE_RCU_TRACE=y
CONFIG_RCU_KTHREAD_PRIO=1
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_TRACE=y

Unfortunately the XEN test still failed. Log end of
the email. This test(s) is bit peculiar though since
its database running in VM with 1 or 2 CPUs. One of
the suspect is because the database RT processes are
hogging the CPU(s), kernel RCU thread is not getting chance
to run which eventually results in stall. Does it
make sense ?

Please note that its non-preempt kernel using RT processes. ;-)

# cat .config | grep PREEMPT
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

Regards,
Santosh
...
....
rcu_sched kthread starved for 399032 jiffies!
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, 
t=462037 jiffies, g=118888, c=118887, q=0)
All QSes seen, last rcu_sched kthread activity 462037 
(4296277632-4295815595), jiffies_till_next_fqs=3, root ->qsmask 0x0
ocssd.bin       R  running task        0 15375      1 0x00000000
  0000000000000000 ffff8800ec003bc8 ffffffff810a8581 ffffffff81abf980
  000000000001d068 ffff8800ec003c28 ffffffff810e9c98 0000000000000000
  0000000000000086 0000000000000000 0000000000000086 0000000000000082
Call Trace:
  <IRQ>  [<ffffffff810a8581>] sched_show_task+0xb1/0x120
  [<ffffffff810e9c98>] print_other_cpu_stall+0x288/0x2d0
  [<ffffffff810e9e60>] __rcu_pending+0x180/0x230
  [<ffffffff810e9fa5>] rcu_check_callbacks+0x95/0x140
  [<ffffffff810edf92>] update_process_times+0x42/0x70
  [<ffffffff810ffc79>] tick_sched_handle+0x39/0x80
  [<ffffffff810ffee2>] tick_sched_timer+0x52/0xa0
  [<ffffffff810f0424>] __run_hrtimer+0x74/0x1d0
  [<ffffffff810ffe90>] ? tick_nohz_handler+0xc0/0xc0
  [<ffffffff810f07b2>] hrtimer_interrupt+0x102/0x240
  [<ffffffff8100aaae>] xen_timer_interrupt+0x2e/0x130
  [<ffffffff8143347a>] ? add_interrupt_randomness+0x3a/0x1f0
  [<ffffffff81384290>] ? store_cursor_blink+0xc0/0xc0
  [<ffffffff810d9de4>] handle_irq_event_percpu+0x54/0x1b0
  [<ffffffff810dd477>] handle_percpu_irq+0x47/0x70
  [<ffffffff810d9d27>] generic_handle_irq+0x27/0x40
  [<ffffffff813e53ca>] evtchn_2l_handle_events+0x25a/0x260
  [<ffffffff81085541>] ? __do_softirq+0x191/0x2f0
  [<ffffffff813e2c2f>] __xen_evtchn_do_upcall+0x4f/0x90
  [<ffffffff813e3404>] xen_evtchn_do_upcall+0x34/0x50
  [<ffffffff816c761e>] xen_hvm_callback_vector+0x6e/0x80
  <EOI>
rcu_sched kthread starved for 462037 jiffies!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [rcu_sched stall] regression/miss-config ?
  2016-05-19 23:45                 ` Santosh Shilimkar
@ 2016-05-20  4:05                   ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2016-05-20  4:05 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: linux-kernel, Sasha Levin

On Thu, May 19, 2016 at 04:45:26PM -0700, Santosh Shilimkar wrote:
> Hi Paul,
> 
> On 5/17/2016 12:15 PM, Paul E. McKenney wrote:
> >On Tue, May 17, 2016 at 06:46:22AM -0700, santosh.shilimkar@oracle.com wrote:
> >>On 5/16/16 5:58 PM, Paul E. McKenney wrote:
> >>>On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> >>>>On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >>>>>On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> >>
> >>[...]
> >>
> >>>>>Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >>>>>you need more housekeeping CPUs than you currently have configured.
> >>>>>
> >>>>Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> >>>>book keeping. Seems like without that clock-event code will just use
> >>>>CPU0 for things like broadcasting which might become bottleneck.
> >>>>This could explain connect the hrtimer_interrupt() path getting slowed
> >>>>down because of book keeping bottleneck.
> >>>>
> >>>>$cat .config | grep NO_HZ
> >>>>CONFIG_NO_HZ_COMMON=y
> >>>># CONFIG_NO_HZ_IDLE is not set
> >>>>CONFIG_NO_HZ_FULL=y
> >>>># CONFIG_NO_HZ_FULL_ALL is not set
> >>>># CONFIG_NO_HZ_FULL_SYSIDLE is not set
> >>>>CONFIG_NO_HZ=y
> >>>># CONFIG_RCU_FAST_NO_HZ is not set
> >>>
> >>>Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
> >>>housekeeping tasks, including the RCU grace-period kthreads.  So you are
> >>>booting without any nohz_full boot parameter?  You can end up with the
> >>>same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
> >>>that you can with CONFIG_NO_HZ_FULL_ALL=y.
> >>>
> >>I see. Yes, the systems are booting without nohz_full boot parameter.
> >>Will try to add more CPUs to it & update the thread
> >>after the verification since it takes time to reproduce the issue.
> >>
> >>Thanks for discussion so far Paul. Its very insightful for me.
> >
> >Please let me know how things go with further testing, especially with
> >the priority setting.
> >
> Sorry for delay. I manage to get information about XEN usecase
> custom config as discussed above. To reduce variables, I disabled
> "CONFIG_NO_HZ_FULL" altogether. So the effective setting was:
> 
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_TREE_RCU_TRACE=y
> CONFIG_RCU_KTHREAD_PRIO=1
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_TRACE=y
> 
> Unfortunately the XEN test still failed. Log end of
> the email. This test(s) is bit peculiar though since
> its database running in VM with 1 or 2 CPUs. One of
> the suspect is because the database RT processes are
> hogging the CPU(s), kernel RCU thread is not getting chance
> to run which eventually results in stall. Does it
> make sense ?
> 
> Please note that its non-preempt kernel using RT processes. ;-)

If you have enough real-time processes to consume all CPU, you will
indeed starve the grace-period kthread, so what you see below would
then be expected behavior.

Try setting CONFIG_RCU_KTHREAD_PRIO to be larger than the real-time
priority of your processes.

							Thanx, Paul

> # cat .config | grep PREEMPT
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> 
> Regards,
> Santosh
> ...
> ....
> rcu_sched kthread starved for 399032 jiffies!
> INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0,
> t=462037 jiffies, g=118888, c=118887, q=0)
> All QSes seen, last rcu_sched kthread activity 462037
> (4296277632-4295815595), jiffies_till_next_fqs=3, root ->qsmask 0x0
> ocssd.bin       R  running task        0 15375      1 0x00000000
>  0000000000000000 ffff8800ec003bc8 ffffffff810a8581 ffffffff81abf980
>  000000000001d068 ffff8800ec003c28 ffffffff810e9c98 0000000000000000
>  0000000000000086 0000000000000000 0000000000000086 0000000000000082
> Call Trace:
>  <IRQ>  [<ffffffff810a8581>] sched_show_task+0xb1/0x120
>  [<ffffffff810e9c98>] print_other_cpu_stall+0x288/0x2d0
>  [<ffffffff810e9e60>] __rcu_pending+0x180/0x230
>  [<ffffffff810e9fa5>] rcu_check_callbacks+0x95/0x140
>  [<ffffffff810edf92>] update_process_times+0x42/0x70
>  [<ffffffff810ffc79>] tick_sched_handle+0x39/0x80
>  [<ffffffff810ffee2>] tick_sched_timer+0x52/0xa0
>  [<ffffffff810f0424>] __run_hrtimer+0x74/0x1d0
>  [<ffffffff810ffe90>] ? tick_nohz_handler+0xc0/0xc0
>  [<ffffffff810f07b2>] hrtimer_interrupt+0x102/0x240
>  [<ffffffff8100aaae>] xen_timer_interrupt+0x2e/0x130
>  [<ffffffff8143347a>] ? add_interrupt_randomness+0x3a/0x1f0
>  [<ffffffff81384290>] ? store_cursor_blink+0xc0/0xc0
>  [<ffffffff810d9de4>] handle_irq_event_percpu+0x54/0x1b0
>  [<ffffffff810dd477>] handle_percpu_irq+0x47/0x70
>  [<ffffffff810d9d27>] generic_handle_irq+0x27/0x40
>  [<ffffffff813e53ca>] evtchn_2l_handle_events+0x25a/0x260
>  [<ffffffff81085541>] ? __do_softirq+0x191/0x2f0
>  [<ffffffff813e2c2f>] __xen_evtchn_do_upcall+0x4f/0x90
>  [<ffffffff813e3404>] xen_evtchn_do_upcall+0x34/0x50
>  [<ffffffff816c761e>] xen_hvm_callback_vector+0x6e/0x80
>  <EOI>
> rcu_sched kthread starved for 462037 jiffies!
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-05-20  4:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-15 21:18 [rcu_sched stall] regression/miss-config ? Santosh Shilimkar
2016-05-16  4:35 ` santosh.shilimkar
2016-05-16 12:03   ` Paul E. McKenney
2016-05-16 16:33     ` Santosh Shilimkar
2016-05-16 17:34       ` Paul E. McKenney
2016-05-16 19:49         ` Santosh Shilimkar
2016-05-17  0:58           ` Paul E. McKenney
2016-05-17 13:46             ` santosh.shilimkar
2016-05-17 19:15               ` Paul E. McKenney
2016-05-19 23:45                 ` Santosh Shilimkar
2016-05-20  4:05                   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).