[trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings
diff mbox series

Message ID 20180109185222.GO3668920@devbig577.frc2.facebook.com
State New, archived
Headers show
Series
  • [trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings
Related show

Commit Message

Tejun Heo Jan. 9, 2018, 6:52 p.m. UTC
>From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 9 Jan 2018 10:38:17 -0800
Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings

When RCU stall warning triggers, it can print out a lot of messages
while holding spinlocks.  If the console device is slow (e.g. an
actual or IPMI serial console), it may end up triggering NMI hard
lockup watchdog like the following.

*** CPU printking while holding RCU spinlock

  PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
   #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
   #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
   #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
   #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
   #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
      [exception RIP: io_serial_in+21]
      RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
      RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
      RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
      RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
      R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
      R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
  --- <NMI exception stack> ---
   #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
   #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
   #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
   #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
   #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
  #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
  #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
  #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
  #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
  #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
  #15 [ffff881fff943d70] printk at ffffffff8119a8a4
  #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
  #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
  #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
  #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
  #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
  #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b

*** CPU triggering the hardlockup watchdog

  PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
   #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
   #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
   #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
   #3 [ffff883fff105b08] panic at ffffffff8119a6ae
   #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
   #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
   #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
   #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
   #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
   #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
  #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
  #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
  #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
      [exception RIP: queued_spin_lock_slowpath+248]
      RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
      RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
      RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
      R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
      R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
  #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
  #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
  #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
  #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
  #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
  #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
  #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
  --- <IRQ stack> ---

Avoid spuriously triggering NMI hardlockup watchdog by touching it
from the print functions.  show_state_filter() shares the same problem
and solution.

v2: Relocate the comment to where it belongs.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/rcu/tree_plugin.h | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Paul E. McKenney Jan. 9, 2018, 8:05 p.m. UTC | #1
On Tue, Jan 09, 2018 at 10:52:22AM -0800, Tejun Heo wrote:
> >From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Tue, 9 Jan 2018 10:38:17 -0800
> Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings
> 
> When RCU stall warning triggers, it can print out a lot of messages
> while holding spinlocks.  If the console device is slow (e.g. an
> actual or IPMI serial console), it may end up triggering NMI hard
> lockup watchdog like the following.

Queued for testing and review, thank you!

							Thanx, Paul

> *** CPU printking while holding RCU spinlock
> 
>   PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
>    #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
>    #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
>    #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
>    #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
>    #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: io_serial_in+21]
>       RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
>       RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
>       RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
>       RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
>       R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
>       R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>   --- <NMI exception stack> ---
>    #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
>    #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
>    #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
>    #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
>    #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
>   #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
>   #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
>   #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
>   #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
>   #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
>   #15 [ffff881fff943d70] printk at ffffffff8119a8a4
>   #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
>   #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
>   #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
>   #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
>   #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
>   #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b
> 
> *** CPU triggering the hardlockup watchdog
> 
>   PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
>    #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
>    #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
>    #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
>    #3 [ffff883fff105b08] panic at ffffffff8119a6ae
>    #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
>    #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
>    #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
>    #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
>    #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
>    #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
>   #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
>   #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
>   #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
>       [exception RIP: queued_spin_lock_slowpath+248]
>       RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
>       RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
>       RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
>       RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
>       R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
>       R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   --- <NMI exception stack> ---
>   #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
>   #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
>   #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
>   #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
>   #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
>   #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
>   #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
>   --- <IRQ stack> ---
> 
> Avoid spuriously triggering NMI hardlockup watchdog by touching it
> from the print functions.  show_state_filter() shares the same problem
> and solution.
> 
> v2: Relocate the comment to where it belongs.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/rcu/tree_plugin.h | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index db85ca3..bf05bf9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
>  	}
>  	t = list_entry(rnp->gp_tasks->prev,
>  		       struct task_struct, rcu_node_entry);
> -	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
> +	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
> +		/*
> +		 * We could be printing a lot while holding a spinlock.
> +		 * Avoid triggering hard lockup.
> +		 */
> +		touch_nmi_watchdog();
>  		sched_show_task(t);
> +	}
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
> 
> @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
>  	char *ticks_title;
>  	unsigned long ticks_value;
> 
> +	/*
> +	 * We could be printing a lot while holding a spinlock.  Avoid
> +	 * triggering hard lockup.
> +	 */
> +	touch_nmi_watchdog();
> +
>  	if (rsp->gpnum == rdp->gpnum) {
>  		ticks_title = "ticks this GP";
>  		ticks_value = rdp->ticks_this_gp;
> -- 
> 2.9.5
>

Patch
diff mbox series

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3..bf05bf9 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -561,8 +561,14 @@  static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
 	}
 	t = list_entry(rnp->gp_tasks->prev,
 		       struct task_struct, rcu_node_entry);
-	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
+	list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
+		/*
+		 * We could be printing a lot while holding a spinlock.
+		 * Avoid triggering hard lockup.
+		 */
+		touch_nmi_watchdog();
 		sched_show_task(t);
+	}
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
@@ -1678,6 +1684,12 @@  static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
 	char *ticks_title;
 	unsigned long ticks_value;
 
+	/*
+	 * We could be printing a lot while holding a spinlock.  Avoid
+	 * triggering hard lockup.
+	 */
+	touch_nmi_watchdog();
+
 	if (rsp->gpnum == rdp->gpnum) {
 		ticks_title = "ticks this GP";
 		ticks_value = rdp->ticks_this_gp;