From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751032AbdBTU6M convert rfc822-to-8bit (ORCPT ); Mon, 20 Feb 2017 15:58:12 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:52687 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750844AbdBTU6K (ORCPT ); Mon, 20 Feb 2017 15:58:10 -0500 Date: Mon, 20 Feb 2017 12:58:02 -0800 From: "Paul E. McKenney" To: Sumit Gemini Cc: ralf.hildebrandt@charite.de, linux-kernel@vger.kernel.org Subject: Re: Need your help on rcu_bh_state stall on CPU 0 (t=0 jiffies) Reply-To: paulmck@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 17022020-0016-0000-0000-00000639C61D X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006652; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000204; SDB=6.00824694; UDB=6.00403726; IPR=6.00602149; BA=6.00005158; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00014363; XFM=3.00000011; UTC=2017-02-20 20:58:06 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17022020-0017-0000-0000-000037910356 Message-Id: <20170220205802.GS30506@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-02-20_17:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1702200194 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 20, 2017 at 03:26:32PM +0530, Sumit Gemini wrote: > Hi Paul, > > > I got you while reading 'http://www.serverphorums.com/read.php?12,468996". I'm > facing same problem but not able to detect what exectly the cause of > it. > > > I'm observed weird problem, I got rcu_bh_state on cpu0 but my system > was in ideal state when i got rcu_bh_state (Cpu stall) problem. After > 7 second my system recovered itself from stall to active. But i did > not get what happen at that time. One thing that you might look into below... > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state detected stall on CPU 0 (t=0 jiffies) > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all CPUs: > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu 0 > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0 > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP] > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP: 0010:[] [] native_read_tsc+0x2/0xf > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP: 0018:ffff88043ee03db0 EFLAGS: 00000803 > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395 RBX: 00000000000003e9 RCX: 0000000000000001 > Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications node id:2020f ret val:0 > Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0 Role:STANDBY > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, > OwnNodeId = 131343, SupervisionTime = 60 Is this "WA IMMND DOWN" a normal occurrence? Perhaps it is pointing at a root cause. > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977 RSI: 0000000000000002 RDI: 0000000000032bd4 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4 R08: 0000000000000000 R09: ffffffff819232b0 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff R11: 7fffffffffffffff R12: 0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0 R14: 0000000000000001 R15: ffffffff81065c28 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS: 0000000000000000(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034 CR3: 0000000001805000 CR4: 00000000000406f0 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180d020) > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack: > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5 ffffffff81200f44 00000000000003e9 0000000000001000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 ffffffff819232b0 ffffffff81017698 7fffffffffffffff > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 0000000000000002 ffffffff81017fdf 0000000000000001 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace: > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] paravirt_read_tsc+0x5/0x8 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] delay_tsc+0x1d/0x68 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] native_safe_apic_wait_icr_idle+0x27/0x32 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] default_send_IPI_dest_field.constprop.0+0x19/0x4d > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] default_send_IPI_mask_sequence_phys+0x38/0x6a > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] arch_trigger_all_cpu_backtrace+0x4d/0x7b > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] check_cpu_stall+0x66/0xdb > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] rcu_pending+0x1a/0x10a > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] rcu_check_callbacks+0x9d/0xae > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] update_process_times+0x31/0x63 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] tick_sched_timer+0x6a/0x90 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] __run_hrtimer+0xa4/0x148 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] hrtimer_interrupt+0xdb/0x19a > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] smp_apic_timer_interrupt+0x6e/0x80 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] apic_timer_interrupt+0x6e/0x80 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] intel_idle+0xdd/0x104 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] cpuidle_idle_call+0xda/0x169 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] cpu_idle+0x51/0x95 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] start_kernel+0x388/0x393 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [] x86_64_start_kernel+0xcf/0xdc > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3 e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3 40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31 > Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0 20 48 09 c8 c3 41 56 41 55 41 54 55 53 > > . > . > . > . > . > . > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu 31 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent > blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP] > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm: kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP: 0010:[] [] intel_idle+0xac/0x104 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP: 0018:ffff880425e33eb8 EFLAGS: 00000046 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030 RBX: 0000000000000010 RCX: 0000000000000001 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000 RSI: ffff880425e33fd8 RDI: ffffffff81810500 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030 R08: 000000000000006d R09: 0000000000034b10 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830 R11: ffff88083eded830 R12: 149739342cb2ca49 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004 R14: 000000000000001f R15: 0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS: 0000000000000000(0000) GS:ffff88083ede0000(0000) knlGS:0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120 CR3: 0000000001805000 CR4: 00000000000406e0 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1 (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580) > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack: > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, requested by peer > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link <1.1.1:bond0-1.1.2:bond0> on network plane A > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact with <1.1.2> > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 000000000cdd3a47 0000000000000000 000000000cdd3a47 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8 0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 ffffffff81304773 ffffffff819232b0 ffff880425e33fd8 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace: > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [] cpuidle_idle_call+0xda/0x169 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [] cpu_idle+0x51/0x95 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80 e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 3f 6e e2 ff 4c 29 e0 48 89 c7 e8 10 ae e0 ff 48 69 e8 40 42 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace: > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [] cpuidle_idle_call+0xda/0x169 > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [] cpu_idle+0x51/0x95 Since NMIs were sent to all CPUs, it is not surprising that one was found to be idle. Or is there something that I am missing in CPU 31's backtrace? > I'm using suse kernel 3.1.10 That is going back a ways! There are quite a few patches to RCU CPU stall warnings that could be applied. It might be easier to move to a newer kernel. Alternatively, if you are convinced that these stall warnings are false positives, you could boot with the rcu_cpu_stall_suppress kernel boot parameter set, or rebuild the kernel with a much larger value of the RCU_CPU_STALL_TIMEOUT Kconfig option. Thanx, Paul