From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932821AbcHJBPf (ORCPT ); Tue, 9 Aug 2016 21:15:35 -0400 Received: from szxga03-in.huawei.com ([119.145.14.66]:60392 "EHLO szxga03-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932481AbcHJBPd (ORCPT ); Tue, 9 Aug 2016 21:15:33 -0400 Subject: Re: [PATCH] rcu: Fix soft lockup for rcu_nocb_kthread To: References: <57610368.7080905@huawei.com> <20160615154913.GC3923@linux.vnet.ibm.com> <576242AB.5010204@huawei.com> <20160616141920.GO3923@linux.vnet.ibm.com> CC: , , , , "linux-kernel@vger.kernel.org" From: Ding Tianhong Message-ID: <57AA7FAA.1030801@huawei.com> Date: Wed, 10 Aug 2016 09:13:14 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20160616141920.GO3923@linux.vnet.ibm.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.22.246] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020202.57AA7FB1.00B0,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: caa10a00c219d67b04dd9a379914034a Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2016/6/16 22:19, Paul E. McKenney wrote: > On Thu, Jun 16, 2016 at 02:09:47PM +0800, Ding Tianhong wrote: >> On 2016/6/15 23:49, Paul E. McKenney wrote: >>> On Wed, Jun 15, 2016 at 03:27:36PM +0800, Ding Tianhong wrote: >>>> I met this problem when using the Testgine to send package to ixgbevf nic >>>> by this steps: >>>> 1. Connect to ixgbevf, and set the speed to 10Gb/s, it could work fine. >>>> 2. Then use ifconfig to down the nic and up again, loop for several times. >>>> 3. The system panic by soft lockup. >>> >>> Good catch, queued for review and testing. But what .config was your >>> kernel built with? >>> >> >> I use the redhat7.1 defconfig to build my kernel, and the RCU config is this: >> 120 # >> 121 # RCU Subsystem >> 122 # >> 123 CONFIG_TREE_RCU=y >> 124 # CONFIG_PREEMPT_RCU is not set >> 125 CONFIG_RCU_STALL_COMMON=y >> 126 CONFIG_CONTEXT_TRACKING=y >> 127 CONFIG_RCU_USER_QS=y >> 128 # CONFIG_CONTEXT_TRACKING_FORCE is not set >> 129 CONFIG_RCU_FANOUT=64 >> 130 CONFIG_RCU_FANOUT_LEAF=16 >> 131 # CONFIG_RCU_FANOUT_EXACT is not set >> 132 # CONFIG_RCU_FAST_NO_HZ is not set >> 133 # CONFIG_TREE_RCU_TRACE is not set >> 134 CONFIG_RCU_NOCB_CPU=y >> 135 CONFIG_RCU_NOCB_CPU_ALL=y >> 136 CONFIG_BUILD_BIN2C=y > > Thank you! You were running with preemption disabled, so your system > would indeed be very susceptible to this problem. > >>> Also, I did tweak both the commit log and the patch. Your cond_resched() >>> would prevent soft lockups, but not RCU stalls, so I substituted >>> cond_resched_rcu_qs(). Please let me know if either of those changes >>> causes problems at your end. >> >> Looks fine to me, I will apply this to my branch and test it, thanks. > > Please let me know how it goes! > > Thanx, Paul > Hi Paul: It has been a long time after applying this patch, and didn't found any problem, I believe this patch is fine, thanks. Ding >> Ding >> >>> >>> Thanx, Paul >>> >>> ------------------------------------------------------------------------ >>> >>> commit c317cf19b34c0d2787b787c38bd2c8fe433215da >>> Author: Ding Tianhong >>> Date: Wed Jun 15 15:27:36 2016 +0800 >>> >>> rcu: Fix soft lockup for rcu_nocb_kthread >>> >>> Carrying out the following steps results in a softlockup in the >>> RCU callback-offload (rcuo) kthreads: >>> >>> 1. Connect to ixgbevf, and set the speed to 10Gb/s. >>> 2. Use ifconfig to bring the nic up and down repeatedly. >>> >>> [ 317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready >>> [ 368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15] >>> [ 368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >>> [ 368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000 >>> [ 368.106005] RIP: 0010:[] [] fib_table_lookup+0x14/0x390 >>> [ 368.106005] RSP: 0018:ffff88061fc83ce8 EFLAGS: 00000286 >>> [ 368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001 >>> [ 368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00 >>> [ 368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000 >>> [ 368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58 >>> [ 368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0 >>> [ 368.106005] FS: 0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000 >>> [ 368.106005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0 >>> [ 368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> [ 368.106005] Stack: >>> [ 368.106005] 00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00 >>> [ 368.106005] ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146 >>> [ 368.106005] ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0 >>> [ 368.106005] Call Trace: >>> [ 368.106005] >>> [ 368.106005] >>> [ 368.106005] [] ip_route_input_noref+0x516/0xbd0 >>> [ 368.106005] [] ? skb_release_data+0xd6/0x110 >>> [ 368.106005] [] ? kfree_skb+0x3a/0xa0 >>> [ 368.106005] [] ip_rcv_finish+0x29f/0x350 >>> [ 368.106005] [] ip_rcv+0x234/0x380 >>> [ 368.106005] [] __netif_receive_skb_core+0x676/0x870 >>> [ 368.106005] [] __netif_receive_skb+0x18/0x60 >>> [ 368.106005] [] process_backlog+0xae/0x180 >>> [ 368.106005] [] net_rx_action+0x152/0x240 >>> [ 368.106005] [] __do_softirq+0xef/0x280 >>> [ 368.106005] [] call_softirq+0x1c/0x30 >>> [ 368.106005] >>> [ 368.106005] >>> [ 368.106005] [] do_softirq+0x65/0xa0 >>> [ 368.106005] [] local_bh_enable+0x94/0xa0 >>> [ 368.106005] [] rcu_nocb_kthread+0x232/0x370 >>> [ 368.106005] [] ? wake_up_bit+0x30/0x30 >>> [ 368.106005] [] ? rcu_start_gp+0x40/0x40 >>> [ 368.106005] [] kthread+0xcf/0xe0 >>> [ 368.106005] [] ? kthread_create_on_node+0x140/0x140 >>> [ 368.106005] [] ret_from_fork+0x58/0x90 >>> [ 368.106005] [] ? kthread_create_on_node+0x140/0x140 >>> >>> ==================================cut here============================== >>> >>> It turns out that the rcuos callback-offload kthread is busy processing >>> a very large quantity of RCU callbacks, and it is not reliquishing the >>> CPU while doing so. This commit therefore adds an cond_resched_rcu_qs() >>> within the loop to allow other tasks to run. >>> >>> Signed-off-by: Ding Tianhong >>> [ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ] >>> Signed-off-by: Paul E. McKenney >>> >>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >>> index 0082fce402a0..85c5a883c6e3 100644 >>> --- a/kernel/rcu/tree_plugin.h >>> +++ b/kernel/rcu/tree_plugin.h >>> @@ -2173,6 +2173,7 @@ static int rcu_nocb_kthread(void *arg) >>> cl++; >>> c++; >>> local_bh_enable(); >>> + cond_resched_rcu_qs(); >>> list = next; >>> } >>> trace_rcu_batch_end(rdp->rsp->name, c, !!list, 0, 0, 1); >>> >>> >>> . >>> >> >> > > > . >