From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from www262.sakura.ne.jp ([202.181.97.72]:57865 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752699AbeFTBP4 (ORCPT ); Tue, 19 Jun 2018 21:15:56 -0400 Message-Id: <201806200115.w5K1FgnG084419@www262.sakura.ne.jp> Subject: Re: INFO: task hung in =?ISO-2022-JP?B?X19nZXRfc3VwZXI=?= From: Tetsuo Handa To: Dmitry Vyukov Cc: syzbot , syzkaller-bugs , linux-fsdevel , LKML , Al Viro MIME-Version: 1.0 Date: Wed, 20 Jun 2018 10:15:42 +0900 References: <9d0092e6-142a-7168-7b3c-084faa9268d5@i-love.sakura.ne.jp> In-Reply-To: Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Dmitry Vyukov wrote: > On Tue, Jun 19, 2018 at 4:10 PM, Tetsuo Handa > wrote: > > On 2018/06/19 20:53, Dmitry Vyukov wrote: > >> On Tue, Jun 19, 2018 at 1:44 PM, Tetsuo Handa > >> wrote: > >>> This bug report is getting no feedback, but I guess that this bug is in > >>> block or mm or locking layer rather than fs layer. > >>> > >>> NMI backtrace for this bug tends to report that sb_bread() from fill_super() > >>> from mount_bdev() is stalling is the cause of keep holding s_umount_key for > >>> more than 120 seconds. What is strange is that NMI backtrace for this bug tends > >>> to point at rcu_read_lock()/pagecache_get_page()/radix_tree_deref_slot()/ > >>> rcu_read_unlock() which is expected not to stall. > >>> > >>> Since CONFIG_RCU_CPU_STALL_TIMEOUT is set to 120 (and actually +5 due to > >>> CONFIG_PROVE_RCU=y) which is longer than CONFIG_DEFAULT_HUNG_TASK_TIMEOUT, > >>> maybe setting CONFIG_RCU_CPU_STALL_TIMEOUT to smaller values (e.g. 25) can > >>> give us some hints... > >> > >> If an rcu stall is the true root cause of this, then I guess would see > >> "rcu stall" bug too. Rcu stall is detected after 120 seconds, but task > >> hang after 120-240 seconds. So rcu stall has much higher chances to be > >> detected. Do you see the corresponding "rcu stall" bug? > > > > RCU stall is detected after 125 seconds due to CONFIG_PROVE_RCU=y > > (e.g. https://syzkaller.appspot.com/bug?id=1fac0fd91219f3f2a03d6fa7deafc95fbed79cc2 ). > > > > I didn't find the corresponding "rcu stall" bug. But it is not required > > that one RCU stall takes longer than 120 seconds. > > > > down(); // Will take 120 seconds due to multiple RCU stalls > > rcu_read_lock(): > > do_something(); > > rcu_read_unlock(): // Took 30 seconds for unknown reason. > > rcu_read_lock(): > > do_something(); > > rcu_read_unlock(): // Took 30 seconds for unknown reason. > > rcu_read_lock(): > > do_something(); > > rcu_read_unlock(): // Took 30 seconds for unknown reason. > > rcu_read_lock(): > > do_something(); > > rcu_read_unlock(): // Took 30 seconds for unknown reason. > > up(); > > > You think this is another false positive? > Like this one https://github.com/google/syzkaller/issues/516#issuecomment-395685629 > ? According to https://syzkaller.appspot.com/text?tag=CrashLog&x=11db16c4400000 from "INFO: rcu detected stall in __process_echoes": [ 859.630022] INFO: rcu_sched self-detected stall on CPU [ 859.635509] 0-....: (124999 ticks this GP) idle=30e/1/4611686018427387906 softirq=287964/287964 fqs=31234 [ 859.645716] (t=125000 jiffies g=156333 c=156332 q=555) (...snipped...) [ 860.266660] ? process_one_work+0x1ba0/0x1ba0 [ 860.271135] ? kthread_bind+0x40/0x40 [ 860.274927] ret_from_fork+0x3a/0x50 [ 861.152252] INFO: task kworker/u4:2:59 blocked for more than 120 seconds. [ 861.159245] Not tainted 4.18.0-rc1+ #109 [ 861.163851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. RCU stall was reported immediately before khungtaskd fires. Since syzbot assigns only 2 CPUs, it might not be rare case that a hung process was unable to run just because somebody else kept occupying CPU resources. Well, "BUG: soft lockup in __process_echoes" will be a dup of "INFO: rcu detected stall in __process_echoes". I wonder why softlockup detector waited for 135 seconds. Well, "BUG: soft lockup in shrink_dcache_parent (2)" and "BUG: soft lockup in snd_virmidi_output_trigger" and "BUG: soft lockup in smp_call_function_many" and "BUG: soft lockup in do_raw_spin_unlock (2)" as well waited for 134 seconds while "BUG: soft lockup in d_walk" waited for only 22 seconds... Anyway, I think that in some cases RCU stalls/soft lockups are the cause of hung tasks.