Re: [syzbot] INFO: rcu detected stall in ext4_file_write_iter (4)

From: Waiman Long <longman@redhat.com>
To: Hillf Danton <hdanton@sina.com>,
	syzbot <syzbot+03464269af631f4a4bdf@syzkaller.appspotmail.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	syzkaller-bugs@googlegroups.com
Subject: Re: [syzbot] INFO: rcu detected stall in ext4_file_write_iter (4)
Date: Mon, 27 Dec 2021 22:14:23 -0500	[thread overview]
Message-ID: <6396f046-b292-1a73-8339-d32b611d9b7f@redhat.com> (raw)
In-Reply-To: <20211225005253.1962-1-hdanton@sina.com>

On 12/24/21 19:52, Hillf Danton wrote:
> On Wed, 22 Dec 2021 09:33:26 -0800
>> syzbot found the following issue on:
>>
>> HEAD commit:    9eaa88c7036e Merge tag 'libata-5.16-rc6' of git://git.kern..
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=142d4143b00000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=e3bdfd29b408d1b6
>> dashboard link: https://syzkaller.appspot.com/bug?extid=03464269af631f4a4bdf
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11b1537db00000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=125de549b00000
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+03464269af631f4a4bdf@syzkaller.appspotmail.com
>>
>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> 	(detected by 0, t=10502 jiffies, g=5509, q=170)
>> rcu: All QSes seen, last rcu_preempt kthread activity 8642 (4295000735-4294992093), jiffies_till_next_fqs=1, root ->qsmask 0x0
>> rcu: rcu_preempt kthread starved for 8643 jiffies! g5509 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
>> rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
>> rcu: RCU grace-period kthread stack dump:
>> task:rcu_preempt     state:R  running task     stack:28208 pid:   14 ppid:     2 flags:0x00004000
>> Call Trace:
>>   <TASK>
>>   context_switch kernel/sched/core.c:4972 [inline]
>>   context_switch kernel/sched/core.c:4972 [inline] kernel/sched/core.c:6253
>>   __schedule+0xa9a/0x4940 kernel/sched/core.c:6253 kernel/sched/core.c:6253
>>   schedule+0xd2/0x260 kernel/sched/core.c:6326 kernel/sched/core.c:6326
>>   schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881 kernel/time/timer.c:1881
>>   rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1955 kernel/rcu/tree.c:1955
>>   rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2128 kernel/rcu/tree.c:2128
>>   kthread+0x405/0x4f0 kernel/kthread.c:327 kernel/kthread.c:327
>>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 arch/x86/entry/entry_64.S:295
>>   </TASK>
>> rcu: Stack dump where RCU GP kthread last ran:
>> NMI backtrace for cpu 0
>> CPU: 0 PID: 3672 Comm: syz-executor549 Not tainted 5.16.0-rc5-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> Call Trace:
>>   <IRQ>
>>   __dump_stack lib/dump_stack.c:88 [inline]
>>   __dump_stack lib/dump_stack.c:88 [inline] lib/dump_stack.c:106
>>   dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 lib/dump_stack.c:106
>>   nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111 lib/nmi_backtrace.c:111
>>   nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62 lib/nmi_backtrace.c:62
>>   trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
>>   trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline] kernel/rcu/tree_stall.h:481
>>   rcu_check_gp_kthread_starvation.cold+0x1fb/0x200 kernel/rcu/tree_stall.h:481 kernel/rcu/tree_stall.h:481
>>   print_other_cpu_stall kernel/rcu/tree_stall.h:586 [inline]
>>   check_cpu_stall kernel/rcu/tree_stall.h:729 [inline]
>>   rcu_pending kernel/rcu/tree.c:3878 [inline]
>>   print_other_cpu_stall kernel/rcu/tree_stall.h:586 [inline] kernel/rcu/tree.c:2597
>>   check_cpu_stall kernel/rcu/tree_stall.h:729 [inline] kernel/rcu/tree.c:2597
>>   rcu_pending kernel/rcu/tree.c:3878 [inline] kernel/rcu/tree.c:2597
>>   rcu_sched_clock_irq+0x2125/0x2200 kernel/rcu/tree.c:2597 kernel/rcu/tree.c:2597
>>   update_process_times+0x16d/0x200 kernel/time/timer.c:1785 kernel/time/timer.c:1785
>>   tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226 kernel/time/tick-sched.c:226
>>   tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428 kernel/time/tick-sched.c:1428
>>   __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
>>   __run_hrtimer kernel/time/hrtimer.c:1685 [inline] kernel/time/hrtimer.c:1749
>>   __hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749 kernel/time/hrtimer.c:1749
>>   hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811 kernel/time/hrtimer.c:1811
>>   local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
>>   local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] arch/x86/kernel/apic/apic.c:1103
>>   __sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103 arch/x86/kernel/apic/apic.c:1103
>>   sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097 arch/x86/kernel/apic/apic.c:1097
>>   </IRQ>
>>   <TASK>
>>   asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 arch/x86/include/asm/idtentry.h:638
>> RIP: 0010:owner_on_cpu kernel/locking/rwsem.c:667 [inline]
> The lock acquirer was spinning long enough to make RCU stall, and the
> question is, if this is not reproducible, how was the report triggered?
> Anything that syzbot is special in this case?
>
> Thanks
> 		Hillf

The test was running on a CONFIG_PREEMPT kernel. So if the syzkaller 
kthread is running at a higher priority than the rcu_preempt kthread, it 
is possible for the rcu_preempt kthread to be starved of cpu time. The 
rwsem optimistic spinning code will relinquish the cpu if there is a 
higher priority thread running. Since rcu_preempt kthread did not seem 
to be able to get the cpu, I suspect that it is probably caused by the 
syzkaller thread having a higher priority.

Cheers,
Longman