Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()

From: Zqiang <qiang.zhang1211@gmail.com>
To: Takashi Iwai <tiwai@suse.de>
Cc: tiwai@suse.com, alsa-devel@alsa-project.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
Date: Wed, 3 Nov 2021 13:41:47 +0800	[thread overview]
Message-ID: <b8e27c66-2fe3-aae0-a19c-e87a3e56625e@gmail.com> (raw)
In-Reply-To: <s5hcznipwsm.wl-tiwai@suse.de>

On 2021/11/2 下午8:27, Takashi Iwai wrote:
> On Tue, 02 Nov 2021 12:20:32 +0100,
> Zqiang wrote:
>>
>> On 2021/11/2 下午6:31, Takashi Iwai wrote:
>>> On Tue, 02 Nov 2021 10:41:57 +0100,
>>> Zqiang wrote:
>>>> On 2021/11/2 下午4:33, Takashi Iwai wrote:
>>>>> On Tue, 02 Nov 2021 04:32:22 +0100,
>>>>> Zqiang wrote:
>>>>>> If we have a lot of cell object, this cycle may take a long time, and
>>>>>> trigger RCU stall. insert a conditional reschedule point to fix it.
>>>>>>
>>>>>> rcu: INFO: rcu_preempt self-detected stall on CPU
>>>>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
>>>>>> 	softirq=16474/16475 fqs=4916
>>>>>> 	(t=10500 jiffies g=19249 q=192515)
>>>>>> NMI backtrace for cpu 1
>>>>>> ......
>>>>>> asm_sysvec_apic_timer_interrupt
>>>>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
>>>>>> spin_unlock_irqrestore
>>>>>> snd_seq_prioq_cell_out+0x1dc/0x360
>>>>>> snd_seq_check_queue+0x1a6/0x3f0
>>>>>> snd_seq_enqueue_event+0x1ed/0x3e0
>>>>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
>>>>>> snd_seq_write+0x2db/0x510
>>>>>> vfs_write+0x1c4/0x900
>>>>>> ksys_write+0x171/0x1d0
>>>>>> do_syscall_64+0x35/0xb0
>>>>>>
>>>>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
>>>>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>>>>>> ---
>>>>>>     sound/core/seq/seq_queue.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
>>>>>> index d6c02dea976c..f5b1e4562a64 100644
>>>>>> --- a/sound/core/seq/seq_queue.c
>>>>>> +++ b/sound/core/seq/seq_queue.c
>>>>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>>>     	}
>>>>>>       	/* Process time queue... */
>>>>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>> It's good to have cond_resched() in those places but it must be done
>>>>> more carefully, as the code path may be called from the non-atomic
>>>>> context, too.  That is, it must have a check of atomic argument, and
>>>>> cond_resched() is applied only when atomic==false.
>>>>>
>>>>> But I still wonder how this gets a RCU stall out of sudden.  Looking
>>>>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
>>>>> it's triggered by many cases since the end of September...
>>>> I did not find useful information from the log,  through calltrace, I
>>>> guess it may be triggered by the long cycle time, which caused the
>>>> static state of the RCU to
>>>>
>>>> not be reported in time.
>>> Yes, I understand that logic.  But I wonder why this gets triggered
>>> *now* out of sudden.  The code has been present over decades, and I
>>> don't think the similar test case must have been performed by fuzzer.
>>>
>>>> I ignore the atomic parameter check,  I will resend v2 .   in
>>>> no-atomic context, we can insert
>>>>
>>>> cond_resched() to avoid this situation, but in atomic context,
>>>>
>>>> the RCU stall maybe still trigger.
>>> Right, so maybe it's better to have an upper limit for the processed
>>> cells, something like below (totally untested).
>>>
>>> Could you reproduce the problem locally?  Otherwise it's all nothing
>>> but a guess...
>> yes, this is just a guess.  I haven't reproduced locally, limiting the
>> number of cycles is a suitable modification,
>>
>> but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.
> Yes, that's why we need the reproducer in anyway before moving
> forward.  The problem is that the patch looks as if it were fixing the
> RCU stall, but we haven't verified it at all that it is really the
> cause.  Even we haven't checked whether it's really the too many cells
> queued, or just because the concurrent queuing made the function
> re-running.

Thanks your explanation,  I think we can send your changes out and wait 
for syzbot to test.

thanks

Zqiang

>
> Takashi