linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
@ 2021-11-02  3:32 Zqiang
  2021-11-02  8:33 ` Takashi Iwai
  0 siblings, 1 reply; 7+ messages in thread
From: Zqiang @ 2021-11-02  3:32 UTC (permalink / raw)
  To: tiwai; +Cc: alsa-devel, linux-kernel, Zqiang

If we have a lot of cell object, this cycle may take a long time, and
trigger RCU stall. insert a conditional reschedule point to fix it.

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
	softirq=16474/16475 fqs=4916
	(t=10500 jiffies g=19249 q=192515)
NMI backtrace for cpu 1
......
asm_sysvec_apic_timer_interrupt
RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
spin_unlock_irqrestore
snd_seq_prioq_cell_out+0x1dc/0x360
snd_seq_check_queue+0x1a6/0x3f0
snd_seq_enqueue_event+0x1ed/0x3e0
snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
snd_seq_write+0x2db/0x510
vfs_write+0x1c4/0x900
ksys_write+0x171/0x1d0
do_syscall_64+0x35/0xb0

Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
---
 sound/core/seq/seq_queue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
index d6c02dea976c..f5b1e4562a64 100644
--- a/sound/core/seq/seq_queue.c
+++ b/sound/core/seq/seq_queue.c
@@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
 		if (!cell)
 			break;
 		snd_seq_dispatch_event(cell, atomic, hop);
+		cond_resched();
 	}
 
 	/* Process time queue... */
@@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
 		if (!cell)
 			break;
 		snd_seq_dispatch_event(cell, atomic, hop);
+		cond_resched();
 	}
 
 	/* free lock */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02  3:32 [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write() Zqiang
@ 2021-11-02  8:33 ` Takashi Iwai
  2021-11-02  9:41   ` Zqiang
  0 siblings, 1 reply; 7+ messages in thread
From: Takashi Iwai @ 2021-11-02  8:33 UTC (permalink / raw)
  To: Zqiang; +Cc: tiwai, alsa-devel, linux-kernel

On Tue, 02 Nov 2021 04:32:22 +0100,
Zqiang wrote:
> 
> If we have a lot of cell object, this cycle may take a long time, and
> trigger RCU stall. insert a conditional reschedule point to fix it.
> 
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
> 	softirq=16474/16475 fqs=4916
> 	(t=10500 jiffies g=19249 q=192515)
> NMI backtrace for cpu 1
> ......
> asm_sysvec_apic_timer_interrupt
> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
> spin_unlock_irqrestore
> snd_seq_prioq_cell_out+0x1dc/0x360
> snd_seq_check_queue+0x1a6/0x3f0
> snd_seq_enqueue_event+0x1ed/0x3e0
> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
> snd_seq_write+0x2db/0x510
> vfs_write+0x1c4/0x900
> ksys_write+0x171/0x1d0
> do_syscall_64+0x35/0xb0
> 
> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> ---
>  sound/core/seq/seq_queue.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
> index d6c02dea976c..f5b1e4562a64 100644
> --- a/sound/core/seq/seq_queue.c
> +++ b/sound/core/seq/seq_queue.c
> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>  		if (!cell)
>  			break;
>  		snd_seq_dispatch_event(cell, atomic, hop);
> +		cond_resched();
>  	}
>  
>  	/* Process time queue... */
> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>  		if (!cell)
>  			break;
>  		snd_seq_dispatch_event(cell, atomic, hop);
> +		cond_resched();


It's good to have cond_resched() in those places but it must be done
more carefully, as the code path may be called from the non-atomic
context, too.  That is, it must have a check of atomic argument, and
cond_resched() is applied only when atomic==false.

But I still wonder how this gets a RCU stall out of sudden.  Looking
through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
it's triggered by many cases since the end of September...


thanks,

Takashi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02  8:33 ` Takashi Iwai
@ 2021-11-02  9:41   ` Zqiang
  2021-11-02 10:31     ` Takashi Iwai
  0 siblings, 1 reply; 7+ messages in thread
From: Zqiang @ 2021-11-02  9:41 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: tiwai, alsa-devel, linux-kernel


On 2021/11/2 下午4:33, Takashi Iwai wrote:
> On Tue, 02 Nov 2021 04:32:22 +0100,
> Zqiang wrote:
>> If we have a lot of cell object, this cycle may take a long time, and
>> trigger RCU stall. insert a conditional reschedule point to fix it.
>>
>> rcu: INFO: rcu_preempt self-detected stall on CPU
>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
>> 	softirq=16474/16475 fqs=4916
>> 	(t=10500 jiffies g=19249 q=192515)
>> NMI backtrace for cpu 1
>> ......
>> asm_sysvec_apic_timer_interrupt
>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
>> spin_unlock_irqrestore
>> snd_seq_prioq_cell_out+0x1dc/0x360
>> snd_seq_check_queue+0x1a6/0x3f0
>> snd_seq_enqueue_event+0x1ed/0x3e0
>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
>> snd_seq_write+0x2db/0x510
>> vfs_write+0x1c4/0x900
>> ksys_write+0x171/0x1d0
>> do_syscall_64+0x35/0xb0
>>
>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>> ---
>>   sound/core/seq/seq_queue.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
>> index d6c02dea976c..f5b1e4562a64 100644
>> --- a/sound/core/seq/seq_queue.c
>> +++ b/sound/core/seq/seq_queue.c
>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>   		if (!cell)
>>   			break;
>>   		snd_seq_dispatch_event(cell, atomic, hop);
>> +		cond_resched();
>>   	}
>>   
>>   	/* Process time queue... */
>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>   		if (!cell)
>>   			break;
>>   		snd_seq_dispatch_event(cell, atomic, hop);
>> +		cond_resched();
>
> It's good to have cond_resched() in those places but it must be done
> more carefully, as the code path may be called from the non-atomic
> context, too.  That is, it must have a check of atomic argument, and
> cond_resched() is applied only when atomic==false.
>
> But I still wonder how this gets a RCU stall out of sudden.  Looking
> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
> it's triggered by many cases since the end of September...

I did not find useful information from the log,  through calltrace, I 
guess it may be triggered by the long cycle time, which caused the 
static state of the RCU to

not be reported in time.

I ignore the atomic parameter check,  I will resend v2 .   in no-atomic 
context, we can insert

cond_resched() to avoid this situation, but in atomic context,

the RCU stall maybe still trigger.

thanks
Zqiang

>
>
> thanks,
>
> Takashi





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02  9:41   ` Zqiang
@ 2021-11-02 10:31     ` Takashi Iwai
  2021-11-02 11:20       ` Zqiang
  0 siblings, 1 reply; 7+ messages in thread
From: Takashi Iwai @ 2021-11-02 10:31 UTC (permalink / raw)
  To: Zqiang; +Cc: tiwai, alsa-devel, linux-kernel

On Tue, 02 Nov 2021 10:41:57 +0100,
Zqiang wrote:
> 
> 
> On 2021/11/2 下午4:33, Takashi Iwai wrote:
> > On Tue, 02 Nov 2021 04:32:22 +0100,
> > Zqiang wrote:
> >> If we have a lot of cell object, this cycle may take a long time, and
> >> trigger RCU stall. insert a conditional reschedule point to fix it.
> >>
> >> rcu: INFO: rcu_preempt self-detected stall on CPU
> >> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
> >> 	softirq=16474/16475 fqs=4916
> >> 	(t=10500 jiffies g=19249 q=192515)
> >> NMI backtrace for cpu 1
> >> ......
> >> asm_sysvec_apic_timer_interrupt
> >> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
> >> spin_unlock_irqrestore
> >> snd_seq_prioq_cell_out+0x1dc/0x360
> >> snd_seq_check_queue+0x1a6/0x3f0
> >> snd_seq_enqueue_event+0x1ed/0x3e0
> >> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
> >> snd_seq_write+0x2db/0x510
> >> vfs_write+0x1c4/0x900
> >> ksys_write+0x171/0x1d0
> >> do_syscall_64+0x35/0xb0
> >>
> >> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
> >> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> >> ---
> >>   sound/core/seq/seq_queue.c | 2 ++
> >>   1 file changed, 2 insertions(+)
> >>
> >> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
> >> index d6c02dea976c..f5b1e4562a64 100644
> >> --- a/sound/core/seq/seq_queue.c
> >> +++ b/sound/core/seq/seq_queue.c
> >> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>   		if (!cell)
> >>   			break;
> >>   		snd_seq_dispatch_event(cell, atomic, hop);
> >> +		cond_resched();
> >>   	}
> >>     	/* Process time queue... */
> >> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>   		if (!cell)
> >>   			break;
> >>   		snd_seq_dispatch_event(cell, atomic, hop);
> >> +		cond_resched();
> >
> > It's good to have cond_resched() in those places but it must be done
> > more carefully, as the code path may be called from the non-atomic
> > context, too.  That is, it must have a check of atomic argument, and
> > cond_resched() is applied only when atomic==false.
> >
> > But I still wonder how this gets a RCU stall out of sudden.  Looking
> > through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
> > it's triggered by many cases since the end of September...
> 
> I did not find useful information from the log,  through calltrace, I
> guess it may be triggered by the long cycle time, which caused the
> static state of the RCU to
> 
> not be reported in time.

Yes, I understand that logic.  But I wonder why this gets triggered
*now* out of sudden.  The code has been present over decades, and I
don't think the similar test case must have been performed by fuzzer.

> I ignore the atomic parameter check,  I will resend v2 .   in
> no-atomic context, we can insert
> 
> cond_resched() to avoid this situation, but in atomic context,
> 
> the RCU stall maybe still trigger.

Right, so maybe it's better to have an upper limit for the processed
cells, something like below (totally untested).

Could you reproduce the problem locally?  Otherwise it's all nothing
but a guess...


thanks,

Takashi

---

diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
index d6c02dea976c..7f796ee62ee7 100644
--- a/sound/core/seq/seq_queue.c
+++ b/sound/core/seq/seq_queue.c
@@ -235,12 +235,15 @@ struct snd_seq_queue *snd_seq_queue_find_name(char *name)
 
 /* -------------------------------------------------------- */
 
+#define MAX_CELL_PROCESSES_IN_QUEUE	1000
+
 void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
 {
 	unsigned long flags;
 	struct snd_seq_event_cell *cell;
 	snd_seq_tick_time_t cur_tick;
 	snd_seq_real_time_t cur_time;
+	int processed = 0;
 
 	if (q == NULL)
 		return;
@@ -263,6 +266,8 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
 		if (!cell)
 			break;
 		snd_seq_dispatch_event(cell, atomic, hop);
+		if (++processed >= MAX_CELL_PROCESSES_IN_QUEUE)
+			return; /* the rest processed at the next batch */
 	}
 
 	/* Process time queue... */
@@ -272,6 +277,8 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
 		if (!cell)
 			break;
 		snd_seq_dispatch_event(cell, atomic, hop);
+		if (++processed >= MAX_CELL_PROCESSES_IN_QUEUE)
+			return; /* the rest processed at the next batch */
 	}
 
 	/* free lock */

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02 10:31     ` Takashi Iwai
@ 2021-11-02 11:20       ` Zqiang
  2021-11-02 12:27         ` Takashi Iwai
  0 siblings, 1 reply; 7+ messages in thread
From: Zqiang @ 2021-11-02 11:20 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: tiwai, alsa-devel, linux-kernel


On 2021/11/2 下午6:31, Takashi Iwai wrote:
> On Tue, 02 Nov 2021 10:41:57 +0100,
> Zqiang wrote:
>>
>> On 2021/11/2 下午4:33, Takashi Iwai wrote:
>>> On Tue, 02 Nov 2021 04:32:22 +0100,
>>> Zqiang wrote:
>>>> If we have a lot of cell object, this cycle may take a long time, and
>>>> trigger RCU stall. insert a conditional reschedule point to fix it.
>>>>
>>>> rcu: INFO: rcu_preempt self-detected stall on CPU
>>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
>>>> 	softirq=16474/16475 fqs=4916
>>>> 	(t=10500 jiffies g=19249 q=192515)
>>>> NMI backtrace for cpu 1
>>>> ......
>>>> asm_sysvec_apic_timer_interrupt
>>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
>>>> spin_unlock_irqrestore
>>>> snd_seq_prioq_cell_out+0x1dc/0x360
>>>> snd_seq_check_queue+0x1a6/0x3f0
>>>> snd_seq_enqueue_event+0x1ed/0x3e0
>>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
>>>> snd_seq_write+0x2db/0x510
>>>> vfs_write+0x1c4/0x900
>>>> ksys_write+0x171/0x1d0
>>>> do_syscall_64+0x35/0xb0
>>>>
>>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
>>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>>>> ---
>>>>    sound/core/seq/seq_queue.c | 2 ++
>>>>    1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
>>>> index d6c02dea976c..f5b1e4562a64 100644
>>>> --- a/sound/core/seq/seq_queue.c
>>>> +++ b/sound/core/seq/seq_queue.c
>>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>    		if (!cell)
>>>>    			break;
>>>>    		snd_seq_dispatch_event(cell, atomic, hop);
>>>> +		cond_resched();
>>>>    	}
>>>>      	/* Process time queue... */
>>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>    		if (!cell)
>>>>    			break;
>>>>    		snd_seq_dispatch_event(cell, atomic, hop);
>>>> +		cond_resched();
>>> It's good to have cond_resched() in those places but it must be done
>>> more carefully, as the code path may be called from the non-atomic
>>> context, too.  That is, it must have a check of atomic argument, and
>>> cond_resched() is applied only when atomic==false.
>>>
>>> But I still wonder how this gets a RCU stall out of sudden.  Looking
>>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
>>> it's triggered by many cases since the end of September...
>> I did not find useful information from the log,  through calltrace, I
>> guess it may be triggered by the long cycle time, which caused the
>> static state of the RCU to
>>
>> not be reported in time.
> Yes, I understand that logic.  But I wonder why this gets triggered
> *now* out of sudden.  The code has been present over decades, and I
> don't think the similar test case must have been performed by fuzzer.
>
>> I ignore the atomic parameter check,  I will resend v2 .   in
>> no-atomic context, we can insert
>>
>> cond_resched() to avoid this situation, but in atomic context,
>>
>> the RCU stall maybe still trigger.
> Right, so maybe it's better to have an upper limit for the processed
> cells, something like below (totally untested).
>
> Could you reproduce the problem locally?  Otherwise it's all nothing
> but a guess...

yes, this is just a guess.  I haven't reproduced locally, limiting the 
number of cycles is a suitable modification,

but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.

thanks

Zqiang

>
> thanks,
>
> Takashi
>
> ---
>
> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
> index d6c02dea976c..7f796ee62ee7 100644
> --- a/sound/core/seq/seq_queue.c
> +++ b/sound/core/seq/seq_queue.c
> @@ -235,12 +235,15 @@ struct snd_seq_queue *snd_seq_queue_find_name(char *name)
>   
>   /* -------------------------------------------------------- */
>   
> +#define MAX_CELL_PROCESSES_IN_QUEUE	1000
> +
>   void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>   {
>   	unsigned long flags;
>   	struct snd_seq_event_cell *cell;
>   	snd_seq_tick_time_t cur_tick;
>   	snd_seq_real_time_t cur_time;
> +	int processed = 0;
>   
>   	if (q == NULL)
>   		return;
> @@ -263,6 +266,8 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>   		if (!cell)
>   			break;
>   		snd_seq_dispatch_event(cell, atomic, hop);
> +		if (++processed >= MAX_CELL_PROCESSES_IN_QUEUE)
> +			return; /* the rest processed at the next batch */
>   	}
>   
>   	/* Process time queue... */
> @@ -272,6 +277,8 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>   		if (!cell)
>   			break;
>   		snd_seq_dispatch_event(cell, atomic, hop);
> +		if (++processed >= MAX_CELL_PROCESSES_IN_QUEUE)
> +			return; /* the rest processed at the next batch */
>   	}
>   
>   	/* free lock */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02 11:20       ` Zqiang
@ 2021-11-02 12:27         ` Takashi Iwai
  2021-11-03  5:41           ` Zqiang
  0 siblings, 1 reply; 7+ messages in thread
From: Takashi Iwai @ 2021-11-02 12:27 UTC (permalink / raw)
  To: Zqiang; +Cc: tiwai, alsa-devel, linux-kernel

On Tue, 02 Nov 2021 12:20:32 +0100,
Zqiang wrote:
> 
> 
> On 2021/11/2 下午6:31, Takashi Iwai wrote:
> > On Tue, 02 Nov 2021 10:41:57 +0100,
> > Zqiang wrote:
> >>
> >> On 2021/11/2 下午4:33, Takashi Iwai wrote:
> >>> On Tue, 02 Nov 2021 04:32:22 +0100,
> >>> Zqiang wrote:
> >>>> If we have a lot of cell object, this cycle may take a long time, and
> >>>> trigger RCU stall. insert a conditional reschedule point to fix it.
> >>>>
> >>>> rcu: INFO: rcu_preempt self-detected stall on CPU
> >>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
> >>>> 	softirq=16474/16475 fqs=4916
> >>>> 	(t=10500 jiffies g=19249 q=192515)
> >>>> NMI backtrace for cpu 1
> >>>> ......
> >>>> asm_sysvec_apic_timer_interrupt
> >>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
> >>>> spin_unlock_irqrestore
> >>>> snd_seq_prioq_cell_out+0x1dc/0x360
> >>>> snd_seq_check_queue+0x1a6/0x3f0
> >>>> snd_seq_enqueue_event+0x1ed/0x3e0
> >>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
> >>>> snd_seq_write+0x2db/0x510
> >>>> vfs_write+0x1c4/0x900
> >>>> ksys_write+0x171/0x1d0
> >>>> do_syscall_64+0x35/0xb0
> >>>>
> >>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
> >>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> >>>> ---
> >>>>    sound/core/seq/seq_queue.c | 2 ++
> >>>>    1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
> >>>> index d6c02dea976c..f5b1e4562a64 100644
> >>>> --- a/sound/core/seq/seq_queue.c
> >>>> +++ b/sound/core/seq/seq_queue.c
> >>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>>>    		if (!cell)
> >>>>    			break;
> >>>>    		snd_seq_dispatch_event(cell, atomic, hop);
> >>>> +		cond_resched();
> >>>>    	}
> >>>>      	/* Process time queue... */
> >>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
> >>>>    		if (!cell)
> >>>>    			break;
> >>>>    		snd_seq_dispatch_event(cell, atomic, hop);
> >>>> +		cond_resched();
> >>> It's good to have cond_resched() in those places but it must be done
> >>> more carefully, as the code path may be called from the non-atomic
> >>> context, too.  That is, it must have a check of atomic argument, and
> >>> cond_resched() is applied only when atomic==false.
> >>>
> >>> But I still wonder how this gets a RCU stall out of sudden.  Looking
> >>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
> >>> it's triggered by many cases since the end of September...
> >> I did not find useful information from the log,  through calltrace, I
> >> guess it may be triggered by the long cycle time, which caused the
> >> static state of the RCU to
> >>
> >> not be reported in time.
> > Yes, I understand that logic.  But I wonder why this gets triggered
> > *now* out of sudden.  The code has been present over decades, and I
> > don't think the similar test case must have been performed by fuzzer.
> >
> >> I ignore the atomic parameter check,  I will resend v2 .   in
> >> no-atomic context, we can insert
> >>
> >> cond_resched() to avoid this situation, but in atomic context,
> >>
> >> the RCU stall maybe still trigger.
> > Right, so maybe it's better to have an upper limit for the processed
> > cells, something like below (totally untested).
> >
> > Could you reproduce the problem locally?  Otherwise it's all nothing
> > but a guess...
> 
> yes, this is just a guess.  I haven't reproduced locally, limiting the
> number of cycles is a suitable modification,
> 
> but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.

Yes, that's why we need the reproducer in anyway before moving
forward.  The problem is that the patch looks as if it were fixing the
RCU stall, but we haven't verified it at all that it is really the
cause.  Even we haven't checked whether it's really the too many cells
queued, or just because the concurrent queuing made the function
re-running.


Takashi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()
  2021-11-02 12:27         ` Takashi Iwai
@ 2021-11-03  5:41           ` Zqiang
  0 siblings, 0 replies; 7+ messages in thread
From: Zqiang @ 2021-11-03  5:41 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: tiwai, alsa-devel, linux-kernel


On 2021/11/2 下午8:27, Takashi Iwai wrote:
> On Tue, 02 Nov 2021 12:20:32 +0100,
> Zqiang wrote:
>>
>> On 2021/11/2 下午6:31, Takashi Iwai wrote:
>>> On Tue, 02 Nov 2021 10:41:57 +0100,
>>> Zqiang wrote:
>>>> On 2021/11/2 下午4:33, Takashi Iwai wrote:
>>>>> On Tue, 02 Nov 2021 04:32:22 +0100,
>>>>> Zqiang wrote:
>>>>>> If we have a lot of cell object, this cycle may take a long time, and
>>>>>> trigger RCU stall. insert a conditional reschedule point to fix it.
>>>>>>
>>>>>> rcu: INFO: rcu_preempt self-detected stall on CPU
>>>>>> rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
>>>>>> 	softirq=16474/16475 fqs=4916
>>>>>> 	(t=10500 jiffies g=19249 q=192515)
>>>>>> NMI backtrace for cpu 1
>>>>>> ......
>>>>>> asm_sysvec_apic_timer_interrupt
>>>>>> RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
>>>>>> spin_unlock_irqrestore
>>>>>> snd_seq_prioq_cell_out+0x1dc/0x360
>>>>>> snd_seq_check_queue+0x1a6/0x3f0
>>>>>> snd_seq_enqueue_event+0x1ed/0x3e0
>>>>>> snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
>>>>>> snd_seq_write+0x2db/0x510
>>>>>> vfs_write+0x1c4/0x900
>>>>>> ksys_write+0x171/0x1d0
>>>>>> do_syscall_64+0x35/0xb0
>>>>>>
>>>>>> Reported-by: syzbot+bb950e68b400ab4f65f8@syzkaller.appspotmail.com
>>>>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>>>>>> ---
>>>>>>     sound/core/seq/seq_queue.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
>>>>>> index d6c02dea976c..f5b1e4562a64 100644
>>>>>> --- a/sound/core/seq/seq_queue.c
>>>>>> +++ b/sound/core/seq/seq_queue.c
>>>>>> @@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>>>     	}
>>>>>>       	/* Process time queue... */
>>>>>> @@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
>>>>>>     		if (!cell)
>>>>>>     			break;
>>>>>>     		snd_seq_dispatch_event(cell, atomic, hop);
>>>>>> +		cond_resched();
>>>>> It's good to have cond_resched() in those places but it must be done
>>>>> more carefully, as the code path may be called from the non-atomic
>>>>> context, too.  That is, it must have a check of atomic argument, and
>>>>> cond_resched() is applied only when atomic==false.
>>>>>
>>>>> But I still wonder how this gets a RCU stall out of sudden.  Looking
>>>>> through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
>>>>> it's triggered by many cases since the end of September...
>>>> I did not find useful information from the log,  through calltrace, I
>>>> guess it may be triggered by the long cycle time, which caused the
>>>> static state of the RCU to
>>>>
>>>> not be reported in time.
>>> Yes, I understand that logic.  But I wonder why this gets triggered
>>> *now* out of sudden.  The code has been present over decades, and I
>>> don't think the similar test case must have been performed by fuzzer.
>>>
>>>> I ignore the atomic parameter check,  I will resend v2 .   in
>>>> no-atomic context, we can insert
>>>>
>>>> cond_resched() to avoid this situation, but in atomic context,
>>>>
>>>> the RCU stall maybe still trigger.
>>> Right, so maybe it's better to have an upper limit for the processed
>>> cells, something like below (totally untested).
>>>
>>> Could you reproduce the problem locally?  Otherwise it's all nothing
>>> but a guess...
>> yes, this is just a guess.  I haven't reproduced locally, limiting the
>> number of cycles is a suitable modification,
>>
>> but the  MAX_CELL_PROCESSES_IN_QUEUE is an experience value.
> Yes, that's why we need the reproducer in anyway before moving
> forward.  The problem is that the patch looks as if it were fixing the
> RCU stall, but we haven't verified it at all that it is really the
> cause.  Even we haven't checked whether it's really the too many cells
> queued, or just because the concurrent queuing made the function
> re-running.

Thanks your explanation,  I think we can send your changes out and wait 
for syzbot to test.

thanks

Zqiang

>
> Takashi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-03  5:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-02  3:32 [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write() Zqiang
2021-11-02  8:33 ` Takashi Iwai
2021-11-02  9:41   ` Zqiang
2021-11-02 10:31     ` Takashi Iwai
2021-11-02 11:20       ` Zqiang
2021-11-02 12:27         ` Takashi Iwai
2021-11-03  5:41           ` Zqiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).