All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-17 22:23 ` Bard Liao
  0 siblings, 0 replies; 16+ messages in thread
From: Bard Liao @ 2020-08-17 22:23 UTC (permalink / raw)
  To: alsa-devel, vkoul
  Cc: vinod.koul, linux-kernel, tiwai, broonie, gregkh, jank,
	srinivas.kandagatla, rander.wang, ranjani.sridharan, hui.wang,
	pierre-louis.bossart, sanyog.r.kale, mengdong.lin, bard.liao

From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

In system suspend stress cases, the SOF CI reports timeouts. The root
cause is that an alert is generated while the system suspends. The
interrupt handling generates transactions on the bus that will never
be handled because the interrupts are disabled in parallel.

As a result, the transaction never completes and times out on resume.
This error doesn't seem too problematic since it happens in a work
queue, and the system recovers without issues.

Nevertheless, this race condition should not happen. When doing a
system suspend, or when disabling interrupts, we should make sure the
current transaction can complete, and prevent new work from being
queued.

BugLink: https://github.com/thesofproject/linux/issues/2344
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
---
 drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
 drivers/soundwire/cadence_master.h |  1 +
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
index 24eafe0aa1c3..1330ffc47596 100644
--- a/drivers/soundwire/cadence_master.c
+++ b/drivers/soundwire/cadence_master.c
@@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
 			     CDNS_MCP_INT_SLAVE_MASK, 0);
 
 		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
-		schedule_work(&cdns->work);
+
+		/*
+		 * Deal with possible race condition between interrupt
+		 * handling and disabling interrupts on suspend.
+		 *
+		 * If the master is in the process of disabling
+		 * interrupts, don't schedule a workqueue
+		 */
+		if (cdns->interrupt_enabled)
+			schedule_work(&cdns->work);
 	}
 
 	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
@@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
 		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
 		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
 	}
+	cdns->interrupt_enabled = state;
+
+	/*
+	 * Complete any on-going status updates before updating masks,
+	 * and cancel queued status updates.
+	 *
+	 * There could be a race with a new interrupt thrown before
+	 * the 3 mask updates below are complete, so in the interrupt
+	 * we use the 'interrupt_enabled' status to prevent new work
+	 * from being queued.
+	 */
+	if (!state)
+		cancel_work_sync(&cdns->work);
 
 	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
 	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
index fdec62b912d3..4d1aab5b5ec2 100644
--- a/drivers/soundwire/cadence_master.h
+++ b/drivers/soundwire/cadence_master.h
@@ -133,6 +133,7 @@ struct sdw_cdns {
 
 	bool link_up;
 	unsigned int msg_count;
+	bool interrupt_enabled;
 
 	struct work_struct work;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-17 22:23 ` Bard Liao
  0 siblings, 0 replies; 16+ messages in thread
From: Bard Liao @ 2020-08-17 22:23 UTC (permalink / raw)
  To: alsa-devel, vkoul
  Cc: pierre-louis.bossart, vinod.koul, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao

From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

In system suspend stress cases, the SOF CI reports timeouts. The root
cause is that an alert is generated while the system suspends. The
interrupt handling generates transactions on the bus that will never
be handled because the interrupts are disabled in parallel.

As a result, the transaction never completes and times out on resume.
This error doesn't seem too problematic since it happens in a work
queue, and the system recovers without issues.

Nevertheless, this race condition should not happen. When doing a
system suspend, or when disabling interrupts, we should make sure the
current transaction can complete, and prevent new work from being
queued.

BugLink: https://github.com/thesofproject/linux/issues/2344
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
---
 drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
 drivers/soundwire/cadence_master.h |  1 +
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
index 24eafe0aa1c3..1330ffc47596 100644
--- a/drivers/soundwire/cadence_master.c
+++ b/drivers/soundwire/cadence_master.c
@@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
 			     CDNS_MCP_INT_SLAVE_MASK, 0);
 
 		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
-		schedule_work(&cdns->work);
+
+		/*
+		 * Deal with possible race condition between interrupt
+		 * handling and disabling interrupts on suspend.
+		 *
+		 * If the master is in the process of disabling
+		 * interrupts, don't schedule a workqueue
+		 */
+		if (cdns->interrupt_enabled)
+			schedule_work(&cdns->work);
 	}
 
 	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
@@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
 		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
 		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
 	}
+	cdns->interrupt_enabled = state;
+
+	/*
+	 * Complete any on-going status updates before updating masks,
+	 * and cancel queued status updates.
+	 *
+	 * There could be a race with a new interrupt thrown before
+	 * the 3 mask updates below are complete, so in the interrupt
+	 * we use the 'interrupt_enabled' status to prevent new work
+	 * from being queued.
+	 */
+	if (!state)
+		cancel_work_sync(&cdns->work);
 
 	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
 	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
index fdec62b912d3..4d1aab5b5ec2 100644
--- a/drivers/soundwire/cadence_master.h
+++ b/drivers/soundwire/cadence_master.h
@@ -133,6 +133,7 @@ struct sdw_cdns {
 
 	bool link_up;
 	unsigned int msg_count;
+	bool interrupt_enabled;
 
 	struct work_struct work;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-17 22:23 ` Bard Liao
@ 2020-08-19  9:06   ` Vinod Koul
  -1 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-19  9:06 UTC (permalink / raw)
  To: Bard Liao
  Cc: alsa-devel, linux-kernel, tiwai, broonie, gregkh, jank,
	srinivas.kandagatla, rander.wang, ranjani.sridharan, hui.wang,
	pierre-louis.bossart, sanyog.r.kale, mengdong.lin, bard.liao

On 18-08-20, 06:23, Bard Liao wrote:
> From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> 
> In system suspend stress cases, the SOF CI reports timeouts. The root
> cause is that an alert is generated while the system suspends. The
> interrupt handling generates transactions on the bus that will never
> be handled because the interrupts are disabled in parallel.
> 
> As a result, the transaction never completes and times out on resume.
> This error doesn't seem too problematic since it happens in a work
> queue, and the system recovers without issues.
> 
> Nevertheless, this race condition should not happen. When doing a
> system suspend, or when disabling interrupts, we should make sure the
> current transaction can complete, and prevent new work from being
> queued.
> 
> BugLink: https://github.com/thesofproject/linux/issues/2344
> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
> Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
> ---
>  drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
>  drivers/soundwire/cadence_master.h |  1 +
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
> index 24eafe0aa1c3..1330ffc47596 100644
> --- a/drivers/soundwire/cadence_master.c
> +++ b/drivers/soundwire/cadence_master.c
> @@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
>  			     CDNS_MCP_INT_SLAVE_MASK, 0);
>  
>  		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
> -		schedule_work(&cdns->work);
> +
> +		/*
> +		 * Deal with possible race condition between interrupt
> +		 * handling and disabling interrupts on suspend.
> +		 *
> +		 * If the master is in the process of disabling
> +		 * interrupts, don't schedule a workqueue
> +		 */
> +		if (cdns->interrupt_enabled)
> +			schedule_work(&cdns->work);

would it not make sense to mask the interrupts first and then cancel the
work? that way you are guaranteed that after this call you dont have
interrupts and work scheduled?

>  	}
>  
>  	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
> @@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
>  		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
>  		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
>  	}
> +	cdns->interrupt_enabled = state;
> +
> +	/*
> +	 * Complete any on-going status updates before updating masks,
> +	 * and cancel queued status updates.
> +	 *
> +	 * There could be a race with a new interrupt thrown before
> +	 * the 3 mask updates below are complete, so in the interrupt
> +	 * we use the 'interrupt_enabled' status to prevent new work
> +	 * from being queued.
> +	 */
> +	if (!state)
> +		cancel_work_sync(&cdns->work);
>  
>  	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
>  	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
> diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
> index fdec62b912d3..4d1aab5b5ec2 100644
> --- a/drivers/soundwire/cadence_master.h
> +++ b/drivers/soundwire/cadence_master.h
> @@ -133,6 +133,7 @@ struct sdw_cdns {
>  
>  	bool link_up;
>  	unsigned int msg_count;
> +	bool interrupt_enabled;
>  
>  	struct work_struct work;
>  
> -- 
> 2.17.1

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-19  9:06   ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-19  9:06 UTC (permalink / raw)
  To: Bard Liao
  Cc: pierre-louis.bossart, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao

On 18-08-20, 06:23, Bard Liao wrote:
> From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> 
> In system suspend stress cases, the SOF CI reports timeouts. The root
> cause is that an alert is generated while the system suspends. The
> interrupt handling generates transactions on the bus that will never
> be handled because the interrupts are disabled in parallel.
> 
> As a result, the transaction never completes and times out on resume.
> This error doesn't seem too problematic since it happens in a work
> queue, and the system recovers without issues.
> 
> Nevertheless, this race condition should not happen. When doing a
> system suspend, or when disabling interrupts, we should make sure the
> current transaction can complete, and prevent new work from being
> queued.
> 
> BugLink: https://github.com/thesofproject/linux/issues/2344
> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
> Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
> ---
>  drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
>  drivers/soundwire/cadence_master.h |  1 +
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
> index 24eafe0aa1c3..1330ffc47596 100644
> --- a/drivers/soundwire/cadence_master.c
> +++ b/drivers/soundwire/cadence_master.c
> @@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
>  			     CDNS_MCP_INT_SLAVE_MASK, 0);
>  
>  		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
> -		schedule_work(&cdns->work);
> +
> +		/*
> +		 * Deal with possible race condition between interrupt
> +		 * handling and disabling interrupts on suspend.
> +		 *
> +		 * If the master is in the process of disabling
> +		 * interrupts, don't schedule a workqueue
> +		 */
> +		if (cdns->interrupt_enabled)
> +			schedule_work(&cdns->work);

would it not make sense to mask the interrupts first and then cancel the
work? that way you are guaranteed that after this call you dont have
interrupts and work scheduled?

>  	}
>  
>  	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
> @@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
>  		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
>  		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
>  	}
> +	cdns->interrupt_enabled = state;
> +
> +	/*
> +	 * Complete any on-going status updates before updating masks,
> +	 * and cancel queued status updates.
> +	 *
> +	 * There could be a race with a new interrupt thrown before
> +	 * the 3 mask updates below are complete, so in the interrupt
> +	 * we use the 'interrupt_enabled' status to prevent new work
> +	 * from being queued.
> +	 */
> +	if (!state)
> +		cancel_work_sync(&cdns->work);
>  
>  	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
>  	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
> diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
> index fdec62b912d3..4d1aab5b5ec2 100644
> --- a/drivers/soundwire/cadence_master.h
> +++ b/drivers/soundwire/cadence_master.h
> @@ -133,6 +133,7 @@ struct sdw_cdns {
>  
>  	bool link_up;
>  	unsigned int msg_count;
> +	bool interrupt_enabled;
>  
>  	struct work_struct work;
>  
> -- 
> 2.17.1

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-19  9:06   ` Vinod Koul
  (?)
@ 2020-08-19 12:51   ` Pierre-Louis Bossart
  2020-08-21  5:07       ` Vinod Koul
  -1 siblings, 1 reply; 16+ messages in thread
From: Pierre-Louis Bossart @ 2020-08-19 12:51 UTC (permalink / raw)
  To: Vinod Koul, Bard Liao
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, rander.wang, bard.liao



On 8/19/20 4:06 AM, Vinod Koul wrote:
> On 18-08-20, 06:23, Bard Liao wrote:
>> From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
>>
>> In system suspend stress cases, the SOF CI reports timeouts. The root
>> cause is that an alert is generated while the system suspends. The
>> interrupt handling generates transactions on the bus that will never
>> be handled because the interrupts are disabled in parallel.
>>
>> As a result, the transaction never completes and times out on resume.
>> This error doesn't seem too problematic since it happens in a work
>> queue, and the system recovers without issues.
>>
>> Nevertheless, this race condition should not happen. When doing a
>> system suspend, or when disabling interrupts, we should make sure the
>> current transaction can complete, and prevent new work from being
>> queued.
>>
>> BugLink: https://github.com/thesofproject/linux/issues/2344
>> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
>> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
>> Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
>> Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
>> ---
>>   drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
>>   drivers/soundwire/cadence_master.h |  1 +
>>   2 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
>> index 24eafe0aa1c3..1330ffc47596 100644
>> --- a/drivers/soundwire/cadence_master.c
>> +++ b/drivers/soundwire/cadence_master.c
>> @@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
>>   			     CDNS_MCP_INT_SLAVE_MASK, 0);
>>   
>>   		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
>> -		schedule_work(&cdns->work);
>> +
>> +		/*
>> +		 * Deal with possible race condition between interrupt
>> +		 * handling and disabling interrupts on suspend.
>> +		 *
>> +		 * If the master is in the process of disabling
>> +		 * interrupts, don't schedule a workqueue
>> +		 */
>> +		if (cdns->interrupt_enabled)
>> +			schedule_work(&cdns->work);
> 
> would it not make sense to mask the interrupts first and then cancel the
> work? that way you are guaranteed that after this call you dont have
> interrupts and work scheduled?

cancel_work_sync() will either
a) wait until the current work completes, or
b) prevent a new one from starting.

there's no way to really 'abort' a workqueue, 'cancel' means either 
complete or don't start.

if you disable the interrupts then cancel the work, you have a risk of 
not letting the work complete if it already started (case a).

The race is
a) the interrupt thread (this function) starts
b) the work is scheduled and starts
c) the suspend handler starts and disables interrupts in [1] below.
d) the work initiates transactions which will never complete since 
Cadence interrupts have been disabled.

So the idea was that before disabling interrupts, the suspend handler 
changes the status, and then calls cancel_work_sync(). the status is 
also used to prevent a new work from being scheduled if you already know 
the suspend is on-going. The test on the status above is not strictly 
necessary, I believe the sequence is safe without it but it avoid 
starting a useless work.

If you want to follow the flow it's better to start with what the 
suspend handler does below first, then look at how the interrupt thread 
might interfere. The diff format does not help, might be also easier to 
apply the patch and look at the rest of the code, e.g the 3 mask updates 
mentioned below are not included in the diff.

> 
>>   	}
>>   
>>   	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
>> @@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
>>   		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
>>   		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
>>   	}

[1]

>> +	cdns->interrupt_enabled = state;
>> +
>> +	/*
>> +	 * Complete any on-going status updates before updating masks,
>> +	 * and cancel queued status updates.
>> +	 *
>> +	 * There could be a race with a new interrupt thrown before
>> +	 * the 3 mask updates below are complete, so in the interrupt
>> +	 * we use the 'interrupt_enabled' status to prevent new work
>> +	 * from being queued.
>> +	 */
>> +	if (!state)
>> +		cancel_work_sync(&cdns->work);
>>   
>>   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
>>   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
>> diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
>> index fdec62b912d3..4d1aab5b5ec2 100644
>> --- a/drivers/soundwire/cadence_master.h
>> +++ b/drivers/soundwire/cadence_master.h
>> @@ -133,6 +133,7 @@ struct sdw_cdns {
>>   
>>   	bool link_up;
>>   	unsigned int msg_count;
>> +	bool interrupt_enabled;
>>   
>>   	struct work_struct work;
>>   
>> -- 
>> 2.17.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-19 12:51   ` Pierre-Louis Bossart
@ 2020-08-21  5:07       ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-21  5:07 UTC (permalink / raw)
  To: Pierre-Louis Bossart
  Cc: Bard Liao, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao

On 19-08-20, 07:51, Pierre-Louis Bossart wrote:
> 
> 
> On 8/19/20 4:06 AM, Vinod Koul wrote:
> > On 18-08-20, 06:23, Bard Liao wrote:
> > > From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> > > 
> > > In system suspend stress cases, the SOF CI reports timeouts. The root
> > > cause is that an alert is generated while the system suspends. The
> > > interrupt handling generates transactions on the bus that will never
> > > be handled because the interrupts are disabled in parallel.
> > > 
> > > As a result, the transaction never completes and times out on resume.
> > > This error doesn't seem too problematic since it happens in a work
> > > queue, and the system recovers without issues.
> > > 
> > > Nevertheless, this race condition should not happen. When doing a
> > > system suspend, or when disabling interrupts, we should make sure the
> > > current transaction can complete, and prevent new work from being
> > > queued.
> > > 
> > > BugLink: https://github.com/thesofproject/linux/issues/2344
> > > Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> > > Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
> > > Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
> > > Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
> > > ---
> > >   drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
> > >   drivers/soundwire/cadence_master.h |  1 +
> > >   2 files changed, 24 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
> > > index 24eafe0aa1c3..1330ffc47596 100644
> > > --- a/drivers/soundwire/cadence_master.c
> > > +++ b/drivers/soundwire/cadence_master.c
> > > @@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
> > >   			     CDNS_MCP_INT_SLAVE_MASK, 0);
> > >   		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
> > > -		schedule_work(&cdns->work);
> > > +
> > > +		/*
> > > +		 * Deal with possible race condition between interrupt
> > > +		 * handling and disabling interrupts on suspend.
> > > +		 *
> > > +		 * If the master is in the process of disabling
> > > +		 * interrupts, don't schedule a workqueue
> > > +		 */
> > > +		if (cdns->interrupt_enabled)
> > > +			schedule_work(&cdns->work);
> > 
> > would it not make sense to mask the interrupts first and then cancel the
> > work? that way you are guaranteed that after this call you dont have
> > interrupts and work scheduled?
> 
> cancel_work_sync() will either
> a) wait until the current work completes, or
> b) prevent a new one from starting.
> 
> there's no way to really 'abort' a workqueue, 'cancel' means either complete
> or don't start.

Quite right, as that is how everyone deals with it. Stop the irq from
firing first and then wait until work is cancelled or completed, hence
cancel_work_sync()

> if you disable the interrupts then cancel the work, you have a risk of not
> letting the work complete if it already started (case a).
> 
> The race is
> a) the interrupt thread (this function) starts
> b) the work is scheduled and starts
> c) the suspend handler starts and disables interrupts in [1] below.
> d) the work initiates transactions which will never complete since Cadence
> interrupts have been disabled.

Would it not be better to let work handle the case of interrupts
disabled and not initiates transactions which wont complete here? That
sounds more reasonable to do rather than complete the work which anyone
doesn't matter as you are suspending

> So the idea was that before disabling interrupts, the suspend handler
> changes the status, and then calls cancel_work_sync(). the status is also
> used to prevent a new work from being scheduled if you already know the
> suspend is on-going. The test on the status above is not strictly necessary,
> I believe the sequence is safe without it but it avoid starting a useless
> work.
> 
> If you want to follow the flow it's better to start with what the suspend
> handler does below first, then look at how the interrupt thread might
> interfere. The diff format does not help, might be also easier to apply the
> patch and look at the rest of the code, e.g the 3 mask updates mentioned
> below are not included in the diff.
> 
> > 
> > >   	}
> > >   	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
> > > @@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
> > >   		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
> > >   		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
> > >   	}
> 
> [1]
> 
> > > +	cdns->interrupt_enabled = state;
> > > +
> > > +	/*
> > > +	 * Complete any on-going status updates before updating masks,
> > > +	 * and cancel queued status updates.
> > > +	 *
> > > +	 * There could be a race with a new interrupt thrown before
> > > +	 * the 3 mask updates below are complete, so in the interrupt
> > > +	 * we use the 'interrupt_enabled' status to prevent new work
> > > +	 * from being queued.
> > > +	 */
> > > +	if (!state)
> > > +		cancel_work_sync(&cdns->work);
> > >   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
> > >   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
> > > diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
> > > index fdec62b912d3..4d1aab5b5ec2 100644
> > > --- a/drivers/soundwire/cadence_master.h
> > > +++ b/drivers/soundwire/cadence_master.h
> > > @@ -133,6 +133,7 @@ struct sdw_cdns {
> > >   	bool link_up;
> > >   	unsigned int msg_count;
> > > +	bool interrupt_enabled;
> > >   	struct work_struct work;
> > > -- 
> > > 2.17.1
> > 

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-21  5:07       ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-21  5:07 UTC (permalink / raw)
  To: Pierre-Louis Bossart
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, Bard Liao, rander.wang, bard.liao

On 19-08-20, 07:51, Pierre-Louis Bossart wrote:
> 
> 
> On 8/19/20 4:06 AM, Vinod Koul wrote:
> > On 18-08-20, 06:23, Bard Liao wrote:
> > > From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> > > 
> > > In system suspend stress cases, the SOF CI reports timeouts. The root
> > > cause is that an alert is generated while the system suspends. The
> > > interrupt handling generates transactions on the bus that will never
> > > be handled because the interrupts are disabled in parallel.
> > > 
> > > As a result, the transaction never completes and times out on resume.
> > > This error doesn't seem too problematic since it happens in a work
> > > queue, and the system recovers without issues.
> > > 
> > > Nevertheless, this race condition should not happen. When doing a
> > > system suspend, or when disabling interrupts, we should make sure the
> > > current transaction can complete, and prevent new work from being
> > > queued.
> > > 
> > > BugLink: https://github.com/thesofproject/linux/issues/2344
> > > Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> > > Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
> > > Reviewed-by: Rander Wang <rander.wang@linux.intel.com>
> > > Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
> > > ---
> > >   drivers/soundwire/cadence_master.c | 24 +++++++++++++++++++++++-
> > >   drivers/soundwire/cadence_master.h |  1 +
> > >   2 files changed, 24 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/soundwire/cadence_master.c b/drivers/soundwire/cadence_master.c
> > > index 24eafe0aa1c3..1330ffc47596 100644
> > > --- a/drivers/soundwire/cadence_master.c
> > > +++ b/drivers/soundwire/cadence_master.c
> > > @@ -791,7 +791,16 @@ irqreturn_t sdw_cdns_irq(int irq, void *dev_id)
> > >   			     CDNS_MCP_INT_SLAVE_MASK, 0);
> > >   		int_status &= ~CDNS_MCP_INT_SLAVE_MASK;
> > > -		schedule_work(&cdns->work);
> > > +
> > > +		/*
> > > +		 * Deal with possible race condition between interrupt
> > > +		 * handling and disabling interrupts on suspend.
> > > +		 *
> > > +		 * If the master is in the process of disabling
> > > +		 * interrupts, don't schedule a workqueue
> > > +		 */
> > > +		if (cdns->interrupt_enabled)
> > > +			schedule_work(&cdns->work);
> > 
> > would it not make sense to mask the interrupts first and then cancel the
> > work? that way you are guaranteed that after this call you dont have
> > interrupts and work scheduled?
> 
> cancel_work_sync() will either
> a) wait until the current work completes, or
> b) prevent a new one from starting.
> 
> there's no way to really 'abort' a workqueue, 'cancel' means either complete
> or don't start.

Quite right, as that is how everyone deals with it. Stop the irq from
firing first and then wait until work is cancelled or completed, hence
cancel_work_sync()

> if you disable the interrupts then cancel the work, you have a risk of not
> letting the work complete if it already started (case a).
> 
> The race is
> a) the interrupt thread (this function) starts
> b) the work is scheduled and starts
> c) the suspend handler starts and disables interrupts in [1] below.
> d) the work initiates transactions which will never complete since Cadence
> interrupts have been disabled.

Would it not be better to let work handle the case of interrupts
disabled and not initiates transactions which wont complete here? That
sounds more reasonable to do rather than complete the work which anyone
doesn't matter as you are suspending

> So the idea was that before disabling interrupts, the suspend handler
> changes the status, and then calls cancel_work_sync(). the status is also
> used to prevent a new work from being scheduled if you already know the
> suspend is on-going. The test on the status above is not strictly necessary,
> I believe the sequence is safe without it but it avoid starting a useless
> work.
> 
> If you want to follow the flow it's better to start with what the suspend
> handler does below first, then look at how the interrupt thread might
> interfere. The diff format does not help, might be also easier to apply the
> patch and look at the rest of the code, e.g the 3 mask updates mentioned
> below are not included in the diff.
> 
> > 
> > >   	}
> > >   	cdns_writel(cdns, CDNS_MCP_INTSTAT, int_status);
> > > @@ -924,6 +933,19 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
> > >   		slave_state = cdns_readl(cdns, CDNS_MCP_SLAVE_INTSTAT1);
> > >   		cdns_writel(cdns, CDNS_MCP_SLAVE_INTSTAT1, slave_state);
> > >   	}
> 
> [1]
> 
> > > +	cdns->interrupt_enabled = state;
> > > +
> > > +	/*
> > > +	 * Complete any on-going status updates before updating masks,
> > > +	 * and cancel queued status updates.
> > > +	 *
> > > +	 * There could be a race with a new interrupt thrown before
> > > +	 * the 3 mask updates below are complete, so in the interrupt
> > > +	 * we use the 'interrupt_enabled' status to prevent new work
> > > +	 * from being queued.
> > > +	 */
> > > +	if (!state)
> > > +		cancel_work_sync(&cdns->work);
> > >   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
> > >   	cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
> > > diff --git a/drivers/soundwire/cadence_master.h b/drivers/soundwire/cadence_master.h
> > > index fdec62b912d3..4d1aab5b5ec2 100644
> > > --- a/drivers/soundwire/cadence_master.h
> > > +++ b/drivers/soundwire/cadence_master.h
> > > @@ -133,6 +133,7 @@ struct sdw_cdns {
> > >   	bool link_up;
> > >   	unsigned int msg_count;
> > > +	bool interrupt_enabled;
> > >   	struct work_struct work;
> > > -- 
> > > 2.17.1
> > 

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-21  5:07       ` Vinod Koul
@ 2020-08-21 15:17         ` Pierre-Louis Bossart
  -1 siblings, 0 replies; 16+ messages in thread
From: Pierre-Louis Bossart @ 2020-08-21 15:17 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Bard Liao, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao



>> cancel_work_sync() will either
>> a) wait until the current work completes, or
>> b) prevent a new one from starting.
>>
>> there's no way to really 'abort' a workqueue, 'cancel' means either complete
>> or don't start.
> 
> Quite right, as that is how everyone deals with it. Stop the irq from
> firing first and then wait until work is cancelled or completed, hence
> cancel_work_sync()

No, this cannot work... The work queue in progress will initiate 
transactions which would never complete because the interrupts are disabled.

>> if you disable the interrupts then cancel the work, you have a risk of not
>> letting the work complete if it already started (case a).
>>
>> The race is
>> a) the interrupt thread (this function) starts
>> b) the work is scheduled and starts
>> c) the suspend handler starts and disables interrupts in [1] below.
>> d) the work initiates transactions which will never complete since Cadence
>> interrupts have been disabled.
> 
> Would it not be better to let work handle the case of interrupts
> disabled and not initiates transactions which wont complete here? That
> sounds more reasonable to do rather than complete the work which anyone
> doesn't matter as you are suspending

A in-progress workqueue has no notion that interrupts are disabled, nor 
that the device is in the process of suspending. It writes into a fifo 
and blocks with wait_for_completion(). the complete() is done in an 
interrupt thread, triggered when the RX Fifo reaches a watermark.

So if you disable interrupts, the complete() never happens.

The safe way to do things with the current code is to let the workqueue 
complete, then disable interrupts.

We only disable interrupts on suspend, we don't need to test if 
interrupts are enabled for every single byte that's transmitted on the 
bus. Not to mention that this additional test would be racy as hell and 
require yet another synchronization primitive making the code more 
complicated.

So yes, the current transactions will complete and will be ignored, but 
it's a lot better than trying to prevent these transactions from 
happening with extra layers of complexity that will impact *every* 
transaction.

BTW I looked at another alternative based on the msg lock, so that 
interrupts cannot be disabled while a message is being sent. However 
because a workqueue may send multiple messages, e.g. to read registers 
are non-contiguous locations, there is no way to guarantee what happens 
how messages and interrupt disabling are scheduled, so there'd still be 
a case of a workqueue not completing and being stuck on a mutex_lock(), 
not so good either.

In short, this is the simplest way to fix the timeout on resume.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-21 15:17         ` Pierre-Louis Bossart
  0 siblings, 0 replies; 16+ messages in thread
From: Pierre-Louis Bossart @ 2020-08-21 15:17 UTC (permalink / raw)
  To: Vinod Koul
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, Bard Liao, rander.wang, bard.liao



>> cancel_work_sync() will either
>> a) wait until the current work completes, or
>> b) prevent a new one from starting.
>>
>> there's no way to really 'abort' a workqueue, 'cancel' means either complete
>> or don't start.
> 
> Quite right, as that is how everyone deals with it. Stop the irq from
> firing first and then wait until work is cancelled or completed, hence
> cancel_work_sync()

No, this cannot work... The work queue in progress will initiate 
transactions which would never complete because the interrupts are disabled.

>> if you disable the interrupts then cancel the work, you have a risk of not
>> letting the work complete if it already started (case a).
>>
>> The race is
>> a) the interrupt thread (this function) starts
>> b) the work is scheduled and starts
>> c) the suspend handler starts and disables interrupts in [1] below.
>> d) the work initiates transactions which will never complete since Cadence
>> interrupts have been disabled.
> 
> Would it not be better to let work handle the case of interrupts
> disabled and not initiates transactions which wont complete here? That
> sounds more reasonable to do rather than complete the work which anyone
> doesn't matter as you are suspending

A in-progress workqueue has no notion that interrupts are disabled, nor 
that the device is in the process of suspending. It writes into a fifo 
and blocks with wait_for_completion(). the complete() is done in an 
interrupt thread, triggered when the RX Fifo reaches a watermark.

So if you disable interrupts, the complete() never happens.

The safe way to do things with the current code is to let the workqueue 
complete, then disable interrupts.

We only disable interrupts on suspend, we don't need to test if 
interrupts are enabled for every single byte that's transmitted on the 
bus. Not to mention that this additional test would be racy as hell and 
require yet another synchronization primitive making the code more 
complicated.

So yes, the current transactions will complete and will be ignored, but 
it's a lot better than trying to prevent these transactions from 
happening with extra layers of complexity that will impact *every* 
transaction.

BTW I looked at another alternative based on the msg lock, so that 
interrupts cannot be disabled while a message is being sent. However 
because a workqueue may send multiple messages, e.g. to read registers 
are non-contiguous locations, there is no way to guarantee what happens 
how messages and interrupt disabling are scheduled, so there'd still be 
a case of a workqueue not completing and being stuck on a mutex_lock(), 
not so good either.

In short, this is the simplest way to fix the timeout on resume.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-21 15:17         ` Pierre-Louis Bossart
@ 2020-08-28  8:00           ` Vinod Koul
  -1 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-28  8:00 UTC (permalink / raw)
  To: Pierre-Louis Bossart
  Cc: Bard Liao, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao

On 21-08-20, 10:17, Pierre-Louis Bossart wrote:
> 
> 
> > > cancel_work_sync() will either
> > > a) wait until the current work completes, or
> > > b) prevent a new one from starting.
> > > 
> > > there's no way to really 'abort' a workqueue, 'cancel' means either complete
> > > or don't start.
> > 
> > Quite right, as that is how everyone deals with it. Stop the irq from
> > firing first and then wait until work is cancelled or completed, hence
> > cancel_work_sync()
> 
> No, this cannot work... The work queue in progress will initiate
> transactions which would never complete because the interrupts are disabled.

Okay that makes sense then

> > > if you disable the interrupts then cancel the work, you have a risk of not
> > > letting the work complete if it already started (case a).
> > > 
> > > The race is
> > > a) the interrupt thread (this function) starts
> > > b) the work is scheduled and starts
> > > c) the suspend handler starts and disables interrupts in [1] below.
> > > d) the work initiates transactions which will never complete since Cadence
> > > interrupts have been disabled.
> > 
> > Would it not be better to let work handle the case of interrupts
> > disabled and not initiates transactions which wont complete here? That
> > sounds more reasonable to do rather than complete the work which anyone
> > doesn't matter as you are suspending
> 
> A in-progress workqueue has no notion that interrupts are disabled, nor that
> the device is in the process of suspending. It writes into a fifo and blocks
> with wait_for_completion(). the complete() is done in an interrupt thread,
> triggered when the RX Fifo reaches a watermark.
> 
> So if you disable interrupts, the complete() never happens.
> 
> The safe way to do things with the current code is to let the workqueue
> complete, then disable interrupts.
> 
> We only disable interrupts on suspend, we don't need to test if interrupts
> are enabled for every single byte that's transmitted on the bus. Not to
> mention that this additional test would be racy as hell and require yet
> another synchronization primitive making the code more complicated.
> 
> So yes, the current transactions will complete and will be ignored, but it's
> a lot better than trying to prevent these transactions from happening with
> extra layers of complexity that will impact *every* transaction.
> 
> BTW I looked at another alternative based on the msg lock, so that
> interrupts cannot be disabled while a message is being sent. However because
> a workqueue may send multiple messages, e.g. to read registers are
> non-contiguous locations, there is no way to guarantee what happens how
> messages and interrupt disabling are scheduled, so there'd still be a case
> of a workqueue not completing and being stuck on a mutex_lock(), not so good
> either.
> 
> In short, this is the simplest way to fix the timeout on resume.

Is this timeout for suspend or resume? Somehow I was under the
assumption that it is former? Or is the result seen on resume?

Rereading the race describe above in steps, I think this should be
handled in step c above. Btw is that suspend or runtime suspend which
causes this? Former would be bigger issue as we should not have work
running when we return from suspend call. Latter should be dealt with
anyway as device might be off after suspend.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-28  8:00           ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-08-28  8:00 UTC (permalink / raw)
  To: Pierre-Louis Bossart
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, Bard Liao, rander.wang, bard.liao

On 21-08-20, 10:17, Pierre-Louis Bossart wrote:
> 
> 
> > > cancel_work_sync() will either
> > > a) wait until the current work completes, or
> > > b) prevent a new one from starting.
> > > 
> > > there's no way to really 'abort' a workqueue, 'cancel' means either complete
> > > or don't start.
> > 
> > Quite right, as that is how everyone deals with it. Stop the irq from
> > firing first and then wait until work is cancelled or completed, hence
> > cancel_work_sync()
> 
> No, this cannot work... The work queue in progress will initiate
> transactions which would never complete because the interrupts are disabled.

Okay that makes sense then

> > > if you disable the interrupts then cancel the work, you have a risk of not
> > > letting the work complete if it already started (case a).
> > > 
> > > The race is
> > > a) the interrupt thread (this function) starts
> > > b) the work is scheduled and starts
> > > c) the suspend handler starts and disables interrupts in [1] below.
> > > d) the work initiates transactions which will never complete since Cadence
> > > interrupts have been disabled.
> > 
> > Would it not be better to let work handle the case of interrupts
> > disabled and not initiates transactions which wont complete here? That
> > sounds more reasonable to do rather than complete the work which anyone
> > doesn't matter as you are suspending
> 
> A in-progress workqueue has no notion that interrupts are disabled, nor that
> the device is in the process of suspending. It writes into a fifo and blocks
> with wait_for_completion(). the complete() is done in an interrupt thread,
> triggered when the RX Fifo reaches a watermark.
> 
> So if you disable interrupts, the complete() never happens.
> 
> The safe way to do things with the current code is to let the workqueue
> complete, then disable interrupts.
> 
> We only disable interrupts on suspend, we don't need to test if interrupts
> are enabled for every single byte that's transmitted on the bus. Not to
> mention that this additional test would be racy as hell and require yet
> another synchronization primitive making the code more complicated.
> 
> So yes, the current transactions will complete and will be ignored, but it's
> a lot better than trying to prevent these transactions from happening with
> extra layers of complexity that will impact *every* transaction.
> 
> BTW I looked at another alternative based on the msg lock, so that
> interrupts cannot be disabled while a message is being sent. However because
> a workqueue may send multiple messages, e.g. to read registers are
> non-contiguous locations, there is no way to guarantee what happens how
> messages and interrupt disabling are scheduled, so there'd still be a case
> of a workqueue not completing and being stuck on a mutex_lock(), not so good
> either.
> 
> In short, this is the simplest way to fix the timeout on resume.

Is this timeout for suspend or resume? Somehow I was under the
assumption that it is former? Or is the result seen on resume?

Rereading the race describe above in steps, I think this should be
handled in step c above. Btw is that suspend or runtime suspend which
causes this? Former would be bigger issue as we should not have work
running when we return from suspend call. Latter should be dealt with
anyway as device might be off after suspend.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-28  8:00           ` Vinod Koul
@ 2020-08-28 15:14             ` Pierre-Louis Bossart
  -1 siblings, 0 replies; 16+ messages in thread
From: Pierre-Louis Bossart @ 2020-08-28 15:14 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Bard Liao, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, rander.wang, bard.liao




> Is this timeout for suspend or resume? Somehow I was under the
> assumption that it is former? Or is the result seen on resume?
> 
> Rereading the race describe above in steps, I think this should be
> handled in step c above. Btw is that suspend or runtime suspend which
> causes this? Former would be bigger issue as we should not have work
> running when we return from suspend call. Latter should be dealt with
> anyway as device might be off after suspend.

This happens with a system suspend. Because we disable the interrupts, 
the workqueue never completes, and we have a timeout on system resume.

That's why we want to prevent the workqueue from starting, or let it 
complete, but not have this zombie state where we suspend but there's 
still a wait for completion that times out later. The point here is 
really  making sure the workqueue is not used before suspend.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-08-28 15:14             ` Pierre-Louis Bossart
  0 siblings, 0 replies; 16+ messages in thread
From: Pierre-Louis Bossart @ 2020-08-28 15:14 UTC (permalink / raw)
  To: Vinod Koul
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, Bard Liao, rander.wang, bard.liao




> Is this timeout for suspend or resume? Somehow I was under the
> assumption that it is former? Or is the result seen on resume?
> 
> Rereading the race describe above in steps, I think this should be
> handled in step c above. Btw is that suspend or runtime suspend which
> causes this? Former would be bigger issue as we should not have work
> running when we return from suspend call. Latter should be dealt with
> anyway as device might be off after suspend.

This happens with a system suspend. Because we disable the interrupts, 
the workqueue never completes, and we have a timeout on system resume.

That's why we want to prevent the workqueue from starting, or let it 
complete, but not have this zombie state where we suspend but there's 
still a wait for completion that times out later. The point here is 
really  making sure the workqueue is not used before suspend.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-08-28 15:14             ` Pierre-Louis Bossart
  (?)
@ 2020-09-08 11:58             ` Jaroslav Kysela
  2020-09-09  7:59                 ` Vinod Koul
  -1 siblings, 1 reply; 16+ messages in thread
From: Jaroslav Kysela @ 2020-09-08 11:58 UTC (permalink / raw)
  To: Pierre-Louis Bossart, Vinod Koul
  Cc: alsa-devel, tiwai, gregkh, linux-kernel, ranjani.sridharan,
	hui.wang, broonie, srinivas.kandagatla, jank, mengdong.lin,
	sanyog.r.kale, Bard Liao, rander.wang, bard.liao

Dne 28. 08. 20 v 17:14 Pierre-Louis Bossart napsal(a):
> 
> 
> 
>> Is this timeout for suspend or resume? Somehow I was under the
>> assumption that it is former? Or is the result seen on resume?
>>
>> Rereading the race describe above in steps, I think this should be
>> handled in step c above. Btw is that suspend or runtime suspend which
>> causes this? Former would be bigger issue as we should not have work
>> running when we return from suspend call. Latter should be dealt with
>> anyway as device might be off after suspend.
> 
> This happens with a system suspend. Because we disable the interrupts, 
> the workqueue never completes, and we have a timeout on system resume.
> 
> That's why we want to prevent the workqueue from starting, or let it 
> complete, but not have this zombie state where we suspend but there's 
> still a wait for completion that times out later. The point here is 
> really  making sure the workqueue is not used before suspend.
> 

Vinod, there is no acceptance progress on this. The patch is really straight
and for the Intel controller. They know what they're doing. I would apply
this. The code can be refined at anytime. It's a fix. I tested it and I can
confirm, that it fixes the issue. It's a vital patch for 5.10 to enable
finally SoundWire drivers for the Intel hardware.

Acked-by: Jaroslav Kysela <perex@perex.cz>


-- 
Jaroslav Kysela <perex@perex.cz>
Linux Sound Maintainer; ALSA Project; Red Hat, Inc.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
  2020-09-08 11:58             ` Jaroslav Kysela
@ 2020-09-09  7:59                 ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-09-09  7:59 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Pierre-Louis Bossart, alsa-devel, tiwai, gregkh, linux-kernel,
	ranjani.sridharan, hui.wang, broonie, srinivas.kandagatla, jank,
	mengdong.lin, sanyog.r.kale, Bard Liao, rander.wang, bard.liao

On 08-09-20, 13:58, Jaroslav Kysela wrote:
> Dne 28. 08. 20 v 17:14 Pierre-Louis Bossart napsal(a):
> > 
> > 
> > 
> >> Is this timeout for suspend or resume? Somehow I was under the
> >> assumption that it is former? Or is the result seen on resume?
> >>
> >> Rereading the race describe above in steps, I think this should be
> >> handled in step c above. Btw is that suspend or runtime suspend which
> >> causes this? Former would be bigger issue as we should not have work
> >> running when we return from suspend call. Latter should be dealt with
> >> anyway as device might be off after suspend.
> > 
> > This happens with a system suspend. Because we disable the interrupts, 
> > the workqueue never completes, and we have a timeout on system resume.
> > 
> > That's why we want to prevent the workqueue from starting, or let it 
> > complete, but not have this zombie state where we suspend but there's 
> > still a wait for completion that times out later. The point here is 
> > really  making sure the workqueue is not used before suspend.
> > 
> 
> Vinod, there is no acceptance progress on this. The patch is really straight
> and for the Intel controller. They know what they're doing. I would apply
> this. The code can be refined at anytime. It's a fix. I tested it and I can
> confirm, that it fixes the issue. It's a vital patch for 5.10 to enable
> finally SoundWire drivers for the Intel hardware.

I do feel that there is something else going on, but not able to pin
point, anyway this fixes the issue so I am applying it now
> 
> Acked-by: Jaroslav Kysela <perex@perex.cz>

Thanks for ack

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
@ 2020-09-09  7:59                 ` Vinod Koul
  0 siblings, 0 replies; 16+ messages in thread
From: Vinod Koul @ 2020-09-09  7:59 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: alsa-devel, tiwai, gregkh, ranjani.sridharan,
	Pierre-Louis Bossart, hui.wang, broonie, srinivas.kandagatla,
	bard.liao, jank, mengdong.lin, sanyog.r.kale, Bard Liao,
	rander.wang, linux-kernel

On 08-09-20, 13:58, Jaroslav Kysela wrote:
> Dne 28. 08. 20 v 17:14 Pierre-Louis Bossart napsal(a):
> > 
> > 
> > 
> >> Is this timeout for suspend or resume? Somehow I was under the
> >> assumption that it is former? Or is the result seen on resume?
> >>
> >> Rereading the race describe above in steps, I think this should be
> >> handled in step c above. Btw is that suspend or runtime suspend which
> >> causes this? Former would be bigger issue as we should not have work
> >> running when we return from suspend call. Latter should be dealt with
> >> anyway as device might be off after suspend.
> > 
> > This happens with a system suspend. Because we disable the interrupts, 
> > the workqueue never completes, and we have a timeout on system resume.
> > 
> > That's why we want to prevent the workqueue from starting, or let it 
> > complete, but not have this zombie state where we suspend but there's 
> > still a wait for completion that times out later. The point here is 
> > really  making sure the workqueue is not used before suspend.
> > 
> 
> Vinod, there is no acceptance progress on this. The patch is really straight
> and for the Intel controller. They know what they're doing. I would apply
> this. The code can be refined at anytime. It's a fix. I tested it and I can
> confirm, that it fixes the issue. It's a vital patch for 5.10 to enable
> finally SoundWire drivers for the Intel hardware.

I do feel that there is something else going on, but not able to pin
point, anyway this fixes the issue so I am applying it now
> 
> Acked-by: Jaroslav Kysela <perex@perex.cz>

Thanks for ack

-- 
~Vinod

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-09  8:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 22:23 [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts Bard Liao
2020-08-17 22:23 ` Bard Liao
2020-08-19  9:06 ` Vinod Koul
2020-08-19  9:06   ` Vinod Koul
2020-08-19 12:51   ` Pierre-Louis Bossart
2020-08-21  5:07     ` Vinod Koul
2020-08-21  5:07       ` Vinod Koul
2020-08-21 15:17       ` Pierre-Louis Bossart
2020-08-21 15:17         ` Pierre-Louis Bossart
2020-08-28  8:00         ` Vinod Koul
2020-08-28  8:00           ` Vinod Koul
2020-08-28 15:14           ` Pierre-Louis Bossart
2020-08-28 15:14             ` Pierre-Louis Bossart
2020-09-08 11:58             ` Jaroslav Kysela
2020-09-09  7:59               ` Vinod Koul
2020-09-09  7:59                 ` Vinod Koul

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.