linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mailbox: cancel timer before starting it
@ 2020-09-23 12:39 Jerome Brunet
  2020-09-24 16:50 ` Kevin Hilman
  2020-10-15 13:46 ` Ionela Voinescu
  0 siblings, 2 replies; 18+ messages in thread
From: Jerome Brunet @ 2020-09-23 12:39 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Jerome Brunet, Kevin Hilman, linux-amlogic, linux-kernel, Da Xue

If the txdone is done by polling, it is possible for msg_submit() to start
the timer while txdone_hrtimer() callback is running. If the timer needs
recheduling, it could already be enqueued by the time hrtimer_forward_now()
is called, leading hrtimer to loudly complain.

WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
Hardware name: Libre Computer AML-S805X-AC (DT)
Workqueue: events_freezable_power_ thermal_zone_device_check
pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
pc : hrtimer_forward+0xc4/0x110
lr : txdone_hrtimer+0xf8/0x118
[...]

Canceling the timer before starting it ensure that the timer callback is
not running when the timer is started, solving this race condition.

Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
Reported-by: Da Xue <da@libre.computer>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
---
 drivers/mailbox/mailbox.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 0b821a5b2db8..34f9ab01caef 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
 exit:
 	spin_unlock_irqrestore(&chan->lock, flags);
 
-	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
-		/* kick start the timer immediately to avoid delays */
+	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
+		/* Disable the timer if already active ... */
+		hrtimer_cancel(&chan->mbox->poll_hrt);
+
+		/* ... and kick start it immediately to avoid delays */
 		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+	}
 }
 
 static void tx_tick(struct mbox_chan *chan, int r)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-09-23 12:39 [PATCH] mailbox: cancel timer before starting it Jerome Brunet
@ 2020-09-24 16:50 ` Kevin Hilman
  2020-10-15 13:46 ` Ionela Voinescu
  1 sibling, 0 replies; 18+ messages in thread
From: Kevin Hilman @ 2020-09-24 16:50 UTC (permalink / raw)
  To: Jerome Brunet, Jassi Brar
  Cc: Jerome Brunet, linux-amlogic, linux-kernel, Da Xue

Jerome Brunet <jbrunet@baylibre.com> writes:

> If the txdone is done by polling, it is possible for msg_submit() to start
> the timer while txdone_hrtimer() callback is running. If the timer needs
> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> is called, leading hrtimer to loudly complain.
>
> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> Hardware name: Libre Computer AML-S805X-AC (DT)
> Workqueue: events_freezable_power_ thermal_zone_device_check
> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> pc : hrtimer_forward+0xc4/0x110
> lr : txdone_hrtimer+0xf8/0x118
> [...]
>
> Canceling the timer before starting it ensure that the timer callback is
> not running when the timer is started, solving this race condition.
>
> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> Reported-by: Da Xue <da@libre.computer>
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>

Reviewed-by: Kevin Hilman <khilman@baylibre.com>

Kevin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-09-23 12:39 [PATCH] mailbox: cancel timer before starting it Jerome Brunet
  2020-09-24 16:50 ` Kevin Hilman
@ 2020-10-15 13:46 ` Ionela Voinescu
  2020-10-15 13:58   ` Jerome Brunet
  1 sibling, 1 reply; 18+ messages in thread
From: Ionela Voinescu @ 2020-10-15 13:46 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue, linux-kernel

Hi guys,

On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> If the txdone is done by polling, it is possible for msg_submit() to start
> the timer while txdone_hrtimer() callback is running. If the timer needs
> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> is called, leading hrtimer to loudly complain.
> 
> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> Hardware name: Libre Computer AML-S805X-AC (DT)
> Workqueue: events_freezable_power_ thermal_zone_device_check
> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> pc : hrtimer_forward+0xc4/0x110
> lr : txdone_hrtimer+0xf8/0x118
> [...]
> 
> Canceling the timer before starting it ensure that the timer callback is
> not running when the timer is started, solving this race condition.
> 
> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> Reported-by: Da Xue <da@libre.computer>
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> ---
>  drivers/mailbox/mailbox.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 0b821a5b2db8..34f9ab01caef 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>  exit:
>  	spin_unlock_irqrestore(&chan->lock, flags);
>  
> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> -		/* kick start the timer immediately to avoid delays */
> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> +		/* Disable the timer if already active ... */
> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> +
> +		/* ... and kick start it immediately to avoid delays */
>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> +	}
>  }
>  
>  static void tx_tick(struct mbox_chan *chan, int r)

I've tracked a regression back to this commit. Details to reproduce:


 - HEAD: (linux-next)
   * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it

 - Platform: arm64 Juno R0 and Juno R2 [1]

 - Partial log:
	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
	[    0.000000] Machine model: ARM Juno development board (r0)
	[..]
	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
	[    1.722768] NET: Registered protocol family 17
	[    1.727364] 9pnet: Installing 9P2000 support
	[    1.731689] Key type dns_resolver registered
	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
	[    1.736407] registered taskstats version 1
	[    1.747061] Loading compiled-in X.509 certificates
	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
	[    1.770484] cpu cpu0: EM: created perf domain
	[    1.778505] cpu cpu1: EM: created perf domain
	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
	[    1.897593] hub 1-1:1.0: USB hub found
	[    1.901656] hub 1-1:1.0: 4 ports detected
	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
	[   22.807823] Task dump for CPU 1:
	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   22.820980] Call trace:
	[   22.823429]  __switch_to+0x138/0x198
	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
	[   23.593995] rcu: blocking rcu_node structures:
	[   23.598449] Task dump for CPU 1:
	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   23.611619] Call trace:
	[   23.614064]  __switch_to+0x138/0x198
	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
	[   85.827994] Task dump for CPU 1:
	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   85.841150] Call trace:
	[   85.843596]  __switch_to+0x138/0x198
	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
	[   87.082088] rcu: blocking rcu_node structures:
	[   87.086540] Task dump for CPU 1:
	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   87.099708] Call trace:
	[   87.102155]  __switch_to+0x138/0x198
	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
	[  148.848037] Task dump for CPU 1:
	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[  148.861207] Call trace:
	[  148.863663]  __switch_to+0x138/0x198
	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
	[  150.570082] rcu: blocking rcu_node structures:
	[  150.574535] Task dump for CPU 1:
	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[  150.587705] Call trace:
	[  150.590151]  __switch_to+0x138/0x198

 - Commit working as expected:
   * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error


[1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board

Thank you,
Ionela.

> -- 
> 2.25.4
> 
> 
> _______________________________________________
> linux-amlogic mailing list
> linux-amlogic@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-amlogic

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-15 13:46 ` Ionela Voinescu
@ 2020-10-15 13:58   ` Jerome Brunet
  2020-10-15 14:29     ` Ionela Voinescu
  2020-10-15 18:45     ` Jassi Brar
  0 siblings, 2 replies; 18+ messages in thread
From: Jerome Brunet @ 2020-10-15 13:58 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue, linux-kernel


On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> Hi guys,
>
> On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
>> If the txdone is done by polling, it is possible for msg_submit() to start
>> the timer while txdone_hrtimer() callback is running. If the timer needs
>> recheduling, it could already be enqueued by the time hrtimer_forward_now()
>> is called, leading hrtimer to loudly complain.
>> 
>> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
>> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
>> Hardware name: Libre Computer AML-S805X-AC (DT)
>> Workqueue: events_freezable_power_ thermal_zone_device_check
>> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>> pc : hrtimer_forward+0xc4/0x110
>> lr : txdone_hrtimer+0xf8/0x118
>> [...]
>> 
>> Canceling the timer before starting it ensure that the timer callback is
>> not running when the timer is started, solving this race condition.
>> 
>> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
>> Reported-by: Da Xue <da@libre.computer>
>> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
>> ---
>>  drivers/mailbox/mailbox.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
>> index 0b821a5b2db8..34f9ab01caef 100644
>> --- a/drivers/mailbox/mailbox.c
>> +++ b/drivers/mailbox/mailbox.c
>> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>>  exit:
>>  	spin_unlock_irqrestore(&chan->lock, flags);
>>  
>> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> -		/* kick start the timer immediately to avoid delays */
>> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> +		/* Disable the timer if already active ... */
>> +		hrtimer_cancel(&chan->mbox->poll_hrt);
>> +
>> +		/* ... and kick start it immediately to avoid delays */
>>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> +	}
>>  }
>>  
>>  static void tx_tick(struct mbox_chan *chan, int r)
>
> I've tracked a regression back to this commit. Details to reproduce:

Hi Ionela,

I don't have access to your platform and I don't get what is going on
from the log below.

Could you please give us a bit more details about what is going on ?

All this patch does is add hrtimer_cancel().
* It is needed if the timer had already been started, which is
  appropriate AFAIU
* It is a NO-OP is the timer is not active.

>
>
>  - HEAD: (linux-next)
>    * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it
>
>  - Platform: arm64 Juno R0 and Juno R2 [1]
>
>  - Partial log:
> 	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
> 	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
> 	[    0.000000] Machine model: ARM Juno development board (r0)
> 	[..]
> 	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
> 	[    1.722768] NET: Registered protocol family 17
> 	[    1.727364] 9pnet: Installing 9P2000 support
> 	[    1.731689] Key type dns_resolver registered
> 	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
> 	[    1.736407] registered taskstats version 1
> 	[    1.747061] Loading compiled-in X.509 certificates
> 	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
> 	[    1.770484] cpu cpu0: EM: created perf domain
> 	[    1.778505] cpu cpu1: EM: created perf domain
> 	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
> 	[    1.897593] hub 1-1:1.0: USB hub found
> 	[    1.901656] hub 1-1:1.0: 4 ports detected
> 	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
> 	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
> 	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
> 	[   22.807823] Task dump for CPU 1:
> 	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   22.820980] Call trace:
> 	[   22.823429]  __switch_to+0x138/0x198
> 	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
> 	[   23.593995] rcu: blocking rcu_node structures:
> 	[   23.598449] Task dump for CPU 1:
> 	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   23.611619] Call trace:
> 	[   23.614064]  __switch_to+0x138/0x198
> 	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
> 	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
> 	[   85.827994] Task dump for CPU 1:
> 	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   85.841150] Call trace:
> 	[   85.843596]  __switch_to+0x138/0x198
> 	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
> 	[   87.082088] rcu: blocking rcu_node structures:
> 	[   87.086540] Task dump for CPU 1:
> 	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   87.099708] Call trace:
> 	[   87.102155]  __switch_to+0x138/0x198
> 	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
> 	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
> 	[  148.848037] Task dump for CPU 1:
> 	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[  148.861207] Call trace:
> 	[  148.863663]  __switch_to+0x138/0x198
> 	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
> 	[  150.570082] rcu: blocking rcu_node structures:
> 	[  150.574535] Task dump for CPU 1:
> 	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[  150.587705] Call trace:
> 	[  150.590151]  __switch_to+0x138/0x198
>
>  - Commit working as expected:
>    * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error
>
>
> [1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board
>
> Thank you,
> Ionela.
>
>> -- 
>> 2.25.4
>> 
>> 
>> _______________________________________________
>> linux-amlogic mailing list
>> linux-amlogic@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-amlogic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-15 13:58   ` Jerome Brunet
@ 2020-10-15 14:29     ` Ionela Voinescu
  2020-10-16  8:44       ` Sudeep Holla
  2020-10-15 18:45     ` Jassi Brar
  1 sibling, 1 reply; 18+ messages in thread
From: Ionela Voinescu @ 2020-10-15 14:29 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Sudeep Holla, Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue,
	linux-kernel

Hi Jerome,

On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> 
> On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> 
> > Hi guys,
> >
> > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> is called, leading hrtimer to loudly complain.
> >> 
> >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> pc : hrtimer_forward+0xc4/0x110
> >> lr : txdone_hrtimer+0xf8/0x118
> >> [...]
> >> 
> >> Canceling the timer before starting it ensure that the timer callback is
> >> not running when the timer is started, solving this race condition.
> >> 
> >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> Reported-by: Da Xue <da@libre.computer>
> >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> ---
> >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> index 0b821a5b2db8..34f9ab01caef 100644
> >> --- a/drivers/mailbox/mailbox.c
> >> +++ b/drivers/mailbox/mailbox.c
> >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >>  exit:
> >>  	spin_unlock_irqrestore(&chan->lock, flags);
> >>  
> >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> -		/* kick start the timer immediately to avoid delays */
> >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> +		/* Disable the timer if already active ... */
> >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> >> +
> >> +		/* ... and kick start it immediately to avoid delays */
> >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> +	}
> >>  }
> >>  
> >>  static void tx_tick(struct mbox_chan *chan, int r)
> >
> > I've tracked a regression back to this commit. Details to reproduce:
> 
> Hi Ionela,
> 
> I don't have access to your platform and I don't get what is going on
> from the log below.
> 
> Could you please give us a bit more details about what is going on ?
> 

I'm not familiar with the mailbox subsystem, so the best I can do right
now is to add Sudeep to Cc, in case this conflicts in some way with the
ARM MHU patches [1].

In the meantime I'll get some traces and get more familiar with the
code.

[1]
https://lore.kernel.org/linux-arm-kernel/20201009113129.uqw5zrqztjgw6vga@bogus/

Hope it helps,
Ionela.

> All this patch does is add hrtimer_cancel().
> * It is needed if the timer had already been started, which is
>   appropriate AFAIU
> * It is a NO-OP is the timer is not active.
> 
> >
> >
> >  - HEAD: (linux-next)
> >    * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it
> >
> >  - Platform: arm64 Juno R0 and Juno R2 [1]
> >
> >  - Partial log:
> > 	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
> > 	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
> > 	[    0.000000] Machine model: ARM Juno development board (r0)
> > 	[..]
> > 	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
> > 	[    1.722768] NET: Registered protocol family 17
> > 	[    1.727364] 9pnet: Installing 9P2000 support
> > 	[    1.731689] Key type dns_resolver registered
> > 	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
> > 	[    1.736407] registered taskstats version 1
> > 	[    1.747061] Loading compiled-in X.509 certificates
> > 	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
> > 	[    1.770484] cpu cpu0: EM: created perf domain
> > 	[    1.778505] cpu cpu1: EM: created perf domain
> > 	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
> > 	[    1.897593] hub 1-1:1.0: USB hub found
> > 	[    1.901656] hub 1-1:1.0: 4 ports detected
> > 	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
> > 	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
> > 	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
> > 	[   22.807823] Task dump for CPU 1:
> > 	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   22.820980] Call trace:
> > 	[   22.823429]  __switch_to+0x138/0x198
> > 	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
> > 	[   23.593995] rcu: blocking rcu_node structures:
> > 	[   23.598449] Task dump for CPU 1:
> > 	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   23.611619] Call trace:
> > 	[   23.614064]  __switch_to+0x138/0x198
> > 	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
> > 	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
> > 	[   85.827994] Task dump for CPU 1:
> > 	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   85.841150] Call trace:
> > 	[   85.843596]  __switch_to+0x138/0x198
> > 	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
> > 	[   87.082088] rcu: blocking rcu_node structures:
> > 	[   87.086540] Task dump for CPU 1:
> > 	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   87.099708] Call trace:
> > 	[   87.102155]  __switch_to+0x138/0x198
> > 	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
> > 	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
> > 	[  148.848037] Task dump for CPU 1:
> > 	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[  148.861207] Call trace:
> > 	[  148.863663]  __switch_to+0x138/0x198
> > 	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
> > 	[  150.570082] rcu: blocking rcu_node structures:
> > 	[  150.574535] Task dump for CPU 1:
> > 	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[  150.587705] Call trace:
> > 	[  150.590151]  __switch_to+0x138/0x198
> >
> >  - Commit working as expected:
> >    * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error
> >
> >
> > [1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board
> >
> > Thank you,
> > Ionela.
> >
> >> -- 
> >> 2.25.4
> >> 
> >> 
> >> _______________________________________________
> >> linux-amlogic mailing list
> >> linux-amlogic@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/linux-amlogic
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-15 13:58   ` Jerome Brunet
  2020-10-15 14:29     ` Ionela Voinescu
@ 2020-10-15 18:45     ` Jassi Brar
  2020-10-16  8:52       ` Ionela Voinescu
  1 sibling, 1 reply; 18+ messages in thread
From: Jassi Brar @ 2020-10-15 18:45 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Ionela Voinescu, Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Thu, Oct 15, 2020 at 8:58 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > Hi guys,
> >
> > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> is called, leading hrtimer to loudly complain.
> >>
> >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> pc : hrtimer_forward+0xc4/0x110
> >> lr : txdone_hrtimer+0xf8/0x118
> >> [...]
> >>
> >> Canceling the timer before starting it ensure that the timer callback is
> >> not running when the timer is started, solving this race condition.
> >>
> >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> Reported-by: Da Xue <da@libre.computer>
> >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> ---
> >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> index 0b821a5b2db8..34f9ab01caef 100644
> >> --- a/drivers/mailbox/mailbox.c
> >> +++ b/drivers/mailbox/mailbox.c
> >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >>  exit:
> >>      spin_unlock_irqrestore(&chan->lock, flags);
> >>
> >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> -            /* kick start the timer immediately to avoid delays */
> >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> +            /* Disable the timer if already active ... */
> >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> +
> >> +            /* ... and kick start it immediately to avoid delays */
> >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> +    }
> >>  }
> >>
> >>  static void tx_tick(struct mbox_chan *chan, int r)
> >
> > I've tracked a regression back to this commit. Details to reproduce:
>
> Hi Ionela,
>
> I don't have access to your platform and I don't get what is going on
> from the log below.
>
> Could you please give us a bit more details about what is going on ?
>
> All this patch does is add hrtimer_cancel().
> * It is needed if the timer had already been started, which is
>   appropriate AFAIU
> * It is a NO-OP is the timer is not active.
>
Can you please try using hrtimer_try_to_cancel() instead ?

thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-15 14:29     ` Ionela Voinescu
@ 2020-10-16  8:44       ` Sudeep Holla
  2020-10-16  9:02         ` Jerome Brunet
  0 siblings, 1 reply; 18+ messages in thread
From: Sudeep Holla @ 2020-10-16  8:44 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Jerome Brunet, Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue,
	linux-kernel

On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> Hi Jerome,
> 
> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> > 
> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > 
> > > Hi guys,
> > >
> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> > >> is called, leading hrtimer to loudly complain.
> > >> 
> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> > >> pc : hrtimer_forward+0xc4/0x110
> > >> lr : txdone_hrtimer+0xf8/0x118
> > >> [...]
> > >> 
> > >> Canceling the timer before starting it ensure that the timer callback is
> > >> not running when the timer is started, solving this race condition.
> > >> 
> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> > >> Reported-by: Da Xue <da@libre.computer>
> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> > >> ---
> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >> 
> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> > >> index 0b821a5b2db8..34f9ab01caef 100644
> > >> --- a/drivers/mailbox/mailbox.c
> > >> +++ b/drivers/mailbox/mailbox.c
> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >>  exit:
> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> > >>  
> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> -		/* kick start the timer immediately to avoid delays */
> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> +		/* Disable the timer if already active ... */
> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> +
> > >> +		/* ... and kick start it immediately to avoid delays */
> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> +	}
> > >>  }
> > >>  
> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >
> > > I've tracked a regression back to this commit. Details to reproduce:
> > 
> > Hi Ionela,
> > 
> > I don't have access to your platform and I don't get what is going on
> > from the log below.
> > 
> > Could you please give us a bit more details about what is going on ?
> > 
> 
> I'm not familiar with the mailbox subsystem, so the best I can do right
> now is to add Sudeep to Cc, in case this conflicts in some way with the
> ARM MHU patches [1].
>

Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
MHU driver as is limiting the number of channels to be used.

> In the meantime I'll get some traces and get more familiar with the
> code.
>

I will try that too.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-15 18:45     ` Jassi Brar
@ 2020-10-16  8:52       ` Ionela Voinescu
  2020-10-16  9:00         ` Jerome Brunet
  2020-10-16 11:52         ` Sudeep Holla
  0 siblings, 2 replies; 18+ messages in thread
From: Ionela Voinescu @ 2020-10-16  8:52 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Jerome Brunet, Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
[..]
> > >> --- a/drivers/mailbox/mailbox.c
> > >> +++ b/drivers/mailbox/mailbox.c
> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >>  exit:
> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > >>
> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> -            /* kick start the timer immediately to avoid delays */
> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> +            /* Disable the timer if already active ... */
> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> +
> > >> +            /* ... and kick start it immediately to avoid delays */
> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> +    }
> > >>  }
> > >>
> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >
> > > I've tracked a regression back to this commit. Details to reproduce:
> >
> > Hi Ionela,
> >
> > I don't have access to your platform and I don't get what is going on
> > from the log below.
> >
> > Could you please give us a bit more details about what is going on ?
> >
> > All this patch does is add hrtimer_cancel().
> > * It is needed if the timer had already been started, which is
> >   appropriate AFAIU
> > * It is a NO-OP is the timer is not active.
> >
> Can you please try using hrtimer_try_to_cancel() instead ?
> 

Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
this limit how effective this change is? AFAIU, this will possibly only
reduce the chances for the race condition, but not solve it.

Thanks,
Ionela.

> thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  8:52       ` Ionela Voinescu
@ 2020-10-16  9:00         ` Jerome Brunet
  2020-10-16 13:52           ` Sudeep Holla
  2020-10-16 17:33           ` Jassi Brar
  2020-10-16 11:52         ` Sudeep Holla
  1 sibling, 2 replies; 18+ messages in thread
From: Jerome Brunet @ 2020-10-16  9:00 UTC (permalink / raw)
  To: Ionela Voinescu, Jassi Brar
  Cc: Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List


On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> [..]
>> > >> --- a/drivers/mailbox/mailbox.c
>> > >> +++ b/drivers/mailbox/mailbox.c
>> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> > >>  exit:
>> > >>      spin_unlock_irqrestore(&chan->lock, flags);
>> > >>
>> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> > >> -            /* kick start the timer immediately to avoid delays */
>> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> > >> +            /* Disable the timer if already active ... */
>> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
>> > >> +
>> > >> +            /* ... and kick start it immediately to avoid delays */
>> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> > >> +    }
>> > >>  }
>> > >>
>> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> > >
>> > > I've tracked a regression back to this commit. Details to reproduce:
>> >
>> > Hi Ionela,
>> >
>> > I don't have access to your platform and I don't get what is going on
>> > from the log below.
>> >
>> > Could you please give us a bit more details about what is going on ?
>> >
>> > All this patch does is add hrtimer_cancel().
>> > * It is needed if the timer had already been started, which is
>> >   appropriate AFAIU
>> > * It is a NO-OP is the timer is not active.
>> >
>> Can you please try using hrtimer_try_to_cancel() instead ?
>> 
>
> Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> this limit how effective this change is? AFAIU, this will possibly only
> reduce the chances for the race condition, but not solve it.
>

It is also my understanding, hrtimer_try_to_cancel() would remove a
timer which as not already started but would return withtout doing
anything if the callback is already running ... which is the original
problem


> Thanks,
> Ionela.
>
>> thanks


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  8:44       ` Sudeep Holla
@ 2020-10-16  9:02         ` Jerome Brunet
  2020-10-16  9:34           ` Sudeep Holla
  0 siblings, 1 reply; 18+ messages in thread
From: Jerome Brunet @ 2020-10-16  9:02 UTC (permalink / raw)
  To: Sudeep Holla, Ionela Voinescu
  Cc: Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue, linux-kernel


On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:

> On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
>> Hi Jerome,
>> 
>> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
>> > 
>> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>> > 
>> > > Hi guys,
>> > >
>> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
>> > >> If the txdone is done by polling, it is possible for msg_submit() to start
>> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
>> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
>> > >> is called, leading hrtimer to loudly complain.
>> > >> 
>> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
>> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
>> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
>> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
>> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>> > >> pc : hrtimer_forward+0xc4/0x110
>> > >> lr : txdone_hrtimer+0xf8/0x118
>> > >> [...]
>> > >> 
>> > >> Canceling the timer before starting it ensure that the timer callback is
>> > >> not running when the timer is started, solving this race condition.
>> > >> 
>> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
>> > >> Reported-by: Da Xue <da@libre.computer>
>> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
>> > >> ---
>> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
>> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
>> > >> 
>> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
>> > >> index 0b821a5b2db8..34f9ab01caef 100644
>> > >> --- a/drivers/mailbox/mailbox.c
>> > >> +++ b/drivers/mailbox/mailbox.c
>> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> > >>  exit:
>> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
>> > >>  
>> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> > >> -		/* kick start the timer immediately to avoid delays */
>> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> > >> +		/* Disable the timer if already active ... */
>> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
>> > >> +
>> > >> +		/* ... and kick start it immediately to avoid delays */
>> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> > >> +	}
>> > >>  }
>> > >>  
>> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> > >
>> > > I've tracked a regression back to this commit. Details to reproduce:
>> > 
>> > Hi Ionela,
>> > 
>> > I don't have access to your platform and I don't get what is going on
>> > from the log below.
>> > 
>> > Could you please give us a bit more details about what is going on ?
>> > 
>> 
>> I'm not familiar with the mailbox subsystem, so the best I can do right
>> now is to add Sudeep to Cc, in case this conflicts in some way with the
>> ARM MHU patches [1].
>>
>
> Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> MHU driver as is limiting the number of channels to be used.
>
>> In the meantime I'll get some traces and get more familiar with the
>> code.
>>
>
> I will try that too.

BTW, this issue was originally reported on amlogic platforms which also
use arm,mhu mailbox driver.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  9:02         ` Jerome Brunet
@ 2020-10-16  9:34           ` Sudeep Holla
  2020-10-16 11:47             ` Sudeep Holla
  0 siblings, 1 reply; 18+ messages in thread
From: Sudeep Holla @ 2020-10-16  9:34 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Ionela Voinescu, Jassi Brar, Sudeep Holla, Kevin Hilman,
	linux-amlogic, Da Xue, linux-kernel

On Fri, Oct 16, 2020 at 11:02:02AM +0200, Jerome Brunet wrote:
> 
> On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:
> 
> > On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> >> Hi Jerome,
> >> 
> >> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> >> > 
> >> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >> > 
> >> > > Hi guys,
> >> > >
> >> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> > >> is called, leading hrtimer to loudly complain.
> >> > >> 
> >> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> > >> pc : hrtimer_forward+0xc4/0x110
> >> > >> lr : txdone_hrtimer+0xf8/0x118
> >> > >> [...]
> >> > >> 
> >> > >> Canceling the timer before starting it ensure that the timer callback is
> >> > >> not running when the timer is started, solving this race condition.
> >> > >> 
> >> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> > >> Reported-by: Da Xue <da@libre.computer>
> >> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> > >> ---
> >> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >> > >> 
> >> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> > >> index 0b821a5b2db8..34f9ab01caef 100644
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>  
> >> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -		/* kick start the timer immediately to avoid delays */
> >> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +		/* Disable the timer if already active ... */
> >> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +		/* ... and kick start it immediately to avoid delays */
> >> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +	}
> >> > >>  }
> >> > >>  
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> > 
> >> > Hi Ionela,
> >> > 
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> > 
> >> > Could you please give us a bit more details about what is going on ?
> >> > 
> >> 
> >> I'm not familiar with the mailbox subsystem, so the best I can do right
> >> now is to add Sudeep to Cc, in case this conflicts in some way with the
> >> ARM MHU patches [1].
> >>
> >
> > Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> > MHU driver as is limiting the number of channels to be used.
> >
> >> In the meantime I'll get some traces and get more familiar with the
> >> code.
> >>
> >
> > I will try that too.
> 
> BTW, this issue was originally reported on amlogic platforms which also
> use arm,mhu mailbox driver.
> 

OK. Anyway just noticed that hrtimer_cancel uses  hrtimer_try_to_cancel
and hrtimer_cancel_wait_running. The latter is just cpu_relax() if
PREEMPT_RT=n, so you may still have issue if the hrtimer is still active
or restarts in the meantime.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  9:34           ` Sudeep Holla
@ 2020-10-16 11:47             ` Sudeep Holla
  0 siblings, 0 replies; 18+ messages in thread
From: Sudeep Holla @ 2020-10-16 11:47 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Ionela Voinescu, Jassi Brar, Kevin Hilman, linux-amlogic, Da Xue,
	linux-kernel

On Fri, Oct 16, 2020 at 10:34:21AM +0100, Sudeep Holla wrote:
> On Fri, Oct 16, 2020 at 11:02:02AM +0200, Jerome Brunet wrote:
> > 
> > On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:
> > 
> > > On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> > >> Hi Jerome,
> > >> 
> > >> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> > >> > 
> > >> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > >> > 
> > >> > > Hi guys,
> > >> > >
> > >> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> > >> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> > >> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> > >> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> > >> > >> is called, leading hrtimer to loudly complain.
> > >> > >> 
> > >> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> > >> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> > >> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> > >> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> > >> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> > >> > >> pc : hrtimer_forward+0xc4/0x110
> > >> > >> lr : txdone_hrtimer+0xf8/0x118
> > >> > >> [...]
> > >> > >> 
> > >> > >> Canceling the timer before starting it ensure that the timer callback is
> > >> > >> not running when the timer is started, solving this race condition.
> > >> > >> 
> > >> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> > >> > >> Reported-by: Da Xue <da@libre.computer>
> > >> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> > >> > >> ---
> > >> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> > >> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >> > >> 
> > >> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> > >> > >> index 0b821a5b2db8..34f9ab01caef 100644
> > >> > >> --- a/drivers/mailbox/mailbox.c
> > >> > >> +++ b/drivers/mailbox/mailbox.c
> > >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >> > >>  exit:
> > >> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> > >> > >>  
> > >> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> > >> -		/* kick start the timer immediately to avoid delays */
> > >> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> > >> +		/* Disable the timer if already active ... */
> > >> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> > >> +
> > >> > >> +		/* ... and kick start it immediately to avoid delays */
> > >> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> > >> +	}
> > >> > >>  }
> > >> > >>  
> > >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >> > >
> > >> > > I've tracked a regression back to this commit. Details to reproduce:
> > >> > 
> > >> > Hi Ionela,
> > >> > 
> > >> > I don't have access to your platform and I don't get what is going on
> > >> > from the log below.
> > >> > 
> > >> > Could you please give us a bit more details about what is going on ?
> > >> > 
> > >> 
> > >> I'm not familiar with the mailbox subsystem, so the best I can do right
> > >> now is to add Sudeep to Cc, in case this conflicts in some way with the
> > >> ARM MHU patches [1].
> > >>
> > >
> > > Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> > > MHU driver as is limiting the number of channels to be used.
> > >
> > >> In the meantime I'll get some traces and get more familiar with the
> > >> code.
> > >>
> > >
> > > I will try that too.
> > 
> > BTW, this issue was originally reported on amlogic platforms which also
> > use arm,mhu mailbox driver.
> > 
> 
> OK. Anyway just noticed that hrtimer_cancel uses  hrtimer_try_to_cancel
> and hrtimer_cancel_wait_running. The latter is just cpu_relax() if
> PREEMPT_RT=n, so you may still have issue if the hrtimer is still active
> or restarts in the meantime. 
>

Scratch that, I failed to see the loop in hrtimer_cancel earlier.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  8:52       ` Ionela Voinescu
  2020-10-16  9:00         ` Jerome Brunet
@ 2020-10-16 11:52         ` Sudeep Holla
  1 sibling, 0 replies; 18+ messages in thread
From: Sudeep Holla @ 2020-10-16 11:52 UTC (permalink / raw)
  To: Ionela Voinescu, Sudeep Holla
  Cc: Jassi Brar, Jerome Brunet, Kevin Hilman,
	open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Fri, Oct 16, 2020 at 12:32 PM Ionela Voinescu
<ionela.voinescu@arm.com> wrote:
>
> On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> [..]
> > > >> --- a/drivers/mailbox/mailbox.c
> > > >> +++ b/drivers/mailbox/mailbox.c
> > > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > > >>  exit:
> > > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > > >>
> > > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > > >> -            /* kick start the timer immediately to avoid delays */
> > > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > > >> +            /* Disable the timer if already active ... */
> > > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > > >> +
> > > >> +            /* ... and kick start it immediately to avoid delays */
> > > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > > >> +    }
> > > >>  }
> > > >>
> > > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > > >
> > > > I've tracked a regression back to this commit. Details to reproduce:
> > >
> > > Hi Ionela,
> > >
> > > I don't have access to your platform and I don't get what is going on
> > > from the log below.
> > >
> > > Could you please give us a bit more details about what is going on ?
> > >
> > > All this patch does is add hrtimer_cancel().
> > > * It is needed if the timer had already been started, which is
> > >   appropriate AFAIU
> > > * It is a NO-OP is the timer is not active.
> > >
> > Can you please try using hrtimer_try_to_cancel() instead ?
> >
>
> Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> this limit how effective this change is? AFAIU, this will possibly only
> reduce the chances for the race condition, but not solve it.
>

Indeed, I tried the same and got a lot of -ETIME failures. hrtimer_cancel
uses hrtimer_try_to_cancel in loop until it succeeds so that the following
hrtimer_start can be done w/o any pending timers.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  9:00         ` Jerome Brunet
@ 2020-10-16 13:52           ` Sudeep Holla
  2020-10-16 17:33           ` Jassi Brar
  1 sibling, 0 replies; 18+ messages in thread
From: Sudeep Holla @ 2020-10-16 13:52 UTC (permalink / raw)
  To: Jerome Brunet, Sudeep Holla
  Cc: Ionela Voinescu, Jassi Brar, Kevin Hilman,
	open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Fri, Oct 16, 2020 at 1:15 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > [..]
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>
> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +            /* Disable the timer if already active ... */
> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +    }
> >> > >>  }
> >> > >>
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >
> >> > Hi Ionela,
> >> >
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> >
> >> > Could you please give us a bit more details about what is going on ?
> >> >
> >> > All this patch does is add hrtimer_cancel().
> >> > * It is needed if the timer had already been started, which is
> >> >   appropriate AFAIU
> >> > * It is a NO-OP is the timer is not active.
> >> >
> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >>
> >
> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > this limit how effective this change is? AFAIU, this will possibly only
> > reduce the chances for the race condition, but not solve it.
> >
>
> It is also my understanding, hrtimer_try_to_cancel() would remove a
> timer which as not already started but would return withtout doing
> anything if the callback is already running ... which is the original
> problem
>


There seem to be some races. It always hangs in the hrtimer_cancel.
Logging some extra messages makes it progress for a while and finally
get stuck in the loop. I wonder if there is a race between cancel
and handler execution.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16  9:00         ` Jerome Brunet
  2020-10-16 13:52           ` Sudeep Holla
@ 2020-10-16 17:33           ` Jassi Brar
  2020-10-16 18:54             ` Jerome Brunet
  1 sibling, 1 reply; 18+ messages in thread
From: Jassi Brar @ 2020-10-16 17:33 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Ionela Voinescu, Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > [..]
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>
> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +            /* Disable the timer if already active ... */
> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +    }
> >> > >>  }
> >> > >>
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >
> >> > Hi Ionela,
> >> >
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> >
> >> > Could you please give us a bit more details about what is going on ?
> >> >
> >> > All this patch does is add hrtimer_cancel().
> >> > * It is needed if the timer had already been started, which is
> >> >   appropriate AFAIU
> >> > * It is a NO-OP is the timer is not active.
> >> >
> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >>
> >
> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > this limit how effective this change is? AFAIU, this will possibly only
> > reduce the chances for the race condition, but not solve it.
> >
>
> It is also my understanding, hrtimer_try_to_cancel() would remove a
> timer which as not already started but would return withtout doing
> anything if the callback is already running ... which is the original
> problem
>
If we are running in the callback path, hrtimer_try_to_cancel will
return -1, in which case we could skip hrtimer_start.
Anyways, I think simply checking for hrtimer_active should effect the same.
I have submitted a patch, of course not tested.

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16 17:33           ` Jassi Brar
@ 2020-10-16 18:54             ` Jerome Brunet
  2020-10-16 19:27               ` Jassi Brar
  0 siblings, 1 reply; 18+ messages in thread
From: Jerome Brunet @ 2020-10-16 18:54 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Ionela Voinescu, Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List


On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:

> On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>>
>>
>> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>>
>> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
>> > [..]
>> >> > >> --- a/drivers/mailbox/mailbox.c
>> >> > >> +++ b/drivers/mailbox/mailbox.c
>> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> >> > >>  exit:
>> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
>> >> > >>
>> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> >> > >> -            /* kick start the timer immediately to avoid delays */
>> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> >> > >> +            /* Disable the timer if already active ... */
>> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
>> >> > >> +
>> >> > >> +            /* ... and kick start it immediately to avoid delays */
>> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> >> > >> +    }
>> >> > >>  }
>> >> > >>
>> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> >> > >
>> >> > > I've tracked a regression back to this commit. Details to reproduce:
>> >> >
>> >> > Hi Ionela,
>> >> >
>> >> > I don't have access to your platform and I don't get what is going on
>> >> > from the log below.
>> >> >
>> >> > Could you please give us a bit more details about what is going on ?
>> >> >
>> >> > All this patch does is add hrtimer_cancel().
>> >> > * It is needed if the timer had already been started, which is
>> >> >   appropriate AFAIU
>> >> > * It is a NO-OP is the timer is not active.
>> >> >
>> >> Can you please try using hrtimer_try_to_cancel() instead ?
>> >>
>> >
>> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
>> > this limit how effective this change is? AFAIU, this will possibly only
>> > reduce the chances for the race condition, but not solve it.
>> >
>>
>> It is also my understanding, hrtimer_try_to_cancel() would remove a
>> timer which as not already started but would return withtout doing
>> anything if the callback is already running ... which is the original
>> problem
>>
> If we are running in the callback path, hrtimer_try_to_cancel will
> return -1, in which case we could skip hrtimer_start.
> Anyways, I think simply checking for hrtimer_active should effect the same.
> I have submitted a patch, of course not tested.

Yes it sloves this race but ...

If a race is possible between a timer callback rescheduling itself (which
is not that uncommon) and another thread trying to cancel it, maybe
there is something worth fixing in hrtimer ? Also, mailbox calls
hrtimer_cancel() in unregister ... are we confident this would work ?

Any fix is by me - yours avoid killing and restarting the timer :) but
it feels like we are working around an issue that might bite us back
later on.

>
> Thanks


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16 18:54             ` Jerome Brunet
@ 2020-10-16 19:27               ` Jassi Brar
  2022-07-19  7:08                 ` Da Xue
  0 siblings, 1 reply; 18+ messages in thread
From: Jassi Brar @ 2020-10-16 19:27 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Ionela Voinescu, Kevin Hilman, open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List

On Fri, Oct 16, 2020 at 1:54 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> > On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
> >>
> >>
> >> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >>
> >> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> >> > [..]
> >> >> > >> --- a/drivers/mailbox/mailbox.c
> >> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> >> > >>  exit:
> >> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> >> > >>
> >> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> >> > >> +            /* Disable the timer if already active ... */
> >> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> >> > >> +
> >> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> >> > >> +    }
> >> >> > >>  }
> >> >> > >>
> >> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> >> > >
> >> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >> >
> >> >> > Hi Ionela,
> >> >> >
> >> >> > I don't have access to your platform and I don't get what is going on
> >> >> > from the log below.
> >> >> >
> >> >> > Could you please give us a bit more details about what is going on ?
> >> >> >
> >> >> > All this patch does is add hrtimer_cancel().
> >> >> > * It is needed if the timer had already been started, which is
> >> >> >   appropriate AFAIU
> >> >> > * It is a NO-OP is the timer is not active.
> >> >> >
> >> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >> >>
> >> >
> >> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> >> > this limit how effective this change is? AFAIU, this will possibly only
> >> > reduce the chances for the race condition, but not solve it.
> >> >
> >>
> >> It is also my understanding, hrtimer_try_to_cancel() would remove a
> >> timer which as not already started but would return withtout doing
> >> anything if the callback is already running ... which is the original
> >> problem
> >>
> > If we are running in the callback path, hrtimer_try_to_cancel will
> > return -1, in which case we could skip hrtimer_start.
> > Anyways, I think simply checking for hrtimer_active should effect the same.
> > I have submitted a patch, of course not tested.
>
> Yes it sloves this race but ...
>
Thanks for confirmation.

> If a race is possible between a timer callback rescheduling itself (which
> is not that uncommon) and another thread trying to cancel it
>
In our case, we should not be cancelling+restarting the timer in the
first place, because txdone_hrtimer will take care of it via
hrtimer_forward_now.

>, maybe
> there is something worth fixing in hrtimer ? Also, mailbox calls
> hrtimer_cancel() in unregister ... are we confident this would work ?
>
Yes. After unregister() every channel is supposed to die and so must
its resources.

-jassi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mailbox: cancel timer before starting it
  2020-10-16 19:27               ` Jassi Brar
@ 2022-07-19  7:08                 ` Da Xue
  0 siblings, 0 replies; 18+ messages in thread
From: Da Xue @ 2022-07-19  7:08 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Jerome Brunet, Ionela Voinescu, Kevin Hilman,
	open list:ARM/Amlogic Meson...,
	Da Xue, Linux Kernel Mailing List, Qi Duan

On Fri, Oct 16, 2020 at 3:27 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> On Fri, Oct 16, 2020 at 1:54 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
> >
> >
> > On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > > On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
> > >>
> > >>
> > >> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > >>
> > >> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > >> > [..]
> > >> >> > >> --- a/drivers/mailbox/mailbox.c
> > >> >> > >> +++ b/drivers/mailbox/mailbox.c
> > >> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >> >> > >>  exit:
> > >> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > >> >> > >>
> > >> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> >> > >> -            /* kick start the timer immediately to avoid delays */
> > >> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> >> > >> +            /* Disable the timer if already active ... */
> > >> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> >> > >> +
> > >> >> > >> +            /* ... and kick start it immediately to avoid delays */
> > >> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> >> > >> +    }
> > >> >> > >>  }
> > >> >> > >>
> > >> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >> >> > >
> > >> >> > > I've tracked a regression back to this commit. Details to reproduce:
> > >> >> >
> > >> >> > Hi Ionela,
> > >> >> >
> > >> >> > I don't have access to your platform and I don't get what is going on
> > >> >> > from the log below.
> > >> >> >
> > >> >> > Could you please give us a bit more details about what is going on ?
> > >> >> >
> > >> >> > All this patch does is add hrtimer_cancel().
> > >> >> > * It is needed if the timer had already been started, which is
> > >> >> >   appropriate AFAIU
> > >> >> > * It is a NO-OP is the timer is not active.
> > >> >> >
> > >> >> Can you please try using hrtimer_try_to_cancel() instead ?
> > >> >>
> > >> >
> > >> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > >> > this limit how effective this change is? AFAIU, this will possibly only
> > >> > reduce the chances for the race condition, but not solve it.
> > >> >
> > >>
> > >> It is also my understanding, hrtimer_try_to_cancel() would remove a
> > >> timer which as not already started but would return withtout doing
> > >> anything if the callback is already running ... which is the original
> > >> problem
> > >>
> > > If we are running in the callback path, hrtimer_try_to_cancel will
> > > return -1, in which case we could skip hrtimer_start.
> > > Anyways, I think simply checking for hrtimer_active should effect the same.
> > > I have submitted a patch, of course not tested.
> >
> > Yes it sloves this race but ...
> >
> Thanks for confirmation.
>
> > If a race is possible between a timer callback rescheduling itself (which
> > is not that uncommon) and another thread trying to cancel it
> >
> In our case, we should not be cancelling+restarting the timer in the
> first place, because txdone_hrtimer will take care of it via
> hrtimer_forward_now.
>
> >, maybe
> > there is something worth fixing in hrtimer ? Also, mailbox calls
> > hrtimer_cancel() in unregister ... are we confident this would work ?
> >
> Yes. After unregister() every channel is supposed to die and so must
> its resources.
>
> -jassi

Hi,

I'm running into an issue where I get "rcu: INFO: rcu_preempt detected
stalls on CPUs/tasks" when booting Linux distributions once every 5 or
so reboots. The system sticks on systemd starting for ~20 seconds
before rcu_preempt prints a message. This only happens on Amlogic (I
tested S805X/S905X/S905D) and I do not get this problem on my
Allwinner or Rockchip boards with the same kernel and image. I have
tried 5.15, 5.18, and 5.19. I dumped a bunch of logs here from the
three kernel versions but the content is the same.
https://github.com/libre-computer-project/libretech-linux/issues/5

+ mount -n -o move /sys /root/sys
+ mount -n -o move /proc /root/proc
+ exec run-init /root /sbin/init
[   26.485689] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   26.486217] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P1
[   26.492332]  (detected by 3, t=5252 jiffies, g=-315, q=546)
[   26.497846] task:run-init        state:R  running task     stack:
 0 pid:    1 ppid:     0 flags:0x00000004
[   26.507719] Call trace:
[   26.510063]  dump_backtrace.part.6+0xdc/0xe8
[   26.514306]  show_stack+0x18/0x68
[   26.517574]  sched_show_task+0x154/0x198
[   26.521461]  rcu_sched_clock_irq+0xd10/0xdd8
[   26.525690]  update_process_times+0x9c/0xd0
[   26.529829]  tick_sched_handle.isra.19+0x34/0x58
[   26.534405]  tick_sched_timer+0x4c/0xa8
[   26.538192]  __hrtimer_run_queues+0x120/0x1b8
[   26.542510]  hrtimer_interrupt+0xd8/0x238
[   26.546474]  arch_timer_handler_phys+0x2c/0x50
[   26.550877]  handle_percpu_devid_irq+0x84/0x138
[   26.555363]  generic_handle_domain_irq+0x2c/0x48
[   26.559935]  gic_handle_irq+0xa4/0xc0
[   26.563549]  call_on_irq_stack+0x2c/0x58
[   26.567432]  do_interrupt_handler+0x80/0x88
[   26.571574]  el1_interrupt+0x38/0x70
[   26.575105]  el1h_64_irq_handler+0x18/0x28
[   26.582437]  el1h_64_irq+0x64/0x68
[   26.585712]  xas_descend+0x0/0x88
[   26.588947]  xas_find+0x190/0x1d8
[   26.592182]  find_lock_entries+0x7c/0x280
[   26.596121]  truncate_inode_pages_range+0xa8/0x490
[   26.600843]  truncate_inode_pages_final+0x58/0x78
[   26.605453]  evict+0x144/0x150
[   26.608471]  iput+0x128/0x190
[   26.611442]  do_unlinkat+0x17c/0x2a8
[   26.614817]  __arm64_sys_unlinkat+0x40/0x88
[   26.618928]  invoke_syscall+0x44/0x100
[   26.622599]  el0_svc_common.constprop.3+0x6c/0xf0
[   26.627237]  do_el0_svc+0x24/0x88
[   26.630473]  el0_svc+0x20/0x60
[   26.633447]  el0t_64_sync_handler+0x90/0xb8
[   26.637562]  el0t_64_sync+0x170/0x174
[   33.757714] VCC_CARD: disabling

I have turned on systemd debug and it does not print anything. The
message is the same every time and I rebuilt the filesystem multiple
times with both ext4 and btrfs roots across the devices.

I'm not sure if this is an extension or corner case of the origin
hr_timer problem. Any ideas?

Best,
Da

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-07-19  7:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 12:39 [PATCH] mailbox: cancel timer before starting it Jerome Brunet
2020-09-24 16:50 ` Kevin Hilman
2020-10-15 13:46 ` Ionela Voinescu
2020-10-15 13:58   ` Jerome Brunet
2020-10-15 14:29     ` Ionela Voinescu
2020-10-16  8:44       ` Sudeep Holla
2020-10-16  9:02         ` Jerome Brunet
2020-10-16  9:34           ` Sudeep Holla
2020-10-16 11:47             ` Sudeep Holla
2020-10-15 18:45     ` Jassi Brar
2020-10-16  8:52       ` Ionela Voinescu
2020-10-16  9:00         ` Jerome Brunet
2020-10-16 13:52           ` Sudeep Holla
2020-10-16 17:33           ` Jassi Brar
2020-10-16 18:54             ` Jerome Brunet
2020-10-16 19:27               ` Jassi Brar
2022-07-19  7:08                 ` Da Xue
2020-10-16 11:52         ` Sudeep Holla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).