[PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
@ 2021-03-19  8:00 Wang Qing
  2021-03-20 18:20 ` Tejun Heo
  2021-03-22 10:59 ` Petr Mladek
  0 siblings, 2 replies; 9+ messages in thread
From: Wang Qing @ 2021-03-19  8:00 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Petr Mladek, Wang Qing,
	Santosh Sivaraj, linux-kernel

When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
updated, while the unbound worker_pool running on its core uses 
wq_watchdog_touched to determine whether locked up. This may be mischecked.

My suggestion is to update both when touch_softlockup_watchdog() is called, 
use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
to check unbound worker_pool.

Signed-off-by: Wang Qing <wangqing@vivo.com>
---
 kernel/watchdog.c  |  5 +++--
 kernel/workqueue.c | 17 ++++++-----------
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7110906..107bc38
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
 	 * update as well, the only side effect might be a cycle delay for
 	 * the softlockup check.
 	 */
-	for_each_cpu(cpu, &watchdog_allowed_mask)
+	for_each_cpu(cpu, &watchdog_allowed_mask) {
 		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
-	wq_watchdog_touch(-1);
+		wq_watchdog_touch(cpu);
+	}
 }
 
 void touch_softlockup_watchdog_sync(void)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0d150da..be08295
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 			continue;
 
 		/* get the latest of pool and touched timestamps */
+		if (pool->cpu >= 0)
+			touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
+		else
+			touched = READ_ONCE(wq_watchdog_touched);
 		pool_ts = READ_ONCE(pool->watchdog_ts);
-		touched = READ_ONCE(wq_watchdog_touched);
 
 		if (time_after(pool_ts, touched))
 			ts = pool_ts;
 		else
 			ts = touched;
 
-		if (pool->cpu >= 0) {
-			unsigned long cpu_touched =
-				READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
-						  pool->cpu));
-			if (time_after(cpu_touched, ts))
-				ts = cpu_touched;
-		}
-
 		/* did we stall? */
 		if (time_after(jiffies, ts + thresh)) {
 			lockup_detected = true;
@@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
 {
 	if (cpu >= 0)
 		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
-	else
-		wq_watchdog_touched = jiffies;
+
+	wq_watchdog_touched = jiffies;
 }
 
 static void wq_watchdog_set_thresh(unsigned long thresh)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-19  8:00 [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking Wang Qing
@ 2021-03-20 18:20 ` Tejun Heo
  2021-03-22  1:47   ` 王擎
  2021-03-22 10:59 ` Petr Mladek
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2021-03-20 18:20 UTC (permalink / raw)
  To: Wang Qing
  Cc: Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Petr Mladek, Santosh Sivaraj,
	linux-kernel

On Fri, Mar 19, 2021 at 04:00:36PM +0800, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> updated, while the unbound worker_pool running on its core uses 
> wq_watchdog_touched to determine whether locked up. This may be mischecked.

Can you please elaborate here, preferably with a concrete scenario where the
new code is better?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-20 18:20 ` Tejun Heo
@ 2021-03-22  1:47   ` 王擎
  0 siblings, 0 replies; 9+ messages in thread
From: 王擎 @ 2021-03-22  1:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Petr Mladek, Santosh Sivaraj,
	linux-kernel

>On Fri, Mar 19, 2021 at 04:00:36PM +0800, Wang Qing wrote:
>> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> updated, while the unbound worker_pool running on its core uses 
>> wq_watchdog_touched to determine whether locked up. This may be mischecked.
>
>Can you please elaborate here, preferably with a concrete scenario where the
>new code is better?

The previous code is problematic for judging whether the unbound pool is locked up:
When the expected single cpu stall occurs, only wq_watchdog_touched_cpu is updated, 
However, the unbound pool uses wq_watchdog_touched to determine whether it is 
locked up, so when the unbound pool is running in a scenario where the cpu stall is 
expected, a misjudgment will occur, because wq_watchdog_touched only be update 
when all the cpu stall as expect.

So we need to update wq_watchdog_touched in touch_softlockup_watchdog(), and it is 
only used for unbound pool,  it will not affect the bound pool in any way.

Thanks,

Qing

>
>Thanks.
>
>-- 
>tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-19  8:00 [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking Wang Qing
  2021-03-20 18:20 ` Tejun Heo
@ 2021-03-22 10:59 ` Petr Mladek
  2021-03-23 12:37   ` 王擎
  1 sibling, 1 reply; 9+ messages in thread
From: Petr Mladek @ 2021-03-22 10:59 UTC (permalink / raw)
  To: Wang Qing
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel

On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> updated, while the unbound worker_pool running on its core uses 
> wq_watchdog_touched to determine whether locked up. This may be mischecked.

By other words, unbound workqueues are not aware of the more common
touch_softlockup_watchdog() because it updates only
wq_watchdog_touched_cpu for the affected CPU. As a result,
the workqueue watchdog might report lockup in unbound workqueue
even though it is blocked by a known slow code.

> My suggestion is to update both when touch_softlockup_watchdog() is called, 
> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> to check unbound worker_pool.
> 
> Signed-off-by: Wang Qing <wangqing@vivo.com>
> ---
>  kernel/watchdog.c  |  5 +++--
>  kernel/workqueue.c | 17 ++++++-----------
>  2 files changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 7110906..107bc38
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>  	 * update as well, the only side effect might be a cycle delay for
>  	 * the softlockup check.
>  	 */
> -	for_each_cpu(cpu, &watchdog_allowed_mask)
> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> -	wq_watchdog_touch(-1);
> +		wq_watchdog_touch(cpu);

Note that wq_watchdog_touch(cpu) newly always updates
wq_watchdog_touched. This cycle will set the same jiffies
value cpu-times to the same variable.

> +	}
>  }
>  
>  void touch_softlockup_watchdog_sync(void)
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 0d150da..be08295
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
>  			continue;
>  
>  		/* get the latest of pool and touched timestamps */
> +		if (pool->cpu >= 0)
> +			touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
> +		else
> +			touched = READ_ONCE(wq_watchdog_touched);
>  		pool_ts = READ_ONCE(pool->watchdog_ts);
> -		touched = READ_ONCE(wq_watchdog_touched);
>  
>  		if (time_after(pool_ts, touched))
>  			ts = pool_ts;
>  		else
>  			ts = touched;
>  
> -		if (pool->cpu >= 0) {
> -			unsigned long cpu_touched =
> -				READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
> -						  pool->cpu));
> -			if (time_after(cpu_touched, ts))
> -				ts = cpu_touched;
> -		}
> -
>  		/* did we stall? */
>  		if (time_after(jiffies, ts + thresh)) {
>  			lockup_detected = true;
> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>  {
>  	if (cpu >= 0)
>  		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> -	else
> -		wq_watchdog_touched = jiffies;
> +
> +	wq_watchdog_touched = jiffies;
>  }
>  
>  static void wq_watchdog_set_thresh(unsigned long thresh)

This last hunk is enough to fix the problem. wq_watchdog_touched will
get updated also from cpu-specific touch_softlockup_watchdog().

The original patch simplified the logic of wq_watchdog_timer_fn().
But it added un-necessary assignments into
touch_all_softlockup_watchdogs(void).

I do not have strong opinion what solution is better. I slightly
prefer to keep only this last hunk.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-22 10:59 ` Petr Mladek
@ 2021-03-23 12:37   ` 王擎
  2021-03-23 15:06     ` Petr Mladek
  0 siblings, 1 reply; 9+ messages in thread
From: 王擎 @ 2021-03-23 12:37 UTC (permalink / raw)
  To: Petr Mladek, Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel


>On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> updated, while the unbound worker_pool running on its core uses 
>> wq_watchdog_touched to determine whether locked up. This may be mischecked.
>
>By other words, unbound workqueues are not aware of the more common
>touch_softlockup_watchdog() because it updates only
>wq_watchdog_touched_cpu for the affected CPU. As a result,
>the workqueue watchdog might report lockup in unbound workqueue
>even though it is blocked by a known slow code.

Yes, this is the problem I'm talking about.
>
>> My suggestion is to update both when touch_softlockup_watchdog() is called, 
>> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
>> to check unbound worker_pool.
>> 
>> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> ---
>>  kernel/watchdog.c  |  5 +++--
>>  kernel/workqueue.c | 17 ++++++-----------
>>  2 files changed, 9 insertions(+), 13 deletions(-)
>> 
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 7110906..107bc38
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>>  	 * update as well, the only side effect might be a cycle delay for
>>  	 * the softlockup check.
>>  	 */
>> -	for_each_cpu(cpu, &watchdog_allowed_mask)
>> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> -	wq_watchdog_touch(-1);
>> +		wq_watchdog_touch(cpu);
>
>Note that wq_watchdog_touch(cpu) newly always updates
>wq_watchdog_touched. This cycle will set the same jiffies
>value cpu-times to the same variable.
>
Although there is a bit of redundancy here, but the most concise way of 
implementation, and it is certain that it will not affect performance.

>> +	}
>>  }
>>  
>>  void touch_softlockup_watchdog_sync(void)
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 0d150da..be08295
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
>>  			continue;
>>  
>>  		/* get the latest of pool and touched timestamps */
>> +		if (pool->cpu >= 0)
>> +			touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
>> +		else
>> +			touched = READ_ONCE(wq_watchdog_touched);
>>  		pool_ts = READ_ONCE(pool->watchdog_ts);
>> -		touched = READ_ONCE(wq_watchdog_touched);
>>  
>>  		if (time_after(pool_ts, touched))
>>  			ts = pool_ts;
>>  		else
>>  			ts = touched;
>>  
>> -		if (pool->cpu >= 0) {
>> -			unsigned long cpu_touched =
>> -				READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
>> -						  pool->cpu));
>> -			if (time_after(cpu_touched, ts))
>> -				ts = cpu_touched;
>> -		}
>> -
>>  		/* did we stall? */
>>  		if (time_after(jiffies, ts + thresh)) {
>>  			lockup_detected = true;
>> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>>  {
>>  	if (cpu >= 0)
>>  		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
>> -	else
>> -		wq_watchdog_touched = jiffies;
>> +
>> +	wq_watchdog_touched = jiffies;
>>  }
>>  
>>  static void wq_watchdog_set_thresh(unsigned long thresh)
>
>This last hunk is enough to fix the problem. wq_watchdog_touched will
>get updated also from cpu-specific touch_softlockup_watchdog().

Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(), 
so wq_watchdog_touched can no longer be used to judge the bound pool, we must update 
every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment.

Thanks,
Qing Wang
>
>The original patch simplified the logic of wq_watchdog_timer_fn().
>But it added un-necessary assignments into
>touch_all_softlockup_watchdogs(void).
>
>I do not have strong opinion what solution is better. I slightly
>prefer to keep only this last hunk.
>
>Best Regards,
>Petr



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-23 12:37   ` 王擎
@ 2021-03-23 15:06     ` Petr Mladek
  2021-03-24  2:16       ` 王擎
  0 siblings, 1 reply; 9+ messages in thread
From: Petr Mladek @ 2021-03-23 15:06 UTC (permalink / raw)
  To: 王擎
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel

On Tue 2021-03-23 20:37:35, 王擎 wrote:
> 
> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> >> updated, while the unbound worker_pool running on its core uses 
> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
> >
> >By other words, unbound workqueues are not aware of the more common
> >touch_softlockup_watchdog() because it updates only
> >wq_watchdog_touched_cpu for the affected CPU. As a result,
> >the workqueue watchdog might report lockup in unbound workqueue
> >even though it is blocked by a known slow code.
> 
> Yes, this is the problem I'm talking about.

I thought more about it. This patch prevents a false positive.
Could it bring an opposite problem and hide real problems?

I mean that an unbound workqueue might get blocked on CPU A
because of a real softlockup. But we might not notice it because
CPU B is touched. Well, there are other ways how to detect
this situation, e.g. the softlockup watchdog.

> >> My suggestion is to update both when touch_softlockup_watchdog() is called, 
> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> >> to check unbound worker_pool.
> >> 
> >> Signed-off-by: Wang Qing <wangqing@vivo.com>
> >> ---
> >>  kernel/watchdog.c  |  5 +++--
> >>  kernel/workqueue.c | 17 ++++++-----------
> >>  2 files changed, 9 insertions(+), 13 deletions(-)
> >> 
> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> index 7110906..107bc38
> >> --- a/kernel/watchdog.c
> >> +++ b/kernel/watchdog.c
> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> >>  	 * update as well, the only side effect might be a cycle delay for
> >>  	 * the softlockup check.
> >>  	 */
> >> -	for_each_cpu(cpu, &watchdog_allowed_mask)
> >> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
> >>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> >> -	wq_watchdog_touch(-1);
> >> +		wq_watchdog_touch(cpu);
> >
> >Note that wq_watchdog_touch(cpu) newly always updates
> >wq_watchdog_touched. This cycle will set the same jiffies
> >value cpu-times to the same variable.
> >
> Although there is a bit of redundancy here, but the most concise way of 
> implementation, and it is certain that it will not affect performance.
> 
> >> +	}
> >>  }
> >>  
> >>  void touch_softlockup_watchdog_sync(void)
> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> >> index 0d150da..be08295
> >> --- a/kernel/workqueue.c
> >> +++ b/kernel/workqueue.c
> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
> >>  {
> >>  	if (cpu >= 0)
> >>  		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> >> -	else
> >> -		wq_watchdog_touched = jiffies;
> >> +
> >> +	wq_watchdog_touched = jiffies;
> >>  }
> >>  
> >>  static void wq_watchdog_set_thresh(unsigned long thresh)
> >
> >This last hunk is enough to fix the problem. wq_watchdog_touched will
> >get updated also from cpu-specific touch_softlockup_watchdog().
> 
> Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(), 
> so wq_watchdog_touched can no longer be used to judge the bound pool, we must update 
> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment.

I see. Your patch makes sense.

I would just improve the commit message. It should better explain
why all the twists are needed. I would write something like:

<proposal>
Subject: workqueue/watchdog: Make unbound workqueues aware of
touch_softlockup_watchdog()

There are two workqueue-specific watchdog timestamps:

    + @wq_watchdog_touched_cpu (per-CPU) updated by
      touch_softlockup_watchdog()

    + @wq_watchdog_touched (global) updated by
      touch_all_softlockup_watchdogs()

watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.

Solution:

touch_softlockup_watchdog() must touch also the global
@wq_watchdog_touched timestamp.

The global timestamp can not longer be used for bound workqueues
because it is updated on all CPUs. Instead, bound workqueues
have to check only @wq_watchdog_touched_cpu and these timestamp
has to be updated for all CPUs in touch_all_softlockup_watchdogs().

Beware:

The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.

It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.
</proposal>

How does that sound?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-23 15:06     ` Petr Mladek
@ 2021-03-24  2:16       ` 王擎
  2021-03-24  8:50         ` Petr Mladek
  0 siblings, 1 reply; 9+ messages in thread
From: 王擎 @ 2021-03-24  2:16 UTC (permalink / raw)
  To: Petr Mladek, Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel


>On Tue 2021-03-23 20:37:35, 王擎 wrote:
>> 
>> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> >> updated, while the unbound worker_pool running on its core uses 
>> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
>> >
>> >By other words, unbound workqueues are not aware of the more common
>> >touch_softlockup_watchdog() because it updates only
>> >wq_watchdog_touched_cpu for the affected CPU. As a result,
>> >the workqueue watchdog might report lockup in unbound workqueue
>> >even though it is blocked by a known slow code.
>> 
>> Yes, this is the problem I'm talking about.
>
>I thought more about it. This patch prevents a false positive.
>Could it bring an opposite problem and hide real problems?
>
>I mean that an unbound workqueue might get blocked on CPU A
>because of a real softlockup. But we might not notice it because
>CPU B is touched. Well, there are other ways how to detect
>this situation, e.g. the softlockup watchdog.
>
>
>> >> My suggestion is to update both when touch_softlockup_watchdog() is called, 
>> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
>> >> to check unbound worker_pool.
>> >> 
>> >> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> >> ---
>> >>  kernel/watchdog.c  |  5 +++--
>> >>  kernel/workqueue.c | 17 ++++++-----------
>> >>  2 files changed, 9 insertions(+), 13 deletions(-)
>> >> 
>> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> >> index 7110906..107bc38
>> >> --- a/kernel/watchdog.c
>> >> +++ b/kernel/watchdog.c
>> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>> >>  	 * update as well, the only side effect might be a cycle delay for
>> >>  	 * the softlockup check.
>> >>  	 */
>> >> -	for_each_cpu(cpu, &watchdog_allowed_mask)
>> >> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>> >>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> >> -	wq_watchdog_touch(-1);
>> >> +		wq_watchdog_touch(cpu);
>> >
>> >Note that wq_watchdog_touch(cpu) newly always updates
>> >wq_watchdog_touched. This cycle will set the same jiffies
>> >value cpu-times to the same variable.
>> >
>> Although there is a bit of redundancy here, but the most concise way of 
>> implementation, and it is certain that it will not affect performance.
>> 
Another way to implement is wq_watchdog_touch() remain unchanged, but need 
to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
notrace void touch_softlockup_watchdog(void)
{
	touch_softlockup_watchdog_sched();
	wq_watchdog_touch(raw_smp_processor_id());
+     wq_watchdog_touch(-1);
}
void touch_all_softlockup_watchdogs(void)
 * update as well, the only side effect might be a cycle delay for
 * the softlockup check.
*/
 -	for_each_cpu(cpu, &watchdog_allowed_mask)
+	for_each_cpu(cpu, &watchdog_allowed_mask) {
  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
 +		wq_watchdog_touch(cpu);
 +	}
 	wq_watchdog_touch(-1);
  }
So wq_watchdog_touched will not get updated many times,  
which do you think is better, Petr?
>> >>  
>> >>  void touch_softlockup_watchdog_sync(void)
>> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> >> index 0d150da..be08295
>> >> --- a/kernel/workqueue.c
>> >> +++ b/kernel/workqueue.c
>> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>> >>  {
>> >>  	if (cpu >= 0)
>> >>  		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
>> >> -	else
>> >> -		wq_watchdog_touched = jiffies;
>> >> +
>> >> +	wq_watchdog_touched = jiffies;
>> >>  }
>> >>  
>> >>  static void wq_watchdog_set_thresh(unsigned long thresh)
>> >
>> >This last hunk is enough to fix the problem. wq_watchdog_touched will
>> >get updated also from cpu-specific touch_softlockup_watchdog().
>> 
>> Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(), 
>> so wq_watchdog_touched can no longer be used to judge the bound pool, we must update 
>> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment.
>
>I see. Your patch makes sense.
>
>I would just improve the commit message. It should better explain
>why all the twists are needed. I would write something like:
>
><proposal>
>Subject: workqueue/watchdog: Make unbound workqueues aware of
>touch_softlockup_watchdog()
>
>There are two workqueue-specific watchdog timestamps:
>
>    + @wq_watchdog_touched_cpu (per-CPU) updated by
>      touch_softlockup_watchdog()
>
>    + @wq_watchdog_touched (global) updated by
>      touch_all_softlockup_watchdogs()
>
>watchdog_timer_fn() checks only the global @wq_watchdog_touched for
>unbound workqueues. As a result, unbound workqueues are not aware
>of touch_softlockup_watchdog(). The watchdog might report a stall
>even when the unbound workqueues are blocked by a known slow code.
>
>Solution:
>
>touch_softlockup_watchdog() must touch also the global
>@wq_watchdog_touched timestamp.
>
>The global timestamp can not longer be used for bound workqueues
>because it is updated on all CPUs. Instead, bound workqueues
>have to check only @wq_watchdog_touched_cpu and these timestamp
>has to be updated for all CPUs in touch_all_softlockup_watchdogs().
>
>Beware:
>
>The change might cause the opposite problem. An unbound workqueue
>might get blocked on CPU A because of a real softlockup. The workqueue
>watchdog would miss it when the timestamp got touched on CPU B.
>
>It is acceptable because softlockups are detected by softlockup
>watchdog. The workqueue watchdog is there to detect stalls where
>a work never finishes, for example, because of dependencies of works
>queued into the same workqueue.
></proposal>
>
>How does that sound?

It's fine to me, what's your opinion, Tejun?

Thanks,
Qing
>
>Best Regards,
>Petr



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-24  2:16       ` 王擎
@ 2021-03-24  8:50         ` Petr Mladek
  2021-03-24  9:42           ` 王擎
  0 siblings, 1 reply; 9+ messages in thread
From: Petr Mladek @ 2021-03-24  8:50 UTC (permalink / raw)
  To: 王擎
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel

On Wed 2021-03-24 10:16:46, 王擎 wrote:
> 
> >On Tue 2021-03-23 20:37:35, 王擎 wrote:
> >> 
> >> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> >> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> >> >> updated, while the unbound worker_pool running on its core uses 
> >> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
> >> >
> >> >By other words, unbound workqueues are not aware of the more common
> >> >touch_softlockup_watchdog() because it updates only
> >> >wq_watchdog_touched_cpu for the affected CPU. As a result,
> >> >the workqueue watchdog might report lockup in unbound workqueue
> >> >even though it is blocked by a known slow code.
> >> 
> >> Yes, this is the problem I'm talking about.
> >
> >I thought more about it. This patch prevents a false positive.
> >Could it bring an opposite problem and hide real problems?
> >
> >I mean that an unbound workqueue might get blocked on CPU A
> >because of a real softlockup. But we might not notice it because
> >CPU B is touched. Well, there are other ways how to detect
> >this situation, e.g. the softlockup watchdog.
> >
> >
> >> >> My suggestion is to update both when touch_softlockup_watchdog() is called, 
> >> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> >> >> to check unbound worker_pool.
> >> >> 
> >> >> Signed-off-by: Wang Qing <wangqing@vivo.com>
> >> >> ---
> >> >>  kernel/watchdog.c  |  5 +++--
> >> >>  kernel/workqueue.c | 17 ++++++-----------
> >> >>  2 files changed, 9 insertions(+), 13 deletions(-)
> >> >> 
> >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> >> index 7110906..107bc38
> >> >> --- a/kernel/watchdog.c
> >> >> +++ b/kernel/watchdog.c
> >> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> >> >>  	 * update as well, the only side effect might be a cycle delay for
> >> >>  	 * the softlockup check.
> >> >>  	 */
> >> >> -	for_each_cpu(cpu, &watchdog_allowed_mask)
> >> >> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
> >> >>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> >> >> -	wq_watchdog_touch(-1);
> >> >> +		wq_watchdog_touch(cpu);
> >> >
> >> >Note that wq_watchdog_touch(cpu) newly always updates
> >> >wq_watchdog_touched. This cycle will set the same jiffies
> >> >value cpu-times to the same variable.
> >> >
> >> Although there is a bit of redundancy here, but the most concise way of 
> >> implementation, and it is certain that it will not affect performance.
> >> 
> Another way to implement is wq_watchdog_touch() remain unchanged, but need 
> to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
> notrace void touch_softlockup_watchdog(void)
> {
> 	touch_softlockup_watchdog_sched();
> 	wq_watchdog_touch(raw_smp_processor_id());
> +     wq_watchdog_touch(-1);
> }
> void touch_all_softlockup_watchdogs(void)
>  * update as well, the only side effect might be a cycle delay for
>  * the softlockup check.
> */
>  -	for_each_cpu(cpu, &watchdog_allowed_mask)
> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>   		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>  +		wq_watchdog_touch(cpu);
>  +	}
>  	wq_watchdog_touch(-1);
>   }
> So wq_watchdog_touched will not get updated many times,  
> which do you think is better, Petr?

I actually prefer the original patch. It makes wq_watchdog_touch()
easy to use. The complexity is hidden in wq-specific code.

The alternative way updates each timestamp only once but the use
is more complicated. IMHO, it is more error prone.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:Re: Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking
  2021-03-24  8:50         ` Petr Mladek
@ 2021-03-24  9:42           ` 王擎
  0 siblings, 0 replies; 9+ messages in thread
From: 王擎 @ 2021-03-24  9:42 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, Guilherme G. Piccoli,
	Andrey Ignatov, Vlastimil Babka, Santosh Sivaraj, linux-kernel


>On Wed 2021-03-24 10:16:46, 王擎 wrote:
>> 
>> >On Tue 2021-03-23 20:37:35, 王擎 wrote:
>> >> 
>> >> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> >> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> >> >> updated, while the unbound worker_pool running on its core uses 
>> >> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
>> >> >
>> >> >By other words, unbound workqueues are not aware of the more common
>> >> >touch_softlockup_watchdog() because it updates only
>> >> >wq_watchdog_touched_cpu for the affected CPU. As a result,
>> >> >the workqueue watchdog might report lockup in unbound workqueue
>> >> >even though it is blocked by a known slow code.
>> >> 
>> >> Yes, this is the problem I'm talking about.
>> >
>> >I thought more about it. This patch prevents a false positive.
>> >Could it bring an opposite problem and hide real problems?
>> >
>> >I mean that an unbound workqueue might get blocked on CPU A
>> >because of a real softlockup. But we might not notice it because
>> >CPU B is touched. Well, there are other ways how to detect
>> >this situation, e.g. the softlockup watchdog.
>> >
>> >
>> >> >> My suggestion is to update both when touch_softlockup_watchdog() is called, 
>> >> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
>> >> >> to check unbound worker_pool.
>> >> >> 
>> >> >> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> >> >> ---
>> >> >>  kernel/watchdog.c  |  5 +++--
>> >> >>  kernel/workqueue.c | 17 ++++++-----------
>> >> >>  2 files changed, 9 insertions(+), 13 deletions(-)
>> >> >> 
>> >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> >> >> index 7110906..107bc38
>> >> >> --- a/kernel/watchdog.c
>> >> >> +++ b/kernel/watchdog.c
>> >> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>> >> >>  	 * update as well, the only side effect might be a cycle delay for
>> >> >>  	 * the softlockup check.
>> >> >>  	 */
>> >> >> -	for_each_cpu(cpu, &watchdog_allowed_mask)
>> >> >> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>> >> >>  		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> >> >> -	wq_watchdog_touch(-1);
>> >> >> +		wq_watchdog_touch(cpu);
>> >> >
>> >> >Note that wq_watchdog_touch(cpu) newly always updates
>> >> >wq_watchdog_touched. This cycle will set the same jiffies
>> >> >value cpu-times to the same variable.
>> >> >
>> >> Although there is a bit of redundancy here, but the most concise way of 
>> >> implementation, and it is certain that it will not affect performance.
>> >> 
>> Another way to implement is wq_watchdog_touch() remain unchanged, but need 
>> to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
>> notrace void touch_softlockup_watchdog(void)
>> {
>> 	touch_softlockup_watchdog_sched();
>> 	wq_watchdog_touch(raw_smp_processor_id());
>> +     wq_watchdog_touch(-1);
>> }
>> void touch_all_softlockup_watchdogs(void)
>>  * update as well, the only side effect might be a cycle delay for
>>  * the softlockup check.
>> */
>>  -	for_each_cpu(cpu, &watchdog_allowed_mask)
>> +	for_each_cpu(cpu, &watchdog_allowed_mask) {
>>   		per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>>  +		wq_watchdog_touch(cpu);
>>  +	}
>>  	wq_watchdog_touch(-1);
>>   }
>> So wq_watchdog_touched will not get updated many times,  
>> which do you think is better, Petr?
>
>I actually prefer the original patch. It makes wq_watchdog_touch()
>easy to use. The complexity is hidden in wq-specific code.
>
>The alternative way updates each timestamp only once but the use
>is more complicated. IMHO, it is more error prone.

I agree, so I will just modify the commit log based on V2 and resend.

Thanks,
Qing
>
>Best Regards,
>Petr



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-03-24  9:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19  8:00 [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking Wang Qing
2021-03-20 18:20 ` Tejun Heo
2021-03-22  1:47   ` 王擎
2021-03-22 10:59 ` Petr Mladek
2021-03-23 12:37   ` 王擎
2021-03-23 15:06     ` Petr Mladek
2021-03-24  2:16       ` 王擎
2021-03-24  8:50         ` Petr Mladek
2021-03-24  9:42           ` 王擎

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.